title,url,pdf,tldr,abstract,keywords Quantum reinforcement learning ,https://openreview.net/forum?id=kRvZ2PcsxjJj,https://openreview.net/pdf?id=kRvZ2PcsxjJj,"A review and implementation of quantum reinforcement learning. We used QRL to train several game agents, and finally predicted and looked forward to future applications and trends.","With the rapid development of quantum technology, it has been confirmed that it can surpass the speed of traditional computing in some fields. Quantum advantage can also be manifested in the field of machine learning. We reviewed many current papers related to quantum reinforcement learning. We discuss in depth how quantum reinforcement learning is implemented and core techniques. quantum reinforcement learning (QRL) method is proposed by combining quantum theory and reinforcement learning (RL).The field of quantum reinforcement learning actually includes two aspects: One is use quantum properties to help reinforcement learning, the other is using reinforcement learning to help quantum circuit design. We have completed agent training for several classic games using quantum reinforcement learning methods, and the superiority and feasibility of the simulation experiments were evaluated. The QRL algorithm can be used in many aspects such as finance, industrial simulation, mechanical control, quantum communication, and quantum circuit optimization. We take a look at the field of quantum reinforcement learning and make bold predictions that many applications in the future will benefit from the development of this technology.","quantum reinforcement learning, multi-agent, quantum technology, control optimization, quantum circuit" Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics,https://openreview.net/forum?id=RUzSobdYy0V,https://openreview.net/pdf?id=RUzSobdYy0V,,"Errors in labels obtained via human annotation adversely affect a trained model's performance. Existing approaches propose ways to mitigate the effect of label error on a model's downstream accuracy, yet little is known about its impact on a model's group-based disparity metrics\footnote{Group-based disparity metrics like subgroup calibration, false positive rate, false negative rate, equalized odds, and equal opportunity are more often known, colloquially, as \textit{fairness metrics} in the literature. We use the term group-based disparity metrics in this work.}. Here we study the effect of label error on a model's group-based disparity metrics like group calibration. We empirically characterize how varying levels of label error, in both training and test data, affect these disparity metrics. We find that group calibration and other metrics are sensitive to train-time and test-time label error---particularly for minority groups. For the same level of label error, the percentage change in group calibration error for the minority group is on average 1.5 times larger than the change for the majority group. Towards mitigating the impact of training-time label error, we present an approach to estimate how changing a single training input's label affects a model's group disparity metric on a test set. We empirically assess the proposed approach on a variety of datasets and find a 10-40\% improvement, compared to alternative approaches, in identifying training inputs that improve a model's disparity metric. The proposed approach can help surface training inputs that may need to be corrected for improving a model's group-based disparity metrics.", Suppression helps: Lateral Inhibition-inspired Convolutional Neural Network for Image Classification,https://openreview.net/forum?id=N3kGYG3ZcTi,https://openreview.net/pdf?id=N3kGYG3ZcTi,Improving feature learning with lateral inhibition,"Convolutional neural networks (CNNs) have become powerful and popular tools since deep learning emerged for image classification in the computer vision field. For better recognition, the dimensions of depth and width have been explored, leading to convolutional neural networks with more layers and more channels. In addition to these factors, neurobiology also suggests the widely existing lateral inhibition (e.g., Mach band effect), which increases the contrast of nearby neuron excitation in the lateral direction, to help recognition. However, such an important mechanism has not been well explored in modern convolutional neural networks. In this paper, we explicitly explore the filter dimension in the lateral direction and propose our lateral inhibition-inspired (LI) design. Our naive design incorporates the low-pass filter, while eliminating the central weight to mimic the inhibition strength decay. The inhibition value is computed from the filtering result of the input, with a simple learnable weight parameter per channel for multiplication to decide the strength. Then the inhibition value is subtracted from the input as suppression, which could increase the contrast to help recognition. We also suggest an alternative using depthwise convolution, as a general form. Our design could work on both the plain convolution and the convolutional block with residual connection, while being compatible with existing modules. Without any channel attention along the channel dimension, the preliminary results demonstrate an absolute improvement of 3.68\% and 0.69\% over AlexNet and ResNet-18, respectively, in the ImageNet data set, with little increase in parameters, indicating the merits of our design to help feature learning for image classification.","Lateral Inhibition, Convolutional Neural Networks" Factorized Fourier Neural Operators,https://openreview.net/forum?id=tmIiMPl4IPa,https://openreview.net/pdf?id=tmIiMPl4IPa,An efficient and scalable neural PDE solver using Fourier transform.,"We propose the Factorized Fourier Neural Operator (F-FNO), a learning-based approach for simulating partial differential equations (PDEs). Starting from a recently proposed Fourier representation of flow fields, the F-FNO bridges the performance gap between pure machine learning approaches to that of the best numerical or hybrid solvers. This is achieved with several insights that collectively have a significant effect – the separable spectral representations; improved residual connections; and carefully designed training strategies. On several challenging benchmark PDEs on regular grids, structured meshes, and point clouds, the F-FNO can scale to deeper networks and outperform both the FNO and the geo-FNO, reducing the error by 83% on the Kolmogorov flow, 31% on the elasticity problem, 57% on the airfoil flow problem, and 60% on the plastic forging problem. Compared to the state-of-the-art pseudo-spectral method, the F-FNO can take a step size that is an order of magnitude larger in time and achieve an order of magnitude speedup to produce the same solution quality.","fourier transform, fourier operators, pde, navier stokes" DFPC: Data flow driven pruning of coupled channels without data.,https://openreview.net/forum?id=mhnHqRqcjYU,https://openreview.net/pdf?id=mhnHqRqcjYU,We propose a novel data-free algorithm to accelerate neural networks via pruning coupled channels.,"Most structured pruning algorithms achieve subnetworks which not only have high predictive accuracy but also have significantly lower FLOPs. It is now noted that the decrease in FLOPs seldom results in a similar decrease in inference time. These algorithms avoid pruning coupled channels (CCs). These channels contribute significantly to the total inference time; layers with CCs as input or output take more than 66% of the inference time in ResNet-50. Motivated by this, we study the problem of pruning CCs in the data-free regime in this paper. Formal studies for pruning CCs are sparse due to a lack of proper characterization. Thus, we define Data Flow Couplings (DFCs) that abstract the notion of coupling and aid us in scoring coupled elements of the network. Gauging saliencies of CCs is not straightforward, for there exists a discrepancy among the layerwise importance of CCs using conventional scoring strategies. This necessitates the definition of grouped saliencies to gauge the importance of coupled elements in a network. Since we do not have access to data, we propose the Backwards Graph-based Saliency Computation (BGSC) algorithm that computes saliencies by estimating an upper bound to the reconstruction error of intermediate layers. We then compare saliencies to prune CCs and call this pruning strategy DFPC. Finally, we show the efficacy of DFPC for models trained on CIFAR-10, CIFAR-100, and ImageNet datasets. For instance, we find that for a 5% accuracy drop and 1.64x reduction of FLOPs for ResNet-101 trained on the CIFAR-10 dataset, the inference time speedup obtained by DFPC is up to 1.66x, without finetuning. When assuming access to the ImageNet training set, we significantly improve over the data-free method. We see at least a 47.1% improvement in speedup for a 2.3% accuracy drop for ResNet-50 against our baselines.","Pruning, Data Free, Model Compression" TVSPrune - Pruning Non-discriminative filters via Total Variation separability of intermediate representations without fine tuning,https://openreview.net/forum?id=sZI1Oj9KBKy,https://openreview.net/pdf?id=sZI1Oj9KBKy,We use the total variation distance between the class conditional distributions of filter outputs for structured pruning of neural networks.," Achieving structured, data-free sparsity of deep neural networks (DNNs) remains an open area of research. In this work, we address the challenge of pruning filters with only access to random samples drawn from the original distribution and without access to the original training set or loss function. We posit the following hypothesis:well-trained models possess discriminative filters, and any non discriminative filters can be pruned without impacting the predictive performance of the classifier. Based on this hypothesis, we propose a new paradigm for pruning neural networks: distributional pruning, wherein we only require access to the distributions that generated the original datasets. Our approach to solving the problem of formalising and quantifying the discriminating ability of filters is through the total variation (TV) distance between the class-conditional distributions of the filter outputs. We present empirical results that, using this definition of discriminability, support our hypothesis on a variety of datasets and architectures. Next, we define the LDIFF score, a heuristic to quantify the extent to which a layer possesses a mixture of discriminative and non-discriminative filters. We empirically demonstrate that the LDIFF score is indicative of the performance of random pruning for a given layer, and thereby indicates the extent to which a layer may be pruned. Our main contribution is a novel one-shot pruning algorithm, called TVSPrune, that identifies non-discriminative filters for pruning. We extend this algorithm to IterTVSPrune, wherein we iteratively apply TVSPrune, thereby enabling us to achieve greater sparsity. Last, we demonstrate the efficacy of the TVSPrune on a variety of datasets, and show that in some cases, we can prune up to 60% of parameters with only a 2% loss of accuracy without any fine-tuning of the model, beating the nearest baseline by almost 10%.","Structured pruning, model compression" Adversarial Training descends without descent: Finding actual descent directions based on Danskin's theorem,https://openreview.net/forum?id=I3HCE7Ro78H,https://openreview.net/pdf?id=I3HCE7Ro78H,There is a subtle bug in the theory behind PGD. We show how to correct it and that it matters in practice,"Adversarial Training using a strong first-order adversary (PGD) is the gold standard for training Deep Neural Networks that are robust to adversarial examples. We show that, contrary to the general understanding of the method, the gradient at an optimal adversarial example may increase, rather than decrease, the adversarially robust loss. This holds independently of the learning rate. More precisely, we provide a counterexample to a corollary of Danskin's Theorem presented in the seminal paper of Madry et al. (2018) which states that a solution of the inner maximization problem can yield a descent direction for the adversarially robust loss. Based on a correct interpretation of Danskin's Theorem, we propose Danskin's Descent Direction (DDD) and we verify experimentally that it provides better directions than those obtained by a PGD adversary. Using the CIFAR10 dataset we further provide a real world example showing that our method achieves a steeper increase in robustness levels in the early stages of training, and is more stable than the PGD baseline.","Adversarial Training, Adversarial Examples, non-convex optimization, robustness" A Study of Biologically Plausible Neural Network: the Role and Interactions of Brain-Inspired Mechanisms in Continual Learning,https://openreview.net/forum?id=9Zx6tTcX0SE,https://openreview.net/pdf?id=9Zx6tTcX0SE,"a comprehensive study on the role and interactions of different mechanisms inspired by the brain including sparse non-overlapping representations, Hebbian learning, synaptic consolidation, and replay of past activations","Humans excel at continually acquiring, consolidating, and retaining information from an ever-changing environment, whereas artificial neural networks (ANNs) exhibit catastrophic forgetting. There are considerable differences in the complexity of synapses, the processing of information, and the learning mechanisms in biological neural networks and their artificial counterpart, which may explain the mismatch in performance. We consider a biologically plausible framework that constitutes separate populations of exclusively excitatory and inhibitory neurons which adhere to Dale's principle and the excitatory pyramidal neurons are augmented with dendritic-like structures for context-dependent processing of stimuli. We then conduct a comprehensive study on the role and interactions of different mechanisms inspired by the brain including sparse non-overlapping representations, Hebbian learning, synaptic consolidation, and replay of past activations that accompanied the learning event. Our study suggests that employing multiple complementary mechanisms in a biologically plausible architecture, similar to the brain, can be effective in enabling continual learning in ANNs.","Continual Learning, Catastrophic Forgetting, Brain-inspired Mechanisms, Active Dendrites, Dale's Principle, Hebbian Learning, Sparsity" Learning Continuous Normalizing Flows For Faster Convergence To Target Distribution via Ascent Regularizations,https://openreview.net/forum?id=6iEoTr-jeB7,https://openreview.net/pdf?id=6iEoTr-jeB7,,"Normalizing flows (NFs) have been shown to be advantageous in modeling complex distributions and improving sampling efficiency for unbiased sampling. In this work, we propose a new class of continuous NFs, ascent continuous normalizing flows (ACNFs), that makes a base distribution converge faster to a target distribution. Although solving such a flow is non-trivial and barely possible, we propose a practical implementation to learn flexibly parametric ACNFs via ascent regularization and apply in two learning cases: maximum likelihood learning for density estimation and minimizing reverse KL divergence for unbiased sampling and variational inference. The learned ACNFs demonstrate faster convergence towards the target distributions, therefore, achieving better density estimations, unbiased sampling and variational approximation at lower computational cost. Furthermore, the flows show to stabilize themselves to mitigate performance deterioration and are less sensitive to the choice of training flow length $T$ . ","normalizing flows, gradient flows, density estimation, unbiased sampling, variational inference" pFedKT: Personalized Federated Learning via Knowledge Transfer,https://openreview.net/forum?id=Vx6G9W5M4sQ,https://openreview.net/pdf?id=Vx6G9W5M4sQ,,"Federated learning (FL) has been widely studied as a new paradigm to achieve multi-party collaborative modelling on decentralized data with privacy protection. Unfortunately, traditional FL suffers from Non-IID data distribution, where clients' private models after FL are even inferior to models trained standalone. Existing approaches to tackle this challenge fall into two directions: a) pursuing a better global model through mitigating biases of private models, and b) improving personalized private models by personalized federated learning (PFL). Still, both of them have limited accuracy improvements in private models. To this end, \textit{we design pFedKT, a novel personalized federated learning framework with knowledge transfer, towards boosting the performances of personalized private models on Non-IID data}. It involves two types of knowledge transfer: a) transferring \textit{historical private knowledge} to new private models by local hypernetworks; b) transferring \textit{the global model's knowledge} to private models through contrastive learning. After absorbing the historical private knowledge and the latest global knowledge, the personalization and generalization of private models are both enhanced. Besides, we derive pFedKT's generalization and prove its convergence theoretically. Extensive experiments verify that pFedKT presents $0.31\%-3.46\%$ accuracy improvements of private models than the state-of-the-art baseline.","Personalized Federated Learning, Knowledge Transfer, Local Hypernetwork, Contrastive Learning" FARE: Provably Fair Representation Learning,https://openreview.net/forum?id=vzdrgR2nomD,https://openreview.net/pdf?id=vzdrgR2nomD,We present the first provable fair representation learning method. ,"Fair representation learning (FRL) is a popular class of methods aiming to produce fair classifiers via data preprocessing. However, recent work has shown that prior methods achieve worse accuracy-fairness tradeoffs than originally suggested by their results. This dictates the need for FRL methods that provide provable upper bounds on unfairness of any downstream classifier, a challenge yet unsolved. In this work we address this challenge and propose Fairness with Restricted Encoders (FARE), the first FRL method with provable fairness guarantees. Our key insight is that restricting the representation space of the encoder enables us to derive suitable fairness guarantees, while allowing empirical accuracy-fairness tradeoffs comparable to prior work. FARE instantiates this idea with a tree-based encoder, a choice motivated by inherent advantages of decision trees when applied in our setting. Crucially, we develop and apply a practical statistical procedure that computes a high-confidence upper bound on the unfairness of any downstream classifier. In our experimental evaluation on several datasets and settings we demonstrate that FARE produces tight upper bounds, often comparable with empirical results of prior methods, which establishes the practical value of our approach. ","fairness, fair representation learning" ONLINE RESTLESS BANDITS WITH UNOBSERVED STATES,https://openreview.net/forum?id=NOKUQ9JMohJ,https://openreview.net/pdf?id=NOKUQ9JMohJ,"We propose TSEETC to slove the restless bandits with unknown transition kernels,unknown reward functions and unobserved states.","We study the online restless bandit problem, where each arm evolves according to a Markov chain independently, and the reward of pulling an arm depends on both the current state of the corresponding Markov chain and the action. The agent (decision maker) does not know the transition kernels and reward functions, and cannot observe the states of arms all the time. The goal is to sequentially choose which arms to pull so as to maximize the expected cumulative rewards collected. In this paper, we propose TSEETC, a learning algorithm based on Thompson Sampling with Episodic Explore-Then-Commit. The algorithm proceeds in episodes of increasing length and each episode is divided into exploration and exploitation phases. In the exploration phase in each episode, action-reward samples are collected in a round-robin way and then used to update the posterior as a mixture of Dirichlet distributions. At the beginning of the exploitation phase, TSEETC generates a sample from the posterior distribution as true parameters. It then follows the optimal policy for the sampled model for the rest of the episode. We establish the Bayesian regret bound $\tilde {\mathcal{O}}(\sqrt{T})$ for TSEETC, where $T$ is the time horizon. This is the first bound that is close to the lower bound of restless bandits, especially in an unobserved state setting. We show through simulations that TSEETC outperforms existing algorithms in regret.","Thompson Sampling, Explore-Then-Commit, online restless bandit" Dual-Domain Diffusion Based Progressive Style Rendering towards Semantic Structure Preservation,https://openreview.net/forum?id=91efl6aSU2d,https://openreview.net/pdf?id=91efl6aSU2d,,"In this paper, we propose a Dual-Domain Diffusion based Progressive Style Rendering (D3PSR) method to achieve style rendering from the semantic Domain A to the style Domain B. Different from the classic diffusion models, our model takes two unpaired images from two domains as inputs, and the output is obtained at the midst layer. With the benefits from diffusion models, a dynamic rendering process was leveraged to progressively incorporate the texture strokes from the style domain while preserving the semantic structure in the noise-adding steps. Our experiments shows that a range of artistic styles can be successfully transferred into the target images without breaking their semantic structures, demonstrating the merits of our new diffusion-based approach with beyond the state-of-the-art performance in style transferring. A further study utilized the similarity scores to measure such a diffusion-based process, showing how semantic structures were rendered in our progressive process in a quantitative view.", UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers,https://openreview.net/forum?id=KNL8KSH7b_F,https://openreview.net/pdf?id=KNL8KSH7b_F,"For the first time, we propose a multimodal compression approach UPop for vision-language Transformers from the perspective of pruning.","Data from the real world contains a vast amount of multimodal information, among which vision and language are the two most representative modalities. On the other hand, researchers have spent much effort on model compression to reduce the huge memory and computational consumption of increasingly large models. However, how to compress multimodal models, especially vison-language Transformers, is still under-explored. This paper proposes the Unified and Progressive Pruning (UPop) that compresses vison-language Transformers via pruning. UPop incorporates 1) unifiedly searching countless multimodal subnetworks in a continuous optimization space from the uncompressed model; 2) progressively and simultaneously retraining the subnetwork. The subnetworks are learned in multiple components, including the self-attention modules, MLPs in both vision and language branches, and cross-attention modules. To ease the progress of pruning, we design \textit{Unified Pruning} to automatically assign the optimal pruning ratio to each compressiable component, instead of manually assigning each component a pruning ratio. To explore the limitation of compression ratio, we propose \textit{Progressive Pruning} to maintain convergence between search and retrain. In addition, UPop enables zero-cost subnetwork selection after searching countless multimodal subnetworks, and the searched subnetwork can be used without any retraining. Experiments on multiple discriminative and generative vision-lanuage tasks demonstrate the versatility of the proposed UPop. For example, we achieve \textbf{2$\times $} compression and \textbf{1.66$\times$} FLOPs reduction on COCO dataset of Image Caption with \textbf{0.8} SPICE drop, \textbf{4$\times $} compression and \textbf{2.96$\times$} FLOPs reduction with \textbf{2.1} SPICE drop.","Multimodal Model, Model Compression, Vision-Language Transformers" Learning to aggregate: A parameterized aggregator to debias aggregation for cross-device federated learning,https://openreview.net/forum?id=IQM-3_Tzldw,https://openreview.net/pdf?id=IQM-3_Tzldw,Our idea is to learn an aggregator to debias aggregation to calibrate and control the direction of aggregated parameters to deal with both client drift and period drift.,"Federated learning (FL) emerged as a novel machine learning setting that enables collaboratively training deep models on decentralized private data. Due to the heterogeneity (non-iidness) of the decentralized data, FL methods (e.g. FedAvg) suffers from unstable and slow convergence. Recent works explain the non-iid problem in FL as the client drift, and deal with it by enforcing regularization at local updates. However, these works neglect the heterogeneity among different communication rounds: the data of sampled candidates at different communication rounds are also of non-iid distribution, and we term it as period drift, which as well as client drift can lead to aggregation bias that degrade convergence. To deal with it, we propose a novel aggregation strategy, named FedPA, that uses a Parameterized Aggregator, as an alternative of averaging. We frame FedPA within a meta-learning setting, and formulates the aggregator as a meta-learner, to learn to aggregate the model parameters of clients. FedPA can directly learn the aggregation bias and well calibrate and control the direction of aggregated parameters to a better direction towards the optimum. Experiments show that FedPA can achieve competitive performances compared with conventional baselines.",Federated learning NeuralStagger: accelerating physics constrained neural PDE solver with spatial-temporal decomposition,https://openreview.net/forum?id=pGR2gNO5c4p,https://openreview.net/pdf?id=pGR2gNO5c4p,,"Neural networks have shown great potential in accelerating the solution of partial differential equations (PDEs). Recently, there has been a growing interest in introducing physics constraints into training neural PDE solvers to reduce the use of costly data and improve the generalization ability. However, these physics constraints, based on certain finite dimensional approximation over the function space, must resolve the smallest scaled physics to ensure the accuracy and stability of the simulation, resulting in heavy computational costs from large input, output, and neural networks. This paper proposes a general acceleration methodology called NeuralStagger by spatially and temporally decomposing the original learning tasks into several coarser-resolution subtasks. We define a coarse-resolution neural solver for each subtask, which requires fewer computational resources, and jointly train them with the vanilla physics constrained loss by simply arranging their outputs to reconstruct the original solution. Due to the perfect parallelism between them, the solution is achieved as fast as a coarse-resolution neural solver. In addition, the trained solvers bring the flexibility for users to simulate with multiple levels of resolution. We demonstrate the successful application of NeuralStagger on various fluid dynamics simulations, which leads to an additional 10 to 100 times speed-up. Moreover, the experiment also shows that the learned model could be well used for optimal control. ", Towards Robust Online Dialogue Response Generation,https://openreview.net/forum?id=s6l6ks1iooc,https://openreview.net/pdf?id=s6l6ks1iooc,,"Although pre-trained sequence-to-sequence models have achieved great success in dialogue response generation, chatbots still suffer from generating inconsistent responses in real-world applications, especially in multi-turn settings. We argue that this can be caused by a discrepancy between training and real-world testing. While the chatbot generates the response based on the gold context during training, it has to predict the next utterance based on the context consisting of both the user’s and the bot’s own utterances in real-world testing. With the growing number of utterances, this discrepancy becomes more severe in the multi-turn settings. In this paper, we propose a hierarchical sampling-based method consisting of both utterance-level sampling and semi-utterance-level sampling, to alleviate the discrepancy, which increases the dialogue coherence implicitly. We further adopt reinforcement learning and re-ranking methods to explicitly optimize the dialogue coherence during training and inference, respectively. Empirical experiments show the effectiveness of the proposed methods for improving the robustness of chatbots in real practice.", Deep Reinforcement Learning based Insight Selection Policy,https://openreview.net/forum?id=3uDXZZLBAwd,https://openreview.net/pdf?id=3uDXZZLBAwd,,"We live in the era of ubiquitous sensing and computing. More and more data is being collected and processed from devices, sensors and systems. This opens up opportunities to discover patterns from these data that could help in gaining better understanding into the source that produces them. This is useful in a wide range of domains, especially in the area of personal health, in which such knowledge could help in allowing users to comprehend their behaviour and indirectly improve their lifestyle. Insight generators are systems that identify such patterns and verbalise them in a readable text format, referred to as insights. The selection of insights is done using a scoring algorithm which aims at optimizing this process based on multiple objectives, e.g., factual correctness, usefulness and interestingness of insights. In this paper, we propose a novel Reinforcement Learning (RL) framework for insight selection where the scoring model is trained by user feedback on interestingness and their lifestyle quality estimates. With the use of highly reusable and simple principles of automatic user simulation based on real data, we demonstrate in this preliminary study that the RL solution may improve the selection of insights towards multiple pre-defined objectives.","recommender, insight, reinforcement learning, behavior change support system, health coaching, lifestyle simulator, Gaussian mixture modeling" Data Leakage in Tabular Federated Learning,https://openreview.net/forum?id=QN_VgTeOYGl,https://openreview.net/pdf?id=QN_VgTeOYGl,We introduce a novel data leakage atack on FL for tabular data.,"While federated learning (FL) promises to preserve privacy in distributed training of deep learning models, recent work in the image and NLP domains showed that training updates leak private data of participating clients. At the same time, most high-stakes applications of FL (e.g., legal and financial) use tabular data. Compared to the NLP and image domains, reconstruction of tabular data poses several unique challenges: (i) categorical features introduce a significantly more difficult mixed discrete-continuous optimization problem, (ii) the mix of categorical and continuous features causes high variance in the final reconstructions, and (iii) structured data makes it difficult for the adversary to judge reconstruction quality. In this work, we tackle these challenges and propose the first comprehensive reconstruction attack on tabular data, called TabLeak. TabLeak is based on three key ingredients: (i) a softmax structural prior, implicitly converting the mixed discrete-continuous optimization problem into an easier fully continuous one, (ii) a way to reduce the variance of our reconstructions through a pooled ensembling scheme exploiting the structure of tabular data, and (iii) an entropy measure which can successfully assess reconstruction quality. Our experimental evaluation demonstrates the effectiveness of TabLeak, reaching a state-of-the-art on four popular tabular datasets. For instance, on the Adult dataset, we improve attack accuracy by 10% compared to the baseline on the practically relevant batch size of 32 and further obtain non-trivial reconstructions for batch sizes as large as 128. Our findings are important as they show that performing FL on tabular data, which often poses high privacy risks, is highly vulnerable.","federated learning, tabular data, data leakage attacks, gradient inversion" Long-horizon video prediction using a dynamic latent hierarchy,https://openreview.net/forum?id=TZG_XsO4x6y,https://openreview.net/pdf?id=TZG_XsO4x6y,Hierarchical generative model for long-horizon video prediction,"The task of video prediction and generation is known to be notoriously difficult, with the research in this area largely limited to short-term predictions. Though plagued with noise and stochasticity, videos consist of features that are organised in a spatiotemporal hierarchy, different features possessing different temporal dynamics. In this paper, we introduce Dynamic Latent Hierarchy (DLH) -- a deep hierarchical latent model that represents videos as a hierarchy of latent states that evolve over separate and fluid timescales. Each latent state is a mixture distribution with two components, representing the immediate past and the predicted future, causing the model to learn transitions only between sufficiently dissimilar states, while clustering temporally persistent states closer together. Using this unique property, DLH naturally discovers the spatiotemporal structure of a dataset and learns disentangled representations across its hierarchy. We hypothesise that this simplifies the task of modeling temporal dynamics of a video, improves the learning of long-term dependencies, and reduces error accumulation. As evidence, we demonstrate that DLH outperforms state-of-the-art benchmarks in video prediction, is able to better represent stochasticity, as well as to dynamically adjust its hierarchical and temporal structure. Our paper shows, among other things, how progress in representation learning can translate into progress in prediction tasks.","long-term video prediction, hierarchical generative model, spatiotemporal disentaglement, event-based model" SwinZS3: Zero-Shot Semantic Segmentation with a Swin Transformer,https://openreview.net/forum?id=yqe0BZeN_xH,https://openreview.net/pdf?id=yqe0BZeN_xH,,"Zero-shot semantic segmentation (ZS3) aims at learning to classify the never-seen classes with zero training samples. Convolutional neural networks (CNNs) have recently achieved great success in this task. However, their limited attention ability constraints existing network architectures to reason based on word embeddings. In this light of the recent successes achieved by Swin Transformers, we propose SwinZS3, a new framework exploiting the visual embeddings and semantic embeddings on joint embedding space. The SwinZS3 combines a transformer image encoder with a language encoder. The image encoder is trained by pixel-text score maps using the dense language-guided semantic prototypes which are computed by the language encoder. This allows the SwinZS3 could recognize the unseen classes at test time without retraining. We experiment with our method on the ZS3 standard benchmarks (PASCAL VOC and PASCAL Context) and the results demonstrate the effectiveness of our method by showing the state-of-art performance.","zero shot semantic segmentation, deep learning, transformer" Softened Symbol Grounding for Neuro-symbolic Systems,https://openreview.net/forum?id=HTJE5Krui0g,https://openreview.net/pdf?id=HTJE5Krui0g,,"Neuro-symbolic learning usually consists of two worlds, i.e., neural network learning and symbolic constraint satisfaction, whose effectiveness hinges on symbol grounding, a fundamental problem in AI. This paper presents a novel, softened symbol grounding process, enabling the interactions of the two worlds in a mutually beneficial manner. Technically, we design a neuro-symbolic learning framework that features (1) modeling of deterministic symbol solution states as a Boltzmann distribution, which avoids expensive state searching and facilitates the interaction between network training and symbolic reasoning; (2) an efficient MCMC sampling technique leveraging projection and SMT solvers, which overcomes the connectivity barrier in sampling symbol solution spaces; (3) an annealing mechanism that avoids the trap of sub-optimal symbol groundings. Experiments with three representative neuro-symbolic learning tasks demonstrate that, thanks to its superior symbol grounding capability, our framework successfully solves problems well beyond the frontier of the existing proposals. ","neuro-symbolic learning, symbol grounding problem, projection-based sampling" Encoding Recurrence into Transformers,https://openreview.net/forum?id=7YfHla7IxBJ,https://openreview.net/pdf?id=7YfHla7IxBJ,We propose a new module to encode the recurrent dynamics of an RNN layer into Transformers and higher sample efficiency can be achieved.,"This paper novelly breaks down with ignorable loss an RNN layer into a sequence of simple RNNs, each of which can be further rewritten into a lightweight positional encoding matrix of a self-attention, named the Recurrence Encoding Matrix (REM). Thus, recurrent dynamics introduced by the RNN layer can be encapsulated into the positional encodings of a multihead self-attention, and this makes it possible to seamlessly incorporate these recurrent dynamics into a Transformer, leading to a new module, Self-Attention with Recurrence (RSA). The proposed module can leverage the recurrent inductive bias of REMs to achieve a better sample efficiency than its corresponding baseline Transformer, while the self-attention is used to model the remaining non-recurrent signals. The relative proportions of these two components are controlled by a data-driven gated mechanism, and the effectiveness of RSA modules are demonstrated by four sequential learning tasks.","Recurrent models, Transformers, sample efficiency, gated mechanism" Generating Intuitive Fairness Specifications for Natural Language Processing,https://openreview.net/forum?id=N_g8TT9Cy7f,https://openreview.net/pdf?id=N_g8TT9Cy7f,We provide new methods for generating individual fairness specifications for NLP based on LLMs and validate them in a human study. ,"Text classifiers have promising applications in high-stake tasks such as resume screening and content moderation. These classifiers must be fair and avoid discriminatory decisions by being invariant to perturbations of sensitive attributes such as gender or ethnicity. However, there is a gap between human intuition about these perturbations and the formal similarity specifications capturing them. While existing research has started to address this gap, current methods are based on hardcoded word replacements, resulting in specifications with limited expressivity or ones that fail to fully align with human intuition (e.g., in cases of asymmetric counterfactuals). This work proposes novel methods for bridging this gap by discovering expressive and intuitive individual fairness specifications. We show how to leverage unsupervised style transfer and GPT-3's zero-shot capabilities to automatically generate expressive candidate pairs of semantically similar sentences that differ along sensitive attributes. We then validate the generated pairs via an extensive crowdsourcing study, which confirms that a lot of these pairs align with human intuition about fairness in the context of toxicity classification. Finally, we show how limited amounts of human feedback can be leveraged to learn a similarity specification that can be used to train downstream fairness-aware models.","Individual Fairness, Style Transfer, NLP, Crowdsourcing, Human Evaluation" Learning to Perturb for Contrastive Learning of Unsupervised Sentence Representations,https://openreview.net/forum?id=gZYbGIpFYpA,https://openreview.net/pdf?id=gZYbGIpFYpA,,"Recently, contrastive learning has been shown effective in fine-tuning pre-trained language models (PLM) to learn sentence representations, which incorporates perturbations into unlabeled sentences to augment semantically related positive examples for training. However, previous works mostly adopt heuristic perturbation methods that are independent of the sentence representations. Since the perturbations are unaware of the goal or process of sentence representation learning during training, it is likely to lead to sub-optimal augmentations for conducting constrative learning. To address this issue, we propose a new framework \textbf{L2P-CSR} that adopts a learnable perturbation strategy for improving contrastive learning of sentence representations. In our L2P-CSR, we design a safer perturbation mechanism that only weakens the influence of tokens and features on the sentence representation, which avoids dramatically changing the semantics of the sentence representations. Besides, we devise a gradient-based algorithm to generate adaptive perturbations specially for the dynamically updated sentence representation during training. Such a way is more capable of augmenting high-quality examples that guide the sentence representation learning. Extensive experiments on diverse sentence-related tasks show that our approach outperforms competitive baselines. ","Unsupervised Sentence Representations, Contrastive Learning" Proper Scoring Rules for Survival Analysis,https://openreview.net/forum?id=Xj9V-stmIcO,https://openreview.net/pdf?id=Xj9V-stmIcO,Theoretical analysis of scoring rules for survival analysis.,"Survival analysis is the problem of estimating probability distributions for future events, which can be seen as a problem in uncertainty quantification. Although there are fundamental theories on strictly proper scoring rules for uncertainty quantification, little is known about those for survival analysis. In this paper, we investigate extensions of four major strictly proper scoring rules for survival analysis. Through the extensions, we discuss and clarify the assumptions arising from the discretization of the estimation of probability distributions. We also discuss the relationship between the existing algorithms and extended scoring rules, and we propose new algorithms based on our extensions of the scoring rules for survival analysis. ","scoring rules, survival analysis, time-to-event analysis" Social Network Structure Shapes Innovation: Experience-sharing in RL with SAPIENS,https://openreview.net/forum?id=BO5_Lm7iD_,https://openreview.net/pdf?id=BO5_Lm7iD_,"We show that a group's ability to collectively solve tasks depends on the social network structure that determines who shares information with whom, with dynamically changing structures performing best.."," The human cultural repertoire relies on innovation: our ability to continuously explore how existing elements can be combined to create new ones. Innovation is not solitary, it relies on collective accumulation and merging of previous solutions. Machine learning approaches commonly assume that fully connected multi-agent networks are best suited for innovation. However, human laboratory and field studies have shown that hierarchical innovation is more robustly achieved by dynamic social network structures. In dynamic settings, humans oscillate between innovating individually or in small clusters, and then sharing outcomes with others. To our knowledge, the role of multi-agent topology on innovation has not been systematically studied in machine learning. It remains unclear a) which social network topologies are optimal for which innovation tasks, and b) which properties of experience sharing improve multi-level innovation. Here we use a multi-level hierarchical problem setting (WordCraft), with three different innovation tasks. We systematically design networks of DQNs sharing experiences from their replay buffers in varying topologies (fully connected, small world, dynamic, ring). Comparing the level of innovation achieved by different experience-sharing topologies across different tasks shows that, first, consistent with human findings, experience sharing within a dynamic topology achieves the highest level of innovation across tasks. Second, experience sharing is not as helpful when there is a single clear path to innovation. Third, two metrics we propose, conformity and diversity of shared experience, can explain the success of different topologies on different tasks. These contributions can advance our understanding of optimal AI-AI, human-human, and human-AI collaborative networks, inspiring future tools for fostering collective innovation in large organizations.","collective innovation, social network, multi-agent model, collective dynamics, communication topology, collective cognition" Mini-batch $k$-means terminates within $O(d/\epsilon)$ iterations,https://openreview.net/forum?id=jREF4bkfi_S,https://openreview.net/pdf?id=jREF4bkfi_S,,"We answer the question: ""Does \emph{local} progress (on batches) imply \emph{global} progress (on the entire dataset) for mini-batch $k$-means?"". Specifically, we consider mini-batch $k$-means which terminates only when the improvement in the quality of the clustering on the sampled batch is below some threshold. Although at first glance it appears that this algorithm might execute forever, we answer the above question in the affirmative and show that if the batch is of size $\tilde{\Omega}((d/\epsilon)^2)$, it must terminate within $O(d/\epsilon)$ iterations with high probability, where $d$ is the dimension of the input, and $\epsilon$ is a threshold parameter for termination. This is true \emph{regardless} of how the centers are initialized. Finally, we show the applicability of our results to the mini-batch $k$-means algorithm implemented in the scikit-learn (sklearn) python library. ", Convergence is Not Enough: Average-Case Performance of No-Regret Learning Dynamics,https://openreview.net/forum?id=Jdj0fZhswJC,https://openreview.net/pdf?id=Jdj0fZhswJC,"Beyond convergence, average case metrics rely on regions of attraction to compare the performance of different dynamics in multi-agent games. ","Learning in games involves two main challenges, even in settings in which agents seek to coordinate: convergence to equilibria and selection of good equilibria. Unfortunately, solving the issue of convergence, which is the focus of state-of-the-art models, conveys little information about the quality of the equilibria that are eventually reached, often none at all. In this paper, we study a class of games in which q-replicator (QRD), a widely-studied class of no-regret learning dynamics that include gradient descent, “standard” replicator, and log-barrier dynamics as special cases, can be shown to converge pointwise to Nash equilibria. This is the starting point for our main task, which is the mathematically challenging problem of performance. In our main contribution, we quantify both conceptually and experimentally the outcome of optimal learning dynamics via average performance metrics, i.e., metrics that couple the regions of attraction with the quality of each attracting point. We provide an exhaustive comparison between gradient descent and “standard” replicator in a class of games with severe equilibrium selection problems and empirically extend our results to all dynamics in the QRD class. Our results combine tools from machine learning, game theory, and dynamical systems and provide a framework to initiate the systematic comparison of different optimal learning dynamics in arbitrary games.","q-replicator dynamics, potential games, average price of anarchy, learning" Gene finding revisited: improved robustness through structured decoding from learning embeddings,https://openreview.net/forum?id=Rn50hCOX9XX,https://openreview.net/pdf?id=Rn50hCOX9XX,Improving the robustness of predicting the exact coding sequences of genomes by combining deep learning with a graphical model encoding gene structure. ,"Gene finding is the task of identifying the locations of coding sequences within the vast amount of genetic code contained in the genome. With an ever increasing quantity of raw genome sequences, gene finding is an important avenue towards understanding the genetic information of (novel) organisms, as well as learning shared patterns across evolutionarily diverse species. The current state of the art are graphical models usually trained per organism and requiring manually curated data sets. However, these models lack the flexibility to incorporate deep learning representation learning techniques that have in recent years been transformative in the analysis of protein sequences, and which could potentially help gene finders exploit the growing number of sequenced genomes to expand performance across multiple organisms. Here, we propose a novel approach, combining learned embeddings of raw genetic sequences with exact decoding using a latent conditional random field. We show that the model achieves performance matching the current state of the art, while increasing training robustness, and removing the need for manually fitted length distributions. As language models for DNA improve, this paves the way for more performant cross-organism gene-finders. ","gene finding, graphical model, gene prediction, gene splicing, conditional random fields, structured decoding, DNA, learned embeddings" PPAT: Progressive Graph Pairwise Attention Network for Event Causality Identification,https://openreview.net/forum?id=eGWEfaW-5Yt,https://openreview.net/pdf?id=eGWEfaW-5Yt,"We propose PPAT for event causality identification, which reasons inter-sentence event causality based on intra-sentence event causality and outperforms all previous methods.","Event Causality Identification (ECI) aims to identify the causality between a pair of event mentions in a document, which is composed of sentence-level ECI (SECI) and document-level ECI (DECI). Previous work applies various reasoning models to help identify the implicit event causality. However, they ignore that most inter-sentence event causality depends on intra-sentence event causality to infer. In this paper, we propose a progressive graph pairwise attention network (PPAT) to consider the above dependence. PPAT applies a progressive reasoning strategy, as it first predicts the intra-sentence causality, and then infers the more implicit inter-sentence causality based on the SECI result. We construct a sentence boundary event relational graph, and PPAT leverages a novel pairwise attention, which attends to different reasoning chains on the graph. In addition, we propose a causality-guided training strategy for assisting PPAT in learning causality-related representations on every layer. Extensive experiments on two well-established benchmark datasets show that our model achieves state-of-the-art performance (5.5% F1 gains on EventStoryLine and 4.5% F1 gains on Causal-TimeBank).","Event causality identification, graph neural network, natural language processing" Learning Uncertainty for Unknown Domains with Zero-Target-Assumption,https://openreview.net/forum?id=pWVASryOyFw,https://openreview.net/pdf?id=pWVASryOyFw,,"We introduce our Maximum-Entropy Rewarded Reinforcement Learning (MERRL) framework that selects training data for more accurate Natural Language Processing (NLP). Because conventional data selection methods select training samples based on the test domain knowledge and not on real life data, they frequently fail in unknown domains like patent and Twitter. Our approach selects training samples that maximize information uncertainty measured by entropy, including observation entropy like empirical Shannon entropy, Min-entropy, R\'enyi entropy, and prediction entropy using mutual information, to cover more possible queries that may appear in unknown worlds. Our MERRL using regularized A2C and SAC achieves up to -99.7 perplexity decrease (-43.4\% relatively) in language modeling, +25.0 accuracy increase (+40.0\% relatively) in sentiment analysis, and +5.0 F1 score increase (+30.8\% relatively) in named entity recognition over various domains, demonstrating strong generalization power on unknown test sets.", "Detecting Out-of-Distribution Data with Semi-supervised Graph “Feature"" Networks",https://openreview.net/forum?id=0OlEBibFa_g,https://openreview.net/pdf?id=0OlEBibFa_g,,"Anomalous and out-of-distribution (OOD) data present a significant challenge to the robustness of decisions taken by deep neural networks, with myriad real-world consequences. State-of-the-art OOD detection techniques use embeddings learned by large pre-trained transformers. We demonstrate that graph structures and topological properties can be leveraged to detect both far-OOD and near-OOD data reliably, simply by characterising each data point (image) as a network of related features (visual concepts). Furthermore, we facilitate human-in-the-loop machine learning by expressing this data to comprise high-level domain-specific concepts. We obtained \textit{97.95\% AUROC} on far-OOD and \textit{98.79\% AUROC} on near-OOD detection tasks based on the LSUN dataset (comparable to the performance of state-of-the-art techniques).", Let Offline RL Flow: Training Conservative Agents in the Latent Space of Normalizing Flow,https://openreview.net/forum?id=1Wo0vqaZ8WJ,https://openreview.net/pdf?id=1Wo0vqaZ8WJ,Latent-Variable Policy Optimization for Offline RL based on Normalizing Flows (outperforms both PLAS and LAPO),"Offline reinforcement learning aims to train a policy on a pre-recorded and fixed dataset without any additional environment interactions. There are two major challenges in this setting: (1) extrapolation error caused by approximating the value of state-action pairs not well-covered by the training data and (2) distributional shift between behavior and inference policies. One way to tackle these problems is to induce conservatism - i.e., keeping the learned policies closer to the behavioral ones. To achieve this, we build upon recent works on learning policies in latent action spaces and use a special form of normalizing flow for constructing a generative model, which we use as a conservative action encoder. This normalizing flow action encoder is pre-trained in a supervised manner on the offline dataset, and then an additional policy model - controller in the latent space - is trained via reinforcement learning. This approach avoids querying actions outside of the training dataset and therefore does not require additional regularization for out-of-dataset actions. We evaluate our method on various locomotion and navigation tasks, demonstrating that our approach outperforms recently proposed algorithms with generative action models on a large portion of datasets.","Offline Reinforcement Learning, Normalizing Flows" Towards a Complete Theory of Neural Networks with Few Neurons,https://openreview.net/forum?id=MjikLUwiB3M,https://openreview.net/pdf?id=MjikLUwiB3M,"We analytically study the landscapes of neural networks with a few neurons, shedding light on how the neurons move following gradient flow. ","Deep learning has seen unprecedented progress thanks to the deployment of models with millions of parameters. On the theoretical side, an immense amount of effort has gone to understanding the dynamics of overparameterized networks. Although now there is a well-developed theory of networks with infinitely many neurons, the classic problem of understanding how a neural network with a few neurons learns remains unsolved. To attack this problem, we analytically study the landscapes of neural networks with few neurons. We prove for the first time that a student network with one neuron has only one critical point --its global minimum-- when learning from a teacher network with arbitrarily many orthogonal neurons. In addition, we prove how a neuron addition mechanism turns a minimum into a line of critical points with transitions from saddles to local minima via non-strict saddles. Finally, we discuss how the insights we get from our novel proof techniques may shed light on the dynamics of neural networks with few neurons.","theory of neural networks, non-convex landscapes, critical manifolds, gradient flow dynamics" Machine Learning from Explanations,https://openreview.net/forum?id=UPQualDj1oo,https://openreview.net/pdf?id=UPQualDj1oo,,"Machine learning needs a huge amount of (labeled) data, as otherwise it might not learn the right model for different sub-populations, or even worse, they might pick up spurious correlations in the training data leading to brittle prediction mechanisms. Also, for small training datasets, there is a huge variability in the learned models on randomly sampled training datasets, which makes the whole process less reliable. But, collection of large amount of useful representative data, and training on large datasets, are very costly. In this paper, we present a technique to train reliable classification models on small datasets, assuming we have access to some simple explanations (e.g., subset of influential input features) on labeled data. We also propose a novel two stage training pipeline that optimizes the model's output and fine-tunes its attention in an interleaving manner, to help the model to agree with the provided explanation while learning from the data. We show that our training pipeline enables faster convergence to better models, especially when there is a severe class imbalance in the population or spurious features in the training data.","model explanations, trustworthy machine learning, explainable ai, interpretable machine learning" Functional Risk Minimization,https://openreview.net/forum?id=9D5FH6LFbRu,https://openreview.net/pdf?id=9D5FH6LFbRu,"We propose to model uncertainty in function space rather than output space. We derive a learning framework, with experimental results, and show connections to recent theory on over-paramterized generalization.","In this work, we break the classic assumption of data coming from a single function $f_{\theta^*}(x)$ followed by some noise in output space $p(y|f_{\theta^*}(x))$. Instead, we model each data point $(x_i,y_i)$ as coming from its own function $f_{\theta_i}$. We show that this model subsumes Empirical Risk Minimization for many common loss functions, and provides an avenue for more realistic noise processes. We derive Functional Risk Minimization~(FRM), a general framework for scalable training objectives which results in better performance in small experiments in regression and reinforcement learning. We also show that FRM can be seen as finding the simplest model that memorizes the training data, providing an avenue towards understanding generalization in the over-parameterized regime.","learning framework, theory, meta-learning, supervised learning" Latent Linear ODEs with Neural Kalman Filtering for Irregular Time Series Forecasting,https://openreview.net/forum?id=a-bD9-0ycs0,https://openreview.net/pdf?id=a-bD9-0ycs0,,"Over the past four years, models based on Neural Ordinary Differential Equations have become state of the art in the forecasting of irregularly sampled time series. Describing the data-generating process as a dynamical system in continuous time allows predictions at arbitrary time points. However, the numerical integration of Neural ODEs typically comes with a high computational burden or may even fail completely. We propose a novel Neural ODE model that embeds the observations into a latent space with dynamics governed by a linear ODE. Consequently, we do not require any specialized numerical integrator but only an implementation of the matrix exponential readily available in many numerical linear algebra libraries. We also introduce a novel state update component inspired by the classical Kalman filter, which, to our knowledge, makes our model the first Neural ODE variant to explicitly satisfy a specific self-consistency property. It allows forecasting irregularly sampled time series with missing values and comes with some numerical stability guarantees. We evaluate the performance on medical and climate benchmark datasets, where the model outperforms the state of the art by margins up to 30%.","Time Series Forecasting, Neural ODE, Kalman Filter, Koopman Operator, Missing Values" Transformer-based model for symbolic regression via joint supervised learning,https://openreview.net/forum?id=ULzyv9M1j5,https://openreview.net/pdf?id=ULzyv9M1j5,,"Symbolic regression (SR) is an important technique for discovering hidden mathematical expressions from observed data. Transformer-based approaches have been widely used for machine translation due to their high performance, and are recently highly expected to be used for SR. They input the data points, then output the expression skeleton, and finally optimize the coefficients. However, recent transformer-based methods for SR focus more attention on large scale training data and ignore the ill-posed problem: the lack of sufficient supervision, i.e. expressions that may be completely different have the same supervision because of their same skeleton, which makes it challenging to deal with data that may be from the same expression skeleton but with different coefficients. Therefore, we present a transformer-based model for SR with the ability to alleviate this problem. Specifically, we leverage a feature extractor based on pure residual MLP networks to obtain more information about data points. Furthermore, the core idea is that we propose a joint learning mechanism combining supervised contrastive learning, which makes features of data points from expressions with the same skeleton more similar so as to effectively alleviates the ill-posed problem. The benchmark results show that the proposed method is up to 25% higher with respect to the recovery rate of skeletons than typical transformer-based methods. Moreover, our method outperforms state-of-the-art SR methods based on reinforcement learning and genetic programming in terms of the coefficient of determination ($R^2$).", Gradient-Based Transfer Learning,https://openreview.net/forum?id=hChYEyebNm1,https://openreview.net/pdf?id=hChYEyebNm1,We formulate transfer learning as a meta-learning problem and extend current gradient-based meta-learning methods to this setting. ,"We formulate transfer learning as a meta-learning problem by extending upon the current meta-learning paradigm in that support and query data are drawn from different, but related distributions of tasks. Inspired by the success of Gradient-Based Meta-Learning (GBML), we propose to expand it to the transfer learning setting by constructing a general encoder-decoder architecture that learns a map between functionals of different domains. This is achieved by leveraging on the idea that the task-adapted parameters of a meta-learner can serve as an informative representation of the task itself. We demonstrate the proposed method on regression, prediction of dynamical systems and meta-imitation learning problems.","meta-learning, gradient-based meta-learning, transfer learning, representation learning" Coreset for Rational Functions,https://openreview.net/forum?id=pgJp7rDc_hk,https://openreview.net/pdf?id=pgJp7rDc_hk,,"We consider the problem of fitting a rational function $f:\mathbb{R}\to\mathbb{R}$ to a time-series $g:\{1,\cdots,n\}\to\mathbb{R}$. This is by minimizing the sum of distances (loss function) $\ell(f):=\sum_{i=1}^n |f(i)-g(i)|$, possibly with additional constraints and regularization terms that may depend on $f$. Our main motivation is to approximate such a time-series by a recursive sequence model $F_n=\sum_{i=1}^k \theta_i F_{n-i}$, e.g. a Fibonacci sequence, where $\theta\in \mathbb{R}^k$ are the model parameters, and $k\geq1$ is constant. For $\varepsilon\in(0,1)$, an $\varepsilon$-coreset for this problem is a small data structure that approximates $\ell(g)$ up to $1\pm\varepsilon$ multiplicative factor, for every rational function $g$ of constant degree. We prove that every signal has an $\varepsilon$-coreset of size $O(n^{0.001}/\varepsilon^2)$, and provide a construction algorithm that computes it in $O(n^{1.001})$ time. Open source code is provided, as well as extensive experimental results, on both real and synthetic datasets, which compare our method to existing solvers from Scipy.","Coreset, Auto-regression, rational functions, non-convex optimization" Joint Representations of Text and Knowledge Graphs for Retrieval and Evaluation,https://openreview.net/forum?id=EgJ0PbRPkCW,https://openreview.net/pdf?id=EgJ0PbRPkCW,"We learn joint representations for knowledge base elements and corresponding text, which allows to perform retrieval and referenceless adequacy evaluation","A key feature of neural models is that they can produce semantic vector representations of objects (texts, images, speech, etc.) ensuring that similar objects are close to each other in the vector space. While much work has focused on learning text, image, knowledge-base (KB) and image-text representations, there are no aligned cross-modal text-KB representations. One challenge for learning such representations is the lack of parallel data. We train retrieval models on datasets of (graph, text) pairs where the graph is a KB subgraph and the text has been heuristically aligned with the graph. When performing retrieval on WebNLG, a clean parallel corpus, our best model achieves 80\% accuracy and 99\% recall@10, showing that similar texts and KB graphs are mapped close to each other. We use this property to create a similarity metric between English text and KB graphs, matching state-of-the-art metrics in terms of correlation with human judgments even though, unlike them, it does not require a reference text to compare against.","Representation learning, Text generation, Knowledge bases, Evaluation" Transformer needs NMDA receptor nonlinearity for long-term memory,https://openreview.net/forum?id=0z_cXcu1N6o,https://openreview.net/pdf?id=0z_cXcu1N6o,,"The NMDA receptor (NMDAR) in the hippocampus is essential for learning and memory. We find an interesting resemblance between deep models' nonlinear activation function and the NMDAR's nonlinear dynamics. In light of a recent study that compared the transformer architecture to the formation of hippocampal memory, this paper presents new findings that NMDAR-like nonlinearity may be essential for consolidating short-term working memory into long-term reference memory. We design a navigation task assessing these two memory functions and show that manipulating the activation function (i.e., mimicking the Mg$^{2+}$-gating of NMDAR) disrupts long-term memory formation. Our experimental data suggest that the concept of place cells and reference memory may reside in the feed-forward network layer of transformers and that nonlinearity plays a key role in these processes. Our findings propose that the transformer architecture and hippocampal spatial representation resemble by sharing the overlapping concept of NMDAR-like nonlinearity.","NMDAR, hippocampus, transformer, memory" Simple Spectral Graph Convolution from an Optimization Perspective,https://openreview.net/forum?id=cZM4iZmxzR7,https://openreview.net/pdf?id=cZM4iZmxzR7,We define a learnable and unsupervised graph convolution framework as self-representation on graph.,"Recent studies on SGC, PageRank and S\textsuperscript{2}GC have demonstrated that several graph diffusion techniques are straightforward, quick, and effective for tasks in the graph domain like node classification. Even though these techniques do not even need labels, they can nevertheless produce more discriminating features than raw attributes for downstream tasks with different classifiers. These methods are data-independent and thus primarily rely on some empirical parameters on polynomial bases (e.g., Monomial and Chebyshev), which ignore the homophily of graphs and the attribute distribution. They are more insensitive to heterophilous graphs due to the low-pass filtering. Although there are many approaches focusing on GNNs based on heterophilous graphs, these approaches are dependent on label information to learn model parameters. In this paper, we study the question: are labels a necessity for GNNs with heterophilous graphs? Based on this question, we propose a framework of self-representation on graphs related to the Least Squares problem. Specifically, we use Generalized Minimum RESidual (GMRES) method, which finds the least squares solution over Krylov subspaces. In theoretical analysis, without label information, we enjoy better features with graph convolution. The proposed method, like previous data-independent methods, is not a deep model and is, therefore, quick, scalable, and simple. We also show performance guarantees for models on real and synthetic data. On a benchmark of real-world datasets, empirically, our method is competitive with existing deep models for node classification.","Graph Convolution, Graph Fourier Transformation, Unsupervised Learning" QAID: Question Answering Inspired Few-shot Intent Detection,https://openreview.net/forum?id=gNI4_85Cyve,https://openreview.net/pdf?id=gNI4_85Cyve,"Our method achieve SOTA results on few-shot intent detection by combining Question-Answering architecture, Contrastive Learning techniques and use of the intent name as answer. ","Intent detection with semantically similar fine-grained intents is a challenging task. To address it, we reformulate intent detection as a question-answering task by treating utterances and intent names as questions and answers. To that end, we utilize a question-answering retrieval architecture and adopt a two stages training schema with batch contrastive loss. In the first stage, we train the model to learn better query representation in a self supervise manner. Then, in the second stage, we fine-tune the model to optimize contextualized token-level similarity scores between queries and answers from the same intent. Our results on three few-shot intent detection benchmarks achieve state-of-the-art performance.","Intent Detection, Question Answering, Contrastive Learning, Passage Retrieval" Rethinking the Value of Prompt Learning for Vision-Language Models,https://openreview.net/forum?id=1FsdIfRngtw,https://openreview.net/pdf?id=1FsdIfRngtw,,"Large-scale visual-language pre-training like CLIP has demonstrated great success in open-set visual concept learning that enables zero-shot transfer to downstream tasks through prompting. To automate prompt engineering, prompt learning is proposed to automatically learn the optimal task-relevant prompts. In this paper, we make some surprising observations that contradict common beliefs about prompts. We observe that even random prompts can achieve pretty good performance for zero-shot recognition. We also find that prompt learning gives comparable or worse performance than directly fine-tuning of the linear classifier. Moreover, prompt learning is no more than parameter-efficient learning, and is a trade-off between optimality and generalization. Our results highlight the need for the rethinking of existing prompt learning, more careful baseline evaluations in future research on prompt learning methods in vision-language models. ","Prompt Tuning, Visual-Language Pre-training" Partial Output Norm: Mitigating the Model Output Blow-up Effect of Cross Entropy Loss ,https://openreview.net/forum?id=zygzt8QFsV,https://openreview.net/pdf?id=zygzt8QFsV,,"Cross entropy loss is a very popular optimization objective and has been successfully applied for diverse classification tasks. The discrepancy between cross entropy objective and real classification target is not fully studied because researchers usually think such discrepancy is a must-pay price to have a differentiable objective which can be optimized through gradient based methods. In this paper, we carefully study such discrepancy and find out such discrepancy leads to the side effect that the model output have certain useless growth tendency when the classification result is correct. We call such side effects as ""model output blow-up effect"". Such effect distracts CE objective from real effective update, which brings the negative influence on the model training. To mitigate such side effect, we introduce a partial normalization layer for regularizing model output to reduce its useless growth tendency. We further provide the theoretical analysis on our finds and our approaches. The experiment results shows that the proposed partial normalization layer improves the model training, and it could be combined with other method like weight decay to achieve big additional performance gain. ", Disentangled Feature Swapping Augmentation for Weakly Supervised Semantic Segmentation,https://openreview.net/forum?id=pW_jGk1D_Ww,https://openreview.net/pdf?id=pW_jGk1D_Ww,We propose a novel feature augmentation for weakly supervised semantic segmentation to prevent the classifier from being biased by misleading correlation.,"Weakly supervised semantic segmentation utilizes a localization map obtained from a classifier to generate a pseudo-mask. However, classifiers utilize background cues to predict class labels because of a biased dataset consisting of images, in which specific objects frequently co-occur with certain backgrounds. Consequently, the classifier confuses the background with the target objects, resulting in inaccurate localization maps. To this end, we propose DisEntangled FeaTure swapping augmentation(DEFT) to prevent the classifier from being biased by a misleading correlation. Our method first disentangles the foreground and background features. Then, we randomly swap the disentangled features within mini-batches via a two-way process. These features contain various contexts that do not appear in the biased dataset, but the class relevant representation is preserved. In addition, we introduce training schemes to obtain further performance gains. Experimental results showed that when our augmentation was used in various weakly supervised semantic segmentation methods trained on the Pascal VOC 2012 dataset, the performance of the localization maps and pseudo-mask as well as the segmentation results improved. ","Weakly Supervised Semantic Segmentation, Data Augmentation, Feature Disentanglement" FLOP: Tasks for Fitness Landscapes Of Protein families using sequence- and structure-based representations,https://openreview.net/forum?id=gHwpv9pSEP2,https://openreview.net/pdf?id=gHwpv9pSEP2,Novel benchmark dataset for exploration of single family protein fitness landscapes for protein engineering,"Protein engineering has the potential to create optimized protein variants with improved properties and function. An initial step in the protein optimization process typically consists of a search among natural (wildtype) sequences to find the naturally occurring protein with the most desirable properties. This chosen candidate is then the basis for the second step: a more local optimization procedure, exploring the space of variants separated from this candidate by a few mutations. While advances in protein representation learning promise to facilitate the exploration of wildtype space, results from real-life cases are often underwhelming, and progress in the area difficult to track. In this paper, we have carefully curated a representative benchmark dataset, which reflects industrially-relevant scenarios for the initial wildtype exploration of protein engineering. We focus on the exploration within a protein family or superfamily, and investigate the downstream predictive power of various dominating protein representation paradigms, i.e., transformer-based language representations, structure-based representations, and evolution-based representations. Our benchmark highlights the importance of coherent split strategies, and how we can be misled into overly optimistic estimates of the state of the field. We hope our benchmark can drive further methodological developments in this important field.","protein engineering, representation learning, generalization, benchmark, enzyme engineering, protein structure, protein language model" Distributed Least Square Ranking with Random Features,https://openreview.net/forum?id=tORS9qGBNpT,https://openreview.net/pdf?id=tORS9qGBNpT,"We study the statistical properties of pairwise ranking using distributed learning and random features, establish the convergence rate in probability, and demonstrate the power of the proposed methods via numerical experiments.","In this paper, we study the statistical properties of pairwise ranking using distributed learning and random features (called DRank-RF) and establish its convergence analysis in probability. Theoretical analysis shows that DRank-RF remarkably reduces the computational requirements while preserving a satisfactory convergence rate. An extensive experiment verifies the effectiveness of DRank-RF. Furthermore, to improve the learning performance of DRank-RF, we propose an effective communication strategy for it and demonstrate the power of communications via theoretical assessments and numerical experiments.","least square ranking, distributed learning, learning theory, random features" Doing Fast Adaptation Fast: Conditionally Independent Deep Ensembles for Distribution Shifts,https://openreview.net/forum?id=17RDXeF-skZ,https://openreview.net/pdf?id=17RDXeF-skZ,,"Classifiers in a diverse ensemble capture distinct predictive signals, which is valuable for datasets containing multiple strongly predictive signals. Performing fast adaptation at test time allows us to generalize to distributions where certain signals are no longer predictive, or to avoid relying on sensitive or protected attributes. However, ensemble learning is often expensive, even more so when we need to enforce diversity constraints between the high-dimensional representations of the classifiers. Instead, we propose an efficient and fast method for learning ensemble diversity. We minimize conditional mutual information of the output distributions between classifiers, a quantity which can be cheaply and exactly computed from empirical data. The resulting ensemble contains individually strong predictors that are only dependent because they predict the label. We demonstrate the efficacy of our method on shortcut learning tasks. Performing fast adaptation on our ensemble selects shortcut-invariant models that generalize well to test distributions where the shortcuts are uncorrelated with the label. ","deep ensemble, diverse ensemble, shortcut learning, spurious correlations, conditional mutual information" Solving stochastic weak Minty variational inequalities without increasing batch size,https://openreview.net/forum?id=ejR4E1jaH9k,https://openreview.net/pdf?id=ejR4E1jaH9k,Weak MVIs can be solved with only stochastic feedback using extragradient-like algorithms by introducing a bias-correction term,"This paper introduces a family of stochastic extragradient-type algorithms for a class of nonconvex-nonconcave problems characterized by the weak Minty variational inequality (MVI). Unlike existing results on extragradient methods in the monotone setting, employing diminishing stepsizes is no longer possible in the weak MVI setting. This has led to approaches such as increasing batch sizes per iteration which can however be prohibitively expensive. In contrast, our proposed methods involves two stepsizes and only requires one additional oracle evaluation per iteration. We show that it is possible to keep one fixed stepsize while it is only the second stepsize that is taken to be diminishing, making it interesting even in the monotone setting. Almost sure convergence is established and we provide a unified analysis for this family of schemes which contains a nonlinear generalization of the celebrated primal dual hybrid gradient algorithm. ","Variational inequalities, stochastic first-order methods, nonconvex-nonconcave, minimax" Diversity Boosted Learning for Domain Generalization with a Large Number of Domains,https://openreview.net/forum?id=8Ygoj2IeXfW,https://openreview.net/pdf?id=8Ygoj2IeXfW,We propose a novel sampling framework to efficiently sample the most informative domains and data points to help train robust models against two kinds of spurious correlations in Domain Generalization field.,"Machine learning algorithms minimizing the average training loss typically suffer from poor generalization performance. It inspires various works for domain generalization (DG), among which a series of methods work by $O(n^2)$ pairwise domain operations with n domains, where each one is often costly. Moreover, while a common objective in the DG literature is to learn invariant representations against spurious correlations induced by domains, we point out the insufficiency of it and highlight the importance of alleviating spurious correlations caused by objects. Based on the observation that diversity helps mitigate spurious correlations, we propose a Diversity boosted twO-level saMplIng framework (DOMI) to efficiently sample the most informative ones among a large number of domains and data points. We show that DOMI helps train robust models against spurious correlations from both domain-side and object-side, substantially enhancing the performance of five backbone DG algorithms on Rotated MNIST and Rotated Fashion MNIST.","Domain Generalization, Spurious Correlation" A Hybrid Framework for Generating A Country-scale Synthetic Population,https://openreview.net/forum?id=wSysC6I_S0z,https://openreview.net/pdf?id=wSysC6I_S0z,This paper provides a hybrid framework to generate country-scale synthetic population and also provides metrics to assess the quality of our population.,"Population censuses are vital to public policy decision-making. They provide insights into human resources, demography, culture, and economic structure at local, regional, and national levels. However, such surveys are very expensive (especially for low and middle income countries with high populations, such as India), and may also raise privacy concerns, depending upon the kinds of data collected. We introduce a novel hybrid framework which can combine data from multiple real-world surveys (with different, partially overlapping sets of attributes) to produce a real-scale synthetic population of humans. Critically, our population maintains family structures comprising individuals with demographic, socioeconomic, health, and geolocation attributes: this means that our ""fake"" people live in realistic locations, have realistic families, etc. Such data can be used for a variety of purposes: we explore one such use case, agent-based modelling of infectious disease in India. We use both machine learning and statistical metrics to gauge the quality of our synthetic population. Our experimental results show that synthetic data can realistically simulate the population for various administrative units of India, producing real-scale, detailed data at the desired level of zoom -- from cities, to districts, to states, eventually combining to form a country-scale synthetic population.","Synthetic Data, Synthetic Population, Agent-based Modelling, Statistical Methods, Machine Learning" Towards Performance-maximizing Network Pruning via Global Channel Attention,https://openreview.net/forum?id=dNmkN_z72P4,https://openreview.net/pdf?id=dNmkN_z72P4,"GlobalPrun is a static channel pruning method which utilizes advantages from both static and dynamic methods via global channel attention, achieving much higher compression rates and better accuracy.","Network pruning has attracted increasing attention recently for its capability of transferring large-scale neural networks (e.g., CNNs) into resource-constrained devices. Such a transfer is typically achieved by removing redundant network parameters while retaining its generalization performance in a static or dynamic pruning manner. Concretely, static pruning usually maintains a larger and fit-to-all (samples) compressed network by removing the same channels for all samples, while dynamic pruning can adaptively remove (more) different channels for different samples and obtain state-of-the-art performance along with a higher compression ratio. However, since the system has to preserve the complete network information for sample-specific pruning, dynamic pruning methods are usually not memory-efficient. In this paper, our interest is to explore a static alternative, dubbed GlobalPru, to conventional static pruning methods that can take into account both compression ratio and model performance maximization. Specifically, a novel channel attention-based learn-to-rank algorithm is proposed to learn the global channel attention of the network for various samples, wherein, each sample-specific channel saliency is forced to reach an agreement on the global ranking. Hence, all samples can empirically share the same pruning priority of channels to achieve channel pruning with minimal performance loss. Extensive experiments demonstrate that the proposed GlobalPru can achieve better performance than state-of-the-art static and dynamic pruning methods by significant margins.","Channle Pruning, Global Attention, Deep Neural Networks, Model Compression" Adaptive Block-wise Learning for Knowledge Distillation,https://openreview.net/forum?id=8XfHh4XSQ0Q,https://openreview.net/pdf?id=8XfHh4XSQ0Q,,"Knowledge distillation allows the student network to improve its performance under the supervision of transferred knowledge. Existing knowledge distillation methods are implemented under the implicit hypothesis that knowledge from teacher and student contributes to each layer of the student network to the same extent. In this work, we argue that there should be different contributions of knowledge from the teacher and the student during training for each layer. Experimental results evidence this argument. To the end, we propose a novel Adaptive Block-wise Learning~(ABL) for Knowledge Distillation to automatically balance teacher-guided knowledge between self-knowledge in each block. Specifically, to solve the problem that the error backpropagation algorithm cannot assign weights to each block of the student network independently, we leverage the local error signals to approximate the global error signals on student objectives. Moreover, we utilize a set of meta variables to control the contribution of the student knowledge and teacher knowledge to each block during the training process. Finally, the extensive experiments prove the effectiveness of our method. Meanwhile, ABL provides an insightful view that in the shallow blocks, the weight of teacher guidance is greater, while in the deep blocks, student knowledge has more influence.","Knowledge distillation, Local error signals, Bilevel optimization" Curriculum-based Co-design of Morphology and Control of Voxel-based Soft Robots,https://openreview.net/forum?id=r9fX833CsuN,https://openreview.net/pdf?id=r9fX833CsuN,Curriculum-based Co-design of Morphology and Control of Voxel-based Soft Robots,"Co-design of morphology and control of a Voxel-based Soft Robot (VSR) for solving a given task is challenging due to the bi-level optimization in the enormous combined design and policy space. In this paper, we present a Curriculum-based Co-design (CuCo) method for learning to design and control VSRs through an easy-to-difficult process. Specifically, we expand the design space from a small size to the target size gradually through a predefined curriculum. At each stage of the curriculum, we use reinforcement learning to simultaneously train the design and policy, which is enabled by incorporating the design process into the environment and using differentiable policy representations. The converged morphology, the learned design and control policies from the last stage are inherited and serve as the starting point for the next stage. In the empirical studies, we show that CuCo is more efficient in creating larger robots with better performance by reusing the practical design and control patterns learned within each stage, in comparison to prior approaches that learn from scratch in the space of target size.","Artificial Life, Brain-body Co-design, Robotics, Modular Soft Robots" Object-Centric Learning with Slot Mixture Models,https://openreview.net/forum?id=AqX3oSbzyQ1,https://openreview.net/pdf?id=AqX3oSbzyQ1,"We propose to use Gaussian Mixture Model to represent slots in object-centric tasks, which leads to a more expressive slots representation and the state-of-the-art results in the set property prediction task.","Object-centric architectures usually apply some differentiable module on the whole feature map to decompose it into sets of entities representations called slots. Some of these methods structurally resemble clustering algorithms, where the center of the cluster in latent space serves as slot representation. Slot Attention is an example of such a method as a learnable analog of the soft k-Means algorithm. In our work, we use the learnable clustering method based on Gaussian Mixture Model, unlike other approaches we represent slots not only as centers of clusters but we also use information about the distance between clusters and assigned vectors, which leads to more expressive slots representations. Our experiments demonstrate that using this approach instead of Slot Attention improves performance in different scenarios achieving state-of-the-art performance in the set property prediction task.","object-centric task, gaussian mixture model, slot attention" WiNeRT: Towards Neural Ray Tracing for Wireless Channel Modelling and Differentiable Simulations,https://openreview.net/forum?id=tPKKXeW33YU,https://openreview.net/pdf?id=tPKKXeW33YU,Neural wireless ray tracer,"In this paper, we work towards a neural surrogate to model wireless electro-magnetic propagation effects in indoor environments. Such neural surrogates provide a fast, differentiable, and continuous representation of the environment and enables end-to-end optimization for downstream tasks (e.g., network planning). Specifically, the goal of the paper is to render the wireless signal (e.g., time-of-flights, power of each path) in an environment as a function of the sensor's spatial configuration (e.g., placement of transmit and receive antennas). NeRF-based approaches have shown promising results in the visual setting (RGB image signal, with a camera sensor), where the key idea is to algorithmically evaluate the `global' signal (e.g., using volumetric rendering) by breaking it down in a sequence of `local' evaluations (e.g., using co-ordinate neural networks). In a similar spirit, we model the time-angle channel impulse response (the global wireless signal) as a superposition of multiple paths. The wireless characteristics (e.g., power) of each path is a result of multiple evaluations of a neural network that learns implicit ray-surface interaction properties. We evaluate our approach in multiple indoor scenarios and demonstrate that our model achieves strong performance (e.g., $<$0.33ns error in time-of-flight predictions). Furthermore, we demonstrate that our neural surrogate whitens the `black-box' wireless simulators, and thus enables inverse rendering applications (e.g., user localization).","neural rendering, wireless, ray tracing, nerf" Pocket-specific 3D Molecule Generation by Fragment-based Autoregressive Diffusion Models,https://openreview.net/forum?id=HGsoe1wmRW5,https://openreview.net/pdf?id=HGsoe1wmRW5,Using fragment-based autoregressive diffusion model to generate 3D molecules for protein binding pockets,"Autoregressive model is widely adopted to generate 3D molecules which can fit any protein binding pocket. Current autoregressive model suffers from two major drawbacks. First, it is hard to capture local geometric patterns as only one atom is generated at each step. Second, most of the autoregressive models generate atoms and chemical bonds in two separate processes, which causes a number of problems such as incorrect counts of rings, a bias distribution of bond lengths, and inaccurate 3D molecular structures. To tackle this problem, we designed a model, named FragDiff, to generate 3D molecules fragment-by-fragment for pockets. In each generation step, FragDiff places a molecular fragment around the pocket by using E(3)-equivariant diffusion generative models to simultaneously predict the atom types, atom coordinates and the chemical bonds of the fragment. Extensive experimental results confirm our assumption that unifying the atoms and bonds generations could significantly improve the quality of the sampled 3D molecules in terms of more accurate distributions of 2D subgraphs and 3D substructures.","3D molecule generation, drug design, protein binding pocket, generative model, diffusion model" Learning with Non-Uniform Label Noise: A Cluster-Dependent Semi-Supervised Approach,https://openreview.net/forum?id=P_48ZG7ySK,https://openreview.net/pdf?id=P_48ZG7ySK,"For the robust learning with non-uniform label noise, we propose a cluster-dependent sample selection algorithm followed by a semi-supervised training mechanism.","Learning with noisy labels is a challenging task in machine learning. Most existing methods explicitly or implicitly assume uniform label noise across all samples. In reality, label noise can be highly non-uniform in the feature space, with higher error rate for more difficult samples. Some recent works consider instance-dependent label noise but they require additional information such as some cleanly labeled data and confidence scores, which are usually unavailable or costly to obtain. In this paper, we consider learning with non-uniform label noise that requires no such additional information. we propose a cluster-dependent sample selection algorithm followed by a semi-supervised training mechanism based on the cluster-dependent label noise. The proposed self-adaptive multi-scale sample selection method increases the consistency of sample space by forcing the selection of clean samples from the entire feature space. Despite its simplicity, the proposed method can distinguish clean data from the corrupt ones more precisely and achieve state-of-the-art performance on image classification benchmarks, especially when the number of training samples is small and the noise rate is large.","Non-uniform label noise, Cluster-dependent sample selection mechanism, Semi-supervised training." Towards scalable and non-IID robust Hierarchical Federated Learning via Label-driven Knowledge Aggregator,https://openreview.net/forum?id=3WYtm7UzsR,https://openreview.net/pdf?id=3WYtm7UzsR,We propose a Hierarchical FL framework to divide and conquer non-IID group-by-group,"In real-world applications, Federated Learning (FL) meets two challenges: (1) scalability, especially when applied to massive IoT networks, and (2) how to be robust against an environment with heterogeneous data. Realizing the first problem, we aim to design a novel FL framework named Full-stack FL (F2L). More specifically, F2L utilizes a hierarchical network architecture, making extending the FL network accessible without reconstructing the whole network system. Moreover, leveraging the advantages of hierarchical network design, we propose a new label-driven knowledge distillation (LKD) technique at the global server to address the second problem. As opposed to current knowledge distillation techniques, LKD is capable of training a student model, which consists of good knowledge from all teachers' models. Therefore, our proposed algorithm can effectively extract the knowledge of the regions' data distribution (i.e., the regional aggregated models) to reduce the divergence between clients' models when operating under the FL system with non-independent identically distributed data. Extensive experiment results reveal that: (i) our F2L method can significantly improve the overall FL efficiency in all global distillations, and (ii) F2L rapidly achieves convergence as global distillation stages occur instead of increasing on each communication cycle.","Federated Learning, Knowledge Distillation, non-IID" Free Bits: Platform-Aware Latency Optimization of Mixed-Precision Neural Networks for Edge Deployment,https://openreview.net/forum?id=_GcWoi0SQm,https://openreview.net/pdf?id=_GcWoi0SQm,"By combining differentiable precision search with platform-aware heuristics, we can reduce end-to-end latency of DNNS running on microcontrollers by up to 29.2%.","Mixed-precision quantization, where a deep neural network's layers are quantized to different precisions, offers the opportunity to optimize the trade-offs between model size, latency, and statistical accuracy beyond what can be achieved with homogeneous-bit-width quantization. However, the search space for layer-wise quantization policies is intractable, and the execution latency of mixed-precision networks is related non-trivially and non-monotonically to precision, depending on the deployment target. This establishes the need for hardware-aware, directed heuristic search algorithms. This paper proposes a hybrid search methodology for mixed-precision network configurations consisting of a hardware-agnostic differentiable search algorithm followed by a hardware-aware heuristic optimization to find mixed-precision configurations latency-optimized for a specific hardware target. We evaluate our algorithm on MobileNetV1 and MobileNetV2 and deploy the resulting networks on a family of multi-core RISC-V microcontroller platforms, each with different hardware characteristics. We achieve up to $29.2\%$ reduction of end-to-end latency compared to an 8-bit model at a negligible accuracy drop from a full-precision baseline on the 1000-class ImageNet dataset. We demonstrate speedups relative to an 8-bit baseline even on systems with no hardware support for sub-byte arithmetic at zero accuracy drop. Furthermore, we show the superiority of our approach to both a purely heuristic search and differentiable search targeting reduced binary operation counts.","Edge AI, TinyML, Mixed-Precision Quantization" LS-IQ: Implicit Reward Regularization for Inverse Reinforcement Learning,https://openreview.net/forum?id=o3Q4m8jg4BR,https://openreview.net/pdf?id=o3Q4m8jg4BR,We propose a novel perspective on implicit L2 reward regularization for inverse reinforcement learning.,"Recent methods for imitation learning directly learn a $Q$-function using an implicit reward formulation, rather than an explicit reward function. However, these methods generally require implicit reward regularization for improving stability, mistreating or even neglecting absorbing states. Previous works show that a squared norm regularization on the implicit reward function is effective, but do not provide a theoretical analysis of the resulting properties of the algorithms. In this work, we show that using this regularizer under a mixture distribution of the policy and the expert provides a particularly illuminating perspective: the original objective can be understood as squared Bellman error minimization, and the corresponding optimization problem minimizes the $\chi^2$-Divergence between the expert and the mixture distribution. This perspective allows us to address instabilities and properly treat absorbing states. We show that our method, Least Squares Inverse Q-Learning, outperforms state-of-the-art algorithms, particularly in environments with absorbing states. Finally, we propose to use an inverse dynamics model to learn from observations only. Using this approach, we retain performance in settings where no expert actions are available.","Inverse Reinforcement Learning, Imitation Learning, Reward Regularization, Deep Reinforcement Learning" On the Certification of Classifiers for Outperforming Human Annotators,https://openreview.net/forum?id=X5ZMzRYqUjB,https://openreview.net/pdf?id=X5ZMzRYqUjB,"A theory for estimating the performance of a classifier by comparing with human annotators, even when the humans are inferior to the classifier.","This paper addresses a key question in current machine learning research: if we believe that a model's predictions might be better than those given by human experts, how can we (humans) verify these beliefs? In some cases, this ``superhuman'' performance is readily demonstrated; for example by defeating top-tier human players in traditional two player games. On the other hand, it can be challenging to evaluate classification models that potentially surpass human performance. Indeed, human annotations are often treated as a ground truth, which implicitly assumes the superiority of the human over any models trained on human annotations. In reality, human annotators are subjective and can make mistakes. Evaluating the performance with respect to a genuine oracle is more objective and reliable, even when querying the oracle is more expensive or sometimes impossible. In this paper, we first raise the challenge of evaluating the performance of both humans and models with respect to an oracle which is $\textit{unobserved}$. We develop a theory for estimating the accuracy compared to the oracle, using only imperfect human annotations for reference. Our analysis provides an executable recipe for detecting and certifying superhuman performance in this setting, which we believe will assist in understanding the stage of current research on classification. We validate the convergence of the bounds and the assumptions of our theory on carefully designed toy experiments with known oracles. Moreover, we demonstrate the utility of our theory by meta-analyzing large-scale natural language processing tasks, for which an oracle does not exist, and show that under our mild assumptions a number of models from recent years have already achieved superhuman performance with high probability---suggesting that our new oracle based performance evaluation metrics are overdue as an alternative to the widely used accuracy metrics that are naively based on imperfect human annotations.","Evaluation theory, Oracle accuracy, Superhuman classifier" Loss Adapted Plasticity: Learning From Data With Unreliable Sources,https://openreview.net/forum?id=VPX0ln_YoG,https://openreview.net/pdf?id=VPX0ln_YoG,"To learn from reliable and unreliable data sources, this paper demonstrates a technique that can be applied to any gradient descent optimiser: Update model weights as a function of the perceived reliability of data sources within a wider data set.","When data is streaming from multiple sources, conventional training methods update model weights often assuming the same level of reliability for each source; that is: a model does not consider data quality of a specific source during training. In many applications, sources can have varied levels of noise or corruption that can produce negative effects on the learning of a robust machine learning model. A key issue is that the quality of data or labels for individual sources is often not available to a model during training and could vary over time. A solution to this problem is to consider the mistakes made while training on data originating from sources and utilise this to create a perceived data quality for each source. This paper demonstrates a technique that can be applied to any gradient descent optimiser: Update model weights as a function of the perceived reliability of data sources within a wider data set. The algorithm controls the plasticity of a given model to weight updates based on the history of losses from individual data sources. We show that applying this technique can significantly improve model performance when trained on a mixture of reliable and unreliable data sources, and maintain performance when models are trained on data sources that are all considered reliable. ","Data Sources, Unreliable Data, Noisy Data, Noisy Labels" Share Your Representation Only: Guaranteed Improvement of the Privacy-Utility Tradeoff in Federated Learning,https://openreview.net/forum?id=oJpVVGXu9i,https://openreview.net/pdf?id=oJpVVGXu9i,,"Repeated parameter sharing in federated learning causes significant information leakage about private data, thus defeating its main purpose: data privacy. Mitigating the risk of this information leakage, using state of the art differentially private algorithms, also does not come for free. Randomized mechanisms can prevent convergence of models on learning even the useful representation functions, especially if there is more disagreement between local models on the classification functions (due to data heterogeneity). In this paper, we consider a representation federated learning objective that encourages various parties to collaboratively refine the consensus part of the model, with differential privacy guarantees, while separately allowing sufficient freedom for local personalization (without releasing it). We prove that in the linear representation setting, while the objective is non-convex, CENTAUR converges to a ball centered around the \emph{global optimal} solution at a linear rate, and the radius of the ball is proportional to the reciprocal of the privacy budget. With this novel utility analysis, we improve the SOTA utility-privacy trade-off for this problem by a factor of $\sqrt{d}$, where $d$ is the input dimension. We empirically evaluate our method with the image classification task on CIFAR10, CIFAR100, and EMNIST, and observe a significant performance improvement over the prior work under the same small privacy budget.","Differential Privacy, Representation Learning, Federated Learning" Quantized Disentangled Representations for Object-Centric Visual Tasks,https://openreview.net/forum?id=JIptuwnqwn,https://openreview.net/pdf?id=JIptuwnqwn,We propose quantised disentangled representations that demonstrate state-of-the art performace in set prediction tasks mong a class of object-centric methods.,"Recently, the pre-quantization of image features into discrete latent variables has helped to achieve remarkable results in image modeling. In this paper, we propose a method to learn discrete latent variables applied to object-centric tasks. In our approach, each object is assigned a slot which is represented as a vector generated by sampling from non-overlapping sets of low-dimensional discrete variables. We empirically demonstrate that embeddings from the learned discrete latent spaces have the disentanglement property. The model is trained with a set prediction and object discovery as downstream tasks. It achieves the state-of-the-art results on the CLEVR dataset among a class of object-centric methods for set prediction task. We also demonstrate manipulation of individual objects in a scene with controllable image generation in the object discovery setting.","quantised representation, disentangled representation, object-centric task" Supervised Random Feature Regression via Projection Pursuit,https://openreview.net/forum?id=BDjGGZk9yz,https://openreview.net/pdf?id=BDjGGZk9yz,,"Random feature methods and neural network models are two popular nonparametric modeling methods, which are regarded as representatives of shallow learning and Neural Network, respectively. In practice random feature methods are short of the capacity of feature learning, while neural network methods lead to computationally heavy problems. This paper aims at proposing a flexible but computational efficient method for general nonparametric problems. Precisely, our proposed method is a feed-forward two-layer nonparametric estimation, and the first layer is used to learn a series of univariate basis functions for each projection variable, and then search for their optimal linear combination for each group of these learnt functions. Based on all the features derived in the first layer, the second layer attempts at learning a single index function with an unknown activation function. Our nonparametric estimation takes advantage of both random features and neural networks, and can be seen as an intermediate bridge between them.","Random Feature, multi-kernel, projection pursuit, semi-parametric regression, neural networks" Graph Spline Networks for Efficient Continuous Simulation of Dynamical Systems,https://openreview.net/forum?id=loc3CUXeuzH,https://openreview.net/pdf?id=loc3CUXeuzH,We propose a novel model to exploit the synergy between graph neural networks and orthogonal spline collocation to accelerate learned simulations of physical systems by interpolating solutions of graph neural networks.,"While complex simulations of physical systems have been widely studied in engineering and scientific computing, lowering their often prohibitive computational requirements has only recently been tackled by deep learning approaches. In this paper, we present GraphSplineNets, a novel deep learning approach to speed up simulation of physical systems with spatio-temporal continuous outputs by exploiting the synergy between graph neural networks (GNN) and orthogonal spline collocation (OSC). Two differentiable time-oriented OSC and spatial-oriented OSC are applied to bridge the gap between discrete GNN outputs and generate continuous solutions at any location in space and time without explicit prior knowledge of underlying differential equations. Moreover, we introduce an adaptive collocation strategy in space to enable the model to sample from the most important regions. Our model improves on widely used graph neural networks for physics simulation on both efficiency and solution accuracy. We demonstrate SplineGraphNets in predicting complex dynamical systems such as the heat equation, damped wave propagation and the Navier-Stokes equations for incompressible flow, where they improve accuracy of more than 25% while providing at least 60% speedup. ","Graph, Spline Collocation Method, Graph Neural Networks, Simulation, Partial Differential Equations, PDEs, Physics, Scientific Computing" Online black-box adaptation to label-shift in the presence of conditional-shift,https://openreview.net/forum?id=kL67fyKb6A,https://openreview.net/pdf?id=kL67fyKb6A,Learning hyper-parameters on an OOD validation set can improve online black-box adaptation to label-shift when there is also conditional-shift in deployment,"We consider an out-of-distribution setting where trained predictive models are deployed online in new locations (inducing conditional-shift), such that these locations are also associated with differently skewed target distributions (label-shift). While approaches for online adaptation to label-shift have recently been discussed by Wu et al. (2021), the potential presence of concurrent conditional-shift has not been considered in the literature, although one might anticipate such distributional shifts in realistic deployments. In this paper, we empirically explore the effectiveness of online adaptation methods in such situations on three synthetic and two realistic datasets, comprising both classification and regression problems. We show that it is possible to improve performance in these settings by learning additional hyper-parameters to account for the presence of conditional-shift by using appropriate validation sets. ","label-shift, online, black-box, adaptation, Bayesian" RuDar: Weather Radar Dataset for Precipitation Nowcasting with Geographical and Seasonal Variability,https://openreview.net/forum?id=WVZQa2QYJN,https://openreview.net/pdf?id=WVZQa2QYJN,Weather radar dataset with benchmarks for nowcasting (next frame prediction) tasks with seasonal and geographical dependencies,"Precipitation nowcasting, a short-term (up to six hours) rain prediction, is arguably one of the most demanding weather forecasting tasks. To achieve accurate predictions, a forecasting model should consider miscellaneous meteorological and geographical data sources. Currently available datasets provide information only about precipitation intensity, vertically integrated liquid (VIL), or maximum reflectivity on the vertical section. Such single-level or aggregated data lacks description of the reflectivity change in vertical dimension, simplifying or distorting the corresponding models. To fill this gap, we introduce an additional dimension of the precipitation measurements in the RuDar dataset that incorporates 3D radar echo observations. Measurements are collected from 30 weather radars located mostly in the European part of Russia, covering multiple climate zones. Radar product updates every 10 minutes with a 2 km spatial resolution. The measurements include precipitation intensity (mm/h) at an altitude of 600 m, reflectivity (dBZ) and radial velocity (m/s) at 10 altitude levels from 1 km to 10 km with 1 km step. We also add the orography information as it affects the intensity and distribution of precipitation. The dataset includes over 50 000 timestamps over a two-year period from 2019 to 2021, totalling in roughly 100 GB of data. We evaluate several baselines, including optical flow and neural network models, for precipitation nowcasting on the proposed data. We also evaluate the uncertainty quantification for the ensemble scenario and show that the corresponding estimates do correlate with the ensemble errors on different sections of data. We believe that RuDar dataset will become a reliable benchmark for precipitation nowcasting models and also will be used in other machine learning tasks, e.g., in data shift studying, anomaly detection, or uncertainty estimation. Both dataset and code for data processing and model preparation are publicly available.","precipitation nowcasting, weather forecasting, weather radar, benchmark" Learning Representations for Reinforcement Learning with Hierarchical Forward Models,https://openreview.net/forum?id=jkMT2AtccX,https://openreview.net/pdf?id=jkMT2AtccX,Hierarchical forward models that predict at varying temporal coarseness and learn to communicate lead to more informative representations and better downstream control.,"Learning control from pixels is difficult for reinforcement learning (RL) agents because representation learning and policy learning are intertwined. Previous approaches remedy this issue with auxiliary representation learning tasks, but they either do not consider the temporal aspect of the problem or only consider single-step transitions, which may miss relevant information if important environmental changes take many steps to manifest. We propose Hierarchical $k$-Step Latent (HKSL), an auxiliary task that learns representations via a hierarchy of forward models that operate at varying magnitudes of step skipping while also learning to communicate between levels in the hierarchy. We evaluate HKSL in a suite of 30 robotic control tasks with and without distractors and a task of our creation. We find that HKSL either converges to higher episodic returns or optimal performance more quickly than several current baselines. Furthermore, we find that HKSL's representations capture task-relevant details accurately across timescales (even in the presence of distractors) and that communication channels between hierarchy levels organize information based on both sides of the communication process, both of which improve sample efficiency.","Reinforcement learning, Representation learning, Continuous control" xTrimoABFold: Improving Antibody Structure Prediction without Multiple Sequence Alignments ,https://openreview.net/forum?id=F5Cj26wfiu,https://openreview.net/pdf?id=F5Cj26wfiu,,"Antibody, used by the immune system to identify and neutralize foreign objects such as pathogenic bacteria and viruses, plays an important role in immune system. In the field of drug engineering, the essential task is designing a novel antibody to make sure its paratope (substructures in the antibody) binds to the epitope of the specific antigen with high precision. Also, understanding the structure of antibody and its paratope can facilitate a mechanistic understanding of the function. Therefore, antibody structure prediction has always been a highly valuable problem for drug discovery. AlphaFold2, a breakthrough in the field of structural biology, provides a feasible solution to predict protein structure based on protein sequences and computationally expensive coevolutionary multiple sequence alignments (MSAs). However, the computational efficiency and undesirable prediction accuracy on antibody, especially on the complementarity-determining regions (CDRs) of antibody limit its applications on the industrially high-throughput drug design. In this paper, we present a novel method named xTrimoABFold to predict antibody structure from antibody sequence based on a pretrained antibody language model (ALM) as well as homologous templates, which are searched from protein database (PDB) via fast and cheap algorithms. xTrimoABFold outperforms the MSA-based AlphaFold2 and the protein language model based SOTAs, e.g., OmegaFold, HelixFold-Single and IgFold with a large significant margin (30+% improvement on RMSD) while performs 151x faster than AlphaFold2. To the best of our knowledge, xTrimoABFold is the best antibody structure predictor to date in the world.","Protein structure prediction, antibody structure prediction, amino acid sequence, homologous structure" Thresholded Lexicographic Ordered Multi-Objective Reinforcement Learning,https://openreview.net/forum?id=mmFtinp4wQ_,https://openreview.net/pdf?id=mmFtinp4wQ_,We investigate reinforcement learning for thresholded lexicographic ordered multi-objective settings.,"Lexicographic multi-objective problems, which impose a lexicographic importance order over the objectives, arise in many real-life scenarios. Existing Reinforcement Learning work directly addressing lexicographic tasks has been scarce. The few proposed approaches were all noted to be heuristics without theoretical guarantees as the Bellman equation is not applicable to them. Additionally, the practical applicability of these prior approaches also suffers from various issues such as not being able to reach the goal state. While some of these issues have been known before, in this work we investigate further shortcomings, and propose fixes for improving practical performance in many cases. We also present a policy optimization approach using our Lexicographic Projection Optimization (LPO) algorithm that has the potential to address these theoretical and practical concerns. Finally, we demonstrate our proposed algorithms on benchmark problems.","Reinforcement Learning, Lexicographic Ordered Multi-Objectives" HOW SAMPLING AFFECTS TRAINING: AN EFFECTIVE SAMPLING THEORY STUDY FOR LONG-TAILED IMAGE CLASSIFICATION,https://openreview.net/forum?id=5WOIluv9Xop,https://openreview.net/pdf?id=5WOIluv9Xop,,"The long-tailed image classification problem has been very challenging for a longtime. Suffered from the unbalanced distribution of categories, many deep vision classification methods perform well in the head classes while poor in the tail ones. This paper proposes an effective sampling theory, attempting to provide a theoretical explanation for the decoupling representation and classifier for long-tailed image classification. To apply the above sampling theory in practice, a general jitter sampling strategy is proposed. Experiments show that variety of long-tailed distribution algorithms exhibit better performance based on the effective sampling theory. The code will be released soon later.", MolBART: Generative Masked Language Models for Molecular Representations,https://openreview.net/forum?id=-4HJSA3Y2vg,https://openreview.net/pdf?id=-4HJSA3Y2vg,We develop self-supervised representations of molecules using generative masked language models that set state-of-the-art for many chemical property and reaction prediction tasks and implicitly learn features and substructures important in chemistry,"We discover a robust self-supervised strategy tailored towards molecular representations for generative masked language models through a series of tailored, in-depth ablations. Using this pre-training strategy, we train MolBART, a BART-like model with an order of magnitude more compute than previous self-supervised molecular representations. In-depth evaluations show that MolBART consistently outperforms other self-supervised representations across classification, regression, and generation tasks setting a new state-of-the-art on 10 tasks. We then quantitatively show that when applied to the molecular domain, the BART objective learns representations that implicitly encode our downstream tasks of interest. For example, by selecting seven neurons from a frozen MolBART, we can obtain a model having performance within two percentage points of the full fine-tuned model on task Clintox. Lastly, we show that standard attribution interpretability methods, when applied to MolBART, highlight certain substructures that chemists use to explain specific properties of molecules.","representation learning, machine learning for chemistry, self-supervised learning, molecular representations" EquiMod: An Equivariance Module to Improve Self-Supervised Learning,https://openreview.net/forum?id=eDLwjKmtYFt,https://openreview.net/pdf?id=eDLwjKmtYFt,We propose a generic equivariance module that structures the learned latent space by learning to predict the displacement in the embedding space caused by augmentations; we show that it improves the representation of usual self-supervised methods.,"Self-supervised visual representation methods are closing the gap with supervised learning performance. These methods rely on maximizing the similarity between embeddings of related synthetic inputs created through data augmentations. This can be seen as a task that encourages embeddings to leave out factors modified by these augmentations, i.e. to be invariant to them. However, this only considers one side of the trade-off in the choice of the augmentations: they need to strongly modify the images to avoid simple solution shortcut learning (e.g. using only color histograms), but on the other hand, augmentations-related information may be lacking in the representations for some downstream tasks (e.g. color is important for birds and flower classification). Few recent works proposed to mitigate the problem of using only an invariance task by exploring some form of equivariance to augmentations. This has been performed by learning additional embeddings space(s), where some augmentation(s) cause embeddings to differ, yet in a non-controlled way. In this work, we introduce a generic equivariance module that structures the learned latent space, in the sense that our module learns to predict the displacement in the embedding space caused by the augmentations. We show that applying that module to state-of-the-art invariance models, such as SimCLR and BYOL, increases the performances on CIFAR10 and ImageNet datasets. Moreover, while our model could collapse to a trivial equivariance, i.e. invariance, we observe that it instead automatically learns to keep some augmentations-related information beneficial to the representations.","Representation learning, Self-supervised learning, Contrastive learning, Equivariance" Cross-utterance Conditioned Coherent Speech Editing via Biased Training and Entire Inference,https://openreview.net/forum?id=O_er9uNktN,https://openreview.net/pdf?id=O_er9uNktN,,"Text-based speech editing systems are developed to enable users to select, cut, copy and paste speech based on the transcript. Existing state-of-art editing systems based on neural networks do partial inferences with no exception, that is, only generate new words that need to be replaced or inserted. This manner usually leads to the prosody of the edited part being inconsistent with the previous and subsequent speech and the failure to handle the alteration of intonation. To address these problems, we propose a cross-utterance conditioned coherent speech editing system, which first does the entire reasoning at inference time. Benefiting from a cross-utterance conditioned variational autoencoder, our proposed system can forge speech by utilizing speaker information, context and acoustic features, and the mel-spectrogram of unedited fragments from the original audio. Also, we apply biased training to concentrate more attention on the part that needs to be reconstructed throughout training. Experiments conducted on subjective and objective metrics demonstrate that our approach outperforms the partial inference method on various editing operations regarding naturalness and prosody consistency.", Manipulating Multi-agent Navigation Task via Emergent Communications,https://openreview.net/forum?id=cUX2psP06OL,https://openreview.net/pdf?id=cUX2psP06OL,,Multi-agent corporations struggle to efficiently sustain grounded communications with a specific task goal. Existing approaches are limited in their simple task settings and single-turn communications. This work describes a multi-agent communication scenario via emergent language in a navigation task. This task involves two agents with unequal abilities: the tourist (agent A) who can only observe the surroundings and the guide (agent B) who has a holistic view but does not know the initial position of agent A. They communicate with the emerged language grounded through the environment and a common task goal: to help the tourist find the target place. We release a new dataset of 3000 scenarios that involve multi-agent visual and language navigation. We also seek to address the multi-agent emergent communications by proposing a collaborative learning framework that enables the agents to generate and understand emergent language and solve tasks. The framework is trained with reinforcement learning by maximizing the task success rate in an end-to-end manner. Results show that the proposed framework achieves competing performance in both the accuracy of language understanding and the task success rate. We also discuss the explanations of the emerged language., Task-Aware Information Routing from Common Representation Space in Lifelong Learning,https://openreview.net/forum?id=-M0TNnyWFT5,https://openreview.net/pdf?id=-M0TNnyWFT5,A continual learning method that entails task-attention modules to capture task-specific information from the common representation space,"Intelligent systems deployed in the real world suffer from catastrophic forgetting when exposed to a sequence of tasks. Humans, on the other hand, acquire, consolidate, and transfer knowledge between tasks that rarely interfere with the consolidated knowledge. Accompanied by self-regulated neurogenesis, continual learning in the brain is governed by the rich set of neurophysiological processes that harbor different types of knowledge which are then integrated by the conscious processing. Thus, inspired by Global Workspace Theory of conscious information access in the brain, we propose TAMiL, a continual learning method that entails task-attention modules to capture task-specific information from the common representation space. We employ simple, undercomplete autoencoders to create a communication bottleneck between the common representation space and the global workspace, allowing only the task-relevant information to the global workspace, thereby greatly reducing task interference. Experimental results show that our method outperforms state-of-the-art rehearsal-based and dynamic sparse approaches and bridges the gap between fixed capacity and parameter isolation approaches while being scalable. We also show that our method effectively mitigates catastrophic forgetting while being well-calibrated with reduced task-recency bias.","Continual learning, Lifelong learning, Representation learning, Global workspace theory, Task-specific attention" CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code,https://openreview.net/forum?id=htL4UZ344nF,https://openreview.net/pdf?id=htL4UZ344nF,,"Recent works have widely adopted large language model pretraining for source code, suggested source code-specific pretraining objectives and investigated the applicability of various Transformer-based language model architectures for source code. This work investigates another important aspect of such models, the effect of different subtokenization options, and aims at identifying most effective and length-efficient subtokenizations, taking into account source code specifics. We propose subtokenziation that reduces average length by 17--40% without downstream performance drop, and show that a carefully chosen subtokenization may significantly improve quality by 0.5-2%, possibly with some length increase.","source code processing, tokenization, byte-pair encoding" SWRM: Similarity Window Reweighting and Margins for Long-Tailed Recognition,https://openreview.net/forum?id=fQejLClfsw,https://openreview.net/pdf?id=fQejLClfsw,,"Real-world data usually obeys a long-tailed distribution. Many previous works merely focus on the superficial phenomenon that tail classes lack samples in long-tailed datasets, yet they do not conduct in-depth analysis on the datasets and the model prediction results. In this paper, we experimentally find that due to the easily confusing visual features between head- and tail classes, the cross-entropy model is prone to misclassify tailed samples as head classes with high appearance similarity. We propose a Similarity Window Reweighting and Margins (SWRM) algorithm to tackle this problem. Specifically, we pretrain a cross-entropy model to model category similarity, then a sliding window is adopted upon the modeling result to constrain the impact of similarity. We design weights for different classes with the help of similarity window, which is named Similarity Window Reweighting (SWR). Besides, different margins computed inside the similarity window will be assigned to different classes, this is called Similarity Window Margin (SWM). In a nutshell, SWR considers the category frequency difference and the category similarity impact simultaneously, so that the weight coefficients computed by SWR are more reasonable. SWM prompts the model to learn fine-grained features and is conducive to the model's discriminative ability. Therefore, our methods alleviate the issue of misclassification effectively. In order to enhance the robustness and generalization of the model, we introduce a learnable similarity vector and further propose a Dynamic Similarity Window Reweighting and Margins (DySWRM) algorithm, which spends less computation cost compared with SWRM. Extensive experiments verify our proposed approaches effectiveness and superiority over SOTA reweighting and logit adjustment methods.","Long-tailed recognition, class re-balancing, reweighting, logit adjustment" Transport with Support: Data-Conditional Diffusion Bridges,https://openreview.net/forum?id=me09xlTmm8,https://openreview.net/pdf?id=me09xlTmm8,Conditioning diffusion Schrödinger bridges on intermediate sparse observations via particle filtering,"The dynamic Schrödinger bridge problem provides an appealing setting for posing optimal transport problems as learning non-linear diffusion processes and enables efficient iterative solvers. Recent works have demonstrated state-of-the-art results (eg, in modelling single-cell embryo RNA sequences or sampling from complex posteriors) but are typically limited to learning bridges with only initial and terminal constraints. Our work extends this paradigm by proposing the Iterative Smoothing Bridge (ISB). We combine learning diffusion models with Bayesian filtering and optimal control, allowing for constrained stochastic processes governed by sparse observations at intermediate stages and terminal constraints. We assess the effectiveness of our method on synthetic and real-world data and show that the ISB generalises well to high-dimensional data, is computationally efficient, and provides accurate estimates of the marginals at intermediate and terminal times. ","diffusion models, optimal transport, particle filtering, stochastic control, sequential Monte Carlo" Supervised Q-Learning can be a Strong Baseline for Continuous Control,https://openreview.net/forum?id=b5M2oNm3nA,https://openreview.net/pdf?id=b5M2oNm3nA,We propose to use Zeroth-Order supervised policy optimization based on Q-learning as alternative to policy gradient in continuous control tasks.,"Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL). However, PG algorithms rely on exploiting the value function being learned with the first-order update locally, which results in limited sample efficiency. In this work, we propose an alternative method called Zeroth-Order Supervised Policy Improvement (ZOSPI). ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods based on zeroth-order policy optimization. This learning paradigm follows Q-learning but overcomes the difficulty of efficiently operating argmax in continuous action space. It finds max-valued action within a small number of samples. The policy learning of ZOSPI has two steps: First, it samples actions and evaluates those actions with a learned value estimator, and then it learns to perform the action with the highest value through supervised learning. We further demonstrate such a supervised learning framework can learn multi-modal policies. Experiments show that ZOSPI achieves competitive results on the continuous control benchmarks with a remarkable sample efficiency.","Zeroth-Order Method, Continuous Control, Supervised Optimization" Randomized Sharpness-Aware Training for Boosting Computational Efficiency in Deep Learning,https://openreview.net/forum?id=8foynpwwRb,https://openreview.net/pdf?id=8foynpwwRb,"We propose a randomized training policy, called randomized sharpness-aware training, for boosting the compuation efficiency in sharpness-aware training.","By driving optimizers to converge to flat minima, sharpness-aware learning algorithms (such as SAM) have shown the power to achieve state-of-art performances. However, these algorithms will generally incur one extra forward-backward propagation at each training iteration, which largely burdens the computation especially for scalable models. To this end, we propose an efficient training scheme, called Randomized Sharpness-Aware Training (RST). Optimizers in RST would perform a Bernoulli trial at each iteration to choose randomly from base algorithms (SGD) and sharpness-aware algorithms (SAM) with a probability arranged by a predefined scheduling function. Due to the mixture of base algorithms, the overall count of propagation pairs could be largely reduced. Also, we give theoretical analysis on the convergence of RST. Then, we empirically study the computation cost and effect of various types of scheduling functions, and give directions on setting appropriate scheduling functions. Further, we extend the RST to a general framework (G-RST), where we can adjust regularization degree on sharpness freely for any scheduling function. We show that G-RST can outperform SAM in most cases while saving 50\% extra computation cost. ","Optimization, Sharpness-aware Training, Computation Efficiency." Self-Supervised Off-Policy Ranking via Crowd Layer,https://openreview.net/forum?id=GX0uI5T8kd,https://openreview.net/pdf?id=GX0uI5T8kd,,"Off-policy evaluation (OPE) aims to estimate the online performance of target policies given dataset collected by some behavioral policies. OPE is crucial in many applications where online policy evaluation is expensive. However, existing OPE methods are far from reliable. Fortunately, in many real-world scenarios, we care only about the ranking of the evaluating policies, rather than their exact online performance. Existing works on off-policy ranking (OPR) adopt a supervised training paradigm, which assumes that there are plenty of deployed policies and the labels of their performance are available. However, this assumption does not apply to most OPE scenarios because collecting such training data might be highly expensive. In this paper, we propose a novel OPR framework called SOCCER, where the existing OPE methods are modeled as workers in a crowdsourcing system. SOCCER can be trained in a self-supervised way as it does not require any ground-truth labels of policies. Moreover, in order to capture the relative discrepancies between policies, we propose a novel transformer-based architecture to learn effective pairwise policy representations. Experimental results show that SOCCER achieves significantly high accuracy in a variety of OPR tasks. Surprisingly, SOCCER even performs better than baselines trained in a supervised way using additional labeled data, which further demonstrates the superiority of SOCCER in OPR tasks.","off-policy ranking, policy representation learning, reinforcement learning" Probing for Correlations of Causal Facts: Large Language Models and Causality,https://openreview.net/forum?id=UPwzqPOs4-,https://openreview.net/pdf?id=UPwzqPOs4-,"We hypothesize that LLMs exploit correlations between the questions on causal relations with their expected (or ""right"") causal answers.","Large Language Models (LLMs) are subject to an ongoing heated debate, leaving open the question of progress towards AGI and dividing the community into two camps: the ones who see the arguably impressive results as evidence to the scaling hypothesis, and the others who are worried about the lack of interpretability and reasoning capabilities. By investigating to which extent causal representations might be captured by LLMs, we make a humble effort towards resolving the ongoing philosophical conflicts. We hypothesize that causal facts are part of the training data and that the LLM are capable of picking up correlations between the questions on causal relations with their expected (or ``right'') causal answers. We study this hypothesis two-fold, (1) by analyzing the LLM's causal question answering capabilities and (2) by probing the LLM's embeddings for correlations on the causal facts. Our analyses suggests that LLMs are somewhat capable of answering causal queries the right way through memorization of the corresponding question-answer pair. However, more importantly, the evidence suggests that LLMs do not perform causal reasoning to arrive at their answers.","large language models, empirical analysis, causal facts" Geometry Problem Solving based on Counterfactual Evolutionary Reasoning,https://openreview.net/forum?id=1BEoYnjZVV,https://openreview.net/pdf?id=1BEoYnjZVV,A new method using counterfactual evolutionary reasoning for geometry problem solving,"As a representative topic in natural language processing and automated theorem proving, geometry problem solving requires an abstract problem understanding and symbolic reasoning. A major challenge here is to find a feasible reasoning sequence that is consistent with given axioms and the theorems already proved. Most recent methods have exploited neural network-based techniques to automatically discover eligible solving steps. Such a kind of methods, however, is greatly impacted by the expert solutions for training. To improve the accuracy, this paper proposes a new method called counterfactual evolutionary reasoning, which uses a generative adversarial network to generate initial reasoning sequences and then introduces counterfactual reasoning to explore potential solutions. By directly exploring theorem candidates rather than the neural network selection, the new method can sufficiently extend the searching space to get a more appropriate reasoning step. Through comparative experiments on the recent proposed geometry3k, the largest geometry problem solving dataset, our method generally achieves a higher accuracy than most previous methods, bringing an overall improvement about 4.4% compared with the transformer models.","Counterfactual Reasoning, Geometry Problem Solving, Symbolic Reasoning" Few-Shot Domain Adaptation For End-to-End Communication,https://openreview.net/forum?id=4F1gvduDeL,https://openreview.net/pdf?id=4F1gvduDeL,We propose a sample-efficient domain adaptation method for the autoencoder based end-to-end communication problem,"The problem of end-to-end learning of a communication system using an autoencoder -- consisting of an encoder, channel, and decoder modeled using neural networks -- has recently been shown to be an effective approach. A challenge faced in the practical adoption of this learning approach is that under changing channel conditions (e.g. a wireless link), it requires frequent retraining of the autoencoder in order to maintain a low decoding error rate. Since retraining is both time consuming and requires a large number of samples, it becomes impractical when the channel distribution is changing quickly. We propose to address this problem using a fast and sample-efficient (few-shot) domain adaptation method that does not change the encoder and decoder networks. Different from conventional training-time unsupervised or semi-supervised domain adaptation, here we have a trained autoencoder from a source distribution that we want to adapt (at test time) to a target distribution using only a small labeled dataset, and no unlabeled data. We focus on a generative channel model based on the Gaussian mixture density network (MDN), and propose a regularized, parameter-efficient adaptation of the MDN using a set of affine transformations. The learned affine transformations are then used to design an optimal transformation at the decoder input to compensate for the distribution shift, and effectively present to the decoder inputs close to the source distribution. Experiments on many simulated distribution changes common to the wireless setting, and a real mmWave FPGA testbed demonstrate the effectiveness of our method at adaptation using very few target domain samples.","domain adaptation, end-to-end communication, autoencoders, Gaussian mixtures, mixture density networks, few-shot, wireless channel" HyPHEN: A Hybrid Packing Method and Optimizations for Homomorphic Encryption-Based Neural Network ,https://openreview.net/forum?id=fyD8adDrXo,https://openreview.net/pdf?id=fyD8adDrXo,Efficient convolution algorithms for private inference based on fully homomorphic encryption,"Private Inference (PI) enables users to enjoy secure AI inference services while companies comply with regulations. Fully Homomorphic Encryption (FHE) based Convolutional Neural Network (CNN) inference is promising as users can offload the whole computation process to the server while protecting the privacy of sensitive data. Recent advances in AI research have enabled HE-friendly deep CNN like ResNet. However, FHE-based CNN (HCNN) suffers from high computational overhead. Prior HCNN approaches rely on dense packing techniques that aggregate as many channels into the ciphertext to reduce element-wise operations like multiplication and bootstrapping. However, these approaches require performing an excessive amount of homomorphic rotations to accumulate channels and maintain dense data organization, which takes up most of the runtime. To overcome this limitation, we present HyPHEN, a deep HCNN implementation that drastically reduces the number of homomorphic rotations. HyPHEN utilizes a novel convolution algorithm, RAConv, utilizing replication-based data organization, which leads to a significant reduction in rotation count. Furthermore, we propose hybrid gap packing method for HyPHEN, which gathers sparse convolution results into a dense data organization with a marginal increase in the number of rotations. HyPHEN explores the trade-off between the computational costs of rotations and other operations, and finds the optimal point minimizing the execution time. With these optimizations, HyPHEN takes 3.8-4.9$\times$ less execution time than the state-of-the-art HCNN implementation and brings the runtimes of ResNet inference down to 1.38-14.86s using a GPU-accelerated HEAAN library.","Private Inference, Homomorphic Encryption, PPML" Causal Inference for Knowledge Graph Completion,https://openreview.net/forum?id=Y1J29OryQg,https://openreview.net/pdf?id=Y1J29OryQg,We propose causal KGC models to alleviate the issues by leveraging causal inference framework.,"The basis of existing knowledge graph completion (KGC) models is to learn the correlations in data, such as the correlation between entities or relations and scores of triplets. Since the world is driven by causality rather than correlation, correlation-driven KGC models are weak in interpretation and suffer from the data bias issue. In this paper, we propose causal KGC models to alleviate the issues by leveraging causal inference framework. Our models are intuitive and interpretable by utilizing causal graphs, controllable by using intervention techniques and model-agnostic. Causal graphs allow us to explain the causal relationships between variables and the data generation process. Under the causal graph, data bias can be seen as confounders. Then we block the bad effect of confounders by intervention operators to mitigate the data bias issue. Due to the difficulty of obtaining randomized data, causal KGC models pose unique challenges for evaluation. Thus, we show a method that makes evaluation feasible. Finally, we show a group theory view for KGC, which is equivalent to the view of causal but further reveals the relationships between causal graphs. Experimental results show that our causal KGC models achieve better performance than traditional KGC models.","Causal Inference, Knowledge Graph Completion" Formal Specifications from Natural Language,https://openreview.net/forum?id=ywAjQw-spmY,https://openreview.net/pdf?id=ywAjQw-spmY,We study the generalization abilities of language models when translating natural language into formal specifications with complex semantics.,"We study the generalization abilities of language models when translating natural language into formal specifications with complex semantics. In particular, we fine-tune language models on three datasets consisting of English sentences and their corresponding formal representation: 1) regular expressions (regex), frequently used in programming and search; 2) First-order logic (FOL), commonly used in software verification and theorem proving; and 3) linear-time temporal logic (LTL), which forms the basis for industrial hardware specification languages. Our experiments show that, in these diverse domains, the language models maintain their generalization capabilities from pre-trained knowledge of natural language to generalize, e.g., to new variable names or operator descriptions. Additionally, they achieve competitive performance, and even outperform the state-of-the-art for translating into regular expressions, with the benefits of being easy to access, efficient to fine-tune, and without a particular need for domain-specific reasoning.","language models, natural language, formal specifications, first-order logic, temporal logic, regular expressions" DELTA: Diverse Client Sampling for Fasting Federated Learning,https://openreview.net/forum?id=CcXTudu9bvu,https://openreview.net/pdf?id=CcXTudu9bvu,"We propose a unbiased sampling method that characterizes the impact of client diversity and local variance, and provide a complete theoretical proof and experimental verification.","Partial client participation has been widely adopted in Federated Learning (FL) to efficiently reduce the communication burden. However, an improper client sampling scheme will select unrepresentative subsets, which will cause a large variance in the model update and slows down the convergence. Existing sampling methods are either biased or can be further improved to accelerate the convergence. In this paper, we propose an unbiased sampling scheme, termed DELTA, to alleviate this problem. In particular, DELTA characterizes the impact of client diversity and local variance and samples the representative clients who carry valuable information for global model updates. Moreover, DELTA is a provably optimal unbiased sampling scheme that minimizes the variance caused by partial client participation and achieves better convergence than other unbiased sampling schemes. We corroborate our results with experiments on both synthetic and real data sets.","federated learning, client sampling" Incremental Predictive Coding: A Parallel and Fully Automatic Learning Algorithm,https://openreview.net/forum?id=rwetAifrs16,https://openreview.net/pdf?id=rwetAifrs16,,"Neuroscience-inspired models, such as predictive coding, have the potential to play an important role in the future of machine intelligence. However, they are not yet used in industrial applications due to some limitations, such as efficiency. In this work, we propose incremental predictive coding (iPC), a variation of the original model derived from the incremental expectation maximization algorithm, where every operation can be performed in parallel without external control. We show both theoretically and empirically that iPC is more efficient than the original algorithm by Rao and Ballard, with performances comparable to those of backpropagation in image classification tasks. This work impacts several areas, as it has general applications in computational neuroscience and machine learning, and specific applications in scenarios where automatization and parallelization are important, such as distributed computing and implementations of deep learning models on analog and neuromorphic chips. ","Cognitive Science, deep learning, predictive coding" Rethinking Metric Based Contrastive Learning Method’s Generalization Capability,https://openreview.net/forum?id=wRdGz0ZWYHF,https://openreview.net/pdf?id=wRdGz0ZWYHF,,"In recent years, semi-supervised/self-supervised methods based on contrastive learning have made great empirical progress in various fields of deep learning, and even outperform supervised methods in some fields (such as NLP and CV). However, there are very few theoretical works that may explain why the model trained using contrastive learning-based methods can outperform the model trained in general supervised methods on supervised tasks. Based on the manifold assumption about the input space, this work proposes three elements of metric-based contrastive learning:(1) Augmented neighborhood defined for every point in the input space (2) Metric-based optimization loss on the output space. (3) Generalization error on the union of the augmented neighborhood. Moreover, we propose an upper bound of (3) named UBGEAN(Upper Bound of Generalization Error on Augmented Neighborhood) which relate to labeled empirical loss and unlabeled metric-based contrastive loss. We also explain the relationship between the existing contrastive semi-supervised/self-supervised methods and our upper bound. Finally, based on it, we propose a supervised consistent contrastive learning method based on this upper bound. we verify the validity of the UBGEAN's generalization capacity against empirical loss by conducting a series of experiments and achieving an 8.2275% improvement on average in 4 tasks. Also, we design another set of experiments to verify the fine-tuning of the self-supervised training model of contrast learning, and it shows that our upper bound can provide a more stable effect to make the self-supervised pre-trained model of contrast learning achieve the effect of supervised pre-training model.", RISC-V MICROARCHITECTURE EXPLORATION VIA REINFORCEMENT LEARNING,https://openreview.net/forum?id=MW0hjtzYRkW,https://openreview.net/pdf?id=MW0hjtzYRkW,Microarchitecture design space exploration via reinforcement learning for RISC-V processors,"Microarchitecture determines a processor's detailed structure, affecting the processor's performance, power, and area (PPA). Deciding on a microarchitecture to achieve a good balance between the PPA values is a non-trivial problem. Previous arts mainly require expert knowledge. The solution becomes inefficient as nowadays processors become increasingly complicated. Machine learning has solved problems automatically with high-quality results via reduced access to domain knowledge. In this paper, we formulate the problem as a Markov decision process and propose an end-to-end solution framework via reinforcement learning. Firstly, a dynamically-weighted reward design is proposed to accommodate the optimization of multiple negatively-correlated objectives. Secondly, local heuristic search is adopted in the action design with prior knowledge of microarchitectures. Thirdly, lightweight calibrated PPA models are incorporated to accelerate the learning process. Experimenting with electronic design automation (EDA) tools on famous RISC-V processors demonstrate that our methodology can learn from experience and outperform human implementations and previous arts' solutions in PPA and overall running time.","Design Space Exploration, Reinforcement Learning" Learning Geometric Representations of Interactive Objects,https://openreview.net/forum?id=HqVp0rNC8jn,https://openreview.net/pdf?id=HqVp0rNC8jn,We propose a representation learning framework that extracts from observations the geometric state of both an agent and an object the agent interacts with.,"We address the problem of learning geometric representations from observations perceived by an agent operating within an environment and interacting with an external object. To this end, we propose a representation learning framework that extracts the state of both the agent and the object from unstructured observations of arbitrary nature (e.g., images). Supervision comes from the performed actions alone, while the dynamics of the object is assumed to be unknown. We provide a theoretical foundation and formally prove that an ideal learner is guaranteed to infer an isometric representation, disentangling the agent from the object. Finally, we investigate empirically our framework on a variety of scenarios. Results show that our model reliably infers the correct representation and outperforms vision-based approaches such as a state-of-the-art keypoint extractor. ","Representation Learning, Interaction, Equivariance" Improve distance metric learning by learning positions of class centers,https://openreview.net/forum?id=7bcrAxy00Jw,https://openreview.net/pdf?id=7bcrAxy00Jw,,"Deep metric learning aims at learning a deep neural network by letting similar samples have small distances while dissimilar samples have large distances. To achieve this goal, the current DML algorithms mainly focus on pulling similar samples in each class as closely as possible. However, pulling similar samples only considers the local distribution of the data samples and ignores the global distribution of the data set, i.e., the center positions of different classes. The global distribution helps the distance metric learning. For example, expanding the distance between centers can increase the discriminant ability of the extracted features. However, how to increase the distance between centers is a challenging task. In this paper, we design a genius function named the skewed mean function, which only considers the most considerable distances of a set of samples. So maximizing the value of the skewed mean function can make the largest distance larger. We also prove that the current energy functions used for uniformity regularization on centers are special cases of our skewed mean function. At last, we conduct extensive experiments to illustrate the superiority of our methods.","distance metric learning, skewed mean function" The guide and the explorer: smart agents for resource-limited iterated batch reinforcement learning,https://openreview.net/forum?id=m3DmIL7wHDW,https://openreview.net/pdf?id=m3DmIL7wHDW,Smart agents for resource-limited iterated batch reinforcement learning,"Iterated (a.k.a growing) batch reinforcement learning (RL) is a growing subfield fueled by the demand from systems engineers for intelligent control solutions that they can apply within their technical and organizational constraints. Model-based RL (MBRL) suits this scenario well for its sample efficiency and modularity. Recent MBRL techniques combine efficient neural system models with classical planning (like model predictive control; MPC). In this paper we add two components to this classical setup. The first is a Dyna-style policy learned on the system model using model-free techniques. We call it the guide since it guides the planner. The second component is the explorer, a strategy to expand the limited knowledge of the guide during planning. Through a rigorous ablation study we show that combination of these two ingredients is crucial for optimal performance and better data efficiency. We apply this approach with an off-policy guide and a heating explorer to improve the state of the art of benchmark systems addressing both discrete and continuous action spaces.","Model-based reinforcement learning, Dyna, exploration, planning, offline, growing batch, iterated batch" FairGBM: Gradient Boosting with Fairness Constraints,https://openreview.net/forum?id=x-mXzBgCX3a,https://openreview.net/pdf?id=x-mXzBgCX3a,"A novel fairness-aware method based on constrained optimization for Gradient Boosting models, that can match state-of-the-art fairness and performance while training 10x faster.","Tabular data is prevalent in many high stakes domains, such as financial services or public policy. Gradient boosted decision trees (GBDT) are popular in these settings due to performance guarantees and low cost. However, in these domains bias is a concern. Existing in-processing Fair ML methods are either inapplicable to GBDT, or incur in significant train time overhead, or are inadequate for problems with high class imbalance -- a typical issue in high stakes domains. We present FairGBM, a dual ascent learning framework for training GBDT under fairness constraints, with little to no impact on predictive performance when compared to unconstrained GBDT. Since observational fairness metrics are non-differentiable, we have to employ a ``proxy-Lagrangian'' formulation using smooth convex error rate proxies to enable gradient-based optimization. Our implementation shows an order of magnitude speedup in training time when compared with related work, a pivotal aspect to foster the widespread adoption of FairGBM by real-world practitioners.","fairness, gradient boosting, constrained optimization, tabular data" Kinship Representation Learning with Face Componential Relation,https://openreview.net/forum?id=F8OUxtWEQRi,https://openreview.net/pdf?id=F8OUxtWEQRi,We achieve the SOTA kinship recognition performance by the learning face componential relation with contrastive learning.,"Kinship recognition aims to determine whether the subjects in two facial images are kin or non-kin, which is an emerging and challenging problem. However, most previous methods focus on heuristic designs without considering the spatial correlation between face images. In this paper, we aim to learn discriminative kinship representations embedded with the relation information between face components (e.g., eyes, nose, etc.). To achieve this goal, we propose the Face Componential Relation Network (FaCoRNet), which learns the relationship between face components among images with a cross-attention mechanism, which automatically learns the important facial regions for kinship recognition. Moreover, we propose Relation-Guided Contrastive Learning, which adapts the loss function by the guidance from cross-attention to learn more discriminative feature representations. The proposed FaCoRNet outperforms previous state-of-the-art methods by large margins for the largest public kinship recognition FIW benchmark. The code will be publicly released upon acceptance.","kinship recognition, attention, contrastive learning" Pseudo-Differential Integral Operator for Learning Solution Operators of Partial Differential Equations,https://openreview.net/forum?id=rer10Bb-9Qn,https://openreview.net/pdf?id=rer10Bb-9Qn,,"Learning mapping between two function spaces has attracted considerable research attention. However, learning the solution operator of partial differential equations (PDEs) remains a challenge in scientific computing. Fourier neural operator (FNO) is recently proposed to learn the solution operators with an excellent performance. In this study, we propose a novel pseudo-differential integral operator (PDIO) to analyze and generalize the Fourier integral operator in FNO. PDIO is inspired by a pseudo-differential operator, which is a generalization of a differential operator and characterized by a certain symbol. We parameterize the symbol by using a neural network and show that the neural-network-based symbol is contained in a smooth symbol class. Subsequently, we prove that the PDIO is a bounded linear operator, and thus is continuous in the Sobolev space. We combine the PDIO with the neural operator to develop a pseudo-differential neural operator (PDNO) to learn the nonlinear solution operator of PDEs. We experimentally validate the effectiveness of the proposed model by using Darcy flow and the Navier-Stokes equation. The results reveal that the proposed PDNO outperforms the existing neural operator approaches in most experiments.", How (Un)Fair is Text Summarization?,https://openreview.net/forum?id=-UsbRlXzMG,https://openreview.net/pdf?id=-UsbRlXzMG,We show that machine learning based summarizers exhibit bias toward different groups and are very sensitive to document structure.,"Creating a good summary requires carefully choosing details from the original text to accurately represent it in a limited space. If a summary contains biased information about a group, it risks passing this bias off to readers as fact. These risks increase if we consider not just one biased summary, but rather a biased summarization algorithm. Despite this, little work has measured whether these summarizers demonstrate biased performance. Rather, most work in summarization focuses on improving performance, ignoring questions of bias. In this paper we demonstrate that automatic summarizers both amplify and introduce bias towards information about under-represented groups. Additionally, we show that summarizers are highly sensitive to document structure, making the summaries they generate unstable under changes that are semantically meaningless to humans, which poses a further fairness risk. Given these results, and the large scale potential for harm presented by biased summarization, we recommend that bias analysis be performed and reported on summarizers to ensure that new automatic summarization methods do not introduce bias to the summaries they generate.","Natural language processing, Summarization, Fairness" Simulating Task-Free Continual Learning Streams From Existing Datasets,https://openreview.net/forum?id=Wac06sAkHk,https://openreview.net/pdf?id=Wac06sAkHk,,"Task-free continual learning is the subfield of machine learning that focuses on learning online from a stream whose distribution changes continuously over time. However, previous works evaluate task-free continual learning using streams with distributions that change only at a few distinct points in time. In order to address the discrepancy between the definition and evaluation of task-free continual learning, we propose a principled algorithm that can permute any labeled dataset into a stream that is continuously nonstationary. We empirically show that the streams generated by our algorithm are less structured than the ones conventionally used in the literature. Moreover, we use our simulated task-free streams to benchmark multiple methods applicable to the task-free setting. We hope that our work will make it more likely that task-free continual learning methods are able to better generalize to real-world problems.",Task-Free Continual Learning Online Bias Correction for Task-Free Continual Learning,https://openreview.net/forum?id=18XzeuYZh_,https://openreview.net/pdf?id=18XzeuYZh_,,"Task-free continual learning is the machine-learning setting where a model is trained online with data generated by a non-stationary stream. Conventional wisdom suggests that, in this setting, models are trained using an approach called experience replay, where the risk is computed both with respect to current stream observations and to a small subset of past observations. In this work, we show both theoretically and empirically how experience replay biases the outputs of the model towards recent stream observations. Moreover, we propose a simple approach to correct for this bias online, by changing the way the output layer of the model is optimized. We show that our approach improves significantly the learning performance of experience-replay approaches over a number of different datasets. Our findings suggest that, in contrast to stationary machine-learning problems, the output layer of a model should be optimized separately from its preceding layers when performing experience replay.",Task-Free Continual Learning A Simple Contrastive Learning Objective for Alleviating Neural Text Degeneration,https://openreview.net/forum?id=j8s-BRxXST,https://openreview.net/pdf?id=j8s-BRxXST,"To tackle the repetitive degeneration problem of neural autoregressive language models, we propose a token-level contrastive learning objective that penalizes incorrectly repeating tokens.","The cross-entropy objective has proved to be an all-purpose training objective for autoregressive language models (LMs). However, without distinguishing problematic tokens, LMs trained using cross-entropy exhibit text degeneration problems. To address this, unlikelihood training has been proposed to reduce the probability of unlikely tokens predicted by LMs. But unlikelihood does not explicitly consider the relationship between the label tokens and unlikely token candidates, thus showing marginal improvements in degeneration. We propose a new contrastive token learning objective that inherits the advantages of cross-entropy and unlikelihood training and avoids their limitations. The key idea is to teach a LM to generate high probabilities for label tokens and low probabilities for negative candidates. Comprehensive experiments on language modeling and open-domain dialogue generation tasks show that the proposed contrastive token objective yields much less repetitive texts, with a higher generation quality than baseline approaches, achieving the new state-of-the-art performance on text degeneration.","language model, contrastive learning, repetition, degeneration" Enriching Online Knowledge Distillation with Specialist Ensemble,https://openreview.net/forum?id=L6CKiPH3hI,https://openreview.net/pdf?id=L6CKiPH3hI,Online knowledge distillation with an ensemble of specialized teachers that are explicitly estimated for each imbalanced label prior.,"Online Knowledge Distillation (KD) has an advantage over traditional KD works in that it removes the necessity for a pre-trained teacher. Indeed, an ensemble of small teachers has become typical guidance for a student's learning trajectory. Previous works emphasized diversity to create helpful ensemble knowledge and further argued that the size of diversity should be significant to prevent homogenization. This paper proposes a well-founded online KD framework with naturally derived specialists. In supervised learning, the parameters of a classifier are optimized by stochastic gradient descent based on a training dataset distribution. If the training dataset is shifted, the optimal point and corresponding parameters change accordingly, which is natural and explicit. We first introduce a label prior shift to induce evident diversity among the same teachers, which assigns a skewed label distribution to each teacher and simultaneously specializes them through importance sampling. Compared to previous works, our specialization achieves the highest level of diversity and maintains it throughout training. Second, we propose a new aggregation that uses post-compensation in specialist outputs and conventional model averaging. The aggregation empirically exhibits the advantage of ensemble calibration even if applied to previous diversity-eliciting methods. Finally, through extensive experiments, we demonstrate the efficacy of our framework on top-1 error rate, negative log-likelihood, and notably expected calibration error.","Online knowledge distillation, Label prior shift, Ensemble learning" Improved Training of Physics-Informed Neural Networks with Model Ensembles,https://openreview.net/forum?id=FEAIArDldTA,https://openreview.net/pdf?id=FEAIArDldTA,,"Learning the solution of partial differential equations (PDEs) with a neural network is an attractive alternative to traditional solvers due to its elegance, greater flexibility and the ease of incorporating observed data. However, training such physics-informed neural networks (PINNs) is notoriously difficult in practice since PINNs often converge to wrong solutions. In this paper, we propose a training algorithm that starts approximation of the PDE solution in the neighborhood of initial conditions and gradually expands the solution domain based on agreement of an ensemble of PINNs. PINNs in the ensemble find similar solutions in the vicinity of points with targets (e.g., observed data or initial conditions) while the found solutions may substantially differ farther away from the observations. Therefore, we propose to use the ensemble agreement as the criterion for gradual expansion of the solution interval, that is including new points for computing the loss derived from differential equations. Due to the flexibility of the domain expansion, our algorithm can easily incorporate measurements in arbitrary locations. In contrast to the existing PINN algorithms with time-adaptive strategies, the proposed algorithm does not need a pre-defined schedule of interval expansion and it treats time and space equally. We experimentally show that the proposed algorithm can stabilize PINN training and yield performance competitive to the recent variants of PINNs trained with time adaptation.","Label propagation, Model ensembles, Partial differential equations, Physics-informed neural networks" Improved Gradient Descent Optimization Algorithm based on Inverse Model-Parameter Difference,https://openreview.net/forum?id=lKXcMB9tOFD,https://openreview.net/pdf?id=lKXcMB9tOFD,A new approach to gradient descent optimization in which learning-rate for each model-parameter is adjusted inversely proportional to the displacement of corresponding model-parameter from preceding iteration.,"A majority of deep learning models implement first-order optimization algorithms like the stochastic gradient descent (SGD) or its adaptive variants for training large neural networks. However, slow convergence due to complicated geometry of the loss function is one of the major challenges faced by the SGD. The currently popular optimization algorithms incorporate an accumulation of past gradients to improve the gradient descent convergence via either the accelerated gradient scheme (including Momentum, NAG, etc.) or the adaptive learning-rate scheme (including Adam, AdaGrad, etc.). Despite their general popularity, these algorithms often display suboptimal convergence owing to extreme scaling of the learning-rate due to the accumulation of past gradients. In this paper, a novel approach to gradient descent optimization is proposed which utilizes the difference in the model-parameter values from the preceding iterations to adjust the learning-rate of the algorithm. More specifically, the learning-rate for each model-parameter is adapted inversely proportional to the displacement of the model-parameter from the previous iterations. As the algorithm utilizes the displacement of model-parameters, poor convergence caused due to the accumulation of past gradients is avoided. A convergence analysis based on the regret bound approach is performed and the theoretical bounds for a stable convergence are determined. An Empirical analysis evaluates the proposed algorithm applied on the CIFAR 10/100 and the ImageNet datasets and compares it with the currently popular optimizers. The experimental results demonstrate that the proposed algorithm shows better performance than the popular optimization algorithms.","Deep learning, Neural Networks, Optimization algorithm, Adaptive learning-rate, Stochastic Gradient Descent" Variational Learning ISTA,https://openreview.net/forum?id=47DzlkyH3dM,https://openreview.net/pdf?id=47DzlkyH3dM,,"Compressed sensing combines the power of convex optimization techniques with a sparsity inducing prior on the signal space to solve an underdetermined system of equations. For many problems, the sparsifying dictionary is not directly given, nor its existence can be assumed. Besides, the sensing matrix can change across different scenarios. Addressing these issues requires solving a sparse representation learning problem, namely dictionary learning, taking into account the epistemic uncertainty on the learned dictionaries and, finally, jointly learning sparse representations and reconstructions under varying sensing matrix conditions. We propose a variant of the LISTA architecture that incorporates the sensing matrix into the architecture. In particular, we propose to learn a distribution over dictionaries via a variational approach, dubbed \ac{VLISTA}, which approximates a posterior distribution over the dictionaries as part of an unfolded LISTA-based recovery network. Such a variational posterior distribution is updated after each iteration, and thereby adapts the dictionary according to the optimization dynamics. As a result, \ac{VLISTA} provides a probabilistic way to jointly learn the dictionary distribution and the reconstruction algorithm with varying sensing matrices. We provide theoretical and experimental support for our architecture and show that it learns calibrated uncertainties.","compressed sensing, LISTA, variational models, inverse problems" Moment Distributionally Robust Probabilistic Supervised Learning,https://openreview.net/forum?id=mN43JdXmYMs,https://openreview.net/pdf?id=mN43JdXmYMs,We propose a distributionally robust learning approach for predicting conditional label distributions in probabilistic supervised learning.,"Probabilistic supervised learning assumes the groundtruth itself is a distribution instead of a single label, as in classic settings. Common approaches learn with a proper composite loss and obtain probability estimates via an invertible link function. Typical links such as the softmax yield restrictive and problematic uncertainty certificates. In this paper, we propose to make direct prediction of conditional label distributions from first principles in distributionally robust optimization based on an ambiguity set defined by feature moment divergence. We derive its generalization bounds under mild assumptions. We illustrate how to manipulate penalties for underestimation and overestimation. Our method can be easily incorporated into neural networks for end-to-end representation learning. Experimental results on datasets with probabilistic labels illustrate the flexibility, effectiveness, and efficiency of this learning paradigm.","probabilistic supervised learning, distributionally robust optimization, proper scoring rules" CLEP: Exploiting Edge Partitioning for Graph Contrastive Learning,https://openreview.net/forum?id=r3-aLHxn2nB,https://openreview.net/pdf?id=r3-aLHxn2nB,,"Generative and contrastive are two fundamental unsupervised approaches to model graph information. The graph generative models extract intra-graph information whereas the graph contrastive learning methods focus on inter-graph information. Combining these complementary sources of information can potentially enhance the expressiveness of graph representations, which, nevertheless, is underinvestigated by existing methods. In this work, we introduce a probabilistic framework called contrastive learning with edge partitioning (CLEP) that integrates generative modeling and graph contrastive learning. CLEP models edge generation by cumulative latent node interactions over multiple mutually independent hidden communities. Inspired by the ``assembly'' behavior of communities in graph generation, CEGCL learns community-specific graph embeddings and assemble them together to represent the entire graph, which are further used to predict the graph's identity via a contrastive objective. To relate each embedding to one hidden community, we define a set of community-specific weighted edges for node feature aggregation by partitioning the observed edges according to the latent node interactions associated with the corresponding hidden community. With these unique designs, CLEP is able to model the statistical dependency among hidden communities, graph structures as well as the identity of each graph; it can also be trained end-to-end via variational inference. We evaluate CLEP on real-world benchmarks under self-supervised and semi-supervised settings and achieve promising results, which demostrate the effectiveness of our method. Various exploratory studies are also conducted to highlight the characteristics of the inferred hidden communities and the potential benefits they bring to representation learning.",Graph Contrastive Learning Meta-Learning the Inductive Biases of Simple Neural Circuits,https://openreview.net/forum?id=dpuAkczrTOt,https://openreview.net/pdf?id=dpuAkczrTOt,"We meta-learn functions that networks of interest find easy to generalise, characterising their inductive bias; we suggest this as a method for interpreting and understanding network function.","Animals receive noisy and incomplete information, from which we must learn how to react in novel situations. A fundamental problem is that training data is always finite, making it unclear how to generalise to unseen data. But, animals do react appropriately to unseen data, wielding Occam's razor to select a parsimonious explanation of the observations. How they do this is called their inductive bias, and it is implicitly built into the operation of animals' neural circuits. This relationship between an observed circuit and its inductive bias is a useful explanatory window for neuroscience, allowing design choices to be understood normatively. However, it is generally very difficult to map circuit structure to inductive bias. In this work we present a neural network tool to bridge this gap. The tool allows us to meta-learn the inductive bias of neural circuits by learning functions that a neural circuit finds easy to generalise, since easy-to-generalise functions are exactly those the circuit chooses to explain incomplete data. We show that in systems where the inductive bias is known analytically, i.e. linear and kernel regression, our tool recovers it. Then, we show it is able to flexibly extract inductive biases from differentiable circuits, including spiking neural networks. This illustrates the intended use case of our tool: understanding the role of otherwise opaque pieces of neural functionality, such as non-linearities, learning rules, or connectomic data, through the inductive bias they induce.","Inductive Bias, Generalisation, Meta-Learning, Spiking Neural Network, Neuroscience" Enabling Equation Learning with the Bayesian Model Evidence via systematic $R^2$-elimination,https://openreview.net/forum?id=w47MhmAsbzs,https://openreview.net/pdf?id=w47MhmAsbzs,A pseudo-brute-force model selection strategy using R-squared and Bayesian model evidence that efficiently works for Equation Learning.,"Deep learning is a powerful method for tasks like predictions and classification but lacks interpretability and analytic access. Instead of fitting up to millions of parameters, an intriguing alternative for a wide range of problems would be to learn the governing equations from data. The resulting models would be concise, parameters could be interpreted, the model could adjust to shifts in data, and analytic analysis would allow for extra insights. Common challenges are model complexity identification, stable feature selection, expressivity, computational feasibility, and scarce data. In our work, the mentioned challenges are addressed by combining existing methods in a novel way. We choose multiple regression as a framework and argue that a large space of model equations is captured. For feature selection, we exploit the computationally cheap coefficient of determination ($R^2$) for a model elimination process in a semi-comprehensive search. Final model selection is achieved by exact values of the Bayesian model evidence with empirical priors, which is known to identify suitable model complexity without relying on mass data. Random polynomials, an epidemiological model, and the Lorenz system are used as examples. For the Lorenz system, which is particularly challenging due to its chaotic nature, we demonstrate the favourable performance of our approach to existing state-of-the-art like SINDy. ","coefficient of determination, Bayesian model evidence, model selection, Equation Learning" Curvature Informed Furthest Point Sampling,https://openreview.net/forum?id=diOVflNRZnG,https://openreview.net/pdf?id=diOVflNRZnG,An extension of furthest point sampling algorithm that takes curvature information into consideration,"Point cloud representation is becoming increasingly popular due to its low memory footprint, ease of creation, collection and modification. As the size of the point cloud increases, we need to incorporate a down-sampling operation to meet the computational demands of the tasks. Classical approaches such as farthest point sampling perform exceedingly well over downstream tasks. The major drawback is that farthest point sampling is a mere heuristic and does not take geometric priors such as curvature into consideration. We propose a novel sampling procedure that conditions the output of farthest point sampling with curvature information. We create a joint rank by multiplying the soft furthest point rank with corresponding curvature scores obtained via a deep neural network and exchange a percentage of low-ranking points in the furthest set with the high-ranking points in the left-out set. Previous differentiable sampling approaches have failed to conform to the end-to-end learning paradigm due to instability while training. We demonstrate that our algorithm is compatible with end-to-end learning. Our sampling scheme consistently outperforms previous baselines on various downstream geometry processing tasks. Finally, we show detailed ablation studies regarding the qualitative and quantitative analysis of the role of different features used in the proposed algorithm.","dowsampling, point cloud, curvature informed, shape completion, segmentation, furthest point sampling" Accelerating spiking neural network training using the $d$-block model,https://openreview.net/forum?id=70-hEqC4Wo8,https://openreview.net/pdf?id=70-hEqC4Wo8,We propose a new SNN model which obtains accelerated training and state-of-the-art performance across various neuromorphic datasets without the need of any regularisation and using less spikes compared to standard SNNs.,"There is a growing interest in using spiking neural networks (SNNs) to study the brain \textit{in silico} and in emulating them on neuromorphic computers due to their lower energy consumption compared to artificial neural networks (ANNs). Significant progress has been made in directly training SNNs to perform on a par with ANNs in terms of accuracy. However, these methods are slow due to their sequential nature and require careful network regularisation to avoid overfitting. We propose a new SNN model, the $d$-block model, with stochastic absolute refractory periods and recurrent conductance latencies, which reduces the number of sequential computations using fast vectorised operations. Our model obtains accelerated training speeds and state-of-the-art performance across various neuromorphic datasets without the need for any regularisation and using fewer spikes compared to standard SNNs.","spiking neural networks, accelerated training, stochastic refractory period, stochastic recurrent conductance latency" RG: OUT-OF-DISTRIBUTION DETECTION WITH REACTIVATE GRADNORM,https://openreview.net/forum?id=-hMNEMgT8Wd,https://openreview.net/pdf?id=-hMNEMgT8Wd,The information of joint feature space and output space improves the performance of OOD detection.,"Detecting out-of-distribution (OOD) data is critical to building reliable machine learning systems in the open world. Previous works mainly perform OOD detection in feature space or output space. Recently, researchers have achieved promising results using gradient information, which combines the information in both feature and output space for OOD detection. However, existing works still suffer from the problem of overconfidence. To address this problem, we propose a novel method called ``Reactivate Gradnorm (RG)'', which exploits the norm of the clipped feature vector and the energy in the output space for OOD detection. To verify the effectiveness of our method, we conduct experiments on four benchmark datasets. Experimental results demonstrate that our RG outperforms existing state-of-the-art approaches by 2.06\% in average AUROC. Meanwhile, RG is easy to implement and does not require additional OOD data or fine-tuning process. We can realize OOD detection in only one forward pass of any pretrained model.","OOD detection, Uncertainty Learning" Don’t fear the unlabelled: safe semi-supervised learning via debiasing,https://openreview.net/forum?id=TN9gQ4x0Ep3,https://openreview.net/pdf?id=TN9gQ4x0Ep3,"We propose a slight modification of most common semi-supervised learning methods to make them safe by debiasing their risk estimate. In particular, we apply it successfully to Fixmatch.","Semi-supervised learning (SSL) provides an effective means of leveraging unlabelled data to improve a model’s performance. Even though the domain has received a considerable amount of attention in the past years, most methods present the common drawback of lacking theoretical guarantees. Our starting point is to notice that the estimate of the risk that most discriminative SSL methods minimise is biased, even asymptotically. This bias impedes the use of standard statistical learning theory and can hurt empirical performance. We propose a simple way of removing the bias. Our debiasing approach is straightforward to implement and applicable to most deep SSL methods. We provide simple theoretical guarantees on the trustworthiness of these modified methods, without having to rely on the strong assumptions on the data distribution that SSL theory usually requires. In particular, we provide generalisation error bounds for the proposed methods. We evaluate debiased versions of different existing SSL methods, such as the Pseudo-label method and Fixmatch, and show that debiasing can compete with classic deep SSL techniques in various settings by providing better calibrated models. Additionally, we provide a theoretical explanation of the intuition of the popular SSL methods. ","Semi-supervised learning, deep learning, empirical risk minimisation, control variate, variance reduction, asymptotic statistics" "Learn Together, Stop Apart: An Inclusive Approach To Early Stopping",https://openreview.net/forum?id=7HSHJQwkna0,https://openreview.net/pdf?id=7HSHJQwkna0,We propose a new scheme to GB pruning based on adaptive stops for different data regions,"Gradient Boosting is the most popular method of constructing ensembles that allows to get state-of-the-art results on many tasks. One of the critical parameters affecting the quality of the learned model is the number of members in the ensemble or the number of boosting iterations. Unfortunately, the problem of selecting the optimal number of models still remains open and understudied. This paper proposes a new look at the optimal stop selection problem in Gradient Boosting. In contrast to the classical approaches that select a universal ensemble size using a hold--out validation set, our algorithm takes into account the heterogeneity of data in the feature space and adaptively sets different number of models for different regions of data, but it still uses the same common ensemble trained for the whole task. Experiments on SOTA implementations of Gradient Boosting show that the proposed method does not affect the complexity of learning algorithms and significantly increases quality on most standard benchmarks up to 2%.","ensemble, boosting, regularization, clusterization" Gandalf : Data Augmentation is all you need for Extreme Classification,https://openreview.net/forum?id=05ff9BRSMzE,https://openreview.net/pdf?id=05ff9BRSMzE,,"Extreme Multi-label Text Classification (XMC) involves learning a classifier that can assign an input with a subset of most relevant labels from millions of label choices. Recent works in this domain have increasingly focused on the problem setting with short-text input data, and labels endowed with short textual descriptions called label features. Short-text XMC with label features has found numerous applications in areas such as prediction of related searches, title-based product recommendation, bid-phrase suggestion, amongst others. In this paper, we propose Gandalf, a graph induced data augmentation based on label features, such that the generated data-points can supplement the training distribution. By exploiting the characteristics of the short-text XMC problem, it leverages the label features to construct valid training instances, and uses the label graph for generating the corresponding soft-label targets, hence effectively capturing the label-label correlations. While most recent advances (such as SiameseXML and ECLARE) in XMC have been algorithmic, mainly aimed towards developing novel deep-learning architectures, our data-centric augmentation approach is orthogonal to these methodologies. We demonstrate the generality and effectiveness of Gandalf by showing up to 30% relative improvements for 5 state-of-the-art algorithms across 4 benchmark datasets consisting of up to 1.3 million labels. ","Extreme Classification, Data Augmentation, Search and Recommendation" Learning a Data-Driven Policy Network for Pre-Training Automated Feature Engineering,https://openreview.net/forum?id=688hNNMigVX,https://openreview.net/pdf?id=688hNNMigVX,We propose a data-driven automated feature engineering framework Fetch.,"Feature engineering is widely acknowledged to be pivotal in tabular data analysis and prediction. Automated feature engineering (AutoFE) emerged to automate this process managed by experienced data scientists and engineers conventionally. In this area, most — if not all — prior work adopted an identical framework from the neural architecture search (NAS) method. While feasible, we posit that the NAS framework very much contradicts the way how human experts cope with the data since the inherent Markov decision process (MDP) setup differs. We point out that its data-unobserved setup consequentially results in an incapability to generalize across different datasets as well as also high computational cost. This paper proposes a novel AutoFE framework Feature Set Data-Driven Search (FETCH), a pipeline mainly for feature generation and selection. Notably, FETCH is built on a brand-new data-driven MDP setup using the tabular dataset as the state fed into the policy network. Further, we posit that the crucial merit of FETCH is its transferability where the yielded policy network trained on a variety of datasets is indeed capable to enact feature engineering on unseen data, without requiring additional exploration. To the best of our knowledge, this is a pioneer attempt to build a tabular data pre-training paradigm via AutoFE. Extensive experiments show that FETCH systematically surpasses the current state-of-the-art AutoFE methods and validates the transferability of AutoFE pre-training.","Automated Feature Engineering, Reinforcement Learning, Tabular Data, Data-Driven, Pre-Training" Attention Flows for General Transformers,https://openreview.net/forum?id=pcBJT4bgbpH,https://openreview.net/pdf?id=pcBJT4bgbpH,We formalize and generalize a method to construct a flow network out of the attention values of Transformer models to compute how much an input token influences a model's prediction.,"In this paper, we study the computation of how much an input token in a Transformer model influences its prediction. We formalize a method to construct a flow network out of the attention values of encoder-only Transformer models and extend it to general Transformer architectures, including an auto-regressive decoder. We show that running a maxflow algorithm on the flow network construction yields Shapley values, which determine a player's impact in cooperative game theory. By interpreting the input tokens in the flow network as players, we can compute their influence on the total attention flow leading to the decoder's decision. Additionally, we provide a library that computes and visualizes the attention flow of arbitrary Transformer models. We show the usefulness of our implementation on various models trained on natural language processing and reasoning tasks.","transformer, explanations, attention flow, shapley value" Grounded Contrastive Learning for Open-world Semantic Segmentation,https://openreview.net/forum?id=ngV1BPp6xDc,https://openreview.net/pdf?id=ngV1BPp6xDc,"We propose a novel open-world segmentation framework using image-text pairs, which optimizes text-region alignment explicitly.","Contrastive learning (CL) with large-scale image-text paired data has made great strides in open-world image recognition. The progress raises attraction to open-world semantic segmentation---aiming at learning to segment arbitrary visual concepts in images. Existing open-world segmentation methods adopt CL to learn diverse visual concepts and adapt its image-level understanding to the segmentation task. However, while CL-based existing methods have shown impressive results, conventional CL is limited in considering image-text level alignment without explicit optimization of region-text level alignment, thus leading to a sub-optimal solution for the segmentation task. In this paper, we propose a novel Grounded Contrastive Learning (GCL) framework to directly align a text and regions described by the text. Our method generates a segmentation mask associated with a given text, extracts grounded image embedding from the masked image region, and aligns it with text embedding via GCL. The framework encourages a model to directly improve the quality of generated segmentation masks. In addition, for a rigorous and fair comparison, we present a unified evaluation protocol with widely used 8 semantic segmentation datasets. GCL achieves state-of-the-art zero-shot segmentation performance with large margins in all datasets. The code will be released publicly available.","open-world semantic segmentation, zero-shot segmentation" Making Substitute Models More Bayesian Can Enhance Transferability of Adversarial Examples,https://openreview.net/forum?id=bjPPypbLre,https://openreview.net/pdf?id=bjPPypbLre,,"The transferability of adversarial examples across deep neural networks (DNNs) is the crux of many black-box attacks. Many prior efforts have been devoted to improving the transferability via increasing the diversity in inputs of some substitute models. In this paper, by contrast, we opt for the diversity in substitute models and advocate to attack a Bayesian model for achieving desirable transferability. Deriving from the Bayesian formulation, we develop a principled strategy for possible finetuning, which can be combined with many off-the-shelf Gaussian posterior approximations over DNN parameters. Extensive experiments have been conducted to verify the effectiveness of our method, on common benchmark datasets, and the results demonstrate that our method outperforms recent state-of-the-arts by large margins (roughly $16.8\%$ absolute increase in average attack success rate on ImageNet), and, by combining with these recent methods, further performance gain can be obtained. Our code will be publicly available. ","Adversarial Examples, Black-box Attacks, Adversarial Transferability" Learning Group Importance using the Differentiable Hypergeometric Distribution,https://openreview.net/forum?id=75O7S_L4oY,https://openreview.net/pdf?id=75O7S_L4oY,We propose the differentiable hypergeometric distribution and show the advantage of explicitly learning subset sizes.,"Partitioning a set of elements into subsets of a priori unknown sizes is essential in many applications. These subset sizes are rarely explicitly learned - be it the cluster sizes in clustering applications or the number of shared versus independent generative latent factors in weakly-supervised learning. Probability distributions over correct combinations of subset sizes are non-differentiable due to hard constraints, which prohibit gradient-based optimization. In this work, we propose the differentiable hypergeometric distribution. The hypergeometric distribution models the probability of different group sizes based on their relative importance. We introduce reparameterizable gradients to learn the importance between groups and highlight the advantage of explicitly learning the size of subsets in two typical applications: weakly-supervised learning and clustering. In both applications, we outperform previous approaches, which rely on suboptimal heuristics to model the unknown size of groups.","hypergeometric distribution, weakly-supervised learning, reparameterization trick, group importance, variational clustering, gumbel softmax" Convergence Rate of Primal-Dual Approach to Constrained Reinforcement Learning with Softmax Policy,https://openreview.net/forum?id=WP0zFLrO01,https://openreview.net/pdf?id=WP0zFLrO01,"We propose a primal-dual policy gradient approach to solve constrained reinforcement learning problems, and show it needs iteration complexity.","In this paper, we consider primal-dual approach to solve constrained reinforcement learning (RL) problems, where we formulate constrained reinforcement learning under constrained Markov decision process (CMDP). We propose the primal-dual policy gradient (PD-PG) algorithm with softmax policy. Although the constrained RL involves a non-concave maximization problem over the policy parameter space, we show that for both exact policy gradient and model-free learning, the proposed PD-PG needs iteration complexity of $\mathcal{O}\left(\epsilon^{-2}\right)$ to achieve its optimal policy for both constraint and reward performance. Such an iteration complexity outperforms or matches most constrained RL algorithms. For the learning with exact policy gradient, the main challenge is to show the positivity of deterministic optimal policy (at the optimal action) is independent on both state space and iteration times. For the model-free learning, since we consider the discounted infinite-horizon setting, and the simulator can not rollout with an infinite-horizon sequence; thus one of the main challenges lies in how to design unbiased value function estimators with finite-horizon trajectories. We consider the unbiased estimators with finite-horizon trajectories that involve geometric distribution horizons, which is the key technique for us to obtain the theoretical results for model-free learning.","Constrained Reinforcement Learning, Constrained Markov Decision Process" Cross-Layer Retrospective Retrieving via Layer Attention,https://openreview.net/forum?id=pvgEL1yS3Ql,https://openreview.net/pdf?id=pvgEL1yS3Ql,A multi-head recurrent layer attention mechanism is proposed to retrieve query-related information from previous layers.,"More and more evidence has shown that strengthening layer interactions can enhance the representation power of a deep neural network, while self-attention excels at learning interdependencies by retrieving query-activated information. Motivated by this, we devise a cross-layer attention mechanism, called multi-head recurrent layer attention (MRLA), that sends a query representation of the current layer to all previous layers to retrieve query-related information from different levels of receptive fields. A light-weighted version of MRLA is also proposed to reduce the quadratic computation cost. The proposed layer attention mechanism can enrich the representation power of many state-of-the-art vision networks, including CNNs and vision transformers. Its effectiveness has been extensively evaluated in image classification, object detection and instance segmentation tasks, where improvements can be consistently observed. For example, our MRLA can improve 1.6% Top-1 accuracy on ResNet-50, while only introducing 0.16M parameters and 0.07B FLOPs. Surprisingly, it can boost the performances by a large margin of 3-4% box AP and mask AP in dense prediction tasks. ","Layer Attention, Recurrent Layer Attention, Layer Interaction, CNNs, Vision Transformers, Vision Networks" RephraseTTS: Dynamic Length Text based Speech Insertion with Speaker Style Transfer,https://openreview.net/forum?id=8zsK9lbna9L,https://openreview.net/pdf?id=8zsK9lbna9L,,"We propose a method for the task of text-conditioned speech insertion, i.e.\ inserting a speech sample in an input speech sample, conditioned on the corresponding complete text transcript. An example use case of the task would be to update the speech audio when corrections are done on the corresponding text transcript. The proposed method follows a transformer-based non-autoregressive approach that allows speech insertions of variable lengths, which are dynamically determined during inference, based on the text transcript and tempo of the available partial input. It is capable of maintaining the speaker's voice characteristics, prosody and other spectral properties of the available speech input. Results from our experiments and user study on LibriTTS show that our method outperforms baselines based on an existing adaptive text to speech method. We also provide numerous qualitative results to appreciate the quality of the output from the proposed method.", Decision S4: Efficient Sequence-Based RL via State Spaces Layers,https://openreview.net/forum?id=kqHkCVS7wbj,https://openreview.net/pdf?id=kqHkCVS7wbj,Replacing transformers with state-space layers for RL modeling. Also extended to on-policy training.,"Recently, sequence learning methods have been applied to the problem of off-policy Reinforcement Learning, including the seminal work on Decision Transformers, which employs transformers for this task. Since transformers are parameter-heavy, cannot benefit from history longer than a fixed window size, and are not computed using recurrence, we set out to investigate the suitability of the S4 family of models, which are based on state-space layers and have been shown to outperform transformers, especially in modeling long-range dependencies. In this work, we present two main algorithms: (i) an off-policy training procedure that works with trajectories, while still maintaining the training efficiency of the S4 model. (ii) An on-policy training procedure that is trained in a recurrent manner, benefits from long-range dependencies, and is based on a novel stable actor-critic mechanism. Our results indicate that our method outperforms multiple variants of decision transformers, as well as the other baseline methods on most tasks, while reducing the latency, number of parameters, and training time by several orders of magnitude, making our approach more suitable for real-world RL","Sequential RL, S4, Decision transformers" Deep autoregressive density nets vs neural ensembles for model-based offline reinforcement learning,https://openreview.net/forum?id=gvOSQjGTtxj,https://openreview.net/pdf?id=gvOSQjGTtxj,We show in model-based offline reinforcement learning a better performance can be obtained with a single well-calibrated autoregressive system model than with the usual ensembles.,"We consider the problem of offline reinforcement learning where only a set of system transitions is made available for policy optimization. Following recent advances in the field, we consider a model-based reinforcement learning algorithm that infers the system dynamics from the available data and performs policy optimization on imaginary model rollouts. This approach is vulnerable to exploiting model errors which can lead to catastrophic failures on the real system. The standard solution is to rely on ensembles for uncertainty heuristics and to avoid exploiting the model where it is too uncertain. We challenge the popular belief that we must resort to ensembles by showing that better performance can be obtained with a single well-calibrated autoregressive model on the D4RL benchmark. We also analyze static metrics of model-learning and conclude on the important model properties for the final performance of the agent.","Offline reinforcement learning, batch reinforcement learning, ensemble, autoregressive, D4RL, model-based" Light and Accurate: Neural Architecture Search via Two Constant Shared Weights Initialisations,https://openreview.net/forum?id=Z2oC6xxaWPL,https://openreview.net/pdf?id=Z2oC6xxaWPL,"Zero-cost neural architecture search with two forward passes: no gradients, no labels, any architecture type, high accuracy.","In recent years, zero-cost proxies are gaining grounds in the field of neural architecture search (NAS). These methods allow to find the optimal neural network for a given task faster and with lesser computational load than conventional NAS methods. Equally important is the fact that they also shed some light on the internal workings of neural architectures. In this paper we present a zero-cost metric that is highly correlated with the train accuracy across the NAS-Bench-101, NAS-Bench-201 and NAS-Bench-NLP benchmark datasets. Architectures are initialised with two distinct constant shared weights, one at a time. Then, a fixed random mini-batch of data is passed forward through each initialisation. We observe that the dispersion of the outputs between two initialisations is positively correlated with trained accuracy. The correlation further improves when the dispersion is normalised by the average output magnitude. Our metric does not require gradients computation and true labels. It thus unbinds NAS procedure from training hyperparameters, loss metric and human-labelled data. Our method is easy to integrate within existing NAS algorithms and takes a fraction of second to evaluate a single network.","neural architecture search, zero-cost, machine learning" Unveiling the sampling density in non-uniform geometric graphs,https://openreview.net/forum?id=mnVf1W6ipGm,https://openreview.net/pdf?id=mnVf1W6ipGm,"We introduce geometric graphs with hubs, an effective model for real-world graphs, and retrieve the sampling density by which those graphs are sampled from continuous latent spaces, to achieve various tasks.","A powerful framework for studying graphs is to consider them as geometric graphs: nodes are randomly sampled from an underlying metric space, and any pair of nodes is connected if their distance is less than a specified neighborhood radius. Currently, the literature mostly focuses on uniform sampling and constant neighborhood radius. However, real-world graphs are likely to be better represented by a model in which the sampling density and the neighborhood radius can both vary over the latent space. For instance, in a social network communities can be modeled as densely sampled areas, and hubs as nodes with larger neighborhood radius. In this work, we first perform a rigorous mathematical analysis of this (more general) class of models, including derivations of the resulting graph shift operators. The key insight is that graph shift operators should be corrected in order to avoid potential distortions introduced by the non-uniform sampling. Then, we develop methods to estimate the unknown sampling density in a self-supervised fashion.  Finally, we present exemplary applications in which the learnt density is used to 1) correct the graph shift operator and improve performance on a variety of tasks, 2) improve pooling, and 3) extract knowledge from networks. Our experimental findings support our theory and provide strong evidence for our model.","graph neural network, graph representation learning, spectral method, non-uniform sampling, geometric graph, graphon" Smooth image-to-image translations with latent space interpolations,https://openreview.net/forum?id=Lgp4Y2Tor34,https://openreview.net/pdf?id=Lgp4Y2Tor34,We are regularizing the latent spaces to have super smooth image-to-image translations. We also created a metric to quantitatively measure how smooth the translations.,"Multi-domain image-to-image (I2I) translations can transform a source image according to the style of a target domain. One important, desired characteristic of these transformations, is their graduality, which corresponds to a smooth change between the source and the target image when their respective latent-space representations are linearly interpolated. However, state-of-the-art methods usually perform poorly when evaluated using inter-domain interpolations, often producing abrupt changes in the appearance or non-realistic intermediate images. In this paper, we argue that one of the main reasons behind this problem is the lack of sufficient inter-domain training data and we propose two different regularization methods to alleviate this issue: a new shrinkage loss, which compacts the latent space, and a Mixup data-augmentation strategy, which flattens the style representations between domains. We also propose a new metric to quantitatively evaluate the degree of the interpolation smoothness, an aspect which is not sufficiently covered by the existing I2I translation metrics. Using both our proposed metric and standard evaluation protocols, we show that our regularization techniques can improve the state-of-the-art multi-domain I2I translations by a large margin. Our code will be made publicly available upon the acceptance of this article.","image-to-image translation, GANs, mixup, latent spaces interpolation" Boosting Causal Discovery via Adaptive Sample Reweighting,https://openreview.net/forum?id=LNpMtk15AS4,https://openreview.net/pdf?id=LNpMtk15AS4,Automatically learn the adaptive weights for each observation to boost score-based causal discovery performance. ,"Under stringent model type and variable distribution assumptions, score-based causal discovery methods learn the directed acyclic graph (DAG) from observational data by evaluating candidate graphs over an averaged score function. Despite the great success in low-dimensional linear systems, it has been observed that these approaches overly exploits easier-to-fit samples, thus inevitably learning spurious edges. Worse still, the common homogeneity assumption of most causal discovery methods can be easily violated due to the widespread existence of heterogeneous data in the real world, resulting in performance vulnerability when noise distributions vary. We propose a simple yet effective model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore for short, where the learned weights tailors quantitatively to the important degree of each samples. Intuitively, we leverage the bilevel optimization scheme to alternatively train a standard DAG learner first, then upweight the samples that the DAG learner fails to fit well and downweight the samples that the DAG learner easily extracts the causation information from. Extensive experiments on both synthetic and real-world datasets are carried out to validate the effectiveness of ReScore. We observe consistent and significant boosts in structure learning performance. We further visualize that ReScore concurrently mitigates the influence of spurious edges and generalizes to heterogeneous data. Finally, we perform theoretical analysis to guarantee the structure identifiability and the weight adaptive properties of ReScore. Our codes are available at https://anonymous.4open.science/r/ReScore-7631.","Causal Structure Learning, Score-based Causal Discovery, Adaptive Sample Reweighting" Robust Training through Adversarially Selected Data Subsets,https://openreview.net/forum?id=BdcfKgE9dhF,https://openreview.net/pdf?id=BdcfKgE9dhF,Develops robust learning strategy where a subset of instances are selectively chosen for perturbation and the selection strategy is never revealed to the learner.,"Robustness to adversarial perturbations often comes at the cost of a drop in accuracy on unperturbed or clean instances. Most existing defense mechanisms attempt to defend the learner from attack on all possible instances, which often degrades the accuracy on clean instances significantly. However, in practice, an attacker might only select a small subset of instances to attack, $e.g.$, in facial recognition systems an adversary might aim to target specific faces. Moreover, the subset selection strategy of the attacker is seldom known to the defense mechanism a priori, making it challenging to attune the mechanism beforehand. This motivates designing defense mechanisms which can (i) defend against attacks on subsets instead of all instances to prevent degradation of clean accuracy and, (ii) ensure good overall performance for attacks on any selected subset. In this work, we take a step towards solving this problem. We cast the training problem as a min-max game involving worst-case subset selection along with optimization of model parameters, rendering the problem NP-hard. To tackle this, we first show that, for a given learner's model, the objective can be expressed as a difference between a $\gamma$-weakly submodular and a modular function. We use this property to propose ROGET, an iterative algorithm, which admits approximation guarantees for a class of loss functions. Our experiments show that ROGET obtains better overall accuracy compared to several state-of-the-art defense methods for different adversarial subset selection techniques.","Subset selection, Robust learning" Beyond Reward: Offline Preference-guided Policy Optimization,https://openreview.net/forum?id=i8AnfJYMvz,https://openreview.net/pdf?id=i8AnfJYMvz,We propose an end-to-end offline preference-based reinforcement learning formulation that directly optimizes the policy by preference supervision without learning a separate reward function.,"In this work, we study offline preference-based reinforcement learning (PbRL), which relaxes the two fundamental supervisory signals in standard reinforcement learning (online accessible transition dynamics and rewards). In other words, the agent is provided with fixed offline trajectory transitions and human preferences between pairs of trajectories. Due to the orthogonality property of rewards and dynamics, one common practice is combining prior PbRL-based reward learning objectives with off-the-shelf offline RL algorithms to bridge preference modeling and offline learning. However, such two isolated optimizations require learning a separate reward function and thus place an information bottleneck on reward learning (the bridge). As an alternative, we propose offline preference-guided policy optimization (OPPO), an end-to-end offline PbRL formulation, which jointly learns to model the preference (for finding the optimal task policy) and the offline data (for eliminating OOD). In particular, OPPO introduces an offline hindsight information matching objective and a preference modeling objective. Then, iterating the two objectives over, we can directly extract a well-performing decision policy, avoiding a separate reward learning. We empirically show that OPPO can effectively model the offline preference and outperform prior competing baselines (including the offline RL algorithms performed over the true reward function).","offline reinforcement learning, preference-based reinforcement learning, hindsight information matching, preference-guided policy optimization" Iterative Circuit Repair Against Formal Specifications,https://openreview.net/forum?id=SEcSahl0Ql,https://openreview.net/pdf?id=SEcSahl0Ql,We present a deep learning approach for repairing sequential circuits against formal specifications given in linear-time temporal logic (LTL).,"We present a deep learning approach for repairing sequential circuits against formal specifications given in linear-time temporal logic (LTL). Given a defective circuit and its formal specification, we train Transformer models to output circuits that satisfy the corresponding specification. We propose a separated hierarchical Transformer for multimodal representation learning of the formal specification and the circuit. We introduce a data generation algorithm that enables generalization to more complex specifications and out-of-distribution datasets. In addition, our proposed repair mechanism significantly improves the automated synthesis of circuits from LTL specifications with Transformers. It improves the state-of-the-art by 6.8 percentage points on held-out instances and 11.8 percentage points on an out-of-distribution dataset from the annual reactive synthesis competition.","sequential circuits, repair, synthesis, transformer" Neural Probabilistic Logic Programming in Discrete-Continuous Domains,https://openreview.net/forum?id=dyifcA9UuRo,https://openreview.net/pdf?id=dyifcA9UuRo,DeepSeaProbLog: a neural probabilistic logic programming language with discrete and continuous random variables.,"Neural-symbolic AI (NeSy) methods allow neural networks to exploit symbolic background knowledge. NeSy has been shown to aid learning in the limited data regime and to facilitate inference on out-of-distribution data. Neural probabilistic logic programming (NPLP) is a popular NeSy approach that integrates probabilistic models with neural networks and logic programming. A major limitation of current NPLP systems, such as DeepProbLog, is their restriction to discrete and finite probability distributions, e.g., binary random variables. To overcome this limitation, we introduce DeepSeaProbLog, an NPLP language that supports discrete and continuous random variables on (possibly) infinite and even uncountable domains. Our main contributions are 1) the introduction of DeepSeaProbLog and its semantics, 2) an implementation of DeepSeaProbLog that supports inference and gradient-based learning, and 3) an experimental evaluation of our approach.","neural-symbolic AI, logic, probability, neural networks, probabilistic logic programming, neuro-symbolic integration, learning and reasoning" Can BERT Refrain from Forgetting on Sequential Tasks? A Probing Study,https://openreview.net/forum?id=UazgYBMS9-W,https://openreview.net/pdf?id=UazgYBMS9-W,,"Large pre-trained language models have helped to achieve state of the art on a variety of NLP tasks, nevertheless, they still suffer from forgetting when incrementally learning a series of sequential tasks. To alleviate this problem, recent works propose several models enhanced by sparse experience replay and local adaption, which yield satisfactory performance. However, in this paper we find that pre-trained language models like BERT have a potential ability to learn sequentially, even without any sparse memory replay. To verify the ability of BERT to maintain old knowledge, we adopt and re-finetune single-layer probe networks with the parameters of BERT fixed. We investigate the models on two typical kinds of NLP tasks, text classification and extractive question answering. And our experiments reveal that BERT can actually generate high quality representations for previous tasks in a long term, under extremely sparse replay or even no replay. We further introduce a series of methods to interpret the mechanism of forgetting and how memory rehearsal plays a significant role in task incremental learning, which bridges the gap between our new discovery and previous studies about catastrophic forgetting. Additionally, we provide both quantified and visualized results demonstrating that the representation space of BERT is always topologically organised, which guarantees its performance.","Natural Language Processing, Probing Study" Behavior Proximal Policy Optimization ,https://openreview.net/forum?id=3c13LptpIph,https://openreview.net/pdf?id=3c13LptpIph," We propose Behavior Proximal Policy Optimization (BPPO), which bases on on-policy method (PPO) and effectively solves offline RL without any extra constraint or regularization introduced. ","Offline reinforcement learning (RL) is a challenging setting where existing off-policy actor-critic methods perform poorly due to the overestimation of out-of-distribution actions. Thus, various additional augmentations are proposed to keep the learned policy close to the offline dataset (or behavior policy). In this work, starting from the analysis of offline monotonic policy improvement, we get a surprising finding that some online on-policy algorithms are naturally able to solve offline RL. Specifically, the inherent conservatism of these on-policy algorithms is exactly what the offline RL method needs to accomplish the closeness. Based on this, we design an algorithm called Behavior Proximal Policy Optimization (BPPO), which successfully solves offline RL without any extra constraint or regularization introduced. Extensive experiments on the D4RL benchmark indicate this extremely succinct method outperforms state-of-the-art offline RL algorithms.","Offline Reinforcement Learning, Monotonic Policy Improvement" UiTTa: Online Test-Time Adaptation by User Interaction,https://openreview.net/forum?id=j7MnZOwaQ__,https://openreview.net/pdf?id=j7MnZOwaQ__,We explore test-time adaptation from model-user interaction. ,"We explore user interaction-based test-time adaptation (UITTA), which adapts a model to shifted test distributions with supervision signals from model-user interactions. Model adaptation in TTA can fail since models learn from the noisy pseudo-labels of the test data. UITTA achieves better adaptation from user feedback on top-K predictions within two rounds of simulated interactions. To have real-time adaptation, we further accelerate model optimization by reducing the cost of gradient backpropagation, through random dropping of backward paths. Simulation experiments on cross-lingual transfer, domain generalization, and corruption robustness show that low-cost user feedback can significantly boost TTA in performance, even competing with online active learning which however needs expensive human annotation. By accelerating pre-trained language models, we reduce 70% – 90% backpropagation cost with only a small drop in performance.","Test-time Adaptation, Human-AI Interaction, Model Robustness, Distribution Shift" FedGC: An Accurate and Efficient Federated Learning under Gradient Constraint for Heterogeneous Data,https://openreview.net/forum?id=eZN8nUXAVO7,https://openreview.net/pdf?id=eZN8nUXAVO7,"An accurate and efficient Federated Learning method is proposed to improve the performance on Non-IID data by mitigating catastrophic forgetting at clients and effectively aggregating clients’ knowledge at server, while reducing local training time.","Federated Learning (FL) is an important paradigm in large-scale distributed machine learning, which enables multiple clients to jointly learn a unified global model without transmitting their local data to a central server. FL has attracted growing attentions in many real-world applications, such as multi-center cardiovascular disease diagnosis and autonomous driving. Practically, the data across clients are always heterogeneous, i.e., not independently and identically distributed (Non-IID), making the local models suffer from catastrophic forgetting of the initial (or global) model. To mitigate this forgetting issue, existing FL methods may require additional regularization terms or generates pseudo data, resulting to 1) limited accuracy; 2) long training time and slow convergence rate for real-time applications; and 3) high communication cost. In this work, an accurate and efficient Federated Learning algorithm under Gradient Constraints (FedGC) is proposed, which provides three advantages: i) High accuracy is achieved by the proposed Client-Gradient-Constraint based projection method (CGC) to alleviate the forgetting issue occurred in clients, and the proposed Server-Gradient-Constraint based projection method (SGC) to effectively aggregate the gradients of clients; ii) Short training time and fast convergence rate are enabled by the proposed fast Pseudo-gradient-based mini-batch Gradient Descent (PGD) method and SGC; iii) Low communication cost is required due to the fast convergence rate and only gradients are necessary to be transmitted between server and clients. In the experiments, four real-world image datasets with three Non-IID types are evaluated, and five popular FL methods are used for comparison. The experimental results demonstrate that our FedGC not only significantly improves the accuracy and convergence rate on Non-IID data, but also drastically decreases the training time. Compared to the state-of-art FedReg, our FedGC improves the accuracy by up to 14.28% and speeds up the local training time by 15.5 times while decreasing 23% of the communication cost.","Federated Learning, Non-IID data" Actionable Neural Representations: Grid Cells from Minimal Constraints,https://openreview.net/forum?id=xfqDe72zh41,https://openreview.net/pdf?id=xfqDe72zh41,"We study a novel definition of an optimal representation of structured spaces, and show that it can be used to derive the brain's grid cells and their perturbations normatively. ","To afford flexible behaviour, the brain must build internal representations that mirror the structure of variables in the external world. For example, 2D space obeys rules: the same set of actions combine in the same way everywhere (step north, then south, and you won't have moved, wherever you start). We suggest the brain must represent this consistent meaning of actions across space, as it allows you to find new short-cuts and navigate in unfamiliar settings. We term this representation an `actionable representation'. We formulate actionable representations using group and representation theory, and show that, when combined with biological and functional constraints - non-negative firing, bounded neural activity, and precise coding - multiple modules of hexagonal grid cells are the optimal representation of 2D space. We support this claim with intuition, analytic justification, and simulations. Our analytic results normatively explain a set of surprising grid cell phenomena, and make testable predictions for future experiments. Lastly, we highlight the generality of our approach beyond just understanding 2D space. Our work characterises a new principle for understanding and designing flexible internal representations: they should be actionable, allowing animals and machines to predict the consequences of their actions, rather than just encode.","Grid Cells, Representation Theory, Theoretical Neuroscience, Normative Models" xTrimoDock: Cross-Modal Transformer for Multi-Chain Protein Docking,https://openreview.net/forum?id=KL6i1IdwQ6z,https://openreview.net/pdf?id=KL6i1IdwQ6z,,"The structure of a protein–protein complex plays a critical role in understanding the dynamics of binding, delineating biological mechanisms, and developing intervention strategies. Rigid protein-protein docking, assuming no conformational change within proteins, predicts the 3D structure of protein complexes from unbound chains. According to the number of chains, rigid docking is divided into binary complex setting that contains only two chains, and more ubiquitous multi-chain complex setting. Most existing docking methods are tailored for binary complexes, and are computationally expensive or not guaranteed to find accurate complex structures. In this paper, we propose a novel model xTrimoDock for the docking of multi-chain complexes, which can simultaneously employ information from both sequence modality and structure modality of involved protein chains. Specifically, xTrimoDock leverages a cross-modal transformer to integrate representations from protein sequences and structures, and conducts a multi-step prediction of rotations and translations to accomplish the multi-chain docking. Extensive experiments reflect the promising results of the proposed model in the harder multi-chain complex setting.", Compression-aware Training of Neural Networks using Frank-Wolfe,https://openreview.net/forum?id=ueEMZjY9WiM,https://openreview.net/pdf?id=ueEMZjY9WiM,,"Many existing Neural Network pruning approaches either rely on retraining to compensate for pruning-caused performance degradation or they induce strong biases to converge to a specific sparse solution throughout training. A third paradigm, ’compression-aware’ training, obtains state-of-the-art dense models which are robust to a wide range of compression ratios using a single dense training run while also avoiding retraining. In that vein, we propose a constrained optimization framework centered around a versatile family of norm constraints and the Stochastic Frank-Wolfe (SFW) algorithm which together encourage convergence to well-performing solutions while inducing robustness towards convolutional filter pruning and low-rank matrix decomposition. Comparing our novel approaches to compression methods in these domains on benchmark image-classification architectures and datasets, we find that our proposed scheme is able to yield competitive results, often outperforming existing compression-aware approaches. In the case of low-rank matrix decomposition, our approach can require much less computational resources than nuclear-norm regularization based approaches by requiring only a fraction of the singular values in each iteration. As a special case, our proposed constraints can be extended to include the unstructured sparsity-inducing constraint proposed constraint by Pokutta et al. (2020) and Miao et al. (2022), which we improve upon. Our findings also indicate that the robustness of SFW-trained models largely depends on the gradient rescaling of the learning rate and we establish a theoretical foundation for that practice.","compression aware, neural network, frank-wolfe, pruning" Modeling content creator incentives on algorithm-curated platforms,https://openreview.net/forum?id=l6CpxixmUg,https://openreview.net/pdf?id=l6CpxixmUg,Algorithmic choices in modern recommenders may have significant and unexpected effects on content creator incentives.,"Content creators compete for user attention. Their reach crucially depends on algorithmic choices made by developers on online platforms. To maximize exposure, many creators adapt strategically, as evidenced by examples like the sprawling search engine optimization industry. This begets competition for the finite user attention pool. We formalize these dynamics in what we call an exposure game, a model of incentives induced by modern algorithms including factorization and (deep) two-tower architectures. We prove that seemingly innocuous algorithmic choices—e.g., non-negative vs. unconstrained factorization—significantly affect the existence and character of (Nash) equilibria in exposure games. We proffer use of creator behavior models like ours for an (ex-ante) pre-deployment audit. Such an audit can identify misalignment between desirable and incentivized content, and thus complement post-hoc measures like content filtering and moderation. To this end, we propose tools for numerically finding equilibria in exposure games, and illustrate results of an audit on the MovieLens and LastFM datasets. Among else, we find that the strategically produced content exhibits strong dependence between algorithmic exploration and content diversity, and between model expressivity and bias towards gender-based user and creator groups.","Nash equilibria, producer incentives, attention monetizing platforms, recommenders, differentiable games, exposure game" MBrain: A Multi-channel Self-Supervised Learning Framework for Brain Signals,https://openreview.net/forum?id=ashgrQnYsm,https://openreview.net/pdf?id=ashgrQnYsm,,"Brain signals are important quantitative data for understanding physiological activities and diseases of human brain. Meanwhile, rapidly developing deep learning methods offer a wide range of opportunities for better modeling brain signals, which has attracted considerable research efforts recently. Most existing studies pay attention to supervised learning methods, which, however, require high-cost clinical labels. In addition, the huge difference in the clinical patterns of brain signals measured by invasive (e.g., SEEG) and non-invasive (e.g., EEG) methods leads to the lack of a unified method. To handle the above issues, in this paper, we propose to study the self-supervised learning (SSL) framework for brain signals that can be applied to pre-train either SEEG or EEG data. Intuitively, brain signals, generated by the firing of neurons, are transmitted among different connecting structures in human brain. Inspired by this, we propose to learn implicit spatial and temporal correlations between different channels (i.e., contacts of the electrode, corresponding to different brain areas) as the cornerstone for uniformly modeling different types of brain signals. Specifically, we capture the temporal correlation by designing the delayed-time-shift prediction task; we represent the spatial correlation by a graph structure, which is built with the goal to maximize the mutual information of each channel and its correlated ones. We further theoretically prove that our design can lead to a better predictive representation. Extensive experiments of seizure detection on both EEG and SEEG large-scale real- world datasets demonstrate our model outperforms several state-of-the-art time series SSL and unsupervised models.","brain signals, self-supervised learning, multi-channel time series, seizure detection" Group-Disentangling Conditional Shift,https://openreview.net/forum?id=d5U-bPKPde,https://openreview.net/pdf?id=d5U-bPKPde,"A VAE-based model that can group-disentangle data under conditional shift, evaluated on fair comparisons between student test scores.","We propose a novel group disentanglement method called the Context-Aware Variational Autoencoder (CxVAE). Our model can learn disentangled representations on datasets with conditional shift. This phenomenon occurs when the conditional distribution of the instance-level latent variable $\mathbf{z}$ given the input observation $\mathbf{x}$ changes from one group to another (i.e. $p_i(\mathbf{z}|\mathbf{x}) \neq p_j(\mathbf{z}|\mathbf{x})$, where $i,j$ are two different groups). We show that existing methods fail to learn disentangled representations under this scenario because they infer the group $\mathbf{u}$ and instance $\mathbf{z}$ variables separately. CxVAE overcomes this limitation by conditioning the instance inference on the group variable $q(\mathbf{z}|\mathbf{x},\mathbf{u})$. Our model has the novel ability to disentangle ambiguous observations (those with incomplete information about the generative factors), which we evaluate on the task of fair comparisons between student test scores. Additionally, we demonstrate empirically that conditional shift is the cause of our model's improved performance.","group disentanglement, variational autoencoders, conditional shift" When and Why Is Pretraining Object-Centric Representations Good for Reinforcement Learning?,https://openreview.net/forum?id=oL2uVCVlyf,https://openreview.net/pdf?id=oL2uVCVlyf,,"Unsupervised object-centric representation (OCR) learning has recently been drawing a lot of attention as a new paradigm of visual representation. This is because of its potential of being an effective pretraining technique for various downstream tasks in terms of sample efficiency, systematic generalization, and reasoning. Although image-based reinforcement learning (RL) is one of the most important and thus frequently mentioned such downstream tasks, the benefit in RL has surprisingly not been investigated systematically thus far. Instead, most of the evaluations have focused on rather indirect metrics such as segmentation quality and object property prediction accuracy. In this paper, we investigate the effectiveness of OCR pretraining for image-based reinforcement learning via empirical experiments. For systematic evaluation, we introduce a simple object-centric visual RL benchmark and verify a series of hypotheses answering questions such as ""Does OCR pretraining provide better sample efficiency?"", ""Which types of RL tasks benefit most from OCR pretraining?"", and ""Can OCR pretraining help with out-of-distribution generalization?"". The results suggest that OCR pretraining is particularly effective in tasks where the relationship between objects is important, improving both task performance and sample efficiency when compared to single-vector representations. Furthermore, OCR models facilitate generalization to out-of-distribution tasks such as changing the number of objects or the appearance of the objects in the scene.","object-centric representation, reinforcement learning" Face reconstruction from facial templates by learning latent space of a generator network,https://openreview.net/forum?id=j1HyTEWHTT,https://openreview.net/pdf?id=j1HyTEWHTT,,"Face recognition systems are increasingly deployed in different applications. In these systems, a feature vector (also called facial embeddings or templates) is typically extracted from each face image and is stored in the system's database during the enrollment stage, which is later used for comparison during the recognition stage. In this paper, we focus on the template inversion attack against face recognition systems and propose a new method to reconstruct face images from facial templates. Within a generative adversarial network (GAN)-based framework, we learn a mapping from facial templates to the intermediate latent space of a pre-trained face generation network, from which we can generate high-resolution realistic reconstructed face images. We show that our proposed method can be applied in whitebox and blackbox attacks against face recognition systems. Furthermore, we evaluate the transferability of our attack when the adversary uses the reconstructed face image to impersonate the underlying subject in an attack against another face recognition system. Considering the adversary's knowledge and the target face recognition system, we define five different attacks and evaluate the vulnerability of state-of-the-art face recognition systems. Our experiments show that our proposed method achieves high success attack rates in whitebox and blackbox scenarios. Furthermore, the reconstructed face images are transferable and can be used to enter target face recognition systems with a different feature extractor model. ", Mole-BERT: Rethinking Pre-training Graph Neural Networks for Molecules,https://openreview.net/forum?id=jevY-DtiZTR,https://openreview.net/pdf?id=jevY-DtiZTR,We explain the negative transfer in molecular graph pre-training and develop two novel pre-training strategies to alleviate this issue.,"Recent years have witnessed the prosperity of pre-training graph neural networks (GNNs) for molecules. Typically, following the Masked Language Modeling (MLM) task of BERT~\citep{devlin2019bert}, \cite{hu2020strategies} first randomly mask the atom types and then pre-train the GNNs to predict them. However, unlike MLM, this pre-training task named AttrMask is too simple to learn informative molecular representations due to the extremely small and unbalanced atom vocabulary. As a remedy, we adopt the encoder of a variant of VQ-VAE~\citep{van2017neural} as a context-aware tokenizer to encode atoms as meaningful discrete values, which can enlarge the atom vocabulary size and mitigate the quantitative divergence between dominant (e.g., carbons) and rare atoms (e.g., phosphorus). With the enlarged atom vocabulary, we propose a novel node-level pre-training task, dubbed Masked Atoms Modeling (\textbf{MAM}), to randomly mask the discrete values and pre-train GNNs to predict them. MAM mitigates the negative transfer issue of AttrMask and can be combined with various pre-training tasks to advance their performance. Furthermore, for graph-level pre-training, we propose triplet masked contrastive learning (\textbf{TMCL}) to model varying degrees of semantic similarity between molecules, which is especially effective for molecule retrieval. MAM and TMCL constitute a novel pre-training framework, \textbf{Mole-BERT}, which can match or outperform state-of-the-art methods that require expensive domain knowledge as guidance. The codes, the tokenizer, and the pre-trained models will be released. ",graph neural networks "A sparse, fast, and stable representation for multiparameter topological data analysis",https://openreview.net/forum?id=5T80c_5NSbV,https://openreview.net/pdf?id=5T80c_5NSbV,"In this article, we provide a general framework for representing multiparameter persistent homology with stability guarantees.","Topological data analysis (TDA) is a new area of geometric data analysis that focuses on using invariants from algebraic topology to provide multiscale shape descriptors for point clouds. One of the most important shape descriptors is persistent homology, which studies the topological variations as a filtration parameter changes; a typical parameter is the feature scale. For many data sets, it is useful to consider varying multiple filtration parameters at once, for example scale and density. While the theoretical properties of one-parameter persistent homology are well understood, less is known about the multiparameter case. Of particular interest is the problem of representing multiparameter persistent homology by elements of a vector space for integration with traditional machine learning. Existing approaches to this problem either ignore most of the multiparameter information to reduce to the one-parameter case or are heuristic and potentially unstable in the face of noise. In this article, we introduce a general representation framework for multiparameter persistent homology that encompasses previous approaches. We establish theoretical stability guarantees under this framework as well as efficient algorithms for practical computation, making this framework an applicable and versatile tool for TDA practitioners. We validate our stability results and algorithms with numerical experiments that demonstrate statistical convergence, prediction accuracy, and fast running times on several real data sets. ","Topological Data Analysis, Algebraic Topology, Persistent Homology, Kernel Methods" What's in a name? The Influence of Personal Names on Spatial Reasoning in BLOOM Large Language Models,https://openreview.net/forum?id=hZ2H2Ps5dp6,https://openreview.net/pdf?id=hZ2H2Ps5dp6,BLOOM models are susceptible to undesirable variations in reasoning ability depending on the choice of personal names even though the reasoning task does not depend on the choice of names.,"Large language models have been shown to exhibit reasoning capability. But the ability of these models to truly comprehend the reasoning task is not yet clear. An ideal model capable of reasoning would not be affected by the names of the entities over which the relations are defined. In this paper, we consider an algorithmically generated spatial reasoning task over the names of persons. We show that the choice of names has a significant impact on the reasoning accuracy of BLOOM large language models. Using popular names from different countries of the world, we show that BLOOM large language models are susceptible to undesirable variations in reasoning ability even though the underlying logical reasoning challenge does not depend on these names. We further identify that the conditional log probability scores characterizing the uncertainty in prediction produced by BLOOM models are not well-calibrated and cannot be used to detect such reasoning errors. We then suggest a new approach based on model self-explanations and iterative model introspection that performs better than BLOOM conditional log probability scores in detecting such errors and may help alleviate the bias exhibited by these models.","BLOOM, Bias, Large Language Model" Contrastive Representation Learning for Multi-scale Spatial Scenes,https://openreview.net/forum?id=dXmWWc7GHVU,https://openreview.net/pdf?id=dXmWWc7GHVU,,"Spatial scenes, which are composed by spatial objects and their spatial relations, are the basis of geographic information retrieval, spatial cognition, and spatial search. Despite the wide usage of spatial scenes, representation learning on spatial scenes that contain complex composition of spatial objects remains a challenge, since the spatial data types of geographic objects (e.g., points, polylines, and polygons) and the geographical scales vary across different spatial scenes. Inspired by recently proposed multi-scale location encoding models such as Space2Vec, we propose a multi-scale spatial scene encoding model called Scene2Vec to solve these representational challenges. In Scene2Vec, a location encoder is used to model the spatial relationships among spatial objects and a feature encoder is used for objects' semantic feature encoding. A scene encoder is developed to integrate the representations of spatial objects into a single scene embedding. Moreover, we propose a spatial scene augmentation method to sample additional points based on the shapes of polyline/polygon-based spatial objects in all scales of spatial scenes. The whole model is trained in a self-supervised manner with a contrastive loss. We conduct experiments on real world datasets for spatial scene retrieval task 1) purely based on points, e.g., points of interest (POIs), and 2) based on multi-structured spatial objects. Results show that Scene2Vec outperforms well-established encoding methods such as Space2Vec and multi-layer perceptrons due to the advantages of the integrated multi-scale representations and the proposed spatial scene augmentation method. Moreover, detailed analysis shows that Scene2Vec has the ability to generate representations of all the three types of spatial objects in a multi-scale manner.", Improving Protein Interaction Prediction using Pretrained Structure Embedding,https://openreview.net/forum?id=KNSRDB-clPX,https://openreview.net/pdf?id=KNSRDB-clPX,,"The prediction of protein-protein interactions (PPIs) is a critical problem because the knowledge of PPIs unravels the cellular behavior and its functionality. So far most previous works on PPI predictions mainly focused on sequence and network information and ignored the structural information of protein physical binding. We design a novel method, called xxx, which can leverage pretrained structure embedding and can be transferred to new ppi predictions. Experimental results on PPi predictions show that our pretrained structure embedding leads to significant improvement in PPI prediction comparing to sequence and network based methods. Furthermore, we show that embeddings pretrained based on ppi from different species can be transferred to improve the prediction for human proteins. ","pretrianing, protein, PPI" Batch Normalization and Bounded Activation Functions,https://openreview.net/forum?id=FLr9RRqbwB-,https://openreview.net/pdf?id=FLr9RRqbwB-,"With bounded activation functions, using batch normalization after activation functions is better because of asymmetric saturation and sparsity. ","Since Batch Normalization was proposed, it has been commonly located in front of activation functions, as proposed by the original paper. Swapping the order, i.e., using Batch Normalization after activation functions, has also been attempted, but it is generally not much different from the conventional order when ReLU is used. However, in the case of bounded activation functions like Tanh, we discovered that the swapped order achieves considerably better performance on various benchmarks and architectures than the conventional order. We report this remarkable phenomenon and closely examine what contributes to this performance improvement in this paper. One noteworthy thing about swapped models is the extreme saturation of activation values, which is usually considered harmful. Looking at the output distribution of individual activation functions, we found that many of them are highly asymmetrically saturated. The experiments inducing a different degree of asymmetric saturation support the hypothesis that asymmetric saturation helps improve performance. In addition, we found that Batch Normalization after bounded activation functions has another important effect: it relocates the asymmetrically saturated output of activation functions near zero. This enables the swapped model to have higher sparsity, further improving performance. Extensive experiments with Tanh, LeLecun Tanh, and Softsign show that the swapped models achieve improved performance with a high degree of asymmetric saturation. ","Batch Normalization, Activation Functions, Saturation, Sparsity" Versatile Energy-Based Models for High Energy Physics,https://openreview.net/forum?id=dC31wEs-hsV,https://openreview.net/pdf?id=dC31wEs-hsV,,"Energy-Based Models (EBMs) have the natural advantage of flexibility in the form of the energy function. Recently, EBMs have achieved great success in modeling high-dimensional data in computer vision and natural language processing. In accordance with these signs of progress, we build a versatile energy-based model for High Energy Physics events at the Large Hadron Collider. This framework builds on a powerful generative model and describes higher-order inter-particle interactions. It suits different encoding architectures and decomposes clearly. As for applicational aspects, it can serve as a powerful parameterized event generator, a generic anomalous signal detector, and an augmented event classifier. ","Generative modeling, Energy-based models, Out-of-distribution detection" MEDOE: A Multi-Expert Decoder and Output Ensemble Framework for Long-tailed Semantic Segmentation,https://openreview.net/forum?id=tcHwiu6CJ_B,https://openreview.net/pdf?id=tcHwiu6CJ_B,We proposed MEDOE framework to address the long-tailed distribution in semantic segmentation,"Long-tailed distribution of semantic categories, which has been often ignored in conventional methods, causes unsatisfactory performance in semantic segmentation on tail categories. In this paper, we focus on the problem of long-tailed semantic segmentation. Although some long-tailed recognition methods (e.g., re-sampling/re-weighting) have been proposed in other problems, they are likely to compromise crucial contextual information in semantic segmentation. Therefore, these methods are hardly adaptable to the problem of long-tailed semantic segmentation. To address this problem, we propose a novel method, named MEDOE, by ensembling and grouping contextual information. Specifically, our MEDOE is a two-sage framework comprising a multi-expert decoder (MED) and a multi-expert output ensemble (MOE). The MED includes several ``experts"", each of which takes as input the dataset masked according to the specific categories based on frequency distribution and generates contextual information self-adaptively for classification. The MOE then ensembles the experts' outputs with learnable decision weights. As a model-agnostic framework, MEDOE can be flexibly and efficiently coupled with various popular deep neural networks (e.g., Deeplabv3+, OCRNet, and PSPNet) to improve the performance in long-tailed semantic segmentation. Experimental results show that the proposed framework outperforms the current methods on both Cityscapes and ADE20K datasets by up to 2\% in mIoU and 6\% in mAcc.","semantic segmentation, long-tailed distribution" Concept-level Debugging of Part-Prototype Networks,https://openreview.net/forum?id=oiwXWPDTyNk,https://openreview.net/pdf?id=oiwXWPDTyNk,A novel and human-friendly concept-level debugger for part-prototype networks.,"Part-prototype Networks (ProtoPNets) are concept-based classifiers designed to achieve the same performance as black-box models without compromising transparency. ProtoPNets compute predictions based on similarity to class-specific part-prototypes learned to recognize parts of training examples, making it easy to faithfully determine what examples are responsible for any target prediction and why. However, like other models, they are prone to picking up confounders and shortcuts from the data, thus suffering from compromised prediction accuracy and limited generalization. We propose ProtoPDebug, an effective concept-level debugger for ProtoPNets in which a human supervisor, guided by the model’s explanations, supplies feedback in the form of what part-prototypes must be forgotten or kept, and the model is fine-tuned to align with this supervision. Our experimental evaluation shows that ProtoPDebug outperforms state-of-the-art debuggers for a fraction of the annotation cost. An online experiment with laypeople confirms the simplicity of the feedback requested to the users and the effectiveness of the collected feedback for learning confounder-free part-prototypes. ProtoPDebug is a promising tool for trustworthy interactive learning in critical applications, as suggested by a preliminary evaluation on a medical decision making task.","explainability, debugging, self-explainable networks, part-prototype networks, concept-based models" Geometrically regularized autoencoders for non-Euclidean data,https://openreview.net/forum?id=_q7A0m3vXH0,https://openreview.net/pdf?id=_q7A0m3vXH0,We propose geometrically regularized autoencoders for non-Euclidean data and discuss their various use cases.,"Regularization is almost {\it de rigueur} when designing autoencoders that are sparse and robust to noise. Given the recent surge of interest in machine learning problems involving non-Euclidean data, in this paper we address the regularization of autoencoders on curved spaces. We show that by ignoring the underlying geometry of the data and applying standard vector space regularization techniques, autoencoder performance can be severely degraded, or worse, training can fail to converge. Assuming that both the data space and latent space can be modeled as Riemannian manifolds, we show how to construct regularization terms in a coordinate-invariant way, and develop geometric generalizations of the denoising autoencoder and reconstruction contractive autoencoder such that the essential properties that enable the estimation of the derivative of the log-probability density are preserved. Drawing upon various non-Euclidean data sets, we show that our geometric autoencoder regularization techniques can have important performance advantages over vector-spaced methods while avoiding other breakdowns that can result from failing to account for the underlying geometry.","autoencoders, Riemannian geometry, non-Euclidean data, regularization, score estimation" Model-based Unknown Input Estimation via Partially Observable Markov Decision Processes,https://openreview.net/forum?id=FmpRQpQLs5J,https://openreview.net/pdf?id=FmpRQpQLs5J,,"In the context of condition monitoring for structures and industrial assets, the estimation of unknown inputs, usually referring to acting loads, is of salient importance for guaranteeing safe and performant engineered systems. In this work, we propose a novel method for estimating unknown inputs from measured outputs, particularly for the case of dynamical systems with known or learned dynamics. The objective is to search for those system inputs that will reproduce the actual measured outputs, which can be reformulated as a Partially Observable Markov Decision Process (POMDP) problem and solved with well-established planning algorithms for POMDPs. The cross-entropy method is adopted in this paper for solving the POMDP due to its efficiency and robustness. The proposed method is demonstrated using simulated dynamical systems for structures with known dynamics, as well as a real wind turbine with learned dynamics, which is inferred via use of a Replay Overshooting (RO) scheme, a deep learning-based dynamics method for learning stochastic dynamics.","unknown input estimation, partially observable markov decision process, model-based reinforcement learning, model predictive control, cross-entropy method, dynamics modeling" TransFool: An Adversarial Attack against Neural Machine Translation Models,https://openreview.net/forum?id=P63GxgD7LIl,https://openreview.net/pdf?id=P63GxgD7LIl,"We propose TransFool to build adversarial attacks against neural machine translation systems, which are fluent sentences and semantically similar to the original sentence, but highly degrade the translation quality. ","Deep neural networks have been shown to be vulnerable to small perturbations of their inputs known as adversarial attacks. In this paper, we consider the particular task of Neural Machine Translation (NMT), where security is often critical. We investigate the vulnerability of NMT models to adversarial attacks and propose a new attack algorithm called TransFool. It builds on a multi-term optimization problem and a gradient projection step to compute adversarial examples that fool NMT models. By integrating the embedding representation of a language model in the proposed attack, we generate fluent adversarial examples in the source language that maintain a high level of semantic similarity with the clean samples and render the attack largely undetectable. Experimental results demonstrate that, for multiple translation tasks and different NMT architectures, our white-box attack can severely degrade the translation quality for more than 60% of the sentences while the semantic similarity between the original sentence and the adversarial example stays very high. Moreover, we show that the proposed attack is transferable to unknown target models and can fool those quite easily. Finally, our method leads to improvement in terms of success rate, semantic similarity, and fluency compared to the existing attack strategies both in white-box and black-box settings. Hence, TransFool permits to better characterize the vulnerability of NMT systems and outlines the necessity to design strong defense mechanisms and more robust NMT systems for real-life applications.","Adversarial attack, deep neural network, language model, natural language processing, neural machine translation, robstness." Protein Sequence Design in a Latent Space via Model-based Reinforcement Learning,https://openreview.net/forum?id=OhjGzRE5N6o,https://openreview.net/pdf?id=OhjGzRE5N6o,This study investigates why many model-based biological sequence design methods produce results that empirically fail and proposes a novel optimization process that can efficiently traverse a latent representation space instead of the sequence space.,"Proteins are complex molecules responsible for different functions in the human body. Enhancing the functionality of a protein and/or cellular fitness can significantly impact various industries. However, their optimization remains challenging, and sequences generated by data-driven methods often fail in wet lab experiments. This study investigates the limitations of existing model-based sequence design methods and presents a novel optimization framework that can efficiently traverse the latent representation space instead of the protein sequence space. Our framework generates proteins with higher functionality and cellular fitness by modeling the sequence design task as a Markov decision process and applying model-based reinforcement learning. We discuss the results in a comprehensive evaluation of two distinct proteins, GPF and His3, along with the predicted structure of optimized sequences using deep learning-based structure prediction.","Biological sequence design, Model-based reinforcement learning, Protein design, Representation learning" Breaking Large Language Model-based Code Generation,https://openreview.net/forum?id=IlkQffBxiC7,https://openreview.net/pdf?id=IlkQffBxiC7,"We present BreaC, a novel method for breaking large language model-based code generators such that they excessively generate erroneous code.","We propose BreaC, a new method for attacking large language models (LLMs) to excessively generate erroneous code. BreaC works by training a class-conditional language model (CCLM) that conditions code generation on a binary attribute specifying whether the output code should contain errors. The CCLM is not only able to generate erroneous programs but can also control other, much larger LLMs to do so without access to their weights. The training of the CCLM leverages unlikelihood training, as well as reinforcement learning that treats the two generation branches of the CCLM as adversaries. We instantiate BreaC on the task of generating code with compilation and parsing errors. Our extensive evaluation demonstrates that BreaC is effective in both adversarial and benign scenarios. For the adversarial scenario, BreaC greatly reduces the compilation rate of various LLMs while maintaining the perplexity of generated programs. For the benign scenario, BreaC is able to produce realistic erroneous programs from correct programs, enabling one to construct parallel training datasets. We demonstrate the high utility of these datasets by training neural bug fixers that significantly surpass the state-of-the-art.","large language models, code generation, controlled generation, attacks, reliability, reinforcement learning" The GANfather: Controllable generation of malicious activity to expose detection weaknesses and improve defence systems.,https://openreview.net/forum?id=9Y0P3YoERSy,https://openreview.net/pdf?id=9Y0P3YoERSy,,"Criminal activities are typically adversarial in nature, where an attacker and a defence system are constantly adapting to each other's behaviour. If the defence systems are helped by automated detection methods, then those methods need to be updated frequently. In practice, this means that the defence systems are always one step behind the attackers. For example, in anti-money laundering systems, new labels representing suspicious activity are frequently delayed by weeks or months and some money laundering activity may never be found, leading to detection systems that are inaccurate and resulting in an estimated undetected €0.7-3 trillion being laundered annually. To tackle the problem of missing or delayed labels in adversarial settings, we propose The GANfather, an adversarial and label-free method to both (1) generate a variety of meaningful attacks, as guided by a custom, user-defined objective function; and (2) train a defence system to detect such attacks. Optionally, we can ensure that the generated attacks escape an existing detection system, revealing current weaknesses which the new defence system actively corrects. Our method is inspired by generative adversarial networks (GANs), but unlike GANs we nudge our generator to produce out-of-distribution data using a loss function that characterises criminal activity. Importantly, our method does not require any labelled examples. We test our framework in two real-world use-cases, namely injection attacks in recommendation systems and anti-money laundering. In the former, we show how an injection attack with a limited number of generated fake profiles is sufficient to successfully recommend an item to a large number of users. These generated injection attacks are more effective in recommending the target item than naive ‘bombing’ strategies and harder to detect. In the latter, the generated attacks are able to simulate money laundering and move cumulative amounts close to 250 thousand dollars through a network of accounts without being detected by existing systems. We also show how we can train a new defence system that captures all these synthetic attacks, potentially saving millions of dollars in detected criminal activity. Our method is generic and applicable in a variety of adversarial domains, exposing current liabilities with the generated data and strengthening the defence systems against current and future malicious attacks.", Proximal Validation Protocol,https://openreview.net/forum?id=HlRfoQDDj-V,https://openreview.net/pdf?id=HlRfoQDDj-V,,"Modern machine learning algorithms are generally built upon a train/validation/test split protocol. In particular, with the absence of accessible testing set in real-world ML development, how to split out a validation set becomes crucial for reliable model evaluation, selection and etc. Concretely, under a randomized splitting setup, the split ratio of the validation set generally acts as a vital meta-parameter; that is, with more data picked and used for validation, it would cost model performance due to the less training data, and vice versa. Unfortunately, this implies a vexing trade-off between performance enhancement against trustful model evaluation. However, to date, the research conducted on this line remains very few. We reason this could be due to a workflow gap between the academic and ML production which we may attribute to a form of technical debt of ML. In this article, we propose a novel scheme --- dubbed Proximal Validation Protocol (PVP) --- which is targeted to resolve this problem of validation set construction. Core to PVP is to assemble a \emph{proximal set} as a substitution for the traditional validation set while avoiding the valuable data wasted by the training procedure. The construction of the proximal validation set is established with dense data augmentation followed by a novel distributional-consistent sampling algorithm. With extensive empirical findings, we prove that PVP works (much) better than all the other existing validation protocols on three data modalities (images, text, and tabular data), demonstrating its feasibility towards ML production.", A Message Passing Perspective on Learning Dynamics of Contrastive Learning,https://openreview.net/forum?id=VBTJqqWjxMv,https://openreview.net/pdf?id=VBTJqqWjxMv,,"In recent years, contrastive learning achieves impressive results on self-supervised visual representation learning, but there still lacks a rigorous understanding of its learning dynamics. In this paper, we show that if we cast a contrastive objective equivalently into the function space, then its learning dynamics admits an interpretable form. Specifically, we show that its gradient descent corresponds to a specific message passing scheme on the corresponding augmentation graph. Based on this perspective, we theoretically characterize how contrastive learning gradually learns discriminative features with the alignment update and the uniformity update. Meanwhile, this perspective also establishes an intriguing connection between contrastive learning and Message Passing Graph Neural Networks (MP-GNNs). This connection not only provides a unified understanding of many techniques independently developed in each community, but also enables us to borrow techniques from MP-GNNs to design new contrastive learning variants, such as graph attention, graph rewiring, jumpy knowledge techniques, etc. We believe that our message passing perspective not only provides a new theoretical understanding of contrastive learning dynamics, but also bridges the two seemingly independent areas together, which could inspire more interleaving studies to benefit from each other. ", Farsighter: Efficient Multi-step Exploration for Deep Reinforcement Learning,https://openreview.net/forum?id=yBOlJNIX0jN,https://openreview.net/pdf?id=yBOlJNIX0jN,,"Exploration in deep reinforcement learning (RL), especially uncertainty-based exploration, plays a key role in improving sample efficiency and boosting total reward. Uncertainty-based exploration methods often measure the uncertainty (variance) of the value function; However, existing exploration strategies either only consider the uncertain impact of next ``one-step'' or propagate the uncertainty for all the remaining steps in an episode. Neither approach can explicitly control the bias-variance trade-off of the value function. In this paper, we propose Farsighter, an explicit multi-step uncertainty exploration framework in DRL. Specifically, Farsighter considers the uncertainty of exact k future steps and it can adaptively adjust k. In practice, we learn Bayesian posterior over Q-function to approximate uncertainty in each step. In model-free cases, we recursively deploy Thompson sampling on the learned posterior distribution for k steps and in model-based cases, we solve a joint optimization problem of higher dimension for a tree-based model. Our method can work on general tasks with high/low-dimensional states, discrete/continuous actions, and sparse/dense rewards. Empirical evaluations show that Farsighter outperforms SOTA explorations on a wide range of Atari games, robotic manipulation tasks, and general RL tasks.", Help Me Explore: Combining Autotelic and Social Learning via Active Goal Queries,https://openreview.net/forum?id=iEE0MadUaZh,https://openreview.net/pdf?id=iEE0MadUaZh,We propose to blend individual and socially-guided skill learning in multi-goal environments with a new interaction protocol named Help Me Explore in which the agent chooses to seek external help for exploration or autonomously learn to master goals.,"Most approaches to open-ended skill learning train a single agent in a purely sensorimotor environment. But because no human child learns everything on their own, we argue that sociality will be a key component of open-ended learning systems. This paper enables learning agents to blend individual and socially-guided skill learning through a new interaction protocol named Help Me Explore (HME). In social episodes triggered at the agent's demand, a social partner suggests a goal at the frontier of the agent's capabilities and, when the goal is reached, follows up with a new adjacent goal just beyond. In individual episodes, the agent practices skills autonomously by pursuing goals it already discovered through either its own experience or social suggestions. The idea of augmenting an individual goal exploration with social goal suggestions is simple, general and powerful. We demonstrate its efficiency on two notoriously hard exploration benchmarks: continuous mazes and a 5-block robotic manipulation task. With minimal social interventions, an HME-agent outperforms the purely social agent deprived of its autonomy, and the purely individual agent which fails to solve hard exploration problems.","Exploration, Teachability, Autotelism, Social Interaction, Curriculum Learning" AUTOMATIC CURRICULUM FOR UNSUPERVISED REIN- FORCEMENT LEARNING,https://openreview.net/forum?id=dqZ_GFn7Nuh,https://openreview.net/pdf?id=dqZ_GFn7Nuh,,"Recent unsupervised reinforcement learning (URL) can learn meaningful skills without task rewards by carefully designed training objectives. However, most existing works lack quantitative evaluation metrics for URL but mainly rely on visualizations of trajectories to compare the performance. Moreover, each URL method only focuses on a single training objective, which can hinder further learning progress and the development of new skills. To bridge these gaps, we first propose multiple evaluation metrics for URL that can cover different preferred properties. We show that balancing these metrics leads to what a “good” trajectory visualization embodies. Next, we use these metrics to develop an automatic curriculum that can change the URL objective across different learning stages in order to improve and balance all metrics. Specifically, we apply a non-stationary multi-armed bandit algorithm to select an existing URL objective for each episode according to the metrics evaluated in previous episodes. Extensive experiments indifferent environments demonstrate the advantages of our method on achieving promising and balanced performance over all URL metrics.", Exploiting Personalized Invariance for Better Out-of-distribution Generalization in Federated Learning,https://openreview.net/forum?id=krcFYSfcbhx,https://openreview.net/pdf?id=krcFYSfcbhx,We are the first to consider the challenging out-of-distribution generalization problem under Non-IID federated learning setting and propose the novel concept of personalized invariance and method to handle it.,"Recently, data heterogeneity among the training datasets on the local clients (a.k.a., Non-IID data) has attracted intense interest in Federated Learning (FL), and many personalized federated learning methods have been proposed to handle it. However, the distribution shift between the training dataset and testing dataset on each client is never considered in FL, despite it being general in real-world scenarios. We notice that the distribution shift (a.k.a., out-of-distribution generalization) problem under Non-IID federated setting becomes rather challenging due to the entanglement between personalized and spurious information. To tackle the above problem, we elaborate a general dual-regularized learning framework to explore the 'personalized invariance', compared with the exsiting personalized federated learning methods which are regularized by a single baseline (usually the global model). Utilizing the personalized invariant features, the developed personalized models can efficiently exploit the most relevant information and meanwhile eliminate spurious information so as to enhance the out-of-distribution generalization performance for each client. Both the theoretical analysis on convergence and OOD generalization performance and the results of extensive experiments demonstrate the superiority of our method over the existing federated learning and invariant learning methods, in diverse out-of-distribution and Non-IID data cases.","Federated Learning, Out-of-distribution Generalization, Non-IID, Personalization, Invariant Learning" Filtered Semi-Markov CRF,https://openreview.net/forum?id=vNrmEgfGIg3,https://openreview.net/pdf?id=vNrmEgfGIg3,,"Semi-Markov CRF \citep{semicrf} has been proposed as an alternative to the traditional Linear Chain CRF\citep{crf} for text segmentation tasks such as Named Entity Recognition. In contrast to CRF, which treats text segmentation as token-level prediction, Semi-CRF considers spans as the task's basic unit, which makes it more expressive. However, Semi-CRF has two major drawbacks: (1) it has quadratic complexity over sequence length as it operates on every span of the input sequence, and (2) empirically, it performs worse than classical CRF for sequence labeling tasks such as NER. In our work, we propose Filtered Semi-Markov CRF, a Semi-CRF variant that addresses the aforementioned issues. Our model extends Semi-CRF by incorporating a filtering step for eliminating irrelevant segments, which helps in reducing the complexity and allows to dramatically reduce the search space. On a variety of NER benchmarks, we find that our approach outperforms both CRF and Semi-CRF models while being significantly faster. We will make our code available to the public. ","Structured prediction, Text segmentation, Named Entity Recognition" Zeroth-Order Optimization with Trajectory-Informed Derivative Estimation,https://openreview.net/forum?id=n1bLgxHW6jW,https://openreview.net/pdf?id=n1bLgxHW6jW,,"Zeroth-order (ZO) optimization, in which the derivative is unavailable, has recently succeeded in many important machine learning applications. Existing algorithms rely on finite difference (FD) methods for derivative estimation and gradient descent (GD)-based approaches for optimization. However, these algorithms suffer from query inefficiency because additional function queries are required for derivative estimation in their every GD update, which typically hinders their deployment in applications where every function query is expensive. To this end, we propose a trajectory-informed derivative estimation method which only uses the optimization trajectory (i.e., the history of function queries during optimization) and hence eliminates the need for additional function queries to estimate a derivative. Moreover, based on our derivative estimation, we propose the technique of dynamic virtual updates, which allows us to reliably perform multiple steps of GD updates without reapplying derivative estimation. Based on these two contributions, we introduce the zeroth-order optimization with trajectory-informed derivative estimation (ZoRD) algorithm for query-efficient ZO optimization. We theoretically demonstrate that our trajectory-informed derivative estimation and our ZoRD algorithm improve over existing approaches, which is then supported by our real-world experiments such as black-box adversarial attack, non-differentiable metric optimization and derivative-free reinforcement learning.","zeroth-order optimization, derivative estimation, finite difference" Distance VS. Coordinate: Distance Based Embedding Improves Model Generalization for Routing Problems,https://openreview.net/forum?id=6apN9AQ-3fN,https://openreview.net/pdf?id=6apN9AQ-3fN,"Distance based embedding is a better choice for routing problems, compared to coordinate based embedding.","Routing problems, such as traveling salesman problem (TSP) and vehicle routing problem, are among the most classic research topics in combinatorial optimization and operations research (OR). In recent years, with the rapid development of online service platforms, there has been renewed interest in applying this study to facilitate emerging industrial applications, such as food delivery and logistics services. While OR methods remain the mainstream technique, increasing efforts have been put into exploiting deep learning (DL) models for tackling routing problems. The existing ML methods often consider the embedding of the route point coordinate as a key model input and are capable of delivering competing performance in synthetic or simplified settings. However, it is empirically noted that this line of work appears to lack robustness and generalization ability that are crucial for real-world applications. In this paper, we demonstrate that the coordinate can unexpectedly lead to these problems. There are two factors that make coordinate rather `poisonous' for DL models: i) the definition of distance between route points is far more complex than what coordinate can depict; ii) the coordinate can hardly be sufficiently `traversed' by the training data. To circumvent these limitations, we propose to abandon the coordinate and instead use the relative distance for route point embedding. We show in both synthetic TSP and real-world food pickup and delivery route prediction problem that our design can significantly improve model's generalization ability, and deliver competitive or better performance with existing models. ","routing problems, travelling salesman problem, combinatorial optimization, pickup and delivery, embedding" Towards biologically plausible Dreaming and Planning,https://openreview.net/forum?id=9pA3oXBwYh7,https://openreview.net/pdf?id=9pA3oXBwYh7,,"Humans and animals can learn new skills after practicing for a few hours, while current reinforcement learning algorithms require a large amount of data to achieve good performances. Recent model-based approaches show promising results by reducing the number of necessary interactions with the environment to learn a desirable policy. However, these methods require biological implausible ingredients, such as the detailed storage of older experiences, and long periods of offline learning. The optimal way to learn and exploit word-models is still an open question. Taking inspiration from biology, we suggest that dreaming might be an efficient expedient to use an inner model. We propose a two-module (agent and model) neural network in which ""dreaming"" (living new experiences in a model-based simulated environment) significantly boosts learning. We also explore ""planning"", an online alternative to dreaming, that shows comparable performances. Importantly, our model does not require the detailed storage of experiences, and learns online the world-model. This is a key ingredient for biological plausibility and implementability (e.g., in neuromorphic hardware).","Reinforcement Learning, Model based, Biologically Plausible" Mixture of Basis for Interpretable Continual Learning with Distribution Shifts,https://openreview.net/forum?id=5iibKv7Wk8W,https://openreview.net/pdf?id=5iibKv7Wk8W,"We develop a novel continual learning algorithm, Mixture of Basis models (MoB), that constructs a dynamic, task-dependent, mixture of interpretable models that outperforms other continual learning algorithms on several, diverse problem domains.","Continual learning in environments with shifting data distributions is a challenging problem with several real-world applications. In this paper we consider settings in which the data distribution (i.e., task) shifts abruptly and the timing of these shifts are not known. Furthermore, we consider a $\textit{semi-supervised task-agnostic}$ setting in which the learning algorithm has access to both task-segmented and unsegmented data for offline training. We propose a novel approach called $\textit{Mixture of Basis}$ models $\textit{(MoB)}$ for addressing this problem setting. The core idea is to learn a small set of $\textit{basis models}$ and to construct a dynamic, task-dependent mixture of the models to predict for the current task. We also propose a new methodology to detect observations that are out-of-distribution with respect to the existing basis models and to instantiate new models as needed. We develop novel problem domains for regression tasks, evaluate MoB and other continual learning algorithms on these, and show that MoB attains better prediction error in nearly every case while using fewer models than other multiple-model approaches. We analyze latent task representations learned by MoB alongside the tasks themselves, using both qualitative and quantitative measures, to show that the learned latent task representations can be interpretably linked to the structure of the task space.","continual learning, lifelong learning, distribution shift, interpretable learning, semi-supervised learning" Extracting Meaningful Attention on Source Code: An Empirical Study of Developer and Neural Model Code Exploration,https://openreview.net/forum?id=6s5HaPx6ndR,https://openreview.net/pdf?id=6s5HaPx6ndR,We compare how developers and GPT-like language models navigate snippets of source code.,"The high effectiveness of neural models of code, such as OpenAI Codex and AlphaCode, suggests coding capabilities of models that are at least comparable to those of humans. However, previous work has only used these models for their raw completion, ignoring how the model reasoning, in the form of attention weights, can be used for other downstream tasks. Disregarding the attention weights means discarding a considerable portion of what those models compute when queried. To profit more from the knowledge embedded in these large pre-trained models, this work compares multiple approaches to post-process these valuable attention weights for supporting code exploration. Specifically, we compare to which extent the transformed attention signal of CodeGen, a large and publicly available pre-trained neural model, agrees with how developers look at and explore code when each answering the same sense-making questions about code. At the core of our experimental evaluation, we collect, manually annotate, and open-source a novel eye-tracking dataset comprising 25 developers answering sense-making questions on code over 92 sessions. We empirically evaluate five attention-agnostic heuristics and ten attention-based post processing approaches of the attention signal against our ground truth of developers exploring code, including the novel concept of follow-up attention which exhibits the highest agreement. Beyond the dataset contribution and the empirical study, we also introduce a novel practical application of the attention signal of pre-trained models with completely analytical solutions, going beyond how neural models’ attention mechanisms have traditionally been used. ","eye-tracking, transformers, self-attention, code exploration, source code, neural models of code" Denoising Differential Privacy in Split Learning,https://openreview.net/forum?id=T6NIgvKRb7b,https://openreview.net/pdf?id=T6NIgvKRb7b,,"Differential Privacy (DP) is applied in split learning to address privacy concerns about data leakage. Previous work combines split neural network (SplitNN) training with DP by adding noise to the intermediate results during the forward pass. Unfortunately, DP noise injection significantly degrades the training accuracy of SplitNN. This paper focuses on improving the training accuracy of DP-protected SplitNNs without sacrificing the privacy guarantee. We propose two denoising techniques, namely scaling and random masking. Our theoretical investigation shows that both of our techniques achieve accurate estimation of the intermediate variables during the forward pass of SplitNN training. Our experiments with real networks demonstrate that our denoising approach allows SplitNN training that can tolerate high levels of DP noise while achieving almost the same accuracy as the non-private (i.e., non-DP protected) baseline. Interestingly, we show that after applying our techniques, the resultant network is more resilient against a state-of-the-art attack, compared to the plain DP-protected baseline.", Neuroevolution is a Competitive Alternative to Reinforcement Learning for Skill Discovery,https://openreview.net/forum?id=6BHlZgyPOZY,https://openreview.net/pdf?id=6BHlZgyPOZY,,"Deep Reinforcement Learning (RL) has emerged as a powerful paradigm for training neural policies to solve complex control tasks. However, these policies tend to be overfit to the exact specifications of the task and environment they were trained on, and thus do not perform well when conditions deviate slightly or when composed hierarchically to solve even more complex tasks. Recent work has shown that training a mixture of policies, as opposed to a single one, that are driven to explore different regions of the state-action space can address this shortcoming by generating a diverse set of behaviors, referred to as skills, that can be collectively used to great effect in adaptation tasks or for hierarchical planning. This is typically realized by including a diversity term - often derived from information theory - in the objective function optimized by RL. However these approaches often require careful hyperparameter tuning to be effective. In this work, we demonstrate that less widely-used neuroevolution methods, specifically Quality Diversity (QD), are a competitive alternative to information-theory-augmented RL for skill discovery. Through an extensive empirical evaluation comparing eight state-of-the-art methods on the basis of (i) metrics directly evaluating the skills' diversity, (ii) the skills' performance on adaptation tasks, and (iii) the skills' performance when used as primitives for hierarchical planning; QD methods are found to provide equal, and sometimes improved, performance whilst being less sensitive to hyperparameters and more scalable. As no single method is found to provide near-optimal performance across all environments, there is a rich scope for further research which we support by proposing future directions and providing optimized open-source implementations.", On Representation Learning in the First Layer of Deep CNNs and the Dynamics of Gradient Descent,https://openreview.net/forum?id=FhYkgzYNMQ7,https://openreview.net/pdf?id=FhYkgzYNMQ7,,"It has previously been reported that the representation that is learned in the first layer of deep CNNs is very different from the initial representation and highly consistent across initialization and architecture. In this work, we quantify this consistency by considering the set of filters as a filter bank and measuring its energy distribution. We find that the energy distribution is remarkably consistent and try to determine the source of this consistency. We show that this consistency cannot be explained by the fact that CNNs learn a representation that is useful for recognition and that CNNs trained with fixed, random filters in the first layer yield comparable recognition performance to full learning. We then show that similar behavior occurs in simple, linear CNNs and obtain an analytical characterization of the energy profile of linear CNNs trained with gradient descent. Our analysis shows that the energy profile is determined by two factors (1) the correlation of the average patch and the class label and (2) an implicit bias given the dynamics of gradient descent. Finally, we show that in commonly used image recognition datasets the correlation between the average patch and the class label is very low and it is the implicit bias that best explains the consistency of representations observed in real-world CNNs.", Learning Layered Implicit Model for 3D Avatar Clothing Representation,https://openreview.net/forum?id=FzKeidp3qnB,https://openreview.net/pdf?id=FzKeidp3qnB,"We present a novel 3D cloth represention, i.e., a neural implicit surface model conditioned on volumetric SMPL prior, to capture realistic clothes from raw scans.","Modeling 3D clothed avatars is a popular topic in the computer graphics and vision area. Due to the complicated nature of realistic garments, the most concerned issue is how to represent 3D cloth shapes efficiently and effectively. A desirable cloth model is expected to preserve high-quality geometric details while establishing essential correspondences between clothes and animation-ready templates. However, by far there is no such a 3D cloth representation that can simultaneously satisfy these two requirements. In this work, we thus formulate a novel 3D cloth representation that integrating the neural implicit surface with a statistical body prior. Different from previous methods using explicit cloth primitives conditioned on the SMPL surface, we adopt a two-layer implicit function to capture the coarse and fine levels of cloth displacements, based on a parametric SMPL volume space. Our approach is aware of the underlying statistical minimal body shapes, and is also capable of modeling challenging clothes like skirts. To evaluate the geometric modeling capacity of our 3D cloth representation, we conduct both qualitative and quantitative experiments on raw scans, which indicate superior performance over existing 3D cloth representations. The effectiveness and flexibility of our 3D cloth representation is further validated in downstream applications, e.g. 3D virtual try-on.","Geometric Representation, 3D Cloth, Human Body, Implicit Surface" Scrunch: Preventing sensitive property inference through privacy-preserving representation learning,https://openreview.net/forum?id=mNk7mgWZcJa,https://openreview.net/pdf?id=mNk7mgWZcJa,"A system, based on the center loss function, to produce private machine learning data exchange for machine learning as a service scenarios","Many tasks that are commonly performed by devices attached to the Internet are currently being offloaded to the cloud, using the Machine Learning as a Service (MLaaS) paradigm. While this paradigm is motivated by the reduced capacity of mobile terminals, it also hinders privacy associated with the data exchanged over the network. Thus, the data exchanged among parties shall be conveniently anonymized to prevent possible confidentiality and privacy issues. While many privacy-enhancing algorithms have been proposed in the past, they are usually relying on very complex models that make difficult their applicability to real-world systems or envision too friendly attacker models. In this paper, we propose a deep learning system that creates anonymized representations for the data, while keeping the accuracy for the targeted MLaaS task high, assuming that the attacker can re-train an adversarial model. Our results show that the proposed algorithm i) is effective yet it uses a lighter approach than state-of-the-art ii) considers less friendly attacker models, and iii) outperforms the benchmark under different privacy metrics.","privacy, machine learning as a service, center loss" Uniform-in-time propagation of chaos for the mean field gradient Langevin dynamics,https://openreview.net/forum?id=_JScUk9TBUn,https://openreview.net/pdf?id=_JScUk9TBUn,,"The mean-field Langevin dynamics is characterized by a stochastic differential equation that arises from (noisy) gradient descent on an infinite-width two-layer neural network, which can be viewed as an interacting particle system. In this work, we establish a quantitative weak propagation of chaos result for the system, with a finite-particle discretization error of $\mathcal{O}(1/N)$ \textit{uniformly over time}, where $N$ is the width of the neural network. This allows us to directly transfer the optimization guarantee for infinite-width networks to practical finite-width models without excessive overparameterization. On the technical side, our analysis differs from most existing studies on similar mean field dynamics in that we do not require the interaction between particles to be sufficiently weak to obtain a uniform propagation of chaos, because such assumptions may not be satisfied in neural network optimization. Instead, we make use of a logarithmic Sobolev-type condition which can be verified in appropriate regularized risk minimization settings. ","Neural network optimization, mean-field regime, interacting particle system, propagation of chaos" GM-VAE: Representation Learning with VAE on Gaussian Manifold,https://openreview.net/forum?id=uV1A7jemwS8,https://openreview.net/pdf?id=uV1A7jemwS8,"We propose a Gaussian manifold VAE whose latent space consists of diagonal Gaussians and a stable distribution over the space, by using information geometry. ","We propose a Gaussian manifold variational auto-encoder (GM-VAE) whose latent space consists of a set of diagonal Gaussian distributions. It is known that the set of the diagonal Gaussian distributions with the Fisher information metric forms a product hyperbolic space, which we call a Gaussian manifold. To learn the VAE endowed with the Gaussian manifold, we first propose a pseudo Gaussian manifold normal distribution based on the Kullback-Leibler divergence, a local approximation of the squared Fisher-Rao distance, to define a density over the latent space. With the newly proposed distribution, we introduce geometric transformations at the last and the first of the encoder and the decoder of VAE, respectively to help the transition between the Euclidean and Gaussian manifolds. Through the empirical experiments, we show competitive generalization performance of GM-VAE against other variants of hyperbolic- and Euclidean-VAEs. Our model achieves strong numerical stability, which is a common limitation reported with previous hyperbolic-VAEs.","VAE, Representation learning, Hyperbolic space, Distribution on Riemannian manifold, Information geometry" Improving Adversarial Robustness by Putting More Regularizations on Less Robust Samples,https://openreview.net/forum?id=-SBZ8c356Oc,https://openreview.net/pdf?id=-SBZ8c356Oc,," Adversarial training, which is to enhance robustness against adversarial attacks, has received much attention because it is easy to generate human-imperceptible perturbations of data to deceive a given deep neural network. In this paper, we propose a new adversarial training algorithm that is theoretically well motivated and empirically superior to other existing algorithms. A novel feature of the proposed algorithm is to apply more regularization to data vulnerable to adversarial attacks than other existing regularization algorithms do. Theoretically, we show that our algorithm can be understood as an algorithm of minimizing a newly derived upper bound of the robust risk. Numerical experiments illustrate that our proposed algorithm improves the generalization (accuracy on examples) and robustness (accuracy on adversarial attacks) simultaneously to achieve the state-of-the-art performance.","Adversarial Training, Adversarial Attack, Robust Learning" Generalizable Multi-Relational Graph Representation Learning: A Message Intervention Approach,https://openreview.net/forum?id=xOPd5QO_5RT,https://openreview.net/pdf?id=xOPd5QO_5RT,A message intervention method for learning generalizable multi-relational graph representations,"With the edges associated with labels and directions, the so-called multi-relational graph possesses powerful expressiveness, which is beneficial to many applications. However, as the heterogeneity brought by the higher cardinality of edges and relations climbs up, more trivial relations are taken into account for the downstream task since they are often highly correlated to the target. As a result, with being forced to fit the non-causal relational patterns on the training set, the downstream model, like graph neural network (GNN), may suffer from poor generalizability on the testing set since the inference is mainly made according to misleading clues. In this paper, under the paradigm of graph convolution, we probe the multi-relational message passing process from the perspective of causality and then propose a Message Intervention method for learning generalizable muLtirElational gRaph representations, coined MILER. In particular, MILER first encodes the vertices and relations into embeddings with relational and directional awareness, then a message diverter is employed to split the original message flow into two flows of interest, i.e., the causal and trivial message flows. Afterward, the message intervention is carried out with the guidance of the backdoor adjustment rule. Extensive experiments on several knowledge graph benchmarks validate the effectiveness as well as the superior generalization ability of MILER.","Multi-Relational Graph, Causal Inference, Representation Learning, Graph Neural Network" Causal Explanations of Structural Causal Models,https://openreview.net/forum?id=Opcegzztjay,https://openreview.net/pdf?id=Opcegzztjay,"As a step towards causal XIL, we propose a solution to the lack of truly causal explanations from existing methods.","In explanatory interactive learning (XIL) the user queries the learner, then the learner explains its answer to the user and finally the loop repeats. XIL is attractive for two reasons, (1) the learner becomes better and (2) the user's trust increases. For both reasons to hold, the learner's explanations must be useful to the user and the user must be allowed to ask useful questions. Ideally, both questions and explanations should be grounded in a causal model since they avoid spurious fallacies. Ultimately, we seem to seek a causal variant of XIL. The question part on the user's end we believe to be solved since the user's mental model can provide the causal model. But how would the learner provide causal explanations? In this work we show that existing explanation methods are not guaranteed to be causal even when provided with a Structural Causal Model (SCM). Specifically, we use the popular, proclaimed causal explanation method CXPlain to illustrate how the generated explanations leave open the question of truly causal explanations. Thus as a step towards causal XIL, we propose a solution to the lack of causal explanations. We solve this problem by deriving from first principles an explanation method that makes full use of a given SCM, which we refer to as SC$\textbf{E}$ ($\textbf{E}$ standing for explanation). Since SCEs make use of structural information, any causal graph learner can now provide human-readable explanations. We conduct several experiments including a user study with 22 participants to investigate the virtue of SCE as causal explanations of SCMs.","explanatory interactive learning, explainable artificial intelligence, causal explanations, structural causal models, user study" Asynchronous Distributed Bilevel Optimization,https://openreview.net/forum?id=_i0-12XqVJZ,https://openreview.net/pdf?id=_i0-12XqVJZ,,"Bilevel optimization plays an essential role in many machine learning tasks, ranging from hyperparameter optimization to meta-learning. Existing studies on bilevel optimization, however, focus on either centralized or synchronous distributed setting. The centralized bilevel optimization approaches require collecting massive amount of data to a single server, which inevitably incur significant communication expenses and may give rise to data privacy risks. Synchronous distributed bilevel optimization algorithms, on the other hand, often face the straggler problem and will immediately stop working if a few workers fail to respond. As a remedy, we propose Asynchronous Distributed Bilevel Optimization (ADBO) algorithm. The proposed ADBO can tackle bilevel optimization problems with both nonconvex upper-level and lower-level objective functions, and its convergence is theoretically guaranteed. Furthermore, it is revealed through theoretic analysis that the iteration complexity of ADBO to obtain the $\epsilon$-stationary point is upper bounded by $\mathcal{O}(\frac{1}{{{\epsilon ^2}}})$. Thorough empirical studies on public datasets have been conducted to elucidate the effectiveness and efficiency of the proposed ADBO.", Multi-Agent Reinforcement Learning with Shared Resources for Inventory Management,https://openreview.net/forum?id=PfPrnKDtvIG,https://openreview.net/pdf?id=PfPrnKDtvIG,We propose a scalable and effective method to control a large number of agents for inventory management.,"In this paper, we consider the inventory management (IM) problem where we need to make replenishment decisions for a large number of stock keeping units (SKUs) to balance their supply and demand. In our setting, the constraint on the shared resources (such as the inventory capacity) couples the otherwise independent control for each SKU. We formulate the problem with this structure as Shared-Resource Stochastic Game (SRSG) and propose an efficient algorithm called Context-aware Decentralized PPO (CD-PPO). Through extensive experiments, we demonstrate that CD-PPO can accelerate the learning procedure compared with standard MARL algorithms.","Multi-Agent Reinforcement Learning, Inventory Mangement" Confidence-Based Feature Imputation for Graphs with Partially Known Features,https://openreview.net/forum?id=YPKBIILy-Kt,https://openreview.net/pdf?id=YPKBIILy-Kt,"For graphs with missing features, we define a novel concept of confidence and propose a pseudo-confidence-based feature imputation (PCFI) scheme.","This paper investigates a missing feature imputation problem for graph learning tasks. Several methods have previously addressed learning tasks on graphs with missing features. However, in cases of high rates of missing features, they were unable to avoid significant performance degradation. To overcome this limitation, we introduce a novel concept of channel-wise confidence in a node feature, which is assigned to each imputed channel feature of a node for reflecting the certainty of the imputation. We then design pseudo-confidence using the channel-wise shortest path distance between a missing-feature node and its nearest known-feature node to replace unavailable true confidence in an actual learning process. Based on the pseudo-confidence, we propose a novel feature imputation scheme that performs channel-wise inter-node diffusion and node-wise inter-channel propagation. The scheme can endure even at an exceedingly high missing rate (e.g., 99.5\%) and it achieves state-of-the-art accuracy for both semi-supervised node classification and link prediction on various datasets containing a high rate of missing features.","Graph neural networks, Graphs, Missing features" Explicitly Maintaining Diverse Playing Styles in Self-Play,https://openreview.net/forum?id=Gkbxt7ThQxU,https://openreview.net/pdf?id=Gkbxt7ThQxU,,"Self-play has proven to be an effective training schema to obtain a high-level agent in complex games through iteratively playing against an opponent from its historical versions. However, its training process may prevent it from generating a well-generalised policy since the trained agent rarely encounters diversely-behaving opponents along its own historical path. In this paper, we aim to improve the generalisation of the policy by maintaining a population of agents with diverse playing styles and high skill levels throughout the training process. Specifically, we propose a bi-objective optimisation model to simultaneously optimise the agents' skill level and playing style. A feature of this model is that we do not regard the skill level and playing style as two objectives to maximise directly since they are not equally important (i.e., agents with diverse playing styles but low skill levels are meaningless). Instead, we create a meta bi-objective model to enable high-level agents with diverse playing styles more likely to be incomparable (i.e. Pareto non-dominated), thereby playing against each other through the training process. We then present an evolutionary algorithm working with the proposed model. Experiments in a classic table tennis game Pong and a commercial role-playing game Justice Online show that our algorithm can learn a well generalised policy and at the same time is able to provide a set of high-level policies with various playing styles.","Reinforcement learning, evolutionary algorithm, self-play, diverse playing styles, high skill levels" Toward Learning Geometric Eigen-Lengths Crucial for Robotic Fitting Tasks,https://openreview.net/forum?id=9gRIOMVLCiH,https://openreview.net/pdf?id=9gRIOMVLCiH,We formulate a novel learning problem and explore learning frameworks to discover useful low-dimensional yet sufficient geometric eigen-lengths for fitting tasks.,"Some extremely low-dimensional yet crucial geometric eigen-lengths often determine whether an object can be fitted in the environment or not. For example, the {\em height} of an object is important to measure to check if it can fit between the shelves of a cabinet, while the {\em width} of a couch is crucial when trying to move it through a doorway. Humans have materialized such crucial geometric eigen-lengths in common sense since they are very useful in serving as succinct yet effective, highly interpretable, and universal object representations. However, it remains obscure and underexplored if learning systems can be equipped with similar capabilities of automatically discovering such key geometric quantities in doing robotic fitting tasks. In this work, we therefore for the first time formulate and propose a novel learning problem on this question and set up a benchmark suite including the tasks, the data, and the evaluation metrics for studying the problem. We explore potential solutions and demonstrate the feasibility of learning such eigen-lengths from simply observing successful and failed fitting trials. We also attempt geometric grounding for more accurate eigen-length measurement and study the reusability of the learned geometric eigen-lengths across multiple tasks. Our work marks the first exploratory step toward learning crucial geometric eigen-lengths and we hope it can inspire future research in tackling this important yet underexplored problem. ","Visual Representation Learning, Shape Understanding" Text2Model: Model Induction for Zero-shot Generalization Using Task Descriptions,https://openreview.net/forum?id=by_KjUv-7YC,https://openreview.net/pdf?id=by_KjUv-7YC,Produce a model for a classification task described in text at inference time without training; using equivariant hypernetworks,"We study the problem of generating a training-free task-dependent visual classifier from text descriptions without visual samples. This Text-to-Model (T2M) problem is closely related to zero-shot learning, but unlike previous work, a T2M model infers a model tailored to a task, taking into account all classes in the task. We analyze the symmetries of T2M, and characterize the equivariance and invariance properties of corresponding models. In light of these properties we design an architecture based on hypernetworks that given a set of new class descriptions predicts the weights for an object recognition model which classifies images from those zero-shot classes. We demonstrate the benefits of our approach compared to zero-shot learning from text descriptions in image and point-cloud classification using various types of text descriptions: From single words to rich text descriptions.","zero-shot learning, vision and language" LiftedCL: Lifting Contrastive Learning for Human-Centric Perception,https://openreview.net/forum?id=WHlt5tLz12T,https://openreview.net/pdf?id=WHlt5tLz12T,"We present LiftedCL for self-supervised learning, which improves contrastive learning by leveraging 3D human structure information to learn 3D-aware human-centric representations.","Human-centric perception targets for understanding human body pose, shape and segmentation. Pre-training the model on large-scale datasets and fine-tuning it on specific tasks has become a well-established paradigm in human-centric perception. Recently, self-supervised learning methods have re-investigated contrastive learning to achieve superior performance on various downstream tasks. When handling human-centric perception, there still remains untapped potential since 3D human structure information is neglected during the task-agnostic pre-training. In this paper, we propose the Lifting Contrastive Learning (LiftedCL) to obtain 3D-aware human-centric representations which absorb 3D human structure information. In particular, to induce the learning process, a set of 3D skeletons is randomly sampled by resorting to 3D human kinematic prior. With this set of generic 3D samples, 3D human structure information can be learned into 3D-aware representations through adversarial learning. Empirical results demonstrate that LiftedCL outperforms state-of-the-art self-supervised methods on four human-centric downstream tasks, including 2D and 3D human pose estimation (0.4% mAP and 1.8 mm MPJPE improvement on COCO 2D pose estimation and Human3.6M 3D pose estimation), human shape recovery and human parsing.","contrastive learning, human-centric perception" Individual Privacy Accounting with Gaussian Differential Privacy,https://openreview.net/forum?id=JmC_Tld3v-f,https://openreview.net/pdf?id=JmC_Tld3v-f,Accurate privacy analysis of fully adaptive compositions using Gaussian differential privacy,"Individual privacy accounting enables bounding differential privacy (DP) loss individually for each participant involved in the analysis. This can be informative as often the individual privacy losses are considerably smaller than those indicated by the DP bounds that are based on considering worst-case bounds at each data access. In order to account for the individual losses in a principled manner, we need a privacy accountant for adaptive compositions of mechanisms, where the loss incurred at a given data access is allowed to be smaller than the worst-case loss. This kind of analysis has been carried out for the R\'enyi differential privacy by Feldman and Zrnic (2021), however not yet for the so-called optimal privacy accountants. We make first steps in this direction by providing a careful analysis using the Gaussian differential privacy which gives optimal bounds for the Gaussian mechanism, one of the most versatile DP mechanisms. This approach is based on determining a certain supermartingale for the hockey-stick divergence and on extending the R\'enyi divergence-based fully adaptive composition results by Feldman and Zrnic (2021). We also consider measuring the individual $(\varepsilon,\delta)$-privacy losses using the so-called privacy loss distributions. Using the Blackwell theorem, we can then use the results of Feldman and Zrnic (2021) to construct an approximative individual $(\varepsilon,\delta)$-accountant. We also show how to speed up the FFT-based individual DP accounting using the Plancherel theorem.","differential privacy, gaussian differential privacy, fully adaptive compositions, privacy accounting, individual privacy loss" Evolving Populations of Diverse RL Agents with MAP-Elites,https://openreview.net/forum?id=CBfYffLqWqb,https://openreview.net/pdf?id=CBfYffLqWqb,,"Quality Diversity (QD) has emerged as a powerful alternative optimization paradigm that aims at generating and maintaining large and diverse collections of solutions, notably with its flagship algorithm MAP-ELITES (ME) which evolves solutions through mutations and crossovers. While very effective for some unstructured problems, early ME implementations relied exclusively on random search to evolve the population of solutions, rendering them notoriously sample-inefficient for high-dimensional problems, for instance when evolving neural networks. Follow-up works considered exploiting gradient information to guide the search in order to address these shortcomings through techniques borrowed from either Black-Box Optimization (BBO) or Reinforcement Learning (RL). Both lines of work demonstrated great promise but approaches based on BBO tend to be less sample-efficient as they often empirically evaluate gradients through random sampling in the parameter space. While mixing RL techniques with ME unlocked state-of-the-art performance for robotics control problems that require a good amount of exploration, it also plagued these ME variants with limitations common among RL algorithms that ME was so far free of, such as hyperparameter sensitivity, high stochasticity as well as training instability, including when the population size increases as some components are shared across the population in recent approaches. Furthermore, existing approaches mixing ME with RL tend to be tied to a specific RL algorithm, which effectively prevents their use on problems where the corresponding RL algorithm fails. To address these shortcomings, we introduce a flexible framework that allows the use of any RL algorithm within a population update and alleviates the aforementioned limitations by evolving populations of agents (whose definition include hyperparameters and all learnable parameters) instead of just policies. We demonstrate the benefits brought about by our framework through extensive numerical experiments on a number of robotics control problems, some of which with deceptive rewards, taken from the QD-RL literature. We also open source an efficient JAX-based implementation of our algorithm. ", Deconfounded Noisy Labels Learning,https://openreview.net/forum?id=aMbzoO5go4r,https://openreview.net/pdf?id=aMbzoO5go4r,"Explicitly deconfound noisy label learning with causal adjustment, which eliminates the spurious correlation between labels and background representation and preserves true causal effect between labels and foreground representation.","Noisy labels are practical in real-world applications and cause severe performance degeneration. In this paper, first the validity of the small loss trick which plenty of noisy methods utilize is challenged. Then an empirical phenomenon named malignant bias is studied that results from the spurious correlation between noisy labels and background representation. To address this problem, unlike previous works based on statistical and regularization methods, we revisit the task from a causal perspective. A causal intervention model named deconfounded noisy labels learning (DeNLL) is applied to explicitly deconfound noisy label learning with causal adjustment, which eliminates the spurious correlation between labels and background representation and preserves true causal effect between labels and foreground representation. DeNLL implements the derived adjustment by a localization module (LM) and a debiased interaction module (DIM). LM adaptively discriminates foreground from background, and DIM dynamically encourages the interaction between the original representation and a debiased factor of the representation, which accords with the causal intervention. Experiments are carried out on five public noisy datasets including synthetic label noise, human label noise and real-world label noise. The proposed method achieves the state-of-the-art accuracy and exhibits clear improvements. Also, the proposed method is model-agnostic which improves the performances consistently on different backbones.","Noisy labels learning, image classification, causal inference." Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data ,https://openreview.net/forum?id=JpbLyEI5EwW,https://openreview.net/pdf?id=JpbLyEI5EwW,,"The implicit biases of gradient-based optimization algorithms are conjectured to be a major factor in the success of modern deep learning. In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with leaky ReLU activations when the training data are nearly-orthogonal, a common property of high-dimensional data. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that asymptotically, gradient flow produces a neural network with rank at most two. Moreover, this network is an $\ell_2$-max-margin solution (in parameter space), and has a linear decision boundary that corresponds to an approximate-max-margin linear predictor. For gradient descent, provided the random initialization variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training. We provide experiments which suggest that a small initialization scale is important for finding low-rank neural networks with gradient descent. ","implicit bias, gradient descent, gradient flow, neural networks" Learning Test Time Augmentation with Cascade Loss Prediction,https://openreview.net/forum?id=MIy9IfYlecR,https://openreview.net/pdf?id=MIy9IfYlecR,,"Data augmentation has been a successful common practice for improving the performance of deep neural network during training stage. In recent years, studies on test time augmentation (TTA) have also been promising due to its effectiveness on improving the robustness against out-of-distribution data at inference. Instead of simply adopting pre-defined handcrafted geometric operations such as croping and flipping, recent TTA methods learn predictive transformations which are supposed to provide the best performance gain on each test sample. However, the desired iteration number of transformation is proportional to the inference time of the predictor, and the gain by ensembling multiple augmented inputs still requires additional forward pass of the target model. In this paper, we propose a cascade method for test time augmentation prediction. It only requires a single forward pass of the transformation predictor, while can output multiple desirable transformations iteratively. These transformations will then be adopted sequentially on the test sample at once before the target model inference. The experimental results show that our method provides a better trade-off between computational cost and overall performance at test time, and shows significant improvement compared to existing methods.", Adaptive Computation with Elastic Input Sequence,https://openreview.net/forum?id=FkRMv-mlSTy,https://openreview.net/pdf?id=FkRMv-mlSTy,We present a new perspective for embattling dynamic allocation of computation budget to different inputs via introducing elasticity to the input length.,"When solving a problem, human beings have the adaptive ability in terms of the type of information they use, the procedure they take, and the amount of time they spend approaching and solving the problem. However, most standard neural networks have the same function type and fixed computation budget on different samples regardless of their nature and difficulty. Adaptivity is a powerful paradigm as it not only imbues practitioners with flexibility pertaining to the downstream usage of these models but can also serve as a powerful inductive bias for solving certain challenging classes of problems. In this work, we propose a new strategy, AdaTape, that enables dynamic computation in neural networks via adaptive tape tokens. AdaTape employs an elastic input sequence by equipping an existing architecture with a dynamic read and write tape. Specifically, we adaptively generate input sequences using tape tokens obtained from a tape bank that can either be trainable or generated from input data. We analyze the challenges and requirements to obtain dynamic sequence content and length, and propose the Adaptive Tape Reader (ATR) algorithm to achieve both objectives. Via extensive experiments on image recognition tasks, we show that AdaTape can achieve better performance while maintaining the computational cost.","Adaptive computation, dynamic allocation of computation budget." Opportunistic Actor-Critic (OPAC) with Clipped Triple Q-learning,https://openreview.net/forum?id=FHZUqgxIBYn,https://openreview.net/pdf?id=FHZUqgxIBYn,OPAC achieves higher average rewards than relevant baselines and mitigates the underestimation bias with the help of Clipped Triple Q-learning.,"Despite being the most successful model-free deep reinforcement learning (RL) algorithms in recent years, Soft Actor-Critic (SAC) and Twin Delayed Deep Deterministic Policy Gradient (TD3) have their respective downsides--TD3 performs well in simple tasks, while SAC does so in relatively complicated ones. However, they also suffer from underestimation due to Clipped Double Q-learning, i.e., taking a minimum of two Q-values. This paper introduces Opportunistic Actor-Critic (OPAC), an ensemble model-free deep RL algorithm that performs well in simple and complex tasks. OPAC combines the features of TD3 and SAC under one roof to retain their respective benefits. It also employs three critics and considers taking the mean of the smallest two Q-values for updating the shared target, dubbed Clipped Triple Q-learning. Our analytical results establish that Clipped Triple Q-learning incurs less underestimation than Clipped Double Q-learning. Furthermore, we have systematically evaluated OPAC in MuJoCo environments, and the empirical results indicate that OPAC attains higher average rewards than the current baselines.","Model-free Deep RL, Actor-Critic, Estimation Bias, Continuous Control" Optimizing Data-Flow in Binary Neural Networks,https://openreview.net/forum?id=odI2OpMFq-D,https://openreview.net/pdf?id=odI2OpMFq-D,A method to speed-up Binary Neural Networks removing floating-point computation,"Binary Neural Networks (BNNs) can significantly accelerate the inference time of a neural network by replacing its expensive floating-point arithmetic with bit-wise operations. Most existing solutions, however, do not fully optimize data flow through the BNN layers, and intermediate conversions from 1 to 16/32 bits often further hinder efficiency. We propose a novel training scheme that can increase data flow and parallelism in the BNN pipeline; specifically, we introduce a clipping block that decreases the data-width from 32 bits to 8. Furthermore, we reduce the internal accumulator size of a binary layer, usually kept using 32-bit to prevent data overflow without losing accuracy. Additionally, we provide an optimization of the Batch Normalization layer that both reduces latency and simplifies deployment. Finally, we present an optimized implementation of the Binary Direct Convolution for ARM instruction sets. Our experiments show a consistent improvement of the inference speed (up to $1.77$ and $1.9 \times$ compared to two state-of-the-art BNNs frameworks) with no drop in accuracy for at least one full-precision model.","Binary, Networks, Quantization, 1-bit" Gray-Box Gaussian Processes for Automated Reinforcement Learning,https://openreview.net/forum?id=rmoMvptXK7M,https://openreview.net/pdf?id=rmoMvptXK7M,,"Despite having achieved spectacular milestones in an array of important real-world applications, most Reinforcement Learning (RL) methods are very brittle concerning their hyperparameters. Notwithstanding the crucial importance of setting the hyperparameters in training state-of-the-art agents, the task of hyperparameter optimization (HPO) in RL is understudied. In this paper, we propose a novel gray-box Bayesian Optimization technique for HPO in RL, that enriches Gaussian Processes with reward curve estimations based on generalized logistic functions. In a very large-scale experimental protocol, comprising 5 popular RL methods (DDPG, A2C, PPO, SAC, TD3), dozens of environments (Atari, Mujoco), and 7 HPO baselines, we demonstrate that our method significantly outperforms current HPO practices in RL.", Protein Sequence and Structure Co-Design with Equivariant Translation,https://openreview.net/forum?id=pRCMXcfdihq,https://openreview.net/pdf?id=pRCMXcfdihq,"A novel framework for protein sequence and structure co-design, which translates proteins in the joint sequence-structure space in an iterative and end-to-end manner.","Proteins are macromolecules that perform essential functions in all living organisms. Designing novel proteins with specific structures and desired functions has been a long-standing challenge in the field of bioengineering. Existing approaches generate both protein sequence and structure using either autoregressive models or diffusion models, both of which suffer from high inference costs. In this paper, we propose a new approach capable of protein sequence and structure co-design, which iteratively translates both protein sequence and structure into the desired state from random initialization, based on context features given a priori. Our model consists of a trigonometry-aware encoder that reasons geometrical constraints and interactions from context features, and a roto-translation equivariant decoder that translates protein sequence and structure interdependently. Notably, all protein amino acids are updated in one shot in each translation step, which significantly accelerates the inference process. Experimental results across multiple tasks show that our model outperforms previous state-of-the-art baselines by a large margin, and is able to design proteins of high fidelity as regards both sequence and structure, with running time orders of magnitude less than sampling-based methods.","protein design, sequence structure co-design, equivariant translation, geometric deep learning" Deep Equilibrium Non-Autoregressive Sequence Learning,https://openreview.net/forum?id=bnkvnbGEXnc,https://openreview.net/pdf?id=bnkvnbGEXnc,,"In this work, we argue that non-autoregressive (NAR) sequence generative models can equivalently be regarded as iterative refinement process towards the target sequence, implying an underlying dynamical system of NAR models: $ \mathbf{z} = \mathcal{f}(\mathbf{z}, \mathbf{x}) \rightarrow \mathbf{y}$. In such a way, the optimal prediction of a NAR model should be the equilibrium state of its dynamics if given infinitely many iterations. However, this is infeasible in practice due to limited computational and memory budgets. To this end, we propose DeqNAR to directly solve for the equilibrium state of NAR models based on deep equilibrium networks (Bai et al., 2019) with black-box root-finding solvers and back-propagate through the equilibrium point via implicit differentiation with constant memory. We conduct extensive experiments on four WMT machine translation benchmarks. Our main findings show that DeqNAR can indeed converge to a more accurate prediction and is a general-purpose framework that consistently yields substantial improvement for several strong NAR backbones.","deep equilibrium model, non-autoregressive sequence-to-sequence" PTUnifier: Pseudo Tokens as Paradigm Unifiers in Medical Vision-and-Language Pre-training,https://openreview.net/forum?id=qSCHRL8b96S,https://openreview.net/pdf?id=qSCHRL8b96S,"This paper proposes a simple approach to extract generic representations from medical images and texts, which can be applied to a broad range of medical tasks.","Medical vision-and-language pre-training (Med-VLP) has shown promising improvements on many downstream medical tasks owing to its applicability to extracting generic representations from medical images and texts. Practically, there exist two typical paradigms, i.e., the \textbf{fusion-encoder paradigm} and the \textbf{dual-encoder paradigm}, depending on whether a heavy fusion module is used. The former outperforms on multi-modal tasks owing to the sufficient interaction between modalities; the latter outperforms on uni-modal and cross-modal tasks due to the single-modality encoding ability. To take advantage of these two paradigms, we propose an effective yet straightforward scheme named PTUnifier to unify the two paradigms thanks to the identical input format by introducing visual and textual pseudo tokens, which serve as a feature bank that stores the most representative images/texts. By doing so, a single model could process various tasks adopting different input formats (i.e., image-only, text-only, and image-text-pair). Furthermore, we construct a pool of pseudo tokens (instead of static ones) to improve diversity and scalability. Experimental results show that our approach achieves state-of-the-art results on a broad range of tasks, spanning uni-modal tasks (i.e., image/text classification and text summarization), cross-modal tasks (i.e., image-to-text generation and image-text/text-image retrieval), and multi-modal tasks (i.e., visual question answering), demonstrating the effectiveness of our approach. Note that the adoption of pseudo tokens is orthogonal to most existing Med-VLP approaches, and we believe that our approach could be a beneficial and complementary extension to these approaches.","medical analysis, vision-and-language, multi-modal learning" SGD Through the Lens of Kolmogorov Complexity,https://openreview.net/forum?id=5YHaMHg2Bfa,https://openreview.net/pdf?id=5YHaMHg2Bfa,,"We initiate a thorough study of the dynamics of stochastic gradient descent (SGD) under minimal assumptions using the tools of entropy compression. Specifically, we characterize a quantity of interest which we refer to as the \emph{accuracy discrepancy}. Roughly speaking, this measures the average discrepancy between the model accuracy on batches and large subsets of the entire dataset. We show that if this quantity is sufficiently large, then SGD finds a model which achieves perfect accuracy on the data in $O(1)$ epochs. On the contrary, if the model cannot perfectly fit the data, this quantity must remain below a \emph{global} threshold, which only depends on the size of the dataset and batch. We use the above framework to lower bound the amount of randomness required to allow (non stochastic) gradient descent to escape from local minimas using perturbations. We show that even if the model is \emph{extremely overparameterized}, at least a linear (in the size of the dataset) number of random bits are required to guarantee that GD escapes local minimas in polynomial time.", Offline imitation learning by controlling the effective planning horizon,https://openreview.net/forum?id=TZixgYj-oqI,https://openreview.net/pdf?id=TZixgYj-oqI,"We fix the problem that previous IL algorithms don't work with a low discount factor, and show that offline IL can be solved with the proposed fix and lowering the discount factor.","In offline imitation learning (IL), we generally assume only a handful of expert trajectories and a supplementary offline dataset from suboptimal behaviors to learn the expert policy. While it is now common to minimize the divergence between state-action visitation distributions so that the agent also considers the future consequences of an action, a sampling error in an offline dataset may lead to erroneous estimates of state-action visitations in the offline case. In this paper, we investigate the effect of controlling the effective planning horizon (i.e., reducing the discount factor) as opposed to imposing an explicit regularizer, as previously studied. Unfortunately, it turns out that the existing algorithms suffer from magnified approximation errors when the effective planning horizon is shortened, which results in a significant degradation in performance. We analyze the main cause of the problem and provide the right remedies to correct the algorithm. We show that the corrected algorithm improves on popular imitation learning benchmarks by controlling the effective planning horizon rather than an explicit regularization.","imitation learning, offline imitation learning, supplementary offline dataset" Learning in temporally structured environments,https://openreview.net/forum?id=z0_V5O9cmNw,https://openreview.net/pdf?id=z0_V5O9cmNw,Models that learn at multiple timescales perform well in tasks with complex temporal structure,"Natural environments have temporal structure at multiple timescales, a property that is reflected in biological learning and memory but typically not in machine learning systems. This paper advances a multiscale learning model in which each weight in a neural network is decomposed into a sum of subweights learning independently with different learning and decay rates. Thus knowledge becomes distributed across different timescales, enabling rapid adaptation to task changes while avoiding catastrophic interference with older knowledge. First, we prove that previous models that learn at multiple timescales, but with complex coupling between timescales, are formally equivalent to the multiscale learner via a reparameterization that eliminates this coupling. Thus the multiscale learning offers a unifying framework that is conceptually and computationally simpler than past work. The same analysis also offers a new characterization of momentum learning, as a fast weight with a negative learning rate. Second, we derive a model of Bayesian inference in environments governed by $1/f$ noise, a common pattern in both natural and human-generated environments that involves long-range (power law) autocorrelations. The model works by applying a Kalman filter to jointly infer dynamics at multiple timescales. We then derive a variational approximation to the Bayesian model and show that it is equivalent to the multiscale learner. Third, we evaluate the models in synthetic online prediction tasks characterized by $1/f$ noise in the latent parameters of the environment. We find that the Bayesian model significantly outperforms stochastic gradient descent (which effectively learns at only one timescale) and a batch heuristic that predicts each timestep based on a fixed horizon of past observations (motivated by the idea that older data have gone stale). Moreover, the multiscale learner with parameters obtained from the variational approximation performs nearly as well as the full Bayesian model, and with memory requirements that are linear in the size of the network (vs. quadratic for the Bayesian model). Future work will incorporate the multiscale learner as an optimizer in deep networks to explore their ability to learn in rich temporally structured environments. ","1/f noise, Kalman filter, neural network, learning theory, optimizers" Identifying Phase Transition Thresholds of Permuted Linear Regression via Message Passing,https://openreview.net/forum?id=-4DiyBMgv9m,https://openreview.net/pdf?id=-4DiyBMgv9m,,"This paper considers the permuted linear regression, i.e., ${\mathbf{Y}} = {\mathbf{\Pi}}^{\natural}{\mathbf{X}}{\mathbf{B}}^{\natural} + {\mathbf{W}}$, where ${\mathbf{Y}} \in \mathbb{R}^{n\times m}, {\mathbf{\Pi}}^{\natural}\in\mathbb{R}^{n\times n}, {\mathbf{X}} \in \mathbb{R}^{n\times p}, {\mathbf{B}}^{\natural}\in \mathbb{R}^{p\times m}$, and ${\mathbf{W}}\in \mathbb{R}^{n\times m}$ represent the observations, missing (or incomplete) information about ordering, sensing matrix, signal of interests, and additive sensing noise, respectively. As is shown in the previous work, there exists phase transition phenomena in terms of the \emph{signal-to-noise ratio} ($\mathsf{snr}$), number of permuted rows, etc. While all existing works only concern the convergence rates without specifying the associate constants in front of them, we give a precise identification of the phase transition thresholds via the message passing algorithm. Depending on whether the signal ${\mathbf{B}}^{\natural}$ is known or not, we separately identify the corresponding critical points around the phase transition regimes. Moreover, we provide numerical experiments and show the empirical phase transition points are well aligned with theoretical predictions.", RandProx: Primal-Dual Optimization Algorithms with Randomized Proximal Updates,https://openreview.net/forum?id=cB4N3G5udUS,https://openreview.net/pdf?id=cB4N3G5udUS,,"Proximal splitting algorithms are well suited to solving large-scale nonsmooth optimization problems, in particular those arising in machine learning. We propose a new primal–dual algorithm, in which the dual update is randomized; equivalently, the proximity operator of one of the function in the problem is replaced by a stochastic oracle. For instance, some randomly chosen dual variables, instead of all, are updated at each iteration. Or, the proximity operator of a function is called with some small probability only. A nonsmooth variance-reduction technique is implemented so that the algorithm finds an exact minimizer of the general problem involving smooth and nonsmooth functions, possibly composed with linear operators. We derive linear convergence results in presence of strong convexity; these results are new even in the deterministic case, when our algorithms reverts to the recently proposed Primal-Dual Davis-Yin algorithm. Some randomized algorithms of the literature are also recovered as particular cases (e.g., Point-SAGA). But our randomization technique is general and encompasses many unbiased mechanisms beyond sampling and probabilistic updates, including compression. Since the convergence speed depends on the slowest among the primal and dual contraction mechanisms, the iteration complexity might remain the same when randomness is used. On the other hand, the computation complexity can be significantly reduced. Overall, randomness helps getting faster algorithms. This has long been known for stochastic-gradient-type algorithms, and our work shows that this fully applies in the more general primal-dual setting as well.","optimization, randomized algorithm, stochastic algorithm, proximal splitting, proximity operator" Improving the Calibration of Fine-tuned Language Models via Denoising Variational Auto-Encoders,https://openreview.net/forum?id=NI7StoWHJPT,https://openreview.net/pdf?id=NI7StoWHJPT,,"Large pre-trained language models have demonstrated strong performance on natural language understanding (NLU) tasks through fine-tuning. However, fine-tuned models still suffer from overconfident predictions, especially in out-of-domain settings. In this paper, we tackle the problem of calibrating fine-tuned language models. Based on the observation that pre-trained language models are well-calibrated yet fine-tuned models distort pre-trained features, we propose a simple fine-tuning method that enforces the consistency between the pre-trained and fine-tuned features. Specifically, we improve calibration on downstream tasks by introducing an auxiliary generative language modeling objective in the fine-tuning phase. The auxiliary objective is defined with a denoising variational auto-encoding framework, which encourages the model to learn generative representations. Empirical results show that our method achieves competitive accuracy and the lowest expected calibration error compared to several strong baselines under both in-domain and out-of-domain settings on three downstream NLU tasks. Furthermore, our fine-tuned models preserve the ability of language modeling. They can generate text given labels on downstream datasets and thus have better interpretability compared to vanilla fine-tuning.", A Hierarchical Bayesian Approach to Federated Learning,https://openreview.net/forum?id=2jcvy1htS_r,https://openreview.net/pdf?id=2jcvy1htS_r,We propose a novel hierarchical Bayesian approach to Federated learning (FL) where the block-coordinate descent solution to the variational inference leads to a viable algorithm for FL with proved convergence and generalisation guarantee.,"We propose a novel hierarchical Bayesian approach to Federated learning (FL), where our models reasonably describe the generative process of clients' local data via hierarchical Bayesian modeling: constituting random variables of local models for clients that are governed by a higher-level global variate. Interestingly, the variational inference in our Bayesian model leads to an optimization problem whose block-coordinate descent solution becomes a distributed algorithm that is separable over clients and allows them not to reveal their own private data at all, thus fully compatible with FL. We also highlight that our block-coordinate algorithm has particular forms that subsume the well-known FL algorithms including Fed-Avg and Fed-Prox as special cases. That is, we not only justify the previous Fed-Avg and Fed-Prox algorithms whose learning protocols look intuitive but theoretically less underpinned, but also generalise them even further via principled Bayesian approaches. Beyond introducing novel modeling and derivations, we also offer convergence analysis showing that our block-coordinate FL algorithm converges to an (local) optimum of the objective at the rate of $O(1/\sqrt{t})$, the same rate as regular (centralised) SGD, as well as the generalisation error analysis where we prove that the test error of our model on unseen data is guaranteed to vanish as we increase the training data size, thus asymptotically optimal.","Federated Learning, Bayesian Methods, Probabilistic Models" Neural Representations in Multi-Task Learning guided by Task-Dependent Contexts,https://openreview.net/forum?id=p48GR3rwtxf,https://openreview.net/pdf?id=p48GR3rwtxf,We investigate neural representations learned by multi-task architectures focusing on task-switching networks that use task-dependent contexts.,"The ability to switch between tasks effectively in response to external stimuli is a hallmark of cognitive control. Our brain is able to filter and to integrate external information to accomplish goal-directed behavior. Task switching occurs rapidly and efficiently, allowing us to perform multiple tasks with ease. In a similar way, deep learning models can be tailored to exhibit multi-task capabilities and achieve high performance across domains. Still, understanding how neural networks make predictions is crucial in many real-world applications. In this study, we delve into neural representations learned by multi-tasking architectures. Concretely, we compare individual and parallel networks with task switching networks. Task-switching networks leverage task-dependent contexts to learn disentangled representations without hurting the overall task accuracy. We show that task-switching networks operate in an intermediate regime between individual and parallel. In addition, we show that shared representations are produced by the emergence neurons encoding multiple tasks. Furthermore, we study the role of contexts across network processing and show its role at aligning the task with the relevant features. Finally, we investigate how the magnitude of contexts affects the performance in task-switching networks.","Learning Representations, Neural Geometry, Context-Dependent Decision Making, Attention Mechanisms" MCTransformer: Combining Transformers And Monte-Carlo Tree Search For Offline Reinforcement Learning,https://openreview.net/forum?id=-94tJCOo7OM,https://openreview.net/pdf?id=-94tJCOo7OM,A novel approach for sequential decision making using reinforcement learning by combining MCTS and transformers.,"Recent studies explored the framing of reinforcement learning as a sequence modeling problem, and then using Transformers to generate effective solutions. In this study, we introduce MCTransformer, a framework that combines Monte-Carlo Tree Search (MCTS) with Transformers. Our approach uses an actor-critic setup, where the MCTS component is responsible for navigating previously-explored states, aided by input from the Transformer. The Transformer controls the exploration and evaluation of new states, enabling an effective and efficient evaluation of various strategies. In addition to the development of highly effective strategies, our setup enables the use of more efficient sampling compared to existing MCTS-based solutions. MCTransformer is therefore able to perform a small number of evaluations for each newly-explored node, and to do so without degrading its performance. Our evaluation, conducted on the challenging and well-known problem of SameGame, shows that our approach outperforms both Transformer-based and MCTS-based solutions.","Transformer, Monte Carlo Tree Search, Offline Reinforcement Learning, SameGame" One-Step Estimator for Permuted Sparse Recovery,https://openreview.net/forum?id=l2vPa8gwBuA,https://openreview.net/pdf?id=l2vPa8gwBuA,,"This paper considers the unlabeled sparse recovery under multiple measurements, i.e., ${\mathbf{Y}} = {\mathbf{\Pi}}^{\natural}{\mathbf{X}} {\mathbf{B}}^{\natural} + {\mathbf{W}}$, where ${\mathbf{Y}} \in \mathbb{R}^{n\times m}, {\mathbf{\Pi}}^{\natural}\in \mathbb{R}^{n\times n}, {\mathbf{X}} \in \mathbb{R}^{n\times p}, {\mathbf{B}}^{\natural}\in \mathbb{R}^{p\times m}, {\mathbf{W}}\in \mathbb{R}^{n\times m}$ represents the observations, missing (or incomplete) correspondence information, sensing matrix, sparse signals, and additive sensing noise, respectively. Different from the previous works on multiple measurements ($m > 1$) which all focus on the sufficient samples regime, namely, $n > p$, we consider a sparse matrix $\mathbf{B}^{\natural}$ and investigate the insufficient samples regime (i.e., $n \ll p$) for the first time. To begin with, we establish the lower bound on the sample number and \emph{signal-to-noise ratio} (${\mathsf{SNR}}$) for the correct permutation recovery. Moreover, we present a simple yet effective estimator. Under mild conditions, we show that our estimator can restore the correct correspondence information with high probability. Numerical experiments are presented to corroborate our theoretical claims.", Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?,https://openreview.net/forum?id=GGItImF9oG5,https://openreview.net/pdf?id=GGItImF9oG5,"Your model is pretty cool, but does it scale? Let's find out. ","There have been a lot of interest in the scaling properties of Transformer models \citep{kaplan2020scaling}. However, not much has been done on the front of investigating the effect of scaling properties of different inductive biases and model architectures. Do model architectures scale differently? If so, how does inductive bias affect scaling behaviour? How does this influence upstream (pretraining) and downstream (transfer)? This paper conducts a systematic study of scaling behaviour of ten diverse model architectures such as Transformers, Switch Transformers, Universal Transformers, Dynamic convolutions, Performers, and recently proposed MLP-Mixers. Via extensive experiments, we show that (1) architecture is an indeed an important consideration when performing scaling and (2) the best performing model can fluctuate at different scales. We believe that the findings outlined in this work has significant implications to how model architectures are currently evaluated in the community.", Guarded Policy Optimization with Imperfect Online Demonstrations,https://openreview.net/forum?id=O5rKg7IRQIO,https://openreview.net/pdf?id=O5rKg7IRQIO,Introducing a new policy optimization method exploiting imperfect online demonstrations from a guardian policy.,"Teacher-Student Framework (TSF) is a reinforcement learning setting where a teacher agent or human expert guards the training of a student agent by intervening and providing online demonstrations. Assuming the teacher policy is optimal, it has the perfect timing and capability to intervene the control of the student agent, providing safety guarantee and exploration guidance. Nevertheless, in many real-world settings it is expensive or even impossible to obtain a well-performing teacher policy. In this work we relax the assumption of a well-performing teacher and develop a new method that can incorporate arbitrary teacher policies with modest or inferior performance. We instantiate an off-policy Reinforcement Learning algorithm, termed Teacher-Student Shared Control (TS2C), which incorporates teacher intervention based on trajectory-based value estimation. Theoretical analysis validates that the proposed TS2C algorithm attains efficient exploration and lower-bound safety guarantee without being affected by the teacher's own performance. Experiments on autonomous driving simulation show that our method can exploit teacher policies at any performance level and maintain a low training cost. Moreover, the student policy excels the imperfect teacher policy in terms of higher accumulated reward in held-out testing environments.","reinforcement learning, guarded policy optimization, imperfect demonstrations, shared control, metadrive simulator" Fast Nonlinear Vector Quantile Regression,https://openreview.net/forum?id=UxqUgchwXkK,https://openreview.net/pdf?id=UxqUgchwXkK,"We extend Vector Quantile Regression to support non-linear specification, while ensuring monotonicity and scaling to millions of samples.","$$ \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} $$ Quantile regression (QR) is a powerful tool for estimating one or more conditional quantiles of a target variable $\rvar{Y}$ given explanatory features $\rvec{X}$. A limitation of QR is that it is only defined for scalar target variables, due to the formulation of its objective function, and since the notion of quantiles has no standard definition for multivariate distributions. Recently, vector quantile regression (VQR) was proposed as an extension of QR for vector-valued target variables, thanks to a meaningful generalization of the notion of quantiles to multivariate distributions via optimal transport. Despite its elegance, VQR is arguably not applicable in practice due to several limitations: (i) it assumes a linear model for the quantiles of the target $\rvec{Y}$ given the features $\rvec{X}$; (ii) its exact formulation is intractable even for modestly-sized problems in terms of target dimensions, number of regressed quantile levels, or number of features, and its relaxed dual formulation may violate the monotonicity of the estimated quantiles; (iii) no fast or scalable solvers for VQR currently exist. In this work we fully address these limitations, namely: (i) We extend VQR to the non-linear case, showing substantial improvement over linear VQR; (ii) We propose {vector monotone rearrangement}, a method which ensures the quantile functions estimated by VQR are monotone functions; (iii) We provide fast, GPU-accelerated solvers for linear and nonlinear VQR which maintain a fixed memory footprint, and demonstrate that they scale to millions of samples and thousands of quantile levels; (iv) We release an optimized python package of our solvers as to widespread the use of VQR in real-world applications.","optimal transport, quantile regression, vector quantiles, uncertainty quantification, multi-output regression, conformal prediction, software" Multi Task Learning of Different Class Label Representations for Stronger Models,https://openreview.net/forum?id=jZfksUBb3Zz,https://openreview.net/pdf?id=jZfksUBb3Zz,"We present a new way of representing class labels, and train a network to recognize them as an auxiliary task, leading to stronger models.","We find that the way in which class labels are represented can have a powerful effect on how well models trained on them learn. In classification, the standard way of representing class labels is as one-hot vectors. We present a new way of representing class labels called Binary Labels, where each class label is a large binary vector. We further introduce a new paradigm, multi task learning on different label representations. We train a network on two tasks. The main task is to classify images based on their one-hot label, and the auxiliary task is to classify images based on their Binary Label. We show that networks trained on both tasks have many advantages, including higher accuracy across a wide variety of datasets and architectures, both when trained from scratch and when using transfer learning. Networks trained on both tasks are also much more effective when training data is limited, and seem to do especially well on more challenging problems. ","Label Representation, Image Classification, Representation Learning, Multi-Task Learning" On the Existence of a Trojaned Twin Model,https://openreview.net/forum?id=w48XN5HwpV8,https://openreview.net/pdf?id=w48XN5HwpV8,"A mathematical model for backdoor attack, show the existence of a trojaned twin model of a clean model","We study the Trojan Attack problem, where malicious attackers sabotage deep neural network models with poisoned training data. In most existing works, the effectiveness of the attack is largely overlooked; many attacks can be ineffective or inefficient for certain training schemes, e.g., adversarial training. In this paper, we adopt a novel perspective and look into the quantitative relationship between a clean model and its Trojaned counterpart. We formulate a successful attack using classic machine learning language. Under mild assumptions, we show theoretically that there exists a Trojaned model, named Trojaned Twin, that is very close to the clean model. This attack can be achieved by simply using a universal Trojan trigger intrinsic to the data distribution. This has powerful implications in practice; the Trojaned twin model has enhanced attack efficacy and strong resiliency against detection. Empirically, we show that our method achieves consistent attack efficacy across different training schemes, including the challenging adversarial training scheme. Furthermore, this Trojaned twin model is robust against SoTA detection methods","Backdoor Attack, Trojan Attack" On Information Maximisation in Multi-View Self-Supervised Learning,https://openreview.net/forum?id=UHPva3PuKLN,https://openreview.net/pdf?id=UHPva3PuKLN, ,"The strong performance of multi-view self-supervised learning (SSL) prompted the development of many different approaches (e.g. SimCLR, BYOL, and DINO). A unified understanding of how each of these methods achieves its performance has been limited by apparent differences across objectives and algorithmic details. Through the lens of information theory, we show that many of these approaches are maximising an approximate lower bound on the mutual information between the representations of multiple views of the same datum. Further, we show that this bound decomposes into a ``reconstruction"" term, treated identically by all SSL methods, and an ``entropy"" term, where existing SSL methods differ in their treatment. We prove that an exact optimisation of both terms of this lower bound encompasses and unifies current theoretical properties such as recovering the true latent variables of the underlying generative process (Zimmermann et al., 2021) or or isolating content from style in such true latent variables (Von Kügelgen et al., 2021). This theoretical analysis motivates a naive but principled objective (EntRec), that exactly optimises both the reconstruction and entropy terms, thus benefiting from said theoretical properties unlike other SSL frameworks. Finally, we show EntRec achieves a downstream performance on-par with existing SSL methods on ImageNet (69.7% after 400 epochs) and on an array of transfer tasks when pre-trained on ImageNet. Furthermore, EntRec is more robust to modifying the batch size, a sensitive hyperparameter in other SSL methods.","multi-view Self-supervised Learning, Information Theory" Leveraging Large Language Models for Multiple Choice Question Answering,https://openreview.net/forum?id=yKbprarjc5B,https://openreview.net/pdf?id=yKbprarjc5B,Large language models that can effectively associate multiple choice answer options with symbols can be prompted in a way that yields dramatically improved performance on multiple choice question answering tasks.,"While large language models (LLMs) like GPT-3 have achieved impressive results on multiple choice question answering (MCQA) tasks in the zero, one, and few-shot settings, they generally lag behind the MCQA state of the art (SOTA). MCQA tasks have traditionally been presented to LLMs like cloze tasks. An LLM is conditioned on a question (without the associated answer options) and its chosen option is the one assigned the highest probability after normalization (for length, etc.). A more natural prompting approach is to present the question and answer options to the LLM jointly and have it output the symbol (e.g., “A”) associated with its chosen answer option. This approach allows the model to explicitly compare answer options, reduces computational costs, and mitigates the effects of tokenization scheme and answer option lengths on answer selection. For the natural approach to be effective the LLM it is used with must be able to associate answer options with the symbols that represent them. The LLM needs what we term multiple choice symbol binding (MCSB) ability. This ability varies greatly by model. We show that a model with high MCSB ability performs much better with the natural approach than with the traditional approach across 20 diverse tasks and largely closes the gap with the SOTA, suggesting that the MCQA ability of LLMs has been previously underestimated.","NLP, language models, multiple choice question answering, symbol binding, GPT-3, Codex" SELCOR: Self-Correction for Weakly Supervised Learning,https://openreview.net/forum?id=No6QvMxdQMo,https://openreview.net/pdf?id=No6QvMxdQMo,We propose a self-training based method to reduce noise in weak labels for weakly supervised learning.,"Powerful machine learning models often require training with large amounts of labeled data. Collecting precise labels from human supervision is expensive and time-consuming. Instead, it is much easier to obtain large-scale weak labels from multiple weak supervision sources such as rules or knowledge graphs, whereas weak labels could be noisy and make models prone to overfitting. We propose a self-training method for weakly supervised learning without using any true label. Our method learns a joint model with a corrector and a predictor, where the predictor generates pseudo clean labels and the corrector revises weak labels to reproduces the pseudo labels generated by the predictor. The joint model is trained by encouraging consistency between label generation and label correction, such that the predictor and corrector can iteratively improve each other to generate more reliable pseudo label for self-training. In this way, our method makes full use of weak labels and effectively suppresses label noise in weak labels and pseudo labels. Experiments on 8 benchmark datasets show that our method outperforms existing weakly supervised methods by large margins.","Weakly Supervised Learning, Label Noise, Self-Training" Efficiently Meta-Learning for Robust Deep Networks without Prior Unbiased Set,https://openreview.net/forum?id=-zw8zmeIt7M,https://openreview.net/pdf?id=-zw8zmeIt7M,"We present an efficiently meta-learning approach, which eliminates the dependence on additional unbiased data and reduces the optimization complexity of recent meta-learning based method","Learning with noisy labels is a practically challenging problem in robust deep learning. Recent efforts to improve the robustness are made by meta-learning the sample weights or transition matrix from a prior unbiased set. Thus, previous meta-learning based approaches generally assume the existence of such prior unbiased set. Unfortunately, this assumption unrealistically simplifies the task of learning noisy labels in real-world scenarios; even worse the updating iterations in previous meta-learning algorithms typically demand prohibitive computational cost. This paper proposes an efficient meta-learning approach for robust deep learning to address these challenges. Specifically, without relying on prior unbiased validation set, our method dynamically estimates unbiased samples in training data and leverages meta-learning to refine the deep networks. Furthermore, to significantly reduce the updating iterations in optimization cost, we elaborately design the inner loop adaption and outer loop optimization of the meta-learning paradigm, respectively. Experimental results demonstrate that our approach is able to save about 6 times training time while achieving comparable or even better generalization performance. In particular, we improve accuracy on the CIFAR100 benchmark at 40% instance-dependent noise by more than 13% in absolute accuracy.","Robust deep learning, Noisy Label, Meta-learning, KD" Learning with Logical Constraints but without Shortcut Satisfaction,https://openreview.net/forum?id=M2unceRvqhh,https://openreview.net/pdf?id=M2unceRvqhh,,"Recent studies have started to explore the integration of logical knowledge into deep learning via encoding logical constraints as an additional loss function. However, existing approaches tend to vacuously satisfy logical constraints through shortcuts, failing to fully exploit the knowledge. In this paper, we present a new framework for learning with logical constraints. Specifically, we address the shortcut satisfaction issue by introducing dual variables for logical connectives, encoding how the constraint is satisfied. We further propose a variational framework where the encoded logical constraint is expressed as a distributional loss that is compatible with the model's original training loss. The theoretical analysis shows that the proposed approach bears some nice properties, and the experimental evaluations demonstrate its superior performance in both model generalizability and constraint satisfaction.","training with logical constraints, logical formula encoding, variational learning, stochastic gradient descent ascent" Certified Training: Small Boxes are All You Need,https://openreview.net/forum?id=7oFuxtJtUMH,https://openreview.net/pdf?id=7oFuxtJtUMH,"We propose a novel certified training method based on propagating small input regions, establishing a new state of the art for certified accuracy.","We propose the novel certified training method, SABR, which outperforms existing methods across perturbation magnitudes on MNIST, CIFAR-10, and TinyImageNet, in terms of both standard and certifiable accuracies. The key insight behind SABR is that propagating interval bounds for a small but carefully selected subset of the adversarial input region is sufficient to approximate the worst-case loss over the whole region while significantly reducing approximation errors. SABR does not only establish a new state-of-the-art in all commonly used benchmarks but more importantly, points to a new class of certified training methods promising to overcome the robustness-accuracy trade-off.","Certified Training, Certified Robustness, Adversarial Robustness, Robustness Verification" Label Similarity Aware Contrastive Learning,https://openreview.net/forum?id=eCIVFDVxvAx,https://openreview.net/pdf?id=eCIVFDVxvAx,Label similarity aware contrastive learning builds better representation space and improves downstream performance via optimizing alignment and uniformity.,"Contrastive learning dramatically improves the performance in self-supervised learning by maximizing the alignment between two representations obtained from the same sample while distributing all representations uniformly. Extended supervised contrastive learning boosts downstream performance by pulling embedding vectors belonging to the same class together, even though vectors are obtained from different samples. In this work, we generalize the supervised contrastive learning approach to the universal framework allowing us to fully utilize the ground truth similarities between samples. All pairs of representations are relatively pulled together in proportion to the label similarity, not equally pulling representations having the same class label. To quantitatively interpret the feature space after contrastive learning, we propose a label similarity aware alignment and uniformity, which measures how genuinely similar samples are aligned and how feature distribution preserves the maximal information. We prove asymptotically and empirically that our proposed contrastive loss optimizes two properties, and optimized properties positively affect task performance. Comprehensive experiments on NLP, Vision, Graph, and Multimodal benchmark datasets using BERT, ResNet, GIN, and LSTM encoders consistently showed that our loss outperforms the previous self-supervised and supervised contrastive losses upon a wide range of data types and corresponding encoder architectures. Introducing a task-specific label similarity function further facilitates downstream performance.","Contrastive Learning, Supervised Learning, Representation Learning" Counterfactual Generation Under Confounding,https://openreview.net/forum?id=mcJvCys7DX7,https://openreview.net/pdf?id=mcJvCys7DX7,We propose a counterfactual data augmentation to improve the performance of a classifier when the data is spuriously correlated,"A machine learning model, under the influence of observed or unobserved confounders in the training data, can learn spurious correlations and fail to generalize when deployed. For image classifiers, augmenting a training dataset using counterfactual examples has been empirically shown to break spurious correlations. However, the counterfactual generation task itself becomes more difficult as the level of confounding increases. Existing methods for counterfactual generation under confounding consider a fixed set of interventions (e.g., texture, rotation) and are not flexible enough to capture diverse data-generating processes. Given a causal generative process, we formally characterize the adverse effects of confounding on any downstream tasks and show that the correlation between generative factors (attributes) can be used to quantitatively measure confounding between generative factors. To minimize such correlation, we propose a counterfactual generation method that learns to modify the value of any attribute in an image and generate new images given a set of observed attributes, even when the dataset is highly confounded. These counterfactual images are then used to regularize the downstream classifier such that the learned representations are the same across various generative factors conditioned on the class label. Our method is computationally efficient, simple to implement, and works well for any number of generative factors and confounding variables. Our experimental results on both synthetic (MNIST variants) and real-world (CelebA) datasets show the usefulness of our approach.","counterfactual, confounding, cycleGAN, classification" Regression with Label Differential Privacy,https://openreview.net/forum?id=h9O0wsmL-cT,https://openreview.net/pdf?id=h9O0wsmL-cT,We present a new label differentially private algorithm for training regression models.,"We study the task of training regression models with the guarantee of label differential privacy (DP). Based on a global prior distribution of label values, which could be obtained privately, we derive a label DP randomization mechanism that is optimal under a given regression loss function. We prove that the optimal mechanism takes the form of a ""randomized response on bins"", and propose an efficient algorithm for finding the optimal bin values. We carry out a thorough experimental evaluation on several datasets demonstrating the efficacy of our algorithm.","label differential privacy, regression" Hierarchical Abstraction for Combinatorial Generalization in Object Rearrangement,https://openreview.net/forum?id=fGG6vHp3W9W,https://openreview.net/pdf?id=fGG6vHp3W9W,We demonstrate how to generalize over a combinatorially large space of rearrangement tasks from only pixel observations by constructing from video demonstrations a factorized transition graph over entity state transitions that we use for control.,"Object rearrangement is a challenge for embodied agents because solving these tasks requires generalizing across a combinatorially large set of underlying entities that take the value of object states. Worse, these entities are often unknown and must be inferred from sensory percepts. We present a hierarchical abstraction approach to uncover these underlying entities and achieve combinatorial generalization from unstructured inputs. By constructing a factorized transition graph over clusters of object representations inferred from pixels, we show how to learn a correspondence between intervening on states of entities in the agent's model and acting on objects in the environment. We use this correspondence to develop a method for control that generalizes to different numbers and configurations of objects, which outperforms current offline deep RL methods when evaluated on a set of simulated rearrangement and stacking tasks.","objects, combinatorial generalization, abstraction, rearrangement, slots, binding, hierarchy, compositionality, symmetry, independence, graph search" SRBGCN: Tangent space-Free Lorentz Transformations for Graph Feature Learning,https://openreview.net/forum?id=BLsM6WymMo6,https://openreview.net/pdf?id=BLsM6WymMo6,This work introduces a fully hyperbolic network that uses direct Lorentz transformations to learn the features directly on the manifold.,"Hyperbolic graph convolutional networks have been successfully applied to represent complex graph data structures. However, optimization on Riemannian manifolds is nontrivial thus most of the existing hyperbolic networks build the network operations on the tangent space of the manifold, which is a Euclidean local approximation. This distorts the learnt features, limits the representation capacity of the network and makes it hard to optimize the network. In this work, we introduce a fully hyperbolic graph convolutional network (GCN), referred to as SRBGCN, which performs neural computations such as feature transformation and aggregation directly on the manifold, using manifold-preserving Lorentz transformations that include spatial rotation (SR) and boost (B) operations. Experiments conducted on static graph datasets for node classification and link prediction tasks validate the performance of the proposed method.","fully hyperbolic network, Lorentz transformations, boost and rotation, graph convolutional networks, hyperbolic rotations" Transfer NAS with Meta-learned Bayesian Surrogates,https://openreview.net/forum?id=paGvsrl4Ntr,https://openreview.net/pdf?id=paGvsrl4Ntr,,"While neural architecture search (NAS) is an intensely-researched area, approaches typically still suffer from either (i) high computational costs or (ii) lack of robustness across datasets and experiments. Furthermore, most methods start searching for an optimal architecture from scratch, ignoring prior knowledge. This is in contrast to the manual design process by researchers and engineers that leverage previous deep learning experiences by, e.g., transferring architectures from previously solved, related problems. We propose to adopt this human design strategy and introduce a novel surrogate for NAS, that is meta-learned across prior architecture evaluations across different datasets. We utilizes Bayesian Optimization (BO) with deep-kernel Gaussian Processes, graph neural networks for the architecture embeddings and a transformer-based set encoder of datasets. As a result, our method consistently achieves state-of-the-art results on six computer vision datasets, while being as fast as one-shot NAS methods.", Mitigating the Limitations of Multimodal VAEs with Coordination-Based Approach,https://openreview.net/forum?id=Rn8u4MYgeNJ,https://openreview.net/pdf?id=Rn8u4MYgeNJ,,"One of the key challenges in multimodal variational autoencoders (VAEs) is inferring a joint representation from arbitrary subsets of modalities. The state-of-the-art approach to achieving this is to sub-sample the modality subsets and learn to generate all modalities from them. However, this sub-sampling in the mixture-based approach has been shown to degrade other important features of multimodal VAEs, such as quality of generation, and furthermore, this degradation is theoretically unavoidable. In this study, we focus on another approach to learning the joint representation by bringing unimodal inferences closer to joint inference from all modalities, which does not have the above limitation. Although there have been models that can be categorized under this approach, they were derived from different backgrounds; therefore, the relation and superiority between them were not clear. To take a unified view, we first categorize them as coordination-based multimodal VAEs and show that these can be derived from the same multimodal evidence lower bound (ELBO) and that the difference in their performance is related to whether they are more tightly lower bounded. Next, we point out that these existing coordination-based models perform poorly on cross-modal generation (or cross-coherence) because they do not learn to reconstruct modalities from unimodal inferences. Therefore, we propose a novel coordination-based model that incorporates these unimodal reconstructions, which avoids the limitations of both mixture and coordination-based models. Experiments with diverse and challenging datasets show that the proposed model mitigates the limitations in multimodal VAEs and performs well in both cross-coherence and generation quality.","mutlimodal learning, deep generative models" Incompatibility between Deterministic Policy and Generative Adversarial Imitation Learning,https://openreview.net/forum?id=3i_7H3phuy3,https://openreview.net/pdf?id=3i_7H3phuy3,,"Deterministic policies are widely applied in generative adversarial imitation learning (GAIL). When adopting these policies, some GAIL variants modify the reward function to avoid training instability. However, the mechanism behind this instability is still largely unknown. In this paper, we capture the instability through the underlying exploding gradients theoretically in the updating process. Our novelties lie in the following aspects: 1) By employing multivariate Gaussian policy with small covariance to approximate deterministic policy, we establish and prove the probabilistic lower bound for the exploding gradients, which can describe the degree of instability universally, while the stochastic policy will never suffer from such pathology subsequently. 2) We also prove that the modified reward function of adversarial inverse reinforcement learning (AIRL) can relieve exploding gradients, but at the expense of ``non-confidence''. Experiments and a toy demo support our analysis.", FiD-Light: Efficient and Effective Retrieval-Augmented Text Generation,https://openreview.net/forum?id=4bCsX2K0KuR,https://openreview.net/pdf?id=4bCsX2K0KuR,We increase the efficiency of FID with FiD-Light including a source pointing workflow for effective retrieval augmented generation.,"Retrieval-augmented generation models offer many benefits over standalone language models: besides a textual answer to a given query they provide provenance items retrieved from an updateable knowledge base. However, they are also more complex systems and need to handle long inputs. In this work, we introduce FiD-Light to strongly increase the efficiency of the state-of-the-art retrieval-augmented FiD model, while maintaining the same level of effectiveness. Our FiD-Light model constrains the information flow from the encoder (which encodes passages separately) to the decoder (using concatenated encoded representations). Furthermore, we adapt FiD-Light with re-ranking capabilities through textual source pointers, to improve the top-ranked provenance precision. Our experiments on a diverse set of seven knowledge intensive tasks (KILT) show FiD-Light consistently improves the Pareto frontier between query latency and effectiveness. FiD-Light with source pointing sets substantial new state-of-the-art results on six KILT tasks for combined text generation and provenance retrieval evaluation, while maintaining reasonable efficiency.","retrieval augmented generation, KILT, Fusion-in-Decoder, efficiency" Contrastive Learning of Molecular Representation with Fragmented Views,https://openreview.net/forum?id=wZiE_S2362V,https://openreview.net/pdf?id=wZiE_S2362V,We propose a contrastive learning method for molecular representation with fragmented views.,"Molecular representation learning is a fundamental task for AI-based drug design and discovery. Contrastive learning is an attractive framework for this task, as also evidenced in various domains of representation learning, e.g., image, language, and speech. However, molecule-specific ways of constructing good positive or negative views in contrastive training under consideration of their chemical semantics have been relatively under-explored. In this paper, we consider a molecule as a bag of meaningful fragments, e.g., functional groups, by disconnecting a non-ring single bond as the semantic-preserving transformation. Then, we suggest to construct a complete (or incomplete) bag of fragments as the positive (or negative) views of a molecule: each fragment loses chemical substructures from the original molecule, while the union of the fragments does not. Namely, this provides easy positive and hard negative views simultaneously for contrastive representation learning so that it can selectively learn useful features and ignore nuisance features. Furthermore, we additionally suggest to optimize the torsional angle reconstruction loss around the fragmented bond to incorporate with 3D geometric structure in the pre-training dataset. Our experiments demonstrate that our scheme outperforms prior state-of-the-art molecular representation learning methods across various downstream molecule property prediction tasks.",Molecule representation learning Theoretical Study of Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward,https://openreview.net/forum?id=jYv81Ai6ztO,https://openreview.net/pdf?id=jYv81Ai6ztO,This paper studies the theory of offline RL with trajectory-wise reward,"The remarkable success of reinforcement learning (RL) heavily relies on observing the reward of every visited state-action pair. In many real world applications, however, an agent can observe only a score that represents the quality of the whole trajectory, which is referred to as the {\em trajectory-wise reward}. In such a situation, it is difficult for standard RL methods to well utilize trajectory-wise reward, and large bias and variance errors can be incurred in policy evaluation. In this work, we propose a novel offline RL algorithm, called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED), which decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value iteration based on the learned proxy reward. To ensure the value functions constructed by PARTED are always pessimistic with respect to the optimal ones, we design a new penalty term to offset the uncertainty of the proxy reward. For general episodic MDPs with large state space, we show that PARTED with overparameterized neural network function approximation achieves an $\tilde{\mathcal{O}}(D_{\text{eff}}H^2/\sqrt{N})$ suboptimality, where $H$ is the length of episode, $N$ is the total number of samples, and $D_{\text{eff}}$ is the effective dimension of the neural tangent kernel matrix. To further illustrate the result, we show that PARTED achieves an $\tilde{\mathcal{O}}(dH^3/\sqrt{N})$ suboptimality with linear MDPs, where $d$ is the feature dimension, which matches with that with neural network function approximation, when $D_{\text{eff}}=dH$. To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.","RL theory, offline RL, trajectory-wise reward" Sharp Convergence Analysis of Gradient Descent for Deep Linear Neural Networks,https://openreview.net/forum?id=lmumJ2pC0JB,https://openreview.net/pdf?id=lmumJ2pC0JB,"In deep linear neural networks, we obtain sharp rates for gradient descent to converge to a global optimum.","This paper provides sharp rates of convergence of the gradient descent (GD) method for deep linear neural networks with different random initialization. This study touches upon one major open theoretical problem in machine learning: why deep neural networks trained with GD methods are efficient in many practical applications. While the solution of this problem is still beyond reach for general nonlinear deep neural networks, there have been extensive efforts in the literature in studying relevant questions for deep linear neural networks and there are many interesting results in this research direction. For example, recent results on the loss landscape show that even though the loss function of deep linear neural networks is non-convex, every local minimizer is also a global minimizer. When the GD method is applied to train the deep linear networks, it has been shown in the literature that the convergence behavior of the GD method depends on the initialization. In this paper, we obtain the sharp rate of convergence of GD for deep linear networks, and we show that this rate does not depend on the types of random initialization. Furthermore, we show that the depth of the network does not affect the optimal rate of convergence, provided that the width of each hidden layer is appropriately large. ","deep linear neural networks, non-convex optimization, gradient descent, initialization" Selective Frequency Network for Image Restoration,https://openreview.net/forum?id=tyZ1ChGZIKO,https://openreview.net/pdf?id=tyZ1ChGZIKO,We propose a novel network to recover the most useful frequency component for image restoration via frequency selection.,"Image restoration aims to reconstruct the latent sharp image from the corrupted observation. Besides dealing with this long-standing task in the spatial domain, a few approaches seek solutions in the frequency domain based on the large discrepancy between spectra of sharp/degraded image pairs. However, these works utilize the existing transformation tools, e.g., Wavelet Transform, to split the feature into several parts, which is not flexible enough to select the most informative frequency component to recover. In this paper, we exploit a multi-branch and content-aware module to decompose the feature into separate frequency subbands dynamically and locally, and then accentuate the useful ones via the channel-wise attention mechanism. In addition, aiming to cope with the large-scale degradation kernel, we propose an extremely simple decoupling and modulation module to enlarge the receptive field based on global and window-based average pooling. Integrating two developed modules into a U-Net backbone, the proposed Selective Frequency Network (SFNet) performs favorably against state-of-the-art algorithms on five image restoration tasks, including image dehazing, image motion/defocus deblurring, image desnowing, and image deraining.","Image restoration, Frequency domain, Frequency selection" Contextualized Generative Retrieval,https://openreview.net/forum?id=3TduOwfFNoy,https://openreview.net/pdf?id=3TduOwfFNoy,"By utilizing contextualized token embeddings in generative retrieval, it can utilize both the parametric space of the model and the non-parametric space of contextualized embeddings.","The text retrieval task is mainly performed in two ways: the bi-encoder approach and the generative approach. The bi-encoder approach maps the document and query embeddings to common vector space and performs a nearest neighbor search. It stably shows high performance and efficiency across different domains but has an embedding space bottleneck as it interacts in L2 or inner product space. The generative retrieval model retrieves by generating a target sequence and overcomes the embedding space bottleneck by interacting in the parametric space. However, it fails to retrieve the information it has not seen during the training process as it depends solely on the information encoded in its own model parameters. To leverage the advantages of both approaches, we propose Contextualized Generative Retrieval model, which uses contextualized embeddings (output embeddings of a language model encoder) as vocab embeddings at the decoding step of generative retrieval. The model uses information encoded in both the non-parametric space of contextualized token embeddings and the parametric space of the generative retrieval model. Our approach of generative retrieval with contextualized vocab embeddings shows higher performance than generative retrieval with only vanilla vocab embeddings in the document retrieval task, an average of 6% higher performance in KILT (NQ, TQA) and 2X higher in NQ-320k, suggesting the benefits of using contextualized embedding in generative retrieval models.","NLP, Information Retrieval" Mirror Training for Input Convex Neural Network,https://openreview.net/forum?id=bxtzk6Wfrpo,https://openreview.net/pdf?id=bxtzk6Wfrpo,," The input convex neural network (ICNN) aims to learn a convex function from the input to the output by using non-decreasing convex activation functions and non-negativity constraints on the weight parameters of some layers. However, in practice, it loses some representation power because of these non-negativity parameters of the hidden units, even though the design of the ``passthrough'' layer can partially address this problem. To solve issues caused by these non-negativity constraints, we use a duplication input pair trick, i.e., the negation of the original input as part of the new input in our structure. This new method will preserve the convexity of the function from the original input to the output and tackle the representation problem in training. Additionally, we design a mirror unit to address this problem further, making the network Mirror ICNN. Moreover, we propose a recurrent input convex neural network (RICNN) structure to deal with the time-series problems. The recurrent unit of the structure can be ICNN or any other convex variant of ICNN. This structure can maintain convexity by constraining the mapping from the hidden output at time step $t$ to the input of the next time step $t+1$. The experiments can support our design, including the simple numerical curve fitting, power system hosting capacity dataset regression, and the MNIST dataset classification.","input convex neural network, convex optimization, representation, recurrent neural network, hosting capacity" Scaling Up Probabilistic Circuits by Latent Variable Distillation,https://openreview.net/forum?id=067CGykiZTS,https://openreview.net/pdf?id=067CGykiZTS,,"Probabilistic Circuits (PCs) are a unified framework for tractable probabilistic models that support efficient computation of various probabilistic queries (e.g., marginal probabilities). One key challenge is to scale PCs to model large and high-dimensional real-world datasets: we observe that as the number of parameters in PCs increases, their performance immediately plateaus. This phenomenon suggests that the existing optimizers fail to exploit the full expressive power of large PCs. We propose to overcome such bottleneck by latent variable distillation: we leverage the less tractable but more expressive deep generative models to provide extra supervision over the latent variables of PCs. Specifically, we extract information from Transformer-based generative models to assign values to latent variables of PCs, providing guidance to PC optimizers. Experiments on both image and language modeling benchmarks (e.g., ImageNet and WikiText-2) show that latent variable distillation substantially boosts the performance of large PCs compared to their counterparts without latent variable distillation. In particular, on the image modeling benchmarks, PCs achieve competitive performance against some of the widely-used deep generative models, including variational autoencoders and flow-based models, opening up new avenues for tractable generative modeling.", Oscillation Neural Ordinary Differential Equations,https://openreview.net/forum?id=afrUI9hkUJM,https://openreview.net/pdf?id=afrUI9hkUJM,Oscillation Neural Ordinary Differential Equations,"Neural ordinary differential equations (NODEs) have received a lot of attention in recent years due to their memory efficiency. Different from traditional deep learning, it defines a continuous deep learning architecture based on the theory of ordinary differential equations (ODEs), which also improves the interpretability of deep learning. However, it has several obvious limitations, such as a NODE is not a universal approximator, it requires a large number of function evaluations (NFEs), and it has a slow convergence rate. We address these drawbacks by modeling and adding an oscillator to the framework of the NODEs. The oscillator enables the trajectories of our model to cross each other. We prove that our model is a universal approximator, even in the original input space. Due to the presence of oscillators, the flows learned by the model will be simpler, thus our model needs fewer NFEs and has a faster convergence speed. We apply our model to various tasks including classification and time series extrapolation, then compare several metrics including accuracy, NFEs, and convergence speed. The experiments show that our model can achieve better results compared to the existing baselines.","Neural Ordinary Differential Equations, Continuous Deep Learning" Improving Differentiable Neural Architecture Search by Encouraging Transferability,https://openreview.net/forum?id=Tl8OmiibP99,https://openreview.net/pdf?id=Tl8OmiibP99,,"Differentiable neural architecture search methods are increasingly popular due to their computational efficiency. However, these methods have unsatisfactory generalizability and stability. Their searched architectures are often degenerate with a dominant number of skip connections and perform unsatisfactorily on test data. Existing methods for solving this problem have a variety of limitations, such as cannot prevent the happening of architecture degeneration, being excessively restrictive in setting the number of skip connections, etc. To address these limitations, we propose a new approach for improving the generalizability and stability of differentiable NAS, by developing a transferability-encouraging tri-level optimization framework which improves the architecture of a main model by encouraging good transferability to an auxiliary model. Our framework involves three stages performed end-to-end: 1) train network weights of a main model; 2) transfer knowledge from the main model to an auxiliary model; 3) optimize the architecture of the main model by maximizing its transferability to the auxiliary model. We propose a new knowledge transfer approach based on matching quadruple relative similarities. Experiments on several datasets demonstrate the effectiveness of our method.", MA-BERT: Towards Matrix Arithmetic-only BERT Inference by Eliminating Complex Non-linear Functions,https://openreview.net/forum?id=HtAfbHa7LAL,https://openreview.net/pdf?id=HtAfbHa7LAL,"MA-BERT completely eliminates the complex non-linear functions in BERT and achieves matrix arithmetic-only operation with trivial ReLU, which could benefit inference on both general computing units and accelerator designs for edge applications","Due to their superior results, Transformer-based models such as BERT have become de facto standards in many Natural Language Processing (NLP) applications. However, the intensive use of complex non-linear functions within the Transformer architecture impairs its computing efficiency and complicates corresponding accelerator designs, because non-linear functions are generally computation-intensive and require special hardware support. In light of this, we propose MA-BERT, which allows matrix arithmetic-only operations in Transformer-based NLP models and achieves efficient inference with negligible accuracy loss. Specifically, we propose four correlated techniques that include approximating softmax with a two-layer neural network, replacing GELU with ReLU, fusing normalization layers with adjacent linear layers, and leveraging knowledge transfer from baseline models. Through these techniques, we are able to eliminate the major non-linear functions in Transformer-based models and obtain MA-BERT with only matrix arithmetic and trivial ReLU operations without compromising on accuracy. With mainly regular matrix arithmetic operations, MA-BERT enables hardware-friendly processing on various computing engines, including CPUs, GPUs, and customized neural network accelerators. Our experimental results on CPUs show that MA-BERT achieves up to 27% reduction in inference time with comparable accuracy on many downstream tasks compared to the baseline BERT models. ","BERT, Efficient inference, Matrix arithmetic-only, Eleminate non-linear functions" Automatically Answering and Generating Machine Learning Final Exams,https://openreview.net/forum?id=MT1Pcdo8sGG,https://openreview.net/pdf?id=MT1Pcdo8sGG,,"Can a machine learn machine learning? We propose to answer this question using the same criteria we use to answer a similar question: can a human learn machine learning? We automatically answer final exams in MIT's recent large machine learning course and generate new questions at a human level. Recently, program synthesis and few-shot learning solved university-level problem set questions in mathematics and STEM courses at a human level. In this work, we solve questions from final exams that differ from problem sets in several ways: the questions are longer, have multiple parts, are more complicated, and span a broader set of topics. We provide a new dataset and benchmark of questions from machine learning final exams and code for automatically answering these questions and generating new questions. To make our dataset a reproducible benchmark, we use automatic checkers for multiple choice questions, questions with numeric answers, and questions with expression answers, and evaluate a large free language model, Meta’s OPT, and compare the results with Open AI’s GPT-3 and Codex. A student survey comparing the quality, appropriateness, and difficulty of machine-generated questions with human-written questions shows that across multiple aspects, machine-generated questions are indistinguishable from human-generated questions and are suitable for final exams. We perform ablation studies comparing zero-shot learning with few-shot learning, chain-of-thought prompting, GPT-3 and OPT pre-trained on text and Codex fine-tuned on code on a range of machine learning topics and find that few-shot learning methods perform best. We make our data and code publicly available for the machine learning community.", CAT: Collaborative Adversarial Training,https://openreview.net/forum?id=u6KhE9fapjX,https://openreview.net/pdf?id=u6KhE9fapjX,,"Adversarial training can improve the robustness of neural networks. Previous adversarial training methods focus on a single training strategy and do not consider the collaboration between different training strategies. In this paper, we find different adversarial training methods have distinct robustness for sample instances. For example, an instance can be correctly classified by a model trained using standard adversarial training (AT) but not by a model trained using TRADES, and vice versa. Based on this phenomenon, we propose a collaborative adversarial training framework to improve the robustness of neural networks. Specifically, we simultaneously use different adversarial training methods to train two robust models from scratch. We input the adversarial examples generated by each network to the peer network and use the logit of the peer network to guide the training of its network. Collaborative Adversarial Training (CAT) can improve both robustness and accuracy. Finally, Extensive experiments on CIFAR-10 and CIFAR-100 validated the effectiveness of our method. CAT achieved new state-of-the-art robustness without using any additional data on CIFAR-10 under the Auto-Attack benchmark.", Efficient Certified Training and Robustness Verification of Neural ODEs,https://openreview.net/forum?id=KyoVpYvWWnK,https://openreview.net/pdf?id=KyoVpYvWWnK,We enable certified training and scalable robustness verification of neural ODEs.,"Neural Ordinary Differential Equations (NODEs) are a novel neural architecture, built around initial value problems with learned dynamics which are solved during inference. Thought to be inherently more robust against adversarial perturbations, they were recently shown to be vulnerable to strong adversarial attacks, highlighting the need for formal guarantees. However, despite significant progress in robustness verification for standard feed-forward architectures, the verification of high dimensional NODEs remains an open problem. In this work we address this challenge and propose GAINS, an analysis framework for NODEs combining three key ideas: (i) a novel class of ODE solvers, based on variable but discrete time steps, (ii) an efficient graph representation of solver trajectories, and (iii) a novel abstraction algorithm operating on this graph representation. Together, these advances enable the efficient analysis and certified training of high-dimensional NODEs, by reducing the runtime from an intractable $\mathcal{O}(\exp(d)+\exp(T))$ to $\mathcal{O}(d+T^2\log^2T)$ in the dimensionality $d$ and integration time $T$. In an extensive evaluation on computer vision (MNIST and Fashion-MNIST) and time-series forecasting (Physio-Net) problems, we demonstrate the effectiveness of both our certified training and verification methods.","Neural ODEs, Adversarial Robustness, Certified Robustness, Robustness Verification, Certified Training" Arbitrary Virtual Try-On Network: Characteristics Representation and Trade-off between Body and Clothing,https://openreview.net/forum?id=d8mr8lKIZ3n,https://openreview.net/pdf?id=d8mr8lKIZ3n,"We develop a special 2D virtual try-on network for cross-category try on task, e.g. long sleeves<->short sleeves or long pants<->skirts, since the limb may be exposed or hidden in such case.","Deep learning based virtual try-on system has achieved some encouraging progress recently, but there still remain several big challenges that need to be solved, such as trying on arbitrary clothes of all types, trying on the clothes from one category to another and generating image-realistic results with few artifacts. To handle this issue, we in this paper first collect a new dataset with all types of clothes, \ie tops, bottoms, and whole clothes, each one has multiple categories with rich information of clothing characteristics such as patterns, logos, and other details. Based on this dataset, we then propose the Arbitrary Virtual Try-On Network (AVTON) that is utilized for all-type clothes, which can synthesize realistic try-on images by preserving and trading off characteristics of the target clothes and the reference person. Our approach includes three modules: 1) Limbs Prediction Module, which is utilized for predicting the human body parts by preserving the characteristics of the reference person. This is especially good for handling cross-category try-on task (e.g. long sleeves <-> short sleeves or long pants <-> skirts, etc., where the exposed arms or legs with the skin colors and details can be reasonably predicted; 2) Improved Geometric Matching Module, which is designed to warp clothes according to the geometry of the target person. We improve the TPS based warping method with a compactly supported radial function (Wendland's $\Psi$-function); 3) Trade-Off Fusion Module, which is to trade off the characteristics of the warped clothes and the reference person. This module is to make the generated try-on images look more natural and realistic based on a fine-tune symmetry of the network structure. Extensive simulations are conducted and our approach can achieve better performance compared with the state-of-the-art virtual try-on methods.","Deep Learning, Virtual Try-on, Generative Adversarial Networks, Artificial Intelligence in Fashion" A Benchmark Dataset for Learning from Label Proportions,https://openreview.net/forum?id=Vf2DK1Ol0ed,https://openreview.net/pdf?id=Vf2DK1Ol0ed,A Benchmark based on Criteo Kaggle CTR dataset for Learning from Label Proportions,"Learning from label proportions (LLP) has recently emerged as an important technique of weakly supervised learning on aggregated labels. In LLP, a model is trained on groups (a.k.a bags) of feature-vectors and their corresponding label proportions to predict labels for individual feature-vectors. While previous works have developed a variety of techniques for LLP, including novel loss functions, model architectures and their optimization, they typically evaluated their methods on pseudo-synthetically generated LLP training data using common small scale supervised learning datasets by randomly sampling or partitioning their instances into bags. Despite growing interest in this important task there are no large scale open source LLP benchmarks to compare various approaches. Construction of such a benchmark is hurdled by two challenges a) lack of natural large scale LLP like data, b) large number of mostly artificial methods of forming bags from instance level datasets. In this paper we propose LLP-Bench: a large scale LLP benchmark constructed from the Criteo Kaggle CTR dataset. We do an in-depth, systematic study of the Criteo dataset and propose a methodology to create a benchmark as a collection of diverse and large scale LLP datasets. We choose the Criteo dataset since it admits multiple natural collections of bags formed by grouping subsets of its 26 categorical features. We analyze all bag collections obtained through grouping by one or two categorical features, in terms of their bag-level statistics as well as embedding based distance metrics quantifying the geometric separation of bags. We then propose to include in LLP-Bench a few groupings to fairly represent real world bag distributions. We also measure the performance of state of the art models, loss functions (adapted to LLP) and optimizers on LLP-Bench. We perform a series of ablations and explain the performance of various techniques on LLP-Bench. To the best of our knowledge LLP-Bench is the first open source benchmark for the LLP task. We hope that the proposed benchmark and the evaluation methodology will be used by ML researchers and practitioners to better understand and hence devise state of art LLP algorithms. ","Learning from Label Proportions, Benchmark Dataset, LLP" UL2: Unifying Language Learning Paradigms,https://openreview.net/forum?id=6ruVLB727MC,https://openreview.net/pdf?id=6ruVLB727MC,How to train a language model properly,"Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized and unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 and/or GPT-like models across multiple diverse setups. Finally, by scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised NLP tasks ranging from language generation (with automated and human evaluation), language understanding, text classification, question answering, commonsense reasoning, long text reasoning, structured knowledge grounding and information retrieval. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization. Finally, we show that UL2 20B works well with chain-of-thought prompting and reasoning, making it an appealing choice for research into reasoning at a small to medium scale of 20B parameters. We release Flax-based T5X model checkpoints for the 20B model publicly. ","language models, pretraining, transformers" Emergence of Exploration in Policy Gradient Reinforcement Learning via Resetting,https://openreview.net/forum?id=GKsNIC_mQRG,https://openreview.net/pdf?id=GKsNIC_mQRG,,"In reinforcement learning (RL), many exploration methods explicitly promote stochastic policies, e.g., by adding an entropy bonus. We argue that exploration only matters in RL because the agent repeatedly encounters the same or similar states, so that it is beneficial to gradually improve the performance over the encounters; otherwise, the greedy policy would be optimal. Based on this intuition, we propose ReMax, an objective for RL whereby stochastic exploration arises as an emergent property, without adding any explicit exploration bonus. In ReMax, an episode is modified so that the agent can reset to previous states in the trajectory, and the agent’s goal is to maximize the best return in the trajectory tree. We show that this ReMax objective can be directly optimized with an unbiased policy gradient method. Experiments confirm that ReMax leads to the emergence of a stochastic exploration policy, and improves the performance compared to RL with no exploration bonus.", CASR: Generating Complex Sequences with Autoregressive Self-Boost Refinement,https://openreview.net/forum?id=SVl1w1u3InX,https://openreview.net/pdf?id=SVl1w1u3InX,CASR improves left-to-right autoregressive generation without heuristic intermediate sequences for complex answers via self-boost refinement,"There are sequence generation tasks where the best order to generate the target sequence is not left-to-right. For example, an answer to the Sudoku game, a structured code like s-expression, and even a logical natural language answer where the analysis may be generated after the decision. We define the target sequences of those tasks as complex sequences. Obviously, a complex sequence should be constructed with multiple logical steps, and has dependencies among each part of itself (e.g. decisions depend on analyses). It's a great challenge for the classic left-to-right autoregressive generation system to generate complex sequences. Current approaches improve one-pass left-to-right generation on NLG tasks by generating different heuristic intermediate sequences in multiple stages. However, for complex sequences, the heuristic rules to break down them may hurt performance, and increase additional exposure bias. To tackle these challenges, we propose a PLM-friendly autoregressive self-boost refinement framework, CASR. When training, CASR inputs the predictions generated by the model itself at the previous refinement step (instead of those produced by heuristic rules). To find an optimal design, we also discuss model architecture, parameter efficiency and initialization strategy. By evaluating CASR on Sudoku, WebQSP, MTOP and KVRET through controlled experiments and empirical studies, we find that CASR produces high-quality outputs. CASR also improves Accuracy on Sudoku (70.93% --> 97.28%) and achieves state-of-the-art performance on KVRET with Micro F1 score (67.88% --> 70.00%).","self-boost refinement, complex answers, autoregressive generation" SciRepEval: A Multi-Format Benchmark for Scientific Document Representations,https://openreview.net/forum?id=zfiYcbeQkH,https://openreview.net/pdf?id=zfiYcbeQkH,,"Learned representations of scientific documents can serve as valuable input features for downstream tasks, without the need for further fine-tuning. However, existing benchmarks for evaluating these representations fail to capture the diversity of relevant tasks. In response, we introduce SciRepEval, the first comprehensive benchmark for training and evaluating scientific document representations. It includes 25 challenging and realistic tasks across four formats: classification, regression, ranking and search. We then use the benchmark to study and improve the generalization ability of scientific document representation models. We show how state-of-the-art models struggle to generalize across task formats, and that simple multi-task training fails to improve them. However, a new approach that learns multiple embeddings per document, each tailored to a different task format, can improve performance. We experiment with task-format-specific control codes and adapters in a multi-task setting and find that they outperform the existing single-embedding state-of-the-art by up to 1.5 points absolute. ", On the convergence of SGD under the over-parameter setting,https://openreview.net/forum?id=raSbs1AFoX3,https://openreview.net/pdf?id=raSbs1AFoX3,We show that SGD converges to the global optimum with probability 1 and provide a asymptotic convergence rate,"With the improvement of computing power, over-parameterized models get increasingly popular in machine learning. This type of model is usually with a complicated, non-smooth, and non-convex loss function landscape. However, when we train the model, simply using the first-order optimization algorithm like stochastic gradient descent (SGD) could acquire some good results, in both training and testing, albeit that SGD is known to not guarantee convergence for non-smooth and non-convex cases. Theoretically, it was previously proved that in training, SGD converges to the global optimum with probability $1 - \epsilon$, but only for certain models and $\epsilon$ depends on the model complexity. It was also observed that SGD tends to choose a flat minimum, which preserves its training performance in testing. In this paper, we first prove that SGD could iterate to the global optimum almost surely under arbitrary initial value and some mild assumptions on the loss function. Then, we prove that if the learning rate is larger than a value depending on the structure of a global minimum, the probability of converging to this global optimum is zero. Finally, we acquire the asymptotic convergence rate based on the local structure of the global optimum. ","SGD, over-parameter, almost surely convergence, global optimum convergence" MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are Better Dense Retrievers,https://openreview.net/forum?id=xKYlWJaLFi,https://openreview.net/pdf?id=xKYlWJaLFi,,"Dense retrieval aims to map queries and passages into low-dimensional vector space for efficient similarity measuring, showing promising effectiveness in various large-scale retrieval tasks. Since most existing methods commonly adopt pre-trained Transformers (\eg BERT) for parameter initialization, some work focuses on proposing new pre-training tasks for compressing the useful semantic information from passages into dense vectors, achieving remarkable performances. However, it is still challenging to effectively capture the rich semantic information and relations about passages into the dense vectors via one single particular pre-training task. In this work, we propose a multi-task pre-trained model, MASTER, that unifies and integrates multiple pre-training tasks with different learning objectives under the bottlenecked masked autoencoder architecture. Concretely, MASTER utilizes a multi-decoder architecture to integrate three types of pre-training tasks: corrupted passages recovering, related passage recovering and PLMs outputs recovering. By incorporating a shared deep encoder, we construct a representation bottleneck in our architecture, compressing the abundant semantic information across tasks into dense vectors. The first two types of tasks concentrate on the semantic information of passages and capturing relationships among them within the pre-training corpus. The third can capture the knowledge beyond the corpus from external PLMs (\eg GPT-2). Extensive experiments on several large-scale passage retrieval datasets have shown that our approach outperforms the previous state-of-the-art dense retrieval methods.","Multi-task Pre-training, Dense Retrieval" Offline Reinforcement Learning via Weighted $f$-divergence,https://openreview.net/forum?id=vJVIUTwohv,https://openreview.net/pdf?id=vJVIUTwohv,"We propose a DICE algorithm with weighted f-divergence regularization for offline RL, which enables state-action dependent regularization.","One of the major challenges of offline reinforcement learning (RL) is dealing with distribution shifts that stem from the mismatch between the trained policy and the data collection policy. Prior offline RL algorithms have addressed this issue by regularizing the policy optimization with $f$-divergence between the state-action visitation distributions of the data collection policy and the optimized policy. While such regularization provides a theoretical lower bound on performance and has had some practical success, it is not affected by the optimality of state-actions and can be overly pessimistic, especially when valuable state-actions are rare in the dataset. To mitigate the problem, we introduce and analyze a weighted $f$-divergence regularized RL framework that can less regularize valuable but rare state-actions as long as sampling error allows. This leads to an offline RL algorithm with iterative stationary distribution correction estimation while jointly re-adjusting the regularization for each state-action. We show that the presented algorithm with weighted $f$-divergence performs competitively with the state-of-the-art methods.","Offline Reinforcement Learning, Stationary Distribution Correction Estimation" Bitrate-Constrained DRO: Beyond Worst Case Robustness To Unknown Group Shifts,https://openreview.net/forum?id=2QzNuaRHn4Z,https://openreview.net/pdf?id=2QzNuaRHn4Z,Robustness to group shifts without training group annotations can be achieved with a constrained form of DRO.,"Although training machine learning models for robustness is critical for real-world adoption, determining how to best ensure robustness remains an open problem. Some methods (e.g., DRO) are overly conservative, while others (e.g., Group DRO) require domain knowledge that may be hard to obtain. In this work, we address limitations in prior approaches by assuming a more nuanced form of group shift: conditioned on the label, we assume that the true group function is simple. For example, we may expect that group shifts occur along high-level features (e.g., image background, lighting). Thus, we aim to learn a model that maintains high accuracy on simple group functions realized by these features, but need not spend valuable model capacity achieving high accuracy on contrived groups of examples. Based on this idea, we formulate a two-player game where conditioned on the label the adversary can only separate datapoints into potential groups using simple features, which corresponds to a bitrate constraint on the adversary’s capacity. Our resulting practical algorithm, Bitrate-Constrained DRO (BR-DRO), does not require group annotations on training data yet matches the performance of Group DRO on datasets that have them or are long-tailed. Our theoretical analysis reveals that in some settings BR-DRO objective can provably yield statistically efficient and less pessimistic solutions than unconstrained DRO.","Robustness, Distribution shift, Group Shift" Some Practical Concerns and Solutions for Using Pretrained Representation in Industrial Systems,https://openreview.net/forum?id=IC8LwiOLKFr,https://openreview.net/pdf?id=IC8LwiOLKFr,We investigate some practical concerns and solutions for using pretrained representation in industrial systems.,"Deep learning has dramatically changed the way data scientists and engineers craft features -- the once tedious process of measuring and constructing can now be achieved by training learnable representations. Recent work shows pretraining can endow representations with relevant signals, and in practice they are often used as feature vectors in downstream models. In real-world production, however, we have encountered key problems that cannot be justified by existing knowledge. They raise concerns that the naive use of pretrained representation as feature vector could lead to unwarranted and suboptimal solution. Our investigation reveals critical insights into the gap of uniform convergence for analyzing pretrained representations, their stochastic nature under gradient descent optimization, what does model convergence means to them, and how they might interact with downstream tasks. Inspired by our analysis, we explore a simple yet powerful approach that can refine pretrained representation in multiple ways, which we call ""Featurizing Pretrained Representations"". Our work balances practicality and rigor, and contributes to both applied and theoretical research of representation learning. ","Representation Learning, Stability, Generalization, Convergence, Predictability, Industry Application" Exphormer: Scaling Graph Transformers with Expander Graphs,https://openreview.net/forum?id=8Tr3v4ueNd7,https://openreview.net/pdf?id=8Tr3v4ueNd7,We show how to use expander graphs to devise sparse graph transformers that are powerful and scalable.,"Graph transformers have emerged as a promising architecture for a variety of graph learning and representation tasks. Despite their successes, it remains challenging to scale graph transformers to large graphs while maintaining accuracy competitive with message-passing networks. In this paper, we introduce Exphormer, a framework for building powerful and scalable graph transformers. Exphormer consists of a sparse attention mechanism based on expander graphs, whose mathematical characteristics, such as spectral expansion, and sparsity, yield graph transformers with complexity only linear in the size of the graph, while allowing us to prove desirable theoretical properties of the resulting transformer models. We show that incorporating Exphormer into the recently-proposed GraphGPS framework produces models with competitive empirical results on a wide variety of graph datasets, including state-of-the-art results on three datasets. We also show that Exphormer can scale to datasets on larger graphs than shown in previous graph transformer architectures.","Graph neural networks, Transformers" Generalization to translation shifts in object detection: a study in architectures and augmentations,https://openreview.net/forum?id=mdERENskoo1,https://openreview.net/pdf?id=mdERENskoo1,Data augmentations and architecture are complementary ways of incorporating inductive bias about desired robustness/invariances,"We provide a detailed evaluation of data augmentations and model architectures (convolutional, vision transformer, and fully connected MLP networks) on generalization to large translation shifts in image data. We make the following observations: (a) In the absence of data augmentation, all architectures, including convolutional networks suffer degradation in performance when evaluated on spatially translated test datasets. Understandably, both the in-distribution accuracy and degradation to shifts are significantly worse for non-convolutional architectures. (b) Across all architectures, even a minimal random crop augmentation (e.g., at most $4$ pixel in CIFAR and TINYIMAGENET datasets) improves the robustness of model performance to much larger magnitude shifts of up to $1/4$ of image size ($8$-$16$ pixels) in the test data -- suggesting a form of meta generalization from augmentation. For non-convolutional architectures, while the absolute accuracy is still low, we see dramatic improvements in relative robustness to large translation shifts. We further observe that the robustness gains are maintained with even more minimal $1-2$ pixel random crop augmentation. (c) With a sufficiently advanced augmentation (RandomCrop+RandFlip+RandAugmentation+Erasing+MixUp) pipeline, all architectures can be trained to have competitive performance, in terms of absolute in-distribution accuracy as well as relative generalization to large translation shifts.","OOD generalization, beyond accuracy, empirical study" Feature selection and low test error in shallow low-rotation ReLU networks,https://openreview.net/forum?id=swEskiem99,https://openreview.net/pdf?id=swEskiem99,"This work establishes low test error of gradient methods on two-layer ReLU networks with standard initialization, in three regimes where key sets of weights rotate little, making use of margins as the core analysis technique.","This work establishes low test error of gradient flow (GF) and stochastic gradient descent (SGD) on two-layer ReLU networks with standard initialization scale, in three regimes where key sets of weights rotate little (either naturally due to GF and SGD, or due to an artificial constraint), and making use of margins as the core analysis technique. The first regime is near initialization, specifically until the weights have moved by $\mathcal{O}(\sqrt m)$, where $m$ denotes the network width, which is in sharp contrast to the $\mathcal{O}(1)$ weight motion allowed by the Neural Tangent Kernel (NTK); here it is shown that GF and SGD only need a network width and number of samples inversely proportional to the NTK margin, and moreover that GF attains at least the NTK margin itself and in particular escapes bad KKT points of the margin objective, whereas prior work could only establish nondecreasing but arbitrarily small margins. The second regime is the Neural Collapse (NC) setting, where data lies in well-separated groups, and the sample complexity scales with the number of groups; here the contribution over prior work is an analysis of the entire GF trajectory from initialization. Lastly, if the inner layer weights are constrained to change in norm only and can not rotate, then GF with large widths achieves globally maximal margins, and its sample complexity scales with their inverse; this is in contrast to prior work, which required infinite width and a tricky dual convergence assumption. ","gradient descent, gradient flow, margin maximization, test error, neural collapse, generalization" Backpropagation through Combinatorial Algorithms: Identity with Projection Works,https://openreview.net/forum?id=JZMR727O29,https://openreview.net/pdf?id=JZMR727O29,"We propose a simple alternative for differentiating through combinatorial solvers with linear objectives, that is on par with SoTA, has no hyperparameters, and is more robust to perturbations.","Embedding discrete solvers as differentiable layers has given modern deep learning architectures combinatorial expressivity and discrete reasoning capabilities. The derivative of these solvers is zero or undefined, therefore a meaningful replacement is crucial for effective gradient-based learning. Prior works rely on smoothing the solver with input perturbations, relaxing the solver to continuous problems, or interpolating the loss landscape with techniques that typically require additional solver calls, introduce extra hyper-parameters, or compromise performance. We propose a principled approach to exploit the geometry of the discrete solution space to treat the solver as a negative identity on the backward pass and further provide a theoretical justification. Our experiments demonstrate that such a straightforward hyper-parameter-free approach is able to compete with previous more complex methods on numerous experiments such as backpropagation through discrete samplers, deep graph matching, and image retrieval. Furthermore, we substitute the previously proposed problem-specific and label-dependent margin with a generic regularization procedure that prevents cost collapse and increases robustness.","combinatorial optimization, deep learning, representation learning, gradient descent, backpropagation, argmin differentiation, deep graph matching, retrieval" Therbligs in Action: Video Understanding through Motion Primitives,https://openreview.net/forum?id=rpWw6Ki2b5s,https://openreview.net/pdf?id=rpWw6Ki2b5s,,"In this paper we introduce a rule-based, compositional, and hierarchical modelling of action using Therbligs as our atoms - a consistent, expressive, contact-centered representation of action. Over these atoms we introduce a differentiable method of rule-based reasoning to regularize for logical consistency. Our approach is complementary to other approaches in that the Therblig-based representations produced by our architecture augment rather than replace existing architectures' representations. We release the first Therblig-centered annotations over two popular video datasets - EPIC Kitchens 100 and 50-Salads. We evaluate our system for the task of action segmentation, demonstrating a substantial improvement using a base GRU architecture over baseline of 5.6% and 4.1% (14.4% and 6.5% relative) increase in accuracy (and increases with respect to all other metrics as well) over EPIC Kitchens and 50-Salads, respectively. We also demonstrate benefits to adopting Therblig representations for two state-of-the-art approaches - MSTCN++ and ASFormer - observing a 10.3%/10.7% relative improvement, respectively, over EPIC Kitchens and 9.3%/6.1% relative improvement, respectively, over 50 Salads. All code and data is to be released upon paper acceptance.", On the Adversarial Robustness against Natural Weather Perturbations,https://openreview.net/forum?id=Pk_di2bPAop,https://openreview.net/pdf?id=Pk_di2bPAop,,"Several algorithms are proposed to improve the robustness of deep neural networks against adversarial perturbations beyond $\ell_p$ cases, i.e. weather perturbations. However, evaluations of existing robust training algorithms are over-optimistic. This is in part due to the lack of a standardized evaluation protocol across various robust training algorithms, leading to ad-hoc methods that test robustness on either random perturbations or the adversarial samples from generative models that are used for robust training, which is either uninformative of the worst case, or is heavily biased. In this paper, we identify such evaluation bias in these existing works and propose the first standardized and fair evaluation that compares various robust training algorithms by using physics simulators for common adverse weather effects i.e. rain and snow. Additionally, our framework identified the lack of diversity in existing robust training algorithms. As a step to address this, we propose a light-weight generative adversarial network (GAN) with improved diverse weather effects controlled by latent codes that can be used in robust training. The proposed robust training algorithm is evaluated on two streetview classification datasets (BIC\_GSV, Places365), where it outperforms other robust training approaches based on generative models for worst-case adversarial rain and snow attacks.", Coupled Multiwavelet Operator Learning for Coupled Differential Equations,https://openreview.net/forum?id=kIo_C6QmMOM,https://openreview.net/pdf?id=kIo_C6QmMOM,We propose a novel coupled multiwavelet operator learning scheme for efficiently solving coupled differential equations.,"Coupled partial differential equations (PDEs) are key tasks in modeling the complex dynamics of many physical processes. Recently, neural operators have shown the ability to solve PDEs by learning the integral kernel directly in Fourier/Wavelet space, so the difficulty of solving the coupled PDEs depends on dealing with the coupled mappings between the functions. Towards this end, we propose a \textit{coupled multiwavelets neural operator} (CMWNO) learning scheme by decoupling the coupled integral kernels during the multiwavelet decomposition and reconstruction procedures in the Wavelet space. The proposed model achieves significantly higher accuracy compared to previous learning-based solvers in solving the coupled PDEs including Gray-Scott (GS) equations and the non-local mean field game (MFG) problem. According to our experimental results, the proposed model exhibits a $2X-4X$ improvement relative $L$2 error compared to the best results from the state-of-the-art models.","Neural operators, coupled differential equations, multiwavelet transform, partial differential equations" Don’t Bet on Sparsity: Designing Brain-inspired Distance-preserving Encoder,https://openreview.net/forum?id=JKFSUPa70W6M,https://openreview.net/pdf?id=JKFSUPa70W6M,,"Multi-headed self-attention-based Transformers have been a central area of research for quite some time. Albeit showing a significant improvement in understanding short-term and long-term contexts from sequences, encoders of Transformer and its variants fail to preserve layer-wise contextual information. Further, text representations learned by Transformer-based encoders are usually of low entropy with low variance, which contradicts typical human brain functions. In this work, we propose TransJect, an encoder model that guarantees a theoretical bound for layer-wise distance preservation between any pair of tokens. We propose a simple alternative to dot product attention to ensure Lipschitz continuity that allows TransJect to learn injective mappings to transform token representations to different manifolds and preserve Euclidean distance between every pair of tokens in subsequent layers. Our evaluation on several benchmark short- and long-sequence classification tasks shows a remarkable improvement of 3.1% and 11%, on average, respectively. Furthermore, empirical results suggest that TransJect is layer-agnostic; in fact, it prefers shallower architectures than deeper ones and prevents layer-wise incremental learning beyond a threshold. Our empirical analyses also show the generalization capabilities of TransJect and the robustness under different hyperparameter configurations. We conduct detailed statistical analysis to confirm the necessity of high-entropic representations to achieve human-like cognition. ","Orthogonal attention, Lipschitz, Entropic Transformer" Mid-Vision Feedback for Convolutional Neural Networks,https://openreview.net/forum?id=4oLK1_k71Tz,https://openreview.net/pdf?id=4oLK1_k71Tz,,"Feedback plays a prominent role in biological vision, where perception is modulated based on agents' continuous interactions with the world, and evolving expectations and world model. We introduce a novel mechanism which modulates perception in Convolutional Neural Networks (CNNs) based on high level categorical expectations: Mid-Vision Feedback (MVF). MVF associates high level contexts with linear transformations. When a context is ""expected"" its associated linear transformation is applied over feature vectors in a mid level of a CNN. The result is that mid-level network representations are biased towards conformance with high level expectations, improving overall accuracy and contextual consistency. Additionally, during training mid-level feature vectors are biased through introduction of a loss term which increases the distance between feature vectors associated with different contexts. MVF is agnostic as to the source of contextual expectations, and can serve as a mechanism for top down integration of symbolic systems with deep vision architectures - applications range from image and video understanding to explainable AI and robotics. We show the superior performance of MVF to post-hoc filtering for incorporation of contextual knowledge, and show superior performance of configurations using predicted context (when no context is known a priori) over configurations with no context awareness.", Cross-Window Self-Training via Context Variations from Sparsely-Labeled Time Series,https://openreview.net/forum?id=whsWWPAUkwR,https://openreview.net/pdf?id=whsWWPAUkwR,,"A real-world time series is often sparsely labeled due to the expensive annotation cost. Recently, self-training methods have been applied to a dataset with few labels to infer the labels of unlabeled augmented instances. Accelerating this trend for time-series data, fully taking advantage of its sequential nature, we propose a novel data augmentation approach called context-additive augmentation, which allows a target instance to be augmented easily by adding preceding and succeeding instances to form an augmented instance. Unlike the existing augmentation techniques which may alter the target instance by directly perturbing its features, it preserves a target instance as is but still gives various augmented instances with varying contexts. Additionally, we propose a cross-window self-training framework based on the context-additive augmentation. The framework first augments target instances by applying context-varying windows over a given time series. Then, the framework derives reliability-based cross-window labels and uses them to maintain consistency among augmented instances across the windows. Extensive experiments using real datasets show that the framework outperforms the existing state-of-the-art self-training methods.","semi-supervised learning, time series, pseudo labeling" Revisiting and Improving FGSM Adversarial Training,https://openreview.net/forum?id=aNiem36virV,https://openreview.net/pdf?id=aNiem36virV,,"FGSM adversarial training often fails to obtain a robust model, and the derived model often suffers from catastrophic overfitting, e.g., it is difficult to resist PGD attacks. In this paper, we found that the FGSM adversarial training model tends to rely on small-scale features, such as detail features, high-frequency features that are difficult for humans to recognize semantics, etc., while PGD adversarial training can effectively regularize the model's utilization of small-scale features. We discuss that excessive use of small-scale features will increase the local non-linearity of the model, making it difficult for the FGSM attack to generalize to the PGD attack. To address this issue, we propose to adjust the training set data, including removing small-scale features in the sample and adding random noise in the direction of small-scale features, so as to prevent the model from over-exploiting the small-scale features. Standard FGSM adversarial training on the adjusted training set is expected to circumvent the catastrophic overfitting problem. Experiments on real data validate the effectiveness of our method and make the FGSM adversarial trained models on CIFAR-10 and CIFAR-100 achieve robustness comparable to PGD adversarial training.","Small-scale features, FGSM adversarial training, Catastrophic overfitting" Safe Reinforcement Learning From Pixels Using a Stochastic Latent Representation,https://openreview.net/forum?id=b39dQt_uffW,https://openreview.net/pdf?id=b39dQt_uffW,"This paper proposes Safe SLAC, a safety-constrained RL approach for partially observable settings, which uses a stochastic latent variable model combined with a safety critic.","We address the problem of safe reinforcement learning from pixel observations. Inherent challenges in such settings are (1) a trade-off between reward optimization and adhering to safety constraints, (2) partial observability, and (3) high-dimensional observations. We formalize the problem in a constrained, partially observable Markov decision process framework, where an agent obtains distinct reward and safety signals. To address the curse of dimensionality, we employ a novel safety critic using the stochastic latent actor-critic (SLAC) approach. The latent variable model predicts rewards and safety violations, and we use the safety critic to train safe policies. Using well-known benchmark environments, we demonstrate competitive performance over existing approaches regarding computational requirements, final reward return, and satisfying the safety constraints. ","safety, reinforcement learning, safe reinforcement learning, constrained Markov decision process, partially observable Markov decision process, MDP, POMDP" TrojText: Test-time Invisible Textual Trojan Insertion,https://openreview.net/forum?id=ja4Lpp5mqc2,https://openreview.net/pdf?id=ja4Lpp5mqc2,"TrojText is a more realistic, efficient, test-time invisible textual Trojan Insertion method against intelligent neuron models","Intelligent neuron models in Natural Language Processing (NLP) are known to be vulnerable to textual Trojan attacks, i.e., Trojan models behave normally for normal inputs, yet produce malicious output for input with a trigger. Invisible textual triggers, e.g., syntactic-structure triggers, are becoming popular since they require more effort for detection and defense than triggers based on content insertion and replacement. Although the high stealthy and attack effects, current Trojan attacks with syntactic-structure triggers are highly dependent on a large corpus of training data to generate poisoned samples with the specific syntactic structure for Trojan insertion. Accessing training data for attackers is not always realistic, training-time attacks and current syntactic poisoned trigger generation, and Trojan insertion by updating all the parameters are extremely time-consuming. In this paper, we propose TrojText to study whether the invisible textual Trojan attack can be efficiently performed without the presence of training data in a more realistic and cost-efficient manner. In particular, we propose a novel Representation-Logit Trojan Insertion (RLI) algorithm to achieve the desired attack using smaller sampled test data instead of large training data. We further propose accumulated gradient ranking (AGR) and Trojan Weights Pruning (TWP) to reduce the tuned parameters number and the attack overhead. We perform extensive experiments of AG's News, SST-2, and OLID on BERT and XLNet. Our TrojText could classify 98.35 \% of test sentences into target class on the BERT model for AG's News data. ","Textual, Trojan, Backdoor, Syntactic, Trigger, Invisible, Attack, Defense, Test-time" An Experiment Design Paradigm using Joint Feature Selection and Task Optimization,https://openreview.net/forum?id=ytNEuwH1yeL,https://openreview.net/pdf?id=ytNEuwH1yeL,,"This paper presents a subsampling-task paradigm for data-driven task-specific experiment design (ED) and a novel method in populationwide supervised feature selection (FS). Optimal ED, the choice of sampling points under constraints of limited acquisition-time, arises in a wide variety of scientific and engineering contexts. However the continuous optimization used in classical approaches depend on a-priori parameter choices and challenging non-convex optimization landscapes. This paper proposes to replace this strategy with a subsampling-task paradigm, analogous to populationwide supervised FS. In particular, we introduce JOFSTO, which performs JOint Feature Selection and Task Optimization. JOFSTO jointly optimizes two coupled networks: one for feature scoring, which provides the ED, the other for execution of a downstream task or process. Unlike most FS problems, e.g. selecting protein expressions for classification, ED problems typically select from highly correlated globally informative candidates rather than seeking a small number of highly informative features among many uninformative features. JOFSTO's construction efficiently identifies potentially correlated, but effective subsets and returns a trained task network. We demonstrate the approach using parameter estimation and mapping problems in quantitative MRI, where economical ED is crucial for clinical application. Results from simulations and empirical data show the subsampling-task paradigm strongly outperforms classical ED, and within our paradigm, JOFSTO outperforms state-of-the-art supervised FS techniques. JOFSTO extends immediately to wider image-based ED problems and other scenarios where the design must be specified globally across large numbers of acquisitions. Our code is available for reviewers https://www.dropbox.com/scl/fo/qe6vb1w6fuf869hx4ht0k/h?dl=0&rlkey=og8czcorurl57jbiixio7hcjt ","Experiment Design, Populationwide Supervised Feature Selection, Quantitative Magnetic Resonance Imaging, Deep Learning" Multi-Objective Online Learning,https://openreview.net/forum?id=dKkMnCWfVmm,https://openreview.net/pdf?id=dKkMnCWfVmm,,"This paper presents a systematic study of multi-objective online learning. We first formulate the framework of Multi-Objective Online Convex Optimization, which encompasses a novel multi-objective regret. This regret is built upon a sequence-wise extension of the commonly used discrepancy metric Pareto suboptimality gap in zero-order multi-objective bandits. We then derive an equivalent form of the regret, making it amenable to be optimized via first-order iterative methods. To motivate the algorithm design, we give an explicit example in which equipping OMD with the vanilla min-norm solver for gradient composition will incur a linear regret, which shows that merely regularizing the iterates, as in single-objective online learning, is not enough to guarantee sublinear regrets in the multi-objective setting. To resolve this issue, we propose a novel min-regularized-norm solver that regularizes the composite weights. Combining min-regularized-norm with OMD results in the Doubly Regularized Online Mirror Multiple Descent algorithm. We further derive the multi-objective regret bound for the proposed algorithm, which matches the optimal bound in the single-objective setting. Extensive experiments on real-world datasets verify the effectiveness of the proposed algorithm.", Improved Training of Physics-Informed Neural Networks Using Energy-Based Priors: a Study on Electrical Impedance Tomography,https://openreview.net/forum?id=zqkfJA6R1-r,https://openreview.net/pdf?id=zqkfJA6R1-r,,"Physics-informed neural networks (PINNs) are attracting significant attention for solving partial differential equation (PDE) based inverse problems, including electrical impedance tomography (EIT). EIT is non-linear and especially its inverse problem is highly ill-posed. Therefore, successful training of PINNs is extremely sensitive to the interplay between different loss terms and hyper-parameters, including the learning rate. In this work, we propose a Bayesian approach through a data-driven energy-based model (EBM) as a prior, to improve the overall accuracy and quality of tomographic reconstruction. In particular, the EBM is trained over the possible solutions of the PDEs with different boundary conditions. By imparting such prior onto physics-based training, PINN convergence is expedited more than ten times faster than the PDE’s solution. The evaluation outcome shows that our proposed method is more robust for solving the EIT problem.","Physics-informed neural networks, electrical impedance tomography, energy-based models" Efficient Bayesian Optimization with Deep Kernel Learning and Transformer Pre-trained on Muliple Heterogeneous Datasets,https://openreview.net/forum?id=0aAd19ZQp11,https://openreview.net/pdf?id=0aAd19ZQp11,,"Bayesian optimization (BO) is widely adopted in black-box optimization problems and it relies on a surrogate model to approximate the black-box response function. With the increasing number of black-box optimization tasks solved and even more to solve, the ability to learn from multiple prior tasks to jointly pre-train a surrogate model is long-awaited to further boost optimization efficiency. In this paper, we propose a simple approach to pre-train a surrogate, which is a Gaussian process (GP) with a kernel defined on deep features learned from a Transformer-based encoder, using datasets from prior tasks with possibly heterogeneous input spaces. In addition, we provide a simple yet effective mix-up initialization strategy for input tokens corresponding to unseen input variables and therefore accelerate new tasks' convergence. Experiments on both synthetic and real benchmark problems demonstrate the effectiveness of our proposed pre-training and transfer BO strategy over existing methods.","Pre-training, Bayesian optimization, Transformer, Transfer learning" Robustness Guarantees for Adversarially Trained Neural Networks,https://openreview.net/forum?id=9_cba-ImPGb,https://openreview.net/pdf?id=9_cba-ImPGb,,"We study robust adversarial training of two-layer neural networks with Leaky ReLU activation function as a bi-level optimization problem. In particular, for the inner-loop that implements the PGD attack, we propose maximizing a lower bound on the 0/1-loss by reflecting a surrogate loss about the origin. This allows us to give a convergence guarantee for the inner-loop PGD attack and precise iteration complexity results for end-to-end adversarial training, which hold for any width and initialization in a realizable setting.", Fast-PINN for Complex Geometry: Solving PDEs with Boundary Connectivity Loss,https://openreview.net/forum?id=IIyox3dwad0,https://openreview.net/pdf?id=IIyox3dwad0,"We present a fast-PINN method based on the incorporation of boundary connectivity constraints into training loss, which can efficiently produce accurate solutions with order of magnitude fewer training samples, across multiple fluid dynamic problems.","We present a novel loss formulation for efficient learning of complex dynamics from governing physics, typically described by partial differential equations (PDEs), using physics-informed neural networks (PINNs). In our experiments, existing versions of PINNs are seen to learn poorly in many problems, especially for complex geometries, as it becomes increasingly difficult to establish appropriate sampling strategy at the near boundary region. Overly dense sampling can adversely impede training convergence if the local gradient behaviors are too complex to be adequately modelled by PINNs. On the other hand, if the samples are too sparse, PINNs may over-fit the near boundary region, leading to incorrect solution. To prevent such issues, we propose a new Boundary Connectivity (BCXN) loss function which provides local structure approximation at the boundary. Our BCXN-loss can implicitly or explicitly impose such approximations during training, thus facilitating fast physics-informed learning across entire problem domains with order of magnitude fewer training samples. This method shows a few orders of magnitude smaller errors than existing methods in terms of the standard L2-norm metric, while using dramatically fewer training samples and iterations. Our proposed Fast-PINN method does not pose any requirement on the differentiable property of the networks, and we demonstrate its benefits and ease of implementation on both multi-layer perceptron and convolutional neural network versions as commonly used in current physics-informed neural network literature.","Physics-informed neural networks, physics-informed loss formulation, multi-layer perceptron, convolutional neural network, fluid dynamics" Noise Transforms Feed-Forward Networks into Sparse Coding Networks,https://openreview.net/forum?id=P9yXPbfqbvC,https://openreview.net/pdf?id=P9yXPbfqbvC,"We find that noise alone induces networks to become Top-K, sparse coding networks. This resolves a difference between biological and artificial neural networks with regards to how sparse they are and how this sparsity is implemented.. ","A hallmark of biological neural networks, which distinguishes them from their artificial counterparts, is the high degree of sparsity in their activations. Here, we show that by simply injecting symmetric, random, noise during training in reconstruction or classification tasks, artificial neural networks with ReLU activation functions eliminate this difference; the neurons converge to a sparse coding solution where only a small fraction are active for any input. The resulting network learns receptive fields like those of primary visual cortex and remains sparse even when noise is removed in later stages of learning.","Sparse Coding, Sparsity, Top-K Activation, Noise, Biologically Inspired" DEFENDING BACKDOOR ATTACKS VIA ROBUSTNESS AGAINST NOISY LABEL,https://openreview.net/forum?id=osIppnySBTV,https://openreview.net/pdf?id=osIppnySBTV,We propose a principled approach to defend the backdoor attack by using existed robust algorithm against label noise.,"Many deep neural networks are vulnerable to backdoor poisoning attacks, in which an adversary strategically injects a backdoor trigger into a small fraction of the training data. The trigger can later be applied during inference to manipulate prediction labels. While the data label could be changed to arbitrary values by an adversary, the extent of corruption injected into the feature values is strictly limited to keep the backdoor attack in disguise, which leads to a resemblance between the backdoor attack and a milder attack that involves only noisy labels. This paper investigates an intriguing question: \textit{Can we leverage algorithms that defend against noisy label corruptions to defend against general backdoor attacks?} We first discuss the limitations of directly using current noisy-label defense algorithms to defend against backdoor attacks. We then propose a meta-algorithm for both supervised and semi-supervised settings that transforms an existing noisy label defense algorithm into one that protects against backdoor attacks. Extensive experiments on different settings show that, by introducing a lightweight alteration for minimax optimization to the existing noisy-label defense algorithms, the robustness against backdoor attacks can be substantially improved, while the initial form of those algorithms would fail in the presence of a backdoor attack.","Backdoor Attack, Deep Learning, Data Poisoning, Noisy Label" A Kernel Perspective of Skip Connections in Convolutional Networks,https://openreview.net/forum?id=6H_uOfcwiVh,https://openreview.net/pdf?id=6H_uOfcwiVh,,"Over-parameterized residual networks (ResNets) are amongst the most successful convolutional neural architectures for image processing. Here we study their properties through their Gaussian Process and Neural Tangent kernels. We derive explicit formulas for these kernels, analyze their spectra, and provide bounds on their implied condition numbers. Our results indicate that (1) with ReLU activation, the eigenvalues of these residual kernels decay polynomially at a similar rate compared to the same kernels when skip connections are not used, thus maintaining a similar frequency bias; (2) however, residual kernels are more locally biased. Our analysis further shows that the matrices obtained by these residual kernels yield favorable condition numbers at finite depths than those obtained without the skip connections, enabling therefore faster convergence of training with gradient descent.", SlothBomb: Efficiency Poisoning Attack against Dynamic Neural Networks,https://openreview.net/forum?id=Zsl54T6OLBH,https://openreview.net/pdf?id=Zsl54T6OLBH,,"Recent increases in deploying deep neural networks (DNNs) on resource-constrained devices, combined with the observation that not all input samples require the same amount of computations, have sparked interest in input-adaptive dynamic neural networks (DyNNs). These DyNNs bring in more efficient inferences and enable deploying DNNs on resource-constrained devices e.g., mobile devices. In this work, we study a new vulnerability about DyNNs: can adversaries manipulate the DyNNs efficiency to provide a false sense of efficiency? To answer this question, we design SlothBomb, an adversarial attack that injects efficiency backdoors in DyNNs. SlothBomb can poison just a minimal percentage of DyNNs training data to inject a backdoor trigger into DyNNs. In the inference time, SlothBomb can use the backdoor trigger to slow down DyNNs and abuse the computational resources of DyNNs - an availability threat analogous to the denial-of-service attacks. We evaluate SlothBomb on three DNN backbone architectures (based on VGG16, MobileNet, and ResNet56) on two popular datasets (CIFAR-10 and Tiny ImageNet) We show that a SlothBomb reduces the efficacy of DyNNs on triggered input samples while keeping almost the same efficiency on clean samples.","Efficient ML, Poisoning Attack" Ordered GNN: Ordering Message Passing to Deal with Heterophily and Over-smoothing,https://openreview.net/forum?id=wKPmPBHSnT6,https://openreview.net/pdf?id=wKPmPBHSnT6,"In this paper, we propose a novel GNN model to tackle heterophily and over-smoothing simultaneously by aligning the rooted-tree hierarchy with node embedding structure.","Most graph neural networks follow the message passing mechanism. However, it faces the over-smoothing problem when multiple times of message passing is applied to a graph, causing indistinguishable node representations and prevents the model to effectively learn dependencies between farther-away nodes. On the other hand, features of neighboring nodes with different labels are likely to be falsely mixed, resulting in the heterophily problem. In this work, we propose to order the messages passing into the node representation, with specific blocks of neurons targeted for message passing within specific hops. This is achieved by aligning the hierarchy of the rooted-tree of a central node with the ordered neurons in its node representation. Experimental results on an extensive set of datasets show that our model can simultaneously achieve the state-of-the-art in both homophily and heterophily settings, without any targeted design. Moreover, its performance maintains pretty well while the model becomes really deep, effectively preventing the over-smoothing problem. Finally, visualizing the gating vectors shows that our model learns to behave differently between homophily and heterophily settings, providing an explainable graph neural model.","GNN, heterophily, over-smoothing" Sparse Distributed Memory is a Continual Learner,https://openreview.net/forum?id=JknGeelZJpHP,https://openreview.net/pdf?id=JknGeelZJpHP,"Improving Sparse Distributed Memory via additional neurobiology results in a deep learning model with strong, organic continual learning and insights into sparse models more broadly.","Continual learning is a problem for artificial neural networks that their biological counterparts are adept at solving. Building on work using Sparse Distributed Memory (SDM) to connect a core neural circuit with the powerful Transformer model, we create a modified Multi-Layered Perceptron (MLP) that is a strong continual learner. We find that every component of our MLP variant translated from biology is necessary for continual learning. Our solution is also free from any memory replay or task information, and introduces novel methods to train sparse networks that may be broadly applicable.","Sparse Distributed Memory, Sparsity, Top-K Activation, Continual Learning, Biologically Inspired" Optimistic Exploration in Reinforcement Learning Using Symbolic Model Estimates,https://openreview.net/forum?id=Ji1_32XWMxK,https://openreview.net/pdf?id=Ji1_32XWMxK,,"There has been increasing interest in using symbolic models along with reinforcement learning (RL) problems, where these coarser abstract models are used as a way to provide higher level guidance to the RL agent. However, most of these works are limited by their assumption that they have access to a symbolic approximation of the underlying problem. To address this problem, we introduce a new method for learning optimistic symbolic approximations of the underlying world model. We will see how these representations, coupled with fast diverse planners developed from the automated planning community, provides us with a new paradigm for optimistic exploration in sparse reward settings. We also investigate how we could speed up the learning process by generalizing learned model dynamics across similar actions with minimal human input. We will evaluate the method, by testing it on multiple benchmark domains and compare it with other RL strategies for sparse reward settings, including hierarchical RL and intrinsic reward based exploration.", FLIP: A Provable Defense Framework for Backdoor Mitigation in Federated Learning,https://openreview.net/forum?id=Xo2E217_M4n,https://openreview.net/pdf?id=Xo2E217_M4n,,"Federated Learning (FL) is a distributed learning paradigm that enables different parties to train a model together for high quality and strong privacy protection. In this scenario, individual participants may get compromised and perform backdoor attacks by poisoning the data (or gradients). Existing work on robust aggregation and certified FL robustness does not study how hardening benign clients can affect the global model (and the malicious clients). In this work, we theoretically analyze the connection among cross-entropy loss, attack success rate, and clean accuracy in this setting. Moreover, we propose a trigger reverse engineering based defense and show that our method can achieve robustness improvement with guarantee (i.e., reducing the attack success rate) without affecting benign accuracy. We conduct comprehensive experiments across different datasets and attack settings. Our results on eight competing SOTA defense methods show the empirical superiority of our method on both single-shot and continuous FL backdoor attacks. We will release our code upon publication.","Federated learning, backdoor mitigation" Towards Automatic Generation of Advanced Shift Networks,https://openreview.net/forum?id=u8IcZZORLuq,https://openreview.net/pdf?id=u8IcZZORLuq,"We propose AutoShiftNet, the first framework tailoring Neural Architecture Search (NAS) to substantially reduce the accuracy gap between bit-shift neural networks and their real-valued counterparts","Multiplication-less neural networks significantly reduce the time and energy cost on the hardware platform, as the compute-intensive multiplications are replaced with lightweight bit-shift operations. However, existing bit-shift networks are all directly transferred from state-of-the-art convolutional neural networks (CNNs), which lead to non-negligible accuracy drop or even failure of model convergence. To combat this, we propose AutoShiftNet, the first framework tailoring Neural Architecture Search (NAS) to substantially reduce the accuracy gap between bit-shift neural networks and their real-valued counterparts. Specifically, we pioneer dragging NAS into a shift-oriented search space and endow it with the robust topology-related search strategy and custom regularization and stabilization. As a result, our AutoShiftNet breaks through the incompatibility of traditional NAS methods for bit-shift neural networks and achieves more desirable performance in terms of accuracy and convergence. Extensive experiments demonstrate that AutoShiftNet sets a new state-of-the-art for bit-shift neural networks, where the accuracy increases (1.69∼8.07)% on CIFAR10, (5.71∼18.09)% on CIFAR100 and > 4.36% on ImageNet, especially when many conventional CNNs fail to converge on ImageNet with bit-shift weights.","Shift Neural Networks, Neural Architecture Search" Robust attributions require rethinking robustness metrics,https://openreview.net/forum?id=j8Ygylt1DYJ,https://openreview.net/pdf?id=j8Ygylt1DYJ,The existing metrics for robustness of attributions to small impercetible perturbations don't capture local sensitivity. We propose new locality-sensitive metrics and show their usefulness.,"For machine learning models to be reliable and trustworthy, their decisions must be interpretable. As these models find increasing use in safety-critical applications, it is important that not just the model predictions but also their explanations (as feature attributions) be robust to small human-imperceptible input perturbations. Recent works have shown that many attribution methods are fragile and have proposed improvements in either the attribution methods or the model training. Existing works measure attributional robustness by metrics such as top-$k$ intersection, Spearman's rank-order correlation (or Spearman's $\rho$) or Kendall's rank-order correlation (or Kendall's $\tau$) to quantify the change in feature attributions under input perturbation. However, we show that these metrics are fragile. That is, under such metrics, a simple random perturbation attack can seem to be as significant as more principled attributional attacks. We instead propose Locality-sENSitive (LENS) improvements of the above metrics, namely, LENS-top-$k$, LENS-Spearman and LENS-Kendall, that incorporate the locality of attributions along with their rank order. Our locality-sensitive metrics provide tighter bounds on attributional robustness and do not disproportionately penalize attribution methods for reasonable local changes. We show that the robust attribution methods proposed in recent works also reflect this premise of locality, thus highlighting the need for a locality-sensitive metric for progress in the field. Our empirical results on well-known benchmark datasets using well-known models and attribution methods support our observations and conclusions in this work.","Robustness, Attribution, Interpretable, Metrics" Learned Nearest-Class-Mean for Biased Representations in Long-Tailed Recognition,https://openreview.net/forum?id=ByaNEZdnx2O,https://openreview.net/pdf?id=ByaNEZdnx2O,Representations in long-tailed recognition exhibit high tail variance; propose Learned NCM to mitigate representation bias.,"The problem of long-tailed recognition (LTR) has received attention in recent years due to the fundamental power-law distribution of objects in the real-world. While classifier bias in LTR has been addressed by many works, representation bias has not yet been researched. At the same time, most recent works use softmax classifiers that are unable to cope with representation bias. In this work, we address these shortcomings by firstly making the key observation that intra-class variance in representation space is negatively correlated to class frequency, leading to biased representations; our analysis reveals that high tail variance is due to spurious correlations learned by deep models. Secondly, to counter representation bias, we propose the Learned Nearest-Class-Mean (NCM), which overcomes uncertainty in empirical centroid estimates and jointly learns centroids minimizing average class-distance normalized variance. Further, we adapt the logit adjustment technique in the NCM framework to achieve higher tail class margin. Our Learned NCM with Logit Adjustment achieves 6\% gain over state-of-the-art in tail accuracy on the benchmark CIFAR100-LT and ImageNet-LT datasets. ","Long-Tailed Recognition, Representation bias, Nearest-Class-Mean" GradientMix: A Simple yet Effective Regularization for Large Batch Training,https://openreview.net/forum?id=5c_nxk-dX1J,https://openreview.net/pdf?id=5c_nxk-dX1J,,"Stochastic gradient descent (SGD) is the core tool for training deep neural networks. As modern deep learning tasks become more complex and state-of-the-art architectures grow as well, network training with SGD takes a huge amount of time; for example, training ResNet on the ImageNet dataset or BERT pre-training can take days to dozens of days. To reduce the network training time, distributed learning using a large batch size for SGD has been one of the main active research areas in recent years, but this approach entails a significant degradation in generalization. To address this issue, in this paper, we propose a simple yet effective regularization technique, GradientMix, for large-scale distributed learning. GradientMix can enhance the generalization in large batch regimes by giving appropriate noise through a mixup of local gradients computed at multiple devices, which is contrary to the conventions that simply average local gradients. Furthermore, GradientMix is optimizer-agnostic, hence can be applied to any popular optimization algorithm as long as the overall loss is expressed as the sum of the subgroup losses. Our extensive experiments show the effectiveness in both small and large-scale problems, and especially we consistently achieve state-of-the-art performance for various optimizers on training ResNet-50 on the ImageNet dataset with 32K batch size.","Large Batch Training, Deep Learning Optimization" UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining,https://openreview.net/forum?id=kXwdL1cWOAi,https://openreview.net/pdf?id=kXwdL1cWOAi,We propose a novel language sampling method that is close to being uniform across languages without introducing harmful repetition and that outperforms the temperature-based sampling.,"Pretrained multilingual large language models have typically used heuristic temperature-based sampling to balance between different languages. However previous work has not systematically evaluated the efficacy of different pretraining language distributions across model scales. In this paper, we propose a new sampling method, UniMax, that delivers more uniform coverage of head languages while mitigating overfitting on tail languages by explicitly capping the number of repeats over each languages corpus. We perform an extensive series of ablations testing a range of sampling strategies on a suite of multilingual benchmarks, while varying model scale. We find that UniMax outperforms standard temperature-based sampling, and the benefits persist as scale increases. As part of our contribution, we release an improved and refreshed variant of the mC4 multilingual corpus consisting of 29 trillion characters across 107 languages. In addition we release full code to reproduce our experiments.","Keywords: multilingual, pretraining, language models, language sampling, language distribution, low-resource languages, overfitting" Discrete State-Action Abstraction via the Successor Representation,https://openreview.net/forum?id=Krk0Gnft2Zc,https://openreview.net/pdf?id=Krk0Gnft2Zc,"We give a max-entropy regularized model for clustering states based on their successor representation, then train options to navigate between clusters.","While the difficulty of reinforcement learning problems is typically related to the complexity of their state spaces, Abstraction proposes that solutions often lie in simpler underlying latent spaces. Prior works have focused on learning either a continuous or dense abstraction, or require a human to provide one. Information-dense representations capture features irrelevant for solving tasks, and continuous spaces can struggle to represent discrete objects. In this work we automatically learn a sparse discrete abstraction of the underlying environment. We do so using a simple end-to-end trainable model based on the successor representation and max-entropy regularization. We describe an algorithm to apply our model, named Discrete State-Action Abstraction (DSAA), which computes an action abstraction in the form of temporally extended actions, i.e., Options, to transition between discrete abstract states. Empirically, we demonstrate the effects of different exploration schemes on our resulting abstraction, and show that it is efficient for solving downstream tasks.","reinforcement learning, abstraction, successor representation, options, discrete, sparse reward, representation learning, intrinsic motivation" Hyper-parameter Tuning for Fair Classification without Sensitive Attribute Access,https://openreview.net/forum?id=wR08RrAsLz5,https://openreview.net/pdf?id=wR08RrAsLz5,,"Fair machine learning methods seek to train models that balance model performance across demographic subgroups defined over sensitive attributes like race and gender. Although sensitive attributes are typically assumed to be known during training, they may not be available in practice due to privacy and other logistical concerns. Recent work has sought to train fair models without sensitive attributes on training data. However, these methods need extensive hyper-parameter tuning to achieve good results, and hence assume that sensitive attributes are known on validation data. However, this assumption too might not be practical. Here, we propose a framework to train fair classifiers without access to sensitive attributes on either training or validation data. Instead, we generate pseudo sensitive attributes on the validation data by training a biased classifier and using the classifier’s incorrectly (correctly) labeled examples as proxies for minority (majority) groups. Since fairness metrics like demographic parity, equal opportunity and subgroup accuracy can be estimated to within a proportionality constant even with noisy sensitive attribute information, we show theoretically and empirically that these proxy labels can be used to maximize fairness under average accuracy constraints. Key to our results is a principled approach to select the hyper-parameters of the biased classifier in a completely unsupervised fashion (meaning without access to ground truth sensitive attributes) that minimizes the gap between fairness estimated using noisy versus ground-truth sensitive labels.", Towards Learning Implicit Symbolic Representation for Visual Reasoning,https://openreview.net/forum?id=V8isglQkt74,https://openreview.net/pdf?id=V8isglQkt74,Implicit symbolic representation emerges from self-supervised pretrained neural networks.,"Visual reasoning tasks are designed to test a learning algorithm's capability to infer causal relationships, discover object interactions, and understand temporal dynamics, all from visual cues. It is commonly believed that to achieve compositional generalization on visual reasoning, an explicit abstraction of the visual scene must be constructed; for example, object detection can be applied to the visual input to produce representations that are then processed by a neural network or a neuro-symbolic framework. We demonstrate that a simple and general self-supervised approach is able to learn implicit symbolic representations with general-purpose neural networks, enabling the end-to-end learning of visual reasoning directly from raw visual inputs. Our proposed approach ``compresses'' each frame of a video into a small set of tokens with a transformer network. The self-supervised learning objective is to reconstruct each image based on the compressed temporal context. To minimize the reconstruction loss, the network must learn a compact representation for each image, as well as capture temporal dynamics and object permanence from temporal context. We evaluate the proposed approach on two visual reasoning benchmarks, CATER and ACRE. We observe that self-supervised pretraining is essential to achieve compositional generalization for our end-to-end trained neural network, and our proposed method achieves on par or better performance compared to recent neuro-symbolic approaches that often require additional object-level supervision.","visual reasoning, self-supervised learning, implicit symbolic representation" GNNInterpreter: A Probabilistic Generative Model-Level Explanation for Graph Neural Networks,https://openreview.net/forum?id=rqq6Dh8t4d,https://openreview.net/pdf?id=rqq6Dh8t4d,"We propose a model-level explanation method for GNNs, which is more general, flexible, and computationally efficient than the current SOTA.","Recently, Graph Neural Networks (GNNs) have significantly advanced the performance of machine learning tasks on graphs. However, this technological breakthrough makes people wonder: how does a GNN make such decisions, and can we trust its prediction with high confidence? When it comes to some critical fields, such as biomedicine, where making wrong decisions can have severe consequences, it is crucial to interpret the inner working mechanisms of GNNs before applying them. In this paper, we propose a model-agnostic model-level explanation method for different GNNs that follow the message passing scheme, GNNInterpreter, to explain the high-level decision-making process of the GNN model. More specifically, GNNInterpreter learns a probabilistic generative graph distribution that produces the most discriminative graph pattern the GNN tries to detect when making a certain prediction by optimizing a novel objective function specifically designed for the model-level explanation for GNNs. Compared with the existing work, GNNInterpreter is more computationally efficient and more flexible in generating explanation graphs with different types of node features and edge features, without introducing another blackbox to explain the GNN and without requiring manually specified domain-specific knowledge. Additionally, the experimental studies conducted on four different datasets demonstrate that the explanation graph generated by GNNInterpreter can match the desired graph pattern when the model is ideal and reveal potential model pitfalls if there exist any.","AI Interpretability, Graph Neural Networks, Model-Level Explanation of Neural Networks" Intra-Instance VICReg: Bag of Self-Supervised Image Patch Embedding Explains the Performance,https://openreview.net/forum?id=J923QzIz8Sh,https://openreview.net/pdf?id=J923QzIz8Sh,We show that Siamese-network-based SSL methods essentially learn a distributed representation of image patches and aggregate them to form the instance representation.,"Recently, self-supervised learning (SSL) has achieved tremendous empirical advancements in learning image representation. However, our understanding and knowledge of the representation are still limited. This work shows that the success of the SOTA Siamese-network-based SSL approaches is primarily based on learning a distributed representation of image patches. In particular, we show that when we learn a representation only for fixed-scale image patches and aggregate different patch representations for an image (instance), it can achieve on par or even better results than the baseline methods on several benchmarks. Further, we show that the patch representation aggregation can also improve various SOTA baseline methods by a large margin. We also establish a formal connection between the Siamese-network-based SSL objective and the image patches co-occurrence statistics modeling, which supplements the prevailing invariance perspective. By visualizing the nearest neighbors of different image patches in the embedding space and projection space, we show that while the projection has more invariance, the embedding space tends to preserve more equivariance and locality. While it is important to push the SOTA engineering frontier, we show that it is also a promising direction to simplify the SOTA methods to build better understanding.","self-supervised learning, explainable machine learning, co-occurrence statistics modeling" Rethinking Symbolic Regression: Morphology and Adaptability in the Context of Evolutionary Algorithms,https://openreview.net/forum?id=OPGy07PojsZ,https://openreview.net/pdf?id=OPGy07PojsZ,,"Symbolic Regression (SR) is the well-studied problem of finding closed-form analytical expressions that describe the relationship between variables in a measurement dataset. In this paper, we rethink SR from 2 perspectives: morphology and adaptability. Morphology: Current SR algorithms typically use several man-made heuristics to influence the morphology (or structure) of the expressions in the search space. These man-made heuristics may introduce unintentional bias and data leakage, especially with the relatively few equation-recovery benchmark problems available for evaluating SR approaches. To address this, we formulate a novel minimalistic approach, based on constructing a depth-aware mathematical language model trained on terminal walks of expression trees, as a replacement to these heuristics. Adaptability: Current SR algorithms tend to select expressions based on only a single fitness function (e.g., MSE on the training set). We promote the use of an adaptability framework in evolutionary SR which uses fitness functions that alternate across generations. This leads to robust expressions that perform well on the training set and are close to the true functional form. We demonstrate this by alternating fitness functions that quantify faithfulness to values (via MSE) and empirical derivatives (via a novel theoretically justified fitness metric coined MSEDI). Proof-of-concept: We combine these ideas into a minimalistic evolutionary SR algorithm that outperforms all benchmark and state of-the-art SR algorithms in problems with unknown constants added, which we claim are more reflective of SR performance for real-world applications. Our claim is then strengthened by reproducing the superior performance on real-world regression datasets from SRBench. For researchers interested in equation-recovery problems, we also propose a set of conventions that can be used to promote fairness in comparison across SR methods and to reduce unintentional bias.", "Efficient, probabilistic analysis of combinatorial neural codes",https://openreview.net/forum?id=NAuVe6pQ7Jb,https://openreview.net/pdf?id=NAuVe6pQ7Jb,"We improve the computational complexity of previous methods and introduce a hypothesis-checking procedure to study algebraic, geometric, and topological features of neural codes.","Artificial and biological neural networks (ANNs and BNNs) can encode inputs in the form of combinations of individual neurons' activities. These combinatorial neural codes present a computational challenge for direct and efficient analysis due to their high dimensionality and often large volumes of data. Here we improve the computational complexity -- from factorial to quadratic time -- of direct algebraic methods previously applied to small examples and apply them to large neural codes generated by experiments. These methods provide a novel and efficient way of probing algebraic, geometric, and topological characteristics of combinatorial neural codes and provide insights into how such characteristics are related to learning and experience in neural networks. We introduce a procedure to perform hypothesis testing on the intrinsic features of neural codes using information geometry. We then apply these methods to neural activities from an ANN for image classification and a BNN for 2D navigation to, without observing any inputs or outputs, estimate the structure and dimensionality of the stimulus or task space. Additionally, we demonstrate how an ANN varies its internal representations across network depth and during learning.","neural code, topology, algebra, information geometry" On Pre-training Language Model for Antibody,https://openreview.net/forum?id=zaq4LV55xHl,https://openreview.net/pdf?id=zaq4LV55xHl,,"Antibodies are vital proteins offering robust protection for the human body from pathogens. The development of general protein and antibody-specific pre-trained language models both facilitate antibody prediction tasks. However, few studies comprehensively explore the representation capability of distinct pre-trained language models on different antibody problems. Here, to investigate the problem, we aim to answer the following key questions: (1) How do pre-trained language models perform in antibody tasks with different specificity? (2) How many benefits will the model gain if we introduce the specific biological mechanism to the pretraining process? (3) Do the learned antibody pre-trained representations make sense in real-world antibody problems, like drug discovery and immune process understanding? Previously, no benchmark available largely hindered the study to answer these questions. To facilitate the investigation, we provide an AnTibody Understanding Evaluation (ATUE) benchmark. We comprehensively evaluate the performance of protein pretrained language models by empirical study along with conclusions and new insights.", Challenging Common Assumptions about Catastrophic Forgetting,https://openreview.net/forum?id=LoOd40EaGA8,https://openreview.net/pdf?id=LoOd40EaGA8,We propose a framework SCoLe (Scaling Continual Learning) to study knowledge accumulation in continual learning with SGD training.,"Standard gradient descent algorithms applied to sequences of tasks are known to induce catastrophic forgetting in deep neural networks. When trained on a new task, the model's parameters are updated in a way that degrades performance on past tasks. This article explores continual learning (CL) on long sequences of tasks sampled from a finite environment. \textbf{We show that in this setting, learning with stochastic gradient descent (SGD) results in knowledge retention and accumulation without specific memorization mechanisms.} This is in contrast to the current notion of forgetting from the CL literature, which shows that training on new tasks with such an approach results in forgetting previous tasks, especially in class-incremental settings. To study this phenomenon, we propose an experimental framework, \Scole{} (Scaling Continual Learning), which allows to generate arbitrarily long task sequences. Our experiments show that the previous results obtained on relatively short task sequences may not reveal certain phenomena that emerge in longer ones.","Continual Learning, Knowledge Accumulation, Scaling" Learning to reason over visual objects,https://openreview.net/forum?id=uR6x8Be7o_M,https://openreview.net/pdf?id=uR6x8Be7o_M,,"A core component of human intelligence is the ability to identify abstract patterns governing complex, high-dimensional perceptual data, as exemplified by visual reasoning tasks such as Raven’s Progressive Matrices (RPM). Motivated by the goal of designing AI systems with this capacity, recent work has focused on evaluating whether neural networks can learn to solve RPM-like problems. This work has generally found that strong performance on these problems requires the incorporation of inductive biases that are specific to the RPM problem format, raising the question of whether such models might be more broadly useful. Here, we investigated the extent to which a general-purpose mechanism for processing visual scenes in terms of objects might enable abstract visual reasoning. We found that a simple model, consisting only of an object-centric encoder and a transformer reasoning module, displayed performance approaching state-of-the-art methods on two challenging RPM-like benchmarks (PGM and I-RAVEN), suggesting that an inductive bias for object-centric processing is a key component of visual reasoning, and may supplant some problem-specific inductive biases. ", Imitating Graph-Based Planning with Goal-Conditioned Policies,https://openreview.net/forum?id=6lUEy1J5R7p,https://openreview.net/pdf?id=6lUEy1J5R7p,We train goal-conditioned policies guided by decisions from graph-based planning.,"Recently, graph-based planning algorithms have gained much attention to solve goal-conditioned reinforcement learning (RL) tasks: they provide a sequence of subgoals to reach the target-goal, and the agents learn to execute subgoal-conditioned policies. However, the sample-efficiency of such RL schemes still remains a challenge, particularly for long-horizon tasks. To address this issue, we present a simple yet effective self-imitation scheme which distills a subgoal-conditioned policy into the target-goal-conditioned policy. Our intuition here is that to reach a target-goal, an agent should pass through a subgoal, so target-goal- and subgoal- conditioned policies should be similar to each other. We also propose a novel scheme of stochastically skipping executed subgoals in a planned path, which further improves performance. Unlike prior methods that only utilize graph-based planning in an execution phase, our method transfers knowledge from a planner along with a graph into policy learning. We empirically show that our method can significantly boost the sample-efficiency of the existing goal-conditioned RL methods under various long-horizon control tasks.","Reinforcement Learning, Goal-Conditioned Reinforcement Learning" Prefer to Classify: Improving Text Classifier via Pair-wise Preference Learning,https://openreview.net/forum?id=CTX5JcDaUX9,https://openreview.net/pdf?id=CTX5JcDaUX9,,"The development of largely human-annotated benchmarks has driven the success of deep neural networks in various NLP tasks. These benchmarks are collected by aggregating decisions made by different annotators on the target task. Aggregating the annotated decisions via majority is still used as a common practice, despite its inevitable limitation from simple aggregation. In this paper, we establish a novel classification framework, based on task-specific human preference between a pair of samples, which provides an informative training signal to capture fine-grained and complementary task information through pair-wise comparison. Hence, it improves the existing instance-wise annotation system by enabling better task modeling from learning the relation between samples. Specifically, we propose a new multi-task learning framework, called prefer-to-classify (P2C), to effectively learn human preferences in addition to the given classification task. We collect human preference signals in two ways: (1) extracting relative preferences implicitly from annotation records (for free) or (2) collecting subjective preferences explicitly from (paid) crowd workers. In various text classification tasks, we demonstrate that both extractive and subjective preferences are effective in improving the classifier with our preference learning framework. Interestingly, we found that subjective preference shows more significant improvements than extractive preference, revealing the effectiveness of explicit modeling of human preferences. Our code and preference dataset will be publicly available upon acceptance.","NLP, text classification, annotation, disagreement, preference" "Seeing Differently, Acting Similarly: Heterogeneously Observable Imitation Learning",https://openreview.net/forum?id=3ULaIHxn9u7,https://openreview.net/pdf?id=3ULaIHxn9u7,,"In many real-world imitation learning tasks, the demonstrator and the learner have to act under different observation spaces. This situation brings significant obstacles to existing imitation learning approaches, since most of them learn policies under homogeneous observation spaces. On the other hand, previous studies under different observation spaces have strong assumptions that these two observation spaces coexist during the entire learning process. However, in reality, the observation coexistence will be limited due to the high cost of acquiring expert observations. In this work, we study this challenging problem with limited observation coexistence under heterogeneous observations: Heterogeneously Observable Imitation Learning (HOIL). We identify two underlying issues in HOIL: the dynamics mismatch and the support mismatch, and further propose the Importance Weighting with REjection (IWRE) algorithm based on importance weighting and learning with rejection to solve HOIL problems. Experimental results show that IWRE can solve various HOIL tasks, including the challenging tasks of transforming the vision-based demonstrations to random access memory (RAM)-based policies in the Atari domain, even with limited visual observations.","Imitation Learning, Heterogeneous Observation, Importance Weighting, Learning with Rejection" Simple and Deep Graph Attention Networks,https://openreview.net/forum?id=uzwakrSKHyT,https://openreview.net/pdf?id=uzwakrSKHyT,,"Graph Attention Networks (GATs) and Graph Convolutional Neural Networks (GCNs) are two state-of-the-art architectures in Graph Neural Networks (GNNs). It is well known that both models suffer from performance degradation when more GNN layers are stacked, and many works have been devoted to address this problem. We notice that main research efforts in the line focus on the GCN models, and their techniques cannot well fit the GATs models due to the inherent difference between these two architectures. In GATs, the attention mechanism ignores the overwhelming propagation from certain nodes as the number of layers increases. To sufficiently utilize the expressive power of GATs, we propose a new version of GAT named Layer-wise Self-adaptive GAT (LSGAT), which can effectively alleviate the oversmoothing issue in deep GATs. We redesign the attention coefficients computation mechanism adaptively adjusted by layer depth, which considers both immediate neighboring and non-adjacent nodes from a global view. The experimental evaluation confirms that LSGAT consistently achieves better results on node classification tasks over relevant counterparts.","Graph Attention Networks, Deep Graph Neural Networks" A theoretical study of inductive biases in contrastive learning,https://openreview.net/forum?id=AuEgNlEAmed,https://openreview.net/pdf?id=AuEgNlEAmed,We provide the first theoretical analysis of self-supervised learning that incorporates the effect of inductive biases of model classes.,"Understanding self-supervised learning is important but challenging. Previous theoretical works study the role of pretraining losses, and view neural networks as general black boxes. However, the recent work of [Saunshi et al.] argues that the model architecture --- a component largely ignored by previous works --- also has significant influences on the downstream performance of self-supervised learning. In this work, we provide the first theoretical analysis of self-supervised learning that incorporates the effect of inductive biases originating from the model class. In particular, we focus on contrastive learning --- a popular self-supervised learning method that is widely used in the vision domain. We show that when the model has limited capacity, contrastive representations would recover certain special clustering structures that are compatible with the model architecture, but ignore many other clustering structures in the data distribution. As a result, our theory can capture the more realistic setting where contrastive representations have much lower dimensionality than the number of clusters in the data distribution. We instantiate our theory on several synthetic data distributions, and provide empirical evidence to support the theory.","theory of self-supervised learning, theory of contrastive learning" Combinatorial Pure Exploration of Causal Bandits,https://openreview.net/forum?id=pBBsrPzq7aF,https://openreview.net/pdf?id=pBBsrPzq7aF,Combinatorial pure exploration algorithm of causal bandits on two different models,"The combinatorial pure exploration of causal bandits is the following online learning task: given a causal graph with unknown causal inference distributions, in each round we choose a subset of variables to intervene or do no intervention, and observe the random outcomes of all random variables, with the goal that using as few rounds as possible, we can output an intervention that gives the best (or almost best) expected outcome on the reward variable $Y$ with probability at least $1-\delta$, where $\delta$ is a given confidence level. We provide the first gap-dependent and fully adaptive pure exploration algorithms on two types of causal models --- the binary generalized linear model (BGLM) and general graphs. For BGLM, our algorithm is the first to be designed specifically for this setting and achieves polynomial sample complexity, while all existing algorithms for general graphs have either sample complexity exponential to the graph size or some unreasonable assumptions. For general graphs, our algorithm provides a significant improvement on sample complexity, and it nearly matches the lower bound we prove. Our algorithms achieve such improvement by a novel integration of prior causal bandit algorithms and prior adaptive pure exploration algorithms, the former of which utilize the rich observational feedback in causal bandits but are not adaptive to reward gaps, while the latter of which have the issue in reverse.","Bandit, causal bandit, pure exploration" How to fine-tune vision models with SGD,https://openreview.net/forum?id=rnFOPhTMB0Y,https://openreview.net/pdf?id=rnFOPhTMB0Y,"SGD can do worse than AdamW under distribution shifts, but simple changes make SGD competitive","SGD (with momentum) and AdamW are the two most commonly used optimizers for fine-tuning large neural networks in computer vision. When the two methods perform the same, SGD is preferable because it uses less memory and is more efficient than AdamW. However, when evaluating on downstream tasks that differ significantly from pretraining, we find that across five popular benchmarks SGD fine-tuning gets substantially lower accuracies than AdamW on many modern vision models such as Vision Transformers and ConvNeXts---especially out-of-distribution (OOD). We find that such large gaps arise in instances where the fine-tuning gradients in the first (``embedding'') layer are much larger than the rest of the model. Our analysis suggests an easy fix: if we simply freeze the embedding layer (0.7\% of the parameters), SGD performs competitively with AdamW while using less memory across a suite of benchmarks. Our insights lead to state-of-the-art accuracies on popular distribution shift benchmarks: WILDS-FMoW, WILDS-Camelyon, BREEDS-Living-17, Waterbirds, and DomainNet.","fine-tuning, SGD, freezing layers, distribution shift" Computational Language Acquisition with Theory of Mind,https://openreview.net/forum?id=C2ulri4duIs,https://openreview.net/pdf?id=C2ulri4duIs,Analyzing the effects of Theory of Mind and environment complexity on language acquisition models.,"Unlike current state-of-the-art language models, young children actively acquire language through interactions with their surrounding environment and caretakers. One mechanism that has been argued to be critical to language learning is the ability to infer the mental states of other agents in social environments, coined Theory of Mind (ToM) by Premack & Woodruff (1978). Drawing inspiration from the modern operationalized versions of ToM implemented in Rabinowitz et al. (2018) and Zhu et al. (2021), we build language-learning agents equipped with ToM, and measure its effects on the learning process.1 We model ToM by giving the speaker agent an internal listener model that is trained alongside the speaker and using this ToM model to rerank potential utterances. We also experiment with varying task difficulty, with the hypothesis that stronger environmental pressures will promote the development of more complex language. We find that speakers trained with a ToM listener component have higher accuracies than those trained without in our image referential game setting. We also find that increasing task difficulty in the training process results in more fluent, higher-quality utterances in evaluation. This suggests the utility of incorporating ToM, as well as other insights from child language acquisition, into computational models thereof.","language acquisition, theory of mind, referential games, natural language processing" Rényi Supervised Contrastive Learning for Transferable Representation,https://openreview.net/forum?id=kpXJYIoMlho,https://openreview.net/pdf?id=kpXJYIoMlho,We present an effective and robust method to learn transferable representation by Rényi supervised contrastive learning.,"A mighty goal of representation learning is to train a feature that can transfer to various tasks or datasets. A conventional approach is to pre-train a neural network on a large-scale labeled dataset, e.g., ImageNet, and use its feature for downstream tasks. However, the feature often lacks transferability due to the class-collapse issue; existing supervised losses (such as cross-entropy) restrain the intra-class variation and limit the capability of learning rich representations. This issue becomes more severe when pre-training datasets are class-imbalanced or coarse-labeled. To address the problem, we propose a new representation learning method, named R\'enyi supervised contrastive learning~(R\'enyiSCL), which can effectively learn transferable representation using a labeled dataset. Our main idea is to use the recently proposed self-supervised R\'enyi contrastive learning in the supervised setup. We show that R\'enyiSCL can mitigate the class-collapse problem by contrasting features with both instance-wise and class-wise information. Through experiments on the ImageNet dataset, we show that R\'enyiSCL outperforms all supervised and self-supervised methods under various transfer learning tasks. In particular, we also validate the effectiveness of R\'enyiSCL under class-imbalanced or coarse-labeled datasets.","Supervised Learning, Representation Learning, Contrastive Learning, Tranfser Learning" MiDAS: Multi-integrated Domain Adaptive Supervision for Fake News Detection,https://openreview.net/forum?id=DaYt6DAA-JK,https://openreview.net/pdf?id=DaYt6DAA-JK,We use Lipschitz smoothness and probabilistic Lipschitzness to build a theoretical foundation for effective multi-domain adaptation using randomized perturbations on unseen data.,"COVID-19 related misinformation and fake news, coined an 'infodemic', has dramatically increased over the past few years. This misinformation exhibits concept drift, where the distribution of fake news changes over time, reducing effectiveness of previously trained models for fake news detection. Given a set of fake news models trained on multiple domains, we propose an adaptive decision module to select the best-fit model for a new sample. We propose MiDAS, a multi-domain adaptative approach for fake news detection that ranks relevancy of existing models to new samples. MiDAS contains 2 components: a doman-invariant encoder, and an adaptive model selector. MiDAS integrates multiple pre-trained and fine-tuned models with their training data to create a domain-invariant representation. Then, MiDAS uses local Lipschitz smoothness of the invariant embedding space to estimate each model's relevance to a new sample. Higher ranked models provide predictions, and lower ranked models abstain. We evaluate MiDAS on generalization to drifted data with 9 fake news datasets, each obtained from different domains and modalities. MiDAS achieves new state-of-the-art performance on multi-domain adaptation for out-of-distribution fake news classification. ","multi-domain adaptation, lipschitz continuity, text classification, weak supervision, team-of-experts" Walking the Tightrope: An Investigation of the Convolutional Autoencoder Bottleneck,https://openreview.net/forum?id=7zv_wSgP-LN,https://openreview.net/pdf?id=7zv_wSgP-LN,We investigate the effect of feature map size vs. number of channels in the bottleneck of convolutional autoecoders and find that tuning the former is significantly more important than the latter.,"In this paper, we present an in-depth investigation of the convolutional autoencoder (CAE) bottleneck. Autoencoders (AE), and especially their convolutional variants, play a vital role in the current deep learning toolbox. Researchers and practitioners employ CAEs for various tasks, ranging from outlier detection and compression to transfer and representation learning. Despite their widespread adoption, we have limited insight into how the bottleneck shape impacts the CAE's emergent properties. We demonstrate that increased bottleneck area (i.e., height $\times$ width) drastically improves generalization in terms of reconstruction error while also speeding up training. The number of channels in the bottleneck, on the other hand, is of secondary importance. Furthermore, we show empirically that CAEs do not learn to copy their input, even when all layers have the same number of neurons as there are pixels in the input (i.e. there is no bottleneck). Besides raising important questions for further research, our findings are directly applicable to two of the most common use-cases for CAEs: In image compression, it is advantageous to increase the feature map size in the bottleneck as this greatly improves reconstruction quality. For reconstruction-based outlier detection, we recommend decreasing the feature map size so that out-of-distribution samples will yield a higher reconstruction error.","autoencoders, unsupervised learning, representation learning, investigation" A Closer Look at Model Adaptation using Feature Distortion and Simplicity Bias,https://openreview.net/forum?id=wkg_b4-IwTZ,https://openreview.net/pdf?id=wkg_b4-IwTZ,"Mitigating feature distortion is not enough to ensure that transfer learning from large-scale, pretrained models leads to better safety and generalization on downstream tasks.","Advances in the expressivity of large-scale pretrained models have increased interest in the design of adaptation protocols which enable safe and effective transfer learning. Going beyond conventional linear probing (LP) and fine tuning (FT) strategies, protocols that can effectively control feature distortion, i.e., the failure to update features orthogonal to the in-distribution, during FT have been found to achieve improved out-of-distribution generalization. A popular example is the recent LP+FT protocol which first learns a linear probe and then uses that initialization during FT. However, in this paper, we find that when adaptation protocols are also evaluated on a variety of safety objectives (e.g., calibration, robustness etc.), that a complementary perspective to feature distortion is required explain protocol behavior. To this end, we study the susceptibility of protocols to simplicity bias (SB), i.e. the well-known propensity of neural networks to rely upon simple features, as SB has recently been shown to underlie several problems in robust generalization. Using a synthetic dominoes dataset obtained by pairing (complex) CIFAR10 with (simple) MNIST samples, we demonstrate that the susceptibility of existing protocols to SB. Given the strong effectiveness of LP+FT, we propose incorporating hardness-promoting perturbations during LP to obtain initializations for FT that further decrease SB. We verify the effectiveness of these modified LP+FT protocols by decreasing SB on the dominoes dataset, and jointly improving OOD generalization and safety on standard adaptation benchmarks.","Transfer Learning, Robustness, Adaptation, Data Augmentation" Pareto Invariant Risk Minimization,https://openreview.net/forum?id=esFxSb_0pSL,https://openreview.net/pdf?id=esFxSb_0pSL,We introduce a novel Multi-Objective Optimization perspective to understand and allieviate the optimization delimma in Out-of-Distribution generalization.,"Recently, there has been a growing surge of interest in enabling machine learning systems to generalize well to Out-of-Distribution (OOD) data. Most efforts are devoted to advancing optimization objectives that regularize models to capture the underlying invariance; however, there often are compromises in the optimization process of these OOD objectives: i) Many OOD objectives have to be relaxed as penalty terms of Empirical Risk Minimization (ERM) for the ease of optimization, while the relaxed forms can weaken the robustness of the original objective; ii) The penalty terms also require careful tuning of the penalty weights due to the intrinsic conflicts between ERM and OOD objectives. Consequently, these compromises could easily lead to suboptimal performance of either the ERM or OOD objective. To address these issues, we introduce a multi-objective optimization (MOO) perspective to understand the OOD optimization process, and propose a new optimization scheme called PAreto Invariant Risk Minimization (PAIR). PAIR improves the robustness of OOD objectives by cooperatively optimizing with other OOD objectives, thereby bridging the gaps caused by the relaxations. Then PAIR approaches a Pareto optimal solution that trades off the ERM and OOD objectives properly. Extensive experiments on challenging benchmarks, WILDS, show that PAIR alleviates the compromises and yields top OOD performances.","Out-of-Distribution Generalization, Optimization, Multi-Objective Optimization, Causal Invariance" Understanding and Adopting Rational Behavior by Bellman Score Estimation,https://openreview.net/forum?id=WzGdBqcBicl,https://openreview.net/pdf?id=WzGdBqcBicl,"We estimate the Bellman score in order to solve IRL, reward transfer, and counterfactual prediction problems","We are interested in solving a class of problems that seek to understand and adopt rational behavior from demonstrations. We may broadly classify these problems into four categories of reward identification, counterfactual analysis, behavior imitation, and behavior transfer. In this work, we make a key observation that knowing how changes in the underlying rewards affect the optimal behavior allows one to solve a variety of aforementioned problems. To a local approximation, this quantity is precisely captured by what we term the Bellman score, i.e gradient of log probabilities of the optimal policy with respect to the reward. We introduce the Bellman score operator which provably converges to the gradient of the infinite-horizon optimal Q-values with respect to the reward which can then be used to directly estimate the score. Guided by our theory, we derive a practical score-learning algorithm which can be used for score estimation in high-dimensional state-actions spaces. We show that score-learning can be used to reliably identify rewards, perform counterfactual predictions, achieve state-of-the-art behavior imitation, and transfer policies across environments. ",Inverse Reinforcement Learning Meta-Learning for Bootstrapping Medical Image Segmentation from Imperfect Supervision ,https://openreview.net/forum?id=yd5kGP5_VVE,https://openreview.net/pdf?id=yd5kGP5_VVE,A meta-based learning method for medical image segmentation under imperfect supervision,"Medical imaging has witnessed remarkable progress but usually requires a large amount of high-quality annotated data which is time-consuming and costly to obtain. To alleviate the annotation burden, learning from imperfect supervision (scarce or noisy annotations) has received much attention recently. In this paper, we present Meta-Learning for Bootstrapping Medical Image Segmentation (MLB-Seg), a unified meta-learning framework to sufficiently exploit the potential of imperfect supervision for medical image segmentation. In the face of noisy labeled data and unlabeled data, we first learn a segmentation model from a small clean set to generate initial labels for the unlabeled data and then gradually leverage the learner’s own predictions (i.e., the online pseudo labels) to bootstrap itself up via meta-learning. Specifically, MLB-Seg learns to dynamically assign per-pixel weight maps to both the imperfect labels (including both the generated labels and the noisy labels), as well as the pseudo labels commensurately to facilitate the bootstrapping procedure, where the weights are determined in a meta-process. To further improve the quality of the pseudo labels, we apply a consistency-based Pseudo Label Enhancement (PLE) scheme by ensembling predictions from various augmented versions of the same input. Noticing that the weight maps from these augmented variants can be extremely noisy from the meta-update, mean teacher is introduced into PLE to stabilize the weight map generation from the student (target) meta-learning model. Extensive experimental results on the public atrial and prostate segmentation datasets demonstrate that our method 1) achieves the state-of-the-art result under both semi- and noisy- supervision; 2) is robust against various imperfect supervisions. Code is publicly available at https://anonymous.4open.science/r/MLB-Seg-C80E.","Semi-supervised learning, Meta-learning, Noisy labeling, Medical image segmentation" L2B: Learning to Bootstrap for Combating Label Noise,https://openreview.net/forum?id=qVI1MqX52Xm,https://openreview.net/pdf?id=qVI1MqX52Xm,A simple and effective method for combating the label noise via joint instance and label reweighting,"Deep neural networks are powerful tools for representation learning, but can easily overfit to noisy labels which are prevalent in many real-world scenarios. Generally, noisy supervision could stem from variation among labelers, label corruption by adversaries, etc. To combat such label noises, one popular line of approach is to apply customized weights to the training instances, so that the corrupted examples contribute less to the model learning. However, such learning mechanisms potentially erase important information about the data distribution and therefore yield suboptimal results. To leverage useful information from the corrupted instances, an alternative is the bootstrapping loss, which reconstructs new training targets on-the-fly by reweighting the real labels and the network's own predictions (i.e., pseudo labels). In this paper, we propose a more generic learnable loss objective which enables a joint reweighting of instances and labels at once. Specifically, our method dynamically adjusts the $\textit{per-sample importance weight}$ between the real observed labels and pseudo-labels, where the weights are efficiently determined in a meta process. Compared to the previous instance reweighting methods, our approach concurrently conducts implicit relabeling, and thereby yields substantial improvements with almost no extra cost. Extensive experimental results demonstrated the strengths of our approach over existing methods on multiple natural and medical image benchmark datasets, including CIFAR-10, CIFAR-100, ISIC2019 and Clothing 1M. Code will be made publicly available.","learning with noisy labels, bootstrapping, meta-learning, medical image analysis" What Makes Convolutional Models Great on Long Sequence Modeling?,https://openreview.net/forum?id=TGJSPbRpJX-,https://openreview.net/pdf?id=TGJSPbRpJX-,We proposed a simple Strucured Global Convolution Kernel for long-range dependencies.,"Convolutional models have been widely used in multiple domains. However, most existing models only use local convolution, making the model unable to handle long-range dependencies efficiently. Attention overcomes this problem by aggregating global information based on the pair-wise attention score but also makes the computational complexity quadratic to the sequence length. Recently, Gu et al. proposed a model called S4 inspired by the state space model. S4 can be efficiently implemented as a global convolutional model whose kernel size equals the input sequence length. With Fast Fourier Transform, S4 can model much longer sequences than Transformers and achieve significant gains over SoTA on several long-range tasks. Despite its empirical success, S4 is involved. It requires sophisticated parameterization and initialization schemes that combine the wisdom from several prior works. As a result, S4 is less intuitive and hard to use for researchers with limited prior knowledge. Here we aim to demystify S4 and extract basic principles that contribute to the success of S4 as a global convolutional model. We focus on the structure of the convolution kernel and identify two critical but intuitive principles enjoyed by S4 that are sufficient to make up an effective global convolutional model: 1) The parameterization of the convolutional kernel needs to be efficient in the sense that the number of parameters should scale sub-linearly with sequence length. 2) The kernel needs to satisfy a decaying structure that the weights for convolving with closer neighbors are larger than the more distant ones. Based on the two principles, we propose a simple yet effective convolutional model called Structured Global Convolution (SGConv). SGConv exhibits strong empirical performance over several tasks: 1) With faster speed, SGConv surpasses the previous SoTA on Long Range Arena and Speech Command datasets. 2) When plugging SGConv into standard language and vision models, it shows the potential to improve both efficiency and performance.","Convolutional Neural Network, Deep Learning Architectures, Long-range dependence, Reparameterization" Progressive Mixup Augmented Teacher-Student Learning for Unsupervised Domain Adaptation,https://openreview.net/forum?id=CUOhDJGy3Mn,https://openreview.net/pdf?id=CUOhDJGy3Mn,,"Unsupervised Domain Adaptation (UDA) aims to transfer knowledge learned from a labeled source domain to an unlabeled target domain, mostly through learning a domain invariant feature representation. Currently, the best performing UDA methods use category level domain alignment to capture fine-grained information, resulting in significantly improved performance over global alignment. While successful, category level UDA methods suffer from the unreliable pseudo-labels for target data. Additionally, most UDA methods directly adapt from source to target domain without regard for the large domain discrepancy. In this paper, we propose an UDA approach with teacher-student learning where the teacher network is used to provide more reliable target pseudo-labels for the student network to train with. Furthermore, we use a progressive mixup augmentation strategy which generates intermediate samples that become increasingly target-dominant as training progresses. Aligning the source and intermediate domains allows the model to gradually transfer fine-grained domain knowledge from the source to the target domain while minimizing the negative impact of noisy target pseudo-labels. This progressive mixup augmented teacher-student (PMATS) training strategy along with class subset sampling and clustering based pseudo-label refinement achieves state-of-the-art performance on two public UDA benchmark datasets: Office-31, and Office-Home.","Unsupervised Domain Adaptation, Progressive Mixup Augmentation, Teacher-Student Learning" M$^3$SAT: A Sparsely Activated Transformer for Efficient Multi-Task Learning from Multiple Modalities,https://openreview.net/forum?id=_QkHfB07QMN,https://openreview.net/pdf?id=_QkHfB07QMN,Adapt the mixture-of-experts (MoEs) into both the self-attention and the feed-forward networks (FFN) of a transformer backbone for efficient multi-task learning from multiple modalities.,"Multi-modal multi-task learning (M$^2$TL) aims to discover the implicit correspondences among heterogeneous modalities and tasks, which is common in real-world applications like autonomous driving and robotics control. Current single-model solutions for M$^2$TL usually fall short in several aspects. The shared backbone between the modalities is prone to overfitting the simpler modality, while jointly optimizing the tasks suffers from unstable training due to the gradient conflicts across tasks. On the other hand, designing a separate model for each task and modality can avoid the above problems but leads to prohibitively expensive computation and memory consumption, rendering this approach unrealistic. In this work, we propose M$^3$SAT, a sparsely activated transformer for efficient M$^2$TL. The proposed framework tailors the mixture-of-experts (MoEs) into both the self-attention and the feed-forward networks (FFN) of a transformer backbone. It adopts the routing policy to assign attention-heads and FFN experts during training, which effectively disentangles the parameter space to prevent training conflicts among diverse modalities and tasks. Meanwhile, disentangled parameter space also restrains the problem of simple modal prone to overfitting. Sparsely activating the transformer also enables efficient computation for each input sample. Through comprehensive evaluation, we demonstrate the effectiveness of our M$^3$SAT: a remarkable performance margin (\textit{e.g.}, $\ge 1.37\%$) is achieved over the dense models with the same computation cost. More importantly, M$^3$SAT can achieve the above performance improvements with a fraction of the computation cost -- our computation is only $1.38\% \sim 53.51\%$ of that of the SOTA methods. Our code will be released upon acceptance.","multi-task learning, multimodal learning, transformer, mixture of experts" Editing models with task arithmetic,https://openreview.net/forum?id=6t0Kwf8-jrj,https://openreview.net/pdf?id=6t0Kwf8-jrj,"We study a new paradigm for editing pre-trained models, where weight vectors obtained via fine-tuning can be combined to efficiently and effectively steer model behavior.","Changing how pre-trained models behave---e.g., improving their performance on a downstream task or mitigating biases learned during pre-training---is a common practice when developing machine learning systems. In this work, we propose a new paradigm for steering the behavior of neural networks, centered around task vectors. A task vector specifies a direction in the weight space of a pre-trained model, such that movement in that direction improves performance on the task. We build task vectors by subtracting the weights of a pre-trained model from the weights of the same model after fine-tuning on a task. We show that these task vectors can be modified and combined together through arithmetic operations such as negation and addition, and the behavior of the resulting model is steered accordingly. Moreover, task vectors can be added together to improve performance on multiple tasks at once. Finally, when tasks are linked by an analogy relationship of the form ``A is to B as C is to D"", combining task vectors from three of the tasks can improve performance on the fourth, even when no data from the fourth task is used for training.","pre-trained models, model editing, model patching, fine-tuning, transfer learning, weight interpolation, merging models" Structured World Representations via Block-Slot Attention,https://openreview.net/forum?id=ZPHE4fht19t,https://openreview.net/pdf?id=ZPHE4fht19t,"We propose a novel object-centric representation called block-slots, which unlike the conventional slots, provides within-slot disentanglement via vector-formed factor representations. ","In this paper, we propose a novel object-centric representation, called Block-Slot Representation which, unlike the conventional slot representation, provides concept-level disentanglement within a slot. A block-slot is constructed by composing a set of modular concept representations, called blocks, generated from a memory of concept prototypes. We call this slot construction process Block-Slot Attention. Block-Slot Attention facilitates the emergence of abstract concept blocks within a slot such as color, position, and texture, without any supervision. This brings the benefits of disentanglement into slots and the representation becomes more interpretable. Similar to Slot Attention, this mechanism can be used as a drop-in module in any arbitrary neural architecture. In experiments, we show that our model disentangles object properties significantly better than the previous methods, including complex textured scenes. We also demonstrate the ability to compose novel scenes by composing slots at the block-level.","object-centric learning, unsupervised representation learning, disentanglement, concept learning" Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis,https://openreview.net/forum?id=PUIqjT4rzq7,https://openreview.net/pdf?id=PUIqjT4rzq7,We propose a training-free approach to incorporate language structured for compositional text-to-image synthesis,"Large-scale diffusion models have demonstrated remarkable performance on text-to-image synthesis (T2I). Despite their ability to generate high-quality and creative images, users still observe images that do not align well with the text input, especially when involving multiple objects. In this work, we strive to improve the compositional skills of existing large-scale T2I models, specifically more accurate attribute binding and better image compositions. We propose to incorporate language structures with the cross-attention layers based on a recently discovered property of diffusion-based T2I models. Our method is implemented on a state-of-the-art model, Stable Diffusion, and achieves better compositional skills in both qualitative and quantitative results. Our structured cross-attention design is also efficient that requires no additional training samples. Lastly, we conduct an in-depth analysis to reveal potential causes of incorrect image compositions and justify the properties of cross-attention layers in the generation process. ","Text-to-Image Synthesis, Diffusion Models, Compositional Generation" Atomized Deep Learning Models,https://openreview.net/forum?id=_lPNXhQ4uvS,https://openreview.net/pdf?id=_lPNXhQ4uvS,,"Deep learning models often tackle the intra-sample structure, such as the order of words in a sentence and pixels in an image, but have not pay much attention to the inter-sample relationship. In this paper, we show that explicitly modeling the inter-sample structure to be more discretized can potentially help model's expressivity. We propose a novel method, Atom Modeling, that can discretize a continuous latent space by drawing an analogy between a data point and an {\it atom}, which is naturally spaced away from other atoms with distances depending on their intra structures. Specifically, we model each data point as an atom composed of electrons, protons, and neutrons and minimize the potential energy caused by the interatomic force among data points. Through experiments with qualitative analysis in our proposed Atom Modeling on synthetic and real datasets, we find that Atom Modeling can improve the performance by maintaining the inter-sample relation and can capture an interpretable intra-sample relation by mapping each component in a data point to electron/proton/neutron.", Topology Matters in Fair Graph Learning: a Theoretical Pilot Study,https://openreview.net/forum?id=TVMjn0RpLHf,https://openreview.net/pdf?id=TVMjn0RpLHf,A theoretical pilot study to show why GNN amplifies prediction bias,"Recent advances in fair graph learning observe that graph neural networks (GNNs) further amplify prediction bias compared with multilayer perception (MLP), while the reason behind this is unknown. In this paper, we conduct a theoretical analysis of the bias amplification mechanism in GNNs. This is a challenging task since GNNs are difficult to be interpreted, and real-world networks are complex. To bridge the gap, we theoretically and experimentally demonstrate that aggregation operation in representative GNNs accumulates bias in node representation due to topology bias induced by graph topology. We provide a sufficient condition identifying the statistical information of graph data, so that graph aggregation enhances prediction bias in GNNs. Motivated by this data-centric finding, we propose a fair graph refinement algorithm, named \textit{FairGR}, to rewire graph topology to reduce sensitive homophily coefficient while preserving useful graph topology. Experiments on node classification tasks demonstrate that \textit{FairGR} can mitigate the prediction bias with comparable performance on three real-world datasets. Additionally, \textit{FairGR} is compatible with many state-of-the-art methods, such as adding regularization, adversarial debiasing, and Fair mixup via refining graph topology. Therefore, \textit{FairGR} is a plug-in fairness method and can be adapted to improve existing fair graph learning strategies. ","Graph Neural Networks, Fairness, Topology" Context-Aware Image Completion,https://openreview.net/forum?id=YlmzborbHTy,https://openreview.net/pdf?id=YlmzborbHTy,,"Image completion is a task that aims to fill in the missing region of a masked image with plausible contents. However, existing image completion methods tend to fill in the missing region with the surrounding texture instead of hallucinating a visual instance that is suitable in accordance with the context of the scene. In this work, we propose a novel image completion model, dubbed Refill, that hallucinates the missing instance that harmonizes well with - and thus preserves - the original context. Refill first adopts a transformer architecture that considers the types, locations of the visible instances, and the location of the missing region. Then, Refill completes the missing foreground and background semantic segmentation masks within the missing region, providing pixel-level semantic and structural guidance to generate missing contents with seamless boundaries. Finally, we condition the image synthesis blocks by using the completed segmentation mask to generate photo-realistic contents to fill out the missing region. Experimental results show the superiority of Refill over state-of-the-art image completion approaches on various natural images.","Image Completion, Image Inpainting" Speech denoising by listening to noise,https://openreview.net/forum?id=wNnaozRwl5O,https://openreview.net/pdf?id=wNnaozRwl5O,,"Speech denoising is the task of obtaining clean speech from the speech signal corrupted by background noise. Except in high end recording studios, we do not get clean speech signal as some background noise, or noise due to the recording device is always present. We propose an approach to denoise noisy speech signal by modeling the noise explicitly. Existing approaches model speech, potentially of multiple speakers, for denoising. Such approaches have an inherent drawback as a separate model is required for each speaker. We show that instead of modeling speaker(s), modelling the noise helps obtain a unified speaker independent denoiser, cf.\ speaker dependent ones in existing popular approaches. In addition to a novel speech denoising network, we also propose a large scale noise dataset, \texttt{AudioNoiseSet}, derived from Audioset dataset, to train our model. We show that our model outperforms prior approaches by significant margin in a large scale, in the wild speech datasets, \ie AVspeech, with standard quantitative metrics. In addition we show with multiple human ratings that the method is preferred over state-of-the-art approaches. The user study also points towards limitations of the metrics used, which we discuss. We also provide many qualitative results to demonstrate our better results. ", Can Agents Run Relay Race with Strangers? Generalization of RL to Out-of-Distribution Trajectories,https://openreview.net/forum?id=ipflrGaf7ry,https://openreview.net/pdf?id=ipflrGaf7ry,,"In this paper, we evaluate and improve the generalization performance for rein- forcement learning (RL) agents on the set of “controllable” states, where good policies exist in these states to achieve high rewards. An RL agent that generally masters a task should reach its goal starting from any controllable state of the environment, without memorizing actions specialized for a small set of states. To practically evaluate generalization performance in these states, we propose relay- evaluation, involving starting the test agent from the middle of trajectories of other independently trained, high-reward stranger agents. With extensive experimental evaluation, we show the prevalence of generalization failure on controllable states from stranger agents. For example, in the Humanoid environment, we observed that a well-trained Proximal Policy Optimization (PPO) agent, with only 3.9% failure rate during regular testing, failed on 81.6% of the states generated by well-trained stranger PPO agents. To improve generalization, we propose a novel method called Self-Trajectory Augmentation (STA), which does not rely on training multiple agents and does not noticeably increase training costs. After applying STA to the Soft Actor Critic’s (SAC) training procedure, we reduced the failure rate of SAC under relay-evaluation by more than three times in most settings without impacting agent performance and increasing the needed number of environment interactions.","Genralization, Reinforcement Learning" DYNAMIC BATCH NORM STATISTICS UPDATE FOR NATURAL ROBUSTNESS,https://openreview.net/forum?id=q9Tv6sR3jp2,https://openreview.net/pdf?id=q9Tv6sR3jp2,,"DNNs trained on natural clean samples have been shown to perform poorly on corrupted samples, such as noisy or blurry images. Various data augmentation methods have been recently proposed to improve DNN’s robustness against common corruptions. Despite their success, they require computationally expensive training and cannot be applied to off-the-shelf trained models. Recently, updating only BatchNorm(BN) statistics of a model on a single corruption has been shown to improve its accuracy on that corruption significantly. However, adopting the idea at inference time when the type of corruption changes decreases the effectiveness of this method. In this paper, we harness the Fourier domain to detect the corruption type, a challenging task in the image domain. We propose a unified framework consisting of a corruption-detection model and BN statistics update that can improve the corruption accuracy of any off-the-shelf trained model. We benchmark our framework on different models and datasets. Our results demonstrate about 8% and 4% accuracy improvement on CIFAR10-C and ImageNet-C, respectively. Furthermore, our framework can further improve the accuracy of state-of-the-art robust models, such as AugMix and DeepAug.", SKTformer: A Skeleton Transformer for Long Sequence Data,https://openreview.net/forum?id=NBES8BZ5wnZ,https://openreview.net/pdf?id=NBES8BZ5wnZ,We design an efficient Transformer model for long sequence data,"Transformers have become a preferred tool for modeling sequential data. Many studies of using Transformers for long sequence modeling focus on reducing computational complexity. They usually exploit the low-rank structure of data and approximate a long sequence by a sub-sequence. One challenge with such approaches is how to make an appropriate tradeoff between information preserving and noise reduction: the longer the sub-sequence used to approximate the long sequence, the better the information is preserved but at a price of introducing more noise into the model and of course more computational costs. We propose skeleton transformer, SKTformer for short, an efficient transformer architecture that effectively addresses the tradeoff. It introduces two mechanisms to effectively reduce the impact of noise while still keeping the computation linear to the sequence length: a smoothing block to mix information over long sequences and a matrix sketch method that simultaneously selects columns and rows from the input matrix. We verify the effectiveness of SKTformer both theoretically and empirically. Extensive studies over both Long Range Arena (LRA) datasets and six time-series forecasting show that SKTformer significantly outperforms both villain Transformer and other state-of-the-art variants of Transformer. Code is available at https://anonymous.4open.science/r/SKTFormer-B33B/","Efficient Trasnformer, Long Sequence Data, CUR decomposition, Robustness, matrix sketching" CktGNN: Circuit Graph Neural Network for Electronic Design Automation,https://openreview.net/forum?id=NE2911Kq1sp,https://openreview.net/pdf?id=NE2911Kq1sp,,"The electronic design automation of analog circuits has been a longstanding challenge in the integrated circuit field due to the huge design space and complex design trade-offs among circuit specifications. In the past decades, intensive research efforts have only been paid to automate the transistor sizing with a given circuit topology. By recognizing the graph nature of circuits, this paper presents a Circuit Graph Neural Network (CktGNN) that simultaneously automates the circuit topology generation and device sizing based on the encoder-dependent optimization subroutines. Particularly, CktGNN encodes circuit graphs using a two-level GNN framework (of nested GNN) where circuits are represented as combinations of subgraphs in a known subgraph basis. In this way, it significantly improves efficiency by reducing the number of subgraphs to perform message passing. Nonetheless, another critical roadblock to advancing learning-assisted circuit design automation is a lack of public benchmarks to perform canonical assessment and reproducible research. To tackle the challenge, we introduce Open Circuit Benchmark (OCB), an open-sourced dataset that contains $10$K distinct operational amplifiers with carefully-extracted circuit specifications from physical implementations. OCB also equips with communicative circuit generation and evaluation capabilities such that it can be used to generalize the applicability of CktGNN to design various analog circuits by efficiently producing corresponding datasets. Experiments on OCB show the extraordinary advantages of CktGNN through representation-based optimization frameworks over other recent powerful GNN baselines and manual design from human experts. Our work paves the way toward a learning-based open-sourced design automation flow for analog circuits.","Graph Neural Networks, Electronic Design Automation, Benchmark Graph Dataset" How Should I Plan? A Performance Comparison of Decision-Time vs. Background Planning,https://openreview.net/forum?id=UYsYdOn-A7e,https://openreview.net/pdf?id=UYsYdOn-A7e,Understanding under what conditions and which settings decision-time planning will perform better than background planning and vice versa.,"In model-based reinforcement learning, an agent can leverage a learned model to improve its way of behaving in different ways. Two prevalent approaches are decision-time planning and background planning. In this study, we are interested in understanding under what conditions and in which settings one of these two planning styles will perform better than the other in domains that require fast responses. After viewing them through the lens of dynamic programming, we first consider the classical instantiations of these planning styles and provide theoretical results and hypotheses on which one will perform better in the pure planning, planning & learning, and transfer learning settings. We then consider the modern instantiations of these planning styles and provide hypotheses on which one will perform better in the last two of the considered settings. Lastly, we perform several illustrative experiments to empirically validate both our theoretical results and hypotheses. Overall, our findings suggest that even though decision-time planning does not perform as well as background planning in their classical instantiations, in their modern instantiations, it can perform on par or better than background planning in both the planning & learning and transfer learning settings.","Reinforcement learning, model-based reinforcement learning, dynamic programming, transfer learning" Substructure-Atom Cross Attention for Molecular Representation Learning,https://openreview.net/forum?id=WFewvIEb0aT,https://openreview.net/pdf?id=WFewvIEb0aT,This paper proposes a novel network that utilizes molecular substructure along with local atom embedding. ,"Designing a neural network architecture for molecular representation is crucial for AI-driven drug discovery and molecule design. In this work, we propose a new framework for molecular representation learning. Our contribution is threefold: (a) demonstrating the usefulness of incorporating substructures to node-wise features from molecules, (b) designing two branch networks consisting of a transformer and a graph neural network so that the networks fused with asymmetric attention, and (c) not requiring heuristic features and computationally-expensive information from molecules. Using 1.8 million molecules collected from ChEMBL and PubChem database, we pretrain our network to learn a general representation of molecules with minimal supervision. The experimental results show that our pretrained network achieves competitive performance on 11 downstream tasks for molecular property prediction. ","Molecular representation learning, Molecular Substructure, Cross-attention" Differentially Private Algorithms for Smooth Nonconvex ERM,https://openreview.net/forum?id=cBNfRYPtvFY,https://openreview.net/pdf?id=cBNfRYPtvFY,We develop simple differentially private optimization algorithms that move along directions of (expected) descent to find approximate second-order necessary solutions for non-convex ERM problems.,"We develop simple differentially private optimization algorithms that move along directions of (expected) descent to find approximate second-order necessary solution for non-convex ERM problems. We use line search, mini-batching, and a two-phase strategy to improve the speed and practicality of the algorithm. Numerical experiments demonstrate the effectiveness of these approaches.","differential privacy, optimization, machine learning, ERM" Untangling Effect and Side Effect: Consistent Causal Inference in Non-Targeted Trials,https://openreview.net/forum?id=vxln_lFKkfc,https://openreview.net/pdf?id=vxln_lFKkfc,We propose an algorithm that provably recovers hidden effect groups in causal studies,"A treatment is usually appropriate for some group (the ``sick"" group) on whom it has an effect, but it can also have a side-effect when given to subjects from another group (the ``healthy"" group). In a non-targeted trial both sick and healthy subjects may be treated, producing heterogeneous effects within the treated group. Inferring the correct treatment effect on the sick population is then difficult, because the effect and side-effect are tangled. We propose an efficient nonparametric approach to untangling the effect and side-effect, called PCM (pre-cluster and merge). We prove its asymptotic consistency in a general setting and show, on synthetic data, more than a 10x improvement in accuracy over existing state-of-the-art. ","Causal Inference, Non Targeted Trials, Machine Learning, Heterogeneous Treatment Effects" AMA: Asymptotic Midpoint Augmentation for Margin Balancing and Moderate Broadening,https://openreview.net/forum?id=CwFcw5DBVOR,https://openreview.net/pdf?id=CwFcw5DBVOR,,"Margin plays an important role like alignment and uniformity for regularization, as shown in contrastive learning literature. However, feature augmentation has been rarely analyzed in this framework, despite their effective regularization. In this paper, we focus on the analysis framework for feature augmentations and propose a novel method to gradually push a decision boundary to the midpoint of related representations via their augmentation, called $\textit{asymptotic midpoint augmentation}$ (AMA). The method induces two effects: 1) balancing the margin for all classes and 2) only moderately broadening the margin until it holds maximal confidence. Each effect addresses the low uniformity of feature augmentations and representation collapse by excessively low alignment of contrastive learning, respectively. We empirically analyze the effects in a toy task for clear visualization and validate the impacts in original, long-tailed, and coarse-to-fine transfer tasks on CIFAR-10 and CIFAR-100.", STUNT: Few-shot Tabular Learning with Self-generated Tasks from Unlabeled Tables,https://openreview.net/forum?id=_xlsjehDvlY,https://openreview.net/pdf?id=_xlsjehDvlY,We propose a few-shot tabular learning framework that meta-learns over the self-generated tasks from unlabeled tables.,"Learning with few labeled tabular samples is often an essential requirement for industrial machine learning applications as varieties of tabular data suffer from high annotation costs or have difficulties in collecting new samples for novel tasks. Despite the utter importance, such a problem is quite under-explored in the field of tabular learning, and existing few-shot learning schemes from other domains are not straightforward to apply, mainly due to the heterogeneous characteristics of tabular data. In this paper, we propose a simple yet effective framework for few-shot tabular learning, coined Self-generated Tasks from UNlabeled Tables (STUNT). Our key idea is to self-generate diverse few-shot tasks by treating randomly chosen columns as a target label. We then employ a meta-learning scheme to learn generalizable knowledge with the constructed tasks. Moreover, we introduce an unsupervised validation scheme for hyperparameter search (and early stopping) by generating a pseudo-validation set using STUNT from unlabeled data. Our experimental results demonstrate that our simple framework brings significant performance gain under various tabular few-shot learning benchmarks, compared to prior semi- and self-supervised baselines.","Tabular representation learning, Few-shot learning, Unsupervised meta-learning" MEDIC: Model Backdoor Removal by Importance Driven Cloning,https://openreview.net/forum?id=qHcR93949op,https://openreview.net/pdf?id=qHcR93949op,We propose importance driven cloning to remove backdoor in machine learning models.,"We develop a novel method to remove injected backdoors in Deep Learning models. It works by cloning the benign behaviors of a trojaned model to a new model of the same structure. It trains the clone model from scratch on a very small subset of samples and aims to minimize a cloning loss that denotes the differences between the activations of important neurons across the two models. The set of important neurons varies for each input, depending on their magnitude of activations and their impact on the classification result. Our experiments show that our technique can effectively remove nine different types of backdoors with minor benign accuracy degradation, outperforming the state-of-the-art backdoor removal techniques that are based on fine-tuning, knowledge distillation, and neuron pruning.","Backdoor Removal, Cloning" The Role of Pre-training Data in Transfer Learning,https://openreview.net/forum?id=q_PkAzGFrmq,https://openreview.net/pdf?id=q_PkAzGFrmq,"We investigate the role of pretraining distribution, data curation, size, and loss and downstream transfer learning","The transfer learning paradigm of model pre-training and subsequent fine-tuning produces high accuracy models. However, a question remains: what data and method should be used for pre-training? We study the effect of the pre-training distribution on transfer learning in the context of image classification. Through controlled experiments, we find that the pre-training dataset is initially important for low-shot transfer. However, the differences between distributions is diminished as more data is made available for fine-tuning. Still, fine-tuning outperforms training from scratch. We also investigate dataset size and observe that larger pre-training datasets lead to better accuracy, however, the absolute accuracy difference is largest in the few-shot regime. Beyond data, we study the effect of the pre-training method, language-image contrastive vs. image-image contrastive, finding that the latter usually leads to better transfer accuracy","pretraining, transfer learning, supervised training, constrastive learning, clip, simclr" Compressed Predictive Information Coding,https://openreview.net/forum?id=rde9B5ue32F,https://openreview.net/pdf?id=rde9B5ue32F,"This work proposes a novel information-theoretic framework, Compressed Predictive Information Coding (CPIC), to extract predictive latent representations from dynamic data","Unsupervised learning plays an important role in many fields, such as machine learning, data compression, and neuroscience. Compared to static data, methods for extracting low-dimensional structure for dynamic data are lagging. We developed a novel information-theoretic framework, Compressed Predictive Information Coding (CPIC), to extract predictive latent representations from dynamic data. Predictive information quantifies the ability to predict the future of a time series from its past. CPIC selectively projects the past (input) into a low dimensional space that is predictive about the compressed data projected from the future (output). The key insight of our framework is to learn representations by balancing the minimization of compression complexity with maximization of the predictive information in the latent space. We derive tractable variational bounds of the CPIC loss by leveraging bounds on mutual information. The CPIC loss induces the latent space to capture information that is maximally predictive of the future of the data from the past. We demonstrate that introducing stochasticity in the encoder and maximizing the predictive information in latent space contributes to learning more robust latent representations. Furthermore, our variational approaches perform better in mutual information estimation compared with estimates under the Gaussian assumption commonly used. We show numerically in synthetic data that CPIC can recover dynamical systems embedded in noisy observation data with low signal-to-noise ratio. Finally, we demonstrate that CPIC extracts features more predictive of forecasting exogenous variables as well as auto-forecasting in various real datasets compared with other state-of-the-art representation learning models. Together, these results indicate that CPIC will be broadly useful for extracting low-dimensional dynamic structure from high-dimensional, noisy time-series data.","Predictive information, time series, variational inference" Importance of Class Selectivity in Early Epochs of Training,https://openreview.net/forum?id=hcLpFslHraT,https://openreview.net/pdf?id=hcLpFslHraT,,"Deep networks trained for classification exhibit class-selective neurons in intermediate layers. Intriguingly, recent studies have shown that class-selective neurons are not strictly necessary for network function. But if class-selective neurons are not necessary, why do they exist? We attempt to answer this question in a series of experiments on ResNet-50 trained on ImageNet. We begin by showing that class-selective neurons emerge in the first few epochs of training before receding rapidly. Single-neuron ablation experiments show that class-selective neurons are important for network function during this early phase of training. The network is close to a linear regime during this early training phase, which may explain the emergence of these class-selective neurons in intermediate layers. Finally, by regularizing against class selectivity at different points in training, we show that the emergence of these class-selective neurons during the first few epochs of training is essential to the successful training of the network. Altogether, our results indicate that class-selective neurons in intermediate layers are vestigial remains of early epochs of training, during which they appear as quasi-linear shortcut solutions to the classification task which are essential to the successful training of the network.", Mechanistic Mode Connectivity,https://openreview.net/forum?id=NZZoABNZECq,https://openreview.net/pdf?id=NZZoABNZECq,,"With the rise of pretrained models, fine-tuning has become of central importance in deep learning. However, unlike retraining from scratch, fine-tuning can fail to qualitatively change the behavior of a pre-trained network. For instance, we find in practice that naive fine-tuning does not eliminate a model’s sensitivity to spurious features. To understand and address this limitation, we study the geometry of neural network loss landscapes through the lens of mode-connectivity. Our work addresses two questions about mode-connectivity: 1) Are models trained on different data distributions mode-connected? 2) Can we fine tune a pre-trained model to switch modes? We define a notion of mechanistic mode-connectivity, and find that only models that already share the same invariances (which we call “mechanistically similar”) are mechanistically mode-connected. We hypothesize this property explains inability of naive fine-tuning methods to induce invariance to spurious features. Based on our analysis, we propose and validate a method of “mechanistic fine-tuning” called connectivity-based fine-tuning (CBFT)","Loss landscapes, Mechanisms, Mode Connectivity" CLASSIFICATION OF INCOMPLETE DATA USING AUGMENTED MLP,https://openreview.net/forum?id=z-lQRTEOpxV,https://openreview.net/pdf?id=z-lQRTEOpxV,A new way to train a multi layer perceptron such that it can classify incomplete data properly,"We introduce a new way to train a Multi-Layer Perceptron (MLP) to classify incomplete data. To achieve this, we train an MLP using a two-phased approach. In the first phase, we train an MLP using complete data. Before the second phase of training, we create an augmented dataset. For this, we use non-missing data, delete each feature once, and then fill it using some predefined points. After that, in the second phase, we retrain the network using the augmented dataset. The aim of this type of training is to predict the class label of an incomplete dataset. At the time of testing, when a feature vector with a missing value appears, we initially impute it using the predefined points and find the class label of the feature vector using the trained network. We compare the proposed method with an original MLP on twelve datasets using four imputation strategies. The proposed method’s performance is better compared to the originally trained MLP.","Classification, imputation, multi layer perceptron, missing data." On the Convergence of Federated Deep AUC Maximization,https://openreview.net/forum?id=WVYJ0BaytpF,https://openreview.net/pdf?id=WVYJ0BaytpF,,"In many real-world applications, the distribution of data is skewed. The standard models, which are designed to optimize the accuracy, have poor prediction performance when they are applied to imbalanced data tasks because the model could be dramatically biased toward its major class. Therefore, areas under ROC curves (AUROC) was proposed as a useful metric to assess how well prediction models performed on unbalanced data sets. On the other hand, federated learning (FL) has attracted increasing attention with the emergence of distributed data due to its communication efficiency. To address the challenge of distributed imbalanced data, research on Federated Deep AUC Maximization (FDAM) is necessary. However, the FDAM problem currently is understudied and is more complex than traditional federated learning (FL) techniques since its minimization objective is non-decomposable over individual examples. In this study, we solve FDAM algorithms for heterogeneous data by reformulating it as the popular non-convex strongly-concave min-max formulation and propose the federated stochastic recursive momentum gradient ascent (FMGDA) algorithm , which can also be applied to general federated non-convex-strongly-concave minimax problems. Importantly, our method does not rely on strict assumptions, such as the PL condition and we proved that it can achieve the $O(\epsilon^{-3})$ sample complexity, which reaches the best-known sample complexity of centralized methods. It also achieves the $O(\epsilon^{-2})$ communication complexity and a linear speedup in terms of the number of clients. Additionally, extensive experimental results show that our algorithm (i.e. FMGDA) performs empirically superior to other algorithms, supporting its effectiveness.", Towards A Unified Neural Architecture for Visual Recognition and Reasoning,https://openreview.net/forum?id=uARGOm09Vnr,https://openreview.net/pdf?id=uARGOm09Vnr,"We propose a neural architecture that can unify visual recognition and spatiotemporal reasoning tasks, and use it to derive insights into how inductive biases, architectural choices, and recognition tasks can help enable reasoning capabilities.","Recognition and reasoning are two pillars of visual understanding. However, these tasks have an imbalance in focus; whereas recent advances in neural networks have shown strong empirical performance in visual recognition, there has been comparably much less success in solving visual reasoning. Intuitively, unifying these two tasks under a singular framework is desirable, as they are mutually dependent and beneficial. Motivated by the recent success of multi-task transformers for visual recognition and language understanding, we propose a unified neural architecture for visual recognition and reasoning tasks with a generic interface (e.g., tokens) for all tasks. Our framework enables the principled investigation of how different visual recognition tasks, datasets, and inductive biases can help enable spatiotemporal reasoning capabilities. Noticeably, we find that object detection, which requires spatial localization of individual objects, is the most beneficial recognition task for reasoning. We further demonstrate via probing that implicit object-centric representations emerge automatically inside our framework. We also discover that visual reasoning and object detection respond to drastically different model components; certain architectural choices such as the backbone model of the visual encoder have a significant impact on visual reasoning, but little on object detection. Given the results of our experiments, we believe that a fruitful direction forward is to consider visual reasoning a first-class citizen alongside visual recognition, as they are strongly correlated but benefit from potentially different design choices.","object-centric representation, detection, reasoning, pix2seq, transformers" BLOOM Large Language Models and the Chomsky Hierarchy,https://openreview.net/forum?id=jczReTpeJ0N,https://openreview.net/pdf?id=jczReTpeJ0N,The performance of the BLOOM large language models cannot be explained by the complexity of the languages in the Chomsky hierarchy.,"We study the performance of BLOOM large language models of different sizes on understanding 12 different languages from the Chomsky hierarchy using few-shot prompts. We investigate whether an increase in the complexity of the languages learned by the larger models can be characterized using the Chomsky hierarchy. We first show that prompting in BLOOM models enables reasoning with a good accuracy on language tasks as diverse as stack manipulation, string reversal, odds first, and interlocked pairing, when the queries are over short strings, that is, small bitwidth bit-vectors from the language. Second, we discover that the two largest models have the highest accuracy on such tasks for prompts with a fixed length, but smaller models are able to achieve similar accuracies with longer prompts. Unlike classical automata or grammar based approaches where algorithms for more complex languages in the Chomsky hierarchy can also recognize simpler languages, we find that the performance of the BLOOM large language models cannot be explained by the complexity of the languages in the Chomsky hierarchy.","Large Language Models, Chomsky Hierarchy" WebBrain: Learning to Generate Factually Correct Articles for Queries by Grounding on Large Web Corpus,https://openreview.net/forum?id=eiuj6cNv4iI,https://openreview.net/pdf?id=eiuj6cNv4iI,,"In this paper, we introduce a new NLP task – generating short factual articles for queries by mining supporting evidence from the Web. In this task, called WebBrain, the ultimate goal is to generate a fluent, informative, and factually-correct short article (e.g., Wiki article) for a factual query unseen in Wikipedia. To enable experiments on WebBrain, we construct a large-scale dataset WebBrain-Raw by extracting English Wikipedia articles and their crawlable Wiki references. WebBrain-Raw is ten times larger than the previous biggest peer dataset, which can greatly benefit the research community. Besides, we empirically analyze the performances of the current state-of-the-art NLP techniques on WebBrain and introduce a new framework ReGen, which enhances the generation factualness by improved evidence retrieval and task-specific pre-training for generation. Experiment results show that ReGen outperforms all baselines in both automatic and human evaluations.","factual generation, retrieval-augmented generation, new large-scale dataset" HloEnv: A Graph Rewrite Environment for Deep Learning Compiler Optimization Research,https://openreview.net/forum?id=ZVnH2suWKRu,https://openreview.net/pdf?id=ZVnH2suWKRu,,"We introduce HloEnv, an environment based on Accelerated Linear Algebra (XLA) for deep learning (DL) compiler optimization research. HloEnv transforms all graph rewrites into a common representation, providing a flexible interface to control and modify existing graph optimization passes. In this representation, an XLA pass is converted into a set of sequential rewrite decisions, which control when and if the rewrites are applied. Along with HloEnv, we present a dataset with broad coverage of computation graphs drawn from modern real-world machine learning models. We select two XLA passes with the largest impact on the runtime of the compiled program, and explore the potential for further improvement over XLA in this decision space. We show that using simple heuristics for decision-making can achieve on-par or better performance than XLA. Using search algorithms further boosts performance. We intend for HloEnv and our dataset to be an open-source, community-driven effort that helps spur advances in DL compiler optimization research.","XLA, Compiler Optimization, Graph Rewrite, Reinforcement Learning Environment" Towards Diverse Perspective Learning with Switch over Multiple Temporal Pooling,https://openreview.net/forum?id=B2ww5cqWq14,https://openreview.net/pdf?id=B2ww5cqWq14,,"Pooling is a widely used method for classification problems. In particular, poolings that consider temporal relationships have been proposed in the time series classification (TSC) domain. However, we found that there exists a data dependency on temporal poolings. Since each pooling has only one perspective, existing temporal poolings cannot solve data dependency problem with a fixed perspective learning. In this paper, we propose a novel pooling architecture for diverse perspective learning: switch over multiple pooling (SoM-TP). The massive case study using layer-wise relevance propagation (LRP) reveals the distinct view that each pooling has and ultimately emphasizes the necessity of diverse perspective learning. Therefore, SoM-TP dynamically selects temporal poolings according to time series data characteristics. The ablation study on SoM-TP shows how diverse perspective learning is achieved. Furthermore, pooling classification is investigated through input attribution by LRP. Extensive experiments are done with the UCR/UEA repository.","timeseries classification, temporal pooling, temporal relationship, perspective learning" Deep Latent State Space Models for Time-Series Generation,https://openreview.net/forum?id=ukWZS73ccwk,https://openreview.net/pdf?id=ukWZS73ccwk,,"Methods based on ordinary differential equations (ODEs) are widely used to build generative models of time-series. In addition to high computational overhead due to explicitly computing hidden states recurrence, existing ODE-based models fall short in learning sequence data with sharp transitions - common in many real-world systems - due to numerical challenges during optimization. In this work, we propose LS4, a generative model for sequences with latent variables evolving according to a state space ODE to increase modeling capacity. Inspired by recent deep state space models (S4), we achieve speedups by leveraging a convolutional representation of LS4 which bypasses the explicit evaluation of hidden states. We show that LS4 significantly outperforms previous continuous-time generative models in terms of marginal distribution, classification, and prediction scores on real-world datasets in the Monash Forecasting Repository, and is capable of modeling highly stochastic data with sharp temporal transitions. LS4 sets state-of-the-art for continuous-time latent generative models, with significant improvement of mean squared error and tighter variational lower bounds on irregularly-sampled datasets, while also being x100 faster than other baselines on long sequences.", Specformer: Spectral Graph Neural Networks Meet Transformers,https://openreview.net/forum?id=0pdSt3oyJa1,https://openreview.net/pdf?id=0pdSt3oyJa1,We propose a novel set-to-set spectral graph filter by using a spectral domain Transformer.,"Spectral graph neural networks learn graph representations via spectral-domain graph convolutions. However, most existing spectral graph filters are scalar-to-scalar functions, i.e., mapping a single eigenvalue to a single filtered value, thus ignoring the global pattern of the spectrum. Furthermore, these filters are often constructed based on some fixed-order polynomials, which have limited expressiveness and flexibility. To tackle these issues, we introduce Specformer, which effectively encodes the set of all eigenvalues and performs self-attention in the spectral domain, leading to a learnable set-to-set spectral filter. We also design a decoder with learnable bases to enable non-local graph convolution. Importantly, Specformer is equivariant to permutation. By stacking multiple Specformer layers, one can build a powerful spectral graph neural network. On synthetic datasets, we show that our Specformer can better recover ground-truth spectral filters than other spectral GNNs. Extensive experiments of both node-level and graph-level tasks on real-world graph datasets show that our Specformer outperforms state-of-the-art GNNs and learns meaningful spectrum patterns.","Spectral Graph Neural Networks, Transformer" MetaP: How to Transfer Your Knowledge on Learning Hidden Physics,https://openreview.net/forum?id=MuoduaZpQxE,https://openreview.net/pdf?id=MuoduaZpQxE,Meta-learning method to transfer hidden physics,"Gradient-based meta-learning methods have primarily focused on classical machine learning tasks such as image classification and function regression, where they were found to perform well by recovering the underlying common representation among a set of given tasks. Recently, PDE-solving deep learning methods, such as neural operators, are starting to make an important impact on learning and predicting the response of a complex physical system directly from observational data. Since the data acquisition in this context is commonly challenging and costly, the call of utilization and transfer of existing knowledge to new and unseen physical systems is even more acute. Herein, we propose a novel meta-learnt approach for transfer-learning knowledge between neural operators, which can be seen as transferring the knowledge of solution operators between governing (unknown) PDEs with varying parameter fields. With the key theoretical observation that the underlying parameter field can be captured in the first layer of the neural operator model, in contrast to typical final-layer transfer in existing meta-learning methods, our approach is a provably universal solution operator for multiple PDE solving tasks. As applications, we demonstrate the efficacy of our proposed approach on heterogeneous material modeling tasks, which shows that our method can handle complex and nonlinear physical response learning tasks while greatly improving the sampling efficiency in new and unseen materials.","meta-learning, neural operator, parametric PDEs" CommsVAE: Learning the brain's macroscale communication dynamics using coupled sequential VAEs,https://openreview.net/forum?id=5H9_FUPA9r8,https://openreview.net/pdf?id=5H9_FUPA9r8,"We address three issues with common connectivity approaches by explicitly modeling the directionality of communication, finding communication at each timestep, and encouraging sparsity.","Communication within or between complex systems is commonplace in the natural sciences and fields such as graph neural networks. The brain is a perfect example of such a complex system, where communication between brain regions is constantly being orchestrated. To analyze communication, the brain is often split up into anatomical regions that each perform certain computations. These regions must interact and communicate with each other to perform tasks and support higher-level cognition. On a macroscale, these regions communicate through signal propagation along the cortex and along white matter tracts over longer distances. When and what types of signals are communicated over time is an unsolved problem and is often studied using either functional or structural data. In this paper, we propose a non-linear generative approach to communication from functional data. We address three issues with common connectivity approaches by explicitly modeling the directionality of communication, finding communication at each timestep, and encouraging sparsity. To evaluate our model, we simulate temporal data that has sparse communication between nodes embedded in it and show that our model can uncover the expected communication dynamics. Subsequently, we apply our model to temporal neural data from multiple tasks and show that our approach models communication that is more specific to each task. The specificity of our method means it can have an impact on the understanding of psychiatric disorders, which are believed to be related to highly specific communication between brain regions compared to controls. In sum, we propose a general model for dynamic communication learning on graphs, and show its applicability to a subfield of the natural sciences, with potential widespread scientific impact.","variational autoencoder, computational neuroscience, graphs, fMRI, sequential variational autoencoder, graph learning, communications" Beyond the injective assumption in causal representation learning,https://openreview.net/forum?id=22Hsbl8twlY,https://openreview.net/pdf?id=22Hsbl8twlY,A hierarchy of generative functions for causal representation learning to consider that relaxes the injective assumption.,"Causal representation learning aims to take some entangled observation, $x$, and recover the latent causal variables $z$ from which the observation was generated using via generative function $g(\cdot): \mathcal{Z}\rightarrow \mathcal{X}$. While this problem is impossible in its full generality, there has been considerable recent progress in showing a variety of conditions in which the latents are identifiable. All of these approaches share the assumption that $g(\cdot)$ is injective: i.e. for any two observations $x_1$ and $x_2$, if $x_1 = x_2$ then the corresponding latent variables, $z_1$ and $z_2$ are equal. This assumption is restrictive but dropping it entirely would allow pathological examples that we could never hope to identify, so in order to make progress beyond injectivity, we need to make explicit the important classes of non-injective functions. In this paper we present formal hierarchy over generative functions that includes injective functions and two non-trivial classes of non-injective functions---occlusion and observable effects---that we argue are important for causal representation learning to consider. We demonstrate that the injective assumption is not necessary, by proving the first identifiability results in settings with occluded variables. ","Representation learning, identifiability, ica, causal representation learning" Answer Me if You Can: Debiasing Video Question Answering via Answering Unanswerable Questions,https://openreview.net/forum?id=hdkdCk6xm48,https://openreview.net/pdf?id=hdkdCk6xm48,We propose a novel framework for VideoQA which is possible to learn confounders existing in the dataset even when confounders are unobserved and to effectively remove the effects of learned confounders.,"Video Question Answering (VideoQA) is a task to predict a correct answer given a question-video pair. Recent studies have shown that most VideoQA models rely on spurious correlations induced by various biases when predicting an answer. For instance, VideoQA models tend to predict `two’ as an answer without considering the video if a question starts with ``How many’' since the majority of answers to such type of questions are `two’. In causal inference, such bias ($\textit{question type}$), which simultaneously affects the input $X$ ($\textit{How many...}$) and the answer $Y$ ($\textit{two}$), is referred to as a confounder $Z$ that hinders a model from learning the true relationship between the input and the answer. The effect of the confounders $Z$ can be removed with a causal intervention $P(Y|do(X))$ when $Z$ is observed. However, there exist many unobserved confounders affecting questions and videos, $\textit{e.g.}$, dataset bias induced by annotators who mainly focus on human activities and salient objects resulting in a spurious correlation between videos and questions. To address this problem, we propose a novel framework that learns unobserved confounders by capturing the bias using $\textit{unanswerable}$ questions, which refers to an artificially constructed VQA sample with a video and a question from two different samples, and leverages the confounders for debiasing a VQA model through causal intervention. We demonstrate that our confounders successfully capture the dataset bias by investigating which part in a video or question that confounders pay attention to. Our experiments on multiple VideoQA benchmark datasets show the effectiveness of the proposed debiasing framework, resulting in an even larger performance gap compared to biased models under the distribution shift.","Video Question Answering, debiasing, causal inference" Language Models Can (kind of) Reason: A Systematic Formal Analysis of Chain-of-Thought,https://openreview.net/forum?id=qFVVBzXxR2V,https://openreview.net/pdf?id=qFVVBzXxR2V,"We present a new synthetic QA dataset called PrOntoQA to systematically explore the reasoning ability of language models via formal analysis, and find that while they can produce valid proof steps, they have difficulty with proof planning.","Large language models (LLMs) have shown remarkable reasoning capabilities given chain-of-thought prompts (examples with intermediate reasoning steps). Existing benchmarks measure reasoning ability indirectly, by evaluating accuracy on downstream tasks such as mathematical reasoning. However, it is unclear how these models obtain the answers and whether they rely on simple heuristics rather than the generated chain-of-thought. To enable systematic exploration of the reasoning ability of LLMs, we present a new synthetic question-answering dataset called PrOntoQA, where each example is generated from a synthetic world model represented in first-order logic. This allows us to parse the generated chain-of-thought into symbolic proofs for formal analysis. Our analysis on InstructGPT and GPT-3 shows that LLMs are quite capable of making correct individual deduction steps, and so are generally capable of reasoning, even in fictional contexts. However, they have difficulty with proof planning: When multiple valid deduction steps are available, they are not able to systematically explore the different options.","large language models, reasoning, question answering, chain-of-thought, in-context learning" Approximation ability of Transformer networks for functions with various smoothness of Besov spaces: error analysis and token extraction,https://openreview.net/forum?id=uOAerdjbEZy,https://openreview.net/pdf?id=uOAerdjbEZy,"This paper is written about the approximation ability of Transformers for functions with various smoothness, and, from a point of view of apporoximation, we prove the token extraction property of Transformers.","Although Transformer networks outperform various natural language processing tasks, many aspects of their theoretical nature are still unclear. On the other hand, fully connected neural networks have been extensively studied in terms of their approximation and estimation capability where the target function is included in such function classes as H\""older class and Besov class. In this paper, we study the approximation and estimation error of Transformer networks in a setting where the target function takes a fixed-length sentence as an input and belongs to two variants of Besov spaces known as anisotropic Besov and mixed smooth Besov spaces, in which it is shown that Transformer networks can avoid curse of dimensionality. By overcoming the difficulties in limited interactions among tokens, we prove that Transformer networks can accomplish minimax optimal rate. Our result also shows that token-wise parameter sharing in Transformer networks decreases dependence of the network width on the input length. Moreover, we prove that, under suitable situations, Transformer networks dynamically select tokens to pay careful attention to. This phenomenon matches attention mechanism, on which Transformer networks are based. Our analyses strongly support the reason why Transformer networks have outperformed various natural language processing tasks from a theoretical perspective.","Transformer, approximation error, estimation error, minimax optimal rate, besov spaces, B-Splines, adaptive sampling recovery, token extraction" Teacher Intervention: Improving Convergence of Quantization Aware Training for Ultra-Low Precision Transformers,https://openreview.net/forum?id=692oJ-QFuMC,https://openreview.net/pdf?id=692oJ-QFuMC,Efficient and accurate two-step Quantization Aware Training method of Finetuned Transformers,"Pre-trained Transformer models such as BERT have shown great success in a wide range of applications, but with the cost of substantial increases in model complexity. Quantization-aware training (QAT) is a promising way to lower the implementation cost and energy consumption. However, aggressive quantization below 2-bit causes considerable accuracy degradation, especially when the downstream dataset is not abundant. This work proposes a proactive knowledge distillation method called Teacher Intervention (TI) for fast converging QAT of ultra-low precision pre-trained Transformers. TI intervenes layer-wise signal propagation with the intact signal from the teacher to remove the interference of propagated quantization errors, smoothing loss surface and expediting the convergence. We further propose a gradual intervention mechanism to stabilize the tuning of the feed-forward network and recover the self-attention map in steps. The proposed scheme enables fast convergence of QAT and improves the model accuracy regardless of the diverse characteristics of downstream fine-tuning tasks. We demonstrate that TI consistently achieves superior accuracy with lower fine-tuning budget. ","Deep Learning, Quantization, QAT, Self-Attention, Transformer, BERT" "Clustering Embedding Tables, Without First Learning Them",https://openreview.net/forum?id=T-DKAYt6BMk,https://openreview.net/pdf?id=T-DKAYt6BMk,"We train recommendation systems using less memory than previous work. This is achieved using clustering of a ""pseudo embedding table"" trained via hashing.","Machine learning systems use embedding tables to work with categorical features. These tables may get extremely large in modern recommendation systems, and various methods have been suggested to fit them in memory. Product- and Residual Vector Quantization are some of the most successful methods for table compression. They function by substituting table rows with references to ``codewords'' picked by k-means clustering. Unfortunately, this means that they must first know the table before compressing it, thus they can only save memory at inference time, not training time. Recent work has employed hashing-based approaches to minimize memory usage during training, however the compression obtained is poorer than that achieved by ``post-training'' quantization. We demonstrate that combining hashing and clustering based algorithms provides the best of both worlds. By first training a hashing-based ``sketch'', then clustering it, and then training the clustered quantization, our method may achieve compression ratios close to those of post-training quantization with the training time memory reductions of hashing-based methods. We prove that this technique works rigorously in the least-square setting.","Clustering, Sketching, Recommendation Systems, Embeddings, Sparse Matrices" Architecture Matters in Continual Learning,https://openreview.net/forum?id=CAsH4Z_Xzj7,https://openreview.net/pdf?id=CAsH4Z_Xzj7,The choice of architecture can significantly impact the continual learning performance.,"A large body of research in continual learning is devoted to overcoming the catastrophic forgetting of neural networks by designing new algorithms that are robust to the distribution shifts. However, the majority of these works are strictly focused on the ""algorithmic"" part of continual learning for a ""fixed neural network architecture"", and the implications of using different architectures are not clearly understood. The few existing continual learning methods that expand the model also assume a fixed architecture and develop algorithms that can efficiently use the model throughout the learning experience. In contrast, in this work, we build on existing works that study continual learning from a neural network's architecture perspective and provide new insights into how the architecture choice, for the same learning algorithm, can impact stability-plasticity trade-off resulting in markedly different continual learning performance. We empirically analyze the impact of various architectural components providing best practices and recommendations that can improve the continual learning performance irrespective of the learning algorithm.","Continual Learning, Catastrophic Forgetting, Neural Network Architecture" Machine Learning Force Fields with Data Cost Aware Training,https://openreview.net/forum?id=5C5ZcWvtI7S,https://openreview.net/pdf?id=5C5ZcWvtI7S,"We propose ASTEROID, a computational framework to reduce the data generation cost of training machine learning force fields.","Machine learning force fields (MLFF) have been proposed to accelerate molecular dynamics (MD) simulation, which finds widespread applications in chemistry and biomedical research. Even for the most data-efficient MLFF models, reaching chemical accuracy can require hundreds of frames of force and energy labels generated by expensive quantum mechanical algorithms, which may scale as $O(n^3)$ to $O(n^7)$, with $n$ being the number of basis functions used and typically proportional to the number of atoms. To address this issue, we propose a multi-stage computational framework -- ASTEROID, which enjoys low training data generation cost without significantly sacrificing MLFFs' accuracy. Specifically, ASTEROID leverages a combination of both large cheap inaccurate data and small expensive accurate data. The motivation behind ASTEROID is that inaccurate data, though incurring large bias, can help capture the sophisticated structures of the underlying force field. Therefore, we first train a MLFF model on a large amount of inaccurate training data, employing a bias-aware loss function to prevent the model from overfitting the potential bias of the inaccurate training data. We then fine-tune the obtained model using a small amount of accurate training data, which preserves the knowledge learned from the inaccurate training data while significantly improving the model's accuracy. Moreover, we propose a variant of ASTEROID based on score matching for the setting where the inaccurate training data are unlabelled. Extensive experiments on MD simulation datasets show that ASTEROID can significantly reduce data generation costs while improving the accuracy of MLFFs.","Machine Learning Force Fields, Data-Cost Aware Training, AI for Science" Covariance Matrix Adaptation MAP-Annealing,https://openreview.net/forum?id=q_tgo-hvgPd,https://openreview.net/pdf?id=q_tgo-hvgPd,We propose a new variant of the quality diversity algorithm CMA-ME that addresses three major limitations affecting performance and robustness.,"Single-objective optimization algorithms search for the single highest-quality solution with respect to an objective. Quality diversity (QD) algorithms, such as Covariance Matrix Adaptation MAP-Elites (CMA-ME), search for a collection of solutions that are both high-quality with respect to an objective and diverse with respect to specified measure functions. However, CMA-ME suffers from three major limitations highlighted by the QD community: prematurely abandoning the objective in favor of exploration, struggling to explore flat objectives, and having poor performance for low-resolution archives. We propose a new quality diversity algorithm, Covariance Matrix Adaptation MAP-Annealing (CMA-MAE), that addresses all three limitations. We provide theoretical justifications for the new algorithm with respect to each limitation. Our theory informs our experiments, which support the theory and show that CMA-MAE achieves state-of-the-art performance.","quality diversity optimization, derivative-free optimization, latent space exploration" Learning Rewards and Skills to Follow Commands with a Data Efficient Visual-Audio Representation,https://openreview.net/forum?id=HumfPzF2yeI,https://openreview.net/pdf?id=HumfPzF2yeI,We learn a representation to generate a reward function to train command-following robots with reinforcement learning,"Based on the recent advancements in representation learning, we propose a novel framework for command-following robots with raw sensor inputs. Previous RL-based methods are either difficult to continuously improve after the deployment or require a large number of new labels during the fine-tuning. Motivated by (self-)supervised contrastive learning literature, we propose a novel representation, named VAR++, that generates an intrinsic reward function for command-following robot tasks by associating images with sound commands. After the robot is deployed in a new domain, the representation can be updated intuitively and data-efficiently by non-expert, and the robots are able to fulfill sound commands without any hand-crafted reward functions. We demonstrate our approach to various sound types and robotic tasks, including navigation and manipulation with raw sensor inputs. In the simulated experiments, we show that our system can continually self-improve in previously unseen scenarios given fewer new labeled data, yet achieves better performance, compared with previous methods. ","Robotics, Representation Learning, Reinforcement Learning" Reinforcement Learning-Based Estimation for Partial Differential Equations,https://openreview.net/forum?id=ZADNbI_3sbS,https://openreview.net/pdf?id=ZADNbI_3sbS,,"In systems governed by nonlinear partial differential equations such as fluid flows, the design of state estimators such as Kalman filters relies on a reduced-order model (ROM) that projects the original high-dimensional dynamics onto a computationally tractable low-dimensional space. However, ROMs are prone to large errors, which negatively affects the performance of the estimator. Here, we introduce the reinforcement learning reduced-order estimator (RL-ROE), a ROM-based estimator in which the correction term that takes in the measurements is given by a nonlinear policy trained through reinforcement learning. The nonlinearity of the policy enables the RL-ROE to compensate efficiently for errors of the ROM, while still taking advantage of the imperfect knowledge of the dynamics. Using examples involving the Burgers and Navier-Stokes equations, we show that in the limit of very few sensors, the trained RL-ROE outperforms a Kalman filter designed using the same ROM. Moreover, it yields accurate high-dimensional state estimates for reference trajectories corresponding to various physical parameter values, without direct knowledge of the latter.", Heterogeneous-Agent Mirror Learning,https://openreview.net/forum?id=OxBl7cSgo6_,https://openreview.net/pdf?id=OxBl7cSgo6_,A general theoretical framework for development of multi-agent reinforcement learning algorithms.,"The necessity for cooperation among intelligent machines has popularised cooperative multi-agent reinforcement learning (MARL) in the artificial intelligence (AI) research community. However, many research endeavours have been focused on developing practical MARL algorithms whose effectiveness has been studied only empirically, thereby lacking theoretical guarantees. As recent studies have revealed, MARL methods often achieve performance that is unstable in terms of reward monotonicity or suboptimal at convergence. To resolve these issues, in this paper, we introduce a novel framework named Heterogeneous-Agent Mirror Learning (HAML) that provides a general template for MARL algorithmic designs. We prove that algorithms derived from the HAML template satisfy the desired properties of the monotonic improvement of the joint reward and the convergence to Nash equilibrium. We verify the practicality of HAML by proving that the current state-of-the-art cooperative MARL algorithms, HATRPO and HAPPO, are in fact HAML instances. Next, as a natural outcome of our theory, we propose HAML extensions of two well-known RL algorithms, HAA2C (for A2C) and HADDPG (for DDPG), and demonstrate their effectiveness against strong baselines on StarCraftII and Multi-Agent MuJoCo tasks.","deep multi-agent reinforcement learning, multi-agent reinforcement learning theory" ADELT: Unsupervised Transpilation Between Deep Learning Frameworks,https://openreview.net/forum?id=vbRRydfyFc,https://openreview.net/pdf?id=vbRRydfyFc,We propose Adversarial DEep Learning Transpiler (ADELT) for source-to-source transpilation between deep learning frameworks.,"We propose Adversarial DEep Learning Transpiler (ADELT) for source-to-source transpilation between deep learning frameworks. Unlike prior approaches, ADELT formulates the transpilation problem as mapping API keyword (an API function name or a parameter name). Based on contextual embeddings extracted by a BERT for code, we train aligned API embeddings in a domain-adversarial setting, upon which we generate a dictionary for keyword translation. The model is trained on our unlabeled DL corpus from web crawl data, without using any hand-crafted rules and parallel data. Our method outperforms state-of-the-art transpilers on multiple transpilation pairs including PyTorch-Keras and PyTorch-MXNet. We make our code, corpus, and evaluation benchmark publicly available.","Applications, Programming Languages, Deep Learning, Unsupervised Learning, Adversarial Training" Recursive Time Series Data Augmentation,https://openreview.net/forum?id=5lgD4vU-l24s,https://openreview.net/pdf?id=5lgD4vU-l24s,,"Time series observations can be seen as realizations of an underlying dynamical system governed by rules that we typically do not know. For time series learning tasks we create our model using available data. Training on available realizations, where data is limited, often induces severe over-fitting thereby preventing generalization. To address this issue, we introduce a general recursive framework for time series augmentation, which we call the Recursive Interpolation Method (RIM). New augmented time series are generated using a recursive interpolation function from the original time series for use in training. We perform theoretical analysis to characterize the proposed RIM and to guarantee its performance under certain conditions. We apply RIM to diverse synthetic and real-world time series cases to achieve strong performance over non-augmented data on a variety of learning tasks. Our method is also computationally more efficient and leads to better performance when compared to state of the art time series data augmentation. ","Time Series, Data augmentation, Representation Learning, Deep Learning, Reinforcement Learning" Auto-Encoding Goodness of Fit,https://openreview.net/forum?id=JjCAdMUlu9v,https://openreview.net/pdf?id=JjCAdMUlu9v,,"For generative autoencoders to learn a meaningful latent representation for data generation, a careful balance must be achieved between reconstruction error and how close the distribution in the latent space is to the prior. However, this balance is challenging to achieve due to a lack of criteria that work both at the mini-batch (local) and aggregated posterior (global) level. In this work, we develop the Goodness of Fit Autoencoder (GoFAE), which incorporates hypothesis tests at two levels. At the mini-batch level, it uses GoF test statistics as regularization objectives. At a more global level, it selects a regularization coefficient based on higher criticism, i.e., a test on the uniformity of the local GoF p-values. We justify the use of GoF tests by providing a relaxed $L_2$-Wasserstein bound on the distance between the latent distribution and target prior. We propose to use GoF tests and prove that optimization based on these tests can be done with stochastic gradient (SGD) descent on a compact Riemannian manifold. Empirically, we show that our higher criticism parameter selection procedure balances reconstruction and generation using mutual information and uniformity of p-values respectively. Finally, we show that GoFAE achieves comparable FID scores and mean squared errors with competing deep generative models while retaining statistical indistinguishability from Gaussian in the latent space based on a variety of hypothesis tests.", VER: Learning Natural Language Representations for Verbalizing Entities and Relations,https://openreview.net/forum?id=wIzVS-RJjCB,https://openreview.net/pdf?id=wIzVS-RJjCB,We propose VER: A Unified Model for Verbalizing Entites and Relations.,"Entities and relationships between entities are vital in the real world. Essentially, we understand the world by understanding entities and relations. For instance, to understand a field, e.g., computer science, we need to understand the relevant concepts, e.g., machine learning, and the relationships between concepts, e.g., machine learning and artificial intelligence. To understand a person, we should first know who he/she is and how he/she is related to others. To understand entities and relations, humans may refer to natural language descriptions. For instance, when learning a new scientific term, people usually start by reading its definition in dictionaries or encyclopedias. To know the relationship between two entities, humans tend to create a sentence to connect them. In this paper, we propose VER: A Unified Model for Verbalizing Entities and Relations. Specifically, we attempt to build a system that takes any entity or entity set as input and generates a sentence to represent entities and relations, named ``natural language representation''. Extensive experiments demonstrate that our model can generate high-quality sentences describing entities and entity relationships and facilitate various tasks on entities and relations, including definition modeling, relation modeling, and generative commonsense reasoning.", Adaptive IMLE for Few-shot Image Synthesis,https://openreview.net/forum?id=8RExG-EKC22,https://openreview.net/pdf?id=8RExG-EKC22,,"Despite their success on large datasets, GANs have been difficult to apply in the few-shot setting, where only a limited number of training examples are provided. Due to mode collapse, GANs tend to ignore some training examples, causing overfitting to a subset of the training dataset, which is small to begin with. A recent method called Implicit Maximum Likelihood Estimation (IMLE) is an alternative to GAN that tries to address this issue. It uses the same kind of generators as GANs but trains it with a different objective that encourages mode coverage. However, the theoretical guarantees of IMLE hold under restrictive conditions, such as the requirement for the optimal likelihood at all data points to be the same. In this paper, we present a more generalized formulation of IMLE which includes the original formulation as a special case, and we prove that the theoretical guarantees hold under weaker conditions. Using this generalized formulation, we further derive a new algorithm, which we dub Adaptive IMLE, which can adapt to the varying difficulty of different training examples. We demonstrate on multiple few-shot image synthesis datasets that our method significantly outperforms existing methods. ", Understanding the Covariance Structure of Convolutional Filters,https://openreview.net/forum?id=WGApODQvwRg,https://openreview.net/pdf?id=WGApODQvwRg,"If you initialize depthwise convolutional filters from the right multivariate Gaussian distribution, they work so well that you may not even have to train them; we provide such Gaussians in closed-form.","Neural network weights are typically initialized at random from univariate distributions, controlling just the variance of individual weights even in highly-structured operations like convolutions. Recent ViT-inspired convolutional networks such as ConvMixer and ConvNeXt use large-kernel depthwise convolutions whose learned filters have notable structure; this presents an opportunity to study their empirical covariances. In this work, we first observe that such learned filters have highly-structured covariance matrices, and moreover, we find that covariances calculated from small networks may be used to effectively initialize a variety of larger networks of different depths, widths, patch sizes, and kernel sizes, indicating a degree of model-independence to the covariance structure. Motivated by these findings, we then propose a learning-free multivariate initialization scheme for convolutional filters using a simple, closed-form construction of their covariance. Models using our initialization outperform those using traditional univariate initializations, and typically meet or exceed the performance of those initialized from the covariances of learned filters; in some cases, this improvement can be achieved without training the depthwise convolutional filters at all.","initialization, init, covariance, gaussian, convolutional neural network, convmixer, convnext, transfer learning, spatial mixing, computer vision, convolution" Reinforcement Logic Rule Learning for Temporal Point Processes ,https://openreview.net/forum?id=ynD_LAMwar2,https://openreview.net/pdf?id=ynD_LAMwar2,We aim to learn a set of temporal logic rules to explain the temporal point processes. ,"We aim to learn a set of temporal logic rules to explain the occurrence of temporal events. Leveraging the temporal point process modeling and learning framework, the rule content and rule weights are jointly learned by maximizing the likelihood of the observed noisy event sequences. The proposed algorithm alternates between a master problem, where the rule weights are updated, and a subproblem, where a new rule is searched and included. The formulated master problem is convex and relatively easy to solve, whereas the subproblem requires searching the huge combinatorial rule predicate and relationship space. To tackle this challenge, we propose a neural search policy to learn to generate the new rule content as a sequence of actions. The policy parameters will be trained end-to-end using the reinforcement learning framework, where the reward signals can be efficiently queried by evaluating the subproblem objective. The trained policy can be used to generate new rules, and moreover, the well-trained policies can be directly transferred to other tasks to speed up the rule searching procedure in the new task. We evaluate our methods on both synthetic and real-world datasets, obtaining promising results.","temporal point processes, explainable models, rule learning" "On Making Graph Continual Learning Easy, Fool-Proof, and Extensive: a Benchmark Framework and Scenarios",https://openreview.net/forum?id=doShL95X0hd,https://openreview.net/pdf?id=doShL95X0hd,"We present BEGIN, an easy-to-use, fool-proof, and extensive benchmark framework for graph continual learning","Continual Learning (CL) is the process of learning ceaselessly a sequence of tasks. Most existing CL methods deal with independent data (e.g., images and text) for which many benchmark frameworks and results under standard experimental settings are available. CL methods for graph data, however, are surprisingly underexplored because of (a) the lack of standard experimental settings, especially regarding how to deal with the dependency between instances, (b) the lack of benchmark datasets and scenarios, and (c) high complexity in implementation and evaluation due to the dependency. In this paper, regarding (a), we define four standard incremental settings (task-, class-, domain-, and time-incremental settings) for graph data, which are naturally applied to many node-, edge-, and graph-level problems. Regarding (b), we provide 17 benchmark scenarios based on nine real-world graphs. Regarding (c), we develop BEGIN, an easy and fool-proof framework for graph CL. BEGIN is easily extended since it is modularized with reusable modules for data processing, algorithm design, validation, and evaluation. Especially, the evaluation module is completely separated from user code to eliminate potential mistakes in evaluation. Using all above, we report extensive benchmark results of seven graph CL methods. Compared to the latest benchmark for graph CL, using BEGIN, we can cover 2.75× more combinations of incremental settings and levels of problems, and, we can implement the same the graph CL method with about 30% fewer lines of code.","Graph Continual Learning, Continual Learning Benchmark Framework" Masked Distillation with Receptive Tokens,https://openreview.net/forum?id=mWRngkvIki3,https://openreview.net/pdf?id=mWRngkvIki3,,"Distilling from the feature maps can be fairly effective for dense prediction tasks since both the feature discriminability and localization information can be well transferred. However, not every pixel contributes equally to the performance, and a good student should learn from what really matters to the teacher. In this paper, we introduce a learnable embedding dubbed receptive token to locate the pixels of interests (PoIs) in the feature map, with a distillation mask generated via pixel-wise attention. Then the masked distillation will be performed via the pixel-wise reconstruction. In this way, a distillation mask refers to a pattern of pixel dependencies. We thus adopt multiple receptive tokens to investigate more sophisticated and informative pixel dependencies within feature maps to enhance the distillation. To obtain a group of masks, the receptive tokens are learned via the regular task loss but with teacher fixed, and we also leverage a Dice loss to enrich the diversity of obtained masks. Our method dubbed MasKD is simple and practical, and needs no priors of ground-truth labels, which can apply to various dense prediction tasks. Experiments show that our MasKD can achieve state-of-the-art performance consistently on object detection and semantic segmentation benchmarks.", Robust Multivariate Time-Series Forecasting: Adversarial Attacks and Defense Mechanisms,https://openreview.net/forum?id=ctmLBs8lITa,https://openreview.net/pdf?id=ctmLBs8lITa,Designs of Adversarial Attacks and Defense Mechanisms for Multivariate Forecasting Models,"This work studies the threats of adversarial attack on multivariate probabilistic forecasting models and viable defense mechanisms. Our studies discover a new attack pattern that negatively impact the forecasting of a target time series via making strategic, sparse (imperceptible) modifications to the past observations of a small number of other time series. To mitigate the impact of such attack, we have developed two defense strategies. First, we extend a previously developed randomized smoothing technique in classification to multivariate forecasting scenarios. Second, we develop an adversarial training algorithm that learns to create adversarial examples and at the same time optimizes the forecasting model to improve its robustness against such adversarial simulation. Extensive experiments on real-world datasets confirm that our attack schemes are powerful and our defense algorithms are more effective compared with baseline defense mechanisms. ",Multivariate Timeseries Forecasting TextShield: Beyond Successfully Detecting Adversarial Sentences in NLP,https://openreview.net/forum?id=xIWfWvKM7aQ,https://openreview.net/pdf?id=xIWfWvKM7aQ,A defense that extends adversarial detection paradigm in NLP,"Adversarial attack serves as a major challenge for neural network models in NLP, which precludes the model's deployment in safety-critical applications. A recent line of work, detection-based defense, aims to distinguish adversarial sentences from benign ones. However, there remains a significant gap between successfully detecting adversarial sentences and correctly classifying adversarial sentences for a specific application. To mitigate such a huge gap, this paper proposes TextShield: (1) we discover a link between text attack and saliency information, and then we propose a saliency-based detector, which can effectively detect whether an input sentence is adversarial or not. (2) We design a saliency-based corrector, which converts the detected adversary sentences to benign ones. By combining the saliency-based detector and corrector, TextShield extends the detection-only paradigm to a detection-correction paradigm, thus filling the gap in the existing detection-based defense. Comprehensive experiments show that (a) TextShield consistently achieves higher or comparable performance than state-of-the-art defense methods across various attacks on different benchmarks. (b) our saliency-based detector outperforms existing detectors for detecting adversarial sentences.","Natural language processing, Adversarial defense, Adversarial attack, Text Classification" Efficient Deep Reinforcement Learning Requires Regulating Statistical Overfitting,https://openreview.net/forum?id=14-kr46GvP-,https://openreview.net/pdf?id=14-kr46GvP-,,"Deep reinforcement learning algorithms that learn policies by trial-and-error must learn from limited amounts of data collected by actively interacting with the environment. While many prior works have shown that proper regularization techniques are crucial for enabling data-efficient RL, a general understanding of the bottlenecks in data-efficient RL has remained unclear. Consequently, it has been difficult to devise a universal technique that works well across all domains. In this paper, we attempt to understand the primary bottleneck in sample-efficient deep RL by examining several potential hypotheses such as non-stationarity, excessive action distribution shift, and overfitting. We perform thorough empirical analysis on state-based DeepMind control suite (DMC) tasks in a controlled and systematic way to show that statistical overfitting on the temporal-difference (TD) error is the main culprit that severely affects the performance of deep RL algorithms, and prior methods that lead to good performance do in fact, control the amount of statistical overfitting. This observation gives us a robust principle for making deep RL efficient: we can hill-climb on a notion of validation temporal-difference error by utilizing any form of regularization techniques from supervised learning. We show that a simple online model selection method that targets the statistical overfitting issue is effective across state-based DMC and Gym tasks.","Reinforcement Learning, Sample Efficient RL, Statistical Overfitting" Nuisances via Negativa: Adjusting for Spurious Correlations via Data Augmentation,https://openreview.net/forum?id=eZr_xEPesc7,https://openreview.net/pdf?id=eZr_xEPesc7,Corrupt semantic features with data augmentations and use their output to build models robust to spurious correlations,"There exist features that are related to the label in the same way across different settings for that task; these are semantic features or semantics. Features with varying relationships to the label are nuisances. For example, in detecting cows from natural images, the shape of the head is a semantic and because images of cows often have grass backgrounds but not always, the background is a nuisance. Relationships between a nuisance and the label are unstable across settings and, consequently, models that exploit nuisance-label relationships face performance degradation when these relationships change. Direct knowledge of a nuisance helps build models that are robust to such changes, but knowledge of a nuisance requires extra annotations beyond the label and the covariates. In this paper, we develop an alternative way to produces robust models by data augmentation. These data augmentations corrupt semantic information to produce models that identify and adjust for where nuisances drive predictions. We study semantic corruptions in powering different robust-modeling methods for multiple out-of distribution (OOD) tasks like classifying waterbirds, natural language inference, and detecting Cardiomegaly in chest X-rays.","spurious correlations, out of distribution generalization, shortcuts, bias mitigation, data augmentation" GNN Domain Adaptation using Optimal Transport,https://openreview.net/forum?id=nGyWzq-703u,https://openreview.net/pdf?id=nGyWzq-703u,We analyze the OOD generalization and consequent domain adaptation limits of Graph Convolution Networks. An optimal transport based DA method is proposed for consistent improvement with a better transferability metric.,"While Graph Convolutional Networks (GCNs) have recently grown in popularity due to their excellent performance on graph data, their performance under domain shift has not been studied extensively. In this work, we first explore the ability of GCNs to generalize to out-of-distribution data using contextual stochastic block models (CSBMs) on the node classification task. Our results in this area provide the first generalization criteria for GCNs on feature distribution and structure changes. Next we examine a popular Unsupervised Domain Adaptation (UDA) covariate shift assumption and demonstrate that it rarely holds for graph data. Motivated by these results, we propose addressing bias in graph models using domain adaptation with optimal transport - GDOT which features a transportation plan that minimizes the cost of the joint feature and estimated label distribution $P(X,\hat{Y})$ between source and target domains. Additionally, we demonstrate that such transportation cost metric serves as a good proxy for estimating transferability between source and target graphs, and is better as a transferability metric than other common metrics like maximum mean discrepancy (MMD). In our controlled CSBM experiments, GDOT demonstrates robustness towards distributional shift, resulting in 90\% ROC AUC (vs.\ the second-best algorithm achieving $<80$\% on feature shift). Comprehensive experiments on both semi-supervised and supervised real-world node classification problems show that our method is the only one that performs consistently better than baseline GNNs in the cross-domain adaptation setting.","domain adaptation, graph neural network, optimal transport" Ask Me Anything: A simple strategy for prompting language models,https://openreview.net/forum?id=bhUPJnS2g0X,https://openreview.net/pdf?id=bhUPJnS2g0X,"We propose a prompting strategy based on aggregating the predictions of multiple prompts, which enables a 6B parameter model to exceed the few-shot performance of GPT3-175B on 15/20 popular benchmarks.","Large language models (LLMs) transfer well to new tasks out-of-the-box simply given a natural language prompt that demonstrates how to perform the task and no additional training. Prompting is a brittle process wherein small modifications to the prompt can cause large variations in the model predictions, and therefore significant effort is dedicated towards designing a painstakingly ""perfect prompt"" for a task. To mitigate the high degree of effort involved in prompting, we instead ask whether collecting multiple ""imperfect prompts"" and aggregating them can lead to a high quality prompting strategy. Our observations motivate our proposed prompting method, ASK ME ANYTHING (AMA). We first develop an understanding of the effective prompt formats, finding question-answering (QA) prompts, which encourage open-ended generation (""Who went to the park?"") tend to outperform those that restrict the model outputs (""Output True or False""). Our approach recursively uses the LLM itself to transform task inputs to the effective QA format. We apply these prompts to collect several noisy votes for the input's true label. We find that these prompts can have very different accuracies and complex dependencies and thus propose to use weak supervision, a classical procedure for combining the noisy predictions, to produce the final predictions. We evaluate AMA across open-source GPT-model families (e.g., Neo, BLOOM, OPT, and T0) demonstrating an average performance lift of 10.2% over the few-shot baseline across both small and large language models. This simple strategy enables the open-source GPT-Neo-6B model to match and exceed the performance of few-shot ($k \geq 32$) GPT3-175B on 15 of 20 popular benchmarks. Averaged across these tasks, the GPT-Neo-6B model outperforms few-shot GPT3-175B.","large language models, prompt-engineering, in-context learning" MixBin: Towards Budgeted Binarization,https://openreview.net/forum?id=pRUW8BuTEFI,https://openreview.net/pdf?id=pRUW8BuTEFI,"We present $\texttt{MixBin}$, an iterative search-based strategy that constructs B2NN through optimized mixing of the binary and full-precision components.","Binarization has proven to be amongst the most effective ways of neural network compression, reducing the FLOPs of the original model by a large extent. However, such levels of compression are often accompanied by a significant drop in the performance of the model. There exist some approaches that reduce this performance drop by facilitating partial binarization of the network, however, a systematic approach to mix binary and full-precision parameters in a single network is still missing. In this paper, we propose a paradigm to perform partial binarization of neural networks in a controlled sense, thereby constructing budgeted binary neural network (B2NN). We present $\texttt{MixBin}$, an iterative search-based strategy that constructs B2NN through optimized mixing of the binary and full-precision components. $\texttt{MixBin}$ allows to explicitly choose the approximate fraction of the network to be kept as binary, thereby presenting the flexibility to adapt the inference cost at a prescribed budget. We demonstrate through numerical experiments that B2NNs obtained from our $\texttt{MixBin}$ strategy are significantly better than those obtained from random selection of the network layers. To perform partial binarization in an effective manner, it is important that both the full-precision as well as the binary components of the B2NN are appropriately optimized. We also demonstrate that the choice of the activation function can have a significant effect on this process, and to circumvent this issue, we present BinReLU, an integral component of $\texttt{MixBin}$, that can be used as an effective activation function for the full-precision as well as the binary components of any B2NN. Experimental investigations reveal that BinReLU outperforms the other activation functions in all possible scenarios of B2NN: zero-, partial- as well as full binarization. Finally, we demonstrate the efficacy of $\texttt{MixBin}$ on the tasks of classification and object tracking using benchmark datasets.","Budgeted Binarization, Model Compression, Efficient deep learning" Limits of Algorithmic Stability for Distributional Generalization,https://openreview.net/forum?id=PoU_NgCStE5,https://openreview.net/pdf?id=PoU_NgCStE5,"In this paper we empirically show that the more stable a learning algorithm is the more robust the resulting model is to covariate, label, and subpopulation shifts. ","As machine learning models become widely considered in safety critical settings, it is important to understand when models may fail after deployment. One cause of model failure is distribution shift, where the training and test data distributions differ. In this paper we investigate the benefits of training models using methods which are algorithmically stable towards improving model robustness, motivated by recent theoretical developments which show a connection between the two. We use techniques from differentially private stochastic gradient descent (DP-SGD) to control the level of algorithmic stability during training. We compare the performance of algorithmically stable training procedures to stochastic gradient descent (SGD) across a variety of possible distribution shifts - specifically covariate, label, and subpopulation shifts. We find that models trained with algorithmically stable procedures result in models with consistently lower generalization gap across various types of shifts and shift severities. as well as a higher absolute test performance in label shift. Finally, we demonstrate that there is there is a tradeoff between distributional robustness, stability, and performance.","Distribution Shift, Robustness, Evaluation" WikiWhy: Answering and Explaining Cause-and-Effect Questions,https://openreview.net/forum?id=vaxnu-Utr4l,https://openreview.net/pdf?id=vaxnu-Utr4l,"We propose WikiWhy, a dataset containing 9000+ ""why"" question-answer-rationale triplets to assess Large Language Models' cause-effect reasoning capability.","As large language models (LLMs) grow larger and more sophisticated, assessing their ""reasoning"" capabilities in natural language grows more challenging. Recent question answering (QA) benchmarks that attempt to assess reasoning are often limited by a narrow scope of covered situations and subject matters. We introduce WikiWhy, a QA dataset built around a novel auxiliary task: explaining why an answer is true in natural language. WikiWhy contains over 9,000 ""why"" question-answer-rationale triples, grounded on Wikipedia facts across a diverse set of topics. Each rationale is a set of supporting statements connecting the question to the answer. WikiWhy serves as a benchmark for the reasoning capabilities of LLMs because it demands rigorous explicit rationales for each answer to demonstrate the acquisition of implicit commonsense knowledge, which is unlikely to be easily memorized. GPT-3 baselines achieve only 38.7% human-evaluated correctness in the end-to-end answer & explain condition, leaving significant room for future improvements.","NLP, Question Answering, LLM, Dataset, Explanation" Offline Reinforcement Learning with Differentiable Function Approximation is Provably Efficient,https://openreview.net/forum?id=6jfbOWzWTcE,https://openreview.net/pdf?id=6jfbOWzWTcE,,"\emph{Offline reinforcement learning}, which aims at optimizing sequential decision-making strategies with historical data, has been extensively applied in real-life applications. \emph{State-Of-The-Art} algorithms usually leverage powerful function approximators (\emph{e.g.} neural networks) to alleviate the sample complexity hurdle for better empirical performances. Despite all that, a more systematic understanding of the statistical complexity for function approximation remains lacking. Towards bridging the gap, we take a step by considering offline reinforcement learning with \emph{differentiable function class approximation} (DFA). This function class naturally incorporates a wide range of models with nonlinear/nonconvex structures. Most importantly, we show offline RL with differentiable function approximation is provably efficient by analyzing the \emph{pessimistic fitted Q-learning} (PFQL) algorithm, and our results provide the theoretical basis for understanding a variety of practical heuristics that rely on Fitted Q-Iteration style design. In addition, we further improve our guarantee with a tighter instance-dependent characterization. We hope our work could draw interest in studying reinforcement learning with differentiable function approximation beyond the scope of current research.",Reinforcement Learning Theory Do We Really Need Graph Models for Skeleton-Based Action Recognition? A Topology-Agnostic Approach with Fully-Connected Networks,https://openreview.net/forum?id=PbXfwJEyKXT,https://openreview.net/pdf?id=PbXfwJEyKXT,,"Graph Convolutional Networks (GCNs) have been dominating skeleton-based action recognition in recent years. While GCN-based approaches keep establishing new state-of-the-art results, the proposed architectures are getting increasingly sophisticated with a variety of add-ons. Many recent works attempt to relax the topology restriction imposed by the GCN framework, such as local/sparse connections and permutation invariance. However, the room for further innovation is extremely limited under such a framework. In this work, we present Topology-Agnostic Network (ToANet), a simple architecture based merely on Fully-Connected (FC) layers, as opposed to GCNs for skeleton-based action recognition. It is constructed by chaining FC layers applied across joints (aggregate joint information) and within each joint (transform joint features) in an alternate manner. Moreover, it contains a novel design of parallel paths for multi-relational modeling. ToANet proves to be a powerful architecture for learning the joint co-occurrence of human skeleton data. ToANet achieves better or comparable results to state-of-the-art GCNs on NTU RGB+D, NTU RGB+D 120 and Northwestern-UCLA datasets. These results challenge the convention of choosing GCNs as the de-facto option for skeleton-based action recognition. We hope that our work stimulates further research on non-GCN based methods, eliminating the restriction of topology.","skeleton-based, action recognition, topology-agnostic, fully-connected" An Integrated Multi-Label Multi-Modal Framework in Deep Metric Learning,https://openreview.net/forum?id=uFC0HBseZxK,https://openreview.net/pdf?id=uFC0HBseZxK,,"Domains such as healthcare demand machine learning models which provide representations for complex relationships between both heterogeneous modes of data, and multiple co-occurring labels. Previous works have tackled representation learning in the multi-label, multi-modal setting, but have neglected to consider the common requirement of generalization to novel, and unknown, tasks at test-time. In this work, we propose an integrated multi-modal multi-label framework for deep metric learning, which we term 3ML--DML. Our framework extends existing proxy learning losses to the multi-label domain, and provides a novel method for enforcement of label correlations via these proxies. The multi-modal component builds a standard fusion model but draws from deep metric learning criteria in order to incorporate auxiliary, high-dimensional embedding and feature spaces from each mode of data as context to match with the output of the fusion model. We explore our method in a variety of settings, including on healthcare data, and demonstrate improvement over constructed baselines both in the context of multi-label multi-modal learning but most poignantly, in zero-shot generalization to new labels.","deep metric learning, healthcare, multimodal, multilabel" Proto-Value Networks: Scaling Representation Learning with Auxiliary Tasks,https://openreview.net/forum?id=oGDKSt9JrZi,https://openreview.net/pdf?id=oGDKSt9JrZi,,"Auxiliary tasks improve the representations learned by deep reinforcement learning agents. Analytically, their effect is reasonably well-understood; in practice, how-ever, their primary use remains in support of a main learning objective, rather than as a method for learning representations. This is perhaps surprising given that many auxiliary tasks are defined procedurally, and hence can be treated as an essentially infinite source of information about the environment. Based on this observation, we study the effectiveness of auxiliary tasks for learning rich representations, focusing on the setting where the number of tasks and the size of the agent’s network are simultaneously increased. For this purpose, we derive a new family of auxiliary tasks based on the successor measure. These tasks are easy to implement and have appealing theoretical properties. Combined with a suitable off-policy learning rule, the result is a representation learning algorithm that can be understood as extending Mahadevan & Maggioni (2007)’s proto-value functions to deep reinforcement learning – accordingly, we call the resulting object proto-value networks. Through a series of experiments on the Arcade Learning Environment, we demonstrate that proto-value networks produce rich features that may be used to obtain performance comparable to established algorithms, using only linear approximation and a small number (~4M) of interactions with the environment’s reward function.","reinforcement learning, representation learning" Conservative Exploration in Linear MDPs under Episode-wise Constraints,https://openreview.net/forum?id=RHWAEeEYmwW,https://openreview.net/pdf?id=RHWAEeEYmwW,We studied conservative exploration with offline dataset during online learning for Linear MDPs and prove that the regret of our algorithm matches the constraint-free counterpart.,"This paper investigates conservative exploration in reinforcement learning where the performance of the learning agent is guaranteed to above certain threshold throughout the learning process. It focuses on the episodic linear Markov Decision Process (MDP) setting where the transition kernels and the reward functions are assumed to be linear. With the knowledge of an existing safe baseline policy, two algorithms based on Least-Squares Value Iteration (LSVI) (Bradtke and Barto, 1996; Osband et al., 2016), coined StepMix-LSVI and EpsMix-LSVI, are proposed to balance the exploitation and exploration while ensuring that the conservative constraint is never violated in each episode with high probability. Theoretical analysis shows that both algorithms achieve the same regret order as LSVI-UCB, their constraint-free counterpart from Jin et al. (2020), indicating that obeying the stringent episode-wise conservative constraint does not compromise the learning performance of these algorithms. We further extend the analysis to the setting where the baseline policy is not given a priori but must be learned from an offline dataset, and prove that similar safety guarantee and regret can be achieved if the offline dataset is sufficiently large. Experiment results corroborate the theoretical analysis and demonstrate the effectiveness of the proposed conservative exploration strategies.","Conservative Exploration, Sample Complexity, Linear MDP, Offline and Online RL" Pseudometric guided online query and update for offline reinforcement learning,https://openreview.net/forum?id=gTph9AD_gx1,https://openreview.net/pdf?id=gTph9AD_gx1,We propose to utilize pseudometric to guide the online queries with optimality and efficient policy update.,"Offline Reinforcement Learning (RL) extracts effective policies from historical data without the need to interact with the environment. However, the learned policy often suffers large generalization errors in the online environment due to the distributional shift. While existing work mostly focuses on learning a generalizable policy, we propose to adapt the learned policy to fit the online environment with limited queries. The goals include querying reasonable actions with limited chances and efficiently modifying the policy. Our insight is to unify these two goals via a proper pseudometric. Intuitively, the metric can compare online and offline states to infer optimal query actions. Additionally, efficient policy updates require good knowledge of the similarity between query results and historical data. Therefore, we propose a unified framework, denoted Pseudometric Guided Offline-to-Online RL (PGO2). Specifically, in deep Q learning, PGO2 has a structural design between the Q-neural network and the Siamese network, which guarantees simultaneous Q-network updating and pseudometric learning, promoting Q-network fine-tuning. In the inference phase, PGO2 solves convex optimizations to identify optimal query actions. We also show that PGO2 training converges to the so-called bisimulation metric with strong theoretical guarantees. Finally, we demonstrate the superiority of PGO2 on diversified datasets. ","Offline Reinforcement Learning, online query, optimal query, policy update" Efficient Data Subset Selection to Generalize Training Across Models: Transductive and Inductive Networks,https://openreview.net/forum?id=GKpwIa9wgwR,https://openreview.net/pdf?id=GKpwIa9wgwR,Trainable non-adaptive data subset selection that generalizes across different model training,"Subset selection, in recent times, has emerged as a successful approach toward efficient training of models by significantly reducing the amount of data and computational resources required. However, existing methods employ discrete combinatorial and model-specific approaches which lack generalizability--- for each new model, the algorithm has to be executed from the beginning. Therefore, for data subset selection for an unseen architecture, one cannot use the subset chosen for a different model. In this work, we propose SubSelNet, a non-adaptive subset selection framework, which tackles these problems with two main components. First, we introduce an attention-based neural gadget that leverages the graph structure of architectures and acts as a surrogate to trained deep neural networks for quick model prediction. Then, we use these predictions to build subset samplers. This leads us to develop two variants of SubSelNet. The first variant is transductive (called as Transductive-SubSelNet) which computes the subset separately for each model by solving a small optimization problem. Such an optimization is still super fast, thanks to the replacement of explicit model training by the model approximator. The second variant is inductive (called as Inductive-SubSelNet) which computes the subset using a trained subset selector, without any optimization. Most state-of-the-art data subset selection approaches are adaptive, in that the subset selection adapts as the training progresses, and as a result, they require access to the entire data at training time. Our approach, in contrast, is non-adaptive and does the subset selection only once in the beginning, thereby achieving resource and memory efficiency along with compute-efficiency at training time. Our experiments show that both transductive and inductive variants of our models outperform several methods on the quality of the subset chosen and further demonstrate that our method can be used for choosing the best architecture from a set of architectures. ","Data Subset Selection, Efficient Learning" Probe Into Multi-agent Adversarial Reinforcement Learning through Mean-Field Optimal Control,https://openreview.net/forum?id=dkLQ9dl4vcY,https://openreview.net/pdf?id=dkLQ9dl4vcY,,"Multi-agent adversarial reinforcement learning (MaARL) has shown promise in solving adversarial games. However, the theoretical tools for MaARL's analysis is still elusive. In this paper, we take the first step to theoretically understanding MaARL through mean-field optimal control. Specifically, we model MaARL as a mean-field quantitative differential game between two dynamical systems with implicit terminal constraints. Based on the game, we respectively study the optimal solution and the generalization of the fore-mentioned game. First of all, a two-sided extremism principle (TSEP) is then established as a necessary condition for the optimal solution of the game. We further show that TSEP is also sufficient given that the terminal time is sufficiently small. Secondly, based on the TSEP, a generalization bound for MaARL is proposed. The bound does not explicitly rely on the dimensions, norms, or other capacity measures of the model, which are usually prohibitively large in deep learning.", Robust Algorithms on Adaptive Inputs from Bounded Adversaries,https://openreview.net/forum?id=I29Kt0RwChs,https://openreview.net/pdf?id=I29Kt0RwChs,We give algorithms robust to adaptive input from adversaries with bounded capabilities and a general framework for achieving it.,"We study dynamic algorithms robust to adaptive inputs generated from sources with bounded capabilities, such as sparsity or limited interaction. For example, we consider robust linear algebraic algorithms when the updates to the inputs are sparse but given by an adversary with access to a query oracle. We also study robust algorithms in the standard centralized setting, where an adversary queries an algorithm in an adaptive manner, but the number of interactions between the adversary and the algorithm is bounded. Together, we provide a unified framework for answering $Q$ adaptive queries that incurs $\widetilde{\mathcal{O}}(\sqrt{Q})$ overhead in space, which is roughly a quadratic improvement over the na\""{i}ve implementation, and only incurs a logarithmic overhead in query time. Our general framework has diverse applications in machine learning and data science, such as adaptive distance estimation, kernel density estimation, linear regression, range queries, and point queries. Surprisingly, we show that these novel subroutines for each of these problems can be generally combined with the elegant use of differential privacy to hide the internal randomness of various subroutines, leading to robust algorithms across these different settings. In addition, we demonstrate even better algorithmic improvements for (1) reducing the pre-processing time for adaptive distance estimation and (2) permitting an unlimited number of adaptive queries for kernel density estimation.","streaming algorithms, adversarial robustness, sketching, kernel density estimation" "Chasing All-Round Graph Representation Robustness: Model, Training, and Optimization",https://openreview.net/forum?id=7jk5gWjC18M,https://openreview.net/pdf?id=7jk5gWjC18M,We identify a fundamental issue in graph adversarial learning and then propose a novel method to enlarge the model capacity and enrich the representation diversity of adversarial samples.,"Graph Neural Networks (GNNs) have achieved state-of-the-art results on a variety of graph learning tasks, however, it has been demonstrated that they are vulnerable to adversarial attacks, raising serious security concerns. A lot of studies have been developed to train GNNs in a noisy environment and increase their robustness against adversarial attacks. However, existing methods have not uncovered a principled difficulty: the convoluted mixture distribution between clean and attacked data samples, which leads to sub-optimal model design and limits their frameworks’ robustness. In this work, we first begin by identifying the root cause of mixture distribution, then, for tackling it, we propose a novel method GAME - Graph Adversarial Mixture of Experts to enlarge the model capacity and enrich the representation diversity of adversarial samples, from three perspectives of model, training, and optimization. Specifically, we first propose a plug-and- play GAME layer that can be easily incorporated into any GNNs and enhance their adversarial learning capabilities. Second, we design a decoupling-based graph adversarial training in which the component of the model used to generate adversarial graphs is separated from the component used to update weights. Third, we introduce a graph diversity regularization that enables the model to learn diverse representation and further improves model performance. Extensive experiments demonstrate the effectiveness and advantages of GAME over the state-of-the-art adversarial training methods across various datasets given different attacks.","Graph neural networks, Mixture of experts, Graph adversarial learning" Training Neural Networks with Low-Precision Model Memory,https://openreview.net/forum?id=cs3n00FQ7OI,https://openreview.net/pdf?id=cs3n00FQ7OI,"We propose memory efficient optimizers for deep learning which keep model parameters, momentum and gradient accumulators in low numerical precision.","The demand for memory to store model-related statistics (""model memory"") is a major bottleneck for training large neural networks. A promising solution is low-precision optimizers, which reduce the numerical precision of the model memory. However, existing work only compresses the momentum, resulting in suboptimal memory efficiency. This paper proposes Low-Precision Model Memory (LPMM), an optimization framework with the entire model memory kept in low precision. LPMM compresses not only the momentum but also model parameters and gradient accumulators. We identify arithmetic underflow as the main problem in building low-precision optimizers and propose a stochastic quantization method and a microbatching technique to overcome this problem. We analyze the convergence behavior of LPMM and theoretically show how the proposed techniques could affect underflowing, which in turn affects the convergence. We apply LPMM to the SGD optimizer with momentum (SGDM). On several realistic benchmarks, LPMM-SGDM can train neural networks with negligible loss of accuracy while reducing over 70% of the model memory compared to the full-precision SGDM.","memory efficient deep learning, stochastic gradient descent, quantization" Raisin: Residual Algorithms for Versatile Offline Reinforcement Learning,https://openreview.net/forum?id=d1oQqDvB7GQ,https://openreview.net/pdf?id=d1oQqDvB7GQ,,"The residual gradient algorithm (RG), gradient descent of the Mean Squared Bellman Error, brings robust convergence guarantees to bootstrapped value estimation. Meanwhile, the far more common semi-gradient algorithm (SG) suffers from well-known instabilities and divergence. Unfortunately, RG often converges slowly in practice. Baird (1995) proposed residual algorithms (RA), weighted averaging of RG and SG, to combine RG's robust convergence and SG's speed. RA works moderately well in the online setting. We find, however, that RA works disproportionately well in the offline setting. Concretely, we find that merely adding a variable residual component to SAC increases its score on D4RL gym tasks by a median factor of 54. We further show that using the minimum of ten critics lets our algorithm match SAC-$N$'s state-of-the-art returns using 50$\times$ less compute and no additional hyperparameters. In contrast, TD3+BC with the same minimum-of-ten-critics trick does not match SAC-$N$'s returns on a handful of environments.","reinforcement learning, offline RL, residual algorithms, residual gradient" VQR: Automated Software Vulnerability Repair Through Vulnerability Queries,https://openreview.net/forum?id=YKoRmMhcpWk,https://openreview.net/pdf?id=YKoRmMhcpWk,,"Recently, automated vulnerability repair (AVR) approaches have been widely adopted to combat the increasing number of software security issues. In particular, transformer-based models achieve competitive results. While existing models are learned to generate vulnerability repairs, existing AVR models lack a mechanism to provide their models with the precise location of vulnerable code (i.e., models may generate repairs for the non-vulnerable areas). To address this problem, we base our framework on the VIT-based approaches for object detection that learn to locate bounding boxes via the cross-matching between object queries and image patches. We cross-match vulnerability queries and their corresponding vulnerable code areas through the cross-attention mechanism to generate more accurate repairs. To strengthen our cross-matching, we propose to learn a novel vulnerability query mask that greatly focuses on vulnerable code areas and integrate it into the cross-attention. Moreover, we also incorporate the vulnerability query mask into the self-attention to learn embeddings that emphasize the vulnerable areas of a program. Through an extensive evaluation using the real-world 5,417 vulnerabilities, our approach outperforms all of the baseline methods by 3.39%-33.21%. The training code and pre-trained models are available at https://github.com/AVR-VQR/VQR.","Automated Vulnerability Repair, Cross-Attention Mechanism, Transformers-based Models" Corruption-free Single-view Self-supervised Learning on Graphs,https://openreview.net/forum?id=2rzFscFzJ0B,https://openreview.net/pdf?id=2rzFscFzJ0B,,"Self-supervised learning (SSL) for graphs is an essential problem since graph data are ubiquitous and data labeling is costly. We argue that existing SSL approaches for graphs have two limitations. First, they rely on corruption techniques such as node attribute perturbation and edge dropping to generate graph views for contrastive learning. These unnatural corruption techniques require extensive tuning efforts and provide marginal improvements. Second, the current approaches require the computation of multiple graph views, which is memory and computationally inefficient. These shortcomings of graph SSL call for a corruption-free single-view learning approach, but the strawman approach of using neighboring nodes as positive examples suffers two problems: it ignores the strength of connections between nodes implied by the graph structure on a macro level, and cannot deal with the high noise in real-world graphs. We propose CURSIVE, a corruption-free single-view graph SSL approach that overcomes these problems by leveraging graph diffusion to measure connection strength and denoise. With extensive experiments, we show that CURSIVE achieves up to $4.55\%$ absolute improvement in ROC-AUC on graph SSL tasks over state-of-the-art approaches while being more memory efficient. Moreover, CURSIVE even outperforms supervised training on node classification tasks of ogbn-proteins dataset.", Fully Online Meta Learning,https://openreview.net/forum?id=eLxADkHrBcR,https://openreview.net/pdf?id=eLxADkHrBcR,"We propose a Fully Online Meta-Learning (FOML) algorithm, which does not require any ground truth knowledge about the task boundaries and stays learn fully online.","While deep networks can learn complex functions such as classifiers, detectors, and trackers, many applications require models that continually adapt to changing input distributions, changing tasks, and changing environmental conditions. Indeed, this ability to continuously accrue knowledge and use past experience to learn new tasks quickly in continual settings is one of the key properties of an intelligent system. For complex and high-dimensional problems, simply updating the model continually with standard learning algorithms such as gradient descent may result in slow adaptation. Meta-learning can provide a powerful tool to accelerate adaptation yet is conventionally studied in batch settings. In this paper, we study how meta-learning can be applied to tackle online problems of this nature, simultaneously adapting to changing tasks and input distributions and meta-training the model in order to adapt more quickly in the future. Extending meta-learning into the online setting presents its own challenges, and although several prior methods have studied related problems, they generally require a discrete notion of tasks, with known ground-truth task boundaries. Such methods typically adapt to each task in sequence, resetting the model between tasks, rather than adapting continuously across tasks. In many real-world settings, such discrete boundaries are unavailable, and may not even exist. To address these settings, we propose a Fully Online Meta-Learning (FOML) algorithm, which does not require any ground truth knowledge about the task boundaries and stays fully online without resetting back to pre-trained weights. Our experiments show that FOML was able to learn new tasks faster than the state-of-the-art online learning methods on Rainbow-MNIST, CIFAR100 and CELEBA datasets.","Meta learning, Online learning" Learning Globally Smooth Functions on Manifolds,https://openreview.net/forum?id=Kot3IIgXGbb,https://openreview.net/pdf?id=Kot3IIgXGbb,We present a constrained learning approach to learn smooth functions over manifold data. ,"Smoothness and low dimensional structures play central roles in improving generalization and stability in learning and statistics. The combination of these properties has led to many advances in semi-supervised learning, generative modeling, and control of dynamical systems. However, learning smooth functions is generally challenging, except in simple cases such as learning linear or kernel models. Typical methods are either too conservative, relying on crude upper bounds such as spectral normalization, too lax, penalizing smoothness on average, or too computationally intensive, requiring the solution of large-scale semi-definite programs. These issues are only exacerbated when trying to simultaneously exploit low dimensionality using, e.g., manifolds. This work proposes to overcome these obstacles by combining techniques from semi-infinite constrained learning and manifold regularization. To do so, it shows that, under typical conditions, the problem of learning a Lipschitz continuous function on a manifold is equivalent to a dynamically weighted manifold regularization problem. This observation leads to a practical algorithm based on a weighted Laplacian penalty whose weights are adapted using stochastic gradient techniques. We prove that, under mild conditions, this method estimates the Lipschitz constant of the solution, learning a globally smooth solution as a byproduct. Numerical examples illustrate the advantages of using this method to impose global smoothness on manifolds as opposed to imposing smoothness on average.","Lipschitz functions, Manifolds, Machine Learning" On Representing Mixed-Integer Linear Programs by Graph Neural Networks,https://openreview.net/forum?id=4gc3MGZra1d,https://openreview.net/pdf?id=4gc3MGZra1d,,"While Mixed-integer linear programming (MILP) is NP-hard in general, practical MILP has received roughly 100--fold speedup in the past twenty years. Still, many classes of MILPs quickly become unsolvable as their sizes increase, motivating researchers to seek new acceleration techniques for MILPs. With deep learning, they have obtained strong empirical results, and many results were obtained by applying graph neural networks (GNNs) to making decisions in various stages of MILP solution processes. This work discovers a fundamental limitation: there exist feasible and infeasible MILPs that all GNNs will, however, treat equally, indicating GNN's lacking power to express general MILPs. Then, we show that, by restricting the MILPs to unfoldable ones or by adding random features, there exist GNNs that can reliably predict MILP feasibility, optimal objective values, and optimal solutions up to prescribed precision. We conducted small-scale numerical experiments to validate our theoretical findings.", LEARNING DYNAMIC ABSTRACT REPRESENTATIONS FOR SAMPLE-EFFICIENT REINFORCEMENT LEARNING,https://openreview.net/forum?id=Lb8ZnWW_In6,https://openreview.net/pdf?id=Lb8ZnWW_In6,,"In many real-world problems, the learning agent needs to learn a problem’s abstractions and solution simultaneously. However, most such abstractions need to be designed and refined by hand for different problems and domains of application. This paper presents a novel top-down approach for constructing state abstractions while carrying out reinforcement learning. Starting with state variables and a simulator, it presents a novel domain-independent approach for dynamically computing an abstraction based on the dispersion of Q-values in abstract states as the agent continues acting and learning. Extensive empirical evaluation on multiple domains and problems shows that this approach automatically learns abstractions that are finely-tuned to the problem, yield powerful sample efficiency, and result in the RL agent significantly outperforming existing approaches.","Sequential Decision-Making, Reinforcement Learning, Learning Abstract Representations" Fighting Fire with Fire: Contrastive Debiasing without Bias-free Data via Generative Bias-transformation,https://openreview.net/forum?id=rieUBLynDqm,https://openreview.net/pdf?id=rieUBLynDqm,"In this paper, we propose Contrastive Debiasing via Generative Bias-transformation (CDvG) which is capable of operating without exploiting bias labels and bias-free samples explicitly.","Despite their remarkable ability to generalize with over-capacity networks, deep neural networks often abuse bias instead of using the actual task-related information for discriminative tasks. Since such shortcuts are only effective within the collected dataset, the resulting biased model underperforms on real-world inputs. To counteract the influence of bias, existing methods either exploit auxiliary information which is rarely obtainable in practice, or sift bias-free samples to exploit them for debiasing. However, such presumptions about the availability of the auxiliary information or bias-free samples are not always guaranteed and the existing methods could break down due to the unmet presumptions. In this paper, we propose Contrastive Debiasing via Generative Bias-transformation (CDvG) which is capable of operating without exploiting bias labels and bias-free samples explicitly. Motivated by our observation that not only discriminative models but also image translation models tend to focus on the easy-to-learn bias, CDvG employs a image translation model to transform the bias to another mode of bias while preserving task-relevant information. Through contrastive learning, we set transformed biased views against another, learning bias-invariant representations. Especially, as the bias has a stronger correlation or is easier to perceive compared to the signal, the translation model is more likely to be a bias translation model, resulting in better debiasing effect. Experimental results demonstrate that CDvG outperforms the state-of-the-arts, especially when bias-free samples are extremely scarce.","debiasing, contrastive learning, image-to-image translation" On Representing Linear Programs by Graph Neural Networks,https://openreview.net/forum?id=cP2QVK-uygd,https://openreview.net/pdf?id=cP2QVK-uygd,,"Learning to optimize is a rapidly growing area that aims to solve optimization problems or improve existing optimization algorithms using machine learning (ML). In particular, the graph neural network (GNN) is considered a suitable ML model for optimization problems whose variables and constraints are permutation--invariant, for example, the linear program (LP). While the literature has reported encouraging numerical results, this paper establishes the theoretical foundation of applying GNNs to solving LPs. Given any size limit of LPs, we construct a GNN that maps different LPs to different outputs. We show that properly built GNNs can reliably predict feasibility, boundedness, and an optimal solution for each LP in a broad class. Our proofs are based upon the recently--discovered connections between the Weisfeiler--Lehman isomorphism test and the GNN. To validate our results, we train a simple GNN and present its accuracy in mapping LPs to their feasibilities and solutions.", On the Importance and Applicability of Pre-Training for Federated Learning,https://openreview.net/forum?id=fWWFv--P0xP,https://openreview.net/pdf?id=fWWFv--P0xP,,"Pre-training is prevalent in nowadays deep learning to improve the learned model's performance. However, in the literature on federated learning (FL), neural networks are mostly initialized with random weights. These attract our interest in conducting a systematic study to explore pre-training for FL. Across multiple visual recognition benchmarks, we found that pre-training can not only improve FL, but also close its accuracy gap to the counterpart centralized learning, especially in the challenging cases of non-IID clients' data. To make our findings applicable to situations where pre-trained models are not directly available, we explore pre-training with synthetic data or even with clients' data in a decentralized manner, and found that they can already improve FL notably. Interesting, many of the techniques we explore are complementary to each other to further boost the performance, and we view this as a critical result toward scaling up deep FL for real-world applications. We conclude our paper with an attempt to understand the effect of pre-training on FL. We found that pre-training enables the learned global models under different clients' data conditions to converge to the same loss basin, and makes global aggregation in FL more stable. Nevertheless, pre-training seems to not alleviate local model drifting, a fundamental problem in FL under non-IID data.","federated learning, pre-training" Scale-invariant Bayesian Neural Networks with Connectivity Tangent Kernel,https://openreview.net/forum?id=VZ5EaTI6dqa,https://openreview.net/pdf?id=VZ5EaTI6dqa,,"Explaining generalizations and preventing over-confident predictions are central goals of studies on the loss landscape of neural networks. Flatness, defined as loss invariability on perturbations of a pre-trained solution, is widely accepted as a predictor of generalization in this context. However, the problem that flatness and generalization bounds can be changed arbitrarily according to the scale of a parameter was pointed out, and previous studies partially solved the problem with restrictions: Counter-intuitively, their generalization bounds were still variant for the function-preserving parameter scaling transformation or limited only to an impractical network structure. As a more fundamental solution, we propose new prior and posterior distributions invariant to scaling transformations by \textit{decomposing} the scale and connectivity of parameters, thereby allowing the resulting generalization bound to describe the generalizability of a broad class of networks with the more practical class of transformations such as weight decay with batch normalization. We also show that the above issue adversely affects the uncertainty calibration of Laplace approximation and propose a solution using our invariant posterior. We empirically demonstrate our posterior provides effective flatness and calibration measures with low complexity in such a practical parameter transformation case, supporting its practical effectiveness in line with our rationale.", Autoregressive Graph Network for Learning Multi-step Physics,https://openreview.net/forum?id=TjxCJ1DK-dm,https://openreview.net/pdf?id=TjxCJ1DK-dm,An Autoregressive Graph Network (GN) that learns forward particle-based physics using inductive biases.,"In this work, we propose an Autoregressive Graph Network~(AGN) that learns forward physics using a temporal inductive bias. Currently, temporal state space information is provided as additional input to a GN when generating roll-out physics simulations. While this relatively increases the network's predictive performance over multiple time steps, a temporal model enables the network to induce and learn temporal biases. In dynamical systems, the arrow of time simplifies possible interactions in the sense that we can assume current observations to be dependent on preceding states. The autoregressive property naturally induces the arrow of time and can further constrain physics-induced GNs to conserve symmetries over long time-steps. Our proposed GN encodes temporal state information using an autoregressive encoder that can parallelly compute latent temporal embeddings over multiple time steps during a single forward pass. We perform case studies that compare multi-step forward predictions against baseline data-driven GNs across diverse datasets that feature different particle interactions. Our approach outperforms the state of the art GN and physics-induced GNs in 9 out of 10 and in 7 out of 10 particle physics datasets when conditioned on optimal historical states. ","graph network, autoregressive model, physics simulation, forward model, inverse model" Simple initialization and parametrization of sinusoidal networks via their kernel bandwidth,https://openreview.net/forum?id=yVqC6gCNf4d,https://openreview.net/pdf?id=yVqC6gCNf4d,We perform a theoretical analysis of a simplified sinusoidal network and use this to propose an informed initialization scheme.,"Neural networks with sinusoidal activations have been proposed as an alternative to networks with traditional activation functions. Despite their promise, particularly for learning implicit models, their training behavior is not yet fully understood, leading to a number of empirical design choices that are not well justified. In this work, we first propose a simplified version of such sinusoidal neural networks, which allows both for easier practical implementation and simpler theoretical analysis. We then analyze the behavior of these networks from the neural tangent kernel perspective and demonstrate that their kernel approximates a low-pass filter with an adjustable bandwidth. Finally, we utilize these insights to inform the sinusoidal network initialization, optimizing their performance for each of a series of tasks, including learning implicit models and solving differential equations.","sinusoidal, periodic, neural tangent kernel, implicit models, physics informed" Who are playing the games?,https://openreview.net/forum?id=afhc1a2x00V,https://openreview.net/pdf?id=afhc1a2x00V,"We show that one cannot get ""efficient"" Shapley values without correctly identifying the players (features) , and propose a solution to this conundrum","The Shapley value has been widely used as the measures of feature importance of a predictive model, by treating a model as a cooperative game $(N, v)$. There have been many discussions on what the correct characteristic function $v$ should be, but almost all literature will take the player set $N$ as the set of features. While in classical cooperative game scenarios, players are obvious and well defined, it is not clear whether we should treat each feature individually as a player in machine learning. In fact, adding or deleting a feature, even a redundant one, will change every feature's Shapley value and its rank among all features in a non-intuitive way. To address this problem, we introduce a new axiom called ``Consistency"", which characterizes the ``robustness"" of computed Shapley-like values against different player set identifications, and is specific to machine learning setup. We show that while one can achieve Efficiency and Consistency in special cases, such as inessential games and 2-player games, they are contradictory to each other in general. This impossibility theorem signifies a conundrum of applying Shapley values in the feature selection process: The Shapley value is only axiomatically desirable if the players(features) are correctly identified, however, this prerequisite is exactly the purpose of the feature selection task. We then introduce the GroupShapley value to help address this dilemma, and as an additional bonus, GroupShapley valuess have a computational advantage over the classical Shapley values.","shapley values, model explainability" Quasiconvex Shallow Neural Network,https://openreview.net/forum?id=XFWLkEcLqDf,https://openreview.net/pdf?id=XFWLkEcLqDf,,"Deep neural networks generally have highly non-convex structures, resulting in multiple local optima of network weights. The non-convex network is likely to fail, i.e., being trapped in bad local optima with large errors, especially when the task involves convexity (e.g., linearly separable classification). While convexity is essential in training neural networks, designing a convex network structure without strong assumptions (e.g., linearity) of activation or loss function is challenging. To extract and utilize convexity, this paper presents the QuasiConvex shallow Neural Network (QCNN) architecture with mild assumptions. We first decompose the network into building blocks where quasiconvexity is thoroughly studied. Then, we design additional layers to preserve quasiconvexity where such building blocks are integrated into general networks. The proposed QCNN, interpreted as a quasiconvex optimization problem, allows for efficient training with theoretical guarantees. Specifically, we construct equivalent convex feasibility problems to solve the quasiconvex optimization problem. Our theoretical results are verified via extensive experiments on common machine learning tasks. The quasiconvex structure in QCNN demonstrates even better learning ability than non-convex deep networks in some tasks.", The Best of Both Worlds: Accurate Global and Personalized Models through Federated Learning with Data-Free Hyper-Knowledge Distillation,https://openreview.net/forum?id=29V3AWjVAFi,https://openreview.net/pdf?id=29V3AWjVAFi,,"Heterogeneity of data distributed across clients limits the performance of global models trained through federated learning, especially in the settings with highly imbalanced class distributions of local datasets. In recent years, personalized federated learning (pFL) has emerged as a potential solution to the challenges presented by heterogeneous data. However, existing pFL methods typically enhance performance of local models at the expense of the global model's accuracy. We propose FedHKD (Federated Hyper-Knowledge Distillation), a novel FL algorithm in which clients rely on knowledge distillation (KD) to train local models. In particular, each client extracts and sends to the server the means of local data representations and the corresponding soft predictions -- information that we refer to as ``hyper-knowledge"". The server aggregates this information and broadcasts it to the clients in support of local training. Notably, unlike other KD-based pFL methods, FedHKD does not rely on a public dataset nor it deploys a generative model at the server. We analyze convergence of FedHKD and conduct extensive experiments on visual datasets in a variety of scenarios, demonstrating that FedHKD provides significant improvement in both personalized as well as global model performance compared to state-of-the-art FL methods designed for heterogeneous data settings.","Federated Learning, Representation Learning, Knowledge Distillation" Minimalistic Unsupervised Learning with the Sparse Manifold Transform,https://openreview.net/forum?id=nN_nBVKAhhD,https://openreview.net/pdf?id=nN_nBVKAhhD,"We build a ""white-box"" unsupervised learning model with two parsimonious principles: sparsity and low-rankness, the model can be viewed as the simplest form of VICReg.","We describe a minimalistic and interpretable method for unsupervised learning, without resorting to data augmentation, hyperparameter tuning, or other engineering designs, that achieves performance close to the SOTA SSL methods. Our approach leverages the sparse manifold transform, which unifies sparse coding, manifold learning, and slow feature analysis. With a one-layer deterministic sparse manifold transform, one can achieve $99.3\%$ KNN top-1 accuracy on MNIST, $81.1\%$ KNN top-1 accuracy on CIFAR-10 and $53.2\%$ on CIFAR-100. With a simple gray-scale augmentation, the model gets $83.2\%$ KNN top-1 accuracy on CIFAR-10 and $57\%$ on CIFAR-100. These results significantly close the gap between simplistic ``white-box'' methods and the SOTA methods. Additionally, we provide visualization to explain how an unsupervised representation transform is formed. The proposed method is closely connected to latent-embedding self-supervised methods and can be treated as the simplest form of VICReg. Though there remains a small performance gap between our simple constructive model and SOTA methods, the evidence points to this as a promising direction for achieving a principled and white-box approach to unsupervised learning.","Unsupervised Learning, Sparsity, Low-rank, Manifold learning, Spectral Embedding" Rewarding Episodic Visitation Discrepancy for Exploration in Reinforcement Learning,https://openreview.net/forum?id=Cj4aar5X65H,https://openreview.net/pdf?id=Cj4aar5X65H,The paper proposes a quantified and computation-efficient intrinsic reward method for improving exploration in reinforcement learning.,"Exploration is critical for deep reinforcement learning in complex environments with high-dimensional observations and sparse rewards. To address this problem, recent approaches proposed to leverage intrinsic rewards to improve exploration, such as novelty-based exploration and prediction-based exploration. However, many intrinsic reward modules require sophisticated structures and representation learning, resulting in prohibitive computational complexity and unstable performance. In this paper, we propose Rewarding Episodic Visitation Discrepancy (REVD), a computation-efficient and quantified exploration method. More specifically, REVD provides intrinsic rewards by evaluating the Rényi divergence-based visitation discrepancy between episodes. To estimate the divergence efficiently, a $k$-nearest neighbor estimator is utilized with a randomly-initialized state encoder. Finally, the REVD is tested on Atari games and PyBullet Robotics Environments. Extensive experiments demonstrate that REVD can significantly improve the sample efficiency of reinforcement learning algorithms and outperform the benchmarking methods.","reinforcement learning, exploration, intrinsic reward, computation-efficient" Over-Training with Mixup May Hurt Generalization,https://openreview.net/forum?id=JmkjrlVE-DG,https://openreview.net/pdf?id=JmkjrlVE-DG,We empirically discovered a U-shaped generalization curve of Mixup training.,"Mixup, which creates synthetic training instances by linearly interpolating random sample pairs, is a simple and yet effective regularization technique to boost the performance of deep models trained with SGD. In this work, we report a previously unobserved phenomenon in Mixup raining: on a number of standard datasets, the performance of Mixup-trained models starts to decay after training for a large number of epochs, giving rise to a U-shaped generalization curve. This behavior is further aggravated when the size of original dataset is reduced. To help understand such a behavior of Mixup, we show theoretically that Mixup training may introduce undesired data-dependent label noises to the synthesized data. Via analyzing a least-square regression problem with a random feature model, we explain why noisy labels may cause the U-shaped curve to occur: Mixup improves generalization through fitting the clean patterns at the early training stage, but as training progresses, Mixup becomes over-fitting to the noise in the synthetic data. Extensive experiments are performed on a variety of benchmark datasets, validating this explanation.","Mixup, Generalization, Overfitting, Regularization" HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention,https://openreview.net/forum?id=0eTTKOOOQkV,https://openreview.net/pdf?id=0eTTKOOOQkV,,"The success of large-scale contrastive vision-language pretraining (CLIP) has benefited both visual recognition and multi-modality content understanding. The concise design brings CLIP the advantage in inference efficiency against other vision-language models with heavier cross-attention fusion layers, making it a popular choice for a wide spectrum of downstream tasks. However, CLIP features can hardly reflect the hierarchy nature of high-level and fine-grained semantics conveyed in images and texts, which is arguably critical to vision-language understanding and reasoning. To this end, we equip both the visual and language branches in CLIP with hierarchy-aware attentions, namely Hierarchy-aware CLIP (HiCLIP), to progressively discover semantic hierarchies layer-by-layer from both images and texts in an unsupervised manner. As a result, such hierarchical aggregation significantly improves the cross-modal alignment. To demonstrate the advantages of HiCLIP, we conduct qualitative analysis on its unsupervised hierarchy induction during inference, as well as extensive quantitative experiments on both visual recognition and vision-language downstream tasks.", Quantile Risk Control: A Flexible Framework for Bounding the Probability of High-Loss Predictions,https://openreview.net/forum?id=p6jsTidUkPx,https://openreview.net/pdf?id=p6jsTidUkPx,We propose a framework to rigorously and flexible control the quantiles of the loss distribution incurred by a predictor or set of predictors.,"Rigorous guarantees about the performance of predictive algorithms are necessary in order to ensure their responsible use. Previous work has largely focused on bounding the expected loss of a predictor, but this is not sufficient in many risk-sensitive applications where the distribution of errors is important. In this work, we propose a flexible framework to produce a variety of bounds on quantiles of the loss distribution incurred by a predictor. Our method takes advantage of the order statistics of the observed loss values rather than relying on the sample mean alone. We show that a quantile is an informative way of quantifying predictive performance, and that our framework applies to a variety of quantile-based metrics, each targeting important subsets of the data distribution. We analyze the theoretical properties of our proposed method and demonstrate its ability to rigorously control loss quantiles on several real-world datasets.",distribution-free uncertainty quantification Text-Conditioned Graph Generation Using Discrete Graph Variational Autoencoders,https://openreview.net/forum?id=4UbhxQIjeSH,https://openreview.net/pdf?id=4UbhxQIjeSH,,"Inspired by recent progress in text-conditioned image generation, we propose a model for the novel problem of text-conditioned graph generation. In this paper we introduce the Vector Quantized Text To Graph generator (VQ-T2G), a discrete graph variational autoencoder and autoregressive transformer for generating general graphs conditioned on text. We curate two multimodal datasets of graphs paired with text, a real-world dataset of 8000 subgraphs from the Wikipedia link network and a dataset of over 5000 synthetic graphs. Experimental results on these datasets demonstrate that VQ-T2G synthesises novel graphs with structure aligned with the text conditioning. Additional experiments in the unconditioned graph generation setting show VQ-T2G is competitive with existing unconditioned graph generation methods across various graph metrics. Code will be released on github following paper acceptance.", Dynamic Neural Network is All You Need: Understanding the Robustness of Dynamic Mechanisms in Neural Networks,https://openreview.net/forum?id=qYO0f9WnUup,https://openreview.net/pdf?id=qYO0f9WnUup,,"Deep Neural Networks (DNN) based solutions are being used to solve different day-to-day problems. Recently, DNNs are being deployed in real-time systems, and lowering the energy consumption and response time has become the need of the hour. To address this scenario, researchers have proposed early-exit Dynamic Neural Networks (DyNNs), where the computation is dynamic based on the input complexity. DyNNs are generally designed and based on larger static DNNs (SDNN). As the DyNNs decrease the energy consumption, it also becomes important to evaluate the robustness of DyNNs to ensure safety. However, there has not been a significant number of works focusing on the robustness of DyNNs. To address this issue, we propose systematic studies to evaluate the robustness of DyNNs. For that purpose, we propose four research questions. These studies are performed on three models and two datasets. Through the studies, we find that DyNNs are more robust than SDNNs, and DyNNs can be used to generate adversarial samples efficiently. We also provide insight into the design choices through research studies. Finally, we propose a novel attack that can decrease the effectiveness of the DyNNs and can be used to evaluate design choices in DyNN.", AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers,https://openreview.net/forum?id=3yEIFSMwKBC,https://openreview.net/pdf?id=3yEIFSMwKBC,AutoMoE: a flexible Neural Architecture Search framework to design efficient sparse models under latency constraints.,"Neural architecture search (NAS) has demonstrated promising results on identifying efficient Transformer architectures which outperform manually designed ones for natural language tasks like neural machine translation (NMT). Existing NAS methods operate on a space of dense architectures, where all of the sub-architecture weights are activated for every input. Motivated by the recent advances in sparsely activated models like the Mixture-of-Experts (MoE) model, we introduce sparse architectures with conditional computation into the NAS search space. Given this expressive search space which subsumes prior densely activated architectures, we develop a new framework AutoMoE to search for efficient sparsely activated sub-Transformers. AutoMoE sparse models obtain (i) 3x FLOPs reduction over manually designed dense Transformers and (ii) 23% FLOPs reduction over state-of-the-art NAS-generated dense sub-Transformers with parity in BLEU score on benchmark datasets for NMT. AutoMoE consists of three training phases: (a) Heterogeneous search space design with dense and sparsely activated Transformer modules (e.g., how many experts? where to place them? what should be their sizes?}; (b) SuperNet training that jointly trains several subnetworks sampled from the large search space by weight-sharing; (c) Evolutionary search for the architecture with the optimal trade-off between task performance and computational constraint like FLOPs and latency.","Mixture-of-expert models, Neural architecture search, Efficiency" Learning Shareable Bases for Personalized Federated Image Classification,https://openreview.net/forum?id=MHgYMtHpKsC,https://openreview.net/pdf?id=MHgYMtHpKsC,,"Personalized federated learning (PFL) aims to leverage the collective wisdom of clients' data while constructing customized models that are tailored to individual client's data distributions. The existing work of PFL mostly aims to personalize for participating clients. In this paper, we focus on a less studied but practically important scenario---generating a personalized model for a new client efficiently. Different from most previous approaches that learn a whole or partial network for each client, we explicitly model the clients' overall meta distribution and embed each client into a low dimension space. We propose FedBasis, a novel PFL algorithm that learns a set of few, shareable basis models, upon which each client only needs to learn the coefficients for combining them into a personalized network. FedBasis is parameter-efficient, robust, and more accurate compared to other competitive PFL baselines, especially in a low data regime, without increasing the inference cost. To demonstrate its applicability, we further present a PFL evaluation protocol for image classification, featuring larger data discrepancies across clients in both the image and label spaces as well as more faithful training and test splits.","Federated learning, Computer vision" Curriculum-inspired Training for Selective Neural Networks,https://openreview.net/forum?id=Hcq7zGgcsOg,https://openreview.net/pdf?id=Hcq7zGgcsOg,We propose a curriculum-inspired method to train selective neural network models by leveraging example difficulty scores.,"We consider the problem of training neural network models for selective classification, where the models have the reject option to abstain from predicting certain examples as needed. Recent advances in curriculum learning have demonstrated the benefit of leveraging the example difficulty scores in training deep neural networks for typical classification settings. Example difficulty scores are even more important in selective classification as a lower prediction error rate can be achieved by rejecting hard examples and accepting easy ones. In this paper, we propose a curriculum-inspired method to train selective neural network models by leveraging example difficulty scores. Our method tailors the curriculum idea to selective neural network training by calibrating the ratio of easy and hard examples in each mini-batch, and exploiting difficulty ordering at the mini-batch level. Our experimental results demonstrate that our method outperforms both the state-of-the-art and alternative methods using vanilla curriculum techniques for training selective neural network models.","curriculum learning, selective classification" Layer-wise Balanced Activation Mechanism,https://openreview.net/forum?id=sqPEs1wEizU,https://openreview.net/pdf?id=sqPEs1wEizU,Layer-wise Balanced Activation Mechanism,"We propose a novel activation mechanism called LayerAct mechanism to develop layer-wise balanced activation functions that converge faster and perform better than existing activation functions. During the backpropagation in neural networks, the scale of activation function determines how much the parameters will be trained using each sample. This fact indicates that training of a neural network can be biased against samples when the distribution of activation is imbalanced among samples. With a simple experiment on the unnormalized network with rectified linear units (ReLUs) for activation, we show that there is a relationship between the sum of the activation scale and the training loss, which indicates that an imbalanced activation scale among samples can result in a bias in learning. The layer normalization (LayerNorm) can be used to avoid such problems of bias in learning by balancing the layer-wise distribution of inputs for activation functions. However, LayerNorm loses the mean and variance statistics of activated instances among samples during re-scaling and re-centering. Our proposed LayerAct mechanism balances the layer-wise distribution of activation outputs for all samples without re-scaling and re-centering; this way, LayerAct functions avoid not only the problem of bias in learning, but also the dilution problem of key statistics. LayerAct functions allow negative activation outputs when the activated signals have to be negative; thus, the machine can avoid bias shifts during learning, enabling rich representations at the end. Moreover, the proposed LayerAct mechanism can be used with the batch normalization (BatchNorm). Experiments show that LayerAct functions outperform the unbalanced element-level activation functions on two benchmark image classification datasets, CIFAR10 and CIFAR100. Given the essential role of activation in traditional multi-layer perceptrons (MLPs), convolutional neural networks (CNNs), and modern deep learning frameworks, our original work on the layer-wise activation fundamentally addressing the core mechanism of learning through multiple layers will contribute in developing high-performance machine learning frameworks.","activation, imbalanced activation, normalization, layer-wise balanced activation, layer-level activation, LayerAct" A Probabilistic Framework For Modular Continual Learning,https://openreview.net/forum?id=SZYXyhE2c6f,https://openreview.net/pdf?id=SZYXyhE2c6f,We introduce a scalable modular continual learning algorithm that is capable of forward knowledge transfer across similar and dissimilar input domains.,"Continual learning (CL) algorithms seek to accumulate and transfer knowledge across a sequence of tasks and achieve better performance on each successive task. Modular approaches, which use a different composition of modules for each task and avoid forgetting by design, have been shown to be a promising direction to CL. However, searching through the large space of possible module compositions remains a challenge. In this work, we develop a scalable probabilistic search framework as a solution to this challenge. Our framework has two distinct components. The first is designed to transfer knowledge across similar input domains. To this end, it models each module’s training input distribution and uses a Bayesian model to find the most promising module compositions for a new task. The second component targets transfer across tasks with disparate input distributions or different input spaces and uses Bayesian optimisation to explore the space of module compositions. We show that these two methods can be easily combined and evaluate the resulting approach on two benchmark suites designed to capture different desiderata of CL techniques. The experiments show that our framework offers superior performance compared to state-of-the-art CL baselines.","continual learning, modular, Bayesian networks, Bayesian optimisation" Knowledge-Grounded Reinforcement Learning,https://openreview.net/forum?id=XYUaprBSDjp,https://openreview.net/pdf?id=XYUaprBSDjp,,"Receiving knowledge, abiding by laws, and being aware of regulations are common behaviors in human society. Bearing in mind that reinforcement learning (RL) algorithms benefit from mimicking humanity, in this work, we propose that an RL agent can act on external guidance in both its learning process and model deployment, making the agent more socially acceptable. We introduce the concept, Knowledge-Grounded RL (KGRL), with a formal definition that an agent learns to follow external guidelines and develop its own policy. Moving towards the goal of KGRL, we propose a novel actor model with an embedding-based attention mechanism that can attend to either a learnable internal policy or external knowledge. The proposed method is orthogonal to training algorithms, and the external knowledge can be flexibly recomposed, rearranged, and reused in both training and inference stages. Through experiments on tasks with discrete and continuous action space, our KGRL agent is shown to be more sample efficient and generalizable, and it has flexibly rearrangeable knowledge embeddings and interpretable behaviors.", Git Re-Basin: Merging Models modulo Permutation Symmetries,https://openreview.net/forum?id=CQsmMYmlP5T,https://openreview.net/pdf?id=CQsmMYmlP5T,,"The success of deep learning is due in large part to our ability to solve certain massive non-convex optimization problems with relative ease. Though non-convex optimization is NP-hard, simple algorithms -- often variants of stochastic gradient descent -- exhibit surprising effectiveness in fitting large neural networks in practice. We argue that neural network loss landscapes contain (nearly) a single basin after accounting for all possible permutation symmetries of hidden units a la Entezari et al. 2021. We introduce three algorithms to permute the units of one model to bring them into alignment with a reference model in order to merge the two models in weight space. This transformation produces a functionally equivalent set of weights that lie in an approximately convex basin near the reference model. Experimentally, we demonstrate the single basin phenomenon across a variety of model architectures and datasets, including the first (to our knowledge) demonstration of zero-barrier linear mode connectivity between independently trained ResNet models on CIFAR-10 and CIFAR-100. Additionally, we identify intriguing phenomena relating model width and training time to mode connectivity. Finally, we discuss shortcomings of the linear mode connectivity hypothesis, including a counterexample to the single basin theory.", The Tilted Variational Autoencoder: Improving Out-of-Distribution Detection,https://openreview.net/forum?id=YlGsTZODyjz,https://openreview.net/pdf?id=YlGsTZODyjz,A generalization of the Gaussian distribution improves the performance of out-of-distribution detection with variational autoencoders.,"A problem with using the Gaussian distribution as a prior for the variational autoencoder (VAE) is that the set on which Gaussians have high probability density is small as the latent dimension increases. This is an issue because VAEs try to attain both a high likelihood with respect to a prior distribution and at the same time, separation between points for better reconstruction. Therefore, a small volume in the high-density region of the prior is problematic because it restricts the separation of latent points. To ameliorate this, we propose a simple generalization of the Gaussian distribution, called the tilted Gaussian, which has a maximum probability density occurring on a sphere instead of a single point. The tilted Gaussian has exponentially more volume in high-density regions than the standard Gaussian as a function of the distribution dimension. We empirically demonstrate that this simple change in the prior distribution improves VAE performance on the task of detecting unsupervised out-of-distribution (OOD) samples. We also introduce a new OOD testing procedure, called the Will-It-Move test, where the tilted Gaussian achieves remarkable OOD performance.","Variational Autoencoder, OOD, Unsupervised" The Role of Coverage in Online Reinforcement Learning,https://openreview.net/forum?id=LQIjzPdDt3q,https://openreview.net/pdf?id=LQIjzPdDt3q,"This paper shows surprising connections between online and offline learnability, in particular, how coverage in offline RL enables exploration in online RL.","Coverage conditions---which assert that the data logging distribution adequately covers the state space---play a fundamental role in determining the sample complexity of offline reinforcement learning. While such conditions might seem irrelevant to online reinforcement learning at first glance, we establish a new connection by showing---somewhat surprisingly---that the mere existence of a data distribution with good coverage can enable sample-efficient online RL. Concretely, we show that coverability---that is, existence of a data distribution that satisfies a ubiquitous coverage condition called concentrability---can be viewed as a structural property of the underlying MDP, and can be exploited by standard algorithms for sample-efficient exploration, even when the agent does not know said distribution. We complement this result by proving that several weaker notions of coverage, despite being sufficient for offline RL, are insufficient for online RL. We also show that existing complexity measures for online RL, including Bellman rank and Bellman-Eluder dimension, fail to optimally capture coverability, and propose a new complexity measure, the self-normalized coefficient, to provide a unification.","reinforcement learning theory, online RL, offline RL, learnability, general function approximation" Learning Mixture Models with Simultaneous Data Partitioning and Parameter Estimation,https://openreview.net/forum?id=36g8Ept_CCj,https://openreview.net/pdf?id=36g8Ept_CCj,PRESTO learns a mixture models such that each model performs well on a data partition,"We study a new framework of learning mixture models via data partitioning called PRESTO, wherein we optimize a joint objective function on the model parameters and the partitioning, with each model tailored to perform well on its specific partition. We connect PRESTO to a number of past works in data partitioning, mixture models, and clustering, and show that PRESTO generalizes several loss functions including the k-means and Bregman clustering objective, the Gaussian mixture model objective, mixtures of support vector machines, and mixtures of linear regression. We then propose a new joint discrete-continuous optimization algorithm which achieves a bounded approximation guarantee for any general loss function, thereby achieving guarantees for the afore-mentioned problems as well. We study PRESTO in the context of resource efficient deep learning, where we train smaller resource constrained models on each partition and show that it outperforms existing data partitioning and model pruning/knowledge distillation approaches, which in contrast to PRESTO, require large initial (teacher) models.","mixture models, resource constrained learning" Estimating Treatment Effects using Neurosymbolic Program Synthesis,https://openreview.net/forum?id=GVWySHBD3Cl,https://openreview.net/pdf?id=GVWySHBD3Cl,We estimate treatment effects/ causal effects using neurosymbolic program synthesis by designing a domain specific language ,"Estimating treatment effects from observational data is a central problem in causal inference. Methods to solve this problem exploit inductive biases and heuristics from causal inference to design multi-head neural network architectures and regularizers. In this work, we propose to use neurosymbolic program synthesis, a data-efficient, and interpretable technique, to solve the treatment effect estimation problem. We theoretically show that neurosymbolic programming can solve the treatment effect estimation problem. By designing a Domain Specific Language (DSL) for treatment effect estimation based on the inductive biases used in literature, we argue that neurosymbolic programming is a better alternative to treatment effect estimation than traditional models. Our empirical study reveals that our model, which implicitly encodes inductive biases in a DSL, achieves better performance on benchmark datasets than the state-of-the-art models.","Causal effect, treatment effect, neurosymbolic programming, domain specific language" Stateful Active Facilitator: Coordination and Environmental Heterogeneity in Cooperative Multi-Agent Reinforcement Learning,https://openreview.net/forum?id=B4maZQLLW0_,https://openreview.net/pdf?id=B4maZQLLW0_,,"In cooperative multi-agent reinforcement learning, a team of agents works together to achieve a common goal. Different environments or tasks may require varying degrees of coordination among agents in order to achieve the goal in an optimal way. The nature of coordination will depend on properties of the environment—its spatial layout, distribution of obstacles, dynamics, etc. We term this variation of properties within an environment as heterogeneity. Existing literature has not sufficiently addressed the fact that different environments may have different levels of heterogeneity. We formalize the notions of coordination level and heterogeneity level of an environment and present HECOGrid, a suite of multi-agent RL environments that facilitates empirical evaluation of different MARL approaches across different levels of coordination and environmental heterogeneity by providing a quantitative control over coordination and heterogeneity levels of the environment. Further, we propose a Centralized Training Decentralized Execution learning approach called Stateful Active Facilitator (SAF) that enables agents to work efficiently in high-coordination and high-heterogeneity environments through a differentiable and shared knowledge source used during training and dynamic selection from a shared pool of policies. We evaluate SAF and compare its performance against baselines IPPO and MAPPO on HECOGrid. Our results show that SAF consistently outperforms the baselines across different tasks and different heterogeneity and coordination levels.", UNDERSTANDING HTML WITH LARGE LANGUAGE MODELS,https://openreview.net/forum?id=GVMwL15UrZO,https://openreview.net/pdf?id=GVMwL15UrZO,"Large language models are very effective at understanding HTML including navigating web pages, classifying elements, and generating descriptions of elements.","Large language models (LLM) have shown exceptional performance on a variety of natural language tasks. Yet, their capabilities for HTML understanding – i.e., parsing the raw HTML of a webpage, with applications to automation of web-based tasks, crawling, and browser-assisted retrieval – have not been fully explored. We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks: (i) Semantic Classification of HTML elements, (ii) Description Generation for HTML inputs, and (iii) Autonomous Web Navigation of HTML pages. While previous work has developed dedicated architectures and training procedures for HTML understanding, we show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks. For instance, fine-tuned LLMs are 12% more accurate at semantic classification compared to models trained exclusively on the task dataset. Moreover, when fine-tuned on data from the MiniWoB benchmark, LLMs successfully complete 50% more tasks using 192x less data compared to the previous best supervised model. To promote further research on LLMs for HTML understanding, we create and open-source a large-scale HTML dataset distilled and auto-labeled from CommonCrawl. We show evidence that T5-based models due to the bidirectional encoder-decoder architecture are the best choice and that for practitioners larger models are not necessarily better.","html understanding, web navigation, large language models, semantic classification, description generation" "KALM: Knowledge-Aware Integration of Local, Document, and Global Contexts for Long Document Understanding",https://openreview.net/forum?id=NxPQ3QOGTWl,https://openreview.net/pdf?id=NxPQ3QOGTWl,,"With the advent of pre-trained language models (LMs), increasing research efforts have been focusing on infusing commonsense and domain-specific knowledge to prepare LMs for downstream tasks. These works attempt to leverage knowledge graphs, the \textit{de facto} standard of symbolic knowledge representation, along with pre-trained LMs. While existing approaches leverage external knowledge, it remains an open question how to jointly incorporate knowledge graphs represented in varying contexts --- from local (e.g., sentence), document-level, to global knowledge, to enable knowledge-rich and interpretable exchange across contexts. In addition, incorporating varying contexts can especially benefit long document understanding tasks that leverage pre-trained LMs, typically bounded by the input sequence length. In light of these challenges, we propose \textbf{KALM}, a language model that jointly leverages knowledge in local, document-level, and global contexts for long document understanding. KALM firstly encodes long documents and knowledge graphs into the three knowledge-aware context representations. KALM then processes each context with context-specific layers. These context-specific layers are followed by a ContextFusion layer that facilitates knowledge exchange to derive an overarching document representation. Extensive experiments demonstrate that KALM achieves state-of-the-art performance on three long document understanding tasks across 6 datasets/settings. Further analyses reveal that the three knowledge-aware contexts are complementary and they all contribute to model performance, while the importance and information exchange patterns of different contexts vary on different tasks and datasets.","natural language processing, long document understanding, knowledge graphs" Kuiper: Moderated Asynchronous Federated Learning on Heterogeneous Mobile Devices with Non-IID Data,https://openreview.net/forum?id=AqiB_Tqqc8z,https://openreview.net/pdf?id=AqiB_Tqqc8z,We develop a moderated asynchronous algorithm for training on a video action recognition task on embedded devices with mobile GPUs. ,"Federated learning allows multiple clients to jointly learn an ML model while keeping their data private. While synchronous federated learning (Sync-FL) requires the devices to share local gradients synchronously to provide better guarantees, it suffers from the problem of stragglers. This is the scenario where the faster clients have to wait for the slower ones, slowing the entire training process. Conventional techniques completely drop the updates from the stragglers and lose the opportunity to learn from the data they hold, which is especially important in a non-iid setting. Asynchronous learning (Async-FL) provides a potential solution to allow the clients to function at their own pace, which typically achieves faster convergence. Since edge devices have a low compute, it is hard to train a video action recognition task on them. We present Kuiper, a variant of Async-FL, to help heterogeneous edge devices with limited resources learn a heavy model on video-action-recognition tasks with data distributed non-IID. Kuiper introduces a novel aggregation scheme, which solves the straggler problem while considering the different data distribution at different clients. Kuiper shows a 11% faster convergence compared to Oort15 [OSDI-21], up to 12% and 9% improvement in test accuracy compared to FedBuff16 [AISTAT-22] and Oort [OSDI-21] on HMDB51, and 10% and 9% on UCF101.","Federated Learning, Edge devices, Non-IID, Video action recognition" Learning Achievement Structure for Structured Exploration in Domains with Sparse Reward,https://openreview.net/forum?id=NDWl9qcUpvy,https://openreview.net/pdf?id=NDWl9qcUpvy,,"We propose Structured Exploration with Achievements (SEA), a multi-stage reinforcement learning algorithm that learns the environment structure with offline data and uses the learned structure to learn different skills and improve overall exploration with online environment interactions in a particular type of environment that has an internal achievement system. SEA first uses a contrast-based loss function to learn the achievement representations and build an achievement classifier. It then tries to recover the environment achievement structure with a heuristic algorithm. Finally, SEA builds a meta-controller with the recovered structure to learn sub-policies and explore new tasks. While exploration in a procedurally generated environment with high-dimensional input like images is extremely hard for reinforcement learning agents, we demonstrate that SEA is still able to recover the underlying structure and explore new tasks in different domains.","deep reinforcement learning, structured exploration" Semi-Autoregressive Energy Flows: Towards Determinant-Free Training of Normalizing Flows,https://openreview.net/forum?id=GBU1mm8_WkV,https://openreview.net/pdf?id=GBU1mm8_WkV,,"Normalizing flows are a popular approach for constructing probabilistic and generative models. However, maximum likelihood training of flows is challenging due to the need to calculate computationally expensive determinants of Jacobians. This paper takes steps towards addressing this challenge by introducing objectives and model architectures for determinant-free training of flows. Central to our framework is the energy objective, a multidimensional extension of proper scoring rules that admits efficient estimators based on random projections. The energy objective does not require calculating determinants and therefore supports general flow architectures that are not well-suited to maximum likelihood training. In particular, we introduce semi-autoregressive flows, an architecture that can be trained with the energy loss, and that interpolates between fully autoregressive and non-autoregressive models, capturing the benefits of both. We empirically demonstrate that energy flows achieve competitive generative modeling performance while maintaining fast generation and posterior inference.", PINTO: Faithful Language Reasoning Using Prompted-Generated Rationales,https://openreview.net/forum?id=WBXbRs63oVu,https://openreview.net/pdf?id=WBXbRs63oVu,,"Neural language models (LMs) have achieved impressive results on various language-based reasoning tasks by utilizing latent knowledge encoded in their own pretrained parameters. To make this reasoning process more explicit, recent works retrieve a rationalizing LM's internal knowledge by training/prompting it to generate free-text rationales, which can be used to guide task predictions made by either the same LM or a separate reasoning LM. However, rationalizing LMs require expensive rationale annotation, without any assurance that the generated rationales improve LM task performance or faithfully reflect LM decision-making. In this paper, we propose PINTO, an LM pipeline that rationalizes via prompt-based learning, and learns to faithfully reason over rationales via counterfactual regularization. First, PINTO maps out a suitable reasoning process for the task input by prompting a frozen rationalizing LM to generate a free-text rationale. Second, PINTO's reasoning LM is fine-tuned to solve the task using the generated rationale as context, while regularized to output less confident predictions when the rationale is perturbed. Across four datasets, we show that PINTO significantly improves the generalization ability of the reasoning LM, yielding higher performance on both in-distribution and out-of-distribution test sets. Also, PINTO leverages the rationales more faithfully than competitive baselines do.","Commonsense reasoning, free-text rationale, rationale generation, faithful reasoning" State Decomposition for Model-free Partially observable Markov Decision Process,https://openreview.net/forum?id=ewYNygoJpa6,https://openreview.net/pdf?id=ewYNygoJpa6,This paper proposes a novel theory of state decomposition in POMDP and a simple algorithm to estimate the gap between state and observation.,"As an essential part of partially observable Markov theory, the measurement of the gap between states and observations is an important issue. In this paper, we propose a novel theory of state decomposition and a simple model-free metric algorithm ($\lambda$-algorithm) for estimating the gap between states and observations in the partially observable Markov decision process with a stationary environment with some missing state conditions. To verify our idea, we design a dimension ablation method to simulate different gaps in the cliff-walking experiment with Q-learning and Sarsa. The simulation results show that $\lambda$ increases steadily as more dimensions are ablated. This proves that $\lambda$ can adequately measure the gap.","POMDP, Reinforcement Learning, Decomposition, Shannon Entropy" Game Theoretic Mixed Experts for Combinational Adversarial Machine Learning,https://openreview.net/forum?id=ZBMpG7fWwOP,https://openreview.net/pdf?id=ZBMpG7fWwOP,,"Recent advances in adversarial machine learning have shown that defenses considered to be robust are actually susceptible to adversarial attacks which are specifically tailored to target their weaknesses. These defenses include Barrage of Random Transforms (BaRT), Friendly Adversarial Training (FAT), Trash is Treasure (TiT) and ensemble models made up of Vision Transformers (ViTs), Big Transfer models and Spiking Neural Networks (SNNs). It remains an open question, however, as to whether the adversarial examples designed to target one defense will be similarly misclassified by another defense. In this paper, we provide the first adversarial defense transferability study, as well as a game theoretic framework for ensemble adversarial attacks and defenses. Our framework is called Game theoretic Mixed Experts (GaME) and is designed to find the Mixed-Nash strategy for an attacker that can employ compositional adversarial attacks. We show that this framework creates an ensemble of defenses with greater robustness than a combinational defense with a uniform or random probability distribution. Overall, our framework and analyses advance the field of adversarial machine learning by yielding new insights into compositional attack and defense formulations.","Adversarial Machine Learning, Security" Return Augmentation gives Supervised RL Temporal Compositionality,https://openreview.net/forum?id=BKuboEUJd8u,https://openreview.net/pdf?id=BKuboEUJd8u,We propose a new data augmentation algorithm that enables RL via supervised methods to extrapolate beyond the best performing trajectories in the offline dataset using bootstrapping.,"Offline Reinforcement Learning (RL) methods that use supervised learning or sequence modeling (e.g., Decision Transformer) work by training a return-conditioned policy. A fundamental limitation of these approaches, as compared to value-based methods, is that they have trouble generalizing to behaviors that have a higher return than what was seen at training. Value-based offline-RL algorithms like CQL use bootstrapping to combine training data from multiple trajectories to learn strong behaviors from sub-optimal data. We set out to endow RL via Supervised Learning (RvS) methods with this form of temporal compositionality. To do this, we introduce SuperB, a dynamic programming algorithm for data augmentation that augments the returns in the offline dataset by combining rewards from intersecting trajectories. We show theoretically that SuperB can improve sample complexity and enable RvS to find optimal policies in cases where it previously fell behind the performance of value-based methods. Empirically, we find that SuperB improves the performance of RvS in several offline RL environments, surpassing the prior state-of-the-art RvS agents in AntMaze by orders of magnitude and offering performance competitive with value-based algorithms on the D4RL-gym tasks.","reinforcement learning, offline reinforcement learning, decision transformer, behavioral cloning, dynamic programming, data augmentation" Neural Integral Equations,https://openreview.net/forum?id=-3br92QL76O,https://openreview.net/pdf?id=-3br92QL76O,Neural Integral Equations are a novel method that allows to model non-local dynamics with complex spatio-temporal relations through neural networks,"Integral equations (IEs) are functional equations defined through integral operators, where the unknown function is integrated over a possibly multidimensional space. Important applications of IEs have been found throughout theoretical and applied sciences, including in physics, chemistry, biology, and engineering; often in the form of inverse problems. IEs are especially useful since differential equations, e.g. ordinary differential equations (ODEs), and partial differential equations (PDEs) can be formulated in an integral version which is often more convenient to solve. Moreover, unlike ODEs and PDEs, IEs can model inherently non-local dynamical systems, such as ones with long distance spatio-temporal relations. While efficient algorithms exist for solving given IEs, no method exists that can learn an integral equation and its associated dynamics from data alone. In this article, we introduce Neural Integral Equations (NIE), a method that learns an unknown integral operator from data through a solver. We also introduce an attentional version of NIE, called Attentional Neural Integral Equations (ANIE), where the integral is replaced by self-attention, which improves scalability and provides interpretability. We show that learning dynamics via integral equations is faster than doing so via other continuous methods, such as Neural ODEs. Finally, we show that ANIE outperforms other methods on several benchmark tasks in ODE, PDE, and IE systems of synthetic and real-world data.","integral equations, dynamical systems, non-local equations, self-attention, brain dynamics" Excess Risk of Two-Layer ReLU Neural Networks in Teacher-Student Settings and its Superiority to Kernel Methods,https://openreview.net/forum?id=6doXHqwMayf,https://openreview.net/pdf?id=6doXHqwMayf,,"While deep learning has outperformed other methods for various tasks, theoretical frameworks that explain its reason have not been fully established. We investigate the excess risk of two-layer ReLU neural networks in a teacher-student regression model, in which a student network learns an unknown teacher network through its outputs. Especially, we consider the student network that has the same width as the teacher network and is trained in two phases: first by noisy gradient descent and then by the vanilla gradient descent. Our result shows that the student network provably reaches a near-global optimal solution and outperforms any kernel methods estimator (more generally, linear estimators), including neural tangent kernel approach, random feature model, and other kernel methods, in a sense of the minimax optimal rate. The key concept inducing this superiority is the non-convexity of the neural network models. Even though the loss landscape is highly non-convex, the student network adaptively learns the teacher neurons.","Deep learning theory, optimization, learning theory, excess risk" Automatic Data Augmentation via Invariance-Constrained Learning,https://openreview.net/forum?id=4hhtHQLGDQO,https://openreview.net/pdf?id=4hhtHQLGDQO,Imposing Invariance Constraints to models enables learning Data augmentation policies that can improve generalization.,"Underlying data structures, such as symmetries or invariances to transformations, are often exploited to improve the solution of learning tasks. However, embedding these properties in models or learning algorithms can be challenging and computationally intensive. Data augmentation, on the other hand, induces these symmetries during training by applying multiple transformations to the input data. Despite its ubiquity, its effectiveness depends on the choices of which transformations to apply, when to do so, and how often. In fact, there is both empirical and theoretical evidence that the indiscriminate use of data augmentation can introduce biases that outweigh its benefits. This work tackles these issues by automatically adapting the data augmentation while solving the learning task. To do so, it formulates data augmentation as an invariance-constrained learning problem and leverages Monte Carlo Markov Chain (MCMC) sampling to solve it. The result is a practical algorithm that not only does away with a priori searches for augmentation distributions, but also dynamically controls if and when data augmentation is applied. Our experiments illustrate the performance of this method, which achieves state-of-the-art results in automatic data augmentation benchmarks for CIFAR datasets. Furthermore, this approach can be used to gather insights on the actual symmetries underlying a learning task.","Automatic data augmentation, Invariance, Constrained Learning, Image classification" GEASS: Neural causal feature selection for high-dimensional biological data,https://openreview.net/forum?id=aKcS3xojnwY,https://openreview.net/pdf?id=aKcS3xojnwY," We propose a new method (GEASS) to identify causally interacting features for high-dimensional spatial/temporal structured data, and apply it to several biological data to infer causal regulatory patterns.","Identifying nonlinear causal relationships in high-dimensional biological data is an important task. However, current neural network based causality detection approaches for such data suffer from poor interpretability and cannot scale well to high dimensional regime. Here we present GEASS (Granger fEAture Selection of Spatiotemporal data), which identifies sparse Granger causality mechanisms of high dimensional spatiotemporal data by a single neural network. GEASS maximizes sparsity-regularized multi-dimensional transfer entropy with a theoretical guarantee of recovering features with spatial/temporal Granger causal relationships. The sparsity regularization is achieved by a novel combinatorial stochastic gate layer to select sparse non-overlapping feature subsets. We demonstrate the efficacy of GEASS in several synthetic datasets and real biological data from single-cell RNA sequencing and spatial transcriptomics.","Granger causality, feature selection, neural networks, single-cell genomics, spatial transcriptomics" Unsupervised 3D Scene Representation Learning via Movable Object Inference,https://openreview.net/forum?id=BgoOPulkznY,https://openreview.net/pdf?id=BgoOPulkznY,"Unsupervised, category-agnostic, object-centric 3D representation learning for complex scenes ","Unsupervised, category-agnostic, object-centric 3D representation learning for complex scenes remains an open problem in computer vision. While a few recent methods can now discover 3D object radiance fields from a single image without supervision, they are limited to simplistic scenes with objects of a single category, often with a uniform color. This is because they discover objects purely based on appearance cues—objects are made of pixels that look alike. In this work, we propose Movable Object Radiance Fields (MORF), aiming at scaling to complex scenes with diverse categories of objects. Inspired by cognitive science of object learning in babies, MORF learns 3D object representations via movable object inference. During training, MORF first obtains 2D masks of movable objects via a self-supervised movable object segmentation method; it then bridges the gap to 3D object representations via conditional neural rendering in multiple views. During testing, MORF can discover, reconstruct, and move unseen objects from novel categories, all from a single image. Experiments show that MORF extracts accurate object geometry and supports realistic object and scene reconstruction and editing, significantly outperforming the state-of-the-art.","3D representation learning, self-supervised learning, object-discovery, neural rendering" FoveaTer: Foveated Transformer for Image Classification,https://openreview.net/forum?id=Gy8vD-zGQqH,https://openreview.net/pdf?id=Gy8vD-zGQqH,,"Many animals and humans process the visual field with varying spatial resolution (foveated vision) and use peripheral processing to make eye movements and point the fovea to acquire high-resolution information about objects of interest. This architecture results in computationally efficient rapid scene exploration. Recent progress in self-attention-based vision Transformers, an alternative to the traditionally convolution-reliant computer vision systems, allows global interactions between feature locations and increases robustness to adversarial attacks. However, the Transformer models do not explicitly model the foveated properties of the visual system nor the interaction between eye movements and the classification task. We propose Foveated Transformer (FoveaTer) model, which uses pooling regions and eye movements to perform object classification tasks using a Vision Transformer architecture. Using square pooling regions or biologically-inspired radial-polar pooling regions, our proposed model pools the image features from the convolution backbone and uses the pooled features as an input to transformer layers. It decides on subsequent fixation location based on the attention assigned by the Transformer to various locations from past and present fixations. The model uses a confidence threshold to stop scene exploration. It dynamically allocates more fixation/computational resources to more challenging images before making the final image category decision. We construct a Foveated model using our proposed approach and compare it against a Baseline model, which does not contain any pooling. Using five ablation studies, we evaluate the contribution of different components of the Foveated model. We perform a psychophysics scene categorization task and use the experimental data to find a suitable radial-polar pooling region combination. We also show that the Foveated model better explains the human decisions in a scene categorization task than a Baseline model. On the ImageNet dataset, the Foveated model with Dynamic-stop achieves an accuracy of $8\%$ below the Baseline model with a throughput gain of $76\%$. Using a Foveated model with Dynamic-stop and the Baseline model, the ensemble achieves an accuracy of $0.7\%$ below the Baseline using the same throughput. We demonstrate our model's robustness against PGD adversarial attacks with both types of pooling regions, where we see the Foveated model outperform the Baseline model.", Linearly Mapping from Image to Text Space,https://openreview.net/forum?id=8tYRqb05pVn,https://openreview.net/pdf?id=8tYRqb05pVn,"Language models (LMs) can 'understand' images through a single tuned linear layer between a frozen image encoder and the LM input, showcasing the similarities in their conceptual representation spaces.","The extent to which text-only language models (LMs) learn to represent the physical, non-linguistic world is an open question. Prior work has shown that pretrained LMs can be taught to ``understand'' visual inputs when the models' parameters are updated on image captioning tasks. We test a stronger hypothesis: that the conceptual representations learned by text-only models are functionally equivalent (up to a linear transformation) to those learned by models trained on vision tasks. Specifically, we show that the image representations from vision models can be transferred as continuous prompts to frozen LMs by training only a single linear projection. Using these to prompt the LM achieves competitive performance on captioning and visual question answering tasks compared to models that tune both the image encoder and text decoder (such as the MAGMA model). We compare three image encoders with increasing amounts of linguistic supervision seen during pretraining: BEIT (no linguistic information), NF-ResNET (lexical category information), and CLIP (full natural language descriptions). We find that all three encoders perform equally well at transferring visual property information to the language model (e.g., whether an animal is large or small), but that image encoders pretrained with linguistic supervision more saliently encode category information (e.g., distinguishing hippo vs.\ elephant) and thus perform significantly better on benchmark language-and-vision tasks. Our results indicate that LMs encode conceptual information structurally similarly to vision-based models, even those that are solely trained on images.","representation learning, deep learning, grounded language learning, nlp, dl, image, image captioning, language grounding, grounded" Actor-Critic Alignment for Offline-to-Online Reinforcement Learning,https://openreview.net/forum?id=z70d8UBFDKF,https://openreview.net/pdf?id=z70d8UBFDKF,We propose a new actor-critic alignment method that allows safe offline-to-online reinforcement learning and achieves strong empirical performance.,"Deep offline reinforcement learning has recently demonstrated considerable promise in leveraging offline datasets, providing high-quality models that significantly reduce the online interactions required for fine-tuning. However, such a benefit is often diminished due to the marked state-action distribution shift, which causes significant bootstrap error and wipes out the good initial policy. Existing solutions resort to constraining the policy shift or balancing the sample replay based on their online-ness. However, they require online estimation of distribution divergence or density ratio. To avoid such complications, we propose deviating from existing actor-critic approaches that directly transfer the state-action value functions. Instead, we post-process them by aligning with the offline learned policy, so that the Q-values for actions *outside* the offline policy are also tamed. As a result, the online fine-tuning can be simply performed as in the standard actor-critic algorithms. We show empirically that the proposed method improves the performance of the fine-tuned robotic agents on various simulated tasks.","Offline Reinforcement Learning, Offline-to-Online" Characterizing intrinsic compositionality in transformers with Tree Projections,https://openreview.net/forum?id=sAOOeI878Ns,https://openreview.net/pdf?id=sAOOeI878Ns,We provide a method to functionally project a transformer into the space of tree structured models and use it to uncover intrinsic compositionality of transformers trained on language data.,"When trained on language data, do transformers learn some arbitrary computation that utilizes the full capacity of the architecture or do they learn a simpler, tree-like computation, hypothesized to underlie compositional meaning systems like human languages? There is an apparent tension between compositional accounts of human language understanding, which are based on a restricted bottom-up computational process, and the enormous success of neural models like transformers, which can route information arbitrarily between different parts of their input. One possibility is that these models, while extremely flexible in principle, in practice learn to interpret language hierarchically, ultimately building sentence representations close to those predictable by a bottom-up, tree-structured model. To evaluate this possibility, we describe an unsupervised and parameter-free method to \emph{functionally project} the behavior of any transformer into the space of tree-structured networks. Given an input sentence, we produce a binary tree that approximates the transformer's representation-building process and a score that captures how ``tree-like'' the transformer's behavior is on the input. While calculation of this score does not require training any additional models, it provably upper-bounds the fit between a transformer and any tree-structured approximation. Using this method, we show that transformers for three different tasks become more tree-like over the course of training, in some cases unsupervisedly recovering the same trees as supervised parsers. These trees, in turn, are predictive of model behavior, with more tree-like models generalizing better on tests of compositional generalization.","Transformers, Unsupervised syntax, hierarchical computation, compositionality" What Do We Maximize in Self-Supervised Learning And Why Does Generalization Emerge?,https://openreview.net/forum?id=tuE-MnjN7DV,https://openreview.net/pdf?id=tuE-MnjN7DV,Analyzing self-supervised learning from an information-theoretic perspective,"In this paper, we provide an information-theoretic (IT) understanding of self-supervised learning methods, their construction, and optimality. As a first step, we demonstrate how IT quantities can be obtained for deterministic networks, as an alternative to the commonly used unrealistic stochastic networks assumption. Secondly, we demonstrate how different SSL models can be (re)discovered based on first principles and highlight what the underlying assumptions of different SSL variants are. Third, we derive a novel generalization bound based on our IT understanding of SSL methods, providing generalization guarantees for the downstream supervised learning task. As a result of this bound, along with our unified view of SSL, we can compare the different approaches and provide general guidelines to practitioners. Consequently, our derivation and insights can contribute to a better understanding of SSL and transfer learning from a theoretical and practical perspective. ","Self Supervised Learning, Neural Networks, Information Theory, Generalization Bounds" SmartFRZ: An Efficient Training Framework using Attention-Based Layer Freezing,https://openreview.net/forum?id=i9UlAr1T_xl,https://openreview.net/pdf?id=i9UlAr1T_xl,,"There has been a proliferation of artificial intelligence applications, where model training is key to promising high-quality services for these applications. However, the model training process is both time-intensive and energy-intensive, inevitably affecting the user's demand for application efficiency. Layer freezing, an efficient model training technique, has been proposed to improve training efficiency. Although existing layer freezing methods demonstrate the great potential to reduce model training costs, they still remain shortcomings such as lacking generalizability and compromised accuracy. For instance, existing layer freezing methods either require manually pre-training defined freeze configurations, which do not apply to different networks, or use heuristic freezing criteria that is hard to guarantee decent accuracy in different scenarios. Therefore, there lacks a generic and smart layer freezing method that can automatically perform ``in-situation'' layer freezing for different networks during training processes. To this end, we propose a generic and efficient training framework (SmartFRZ). The core proposed technique in SmartFRZ is attention-guided layer freezing, which can automatically select the appropriate layers to freeze without compromising accuracy. Experimental results show that SmartFRZ effectively reduces the amount of computation in training and achieves significant training acceleration, and outperforms the state-of-the-art layer freezing approaches.", Similarity-Based Cooperation,https://openreview.net/forum?id=r0LQFDOwfbU,https://openreview.net/pdf?id=r0LQFDOwfbU,"If ML agents observe how similar they are to each other, they can cooperate in the one-shot Prisoner's Dilemma.","As ML agents act more autonomously in the world, they will increasingly interact with each other. Unfortunately, in many social dilemmas like the one-shot Prisoner’s Dilemma, standard game theory predicts that ML agents will fail to cooperate with each other. Prior work has shown that one way to enable cooperative outcomes in the one-shot Prisoner’s Dilemma is to make the agents mutually transparent to each other, i.e., to allow them to access one another’s source code (Rubinstein 1997; Tennenholtz 2004) or weights in the case of ML agents. However, full transparency is often unrealistic, whereas partial transparency is commonplace. Moreover, it is difficult to machine-learn to cooperate in the full transparency setting. In this paper, we introduce a more realistic setting in which agents only observe a single number indicating how similar they are to each other. We prove that this allows for the same set of cooperative outcomes as the full transparency setting. We also demonstrate experimentally that cooperation can be learned using simple ML methods.","multi-agent reinforcement learning, cooperative AI, program equilibrium" Consistent Data Distribution Sampling for Large-scale Retrieval,https://openreview.net/forum?id=NUU2tFxUjRa,https://openreview.net/pdf?id=NUU2tFxUjRa,A novel negative sampling strategy to tackle training-inference inconsistency of data distribution for large-scale retrieval.,"Retrieving candidate items with low latency and computational cost is important for large-scale advertising systems. Negative sampling is a general approach to model million-scale items with rich features in the retrieval. The training-inference inconsistency of data distribution brought from sampling negatives is a key challenge. In this work, we propose a novel negative sampling strategy Consistent Data Distribution Sampling (CDDS) to solve such an issue. Specifically, we employ a relative large-scale of uniform training negatives and batch negatives to adequately train long-tail and hot items respectively, and employ high divergence negatives to improve the learning convergence. To make the above training samples approximate the serving item data distribution, we introduce an auxiliary loss based on an asynchronous item embedding matrix over the entire item pool. Offline experiments on real datasets achieve SOTA performance. Online experiments with multiple advertising scenarios show that our method has achieved significant increases in GMV. The source code will be released in the future.","Retrieval, Neural Networks, Deep Learning, Recommender Systems, Information Systems" NOVEL FEATURE REPRESENTATION STRATEGIES FOR TIME SERIES FORECASTING WITH PREDICTED FUTURE COVARIATES,https://openreview.net/forum?id=sdlplaOsLdw,https://openreview.net/pdf?id=sdlplaOsLdw,We propose two novel feature representation strategies for time series forecasting with predicted future covariates.,"Accurate time series forecasting is a fundamental challenge in data science. Unlike traditional statistical methods, conventional machine learning models, such as RNNs and CNNs, use historical data consisting of previously measured variables including the forecast variable and all its covariates. However, in many applications, some of the covariates can be predicted with reasonable accuracy for the immediate future. Note that the input may also contain some covariates that cannot be accurately predicted. We consider the problem of predicting water levels at a given location in a river or canal system using historical data and future covariates, some of which (e.g., precipitation, tide) may be predictable. In many applications, for some covariates of interest, it may be possible to use historical data or accurate predictions for the near future. Traditional methods to incorporate future predictable covariates have major limitations. The strategy of simply concatenating the future predicted covariates to the input vector is highly likely to miss the past-future connection. Another strategy that iteratively predicts one step at a time can end up with prediction error accumulation. We propose two novel feature representation strategies to solve those limitations -- shifting and padding, which create a framework for contextually linking the past with the predicted future, while avoiding any accumulation of prediction errors. Extensive experiments on three well-known datasets revealed that our strategies when applied to RNN and CNN backbones, outperform existing methods. Our experiments also suggest a relationship between the amount of shifting and padding and the periodicity of the time series.","time series forecasting, future covariates, shifting, padding, periodicity, deep learning" Augmentation Component Analysis: Modeling Similarity via the Augmentation Overlaps,https://openreview.net/forum?id=5vM51iamNeL,https://openreview.net/pdf?id=5vM51iamNeL,,"Self-supervised learning aims to learn embeddings with which semantically similar samples are close. Contrastive learning methods pull views of samples together and push different samples away, which utilizes semantic invariance of augmentation but ignores the relationship between samples. To better exploit the power of augmentation, we observe that semantically similar samples are more likely to have similar augmented views. So the augmentation feature, composed of the distribution of augmentations, can act as the ideal embedding, and similarity over them reveals how much the augmentations of two samples overlap. Without computational burdens to explicitly estimate its value, we propose Augmentation Component Analysis (ACA) with a contrastive-like loss to learn principal components and an on-the-fly projection loss to embed data. ACA equals an efficient dimension reduction by PCA and extracts low-dimensional embeddings, theoretically preserving the similarity of augmentation distribution between samples. Empirical results show our method can achieve competitive results against various traditional contrastive learning methods on different benchmarks.", Reproducible Bandits,https://openreview.net/forum?id=gcD2UtCGMc2,https://openreview.net/pdf?id=gcD2UtCGMc2,We provide a definition of reproducibility in the context of stochastic bandit problems and we develop algorithms with low regret in various environments.,"In this paper, we introduce the notion of reproducible policies in the context of stochastic bandits, one of the canonical problems in interactive learning. A policy in the bandit environment is called reproducible if it pulls, with high probability, the \emph{exact} same sequence of arms in two different and independent executions (i.e., under independent reward realizations). We show that not only do reproducible policies exist, but also they achieve almost the same optimal (non-reproducible) regret bounds in terms of the time horizon. More specifically, in the stochastic multi-armed bandits setting, we develop a policy with an optimal problem-dependent regret bound whose dependence on the reproducibility parameter is also optimal. Similarly, for stochastic linear bandits (with finitely and infinitely many arms) we develop reproducible policies that achieve the best-known problem-independent regret bounds with an optimal dependency on the reproducibility parameter. Our results show that even though randomization is crucial for the exploration-exploitation trade-off, an optimal balance can still be achieved while pulling the exact same arms in two different rounds of executions. ","Interactive Learning, Reproducible Learning, Bandit Algorithms" Persistence-based Contrastive Learning with Graph Neural Recurrent Networks for Time-series Forecasting,https://openreview.net/forum?id=2RjnzZqax1J,https://openreview.net/pdf?id=2RjnzZqax1J,,"In the recent years, combinations of graph convolution and recurrent architectures have emerged as a new powerful alternative for multivariate spatio-temporal forecasting, with applications ranging from biosurveillance to traffic monitoring. However, such methods often tend to suffer from vulnerability to noise and limited generalization abilities, especially when semantics and structural properties of time series evolve over time. To address these limitations, we propose a simple yet flexible and highly effective framework, i.e., Persistence-based Contrastive Learning with Graph Neural Recurrent Networks (PCL-GCRN). The key idea behind PCL-GCRN is the notion of topological invariance that we introduce to contrastive graph learning for multivariate spatio-temporal processes. PCL-GCRN allows us to simultaneously focus on multiple most important data shape characteristics at different granularities that play the key role in the learning performance. As a result, PCL-GCRN leads to richer data augmentation, improved performance, and enhanced robustness. Our extensive experiments on a broad range of real-world datasets, from spatio-temporal forecasting of traffic to monkeypox surveillance, suggest that PCL-GCRN yields competitive results both in terms of prediction accuracy and robustness, outperforming 19 competing approaches.","Spatio-temporal forecasting, graph neural network, topological data analysis, contrastive learning" ACE-EM: Boosted ab initio Cryo-EM 3D Reconstruction with Asymmetric Complementary Autoencoder,https://openreview.net/forum?id=wwHOYTpfRqH,https://openreview.net/pdf?id=wwHOYTpfRqH,3D cryo-EM reconstruction with ACE-EM,"Cryo-electron microscopy (cryo-EM) is an imaging technique for obtaining high-resolution biomolecular structures. The central problem in cryo-EM is to recover the underlying 3-dimensional (3D) objects from 2-dimensional (2D) projection images. Aside from signal corruptions and extremely low signal-to-noise ratio, a major challenge in cryo-EM 3D reconstruction is to estimate the poses of 3D objects during the projection image formation, which are missing from experimental measurements. Recent methods attempted to solve the pose estimation problem using the autoencoder architecture. A key issue with this approach is that the latent vector is only indirectly updated through the decoder. The encoder's learning of the pose space can be easily trapped in a local subspace, resulting in suboptimal pose inferences and inferior 3D reconstruction quality. Here we present a modified autoencoder architecture called ACE (asymmetric complementary autoencoder) and designed the ACE-EM method to solve this issue, which consists of two tasks. The first task takes projection images and outputs predicted images using an image encoder followed by a pose decoder. The second task reverses the order of encoder and decoder, which takes randomly sampled poses and outputs predicted poses. The two tasks complement each other and can achieve a more balanced training of the encoder-decoder parameters. Compared to other methods, ACE-EM can reach higher pose space coverage within the same training time and has achieved state-of-the-art 3D reconstruction results for several benchmark datasets. ","autoencoder, electron microscopy, cryo-EM, 3D reconstruction, pose inference, asymmetric complementary autoencoder" Diffusion-based point cloud generation with smoothness constraints,https://openreview.net/forum?id=ZDpSoddiLRR,https://openreview.net/pdf?id=ZDpSoddiLRR,,"Diffusion models have been popular for point cloud generation tasks. Existing works utilize the forward diffusion process as a discrete Markov Chain to convert the original point distribution into a noise distribution (e.g., standard Gaussian distribution) and learn the reverse diffusion process to recover the target point distribution from the noise distribution. However, the diffusion process can produce samples with non-uniform points on the surface without consideration of the point cloud geometric feature. To alleviate the problem, we propose a novel diffusion-based framework for point cloud generation and incorporate the local smoothness constraint into the generation process. Experiments demonstrate that the proposed model is not only capable of generating realistic shapes but also generating more uniform point clouds, outperforming multiple state-of-the-art methods. ", NEURAL HAMILTONIAN FLOWS IN GRAPH NEURAL NETWORKS,https://openreview.net/forum?id=lhPLT5gnBrH,https://openreview.net/pdf?id=lhPLT5gnBrH,,"Graph neural networks (GNNs) suffer from oversmoothing and oversquashing problems when node features are updated over too many layers. Embedding spaces can also vary significantly for different data types, leading to the need for different GNN model types. In this paper, we model the embedding of a node feature as a Hamiltonian flow over time. As in physics where Hamiltonian flow conserves the energy over time, its induced GNNs enable a more stable feature updating mechanism. Moreover, since the Hamiltonian flows are defined on a general symplectic manifold, this approach allows us to learn the underlying manifold of the graph in training, in contrast to most of the existing literature that assumes a fixed graph embedding manifold. We test Hamiltonian flows of different forms and demonstrate empirically that our approach achieves better node classification accuracy than popular state-of-the-art GNNs.", Convergence Analysis of Split Learning on Non-IID Data,https://openreview.net/forum?id=SNONkz5zEUF,https://openreview.net/pdf?id=SNONkz5zEUF,Convergence Analysis of Split Learning on Non-IID Data,"Split Learning (SL) is one promising variant of Federated Learning (FL), where the AI model is split and trained at the clients and the server collaboratively. By offloading the computation-intensive portions to the server, SL enables efficient model training on resource-constrained clients. Despite its booming applications, SL still lacks rigorous convergence analysis on non-IID data, which is critical for hyperparameter selection. In this paper, we first prove that SL exhibits an $\mathcal{O}(1/\sqrt{T})$ convergence rate for non-convex objectives on non-IID data, where $T$ is the number of total steps. By comparing the convergence analysis and experimental results, SL can outperform FL in terms of convergence rate (w.r.t. per-client training/communication rounds, and hence, the computation efficiency) and exhibit comparable accuracy to FL on mildly non-IID data. In contrast, FL prevails on highly non-IID data.","Federated Learning, Split Leanring, Convergence analysis" Principal Trade-off Analysis,https://openreview.net/forum?id=Bvnjqe3ZroD,https://openreview.net/pdf?id=Bvnjqe3ZroD,A decomposition method that represents a game as a sum of planar embeddings ,"The focus on equilibrium solutions in games underemphasizes the importance of understanding their overall structure. A different set of tools is needed for learning and representing the general structure of a game. In this paper we illustrate ""Principle Trade-off Analysis"" (PTA), a decomposition method that embeds games into a low dimensional feature space and argue that the embeddings are more revealing than previously demonstrated. Here, we develop an analogy to Principal Component Analysis (PCA). PTA represents an arbitrary two-player zero-sum game as the weighted sum of pairs of orthogonal 2D feature planes. We show that each of the feature planes represent unique strategic trade-offs (cyclic modes) and truncation of the sequence provides insightful model reduction. We demonstrate the validity of PTA on a pair of games (Blotto, Pokemon). In Blotto, PTA identifies game symmetries, and specifies strategic trade-offs associated with distinct win conditions. These symmetries reveal limitations of PTA unaddressed in previous work. For Pokemon, PTA recovers clusters that naturally correspond to Pokemon types, correctly identifies the designed tradeoff between those types, and discovers a rock-paper-scissor (RPS) cycle in the Pokemon generation type - all absent any specific information except game outcomes. ","Learning theory, Representation Learning, algorithmic game theory, Functional form games, matrix decomposition" Neural Bregman Divergences for Distance Learning,https://openreview.net/forum?id=nJ3Vx78Nf7p,https://openreview.net/pdf?id=nJ3Vx78Nf7p,We develop the first effective deep learning tooling for learning arbitrary Bregman divergences,"Many metric learning tasks, such as triplet learning, nearest neighbor retrieval, and visualization, are treated primarily as embedding tasks where the ultimate metric is some variant of the Euclidean distance (e.g., cosine or Mahalanobis), and the algorithm must learn to embed points into the pre-chosen space. The study of non-Euclidean geometries is often not explored, which we believe is due to a lack of tools for learning non-Euclidean measures of distance. Recent work has shown that Bregman divergences can be learned from data, opening a promising approach to learning asymmetric distances. We propose a new approach to learning arbitrary Bergman divergences in a differentiable manner via input convex neural networks and show that it overcomes significant limitations of previous works. We also demonstrate that our method more faithfully learns divergences over a set of both new and previously studied tasks, including asymmetric regression, ranking, and clustering. Our tests further extend to known asymmetric, but non-Bregman tasks, where our method still performs competitively despite misspecification, showing the general utility of our approach for asymmetric learning. ","metric learning, Bregman divergences, distance learning, embedding" Neural Autoregressive Refinement for Self-Supervised Outlier Detection beyond Images,https://openreview.net/forum?id=8mQSpCL36Lg,https://openreview.net/pdf?id=8mQSpCL36Lg,,"Many self-supervised methods have been proposed with the target of image anomaly detection. These methods often rely on the paradigm of data augmentation with predefined transformations. However, it is not straightforward to apply these techniques to non-image data, such as time series or tabular data. Here we propose a novel data refinement (DR) scheme that relies on neural autoregressive flows (NAF) for self-supervised anomaly detection. Flow-based models allow to explicitly learn the probability density and thus can assign accurate likelihoods to normal data which makes it usable to detect anomalies. The proposed NAF-DR method is achieved by efficiently generating random samples from latent space and transforming them into feature space along with likelihoods via invertible mapping. The samples with lower likelihoods are selected and further checked by outlier detection using Mahalanobis distance. The augmented samples incorporated with normal samples are used to train a better detector to approach decision boundaries. Compared with random transformations, NAF-DR can be interpreted as a likelihood-oriented data augmentation that is more efficient and robust. Extensive experiments show that our approach outperforms existing baselines on multiple tabular and time series datasets, and {\color{blue}one real-world application}, significantly improving accuracy and robustness over the state-of-the-art baselines. ", Offline Reinforcement Learning from Heteroskedastic Data Via Support Constraints,https://openreview.net/forum?id=Rg1LG7wtd2D,https://openreview.net/pdf?id=Rg1LG7wtd2D,We show that conventional distributional constraint RL algorithms are need with heteroskedatic datasets. We propose an offline RL method to handle such settings.,"Offline reinforcement learning (RL) learns policies entirely from static datasets, thereby avoiding the challenges associated with online data collection. Practical applications of offline RL will inevitably require learning from datasets where the variability of demonstrated behaviors changes non-uniformly across the state space. For example, at a red light, nearly all human drivers behave similarly by stopping, but when merging onto a highway, some drivers merge quickly, efficiently, and safely, while many hesitate or merge dangerously. We show that existing popular offline RL methods based on distribution constraints fail to learn from data with such non-uniform change in the variability of demonstrated behaviors, often due to the requirement to stay close to the behavior policy to the same extent across the state space. We demonstrate this failure mode both theoretically and experimentally. Ideally, the learned policy should be free to choose per-state how closely to follow the behavior policy to maximize long-term return, as long as the learned policy stays within the support of the behavior policy. To instantiate this principle, we reweight the data distribution in conservative Q-learning and show that support constraints emerge when doing so. The reweighted distribution is a mixture of the current policy and an additional policy trained to mine poor actions that are likely under the behavior policy. Our method CQL (ReDS) is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation. ","offline RL, support constraints, heteroskedastic data" Finding Private Bugs: Debugging Implementations of Differentially Private Stochastic Gradient Descent ,https://openreview.net/forum?id=gKKUZ4fTEqh,https://openreview.net/pdf?id=gKKUZ4fTEqh,We proposed an easy method to detect common implementation errors in DP-SGD for practitioners.,"It is important to learn with privacy-preserving algorithms when training data contains sensitive information. Differential privacy (DP) proposes to bound the worst-case privacy leakage of a training algorithm. However, the analytic nature of these algorithmic guarantees makes it difficult to verify that an implementation of a differentially private learner is correct. Research in the field focuses on empirically approximating the analytic bound, which only assesses whether an implementation provides the guarantee claimed for a particular dataset or not. It is also typically costly. In this paper, we take a first step towards providing a simple and lightweight methodology for practitioners to identify common implementation mistakes without imposing any changes to their scripts. Our approach stems from measuring distances between models outputted by the training algorithm. We demonstrate that our method successfully identifies specific mistakes made in the implementation of DP-SGD, the de facto algorithm for differentially private deep learning. These mistakes include improper gradient computations or noise miscalibration. Both approaches invalidate assumptions that are essential to obtaining a rigorous privacy guarantee. ","DP, DP-SGD, debugging, model distance" Robust Generative Flows on Reliable Image Reconstruction without Training Data,https://openreview.net/forum?id=QK1R-vPGsop,https://openreview.net/pdf?id=QK1R-vPGsop,,"A key application of computational imaging is to determine the hidden information from a set of observed but sparse measurements. To fully characterize the uncertainty naturally induced by the sparse measurements, a robust inverse solver that is able to estimate the complete posterior of the unrecoverable targets is therefore important, with a potential to probabilistically interpret the observational data for decision making. In this work, we propose a deep variational framework that leverages a deep generative model to learn an approximate posterior distribution for quantifying image reconstruction uncertainty without training data. This is achieved by parameterizing the target posterior using a flow-based model and minimizing their KL divergence. To perform accurate uncertainty estimation, we propose a robust flow-based model where the stability is enhanced by adding bi-directional regularization and the expressivity is improved by using gradient boosting. We also found that the statistics of latent distribution are conservatively propagated to the posterior distribution through an invertible transformation and therefore introduce a space-filling design to achieve significant variance reduction on both latent prior space and target posterior space. We demonstrate our method on several benchmark tasks and two real-world applications (fastMRI and black hole image reconstruction) and show that it achieves a reliable and high-quality image reconstruction with robust uncertainty estimation. ", A Computationally Efficient Sparsified Online Newton Method,https://openreview.net/forum?id=cOOQruYU7Bh,https://openreview.net/pdf?id=cOOQruYU7Bh,,"Second-order methods have huge potential in improving the convergence of deep neural network (DNN) training, but are prohibitive due to their large memory and compute requirements. Furthermore, computing the matrix inverse or the Newton direction, which is needed in second-order methods, requires high precision computation for stable training as the preconditioner could have a large condition number. This paper provides a first attempt at developing computationally efficient sparse preconditioners for DNN training which can also tolerate low precision computation. Our new Sparsified Online Newton (SONew) algorithm emerges from the novel use of the so-called LogDet matrix divergence measure; we combine it with sparsity constraints to minimize the regret in the online convex optimization framework. Our mathematical analysis allows us to reduce the condition number of our sparse preconditioning matrix, thus improving the stability of training with low precision. We conduct experiments on a feed-forward neural-network autoencoder benchmark, where we compare training loss of optimizers when run for a fixed number of epochs. In the float32 experiments, our methods outperform the best-performing first-order optimizers and perform comparably to Shampoo, a state-of-the-art second-order optimizer. However, our method is even more effective in low-precision, where SONew finishes training considerably faster while performing comparably with Shampoo on training loss.","Optimization, Second order methods" TG-Gen: A Deep Generative Model Framework for Temporal Graphs,https://openreview.net/forum?id=5H1MT1RuWP4,https://openreview.net/pdf?id=5H1MT1RuWP4,"We propose, TG-Gen, a generic framework for generating synthetic temporal graph data. ","Graph Neural Networks (GNNs) have recently emerged as popular methods for learning representations of non-euclidean data often encountered in diverse areas ranging from chemistry and biology to social and financial networks. More recently, research has focused specifically on learning on temporal graphs, wherein the nodes and edges of a graph, and their respective features, may change over time. However, existing work in the temporal graph space has largely focused on discriminative models. In this work, we present TG-Gen, a generic generative framework for temporal graph data, which combines an encoder module that creates temporal embeddings of nodes from raw interaction data, with a decoder module that uses the learned temporal embeddings to create a deep probabilistic model of interaction data. We show that TG-Gen is able to generate robust and accurate synthetic data for temporal graphs for two traditional benchmark data and a novel dataset. Additionally, we demonstrate that TG-Gen is able to learn generalizable representations of temporal graphs and outperforms the previous state-of-the-art method in the discriminative regime, such as for dynamic link prediction. Finally, we perform comprehensive ablation studies which show the effects of specific modules and configurations of our model.","graph neural networks, generative models, temporal graphs" Solving Continual Learning via Problem Decomposition,https://openreview.net/forum?id=SnBDX5k-KuJ,https://openreview.net/pdf?id=SnBDX5k-KuJ,,"This paper is concerned with class incremental learning (CIL) in continual learning (CL). CIL is the popular continual learning paradigm in which a system receives a sequence of tasks with different classes in each task and is expected to learn to predict the class of each test instance without given any task related information for the instance. Although many techniques have been proposed to solve CIL, it remains to be highly challenging due to the difficulty of dealing with catastrophic forgetting (CF). This paper starts from the first principle and proposes a novel method to solve the problem. The definition of CIL reveals that the problem can be decomposed into two probabilities: within-task prediction probability and task-id prediction probability. This paper proposes an effective technique to estimate these two probabilities based on the estimation of feature distributions in the latent space using incremental PCA and Mahalanobis distance. The proposed method does not require a memory buffer to save replay data and it outperforms strong baselines including replay-based methods.","Continual learning, lifelong learning" Long Term Fairness via Performative Distributionally Robust Optimization,https://openreview.net/forum?id=YvrAyFZq0ID,https://openreview.net/pdf?id=YvrAyFZq0ID,We develop a model for long-term fairness by considering a distributionally robust optimization objective in the performative prediction framework.,"Fairness researchers in machine learning (ML) have coalesced around several fairness criteria which provide formal definitions of what it means for an ML model to be fair. However, these criteria have some serious limitations. We identify four key shortcomings of these formal fairness criteria and address them by extending performative prediction to include a distributionally robust objective. Performative prediction is a recent framework developed to understand the effects of when deploying model influences the distribution on which it is making predictions. We prove a convergence result for our proposed repeated distributionally robust optimization (RDRO). We further verify our results empirically and develop experiments to demonstrate the impact of using RDRO on learning fair ML models.","fairness, distributionally robust optimization, performative prediction, distribution shift" The In-Sample Softmax for Offline Reinforcement Learning,https://openreview.net/forum?id=u-RuvyDYqCM,https://openreview.net/pdf?id=u-RuvyDYqCM,A novel Bellman operator that avoids bootstrapping on out-of-sample actions. ,"Reinforcement learning (RL) agents can leverage batches of previously collected data to extract a reasonable control policy. An emerging issue in this offline RL setting, however, is that the bootstrapping update underlying many of our methods suffers from insufficient action-coverage: standard max operator may select a maximal action that has not been seen in the dataset. Bootstrapping from these inaccurate values can lead to overestimation and even divergence. There are a growing number of methods that attempt to approximate an in-sample max, that only uses actions well-covered by the dataset. We highlight a simple fact: it is more straightforward to approximate an in-sample softmax using only actions in the dataset. We show that policy iteration based on the in-sample softmax converges, and that for decreasing temperatures it approaches the in-sample max. We derive an In-Sample Actor-Critic (AC), using this in-sample softmax, and show that it is consistently better or comparable to existing offline RL methods, and is also well-suited to fine-tuning. ",Offline Reinforcement Learning LUNA: Language as Continuing Anchors for Referring Expression Comprehension,https://openreview.net/forum?id=VoplHXsPKLE,https://openreview.net/pdf?id=VoplHXsPKLE,,"Referring expression comprehension aims to localize the description of a natural language expression in an image. Using location priors to remedy inaccuracies in cross-modal alignments is the state of the art for CNN-based methods tackling this problem. Recent Transformer-based models cast aside this idea making the case for steering away from hand-designed components. In this work, we propose LUNA, which uses language as continuing anchors to guide box prediction in a Transformer decoder, and show that language-guided location priors can be effectively exploited in a Transformer-based architecture. Specifically, we first initialize an anchor box from the input expression via a small “proto-decoder”, and then use this anchor as location prior in a modified Transformer decoder for predicting the bounding box. Iterating through each decoder layer, the anchor box is first used as a query for pooling multi-modal context, and then updated based on pooled context. This approach allows the decoder to focus selectively on one part of the scene at a time, which reduces noise in multi-modal context and leads to more accurate box predictions. Our method outperforms existing state-of-the-art methods on the challenging datasets of ReferIt Game, RefCOCO/+/g, and Flickr30K Entities.", Bias Propagation in Federated Learning,https://openreview.net/forum?id=V7CYzdruWdm,https://openreview.net/pdf?id=V7CYzdruWdm,,"We show that participating in federated learning can be detrimental to group fairness. In fact, the bias of a few biased parties against under-represented groups (identified by sensitive attributes such as gender or race) propagates through the network to all parties. On naturally partitioned real-world datasets, we analyze and explain bias propagation in federated learning. Our analysis reveals that biased parties unintentionally yet stealthily encode their bias in a small number of model parameters, and throughout the training, they steadily increase the dependence of the global model on sensitive attributes. What is important to highlight is that the experienced bias in federated learning is higher than what parties would otherwise encounter in centralized training with a model trained on the union of all their data. This indicates that the bias is due to the algorithm. Our work calls for auditing group fairness in federated learning, and designing learning algorithms that are robust to bias propagation. ","Fairness, Algorithmic Bias, Federated Learning" A Study of Causal Confusion in Preference-Based Reward Learning,https://openreview.net/forum?id=R0Xxvr_X3ZA,https://openreview.net/pdf?id=R0Xxvr_X3ZA,We identify and analyze important factors that influence causal confusion when learning rewards from human preference labels.,"Learning policies via preference-based reward learning is an increasingly popular method for customizing agent behavior, but has been shown anecdotally to be prone to spurious correlations and reward hacking behaviors. While much prior work focuses on causal confusion in reinforcement learning and behavioral cloning, we aim to study it in the context of reward learning. To study causal confusion, we perform a series of sensitivity and ablation analyses on three benchmark domains where rewards learned from preferences achieve minimal test error but fail to generalize to out-of-distribution states---resulting in poor policy performance when optimized. We find that the presence of non-causal distractor features, noise in the stated preferences, partial state observability, and larger model capacity can all exacerbate causal confusion. We also identify a set of methods with which to interpret causally confused learned rewards: we observe that optimizing causally confused rewards drives the policy off the reward's training distribution, resulting in high predicted (learned) rewards but low true rewards. These findings illuminate the susceptibility of reward learning to causal confusion, especially in high-dimensional environments---failure to consider even one of many factors (data coverage, state definition, etc.) can quickly result in unexpected, undesirable behavior.","reward learning, robustness, preference-based learning" UniKGQA: Unified Retrieval and Reasoning for Solving Multi-hop Question Answering Over Knowledge Graph,https://openreview.net/forum?id=Z63RvyAZ2Vh,https://openreview.net/pdf?id=Z63RvyAZ2Vh,,"Multi-hop Question Answering over Knowledge Graph~(KGQA) aims to find the answer entities that are multiple hops away from the topic entities in a natural language question on a large-scale Knowledge Graph (KG). To cope with the vast search space, existing work usually adopts a two-stage approach: it firstly retrieves a relatively small subgraph related to the question and then performs the reasoning on the subgraph to accurately find the answer entities. Although these two stages are highly related, previous work employs very different technical solutions for developing the retrieval and reasoning models, neglecting their relatedness in task essence. In this paper, we propose UniKGQA, a novel approach for multi-hop KGQA task, by unifying retrieval and reasoning in both model architecture and parameter learning. For model architecture, UniKGQA consists of a semantic matching module based on a PLM for question-relation semantic matching, and a matching information propagation module to propagate the matching information along the edges on KGs. For parameter learning, we design a shared pre-training task based on question-relation matching for both retrieval and reasoning models, and then propose retrieval- and reasoning-oriented fine-tuning strategies. Compared with previous studies, our approach is more unified, tightly relating the retrieval and reasoning stages. Extensive experiments on three benchmark datasets have demonstrated the effectiveness of our method on the multi-hop KGQA task. ", Comparing Human and Machine Bias in Face Recognition,https://openreview.net/forum?id=wtQxtWC9bra,https://openreview.net/pdf?id=wtQxtWC9bra,,"Much recent research has uncovered and discussed serious concerns of bias in facial analysis technologies, finding performance disparities between groups of people based on perceived gender, skin type, lighting condition, etc. These audits are immensely important and successful at measuring algorithmic bias but have two major challenges: the audits (1) use facial recognition datasets which lack quality metadata, like LFW and CelebA, and (2) do not compare their observed algorithmic bias to the biases of their human alternatives. In this paper, we release improvements to the LFW and CelebA datasets which will enable future researchers to obtain measurements of algorithmic bias that are not tainted by major flaws in the dataset (e.g. identical images appearing in both the gallery and test set). We also use these new data to develop a series of challenging facial identification and verification questions that we administered to various algorithms and a large, balanced sample of human reviewers. We find that both computer models and human survey participants perform significantly better at the verification task, generally obtain lower accuracy rates on dark-skinned or female subjects for both tasks, and obtain higher accuracy rates when their demographics match that of the question. Academic models exhibit comparable levels of gender bias to humans, but are significantly more biased against darker skin types than humans.", Sufficient Subgraph Embedding Memory for Continual Graph Representation Learning,https://openreview.net/forum?id=SJjvXfape5U,https://openreview.net/pdf?id=SJjvXfape5U,,"Memory replay, which constructs a buffer to store representative samples and retrain the model over the buffer to maintain its performance over existing tasks, has shown great success for continual learning with Euclidean data. Directly applying it to graph data, however, can lead to the memory explosion problem due to the necessity to consider explicit topological connections of representative nodes. To this end, we present Parameter Decoupled Graph Neural Networks (PDGNNs) with Sufficient Subgraph Embedding Memory (SSEM) to fully utilize the explicit topological information for memory replay and reduce the memory space complexity from $\mathcal{O}(nd^L)$ to $\mathcal{O}(n)$, where $n$ is the memory buffer size, $d$ is the average node degree, and $L$ is the range of neighborhood aggregation. Specifically, PDGNNs decouple trainable parameters from the computation subgraphs via $\textit{Sufficient Subgraph Embeddings}$ (SSEs), which compress subgraphs into vectors ($\textit{i.e.}$, SSEs) to reduce the memory consumption. Besides, we discover a $\textit{pseudo-training effect}$ in memory based continual graph learning, which does not exist in continual learning on Euclidean data without topological connection ($\textit{e.g.}$, individual images). Based on the discovery, we develop a novel $\textit{coverage maximization sampling}$ strategy to enhance the performance when the memory budget is tight. Thorough empirical studies demonstrate that PDGNNs with SSEM outperform state-of-the-art techniques for both class-incremental and task-incremental settings. ","Graph, Class-incremental learning, continual learning, network" One cannot stand for everyone! Leveraging Multiple User Simulators to train Task-oriented Dialogue Systems,https://openreview.net/forum?id=Y2E5-_HL0DV,https://openreview.net/pdf?id=Y2E5-_HL0DV,"We propose to leverage multiple user simulators simultaneously to optimize ToD systems, leading to a framework called MUST.","User simulators are agents designed to imitate human users; recent advances have found that Task-oriented Dialogue (ToD) systems optimized toward a user simulator could better satisfy the need of human users. However, this might result in a sub-optimal ToD system if it is tailored to only one \textit{ad hoc} user simulator, since human users can behave differently. In this paper, we propose a framework called MUST to optimize ToD systems via leveraging \textbf{m}ultiple \textbf{u}ser \textbf{s}imula\textbf{t}ors. The main challenges of MUST fall in 1) how to adaptively specify which user simulator to interact with the ToD system at each optimization step, since the ToD system might be over-fitted to some specific user simulators, and simultaneously under-fitted to some others; 2) how to avoid catastrophic forgetting of the adaption for a simulator that is not selected for several consecutive optimization steps. To tackle these challenges, we formulate MUST as a Multi-armed bandits (MAB) problem and provide a method called MUST$_{\mathrm{adaptive}}$ that balances \textit{i}) the \textit{boosting adaption} for adaptive interactions between different user simulators and the ToD system and \textit{ii}) the \textit{uniform adaption} to avoid the catastrophic forgetting issue. With both automatic evaluations and human evaluations, our extensive experimental results on the restaurant search task from MultiWOZ show that the dialogue system trained by our proposed MUST achieves a better performance than those trained by any single user simulator. It also has a better generalization ability when testing with unseen user simulators. Moreover, our method MUST$_{\mathrm{adaptive}}$ is indeed more efficient and effective to leverage multiple user simulators by our visualization analysis.","User simulators, Task-oriented Dialogue Systems." Towards Out-of-Distribution Adversarial Robustness,https://openreview.net/forum?id=XYTwCOoKkLY,https://openreview.net/pdf?id=XYTwCOoKkLY,"We use the out-of-distribution generalisation approach of Risk Extrapolation (REx) to obtain superior robustness against multiple adversarial attacks, which generalises to attacks not seen during training.","Adversarial robustness continues to be a major challenge for deep learning. A core issue is that robustness to one type of attack often fails to transfer to other attacks. While prior work establishes a theoretical trade-off in robustness against different $L_p$ norms, we show that there is potential for improvement against many commonly used attacks by adopting a domain generalisation approach. Concretely, we treat each type of attack as a domain, and apply the Risk Extrapolation method (REx), which promotes similar levels of robustness against all training attacks. Compared to existing methods, we obtain similar or superior worst-case adversarial robustness on attacks seen during training. Moreover, we achieve superior performance on families or tunings of attacks only encountered at test time. On ensembles of attacks, our approach improves the accuracy from 3.4\% the best existing baseline to 25.9\% on MNIST, and from 16.9\% to 23.5\% on CIFAR10.","adversarial, robustness, REx, OOD" Is the Performance of My Deep Network Too Good to Be True? A Direct Approach to Estimating the Bayes Error in Binary Classification,https://openreview.net/forum?id=FZdJQgy05rz,https://openreview.net/pdf?id=FZdJQgy05rz,A simple and direct Bayes error estimator that just takes the mean of the labels that show uncertainty of the classes.,"There is a fundamental limitation in the prediction performance that a machine learning model can achieve due to the inevitable uncertainty of the prediction target. In classification problems, this can be characterized by the Bayes error, which is the best achievable error with any classifier. The Bayes error can be used as a criterion to evaluate classifiers with state-of-the-art performance and can be used to detect test set overfitting. We propose a simple and direct Bayes error estimator, where we just take the mean of the labels that show \emph{uncertainty} of the classes. Our flexible approach enables us to perform Bayes error estimation even for weakly supervised data. In contrast to others, our method is model-free and even instance-free. Moreover, it has no hyperparameters and gives a more accurate estimate of the Bayes error than several baselines empirically. Experiments using our method suggest that a recently proposed classifier, the Vision Transformer, may have reached (or is about to reach) the Bayes error for the CIFAR-10H dataset.","Bayes error, best achievable error, irreducible error" Learning Deep Operator Networks: The Benefits of Over-Parameterization,https://openreview.net/forum?id=ZMO7nETTWg9,https://openreview.net/pdf?id=ZMO7nETTWg9,We show that stochastic gradient descent converges to the global minimum for a DeepONet model.,"Neural Operators that directly learn mappings between function spaces have received considerable recent attention. Deep Operator Networks (DeepONets), a popular recent class of neural operators have shown promising preliminary results in approximating solution operators of parametric differential equations. Despite the universal approximation guarantees, there is yet no optimization convergence guarantee for DeepONets based on gradient descent (GD). In this paper, we establish such guarantees and show that over-parameterization based on wide layers provably helps. In particular, we present two types of optimization convergence analysis: first, for smooth activations, we bound the spectral norm of the Hessian of DeepONets and use the bound to show geometric convergence of GD based on restricted strong convexity (RSC); and second, for ReLU activations, we show the neural tangent kernel (NTK) of DeepONets at initialization is positive definite, which can be used with the standard NTK analysis to imply geometric convergence. Further, we present empirical results on three canonical operator learning problems: Antiderivative, Diffusion-Reaction equation, and Burger’s equation, and show that wider DeepONets lead to lower training loss on all the problems, thereby supporting the theoretical results","Deep Operator Networks, Optimization, Neural Tangent Kernel" How Useful are Gradients for OOD Detection Really?,https://openreview.net/forum?id=s0ceCGfcIKb,https://openreview.net/pdf?id=s0ceCGfcIKb,,"One critical challenge in deploying machine learning models in real-life applications is out-of-distribution (OOD) detection. Given a predictive model which is accurate on in distribution (ID) data, an OOD detection system can further equip the model with the option to defer prediction when the input is novel and the model has low confidence. Notably, there has been some recent interest in utilizing gradient information in pretrained models for OOD detection. While these methods are competitive, we argue that previous works conflate their performance with the necessity of gradients. In this work, we provide an in-depth analysis of gradient-based methods and elucidate the key components that warrant their OOD detection performance. We further demonstrate that a general, non-gradient-based family of OOD detection methods are just as competitive, casting doubt on the usefulness of gradients for OOD detection", Many-Body Approximation for Tensors,https://openreview.net/forum?id=vl9TIwbQ_jg,https://openreview.net/pdf?id=vl9TIwbQ_jg,We formulate rank-free tensor decomposition focusing on interactions between tensor modes. We also illustrate the relationship between our model and existing low-rank approximation models using tensor networks.,"We propose a nonnegative tensor decomposition with focusing on the relationship between the modes of tensors. Traditional decomposition methods assume low-rankness in the representation, resulting in difficulties in global optimization and target rank selection. To address these problems, we present an alternative way to decompose tensors, a many-body approximation for tensors, based on an information geometric formulation. A tensor is treated via an energy-based model, where the tensor and its mode correspond to a probability distribution and a random variable, respectively, and many-body approximation is performed on it by taking the interaction between variables into account. Our model can be globally optimized in polynomial time in terms of the KL divergence minimization, which is empirically faster than low-rank approximations keeping comparable reconstruction error. Furthermore, we visualize interactions between modes as tensor networks and reveal a nontrivial relationship between many-body approximation and low-rank approximation.","Tensor decomposition, Energy based model, Tensor networks" Faster Last-iterate Convergence of Policy Optimization in Zero-Sum Markov Games,https://openreview.net/forum?id=bRwBpKrNzF7,https://openreview.net/pdf?id=bRwBpKrNzF7,"We achieve better last-iterate convergence result of policy optimization for two-player zero-sum Markov games, with single-loop and symmetric update rules.","Multi-Agent Reinforcement Learning (MARL)---where multiple agents learn to interact in a shared dynamic environment---permeates across a wide range of critical applications. While there has been substantial progress on understanding the global convergence of policy optimization methods in single-agent RL, designing and analysis of efficient policy optimization algorithms in the MARL setting present significant challenges and new desiderata, which unfortunately, remain highly inadequately addressed by existing theory. In this paper, we focus on the most basic setting of competitive multi-agent RL, namely two-player zero-sum Markov games, and study equilibrium finding algorithms in both the infinite-horizon discounted setting and the finite-horizon episodic setting. We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method and the value is updated on a slower timescale. We show that, in the full-information tabular setting, the proposed method achieves a finite-time last-iterate linear convergence to the quantal response equilibrium of the regularized problem, which translates to a sublinear convergence to the Nash equilibrium by controlling the amount of regularization. Our convergence results improve upon the best known iteration complexities, and lead to a better understanding of policy optimization in competitive Markov games.","zero-sum Markov game, entropy regularization, policy optimization, global convergence, multiplicative updates" Memorization Capacity of Neural Networks with Conditional Computation,https://openreview.net/forum?id=rB3zRN0lBYr,https://openreview.net/pdf?id=rB3zRN0lBYr,"In classical ""unconditional"" ReLU nets, one needs $O(\sqrt{n})$ arithmetic operations to recall any one of $n$ stored patterns. Conditional computation reduces this to $O(\log n)$ and this is the best possible. ","Many empirical studies have demonstrated the performance benefits of conditional computation in neural networks, including reduced inference time and power consumption. We study the fundamental limits of neural conditional computation from the perspective of memorization capacity. For Rectified Linear Unit (ReLU) networks without conditional computation, it is known that memorizing a collection of $n$ input-output relationships can be accomplished via a neural network with $O(\sqrt{n})$ neurons. Calculating the output of this neural network can be accomplished using $O(\sqrt{n})$ elementary arithmetic operations of additions, multiplications and comparisons for each input. Using a conditional ReLU network, we show that the same task can be accomplished using only $O(\log n)$ operations per input. This represents an almost exponential improvement as compared to networks without conditional computation. We also show that the $O(\log n)$ rate is the best possible. ","Memorization capacity, conditional computation." On the Power of Pre-training for Generalization in RL: Provable Benefits and Hardness,https://openreview.net/forum?id=2G-vUJ7XcSB,https://openreview.net/pdf?id=2G-vUJ7XcSB,A theoretical research on how much pre-training in reinforcement learning can help improve performance in target environment.,"Generalization in Reinforcement Learning (RL) aims to train an agent during training that generalizes to the target environment. In this work, we first point out that RL generalization is fundamentally different from the generalization in supervised learning, and fine-tuning on the target environment is necessary for good test performance. Therefore, we seek to answer the following question: how much can we expect pre-training over training environments to be helpful for efficient and effective fine-tuning? On one hand, we give a surprising result showing that asymptotically, the improvement from pre-training is at most a constant factor. On the other hand, we show that pre-training can be indeed helpful in the non-asymptotic regime by designing a policy collection-elimination (PCE) algorithm and proving a distribution-dependent regret bound that is independent of the state-action space. We hope our theoretical results can provide insight towards understanding pre-training and generalization in RL.","Reinforcement Learning, Generalization, Learning Theory" "A Fast, Well-Founded Approximation to the Empirical Neural Tangent Kernel",https://openreview.net/forum?id=HN0ehX-ov5Q,https://openreview.net/pdf?id=HN0ehX-ov5Q,A fast and provable approximation to the empirical Neural Tangent Kernel,"Empirical neural tangent kernels (eNTKs) can provide a good understanding of a given network's representation: they are often far less expensive to compute and applicable more broadly than infinite-width NTKs. For networks with $O$ output units (e.g. an $O$-class classifier), however, the eNTK on $N$ inputs is of size $NO \times NO$, taking $\mathcal{O}\big( (N O)^2\big)$ memory and up to $\mathcal{O}\big( (N O)^3 \big)$ computation. Most existing applications have therefore used one of a handful of approximations yielding $N \times N$ kernel matrices, saving orders of magnitude of computation, but with limited to no justification. We prove that one such approximation, which we call ``sum of logits,'' converges to the true eNTK at initialization. Our experiments demonstrate the quality of this approximation for various uses across a range of settings.","neural tangent kernels, deep learning theory" Boosting Drug-Target Affinity Prediction from Nearest Neighbors,https://openreview.net/forum?id=4K2SRejNGEI,https://openreview.net/pdf?id=4K2SRejNGEI,,"Precisely predicting Drug-Target binding Affinity (DTA) is essential for drug discovery. Recently, deep learning methods have been popular with DTA prediction. However, the prediction accuracy is still far from satisfaction. In this work, inspired by the recent success of retrieval methods, we propose $k$NN-DTA, a non-parametric embedding-based retrieval method adopted on a pre-trained DTA prediction model, which can extend the power of the neural DTA model with no or negligible cost. Compared to traditional chemical similarity retrieval, our embedding-based retrieval shows extremely high efficiency. Different from existing methods, we introduce two neighbor aggregation ways from both embedding space and label space that are integrated in a unified framework. Specifically, we propose a \emph{label aggregation} with \emph{pair-wise retrieval} and a \emph{representation aggregation} with \emph{point-wise retrieval} of the nearest neighbors. This method executes in the inference phase and can efficiently boost the DTA prediction performance with no training cost. In addition, we propose an extension, Ada-$k$NN-DTA, an instance-wise and adaptive aggregation with lightweight learning. Results on four benchmark datasets show that $k$NN-DTA brings significant improvements, outperforming previous state-of-the-art (SOTA) results, e.g, on BindingDB IC$_{50}$ and $K_i$ testbeds, $k$NN-DTA obtains new records of RMSE scores $\bf{0.687}$ and $\bf{0.748}$ with both $\bf{4}$ point improvement. The extended Ada-$k$NN-DTA can further improve the performance, e.g., another $\bf{1}$ point gain on BindingDB. These results strongly prove the effectiveness and efficiency of our method. Results on other settings and comprehensive studies/analyses also show the great potential of our $k$NN-DTA approach.", Weighted Clock Logic Point Process,https://openreview.net/forum?id=YfUICnZMwk7,https://openreview.net/pdf?id=YfUICnZMwk7,A novel neuro-symbolic framework for modeling temporal point processes with interpretability and high computation efficiency.,"Datasets involving multivariate event streams are prevalent in numerous applications. We present a novel framework for modeling temporal point processes called clock logic neural networks (CLNN) which learn weighted clock logic (wCL) formulas as interpretable temporal rules by which some events promote or inhibit other events. Specifically, CLNN models temporal relations between events using conditional intensity rates informed by a set of wCL formulas, which are more expressive than related prior work. Unlike conventional approaches of searching for generative rules through expensive combinatorial optimization, we design smooth activation functions for components of wCL formulas that enable a continuous relaxation of the discrete search space and efficient learning of wCL formulas using gradient-based methods. Experiments on synthetic datasets manifest our model's ability to recover the ground-truth rules and improve computational efficiency. In addition, experiments on real-world datasets show that our models perform competitively when compared with state-of-the-art models. ","Multivariate event data, Neuro-symbolic models, Temporal point process, Propositional logic" Simple Emergent Action Representations from Multi-Task Policy Training,https://openreview.net/forum?id=NUl0ylt7SM,https://openreview.net/pdf?id=NUl0ylt7SM,We discover emergent action representations from multi-task training and further use them to perform task generalization.,"Low-level sensory and motor signals in the high-dimensional spaces~(e.g., image observations or motor torques) in deep reinforcement learning are complicated to understand or harness for downstream tasks directly. While sensory representations have been widely studied, the representations of actions that form motor skills are yet under exploration. In this work, we find that when a multi-task policy network takes as input states and task embeddings, a space based on the task embeddings emerges to contain meaningful action representations with moderate constraints.Within this space, interpolated or composed embeddings can serve as a high-level interface to instruct the agent to perform meaningful action sequences. Empirical results not only show that the proposed action representations have efficacy for intra-action interpolation and inter-action composition with limited or no learning, but also demonstrate their superior ability in task adaptation to strong baselines in Mujoco locomotion tasks. The evidence elucidates that learning the action representations is a promising direction toward efficient, adaptable, and composable RL, forming the basis of abstract action planning and the understanding of motor signal space. Anonymous project page: https://sites.google.com/view/emergent-action-representation/","action representation, reinforcement learning, representation learning" Iterative Task-adaptive Pretraining for Unsupervised Word Alignment,https://openreview.net/forum?id=Yp_dRGS-TlC,https://openreview.net/pdf?id=Yp_dRGS-TlC,,"How to establish a closer relationship between pre-training and downstream task is a valuable question. We argue that task-adaptive pretraining should not just performed before task. For word alignment task, we propose an iterative self-supervised task-adaptive pretraining paradigm, tying together word alignment and self-supervised pretraining by code-switching data augmentation. When we get the aligned pairs predicted by the multilingual contextualized word embeddings, we employ these pairs and origin parallel sentences to synthesize code-switched sentences. Then multilingual models will be continuously finetuned on the augmented code-switched dataset. Finally, finetuned models will be used to produce new aligned pairs. This process will be executed iteratively. Our paradigm is suitable for almost all unsupervised word alignment methods based on multilingual pre-trained LMs and doesn't need gold labeled data, extra parallel data or any other external resources. Experimental results on six language pairs demonstrate that our paradigm can consistently improve baseline method. Compared to resource-rich languages, the improvements on relatively low-resource or different morphological languages are more significant. For example, the AER scores of three different alignment methods based on XLM-R are reduced by about $4 \sim 5$ percentage points on language pair En-Hi. ", Open-Set 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning,https://openreview.net/forum?id=1yclzf1DWsf,https://openreview.net/pdf?id=1yclzf1DWsf,We propose an open-set 3D detection method that detects unseen categories without corresponding 3D labels,"Current point-cloud detection methods have difficulty detecting the open-set objects in the real world, due to their limited generalization capability. Moreover, it is extremely laborious and expensive to collect and fully annotate a point-cloud detection dataset with numerous classes of objects, leading to the limited classes of existing point-cloud datasets and hindering the model to learn general representations to achieve open-set point-cloud detection. Instead of seeking a point-cloud dataset with full labels, we resort to ImageNet1K to broaden the vocabulary of the point-cloud detector. We propose OS-3DETIC, an Open-Set 3D DETector using Image-level Class supervision. Specifically, we take advantage of two modalities, the image modality for recognition and the point-cloud modality for localization, to generate pseudo labels for unseen classes. Then we propose a novel debiased cross-modal cross-task contrastive learning method to transfer the knowledge from image modality to point-cloud modality during training. Without hurting the latency during inference, OS-3DETIC makes the well-known point-cloud detector capable of achieving open-set detection. Extensive experiments demonstrate that the proposed OS-3DETIC achieves at least 10.77 % mAP improvement (absolute value) and 9.56 % mAP improvement (absolute value) by a wide range of baselines on the SUN-RGBD dataset and ScanNet dataset, respectively. Besides, we conduct sufficient experiments to shed light on why the proposed OS-3DETIC works.","open vocabulary, 3d detection, contrastive learning" Tight Non-asymptotic Inference via Sub-Gaussian Intrinsic Moment Norm,https://openreview.net/forum?id=c9QTkDGJ_cB,https://openreview.net/pdf?id=c9QTkDGJ_cB,Tight Non-asymptotic Inference,"In non-asymptotic statistical inferences, variance-type parameters of sub-Gaussian distributions play a crucial role. However, direct estimation of these parameters based on the empirical moment generating function (MGF) is infeasible. To this end, we recommend using a sub-Gaussian intrinsic moment norm [Buldygin and Kozachenko (2000), Theorem 1.3] through maximizing a series of normalized moments. Importantly, the recommended norm can not only recover the exponential moment bounds for the corresponding MGFs, but also lead to tighter Hoeffiding's sub-Gaussian concentration inequalities. In practice, intrinsic moment norm can be robustly and consistently estimated via a simple plug-in approach. Our theoretical results are applied to non-asymptotic analysis, including the multi-armed bandit.","non-asymptotic inference, uncertainty quantification, concentration inequality, multi-armed bandit" Interaction-Based Disentanglement of Entities for Object-Centric World Models,https://openreview.net/forum?id=JQc2VowqCzz,https://openreview.net/pdf?id=JQc2VowqCzz,"We present a structured, action-conditioned probabilistic model that learns to disentangle object representations based on interactions and demonstrate its ability to solve downstream tasks.","Perceiving the world compositionally in terms of space and time is essential to understanding object dynamics and solving downstream tasks. Object-centric learning using generative models has improved its ability to learn distinct representations of individual objects and predict their interactions, and it is a focal question how to utilize the learned representations to solve untrained, downstream tasks. However, as models struggle to predict object interactions and track the objects accurately especially for unseen configurations, using object-centric representations in downstream tasks is yet a challenge. This paper proposes STEDIE, a new model that disentangles object representations based on interactions, into interaction-relevant relational features and interaction-irrelevant global features without supervision. Empirical evaluation shows that the proposed model factorizes global features unaffected by interactions from relational features that are necessary to predict outcome of interactions. We also show that STEDIE, by excluding features irrelevant in predicting interactions, achieves better performance in planning tasks and understanding causal relationships. In both tasks, our model not only achieves better performance in terms of reconstruction ability but also utilizes the disentangled representations to solve the tasks in a structured manner.","object-centric, object-oriented, world models, self-supervised learning, probabilistic deep learning, structured models, video prediction, physics prediction, planning, variational autoencoders, model-based reinforcement learning, VAEs, unsupervised" CodeT5Mix: A Pretrained Mixture of Encoder-decoder Transformers for Code Understanding and Generation,https://openreview.net/forum?id=VPCi3STZcaO,https://openreview.net/pdf?id=VPCi3STZcaO,We propose a new pretrained mixture of encoder-decoder Transformers for code and achieve new SoTA results on a wide range of code understanding like code retrieval and generation tasks such as code synthesis and math programming.,"Pretrained language models (LMs) trained on vast source code have achieved prominent progress in a wide range of code intelligence tasks. Despite their success, they either adopt specific types of network architectures (encoder-only or decoder-only) for different downstream tasks or rely on a single architecture (encoder-decoder or UniLM-style encoder) for all tasks. The latter approach usually results in a sub-optimal performance on a subset of tasks. To address these limitations, we propose “CodeT5Mix”, a mixture of encoder-decoder Transformers for code where its components can be flexibly combined based on the target tasks during finetuning, while still enjoying the mutual benefits from the joint pretraining. To endow the model with both code understanding and generation capabilities, we pretrain CodeT5Mix using a mixture of denoising, contrastive learning, matching, and Causal Language Modeling (CLM) tasks on large-scale multilingual code corpora in nine programming languages. Additionally, we design a weight sharing strategy in decoders except the feedforward layers, which act as task-specific experts to reduce the interference across tasks of various types. We extensively evaluate CodeT5Mix on seven tasks in four different modes and achieve state-of-the-art (SoTA) performance on most tasks such as text-to-code retrieval, code completion and generation, and math programming. Particularly, we demonstrate that CodeT5Mix can be used as a unified semi-parametric retrieval-augmented generator with SoTA code generation performance. ","Language model pretraining, multimodal learning, code understanding and generation" Neural Image-based Avatars: Generalizable Radiance Fields for Human Avatar Modeling,https://openreview.net/forum?id=-ng-FXFlzgK,https://openreview.net/pdf?id=-ng-FXFlzgK,,"We present a method that enables synthesizing novel views and novel poses of arbitrary human performers from sparse multi-view images. A key ingredient of our method is a hybrid appearance blending module that combines the advantages of the implicit body NeRF representation and image-based rendering. Existing generalizable human NeRF methods that are conditioned on the body model have shown robustness against the geometric variation of arbitrary human performers. Yet they often exhibit blurry results when generalized onto unseen identities. Meanwhile, image-based rendering shows high-quality results when sufficient observations are available, whereas it suffers artifacts in sparse-view settings. We propose Neural Image-based Avatars (NIA) that exploits the best of those two methods: to maintain robustness under new articulations and self-occlusions while directly leveraging the available (sparse) source view colors to preserve appearance details of new subject identities. Our hybrid design outperforms recent methods on both in-domain identity generalization as well as challenging cross-dataset generalization settings. Also, in terms of the pose generalization, our method outperforms even the per-subject optimized animatable NeRF methods.","Generalizable human radiance fields, Human performance capture, Human NeRF, Neural radiance fields" Federated Neural Bandits,https://openreview.net/forum?id=38m4h8HcNRL,https://openreview.net/pdf?id=38m4h8HcNRL,"We introduce federated neural-UCB, which uses a weighted combination of two UCBs that respectively, (a) accelerates exploration using observations from other agents and (b) improves reward prediction using a neural network with aggregated parameters.","Recent works on neural contextual bandits have achieved compelling performances due to their ability to leverage the strong representation power of neural networks (NNs) for reward prediction. Many applications of contextual bandits involve multiple agents who collaborate without sharing raw observations, thus giving rise to the setting of federated contextual bandits}. Existing works on federated contextual bandits rely on linear or kernelized bandits, which may fall short when modeling complex real-world reward functions. So, this paper introduces the federated neural-upper confidence bound (FN-UCB) algorithm. To better exploit the federated setting, FN-UCB adopts a weighted combination of two UCBs: $\text{UCB}^{a}$ allows every agent to additionally use the observations from the other agents to accelerate exploration (without sharing raw observations), while $\text{UCB}^{b}$ uses an NN with aggregated parameters for reward prediction in a similar way to federated averaging for supervised learning. Notably, the weight between the two UCBs required by our theoretical analysis is amenable to an interesting interpretation, which emphasizes $\text{UCB}^{a}$ initially for accelerated exploration and relies more on $\text{UCB}^{b}$ later after enough observations have been collected to train the NNs for accurate reward prediction (i.e., reliable exploitation). We prove sub-linear upper bounds on both the cumulative regret and the number of communication rounds of FN-UCB, and empirically demonstrate its competitive performance.","neural contextual bandits, federated bandits" Compositional Task Representations for Large Language Models,https://openreview.net/forum?id=6axIMJA7ME3,https://openreview.net/pdf?id=6axIMJA7ME3,,"Large language models have shown a remarkable cross-task generalization ability. Most prior work assumed that prompts effectively extract knowledge from language models to facilitate generalization to new tasks. This perspective led to numerous studies on improving prompts. In contrast, we introduce a new perspective, compositional generalization, that views each task as a composition of latent codes and generalizes to test tasks by a new composition of seen codes. To this end, we propose a novel prompt-free approach, Compositional Task Representations (CTR), that employs multi-task training to learn a discrete, compositional codebook. Empirically, our CTR substantially outperforms prompt-based methods in zero-label learning on average. According to our analysis, some of the learned CTR codes are interpretable to human and demonstrate a certain degree of controllability. ", What do large networks memorize?,https://openreview.net/forum?id=QcA9iGaLpH4,https://openreview.net/pdf?id=QcA9iGaLpH4,"Increasing model size may increase memorisation of certain training samples, while distillation inhibits memorisation","The success of modern neural models has prompted renewed study of the connection between memorisation and generalisation: such models typically generalise well, despite being able to perfectly fit (""memorise"") completely random labels. To more carefully study this issue, Feldman (2019); Feldman & Zhang (2020) provided a simple metric to quantify the degree of memorisation of a specific training example, and empirically quantified the corresponding memorisation profile of a ResNet model on image classification benchmarks. While an exciting first glimpse into how real-world models memorise, these studies leave open several questions about memorisation of practical networks. In particular, how is memorisation affected by increasing model size, and by distilling a large model into a smaller one? We present a systematic empirical analysis of these questions. On standard image classification benchmarks, we find that training examples exhibit a diverse set of memorisation trajectories across model sizes, with some samples having increased memorisation under larger models. Further, we find that distillation tends to inhibit memorisation of the student model, while also improving generalisation. Finally, we show that computationally tractable measures of memorisation do not capture the properties we identify for memorisation in the sense of Feldman (2019), despite highly correlating to the latter. ","memorization, overparameterization, example difficulty" TILDE-Q: a Transformation Invariant Loss Function for Time-Series Forecasting,https://openreview.net/forum?id=D1Sawu2y1QG,https://openreview.net/pdf?id=D1Sawu2y1QG,"We designed a novel, lightweight, and shape-aware loss function for time-series forecasting.","Time-series forecasting has caught increasing attention in the AI research field due to its importance in solving real-world problems across different domains, such as energy, weather, traffic, and economy. As shown in various types of data, it has been a must-see issue to deal with drastic changes, temporal patterns, and shapes in sequential data that previous models are weak in prediction. This is because most cases in time-series forecasting aim to minimize $L_p$ norm distances as loss functions, such as mean absolute error (MAE) or mean square error (MSE). These loss functions are vulnerable to not only considering temporal dynamics modeling but also capturing the shape of signals. In addition, these functions often make models misbehave and return uncorrelated results to the original time-series. To become an effective loss function, it has to be invariant to the set of distortions between two time-series data instead of just comparing exact values. In this paper, we propose a novel loss function, called TILDE-Q (Transformation Invariant Loss function with Distance EQuilibrium), that not only considers the distortions in amplitude and phase but also allows models to capture the shape of time-series sequences. In addition, TILDE-Q supports modeling periodic and non-periodic temporal dynamics at the same time. We evaluate the effectiveness of TILDE-Q by conducting extensive experiments with respect to periodic and non-periodic conditions of data, from naive models to state-of-the-art models. The experiment results indicate that the models trained with TILDE-Q outperform those trained with other training metrics (e.g., MSE, dynamic time warping (DTW), temporal distortion index (TDI), and longest common subsequence (LCSS)).","Time-Series Forecasting, Deep Learning, Loss functions, Time-series similarity" Pretraining One Language Model for All With the Text-To-Text Framework Using Model-Generated Signals,https://openreview.net/forum?id=us3brYx_ZBZ,https://openreview.net/pdf?id=us3brYx_ZBZ,Improve the performance of encoder-decoder language models (like T5) in unifying NLP tasks by pretraining with ELECTRA-style model-generated signals.,"Pretrained encoder-decoder language models provide the flexibility to unify various language scenarios into one text-to-text framework, but various recent studies raised concerns about their inferior pretraining efficiency and effectiveness compared to encoder only and decoder only models. In this paper, we improve the performance of encoder-decoder language models in unifying NLP tasks by pretraining with ELECTRA-style model-generated signals. We first show the challenges of pretraining encoder-decoder models (such as T5) using model-generated signals, including ill-formed target, label leakage, and training instability. We then propose Metro-T5, a new formulation of the denoising pretraining task and multi-task learning loss for encoder-decoder models to incorporate ELECTRA-Style pretraining. Metro-T5 outperforms T5 on a variety of language tasks in standard fine-tuning and prompt-based zero/few-shot scenarios. Our analysis shows Metro-T5 achieves similar generalization ability with much better efficiency, outperforming T0 (3B) in prompt-based learning with only 8% parameters and T5 in all tasks with fewer GPU hours. Our pretraining code and model checkpoints will be open-sourced.","natural language understanding, natural language generation, sequence-to-sequence, language models, language pretraining, prompting, zero-shot prompting" A Picture of the Space of Typical Learning Tasks,https://openreview.net/forum?id=RlxNpChToM_,https://openreview.net/pdf?id=RlxNpChToM_,"We develop a technique to analyze the learned representation on a task, and its relationship to other tasks. We identify several surprising phenomena, e.g., the manifold of probabilistic models learned on different tasks is low-dimensional.","We develop a technique to analyze representations learned by deep networks when they are trained on different tasks using supervised, multi-task, meta- and contrastive learning. We develop a technique to visualize such representations using an isometric embedding of the space of probabilistic models into a lower-dimensional space, i.e., one that preserves pairwise distances. We discover the following surprising phenomena that shed light upon the structure in the space of learning tasks: (1) the manifold of probabilistic models trained on different tasks using different representation learning methods is effectively low-dimensional; (2) supervised learning on one task results in a surprising amount of progress on seemingly dissimilar tasks; progress on other tasks is larger if the training task has diverse classes; (3) the structure of the space of tasks indicated by our visualization technique is consistent with parts of the Wordnet phylogenetic tree; (4) fine-tuning a model upon a sub-task does not change the representation much if the model was trained for a large number of epochs; (5) episodic meta-learning algorithms fit similar models eventually as that of supervised learning, even if the two traverse different trajectories during training; (6) contrastive learning methods trained on different datasets learn similar representations. We use classification tasks constructed from the CIFAR-10 and Imagenet datasets to study these phenomena.","Information Geometry, Space of learning tasks" Linear Mode Connectivity of Deep Neural Networks via Permutation Invariance and Renormalization,https://openreview.net/forum?id=gU5sJ6ZggcX,https://openreview.net/pdf?id=gU5sJ6ZggcX,"In this paper we empirically investigate the conjecture from Entezari et al. 2021 which states that if permutation invariance is taken into account, then there should be no barrier in the linear interpolation between SGD solutions.","In this paper we empirically investigate the conjecture from Entezari et al. (2021) which states that if permutation invariance is taken into account, then there should be no loss barrier to the linear interpolation between SGD solutions. We conduct our investigation using standard computer vision architectures trained on CIFAR-10 and ImageNet. First, we observe a general phenomenon in which interpolated deep networks suffer a collapse in the variance of their activations. We demonstrate that an appropriate rescaling of the pre-activations of the interpolated networks ameliorates this problem and significantly reduces the barrier. Second, by combining this with an algorithm for finding permutations based on maximizing correlations between the activations of matched neurons, we are able to reduce the interpolation barrier for a standard ResNet18 trained on CIFAR-10 to 1.5% absolute test error. We explore the interaction between our method and the choice of normalization layer, and demonstrate its robustness across a variety of architectures and training sets.","Permutation, Invariance, Mode Connectivity, Barrier, Loss landscape, Deep Learning" Multi-View Masked Autoencoders for Visual Control,https://openreview.net/forum?id=OE4uriQtuDJ,https://openreview.net/pdf?id=OE4uriQtuDJ,We present a framework for multi-view representation learning via masked view reconstruction.,"This paper investigates how to leverage data from multiple cameras to learn representations beneficial for visual control. To this end, we present the Multi-View Masked Autoencoder (MV-MAE), a simple and scalable framework for multi-view representation learning. Our main idea is to mask multiple viewpoints from video frames at random and train a video autoencoder to reconstruct pixels of both masked and unmasked viewpoints. This allows the model to learn representations that capture useful information of the current viewpoint but also the cross-view information from different viewpoints. We evaluate MV-MAE on challenging RLBench visual manipulation tasks by training a reinforcement learning agent on top of frozen representations. Our experiments demonstrate that MV-MAE significantly outperforms other multi-view representation learning approaches. Moreover, we show that the number of cameras can differ between the representation learning phase and the behavior learning phase. By training a single-view control agent on top of multi-view representations from MV-MAE, we achieve 62.3% success rate while the single-view representation learning baseline achieves 42.3%.","visual control, masked autoencoder, representation learning, world model" Boosting Adversarial Training with Masked Adaptive Ensemble,https://openreview.net/forum?id=fCO0_zsXk3j,https://openreview.net/pdf?id=fCO0_zsXk3j,"This paper boosts adversarial training by proposing a novel framework, including a detector and a classifier, making the DNN model withstand both dense attacks and sparse attacks and maintain high standard accuracy.","Adversarial training (AT) can help improve the robustness of a deep neural network (DNN) against potential adversarial attacks by intentionally injecting adversarial examples into the training data, but this way inevitably incurs standard accuracy degradation to some extent, thereby calling for a trade-off between standard accuracy and robustness. Besides, the prominent AT solutions are vulnerable to sparse attacks, due to “robustness overfitting” upon dense attacks, often adopted by AT to produce a threat model. To tackle such shortcomings, this paper proposes a novel framework, including a detector and a classifier bridged by our newly developed adaptive ensemble. Specifically, a Guided Backpropagation-based detector is designed to sniff adversarial examples, driven by our empirical observation. Meanwhile, a classifier with two encoders is employed for extracting visual representations respectively from clean images and adversarial examples. The adaptive ensemble approach also enables us to mask off a random subset of image patches within input data, eliminating potential adversarial effects when encountering malicious inputs with negligible standard accuracy degradation. As such, our approach enjoys improved robustness, able to withstand both dense and sparse attacks, while maintaining high standard accuracy. Experimental results exhibit that our detector and classifier outperform their state-of-the-art counterparts, in terms of detection accuracy, standard accuracy, and adversarial robustness. For example, on CIFAR-10, our detector achieves the best detection accuracy of 99.6% under dense attacks and of 98.5% under sparse attacks. Our classifier achieves the best standard accuracy of 91.2% and the best robustness against dense attack (or sparse attack) of 57.5% (or 54.8%).","Deep Learning Security, Adversarial Example Detection, Adversarial Training" MILE: Memory-Interactive Learning Engine for Solving Mathematical Problems,https://openreview.net/forum?id=nQtcJ24_45K,https://openreview.net/pdf?id=nQtcJ24_45K,A new learning framework interacting with memory embeddings for solving mathematical problems,"Mathematical problem solving is a task that examines the capacity of machine learning models for performing logical reasoning. Existing work employed formulas as intermediate labels in this task to formulate the computing and reasoning processes and achieved remarkable performance. However, we are questioning the limitations of existing methods from two perspectives: the expressive capacity of formulas and the learning capacity of existing models. In this work, we proposed Memory-Interactive Learning Engine (MILE), a new framework for solving mathematical problems. Our main contribution in this work includes a new formula representing technique and a new decoding method. In our experiment on Math23K dataset, MILE outperformed existing methods on not only question answering accuracy but also robustness and generalization capacity.","mathematical reasoning, symbolic reasoning, neural networks with memory" UNICO: Efficient Unified Hardware-Software Co-Optimization For Deep Neural Networks,https://openreview.net/forum?id=E2KNgQVJMiP,https://openreview.net/pdf?id=E2KNgQVJMiP,UNICO is a fast and high-fidelity neural accelerator HW-SW co-search solution that can find high-quality HW configurations that are generalizable to unseen DNN applications at the time of co-search.,"Specialized hardware has become an indispensable component to deep neural network acceleration. To keep up with the rapid evolution of neural networks, recently, holistic and automated solutions for jointly optimizing both hardware architectures and software mapping have been proposed. In this paper, we propose UNICO, a Unified Co-Optimization framework for hardware-software co-design, aimed at addressing the efficiency issues of vast design space exploration and the issue of overfitting to specific input neural network workloads that are facing current approaches. UNICO employs multi-objective Bayesian optimization to sample hardware, and performs parallel and adaptive software mapping search for hardware samples with a customized successive halving algorithm. To reduce overfitting, UNICO incorporates quantitative robustness measures to guide the proposed search and evaluation procedure. Experiments performed for both open-source spatial accelerators and a real-word commercial environment show that UNICO significantly improves over its counterparts, finding not only superior but also more robust hardware configurations, yet at drastically lower search cost.","Neural accelerator optimization, Hardware-Software co-design, Hardware optimization, HW design robustness, HW design generalizability, Successive halving, Holistic time-efficient search, Multi-objective Bayesian optimization, High-fidelity search, Tensor computation" Diffusion-GAN: Training GANs with Diffusion,https://openreview.net/forum?id=HZf7UbpWHuA,https://openreview.net/pdf?id=HZf7UbpWHuA,,"Generative adversarial networks (GANs) are challenging to train stably, and a promising remedy of injecting instance noise into the discriminator input has not been very effective in practice. In this paper, we propose Diffusion-GAN, a novel GAN framework that leverages a forward diffusion chain to generate Gaussian-mixture distributed instance noise. Diffusion-GAN consists of three components, including an adaptive diffusion process, a diffusion timestep-dependent discriminator, and a generator. Both the observed and generated data are diffused by the adaptive diffusion process via different noise-to-data ratios at each timestep. The timestep-dependent discriminator learns to distinguish the diffused real data from the diffused generated data at each diffusion timestep. The generator learns from the discriminator's feedback by backpropagating through the forward diffusion chain, whose length is adaptively adjusted to balance the noise and data levels. We theoretically show that the discriminator's timestep-dependent strategy gives consistent and helpful guidance to the generator, enabling it to match the true data distribution. We demonstrate the advantages of Diffusion-GAN over strong GAN baselines on various datasets, showing that it can produce more realistic images with higher stability and data efficiency than state-of-the-art GANs.","deep generative models, diffusion models, data-efficient stable GAN training, adaptive data augmentation" Contextual Subspace Approximation with Neural Householder Transforms,https://openreview.net/forum?id=Io0mSpdqnHJ,https://openreview.net/pdf?id=Io0mSpdqnHJ,We propose a method that trains a neural network to compute a context-dependent basis for high dimensional actuation commands. ,"Choosing an appropriate action representation is an integral part of solving robotic manipulation problems. Published approaches include latent action models which compress the control space into a low dimensional manifold. These involve training a conditional autoencoder, where the current observation and a low-dimensional action are passed through a neural network decoder to compute high dimensional actuation commands. Such models can have a large number of parameters, and can be difficult to interpret from a user perspective. In this work, we propose that similar performance gains in robotics tasks can be achieved by restructuring the neural network to map observations to a basis for a context-dependent linear actuation subspace. This results in an action interface wherein a user’s actions determine a linear combination of a state-conditioned actuation basis. We introduce the Neural Householder Transform (NHT) as a method for computing this basis. Our results show that reinforcement learning agents trained with NHT in kinematic manipulation and locomotion environments tend to be more robust to hyperparameter choice and achieve higher final success rates compared to agents trained with alternative action representations. NHT agents outperformed agents trained with joint velocity/torque actions, agents trained with an SVD actuation basis, and agents trained with a LASER action interface in the WAMWipe, WAMGrasp, and HalfCheetah environments.","robotics, RL, representation learning" Mind the Pool: Convolutional Neural Networks Can Overfit Input Size,https://openreview.net/forum?id=cWmtUcsYC3V,https://openreview.net/pdf?id=cWmtUcsYC3V,Standard pooling arithmetic can cause CNNs to overfit the input size used during; an adjustment improves generalization to arbitrary sizes and robustness to translation shifts.,"We demonstrate how convolutional neural networks can overfit the input size: The accuracy drops significantly when using certain sizes, compared with favorable ones. This issue is inherent to pooling arithmetic, with standard downsampling layers playing a major role in favoring certain input sizes and skewing the weights accordingly. We present a solution to this problem by depriving these layers from the arithmetic cues they use to overfit the input size. Through various examples, we show how our proposed spatially-balanced pooling improves the generalization of the network to arbitrary input sizes and its robustness to translational shifts.","Convolutional Neural Networks, Pooling, Input Size, Overfitting" Towards Unsupervised Time Series Representation Learning: A Decomposition Perspective,https://openreview.net/forum?id=8IMz713Bxcq,https://openreview.net/pdf?id=8IMz713Bxcq,An unsupervised time series representation learning approach with the help of time series decomposition and contrastive learning,"Existing contrastive methods of universal time series representation learning mainly rely on distilling invariant patterns at varying scales and building contrastive loss with the help of negative sampling. However, the invariance assumptions may not hold in real-world time-series data, and the infamous negative sampling could bring in new biases for representation learning. In this work, we propose a novel contrastive learning approach toward time series representation learning on top of trend-seasonality decomposition, namely TS-DC. TS-DC differentiates itself from prior methods in three folds: 1) a time series decomposition approach is devised to distill different aspects/components of a complex time series; 2) a novel component-wise contrastive loss is proposed in which negative sampling is not necessary; 3) the informative signals of time series can be captured comprehensively by means of adaptive contrasting. Extensive experiments on different public benchmark datasets validate the superior performance of our proposed representation learning method. ","Time Series, Representation Learning, Contrastive Learning" Reparameterization through Spatial Gradient Scaling,https://openreview.net/forum?id=Kpdewuy7RU6,https://openreview.net/pdf?id=Kpdewuy7RU6,,"Reparameterization aims to improve the generalization of deep neural networks by transforming a convolution operation into equivalent multi-branched structures during training. However, there exists a gap in understanding how reparameterization may change and benefit learning processes for neural networks. In this paper, we present a novel spatial gradient scaling method to redistribute learning focus among weights in convolutional neural networks. We prove that spatial gradient scaling achieves the same learning dynamics as a branched reparameterization yet without introducing structural changes into the network. We further propose an analytical approach that dynamically learns scalings for each convolutional layer based on the spatial characteristics of its input feature map gauged by mutual information. Experiments on CIFAR-10, CIFAR-100, and ImageNet show that without searching for reparameterized structures, our proposed scaling method outperforms the state-of-the-art reparameterization methods at a lower computational cost.","reparameterization, deep learning, convolutional neural networks, neural architectures" Boomerang: Local sampling on image manifolds using diffusion models,https://openreview.net/forum?id=ObWiIiKihBf,https://openreview.net/pdf?id=ObWiIiKihBf,,"Diffusion models can be viewed as mapping points in a high-dimensional latent space onto a low-dimensional learned manifold, typically an image manifold. The intermediate values between the latent space and image manifold can be interpreted as noisy images which are determined by the noise scheduling scheme employed during pre-training. We exploit this interpretation to introduce Boomerang, a local image manifold sampling approach using the dynamics of diffusion models. We call it Boomerang because we first add noise to an input image, moving it closer to the latent space, then bring it back to the image space through diffusion dynamics. We use this method to generate images which are similar, but nonidentical, to the original input images on the image manifold. We are able to set how close the generated image is to the original based on how much noise we add. Additionally, the generated images have a degree of stochasticity, allowing us to locally sample as many times as we want without repetition. We show three applications for which Boomerang can be used. First, we provide a framework for constructing privacy-preserving datasets having controllable degrees of anonymity. Second, we show how to use Boomerang for data augmentation while staying on the image manifold. Third, we introduce a framework for image super-resolution with 8x upsampling. Boomerang does not require any modification to the training of diffusion models and can be used with pretrained models on a single, inexpensive GPU.","Diffusion models, local sampling, image manifolds" TOWARD RELIABLE NEURAL SPECIFICATIONS,https://openreview.net/forum?id=RPVgoRFYWHB,https://openreview.net/pdf?id=RPVgoRFYWHB,We propose a new family of specifications based on neural activation patterns and evaluate its effectiveness through both statistical analysis and formal verification.,"Having reliable specifications is an unavoidable challenge in achieving verifiable correctness, robustness, and interpretability of AI systems. Existing specifications for neural networks are in the flavor of “data as specification”, that is, the local neighborhood centering around a reference input is considered to be correct (or robust). However, our empirical study shows that such specifications fail to certify any test data points, making it impractical for real-world applications. We propose a new family of specifications called “neural representation as specification”, which uses the intrinsic information of neural networks — neural activation patterns (NAP) rather than input data to specify the correctness and/or robustness of neural network predictions. We present a simple statistical approach to extracting dominant neural activation patterns. We analyze NAPs from a statistical point of view and find that a single NAP can cover a large number of training and testing data points whereas ad hoc data-as-specification can only cover a single training data point and often zero testing data points. To show the effectiveness of discovered NAPs, we formally verify several important properties, such as a particular type of misclassification never happens for a given NAP, and there is no ambiguity among different NAPs. We show that by using NAP, we can verify the prediction of the entire input space, while still recalling 84% of the data. Thus, we argue that using NAPs is a more reliable and extensible specification for neural network verification.","formal verification, specification, neural network verfication, trustworthy AI, interpretability" A second order regression model shows edge of stability behavior,https://openreview.net/forum?id=R2M14I9LEwW,https://openreview.net/pdf?id=R2M14I9LEwW,"Recently observed non-linear learning effects like progressive sharpening and edge of stability occur generically in a simple, second order regression model.","Recent studies of learning algorithms have shown that there is a regime with an initial increase in the largest eigenvalue of the loss Hessian (progressive sharpening), followed by a stabilization of the eigenvalue near the maximum value which allows convergence (edge of stability). We consider a class of predictive models that are quadratic in the parameters, which we call second-order regression models. This is in contrast with the neural tangent kernel regime, where the predictive function is linear in the parameters. For quadratic objectives in two dimensions, we prove that this second order regression model exhibits both progressive sharpening and edge of stability behavior. We then show that in higher dimensions, the model shows this behavior generically without the structure of a neural network, due to a non-linearity induced in the learning dynamics. Finally, we show that edge of stability behavior in neural networks is correlated with the behavior in quadratic regression models.","deep learning theory, non-linear dynamics, optimization" Learning Frequency-aware Network for Continual Learning,https://openreview.net/forum?id=k1lUZZzE6b-,https://openreview.net/pdf?id=k1lUZZzE6b-,,"As a challenging problem, continual learning aims to solve the problem that the model does not forget the knowledge of the old model as much as possible when learning new tasks. Most current algorithms perform the same processing on each pixel of the image. As people have different memory abilities for image details and the whole, the forgetting of different parts of the image by the neural network is also asynchronous. In this paper, we discuss the problem of asynchronous forgetting of images at different frequencies. To solve this problem, we propose a solution from the perspective of network structure design and feature preservation. In terms of network structure, we design a dual stream network with high and low frequency separation, and use the characteristics of CNN and transform to process the high-frequency and low-frequency information of images respectively; in the aspect of feature preservation, we design a dynamic distillation loss function to dynamically adjust the preserved weight of high-frequency and low-frequency information according to the training stage of the network. We have verified the effectiveness of our scheme through a series of experiments.","Continual Learning, Incremental Learning, Vision Transformer" Unsupervised Learning for Combinatorial Optimization Needs Meta Learning,https://openreview.net/forum?id=-ENYHCE8zBp,https://openreview.net/pdf?id=-ENYHCE8zBp,,"A general framework of unsupervised learning for combinatorial optimization (CO) is to train a neural network (NN) whose output gives a problem solution through directly optimizing the CO objective. Albeit with some advantages over traditional solvers, the current framework optimizes an averaged performance over the distribution of historical problem instances, which misaligns with the actual goal of CO that looks for a good solution to every future encountered instance. With this observation, we propose a new objective of unsupervised learning for CO where the goal of learning is to search good initialization for future problem instances rather than give direct solutions. We propose a meta-learning-based training pipeline for this new objective. Our method achieves good empirical performance. We observe that even just the initial solution given by our model before fine-tuning can significantly outperform the baselines under various evaluation settings including over the same dataset, cross multiple datasets, and with a shift in the problem scale. The reason we conjecture is that meta-learning-based training may help with finding valleys in the optimization landscape with good local optima for the CO problems that often contain a lot of bad local optima.","combinatorial optimization, unsupervised learning, meta learning, graph neural networks" Latent Topology Induction for Understanding Contextualized Representations,https://openreview.net/forum?id=YrxOdjYd1j8,https://openreview.net/pdf?id=YrxOdjYd1j8,We discover the hidden topology within the representation space of contextualized representations,"Recently, there has been considerable interests in understanding pretrained language models. This work studies the hidden geometry of the representation space of language models from a unique topological perspective. We hypothesize that there exist a network of latent anchor states summarizing the topology (neighbors and connectivity) of the representation space. we infer this latent network in a fully unsupervised way using a structured variational autoencoder. We show that such network exists in pretrained representations, but not in baseline random or positional embeddings. We connect the discovered topological structure to their linguistic interpretations. In this latent network, leave nodes can be grounded to word surface forms, anchor states can be grounded to linguistic categories, and connections between nodes and states can be grounded to phrase constructions and syntactic templates. We further show how such network evolves as the embeddings become more contextualized, with observational and statistical evidence demonstrating how contextualization helps words “receive meaning” from their topological neighbors via the anchor states. We demonstrate these insights with extensive experiments and visualizations.", DyG2Vec: Representation Learning for Dynamic Graphs With Self-supervision,https://openreview.net/forum?id=cC0VNCNCqpK,https://openreview.net/pdf?id=cC0VNCNCqpK,,"The challenge in learning from dynamic graphs for predictive tasks lies in extracting fine-grained temporal motifs from an ever-evolving graph. Moreover, task labels are often scarce, costly to obtain, and highly imbalanced for large dynamic graphs. Recent advances in self-supervised learning on graphs demonstrate great potential, but focus on static graphs. State-of-the-art (SoTA) models for dynamic graphs are not only incompatible with the self-supervised learning (SSL) paradigm but also fail to forecast interactions beyond the very near future. To address these limitations, we present DyG2Vec, an SSL-compatible, efficient model for representation learning on dynamic graphs. DyG2Vec uses a window-based mechanism to generate task-agnostic node embeddings that can be used to forecast future interactions. DyG2Vec significantly outperforms SoTA baselines on benchmark datasets for downstream tasks while only requiring a fraction of the training/inference time. We adapt two SSL evaluation mechanisms to make them applicable to dynamic graphs and thus show that SSL pre-training helps learn more robust temporal node representations, especially for scenarios with few labels. ", Unsupervised Meta-learning via Few-shot Pseudo-supervised Contrastive Learning,https://openreview.net/forum?id=TdTGGj7fYYJ,https://openreview.net/pdf?id=TdTGGj7fYYJ,,"Unsupervised meta-learning aims to learn generalizable knowledge across a distribution of tasks constructed from unlabeled data. Here, the main challenge is how to construct diverse tasks for meta-learning without label information; recent works have proposed to create, e.g., pseudo-labeling via pretrained representations or creating synthetic samples via generative models. However, such a task construction strategy is fundamentally limited due to heavy reliance on the immutable pseudo-labels during meta-learning and the quality of the representations or the generated samples. To overcome the limitations, we propose a simple yet effective unsupervised meta-learning framework, coined Pseudo-supervised Contrast (PsCo), for few-shot classification. We are inspired by the recent self-supervised learning literature; PsCo utilizes a momentum network and a queue of previous batches to improve pseudo-labeling and construct diverse tasks in a progressive manner. Our extensive experiments demonstrate that PsCo outperforms existing unsupervised meta-learning methods under various in-domain and cross-domain few-shot classification benchmarks. We also validate that PsCo is easily scalable to a large-scale benchmark, while recent prior-art meta-schemes are not.","unsupervised meta-learning, supervised contrastive learning, self-supervised learning" PromptBoosting: Black-Box Text Classification with Ten Forward Passes,https://openreview.net/forum?id=01LMSeReNvY,https://openreview.net/pdf?id=01LMSeReNvY,,"We describe PromptBoosting, a query-efficient procedure for building a text classifier from a neural language model (LM) without access to the LM’s parameters, gradients, or hidden representations. This form of “black-box” classifier training has become increasingly important as the cost of training and inference in large-scale LMs grows. But existing black-box LM classifier learning approaches are themselves computationally inefficient, typically specializing LMs to the target task by searching in a large space of (discrete or continuous) prompts using zeroth-order optimization methods. Instead of directly optimizing in prompt space, PromptBoosting obtains a small pool of prompts via a gradient-free approach and then constructs a large pool of weak learners by pairing these prompts with different elements of the LM’s output distribution. These weak learners are then ensembled using the AdaBoost algorithm. The entire learning process requires only a small number of forward passes and no backward pass. Experiments show that PromptBoosting achieves state-of-the-art performance in multiple black-box few-shot classification tasks, and matches or outperforms full fine-tuning in both few-shot and standard learning paradigms, while training 10x faster than existing black-box methods.", Decepticons: Corrupted Transformers Breach Privacy in Federated Learning for Language Models,https://openreview.net/forum?id=r0BrY4BiEXO,https://openreview.net/pdf?id=r0BrY4BiEXO,,"A central tenet of Federated learning (FL), which trains models without centralizing user data, is privacy. However, previous work has shown that the gradient updates used in FL can leak user information. While the most industrial uses of FL are for text applications (e.g. keystroke prediction), nearly all attacks on FL privacy have focused on simple image classifiers. We propose a novel attack that reveals private user text by deploying malicious parameter vectors, and which succeeds even with mini-batches, multiple users, and long sequences. Unlike previous attacks on FL, the attack exploits characteristics of both the Transformer architecture and the token embedding, separately extracting tokens and positional embeddings to retrieve high-fidelity text. This work suggests that FL on text, which has historically been resistant to privacy attacks, is far more vulnerable than previously thought. ","Federated Learning, Attack, Privacy, Transformers, Gradient Inversion" Adaptive Optimization in the $\infty$-Width Limit,https://openreview.net/forum?id=zgVDqw9ZUES,https://openreview.net/pdf?id=zgVDqw9ZUES,We derive the infinite width limits of neural networks trained with adaptive optimizers,"Recent works have developed detailed understanding of large neural networks' behaviors via their infinite-width limits, e.g., the neural tangent kernel (NTK) and the feature learning ($\mu$) limits. These theories were developed for stochastic gradient descent. Yet, in practice, all large NN are trained using Adam or other adaptive gradient optimizers (AGO), which are not covered by such previous works. Here, we close this gap via the Tensor Programs framework. Specifically, for deep MLPs, we derive the NTK and $\mu$ parametrizations as well as their infinite-width limits. We find 1) The NTK limit of AGO, in contrast to that of SGD, now depends nonlinearly on the loss derivative but nevertheless still fails to learn features; 2) this is fixed by the $\mu$ limit of AGO (as in the case of SGD). To obtain these results, we extend the Tensor Programs language with a new instruction that allows one to express the gradient processing done by AGOs.","Infinite width, neural tangent kernels, feature learning, theory, adaptive optimization, tensor programs" Pyramidal Denoising Diffusion Probabilistic Models,https://openreview.net/forum?id=MMKqOJgRiw4,https://openreview.net/pdf?id=MMKqOJgRiw4,,"Recently, diffusion model have demonstrated impressive image generation performances, and have been extensively studied in various computer vision tasks. Unfortunately, training and evaluating diffusion models consume a lot of time and computational resources. To address this problem, here we present a novel pyramidal diffusion model that can generate high resolution images starting from much coarser resolution images using a {\em single} score function trained with a positional embedding. This enables a neural network to be much lighter and also enables time-efficient image generation without compromising its performances. Furthermore, we show that the proposed approach can be also efficiently used for multi-scale super-resolution problem using a single score function.","Diffusion Model, Image Generation, Super Resolution" Guiding Energy-based Models via Contrastive Latent Variables,https://openreview.net/forum?id=CZmHHj9MgkP,https://openreview.net/pdf?id=CZmHHj9MgkP,We propose a simple yet effective framework for improving energy-based models (EBMs) via contrastive representation learning.,"An energy-based model (EBM) is a popular generative framework that offers both explicit density and architectural flexibility, but training them is difficult since it is often unstable and time-consuming. In recent years, various training techniques have been developed, e.g., better divergence measures or stabilization in MCMC sampling, but there often exists a large gap between EBMs and other generative frameworks like GANs in terms of generation quality. In this paper, we propose a novel and effective framework for improving EBMs via contrastive representation learning (CRL). To be specific, we consider representations learned by contrastive methods as the true underlying latent variable. This contrastive latent variable could guide EBMs to understand the data structure better, so it can improve and accelerate EBM training significantly. To enable the joint training of EBM and CRL, we also design a new class of latent-variable EBMs for learning the joint density of data and the contrastive latent variable. Our experimental results demonstrate that our scheme achieves lower FID scores, compared to prior-art EBM methods (e.g., additionally using variational autoencoders or diffusion techniques), even with significantly faster and more memory-efficient training. We also show conditional and compositional generation abilities of our latent-variable EBMs as their additional benefits, even without explicit conditional training.","energy-based model, contrastive representation learning" Deep Watermarks for Attributing Generative Models,https://openreview.net/forum?id=43nOUI4VHUw,https://openreview.net/pdf?id=43nOUI4VHUw,,"Generative models have enabled the creation of contents that are indistinguishable from those taken from the Nature. Open-source development of such models raised concerns about the risks in their misuse for malicious purposes. One potential risk mitigation strategy is to attribute generative models via watermarking. Current watermarking methods exhibit significant tradeoff between robust attribution accuracy and generation quality, and also lack principles for designing watermarks to improve this tradeoff. This paper investigates the use of latent semantic dimensions as watermarks, from where we can analyze the effects of design variables, including the choice of watermarking dimensions, watermarking strength, and the capacity of watermarks, on the accuracy-quality tradeoff. Compared with previous SOTA, our method requires minimum computation and is more applicable to large-scale models. We use StyleGAN2 and the latent diffusion model to demonstrate the efficacy of our method.","Model Attribution, Watermarking, Generative Models" Steerable Equivariant Representation Learning,https://openreview.net/forum?id=NFzHAognkpQ,https://openreview.net/pdf?id=NFzHAognkpQ,,"Pre-trained deep image representations are useful for post-training tasks such as classification through transfer learning, image retrieval, and object detection. Data augmentations are a crucial aspect of pre-training robust representations in both supervised and self-supervised settings. Data augmentations explicitly or implicitly promote \emph{invariance} in the embedding space to the input image transformations. This invariance reduces generalization to those downstream tasks which rely on sensitivity to these particular data augmentations. In this paper, we propose a method of learning representations that are instead \emph{equivariant} to data augmentations. We achieve this equivariance through the use of \emph{steerable} representations. Our representations can be manipulated directly in embedding space via learned linear maps. We demonstrate that our resulting steerable and equivariant representations lead to better performance on transfer learning and robustness: e.g. we improve linear probe top-1 accuracy by between 1\% to 3\% for transfer; and ImageNet-C accuracy by upto 3.4\%. We further show that the steerability of our representations provides significant speedup (nearly $50\times$) for test-time augmentations; by applying a large number of augmentations for out-of-distribution detection, we significantly improve OOD AUC on the ImageNet-C dataset over an invariant representation.","representation, visual, equivariance, equivariant" Differentially Private Diffusion Models,https://openreview.net/forum?id=pX21pH4CsNB,https://openreview.net/pdf?id=pX21pH4CsNB,Training diffusion models with differential privacy achieves state-of-the art performance on image generation benchmarks.,"While modern machine learning models rely on increasingly large training datasets, data is often limited in privacy-sensitive domains. Generative models trained with differential privacy (DP) on sensitive data can sidestep this challenge, providing access to synthetic data instead. However, training DP generative models is highly challenging due to the noise injected into training to enforce DP. We propose to leverage diffusion models (DMs), an emerging class of deep generative models, and introduce Differentially Private Diffusion Models (DPDMs), which enforce privacy using differentially private stochastic gradient descent (DP-SGD). We motivate why DP-SGD is well suited for training DPDMs, and thoroughly investigate the DM parameterization and the sampling algorithm, which turn out to be crucial ingredients in DPDMs. Furthermore, we propose noise multiplicity, a simple yet powerful modification of the DM training objective tailored to the DP setting to boost performance. We validate our novel DPDMs on widely-used image generation benchmarks and achieve state-of-the-art (SOTA) performance by large margins. For example, on MNIST we improve the SOTA FID from 48.4 to 5.01 and downstream classification accuracy from 83.2% to 98.1% for the privacy setting DP-$(\varepsilon=10, \delta=10^{-5})$. Moreover, on standard benchmarks, classifiers trained on DPDM-generated synthetic data perform on par with task-specific DP-SGD-trained classifiers, which has not been demonstrated before for DP generative models.","Diffusion models, Differential Privacy, Generative Modeling" Outlier-Robust Group Inference via Gradient Space Clustering,https://openreview.net/forum?id=czL6NLxJsx,https://openreview.net/pdf?id=czL6NLxJsx,"We propose to perform clustering in the gradient space for outlier-robust group identification, thereby learning distributionally and outlier robust models when group labels are unavailable.","Traditional machine learning models focus on achieving good performance on the overall training distribution, but they often underperform on minority groups. Existing methods can improve the worst-group performance, but they can have several limitations: (i) they require group annotations, which are often expensive and sometimes infeasible to obtain, and/or (ii) they are sensitive to outliers. Most related works fail to solve these two issues simultaneously as they focus on conflicting perspectives of minority groups and outliers. We address the problem of learning group annotations in the presence of outliers by clustering the data in the space of gradients of the model parameters. We show that data in the gradient space has a simpler structure while preserving information about minority groups and outliers, making it suitable for standard clustering methods like DBSCAN. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art both in terms of group identification and downstream worst-group performance.","Distributionally Robust Optimization, Outlier Robust Optimization, Group Identification, Subpopulation Shift" Broken Neural Scaling Laws,https://openreview.net/forum?id=sckjveqlCZ,https://openreview.net/pdf?id=sckjveqlCZ,"We present a functional form that accurately models the scaling behaviors for each task from a very large and diverse set of downstream (and upstream) tasks, even scaling behaviors that were previously believed to be ""unpredictable"".","We present a smoothly broken power law functional form that accurately models the scaling behaviors (of artificial neural networks) (i.e. how the evaluation metric of interest varies as the amount of compute used for training, number of model parameters, or training dataset size varies) for each task from a very large and diverse set of upstream and downstream (i.e. zero-shot, prompted, and fine-tuned) tasks. These tasks include large-scale vision tasks, large-scale unsupervised language tasks, arithmetic, and reinforcement learning. This functional form yields extrapolations of scaling behavior that often are an order of magnitude more accurate than previous functional forms for modeling the scaling behavior of artificial neural networks. Moreover, this functional form accurately models the non-monotonic transitions present in the scaling behavior of phenomena such as double descent and the delayed, sharp transitions present in the scaling behavior of tasks such as arithmetic.","Scaling Laws, Scaling, Scale, Big Learning, Deep Learning, Artificial Neural Networks" Learning to perceive objects by prediction,https://openreview.net/forum?id=VILHmvACcR,https://openreview.net/pdf?id=VILHmvACcR,Object representation arise by predicting the future,"The representation of objects is the building block of higher-level concepts. Infants develop the notion of objects without supervision, for which the prediction error of future sensory input is likely a major teaching signal. We assume that the goal of representing objects distinctly is to allow the prediction of the coherent motion of all parts of an object independently from the background while keeping track of relatively fewer parameters of the object's motion. To realize this, we propose a framework to extract object-centric representations from single 2D images by learning to predict future scenes containing moving objects. The model learns to explicitly infer objects' locations in a 3D environment, generate 2D segmentation masks of objects, and perceive depth. Importantly, the model requires no supervision or pre-training but assumes rigid-body motion and only needs the observer's self-motion at training time. Further, by evaluating on a new synthetic dataset with more complex textures of objects and background, we found our model overcomes the reliance on clustering colors for segmenting objects, which is a limitation for previous models not using motion information. Our work demonstrates a new approach to learning symbolic representation grounded in sensation and action.","self supervised learning, predictive learning, object-centric representation, 3D perception, sensory grounding" Avoiding spurious correlations via logit correction,https://openreview.net/forum?id=5BaqCFVh5qL,https://openreview.net/pdf?id=5BaqCFVh5qL," We propose the logit correction (LC) loss, a simple yet effective improvement on the softmax cross-entropy loss, to mitigate spurious correlations","Empirical studies suggest that machine learning models trained with empirical risk minimization (ERM) often rely on attributes that may be spuriously correlated with the class labels. Such models typically lead to undesired and poor performance during inference for data lacking such correlations and generalize even worse when more training data present spurious correlations. In this work, we explicitly consider the presence of the potential spurious correlations exist in the majority of training data. Unlike existing approaches which use the ERM model outputs to detect the samples without spurious correlations, and heuristically upweight or upsample those samples, we propose the logit correction (LC) loss, a simple yet effective improvement on the softmax cross-entropy loss, to correct the sample logit. We demonstrate that minimizing the LC loss is equivalent to maximizing the group-balanced accuracy, thus the proposed LC could mitigate the negative impacts of spurious correlations in the majority of samples. Our extensive experimental results further reveal that the proposed LC loss outperforms the SoTA solutions on multiple popular benchmarks by a noticeable large margin, an average 5.5% absolute improvement, without access to spurious attribute labels. LC is also competitive with oracle methods that make use of the attribute labels. ","spurious correlation, robust learning, empirical risk minimization" LEARNING CONTEXT-AWARE ADAPTIVE SOLVERS TO ACCELERATE QUADRATIC PROGRAMMING,https://openreview.net/forum?id=p5cvsNww5dB,https://openreview.net/pdf?id=p5cvsNww5dB,,"Quadratic programming (QP) is an important sub-field of mathematical optimization. The alternating direction method of multipliers (ADMM) is a successful method to solve QP. Even though ADMM shows promising results in solving various types of QP, its convergence speed is known to be highly dependent on the step-size parameter $\rho$. Due to the absence of a general rule for setting $\rho$, it is often tuned manually or heuristically. In this paper, we propose CA-ADMM (Context-aware Adaptive ADMM)) which learns to adaptively adjust $\rho$ to accelerate ADMM. CA-ADMM extracts the spatio-temporal context, which captures the dependency of the primal and dual variables of QP and their temporal evolution during the ADMM iterations. CA-ADMM chooses $\rho$ based on the extracted context. Through extensive numerical experiments, we validated that CA-ADMM effectively generalizes to unseen QP problems with different sizes and classes (i.e., having different QP parameter structures). Furthermore, we verified that CA-ADMM could dynamically adjust $\rho$ considering the stage of the optimization process to accelerate the convergence speed further.","quadratic optimization, convex optimization, reinforcement learning for optimization, graph neural network, contextual learning" Learning Latent Structural Causal Models,https://openreview.net/forum?id=w2mDq-p9EEf,https://openreview.net/pdf?id=w2mDq-p9EEf,"bayesian inference over latent structural causal models from low level data and random, known interventions for linear Gaussian additive noise SCMs. Such a model also performs image generation from unseen interventions,","Causal learning has long concerned itself with the accurate recovery of underlying causal mechanisms. Such causal modelling enables better explanations of out-of-distribution data. Prior works on causal learning assume that the high-level causal variables are given. However, in machine learning tasks, one often operates on low-level data like image pixels or high-dimensional vectors. In such settings, the entire Structural Causal Model (SCM) -- structure, parameters, \textit{and} high-level causal variables -- is unobserved and needs to be learnt from low-level data. We treat this problem as Bayesian inference of the latent SCM, given low-level data. For linear Gaussian additive noise SCMs, we present a tractable approximate inference method which performs joint inference over the causal variables, structure and parameters of the latent SCM from random, known interventions. Experiments are performed on synthetic datasets and a causally generated image dataset to demonstrate the efficacy of our approach. We also perform image generation from unseen interventions, thereby verifying out of distribution generalization for the proposed causal model.","Causal discovery, Bayesian inference" Pre-Training for Robots: Leveraging Diverse Multitask Data via Offline Reinforcement Learning,https://openreview.net/forum?id=lVdvYoIxsXm,https://openreview.net/pdf?id=lVdvYoIxsXm,,"Recent progress in deep learning highlights the tremendous potential of utilizing diverse datasets for achieving effective generalization and makes it enticing to consider leveraging broad datasets for attaining more robust generalization in robotic learning as well. However, in practice we likely will want to learn a new skill in a new environment that is unlikely to be contained in the prior data. Therefore we ask: how can we leverage existing diverse offline datasets in combination with small amounts of task-specific data to solve new tasks, while still enjoying the generalization benefits of training on large amounts of data? In this paper, we demonstrate that end-to-end offline RL can be an effective approach for doing this, without the need for any representation learning or vision-based pre-training. We present pre-training for robots (PTR), a framework based on offline RL that attempts to effectively learn new tasks by combining pre-training on existing robotic datasets with rapid fine-tuning on a new task, with as a few as 10 demonstrations. At its core, PTR applies an existing offline RL method such as conservative Q-learning (CQL), but extends it to include several crucial design decisions that enable PTR to actually work and outperform a variety of prior methods. To the best of our knowledge, PTR is the first offline RL method that succeeds at learning new tasks in a new domain on a real WidowX robot with as few as 10 task demonstrations, by effectively leveraging an existing dataset of diverse multi-task robot data collected in a variety of toy kitchens. We present an accompanying overview video at https://www.youtube.com/watch?v=yAWgyLJD5lY&ab_channel=PTRICLR","pre-training, robotics, finetuning, offline RL" Safe Exploration Incurs Nearly No Additional Sample Complexity for Reward-Free RL,https://openreview.net/forum?id=wNUgn1n6esQ,https://openreview.net/pdf?id=wNUgn1n6esQ,"We developed a novel reward-free RL framework with safety constraint, and provide a unified provably efficient algorithm.","Reward-free reinforcement learning (RF-RL), a recently introduced RL paradigm, relies on random action-taking to explore the unknown environment without any reward feedback information. While the primary goal of the exploration phase in RF-RL is to reduce the uncertainty in the estimated model with minimum number of trajectories, in practice, the agent often needs to abide by certain safety constraint at the same time. It remains unclear how such safe exploration requirement would affect the corresponding sample complexity in order to achieve the desired optimality of the obtained policy in planning. In this work, we make a first attempt to answer this question. In particular, we consider the scenario where a safe baseline policy is known beforehand, and propose a unified Safe reWard-frEe ExploraTion (SWEET) framework. We then particularize the SWEET framework to the tabular and the low-rank MDP settings, and develop algorithms coined Tabular-SWEET and Low-rank-SWEET, respectively. Both algorithms leverage the concavity and continuity of the newly introduced truncated value functions, and are guaranteed to achieve zero constraint violation during exploration with high probability. Furthermore, both algorithms can provably find a near-optimal policy subject to any constraint in the planning phase. Remarkably, the sample complexities under both algorithms match or even outperform the state of the art in their constraint-free counterparts up to some constant factors, proving that safety constraint hardly increases the sample complexity for RF-RL.","Reward-free RL, Safety constraint, Sample complexity, Pure exploration" S$^6$-DAMON: Bridging Self-Supervised Speech Models and Real-time Speech Recognition,https://openreview.net/forum?id=3dH2aqKGzZe,https://openreview.net/pdf?id=3dH2aqKGzZe,We propose a data-model co-compression framework dubbed S$^6$-DAMON for bridging self-supervised speech models with real-time speech recognition.,"There has been an growing demand for deep neural network (DNN) powered automatic speech recognition (ASR) on mobile platforms for real-time speech recognition. However, ubiquitous on-device ASR systems are still hindered by two bottlenecks: (1) the lack of large-scale transcribed speech data especially for low-resource spoken languages and (2) the large gap between DNNs' prohibitive complexity and mobiles' limited resources. In parallel, speech models pretrained via self-supervised learning (SSL) have emerged to reduce the reliance on the availability of transcribed speech data, which however further enlarges the efficiency gap because they often adopt large transformers to ensure expressive speech representations. Thus, it is highly desired to trim down the complexity of speech SSL models to enable real-time on-device ASR. This is particularly challenging since only structured sparsity can favor hardware efficiency in commercial devices, under which the speech representation learned by SSL could easily be demolished. To this end, we develop a framework dubbed S$^6$-DAMON to pursue structured sparsity in speech SSL models via data-model co-compression. On the data side, leveraging both the duration of each phoneme and the pauses between the words/phonemes of human utterances, we propose a salient audio token detector, dubbed SALAD, to remove input audio tokens that are redundant; On the model side, we identify that the failure of the SOTA ASR pruning method under structured sparsity is caused by the sparsity discrepancy between finetuning/deployment and their limited learnability of sparsity distributions, and then tackle it via a new ASR pruning pipeline dubbed SAFARI, which adopts a three-step pipeline - sparsify, finetune, and adjust sparsity. Extensive experiments validate that S$^6$-DAMON can enable real-time ASR with limited transcribed speech data requirements while maintaining decent recognition performance. All source codes will be released upon acceptance.","automated speech recognition, self-supervised learning, model compression" Teaching Algorithmic Reasoning via In-context Learning,https://openreview.net/forum?id=6dlC7E1H_9,https://openreview.net/pdf?id=6dlC7E1H_9,We study how to teach algorithmic reasoning to LLMs via in-context learning. We show that algorithmic reasoning can be taught by increasing specificity in the way we explain the steps of an algorithm along with running examples.,"Large language models (LLMs) have shown increasing in-context learning capabilities through scaling up model and data size. Despite this progress, LLMs are still unable to solve algorithmic reasoning problems. While providing a rationale with the final answer has led to further improvements in multi-step reasoning problems, Anil et al. 2022 showed that even simple algorithmic reasoning tasks such as parity are far from solved. In this work, we identify and study four key stages for successfully teaching algorithmic reasoning to LLMs: (1) formulating algorithms as skills, (2) teaching multiple skills simultaneously (skill accumulation), (3) teaching how to combine skills (skill composition) and (4) teaching how to use skills as tools. We show that it is possible to teach algorithmic reasoning to LLMs via in-context learning, which we refer to as Algorithmic Prompting. We evaluate our approach on a variety of arithmetic and quantitative reasoning tasks, and demonstrate significant boosts in performance over existing prompting techniques. In particular, for long parity, addition, multiplication and subtraction and parity tasks, we achieve an error reduction of approximately 10x, 9x, 5x and 2x respectively compared to the best available baselines. ","in-context learning, algorithmic reasoning, LLMs, prompting" Offline Q-learning on Diverse Multi-Task Data Both Scales And Generalizes,https://openreview.net/forum?id=4-k7kUavAj,https://openreview.net/pdf?id=4-k7kUavAj,,"The potential of offline reinforcement learning (RL) is that high-capacity models trained on large, heterogeneous datasets can lead to agents that generalize broadly, analogously to similar advances in vision and NLP. However, recent works argue that offline RL methods encounter unique challenges to scaling up model capacity. Drawing on the learnings from these works, we re-examine previous design choices and find that with appropriate choices: ResNets, cross-entropy based distributional backups, and feature normalization, offline Q-learning algorithms exhibit strong performance that scales with model capacity. Using multi-task Atari as a testbed for scaling and generalization, we train a single policy on 40 games with near-human performance using up-to 80 million parameter networks, finding that model performance scales favorably with capacity. In contrast to prior work, we extrapolate beyond dataset performance even when trained entirely on a large (400M transitions) but highly suboptimal dataset (51% human-level performance). Compared to return-conditioned supervised approaches, offline Q-learning scales similarly with model capacity and has better performance, especially when the dataset is suboptimal. Finally, we show that offline Q-learning with a diverse dataset is sufficient to learn powerful representations that facilitate rapid transfer to novel games and fast online learning on new variations of a training game, improving over existing state-of-the-art representation learning approaches.","offline RL, multi-task Atari, large models" Disentangled Conditional Variational Autoencoder for Unsupervised Anomaly Detection,https://openreview.net/forum?id=D__ipVB0Z7,https://openreview.net/pdf?id=D__ipVB0Z7,"unsupervised anomaly detection architecture incorporating disentangled learning, information theory and conditional variational modeling. ","Recently, generative models have shown promising performance in anomaly detection tasks. Specifically, autoencoders learn representations of high-dimensional data, and their reconstruction ability can be used to assess whether a new instance is likely to be anomalous. However, the primary challenge of unsupervised anomaly detection (UAD) is in learning appropriate disentangled features and avoiding information loss, while incorporating known sources of variation to improve the reconstruction. In this paper, we propose a novel architecture of generative autoencoder by combining the frameworks of $\beta$-VAE, conditional variational autoencoder (CVAE), and the principle of total correlation (TC). We show that our architecture improves the disentanglement of latent features, optimizes TC loss more efficiently, and improves the ability to detect anomalies in an unsupervised manner with respect to high-dimensional instances, such as in imaging datasets. Through both qualitative and quantitative experiments on several benchmark datasets, we demonstrate that our proposed method excels in terms of both anomaly detection and capturing disentangled features. Our analysis underlines the importance of learning disentangled features for UAD tasks.","unsupervised anomaly detection, autoencoder, disentanglement learning, representation learning, information theory" Diffusion-based Image Translation using disentangled style and content representation,https://openreview.net/forum?id=Nayau9fwXU,https://openreview.net/pdf?id=Nayau9fwXU,We propose a new method which enables image translation using Denoising Diffusion Probabilistic Model.,"Modern diffusion-based image translation guided by semantic texts or a single target image has enabled flexible style transfer which is not limited to the specific domains. Unfortunately, due to the stochastic nature of diffusion models, it is often difficult to maintain the original content of the image during the reverse diffusion. To address this, here we present a novel diffusion-based image translation method using disentangled style and content representation. Specifically, inspired by the slicing Vision Transformer can convert the semantic appearance of a given image into target domain while maintaining the structure of input image, in our method we extract intermediate keys of multihead self attention layer from ViT model and used them as the content preservation loss. More specifically, to preserve the structure information we use the contrastive loss between intermediate keys of the input image and the estimated denoised output during the reverse diffusion sampling. Then, an image guided style transfer is performed by matching the [CLS] token between the denoised diffusion output and target domain, whereas additional CLIP loss is used for the text-driven style transfer.","DDPM, CLIP, Image Translation, ViT" An Analytic Framework for Robust Training of Differentiable Hypothesis,https://openreview.net/forum?id=ttnf-Wibn2R,https://openreview.net/pdf?id=ttnf-Wibn2R,,"The reliability of a learning model is key to the successful deployment of machine learning in various industries. Creating a robust model, particularly one unaffected by adversarial attacks, requires a comprehensive understanding of the adversarial examples phenomenon. However, it is difficult to describe the phenomenon due to the complicated nature of the problems in machine learning. Consequently, many studies investigate the phenomenon by proposing a simplified model of how adversarial examples occur and validate it by predicting some aspect of the phenomenon. While these studies cover many different characteristics of the adversarial examples, they have not reached a holistic approach to the geometric and analytic modeling of the phenomenon. We observe the phenomenon in many applications of machine learning, and its effects seems to be independent of the choice of the hypothesis class. In this paper, we propose a formalization of robustness in learning theoretic terms and give a geometrical description of the phenomenon in analytic classifiers. We then utilize the proposal to devise a robust classification learning rule for differentiable hypothesis classes and showcase our framework on synthetic and real-world data.", Federated Learning with Heterogeneous Label Noise: A Dual Structure Approach,https://openreview.net/forum?id=e4qmg9HQJPr,https://openreview.net/pdf?id=e4qmg9HQJPr,,"The performance of federated learning relies heavily on the label quality of each distributed client. In this paper, we consider a federated learning setting with heterogeneous label noise, where each local client might observe training labels with heterogeneous noise rates, which may even drawn from different subsets of the label space. The above high heterogeneity poses challenges for applying the existing label noise learning approaches to each client locally. We formalize the study of federated learning from heterogeneous label noise by firstly identifying two promising label noise generation models. Then, we propose a dual structure approach named FedDual. Intuitively, if there exists a model that filters out the wrongly labeled instances from the local dataset, the effect of label noise can be mitigated. Considering the heterogeneity of local datasets, in addition to the globally shared model, each client in FedDual maintains a local and personalized denoising model. The personalized denoising models can combine information from the global model or other pre-trained models to ensure the performance of denoising. Under this framework, we instantiate our approach with several local sample cleaning methods. We present substantial experiments on MNIST, CIFAR10, and CIFAR100 to demonstrate that FedDual can effectively recognize heterogeneous label noise in different clients and improve the performance of the aggregated model.","Federated learning, Heterogeneous lable noise" Correspondences between word learning in children and captioning models ,https://openreview.net/forum?id=R6zTkW_w_PV,https://openreview.net/pdf?id=R6zTkW_w_PV,We show that image captioning systems' performance correlates with the age at which children acquire words from a variety of word categories.,"For human children as well as machine learning systems, a key challenge in learning a word is linking the word to the visual phenomena it describes. By organizing model output into word categories used to analyze child language learning data, we show a correspondence between word learning in children and the performance of image captioning models. Although captioning models are trained only on standard machine learning data, we find that their performance in producing words from a variety of word categories correlates with the age at which children acquire words from each of those categories. To explain why this correspondence exists, we show that the performance of captioning models is correlated with human judgments of the concreteness of words, suggesting that these models are capturing the complex real-world association between words and visual phenomena.","cognitive science, child development, language, image captioning, computer vision" Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness,https://openreview.net/forum?id=G_D6xThdQe4,https://openreview.net/pdf?id=G_D6xThdQe4,"We investigate robustness of MoE experts, and apply ultra low-bit quantization to them for achieving more efficient MoE model inference.","Large Mixture of Experts (MoE) models could achieve state-of-the-art quality on various language tasks, including machine translation task, thanks to the efficient model scaling capability with expert parallelism. However, it has brought a fundamental issue of larger memory consumption at deployment time. Furthermore, this results in significant inference speed degradation at auto-regressive decoding steps due to the increased memory transfers. In this paper, we propose a simple weight-only quantization method using ultra low-bit such as 2-bit, 3-bit and 4-bits to effectively mitigate the increased memory and latency issues of MoE models. We show that low-bit quantization together with the MoE architecture delivers a reliable model performance while reducing the memory size significantly even without any additional training. Especially, expert layers in MoE models are much more robust to the quantization than conventional feedforward networks (FFN) layers. In our comprehensive analysis, we show that MoE models with 2-bit and 80\% sparse expert weights can deliver better model performance than the dense model trained on the same dataset. We present how quantization of different parts of models affects the performance with various experiments using a large MoE model (5.3 B). As a result of low-bit quantization, we show the model size can be reduced by 4.9X smaller than the original half precision floating point (fp16) MoE model. This cuts down the model size of 5.3B parameters from 8.4x of the dense model to only 1.7x of the dense model after 2-bit quantization. It still preserves 1.88\% higher accuracy than the dense model. Combined with an optimized GPU runtime implementation, it also achieves 2.7X speed-up which is even slightly faster than the FLOPs equivalent dense model.","MoE, Quantization, Mixture of Experts, Sparse Model, Machine Translation" Implicit Regularization for Group Sparsity,https://openreview.net/forum?id=d7Q0vVfJ0wO,https://openreview.net/pdf?id=d7Q0vVfJ0wO,We study the implicit regularization of gradient descent towards structured sparsity via a novel neural reparameterization.,"We study the implicit regularization of gradient descent towards structured sparsity via a novel neural reparameterization, which we call a diagonally grouped linear neural network. We show the following intriguing property of our reparameterization: gradient descent over the squared regression loss, without any explicit regularization, biases towards solutions with a group sparsity structure. In contrast to many existing works in understanding implicit regularization, we prove that our training trajectory cannot be simulated by mirror descent. We analyze the gradient dynamics of the corresponding regression problem in the general noise setting and obtain minimax-optimal error rates. Compared to existing bounds for implicit sparse regularization using diagonal linear networks, our analysis with the new reparameterization shows improved sample complexity. In the degenerate case of size-one groups, our approach gives rise to a new algorithm for sparse linear regression. Finally, we demonstrate the efficacy of our approach with several numerical experiments.","gradient descent, implicit regularization, structured/group sparsity, linear neural network" Why do Models with Conditional Computation Learn Suboptimal Solutions?,https://openreview.net/forum?id=4O4eoAVEdIs,https://openreview.net/pdf?id=4O4eoAVEdIs,,"Sparsely-activated neural networks with conditional computation learn to route their inputs through different subnetworks, providing a strong structural prior and reducing computational costs. Despite their possible benefits, models with learned routing often underperform their parameter-matched densely-activated counterparts as well as models that use non-learned heuristic routing strategies. In this paper, we hypothesize that these shortcomings stem from the gradient estimation techniques used to train sparsely-activated models with non-differentiable discrete routing decisions. To test this hypothesis, we evaluate the performance of sparsely-activated models trained with various gradient estimation techniques in three settings where a high-quality heuristic routing strategy can be designed. Our experiments reveal that learned routing reaches substantially different (and worse) solutions than heuristic routing in various settings. As a first step towards remedying this gap, we demonstrate that supervising the routing decision on a small fraction of the examples is sufficient to help the model to learn better routing strategies. Our results shed light on the difficulties of learning effective routing and set the stage for future work on conditional computation mechanisms and training techniques.","neural networks, conditional computation, gradient estimation" Stabilized training of joint energy-based models and its practical applications,https://openreview.net/forum?id=hayd_QIsu1,https://openreview.net/pdf?id=hayd_QIsu1,JEM with stabilized training using SGLD samples; it enables us to apply it to speech,"The recently proposed Joint Energy-based Model (JEM) interprets discriminatively trained classifier p(y|x) as an energy model, which is also trained as a generative model describing the distribution of the input observations p(x). The JEM training relies on ""positive examples"" (i.e. examples from the training data set) as well as on ""negative examples"", which are samples from the modeled distribution p(x) generated by means of Stochastic Gradient Langevin Dynamics (SGLD). Unfortunately, SGLD often fails to deliver negative samples of sufficient quality during the standard JEM training, which causes a very unbalanced contribution from the positive and negative examples when calculating gradients for JEM updates. As a consequence, the standard JEM training is quite unstable requiring careful tuning of hyper-parameters and frequent restarts when the training starts diverging. This makes it difficult to apply JEM to different neural network architectures, modalities, and tasks. In this work, we propose a training procedure that stabilizes SGLD-based JEM training (ST-JEM) by balancing the contribution from the positive and negative examples. We also propose to add an additional ""regularization"" term to the training objective -- MI between the input observations x and output labels y -- which encourages the JEM classifier to make more certain decisions about output labels. We demonstrate the effectiveness of our approach on the CIFAR10 and CIFAR100 tasks. We also consider the task of classifying phonemes in a speech signal, for which we were not able to train JEM without the proposed stabilization. We show that a convincing speech can be generated from the trained model. Alternatively, corrupted speech can be de-noised by bringing it closer to the modeled speech distribution using a few SGLD iterations. We also propose and discuss additional applications of the trained model.", HesScale: Scalable Computation of Hessian Diagonals,https://openreview.net/forum?id=mKILD5MLR2C,https://openreview.net/pdf?id=mKILD5MLR2C,,"Second-order optimization uses curvature information about the objective function, which can help in faster convergence. However, such methods typically require expensive computation of the Hessian matrix, preventing their usage in a scalable way. The absence of efficient ways of computation drove the most widely used methods to focus on first-order approximations that do not capture the curvature information. In this paper, we develop \textit{HesScale}, a scalable approach to approximating the diagonal of the Hessian matrix, to incorporate second-order information in a computationally efficient manner. We show that HesScale has the same computational complexity as backpropagation. Our results on supervised classification show that HesScale achieves high approximation accuracy, allowing for scalable and efficient second-order optimization.", Adaptive Anchor for Robust Keypoint Localization,https://openreview.net/forum?id=vdhco_34qV8,https://openreview.net/pdf?id=vdhco_34qV8,,"Existing keypoint localization methods mostly select pre-defined points like image center as anchors, then infer keypoint locations referring to anchors. Pre-defined anchors are sensitive to occlusions and crowded scenes, leading to degraded robustness. This paper proposes to detect Adaptive Anchor (AdaAnchor) for keypoint localization. Instead of relying on pre-defined rules, AdaAnchor is adaptively selected by maximizing both the keypoint localization confidence and accuracy. This strategy leads to more robust keypoint localization even with the existence of occlusions and truncations. AdaAnchor can be flexibly integrated into different methods by replacing their anchor point selection strategies. Experiments show that it surpasses previous anchor selection methods on both single and multiple keypoint localization tasks. For instance, replacing the heatmap-anchor with AdaAnchor reduces the localization error of invisible keypoints by 6%, meanwhile improves the confidence by 41.7+% on COCO in single keypoint localization. This advantage sustains on multiple keypoint localization task, e.g., AdaAnchor outperforms heatmap-anchor by 4.8% AP on bottom-up multi-person pose estimation.","keypoint localization, human pose estimation" Divide-and-Cluster: Spatial Decomposition Based Hierarchical Clustering,https://openreview.net/forum?id=TDUMUFa5zz,https://openreview.net/pdf?id=TDUMUFa5zz,"This paper clusters n points located in a D-dimensional space by detecting their mutual clustering affinity within local neighborhoods, using more efficient local computations, and then hierarchically growing the local clusters outward.","This paper is about increasing the computational efficiency of clustering algorithms. Many clustering algorithms are based on properties of relative locations of points, globally or locally, e.g., interpoint distances and nearest neighbor distances. This amounts to using a lower dimensional space than the full dimensionality $D$ of the space in which the points are embedded. We present a clustering algorithm, Divide-and-Cluster (DAC), which detects local clusters in small neighborhoods obtained by recursive tessellation of space, and then merges them hierarchically, following the Divide-and-Conquer paradigm. This significantly reduces computation time which may otherwise grow nonlinearly number $n$ of points. We define locality as hypercubical neighborhoods in a recursive hypercubical decomposition of space, represented by a tree. Clusters are detected within each hypercube, and merged with those from neighboring hypercubes while traversing up the tree. We expect DAC to perform better than many other algorithms because (a) as clusters merge into larger clusters (components), their number steadily decreases vs the number of points, and (b) we cluster only neighboring components. The ordering of component appearances also simultaneously yields a cluster hierarchy (tree). Further, our use of small neighborhoods allows piecewise uniform approximation of large, nonuniform, arbitrary shaped clusters, thus avoiding the need for global cluster models. We experimentally verify the correctness of detected clusters on a variety of datasets, posing a variety of challenges, as well as show that DAC’s runtime is significantly better than representative algorithms of other types, particularly for increasing values of $n$. ","Unsupervised Learning, High-dimensional features, World Centered Clustering, Points Centered Clustering, Hierarchical Clustering, Complexity, Minimal Spanning Tree" Implicit regularization in Heavy-ball momentum accelerated stochastic gradient descent,https://openreview.net/forum?id=ZzdBhtEH9yB,https://openreview.net/pdf?id=ZzdBhtEH9yB,,"It is well known that the finite step-size ($h$) in Gradient descent (GD) implicitly regularizes solutions to flatter minimas. A natural question to ask is \textit{Does the momentum parameter $\beta$ (say) play a role in implicit regularization in Heavy-ball (H.B) momentum accelerated gradient descent (GD+M)?}. To answer this question, first, we show that the trajectory traced by discrete H.B momentum update (GD+M) is $O(h^2)$ close to a continuous trajectory induced by a modified loss, which consists of an original loss and an implicit regularizer. This implicit regularizer for (GD+M) is indeed stronger than that of (GD) by factor of $(\frac{1+\beta}{1-\beta})$, thus explaining why (GD+M) shows better generalization performance and higher test accuracy than (GD). Furthermore, we extend our analysis to stochastic version of gradient descent with momentum (SGD+M) and propose a deterministic continuous trajectory that is $O(h^2)$ close to the discrete update of (SGD+M) in a strong approximation sense. We explore the implicit regularization in (SGD+M) and (GD+M) through a series of experiments validating our theory. ", ORCA: Interpreting Prompted Language Models via Locating Supporting Evidence in the Ocean of Pretraining Data,https://openreview.net/forum?id=F0UQv_MNWCt,https://openreview.net/pdf?id=F0UQv_MNWCt,We find supporting data evidence from pretraining data to interpret prompted language models.,"Prompting large pretrained language models leads to strong performance in a variety of downstream tasks. However, it is still unclear from where the model learns task-specific knowledge, especially in zero-shot setups. In this work, we propose a novel method ORCA to identify evidence of the model's task-specific competence in prompt-based learning. Through an instance attribution approach to model interpretability, by iteratively using gradient information related to the downstream task, ORCA locates a very small subset of pretraining data that directly supports the model's predictions in a given task; we call this subset supporting data evidence. We show that supporting data evidence offers new insights about the prompted language models. For example, in the tasks of sentiment analysis and textual entailment, BERT shows a substantial reliance on BookCorpus---the smaller corpus of BERT's two pretraining corpora---as well as on pretraining examples that mask out synonyms to the task labels used in prompts.","interpretability, prompting language models, pretraining data as evidence" Getting away with more network pruning: From sparsity to geometry and linear regions,https://openreview.net/forum?id=Itn7dH7muI,https://openreview.net/pdf?id=Itn7dH7muI,"If we prune with the maximum number of linear regions in mind, we can improve accuracy considerably","One surprising trait of neural networks is the extent to which their connections can be pruned with little to no effect on accuracy. But when we cross a critical level of parameter sparsity, pruning any further leads to a sudden drop in accuracy. What could explain such a drop? In this work, we explore how sparsity may affect the geometry of the linear regions defined by a neural network and consequently reduce its expected maximum number of linear regions. We observe that sparsity affects accuracy in pruned neural networks similarly to how it affects the number of linear regions as well as - and more so - our proposed upper bound on that number. Conversely, we find out that selecting the sparsity on each layer to maximize the bound very often improves accuracy in comparison to using the same sparsity across all layers, thereby providing us guidance on where to prune. ", Real-time variational method for learning neural trajectory and its dynamics,https://openreview.net/forum?id=M_MvkWgQSt,https://openreview.net/pdf?id=M_MvkWgQSt,A real-time variational Bayesian method aimed at uncovering latent neural trajectories and their dynamical systems.,"Latent variable models have become instrumental in computational neuroscience for reasoning about neural computation. This has fostered the development of powerful offline algorithms for extracting latent neural trajectories from neural recordings. However, despite the potential of real-time alternatives to give immediate feedback to experimentalists, and enhance experimental design, they have received markedly less attention. In this work, we introduce the exponential family variational Kalman filter (eVKF), an online recursive Bayesian method aimed at inferring latent trajectories while simultaneously learning the dynamical system generating them. eVKF works for arbitrary likelihoods and utilizes the constant base measure exponential family to model the latent state stochasticity. We derive a closed-form variational analog to the predict step of the Kalman filter which leads to a provably tighter bound on the ELBO compared to another online variational method. We validate our method on synthetic and real-world data, and, notably, show that it achieves competitive performance.","neural dynamics, neural trajectory, online variational inference" Supervised Metric Learning for Retrieval via Contextual Similarity Optimization,https://openreview.net/forum?id=N5gn1KjCWW,https://openreview.net/pdf?id=N5gn1KjCWW,,"Existing deep metric learning approaches fall into three general categories: contrastive learning, average precision (AP) maximization, and classification. We propose a novel alternative approach, contextual similarity optimization, inspired by work in unsupervised metric learning. Contextual similarity is a discrete similarity measure based on relationships between neighborhood sets, and is widely used in the unsupervised setting as pseudo-supervision. Inspired by this success, we propose a framework which optimizes a combination of contextual and cosine similarities. Contextual similarity calculation involves several non-differentiable operations, including the heaviside function and intersection of sets. We show how to circumvent non-differentiability to explicitly optimize contextual similarity, and we further incorporate appropriate similarity regularization to yield our novel metric learning loss. The resulting loss function achieves state-of-the-art Recall @ 1 accuracy on standard supervised image retrieval benchmarks when combined with the standard contrastive loss.","Image Retrieval, Metric Learning, Contextual Similarity" Large Language Models are Human-Level Prompt Engineers,https://openreview.net/forum?id=92gvk82DE-,https://openreview.net/pdf?id=92gvk82DE-,We propose an algorithm for automatic instruction generation and selection for large language models with human level performance.,"By conditioning on natural language instructions, large language models (LLMs) have displayed impressive capabilities as general-purpose computers. However, task performance depends significantly on the quality of the prompt used to steer the model, and most effective prompts have been handcrafted by humans. Inspired by classical program synthesis and the human approach to prompt engineering, we propose Automatic Prompt Engineer (APE) for automatic instruction generation and selection. In our method, we treat the instruction as the ""program,"" optimized by searching over a pool of instruction candidates proposed by an LLM in order to maximize a chosen score function. To evaluate the quality of the selected instruction, we evaluate the zero-shot performance of another LLM following the selected instruction. Experiments on 24 NLP tasks show that our automatically generated instructions outperform the prior LLM baseline by a large margin and achieve better or comparable performance to the instructions generated by human annotators on 21/24 tasks. We conduct extensive qualitative and quantitative analyses to explore the performance of APE. We show that APE-engineered prompts can be applied to steer models toward truthfulness and/or informativeness, as well as to improve few-shot learning performance by simply prepending them to standard in-context learning prompts.","few-shot learning, automated reasoning, large language models" Do Not Blindly Imitate the Teacher: Loss Perturbation for Knowledge Distillation,https://openreview.net/forum?id=FILleBqk31S,https://openreview.net/pdf?id=FILleBqk31S,We propose a perturbed loss function for the knowledge distillation task which outperforms the underlying KL loss and other perturbation methods.,"Knowledge distillation (KD) is a popular model compression technique to transfer knowledge from large teacher models to a small student model. Typically, the student learns to imitate the teacher by minimizing the KL divergence of its output distribution with the teacher's output distribution. We argue that such a learning objective is sub-optimal because there exists a discrepancy between the teacher's output distribution and the ground truth label distribution, and forcing the student to blindly imitate the unreliable teacher output distribution leads to inferior performance. To this end, we propose a novel knowledge distillation objective PTLoss by first representing the vanilla KL-based distillation loss function via a Maclaurin series and then perturbing the leading-order terms in this series. This perturbed loss improves the student generalizability by effectively distilling knowledge from a shifted distribution closer to the ground truth data. We also propose a method to compute this shifted teacher distribution, named Proxy Teacher, which enables us to select the perturbation coefficients in PTLoss. We theoretically show the perturbed loss reduces the deviation from the true population risk compared to the vanilla KL-based distillation loss functions. Experiments on three tasks with teachers of different scales show that our method significantly outperforms vanilla distillation loss functions and other perturbation methods.","distillation, loss function, natural language processing" Fast Yet Effective Graph Unlearning through Influence Analysis,https://openreview.net/forum?id=er_nz4Q9Km7,https://openreview.net/pdf?id=er_nz4Q9Km7,,"Recent evolving data privacy policies and regulations have led to increasing interest in the machine unlearning problem. In this paper, we consider Graph Neural Networks (GNNs) as the target model, and study the problem of edge unlearning in GNNs, i.e., learning a new GNN model as if a specified set of edges never existed in the original training graph. Despite its practical importance, the problem remains elusive due to the non-convexity nature of GNNs. Our main technical contribution is three-fold: 1) we cast the problem of edge unlearning as estimating the influence functions of the edges to be removed; 2) we design a computationally and memory efficient algorithm named EraEdge for edge influence estimation and unlearning; 3) under standard regularity conditions, we prove that the sequence of iterates produced by our algorithm converges to the desired model. A comprehensive set of experiments on three prominent GNN models and four benchmark graph datasets demonstrate that our algorithm achieves significant speed-up gains over retraining from scratch without sacrificing the model accuracy too much. Furthermore, our algorithm outperforms the existing GNN unlearning approach in terms of both training time and accuracy of the target GNN model.", Faster Hyperparameter Search for GNNs via Calibrated Dataset Condensation,https://openreview.net/forum?id=ohQPU2G3r3C,https://openreview.net/pdf?id=ohQPU2G3r3C,"We propose a novel hyperparameter-calibrated dataset condensation (HCDC) algorithm, which can be applied to speed up hyperparameter optimization on graphs.","Dataset condensation aims to reduce the computational cost of training multiple models on a large dataset by condensing the training dataset into a small synthetic set. State-of-the-art approaches rely on matching the model gradients for the real and synthetic data and have recently been applied to condense large-scale graphs for node classification tasks. Although dataset condensation may be efficient when training multiple models for hyperparameter optimization, there is no theoretical guarantee on the generalizability of the condensed data: data condensation often generalizes poorly across hyperparameters/architectures in practice, while we find and prove this overfitting is much more severe on graphs. In this paper, we consider a different condensation objective specifically geared towards hyperparameter search. We aim to generate the synthetic dataset so that the validation-performance rankings of the models, with different hyperparameters, on the condensed and original datasets are comparable. We propose a novel hyperparameter-calibrated dataset condensation algorithm, which obtains the synthetic validation data by matching the hyperparameter gradients computed via implicit differentiation and efficient inverse Hessian approximation. HCDC employs a supernet with differentiable hyperparameters, making it suitable for modeling GNNs with widely different convolution filters. Experiments demonstrate that the proposed framework effectively maintains the validation-performance rankings of GNNs and speeds up hyperparameter/architecture search on graphs.","Graph Condensation, Dataset Condensation, Hyperparameter Optimization, Graph Neural Networks, Graph Compression" FedTiny: Pruned Federated Learning Towards Specialized Tiny Models,https://openreview.net/forum?id=2WmBMrCZSx,https://openreview.net/pdf?id=2WmBMrCZSx,,"Neural network pruning has been a well-established compression technique to enable deep learning models on resource-constrained devices. The pruned model is usually specialized to meet specific hardware platforms and training tasks (defined as deployment scenarios). However, existing pruning approaches rely heavily on training data to trade off model size, efficiency, and accuracy, which becomes ineffective for federated learning (FL) over distributed and confidential datasets. Moreover, the memory- and compute-intensive pruning process of most existing approaches cannot be handled by most FL devices with resource limitations. In this paper, we develop FedTiny, a novel distributed pruning framework for FL, to obtain specialized tiny models for memory- and computing-constrained participating devices with confidential local data. To alleviate biased pruning due to unseen heterogeneous data over devices, FedTiny introduces an adaptive batch normalization (BN) selection module to adaptively obtain an initially pruned model to fit deployment scenarios. Besides, to further improve the initial pruning, FedTiny develops a lightweight progressive pruning module for local finer pruning under tight memory and computational budgets, where the pruning policy for each layer is gradually determined rather than evaluating the overall deep model structure. Extensive experimental results demonstrate the effectiveness of FedTiny, which outperforms state-of-the-art baseline approaches, especially when compressing deep models to extremely sparse tiny models.", Spatiotemporal Modeling of Multivariate Signals with Graph Neural Networks and Structured State Space Models,https://openreview.net/forum?id=zV3Q0a8--A,https://openreview.net/pdf?id=zV3Q0a8--A,Graph neural networks for spatiotemporal modeling of multivariate signals,"Multivariate signals are prevalent in various domains, such as healthcare, transportation systems, and space sciences. Modeling spatiotemporal dependencies in multivariate signals is challenging due to (1) long-range temporal dependencies and (2) complex spatial correlations between sensors. To address these challenges, we propose representing multivariate signals as graphs and introduce GraphS4mer, a general graph neural network (GNN) architecture that captures both spatial and temporal dependencies in multivariate signals. Specifically, (1) we leverage Structured State Spaces model (S4), a state-of-the-art sequence model, to capture long-term temporal dependencies and (2) we propose a graph structure learning layer in GraphS4mer to automatically learn the underlying graph structures in the data. We evaluate our proposed model on three distinct tasks and show that GraphS4mer consistently improves over existing models, including (1) seizure detection from electroencephalography signals, outperforming a previous GNN with self-supervised pretraining by 3.1 points in AUROC; (2) sleep staging from polysomnography signals, a 4.1 points improvement in macro-F1 score compared to existing sleep staging models; and (3) traffic forecasting, reducing MAE by 8.8% compared to existing GNNs and by 1.4% compared to transformer-based models.","Multivariate Signals, Graph Neural Network, Graph Structure Learning, Structured State Spaces, Time Series" TI-VAE: A temporally independent VAE with applications to latent factor learning in neuroimaging,https://openreview.net/forum?id=6lUU0QaTOC,https://openreview.net/pdf?id=6lUU0QaTOC, Our approach extends temporal ICA to the non-linear case and generalizes weight sharing to non-Euclidean neuroimaging data. ,"Functional magnetic resonance imaging (fMRI) data contain complex spatiotemporal dynamics, thus researchers have developed approaches that reduce the dimensionality of the signal while extracting relevant and interpretable dynamics. Recently, the feasibility of latent factor analysis, which can identify the lower-dimensional trajectory of neuronal population activations, has been demonstrated on both spiking and calcium imaging data. In this work, we propose a new framework inspired by latent factor analysis and apply it to functional MRI data from the human somatomotor cortex. Models of fMRI data that can perform whole-brain discovery of dynamical latent factors are understudied. The benefits of approaches such as linear independent component analysis models have been widely appreciated, however, nonlinear extensions are rare and present challenges in terms of identification. Deep learning methods are potentially well-suited, but without adequate inductive biases with respect to spatial weight-sharing may heavily overparameterize the model for the dataset size. Due to the underspecification of neuroimaging approaches, this increases the chances of overfitting and picking up on spurious correlations. Our approach extends temporal ICA to the non-linear case and generalizes weight sharing to non-Euclidean neuroimaging data. We evaluate our model on data with multiple motor sub-tasks to assess whether the model captures disentangled latent factors corresponding to each sub-task. Then, to evaluate the latent factors we find further, we compare the spatial location of each latent factor to the known motor homunculus. Finally, we show that our latent factors correlate better to the task than the current gold standard of source signal separation for neuroimaging data, independent component analysis (ICA). ","variational autoencoder, computational neuroscience, latent factor analysis, latent factor learning, fMRI, sequential variational autoencoder, somatomotor cortex, weight sharing, inductive bias" Pruning Deep Neural Networks from a Sparsity Perspective,https://openreview.net/forum?id=i-DleYh34BM,https://openreview.net/pdf?id=i-DleYh34BM,This work develops PQ Index (PQI) as a new measure of sparsity and proposes a Sparsity-informed Adaptive Pruning (SAP) algorithm. ,"In recent years, deep network pruning has attracted significant attention in order to enable the rapid deployment of AI into small devices with computation and memory constraints. Pruning is often achieved by dropping redundant weights, neurons, or layers of a deep network while attempting to retain a comparable test performance. Many deep pruning algorithms have been proposed with impressive empirical success. However, existing approaches lack a quantifiable measure to estimate the compressibility of a sub-network during each pruning iteration and thus may under-prune or over-prune the model. In this work, we propose PQ Index (PQI) to measure the potential compressibility of deep neural networks and use this to develop a Sparsity-informed Adaptive Pruning (SAP) algorithm. Our extensive experiments corroborate the hypothesis that for a generic pruning procedure, the sparsity decreases first when a large model is being effectively regularized and then increases when its compressibility reaches a limit that appears to correspond to the beginning of underfitting. Subsequently, PQI decreases again when the model collapse and significant deterioration in the performance of the model start to occur. Additionally, our experiments demonstrate that the proposed adaptive pruning algorithm is superior to the state-of-the-art algorithms such as the lottery ticket-based pruning methods, in terms of both compression efficiency and robustness.","Adaptive Pruning, Model Collapse, Sparsity, Model Compression, Deep Learning" High-dimensional Continuum Armed and High-dimensional Contextual Bandit: with Applications to Assortment and Pricing,https://openreview.net/forum?id=nareqzplSc9,https://openreview.net/pdf?id=nareqzplSc9,"We propose a new model and an efficient theoretically guaranteed algorithm for high-dimensional continuum armed and high-dimensional contextual bandit, with applications to the joint assortment and pricing problem.","The bandit problem with high-dimensional continuum arms and high-dimensional contextual covariates is often faced by decision-makers but remains unsolved. Existing bandit algorithms are impracticable due to the complexity of the double-layer high dimensionality. We formulate this problem as a high-dimensional continuum armed contextual bandit with high-dimensional covariates and propose a novel model that captures the effect of the arm and contextual on the reward with a low-rank representation matrix. The representation matrix is endowed with interpretability and predictive power. We further propose an efficient bandit algorithm based on a low-rank matrix estimator with theoretical justifications. The generality of our model allows wide applications including business and healthcare. In particular, we apply our method to assortment and pricing, both of which are important decisions for firms such as online retailers. Our method can solve the assortment-pricing problem simultaneously while most existing methods address them separately. We demonstrate the effectiveness of our method to jointly optimize assortment and pricing for revenue maximization for a giant online retailer.","bandit, high-dimensional statistics, assortment, pricing, reinforcement learning" ​​What learning algorithm is in-context learning? Investigations with linear models,https://openreview.net/forum?id=0g0X4H8yN4I,https://openreview.net/pdf?id=0g0X4H8yN4I,"We prove that the transformers can implement learning algorithms for linear models based e.g gradient descent, then observe they closely match the predictors of known algorithms, transitioning between different predictors as transformer depth vary.","Neural sequence models, especially transformers, exhibit a remarkable capacity for in-context learning. They can construct new predictors from sequences of labeled examples $(x, f(x))$ presented in the input without further parameter updates. We investigate the hypothesis that transformer-based in-context learners implement standard learning algorithms implicitly, by encoding context-specific parametric models in their hidden representations, and updating these implicit models as new examples appear in the context. Using linear regression as a model problem, we offer three sources of evidence for this hypothesis. First, we prove by construction that transformers can implement learning algorithms for linear models based on gradient descent and closed-form computation of regression parameters. Second, we show that trained in-context learners closely match the predictors computed by gradient descent, ridge regression, and exact least-squares regression, transitioning between different predictors as transformer depth and dataset noise vary. Third, we present preliminary evidence that in-context learners share algorithmic features with these predictors: learners' late layers encode weight vectors and moment matrices. These results suggest that in-context learning is understandable in algorithmic terms, and that (at least in the linear case) learners may work by rediscovering standard estimation algorithms.","in-context learning, transformers, sequence models, deep learning, meta learning" Learning to represent and predict evolving visual signals via polar straightening,https://openreview.net/forum?id=9d13HEFFaea,https://openreview.net/pdf?id=9d13HEFFaea,,"Observer motion and continuous deformations of objects and textures imbue natural videos with distinct temporal structures, enabling the prediction of future frames from past ones. Conventional methods proceed by estimating local motion, or optic flow, and then using this to predict future frames by warping and copying content. Here, we explore a more direct methodology, in which frames are transformed into an alternative representation where temporal structure and evolution are more readily accessible. As a base case, a rigidly translating pattern can be described in the frequency domain as a linear combination of sinusoids, each with constant amplitude and phase that cycles at a rate proportional to its frequency. This fundamental property of Fourier representation reduces prediction to angular extrapolation. Motivated by the geometry of this well-known case, we formulate a self-supervised learning problem which seeks a transformation of video frames to facilitate next-frame prediction in these natural polar coordinates. We construct a network architecture in which pairs of convolutional channels are used to factorize signals into slowly evolving amplitudes and linearly advancing phases. We train this network to predict future frames, and compare its performance with that of conventional methods using optic flow, and other learned predictive neural networks, evaluated on natural videos from the DAVIS dataset. We find that the polar predictor achieves high prediction performance while remaining interpretable and fast, thereby demonstrating the potential of a flow-free video processing methodology that is trained end-to-end to predict natural video content.","Video prediction, self-supervised representation learning, phase prediction, invariance / equivariance factorization" Protecting Bidder Information in Neural Auctions,https://openreview.net/forum?id=b5RD94lXu2j,https://openreview.net/pdf?id=b5RD94lXu2j,Neural auctions often reveal private bidder information; we apply stochasticity to make them private.,"Single-shot auctions take place all the time, for example when selling ad space or allocating radio frequencies. Devising mechanisms for auctions with many bidders and multiple items can be complicated. It has been shown that neural networks can be used to approximate these mechanisms by satisfying the constraints that an auction be strategyproof and revenue maximizing. We show that despite such auctions maximizing revenue, they do so at the cost of revealing private bidder information. While randomness is often used to build in privacy, in this context it comes with complications if done without care. Specifically, it can violate rationality and feasibility constraints and can fundamentally change the incentive structure of the mechanism, and/or harm top-level metrics such as revenue or social welfare. We propose a method based on stochasticity that ensures privacy and meets the requirements for auction mechanisms. Furthermore, we analyze the cost to the auction house in expected revenue that comes with introducing privacy of various degrees.","Mechanism design, neural auctions, privacy" On Representation Learning Under Class Imbalance,https://openreview.net/forum?id=CPDtGLmXEfy,https://openreview.net/pdf?id=CPDtGLmXEfy,We study foundational questions regarding representation learning under imbalanced data for a variety of model classes and across a wide range of domains ,"Unlike carefully curated academic benchmarks, real-world datasets are often highly class-imbalanced, involving training and test sets which contain few examples from certain minority classes. While there is a common understanding that neural network generalization is negatively impacted by imbalance, the source of this problem and its resolution are unclear. Through extensive empirical investigation, we study foundational learning behaviors for various models such as neural networks, gradient-boosted decision trees, and SVMs across a range of domains and find that (1) contrary to conventional wisdom, re-balancing the training set to include a higher proportion of minority samples degrades performance on imbalanced test sets; (2) minority samples are hard to fit, yet algorithms which fit them, such as oversampling, do not improve generalization. Motivated by the observation that re-balancing class-imbalanced training data is ineffective, we show that several existing techniques for improving representation learning are effective in this setting: (3) self-supervised pre-training is insensitive to imbalance and can be used for feature learning before fine-tuning on labels; (4) Bayesian inference is effective because neural networks are especially underspecified under class imbalance; (5) flatness-seeking regularization pulls decision boundaries away from minority samples, especially when we seek minima that are particularly flat on the minority samples’ loss.","Class Imbalance, Neural Networks, Representation Learning, Flatness, Self-Supervised Learning, Bayesian Learning" Gradient Descent Converges Linearly for Logistic Regression on Separable Data,https://openreview.net/forum?id=CKATCkQFcdJ,https://openreview.net/pdf?id=CKATCkQFcdJ,We theoretically show that gradient descent with increasing learning rate obtains favorable rates on logistic regression.,"We show that running gradient descent on the logistic regression objective guarantees loss $f(x) \leq 1.1 \cdot f(x^*) + \epsilon$, where the error $\epsilon$ decays exponentially with the number of iterations. This is in contrast to the common intuition that the absence of strong convexity precludes linear convergence of first-order methods, and highlights the importance of variable learning rates for gradient descent. For separable data, our analysis proves that the error between the predictor returned by gradient descent and the hard SVM predictor decays as $\mathrm{poly}(1/t)$, exponentially faster than the previously known bound of $O(\log\log t / \log t)$. Our key observation is a property of the logistic loss that we call multiplicative smoothness and is (surprisingly) little-explored: As the loss decreases, the objective becomes (locally) smoother and therefore the learning rate can increase. Our results also extend to sparse logistic regression, where they lead to an exponential improvement of the sparsity-error tradeoff. ","logistic regression, gradient descent, sparse optimization" Interpretable (meta)factorization of clinical questionnaires to identify general dimensions of psychopathology,https://openreview.net/forum?id=c5-qKzTbP2O,https://openreview.net/pdf?id=c5-qKzTbP2O,"We propose an interpretable factorization for multiple, partially-responded clinical questionnaires","Psychiatry research aims at understanding manifestations of psychopathology in behavior, in terms of a small number of latent constructs. These are usually inferred from questionnaire data using factor analysis. The resulting factors and relationship to the original questions are not necessarily interpretable. Furthermore, this approach does not provide a way to separate the effect of confounds from those of constructs, and requires explicit imputation for missing data. Finally, there is no clear way to integrate multiple sets of constructs estimated from different questionnaires. An important question is whether there is a universal, compact set of constructs that would span all the psychopathology issues listed across those questionnaires. We propose a new matrix factorization method designed for questionnaires aimed at promoting interpretability, through bound and sparsity constraints. We provide an optimization procedure with theoretical convergence guarantees, and validate automated methods to detect latent dimensionality on synthetic data. We first demonstrate the method on a commonly used general-purpose questionnaire. We then show it can be used to extract a broad set of 15 psychopathology factors spanning 21 questionnaires from the Healthy Brain Network study. We show that our method preserves diagnostic information against competing methods, even as it imposes more constraints. Finally, we demonstrate that it can be used for defining a short, general questionnaire that allows recovery of those 15 meta-factors, using data more efficiently than other methods.","Factor analysis, matrix factorization, meta-factors, latent constructs, Healthy Brain Network Study" Enhancing Meta Learning via Multi-Objective Soft Improvement Functions,https://openreview.net/forum?id=hCmjBJeGXcu,https://openreview.net/pdf?id=hCmjBJeGXcu,,"Meta-learning tries to leverage information from similar learning tasks. In the commonly-used bilevel optimization formulation, the shared parameter is learned in the outer loop by minimizing the average loss over all tasks. However, the converged solution may be comprised in that it only focuses on optimizing on a small subset of tasks. To alleviate this problem, we consider meta-learning as a multi-objective optimization (MOO) problem, in which each task is an objective. However, existing MOO solvers need to access all the objectives’ gradients in each iteration, and cannot scale to the huge number of tasks in typical meta-learning settings. To alleviate this problem, we propose a scalable gradient-based solver with the use of mini-batch. We provide theoretical guarantees on the Pareto optimality or Pareto stationarity of the converged solution. Empirical studies on various machine learning settings demonstrate that the proposed method is efficient, and achieves better performance than the baselines, particularly on improving the performance of the poorly-performing tasks and thus alleviating the compromising phenomenon.","Meta Learning, Multi-Objective Optimization" Discrete Predictor-Corrector Diffusion Models for Image Synthesis,https://openreview.net/forum?id=VM8batVBWvg,https://openreview.net/pdf?id=VM8batVBWvg,We propose a learned predictor-corrector sampler for discrete diffusion models and empirically demonstrate its effectiveness for image generation.,"We introduce Discrete Predictor-Corrector diffusion models (DPC), extending predictor-corrector samplers in Gaussian diffusion models to the discrete case. Predictor-corrector samplers are a class of samplers for diffusion models, which improve on ancestral samplers by correcting the sampling distribution of intermediate diffusion states using MCMC methods. In DPC, the Langevin corrector, which does not have a direct counterpart in discrete space, is replaced with a discrete MCMC transition defined by a learned corrector kernel. The corrector kernel is trained to make the correction steps achieve asymptotic convergence, in distribution, to the correct marginal of the intermediate diffusion states. Equipped with DPC, we revisit recent transformer-based non-autoregressive generative models through the lens of discrete diffusion, and find that DPC can alleviate the compounding decoding error due to the parallel sampling of visual tokens. Our experiments show that DPC improves upon existing discrete latent space models for class-conditional image generation on ImageNet, and outperforms continuous diffusion models and GANs, according to standard metrics and user preference studies.","discrete diffusion models, image synthesis" Instruction-Following Agents with Jointly Pre-Trained Vision-Language Models,https://openreview.net/forum?id=U0jfsqmoV-4,https://openreview.net/pdf?id=U0jfsqmoV-4,A simple model consists of a pretrained multimodal transformer and policy transformer for instruction following that significantly improves performance,"Humans are excellent at understanding language and vision to accomplish a wide range of tasks. In contrast, creating general instruction-following embodied agents remains a difficult challenge. Prior work that uses pure language-only models lack visual grounding, making it difficult to connect language instructions with visual observations. On the other hand, methods that use pre-trained vision-language models typically come with divided language and visual representations, requiring designing specialized network architecture to fuse them together. We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments. Our InstructRL method consists of a multimodal transformer that encodes visual observations and language instructions, and a policy transformer that predicts actions based on encoded representations. The multimodal transformer is pre-trained on millions of image-text pairs and natural language text, thereby producing generic cross-modal representations of observations and instructions. The policy transformer keeps track of the full history of observations and actions, and predicts actions autoregressively. We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings. Our model also shows better model scalability and generalization ability than prior work.","reinforcement learning, pre-training, multimodal representation, representation learning, transformer" Infusing Lattice Symmetry Priors in Neural Networks Using Soft Attention Masks,https://openreview.net/forum?id=G7E_K3WaLpK,https://openreview.net/pdf?id=G7E_K3WaLpK,,"Infusing inductive biases and knowledge priors in artificial neural networks is a promising approach for achieving sample efficiency in current deep learning models. Core knowledge priors of human intelligence have been studied extensively in developmental science and recent work has postulated the idea that research on artificial intelligence should revolve around the same basic priors. As a step towards this direction, in this paper, we introduce LatFormer, a model that incorporates lattice geometry and topology priors in attention masks. Our study of the properties of these masks motivates a modification to the standard attention mechanism, where attention weights are scaled using soft attention masks generated by a convolutional neural network. Our experiments on ARC and on synthetic visual reasoning tasks show that LatFormer requires 2-orders of magnitude fewer data than standard attention and transformers in these tasks. Moreover, our results on ARC tasks that incorporate geometric priors provide preliminary evidence that deep learning can tackle this complex dataset, which is widely viewed as an important open challenge for AI research.", Counterfactual Vision-Language Data Synthesis with Intra-Sample Contrast Learning,https://openreview.net/forum?id=K1NKDaNM9i,https://openreview.net/pdf?id=K1NKDaNM9i,Counterfactual Vision-Language Data Synthesis with Intra-Sample Contrast Learning for Visual Commonsense Reasoning,"Existing Visual Learning (VL) benchmarks often contain exploitative biases. Most former works only attempted to mitigate biases in semantically low-level and conventional visual-question-answering typed datasets like VQA and GQA. However, these methods cannot generalize to recently emerging highly semantic VL datasets like VCR and are also difficult to scale due to many severe problems like high-cost labors, drastically disrupting the data distribution\textit{, etc.}To resolve those problems and also address other unique biases on VCR-like datasets, we first conduct in-depth analysis and identify important biases in VCR dataset. We further propose a generalized solution that synthesizes counterfactual image and text data based on the original query's semantic focus while producing less distortion to the data distribution. To utilize our synthesized data, we also design an innovative intra-sample contrastive training strategy to assist QA learning in Visual Commonsense Reasoning (VCR). Moreover, our synthesized VL data also serve as a highly-semantic debiased benchmark for evaluating future VL models' robustness. Extensive experiments show that our proposed synthesized data and training strategy improve existing VL models' performances on both the original VCR dataset and our proposed debiased benchmark.","counterfactual, data augmentation, vision language, kowledge distillation, vcr, vqa, visual question answering, commonsense reasoning, multimodal, robust, domain-shift, debiased" META-LEARNING FOR UNSUPERVISED OUTLIER DETECTION WITH OPTIMAL TRANSPORT,https://openreview.net/forum?id=6G5DwFLYRM,https://openreview.net/pdf?id=6G5DwFLYRM,A new meta learning for unsupervised machine learning problems with optimal transport.,"Automated machine learning has been widely researched and adopted in the field of supervised classification and regression, but progress in unsupervised settings has been limited. We propose a novel approach to automate outlier detection based on meta-learning from previous datasets with outliers. Our premise is that the selection of the optimal outlier detection technique depends on inherent properties of the data distribution. We leverage the Gromov-Wasserstein distance in particular, to find the dataset with the most similar underlying distribution, and then apply the outlier detection techniques that proved to work best for that data distribution. We evaluate the robustness of our approach and find that it outperforms the state of the art methods in unsupervised outlier detection. This approach can also be easily generalized to automate other unsupervised settings.","unsupervised learning, automl, optimal transport" GPTQ: Accurate Quantization for Generative Pre-trained Transformers,https://openreview.net/forum?id=tcbBPnfwxS,https://openreview.net/pdf?id=tcbBPnfwxS,"We show that Generative Pre-trained Transformer (GPT) models can be quantized down to 3-4 bits without significant loss of accuracy, which leads to significant computational and usability improvements. ","Generative Pre-trained Transformer (GPT) models have set themselves apart by breakthrough performance across complex language modelling tasks, but also by their extremely high computational costs. Specifically, due to memory costs, even inference for large, highly-accurate GPT models may require multiple performant GPUs to execute, which limits their usability. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 2x when using high-end GPUs (NVIDIA A100) and 4x when using more cost-effective ones (NVIDIA A6000).","compression, quantization, generative pre-trained transformers, GPT, second-order methods" Domain-Invariant Auxiliary Learning for Robust Few-Shot Predictions from Noisy Data,https://openreview.net/forum?id=Bo-1bxmCrrA,https://openreview.net/pdf?id=Bo-1bxmCrrA,We propose a novel MetaAux framework using auxiliary tasks to effectively learn a robust representation for better generalization and adaptation in unseen few-shot tasks.,"Modern meta-learning approaches produce state-of-the-art performance by imitating the test condition for few-shot learning (FSL) using episodic training. However, overfitting and memorizing corrupted labels has been a long-standing issue. Data cleansing offers a promising solution for dealing with noisy labels. Nevertheless, in FSL, data cleansing exacerbates the severity of the problem as the available training data becomes much more limited and the model is typically inadequately trained. In this work, we address overfitting in a noisy setting by exploiting auxiliary tasks to learn a better shared representation. Unsupervised auxiliary tasks are designed with no extra labeling overhead and Wasserstein distance is leveraged to align the primary and auxiliary distributions that ensure the learned knowledge is domain-invariant. Building upon the theoretical advances on PAC-Bayesian analysis, we gain ground on deriving novel generalization bounds of meta-learning with auxiliary tasks and under the effect of noisy corruptions. Extensive experiments on FSL tasks with noisy labels are conducted to show the effectiveness and robustness of our proposed method. ","meta-learning, few-shot learning, auxiliary task" Attentive MLP for Non-Autoregressive Generation,https://openreview.net/forum?id=hA7XDfCD1y2,https://openreview.net/pdf?id=hA7XDfCD1y2,We propose Attentive Multi-Layer Perceptron (AMLP) to integrate the attention mechanism with the multi-layer perceptron (MLP) in non-autoregressive architecture.,"Autoregressive~(AR) generation almost dominates sequence generation for its efficacy. Recently, non-autoregressive~(NAR) generation gains increasing popularity for its efficiency and growing efficacy. However, its efficiency is still bottlenecked by softmax attention of quadratic complexity on computational time and memory cost. Such bottleneck prevents non-autoregressive models from scaling to long sequence generation and few works have been done to mitigate this problem. In this paper, we propose a novel MLP variant, \textbf{A}ttentive \textbf{M}ulti-\textbf{L}ayer \textbf{P}erceptron~(AMLP), to produce a generation model with linear time and space complexity. Different from classic MLP with static and learnable projection matrices, AMLP leverages adaptive projections computed from inputs in an attentive mode. And different from softmax attention, AMLP uses sample-aware adaptive projections to enable communications among tokens in a sequence, and models the measurement between the query and key space. Furthermore, we marry AMLP with popular NAR models, deriving a highly efficient NAR-AMLP architecture with linear time and space complexity. The empirical results show that such marriage architecture NAR-AMLP surpasses competitive NAR efficient models, by a significant margin on text-to-speech synthesis and machine translation. We also test AMLP's self- and cross-attention ability separately with extensive ablation experiments, and find them comparable or even superior to the other efficient models. The efficiency analysis further shows that AMLP speeds up the inference and extremely reduces the memory cost against vanilla non-autoregressive models. All the experiments reveal that NAR-AMLP is a promising architecture in both of efficiency and efficacy.","Non-autoregressive, AMLP, linear complexity" ConserWeightive Behavioral Cloning for Reliable Offline Reinforcement Learning,https://openreview.net/forum?id=q2vsXnsjNB_,https://openreview.net/pdf?id=q2vsXnsjNB_,Simple weighted sampling + conservative regularization based on l2 penalty improves robustness of conditional BC when conditioning on large out-of-distribution returns.,"The goal of offline reinforcement learning (RL) is to learn near-optimal policies from static logged datasets, thus sidestepping expensive online interactions. Behavioral cloning (BC) provides a straightforward solution to offline RL by mimicking offline trajectories via supervised learning. Recent advances~\cite{chen2021decision, janner2021offline, emmons2021rvs} have shown that by conditioning on desired future returns, BC can perform competitively to their value-based counterparts, while enjoying much more simplicity and training stability. However, the distribution of returns in the offline dataset can be arbitrarily skewed and suboptimal, which poses a unique challenge for conditioning BC on expert returns at test-time. We propose ConserWeightive Behavioral Cloning (\name), a simple and effective method for improving the performance of conditional BC for offline RL with two key components: trajectory weighting and conservative regularization. Trajectory weighting addresses the bias-variance tradeoff in conditional BC and provides a principled mechanism to learn from both low return trajectories (typically plentiful) and high return trajectories (typically few). Further, we analyze the notion of conservatism in existing BC methods, and propose a novel conservative regularizer that explicitly encourages the policy to stay close to the data distribution. The regularizer helps achieve more reliable performance, and removes the need for ad-hoc tuning of the conditioning value during evaluation. We instantiate \name{} in the context of Reinforcement Learning via Supervised Learning (RvS)~\cite{emmons2021rvs} and Decision Transformer (DT)~\citep{chen2021decision}, and empirically show that it significantly boosts the performance and stability of prior methods on various offline RL benchmarks.","offline RL, behavioral cloning, conservatism" Dynamics Model Based Adversarial Training For Competitive Reinforcement Learning,https://openreview.net/forum?id=oxIbD0j-GGo,https://openreview.net/pdf?id=oxIbD0j-GGo,We propose a dynamics model based adversarial training framework to train DRL agents robust against adversarial perturbations in two-agent games.,"Adversarial perturbations substantially degrade the performance of Deep Reinforcement Learning (DRL) agents, reducing the applicability of DRL in practice. Existing adversarial training for robustifying DRL uses the information of agent at the current step to minimize the loss upper bound introduced by adversarial input perturbations. It however only works well for single-agent tasks. The enhanced controversy in two-agent games introduces more dynamics and makes existing methods less effective. Inspired by model-based RL that builds a model for the environment transition probability, we propose a dynamics model-based adversarial training framework for modeling multi-step state transitions. Our dynamics model transitively predicts future states, which can provide more precise back-propagated future information during adversarial perturbation generation, and hence improve the agent’s empirical robustness substantially under different attacks. Our experiments on four two-agent competitive MuJoCo games show that our method consistently outperforms state-of-the-art adversarial training techniques in terms of empirical robustness and normal functionalities of DRL agents.","Adversarial Training, Competitive Reinforcement Learning, Adversarial Robustness" ADVL: Adaptive Distillation for Vision-Language Tasks,https://openreview.net/forum?id=8-2sjUPp_YD,https://openreview.net/pdf?id=8-2sjUPp_YD,Leveraging Pretrained Unimodal Encoders for Vision-Language Tasks via Adaptive Knowledge Distillation,"Large-scale image-text pairs, such as image-captions and image-phrases, enable the strong representation of vision-language (VL) models. Nevertheless, they lose diversity and complexity due to the constraints in collecting data. Meanwhile, models pre-trained with image-only or text-only data (we call them unimodal pretrained models) continue to flourish and impress the community. Compared to image-text pairs, unimodal data has less constraints during the collection process resulting in more diverse styles. A natural question is how to leverage unimodal pretrained models to benefit downstream VL tasks? Most existing works focus on fusing VL information in the expensive pre-training stage. They directly plug in unimodal pre-trained encoders into a VL framework and redo an additional pre-training step on paired image-text data. This causes additional computation expense and the unimodal pretrained knowledge might be forgotten. In this paper, we take a different route and investigate how to fuse VL information in the finetuning stage oaly. To directly transfer pretrained knowledge from unimodal models to belp downstream VL tasks, we propose $\mathrm{ADVL}$, which avoids redoing any pre-training step and is generalizable to be applied of top of various VL models. To comprehensively demonstrate the effectiveness of ADVL, we conduct evaluation across three mostly recognized highly semantic VL benchmarks: VCR, VQA, and SNLI-VE under three settings, low-shot, full-shot and domainshifted settings. Results show that ADVL consistently improves the performance with different VL base models across all settings. It even achieves state-of-theart (SOTA) performance on VCR among models pre-trained with image-text data and delivers competitive results on VQA and SNLI-VE, Based on our analysis, we also discover that ADVL can improve the robustness of VL models and regulate them to better use vision information.","vision language, kowledge distillation, vcr, vqa, snli-ve, visual question answering, commonsense reasoning, pretraining, multimodal, robust, low-shot, zero-shot, domain-shift, debiased" A new characterization of the edge of stability based on a sharpness measure aware of batch gradient distribution,https://openreview.net/forum?id=bH-kCY6LdKg,https://openreview.net/pdf?id=bH-kCY6LdKg,,"For full-batch gradient descent (GD), it has been empirically shown that the sharpness, the top eigenvalue of the Hessian, increases and then hovers above $2/\text{(learning rate)}$, and this is called ``the edge of stability'' phenomenon. However, it is unclear why the sharpness is somewhat larger than $2/\text{(learning rate)}$ and how this can be extended to general mini-batch stochastic gradient descent (SGD). We propose a new sharpness measure (interaction-aware-sharpness) aware of the \emph{interaction} between the batch gradient distribution and the loss landscape geometry. This leads to a more refined and general characterization of the edge of stability for SGD. Moreover, based on the analysis of a concentration measure of the batch gradient, we propose a more accurate scaling rule, Linear and Saturation Scaling Rule (LSSR), between batch size and learning rate.","edge of stability, SGD, learning rate, batch size, optimization, generalization, implicit bias, implicit regularization, sharpness, scaling rule" Finding the smallest tree in the forest: Monte Carlo Forest Search for UNSAT solving,https://openreview.net/forum?id=8MneBPDxV9L,https://openreview.net/pdf?id=8MneBPDxV9L,"We develop Monte Carlo Forest Search (MCFS), an algorithm for finding small search trees within a forest that retains the benefits of the best MCTS approaches.","Monte Carlo Tree Search (MCTS) is an effective approach for finding low-cost paths through any large combinatorial space that can naturally be structured as a search tree. However, some combinatorial problems do not have a natural interpretation as searches for a good path. For example, solving a CSP can be represented as a path (assign variables sequentially and check the solution); however, proving that no solution exists (via existing methods) requires enumerating multiple paths to build out a “proof tree” demonstrating that every possible variable assignment leads to a conflict. Rather than finding a good path (solution) within a tree, the search problem becomes searching for a small proof tree within a forest of candidate trees. In this paper we develop Monte Carlo Forest Search (MCFS), an algorithm for finding small search trees. Our method leverages the benefits of the best MCTS approaches and further introduces two key ideas. First, we estimate tree size via the linear (i.e., path-based) and unbiased approximation from Knuth (1975). Second, we query a strong solver at a user-defined depth rather than learning a policy across the whole tree, in order to (1) reduce the variance of our tree-size estimates and (2) focus our policy search on early decisions, which offer the greatest potential for reducing tree size. We evaluated our approach on the Boolean satisfiability (SAT) problem, and found that it matched or improved performance over a strong baseline on two well-known distributions (\texttt{sgen}, \texttt{random}). Notably, we improved walltime by 9\% on \texttt{sgen} over the \texttt{kcnfs} solver and even further over the strongest UNSAT solver from the 2021 SAT competition.","Monte Carlo Tree Search, Reinforcement learning, Combinatorial optimization, SAT" $\mathrm{SE}(3)$-Equivariant Attention Networks for Shape Reconstruction in Function Space,https://openreview.net/forum?id=RDy3IbvjMqT,https://openreview.net/pdf?id=RDy3IbvjMqT,,"We propose a method for 3D shape reconstruction from unoriented point clouds. Our method consists of a novel SE(3)-equivariant coordinate-based network, that parametrizes the occupancy field of the shape and respects the inherent symmetries of the problem. In contrast to previous shape reconstruction methods that align the input to a regular grid, we operate directly on the irregular point cloud. Our architecture leverages equivariant attention layers that operate on local tokens. This mechanism enables local shape modelling, a crucial property for scalability to large scenes. Given an unoriented, sparse, noisy point cloud as input, we produce equivariant features for each point. These serve as keys and values for the subsequent equivariant cross-attention blocks that parametrize the occupancy field. By querying an arbitrary point in space, we predict its occupancy score. We show that our method outperforms previous SO(3)-equivariant methods, as well as non-equivariant methods trained on SO(3)-augmented datasets. More importantly, local modelling together with SE(3)-equivariance create an ideal setting for SE(3) scene reconstruction. We show that by training only on single, aligned objects and without any pre-segmentation, we can reconstruct novel scenes containing arbitrarily many objects in random poses without any performance loss. ","shape reconstruction, equivariance, neural fields, attention, 3D vision, point clouds" PBES: PCA Based Exemplar Sampling Algorithm for Continual Learning,https://openreview.net/forum?id=D8ulVmpjzYX,https://openreview.net/pdf?id=D8ulVmpjzYX,,"Traditional machine learning is both data and computation intensive. The most powerful models require huge quantities of data to train and the training is highly time-consuming. In the streaming or incremental model of machine learning, the data is received and processed in a streaming manner, i.e., the entire data stream is not stored, and the models are updated incrementally. While this is closer to the learning process of humans, a common problem associated with this is “catastrophic forgetting” (CF), i.e., because the entire data is not stored, but just a sketch of it, as more and more data arrives, the older data has invariably a smaller representation in the stored sketch, and this causes models to perform badly on tasks that are closer to older data. One of the approaches to solve this problem stores an “exemplar set” of data items from the stream – but this raises the central question: how to choose which items to store? Current approaches to solve this are based on herding, which is a way to select a random looking sample by a deterministic algorithm. We propose a novel selection approach based on Principal Component analysis and median sampling. This approach avoids the pitfalls due to outliers and is both simple to implement and use across various incremental machine learning models. It also has independent usage as a sampling algorithm. We achieve better performance compared to state-of-the-art methods. ","Continual Learning, Incremental Learning, Machine Learning, PCA, principal directions, principal component analysis, Class-incremental learning" "3D-IntPhys: Learning 3D Visual Intuitive Physics for Fluids, Rigid Bodies, and Granular Materials",https://openreview.net/forum?id=15lSKp0wBnm,https://openreview.net/pdf?id=15lSKp0wBnm,"An intuitive physics model with explicit 3D and compositional structures learned from multi-view videos. The learned model can handle complicated objects (e.g., fluid, rigid objects, granular materials) and perform extrapolated generalization.","Given a visual scene, humans have strong intuitions about how a scene can evolve over time under given actions. The intuition, often termed visual intuitive physics, is a critical ability that allows us to make effective plans to manipulate the scene to achieve desired outcomes without relying on extensive trial and error. In this paper, we present a framework capable of learning 3D-grounded visual intuitive physics models purely from unlabeled images. Our method is composed of a conditional Neural Radiance Field (NeRF)-style visual frontend and a 3D point-based dynamics prediction backend, in which we impose strong relational and structural inductive bias to capture the structure of the underlying environment. Unlike existing intuitive point-based dynamics works that rely on the supervision of dense point trajectory from simulators, we relax the requirements and only assume access to multi-view RGB images and (imperfect) instance masks. This enables the proposed model to handle scenarios where accurate point estimation and tracking are hard or impossible. We evaluate the models on three challenging scenarios involving fluid, granular materials, and rigid objects, where standard detection and tracking methods are not applicable. We show our model can make long-horizon future predictions by learning from raw images and significantly outperforms models that do not employ an explicit 3D representation space. We also show that, once trained, our model can achieve strong generalization in complex scenarios under extrapolate settings.","Visual Intuitive Physics, Neural Implicit Representations, Graph Neural Networks, Learning-Based Dynamics Modeling, Particle-Based Dynamics" Continual Post-Training of Language Models,https://openreview.net/forum?id=m_GDIItaI3o,https://openreview.net/pdf?id=m_GDIItaI3o,This paper proposes a continual post-training method based on soft-masking to learn a sequence of unlabeled domain corpora to adapt a language model to improve the end-task performances in these domains.,"Language models (LMs) have been instrumental for the recent rapid advance of natural language processing. Existing research has shown that post-training or adapting an LM using an unlabeled topical/domain corpus can improve the end-task performance in the domain. This paper proposes a novel method to continually post-train an LM with a sequence of unlabeled domain corpora to adapt the LMto these domains to improve their end-task performances. The key novelty of our method is a soft-masking mechanism that directly controls the update to the LM. A novel proxy is also proposed to preserve the general knowledge in the original LM. Additionally, it contrasts the representations of the previously learned domain knowledge (including the general knowledge in pre-trained LM) and the knowledge from the current full network to achieve knowledge integration. The method not only overcomes catastrophic forgetting, but also achieves knowledge transfer to improve end-task performances compared to post-training each domain separately. Empirical evaluation demonstrates the effectiveness of the proposed method.","Continual learning, Domain-adaptive Pretraining, Post-training" Min-Max Multi-objective Bilevel Optimization with Applications in Robust Machine Learning,https://openreview.net/forum?id=PvDY71zKsvP,https://openreview.net/pdf?id=PvDY71zKsvP,We study a generic min-max bilevel multi-objective optimization framework with novel theoretical analysis and applications in representation learning and hyperparameter optimization,"We consider a generic min-max multi-objective bilevel optimization problem with applications in robust machine learning such as representation learning and hyperparameter optimization. We design MORBiT, a novel single-loop gradient descent-ascent bilevel optimization algorithm, to solve the generic problem and present a novel analysis showing that MORBiT converges to the first-order stationary point at a rate of $\widetilde{\mathcal{O}}(n^{1/2} K^{-2/5})$ for a class of weakly convex problems with $n$ objectives upon $K$ iterations of the algorithm. Our analysis utilizes novel results to handle the non-smooth min-max multi-objective setup and to obtain a sublinear dependence in the number of objectives $n$. Experimental results on robust representation learning and robust hyperparameter optimization showcase (i) the advantages of considering the min-max multi-objective setup, and (ii) convergence properties of the proposed \morbit.","robust optimization, bilevel optimization, multi-objective optimization" IAE: Implicit Autoencoder for Point Cloud Self-supervised Representation Learning,https://openreview.net/forum?id=00kPgkoahtO,https://openreview.net/pdf?id=00kPgkoahtO,We propose a simple yet effective non-symmetric autoencoder for point cloud self-supervised learning which leverages implicit function.,"Autoencoding has been a popular topic across many fields and recently emerged in the 3D domain. However, many 3D representations (e.g., point clouds) are discrete samples of the underlying continuous 3D surface which makes them different from other data modalities. This process inevitably introduces sampling variations on the underlying 3D shapes. In learning 3D representation, a desirable goal is to disregard such sampling variations while focusing on capturing transferable knowledge of the underlying 3D shape. This aim poses a grand challenge to existing representation learning paradigms. For example, the standard autoencoding paradigm forces the encoder to capture such sampling variations as the decoder has to reconstruct the original point cloud. In this paper, we introduce the Implicit Autoencoder (IAE). This simple yet effective method addresses this challenge by replacing the point cloud decoder with an implicit decoder. The implicit decoder can output a continuous representation that is shared among different point cloud samplings of the same model. Reconstructing under the implicit representation can prioritize that the encoder discards sampling variations, introducing appropriate inductive bias to learn more generalizable feature representations. We validate this claim experimentally and show a theoretical analysis under a simple linear autoencoder. Moreover, our implicit decoder offers excellent flexibility in designing suitable implicit representations for different tasks. We demonstrate the usefulness of IAE across various self-supervised learning tasks for both 3D objects and 3D scenes. Experimental results show that IAE consistently outperforms the state-of-the-art in each task. ","point cloud, self-supervised learning, representation learning, autoencoder, implicit function" The Plug and Play of Language Models for Text-to-image Generation,https://openreview.net/forum?id=1n1c7cHl3Zc,https://openreview.net/pdf?id=1n1c7cHl3Zc,This paper introduces a new method to efficiently plug new language models to exiting text-to-image generation models as enhancement in scalability.,"Text-to-image (T2I) models enable controllable image generation through user-provided captions. A text encoder is typically used to map captions to a latent space, and it has been shown to be critical for model's performance. However, replacing or upgrading the text encoder in a T2I model is challenging due to the tight bond between the current encoder and the image decoder. It requires training the model from scratch, which can be prohibitively expensive. To address this problem, we introduce a more efficient approach to align a pre-trained language model with the latent space of an existing T2I model. We propose a Model Translation Network (MTN) and a new training objective to align the representation spaces of the two text encoders using only a corpus of unlabeled text. We empirically find that MTN can be trained efficiently and can boost the performance of existing T2I models by upgrading their text encoder. Moreover, we find that MTN can align multilingual language models such as XLM-Roberta, thus allowing existing T2I models to generate high-quality images from captions beyond English. ","Text-to-Image Generation, Language Models, Efficiency" Learning Arborescence with An Efficient Inference Algorithm,https://openreview.net/forum?id=V2BQvSIWnYD,https://openreview.net/pdf?id=V2BQvSIWnYD,"We propose an efficient algorithm for predicting minimum weight arborescence, that can boost the training and inference time for arborescence learning tasks.","We consider a class of structured learning problems on arborescence (i.e., the directed spanning tree) from the input graph. The key step involved in this problem is predicting the minimal weight arborescence (MWA) from the learned model. In literature, there are two lines of research for predicting MWA: the Chu-Liu Edmonds (CLE) and the Lovasz methods. The CLE method is easy to implement while it takes $\mathcal{O}(n)$ cycle contractions. Here $n$ is the graph size. The Lovasz method reduces to the multi-pair shortest path (MPSP) problem and takes only $\mathcal{O}(\log n)$ contractions. Nevertheless, in the CPU setting, MPSP has the same time complexity as finding MWA. The Lovasz method only attains time efficiency under a sufficient GPU setting. Both the aforementioned methods are painfully slow for large-scale learning tasks. In this research, we find the general MPSP problem can be simplified when working with machine learning models. This is because the learning model predicts edge weights for all pairs of vertices and the graph we process is always complete. Therefore, we only need to handle those paths that directly enter every weakly connected component (WCC) while the classic Lovasz method needs to handle all possible paths. This allows us to propose LAzy LoVAz (Lava) method that enjoys $\mathcal{O}(\log n)$ contractions as well as efficient performance in both CPU and GPU settings. In experiments, we consider synthetic datasets and two real-world learning tasks, i.e., graph-based dependency parsing and unsupervised parsing on ListOps. The empirical results exhibit important gains of our Lava method to the classic CLE and Lovasz methods, that Lava boosts the training time for arborescence learning tasks.","minimum weight arborescence, arborescence Learning" "Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning",https://openreview.net/forum?id=Uuf2q9TfXGA,https://openreview.net/pdf?id=Uuf2q9TfXGA,"We provide a theory to explain why ensemble and knowledge distillation work for Deep Learning. It matches practice well, while traditional theory such as boosting, random feature mappings or NTKs, cannot explain the same phenomena for DL.","(this is a theory paper) We formally study how \emph{ensemble} of deep learning models can improve test accuracy, and how the superior performance of ensemble can be distilled into a single model using \emph{knowledge distillation}. We consider the challenging case where the ensemble is simply an average of the outputs of a few independently trained neural networks with the \emph{same} architecture, trained using the \emph{same} algorithm on the \emph{same} data set, and they only differ by the random seeds used in the initialization. We show that ensemble/knowledge distillation in \emph{deep learning} works very differently from traditional learning theory (such as boosting or NTKs). We develop a theory showing that when data has a structure we refer to as ``multi-view'', then ensemble of independently trained neural networks can provably improve test accuracy, and such superior test accuracy can also be provably distilled into a single model. Our result sheds light on how ensemble works in deep learning in a way that is completely different from traditional theorems, and how the ``dark knowledge'' is hidden in the outputs of the ensemble and can be used in distillation.", A Score-Based Model for Learning Neural Wavefunctions,https://openreview.net/forum?id=rMQ1Wme3S0c,https://openreview.net/pdf?id=rMQ1Wme3S0c,We propose a score-based optimization framework for Quantum Monte Carlo which does not require explicit probability distribution.,"Quantum Monte Carlo coupled with neural network wavefunctions has shown success in finding the ground state of quantum many-body systems. The existing optimization approaches compute the energy by sampling local energy from an explicit probability distribution given by the wavefunction. In this work, we provide a new optimization framework for obtaining properties of quantum many-body ground state using score-based neural networks. This new framework does not require explicit probability distribution and performs the sampling via Langevin dynamics. Our method is based on the key observation that the local energy is directly related to the score, defined as the gradient of the logarithmic wavefunction. Inspired by the score matching and the diffusion Monte Carlo methods, we derive a weighted score matching objective, which guides our score-based models to correctly converge to the ground state. We first validate our approach with experiments on quantum harmonic traps, and further results show that it can accurately learn the ground states of atomic systems. By implicitly modeling the high-dimensional data distribution, our work paves the way toward a more efficient representation of quantum systems.","Neural wavefunction, Quantum Monte Carlo, Score-based method" Benchmarking Algorithms for Domain Generalization in Federated Learning,https://openreview.net/forum?id=IsCg7qoy8i9,https://openreview.net/pdf?id=IsCg7qoy8i9,Benchmarking algorithms for domain generalization in federated learning on multiple realistic datasets.,"In this paper, we present a unified platform to study domain generalization in the federated learning (FL) context and conduct extensive empirical evaluations of the current state-of-the-art domain generalization algorithms adapted to FL. In particular, we perform a fair comparison of nine existing algorithms in solving domain generalization {either centralized domain generalization algorithms adapted to the FL context or existing FL domain generalization algorithms } to comprehensively explore the challenges introduced by FL. These challenges include statistical heterogeneity among clients, the number of clients, the number of communication rounds, etc. The evaluations are conducted on three diverse datasets including PACS (image dataset covering photo, sketch, cartoon, and painting domains), iWildCam (image dataset with 323 domains), and Py150 (natural language processing dataset with 8421 domains). The experiments show that the challenges brought by federated learning stay unsolved in the realistic experiment setting. Furthermore, the code base supports fair and reproducible new algorithm evaluation with easy implementation.","Domain Generalization, Federated Learning, Benchmark." The Vendi Score: A Diversity Evaluation Metric for Machine Learning,https://openreview.net/forum?id=dF0g-5k05h_,https://openreview.net/pdf?id=dF0g-5k05h_,,"Diversity is an important criterion for many areas of machine learning (ML), including generative modeling and dataset curation. Yet little work has gone into understanding, formalizing, and measuring diversity in ML. In this paper we address the diversity evaluation problem by proposing the Vendi Score, which connects and extends ideas from ecology and quantum statistical mechanics to ML. The Vendi Score is defined as the exponential of the Shannon entropy of the eigenvalues of a similarity matrix. This matrix is induced by a user-defined similarity function applied to the sample to be evaluated for diversity. In taking a similarity function as input, the Vendi Score enables its user to specify any desired form of diversity. Importantly, unlike many existing metrics in ML, the Vendi Score doesn’t require a reference dataset or distribution over samples or labels, it is therefore general and applicable to any generative model, decoding algorithm, and dataset from any domain where similarity can be defined. We showcase the Vendi Score on molecular generative modeling where we found it addresses shortcomings of the current diversity metric of choice in that domain. We also applied the Vendi Score to generative models of images and decoding algorithms of text where we found it confirms known results about diversity in those domains. Furthermore, we used the Vendi Score to measure mode collapse, a known shortcoming of generative adversarial networks (GANs). In particular, the Vendi Score revealed that even GANs that capture all the modes of a labelled dataset can be less diverse than the original dataset. Finally, the interpretability of the Vendi Score allowed us to diagnose several benchmark ML datasets for diversity, opening the door for diversity-informed data augmentation", How Can GANs Learn Hierarchical Generative Models for Real-World Distributions,https://openreview.net/forum?id=7h5KSs2PCRi,https://openreview.net/pdf?id=7h5KSs2PCRi,We provide a theory to study how generative adversarial networks (GANs) can efficiently learn certain hierarchically generated distributions that are close to the distribution of images in practice.,"(this is a theory paper) Generative adversarial networks (GANs) are among the most successful models for learning high-complexity, real-world distributions. However, in theory, due to the highly non-convex, non-concave landscape of the minmax training objective, GAN remains one of the least understood deep learning models. In this work, we formally study how GANs can efficiently learn certain hierarchically generated distributions that are close to the distribution of real-life images. We prove that when a distribution has a structure that we refer to as \emph{forward super-resolution}, then simply training generative adversarial networks using stochastic gradient descent ascent (SGDA) can learn this distribution efficiently, both in sample and time complexities. We also provide empirical evidence that our assumption ``forward super-resolution'' is very natural in practice, and the underlying learning mechanisms that we study in this paper (to allow us efficiently train GAN via GDA in theory) simulates the actual learning process of GANs on real-world problems. ", Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus,https://openreview.net/forum?id=9yE2xEj0BH7,https://openreview.net/pdf?id=9yE2xEj0BH7,We propose an enhanced vision-language model for UI tasks that achieves SoTA on representative UI tasks and supports few-shot and multitask learning.,"Mobile UI understanding is important for enabling various interaction tasks such as UI automation and accessibility. Previous mobile UI modeling often depends on the view hierarchy information of a screen, which directly provides the structural data of the UI, with the hope to bypass challenging tasks of visual modeling from screen pixels. However, view hierarchy is not always available, and is often corrupted with missing object descriptions or misaligned bounding box positions. As a result, although using view hierarchy offers some short-term gains, it may ultimately hinder the applicability and performance of the model. In this paper, we propose Spotlight, a vision-only approach for mobile UI understanding. Specifically, we enhance a vision-language model that only takes the screenshot of the UI and a region of interest on the screen---the focus---as the input. This general architecture is easily scalable and capable of performing a range of UI modeling tasks. Our experiments show that our model obtains SoTA results on several representative UI tasks and outperforms previous methods that use both screenshots and view hierarchies as input. Furthermore, we explore the multi-task learning and few-shot prompting capacity of the proposed models, demonstrating promising results in the multi-task learning direction.","vision-language, UI, few-shot, finetuning, multi-task" A Control-Centric Benchmark for Video Prediction,https://openreview.net/forum?id=rimcq1oIFeR,https://openreview.net/pdf?id=rimcq1oIFeR,"We find that existing video evaluation metrics are not always indicative of a model's performance during control, and propose a benchmark that directly evaluates video prediction models on simulated manipulation tasks by using them for planning.","Video is a promising source of knowledge for embodied agents to learn models of the world's dynamics. Large deep networks have become increasingly effective at modeling complex video data in a self-supervised manner, as evaluated by metrics based on human perceptual similarity or pixel-wise comparison. However, it remains unclear whether current metrics are accurate indicators of performance on downstream tasks. We find empirically that for planning robotic manipulation, existing metrics can be unreliable at predicting execution success. To address this, we propose a benchmark for action-conditioned video prediction in the form of a control benchmark that evaluates a given model for simulated robotic manipulation through sampling-based planning. Our benchmark, Video Prediction for Visual Planning ($\text{VP}^2$), includes simulated environments with $11$ task categories and $310$ task instance definitions, a full planning implementation, and training datasets containing scripted interaction trajectories for each task category. A central design goal of our benchmark is to expose a simple interface -- a single forward prediction call -- so it is straightforward to evaluate almost any action-conditioned video prediction model. We then leverage our benchmark to study the effects of scaling model size, quantity of training data, and model ensembling by analyzing three highly-performant video prediction models, finding that while scale can improve perceptual quality when modelling visually diverse settings, other attributes such as uncertainty awareness can also aid planning performance.","benchmarking, video prediction, visual MPC, manipulation" Continual Learning Based on Sub-Networks and Task Similarity,https://openreview.net/forum?id=ncQCD9M8SwT,https://openreview.net/pdf?id=ncQCD9M8SwT,"A continual learning method based on sub-networks and task similarity is proposed and evaluated on NLP classification, generation, and extraction problems.","Continual learning (CL) has two main objectives: preventing catastrophic forgetting (CF) and encouraging knowledge transfer (KT) across tasks. The existing literature mainly tries to overcome CF. Although some papers have focused on both CF and KT, they may still suffer from CF because of their ineffective handling of previous tasks and/or poor task similarity detection mechanisms to achieve KT. This work presents a new CL method that addresses the above issues. First, it overcomes CF by isolating the knowledge of each task via a learned mask that indicates a sub-network. Second, it proposes a novel technique to compute how important each mask is to the new task, which indicates how the new task is similar to an underlying old task. Similar tasks can share the same mask/subnetwork for KT, while dissimilar tasks use different masks/sub-networks for CF prevention. Comprehensive experiments have been conducted using a range of NLP problems, including classification, generation, and extraction to show that the proposed method consistently outperforms prior state-of-the-art baselines.","Continual learning, NLP tasks, Task Similarity, Sub-network" A Stable and Scalable Method for Solving Initial Value PDEs with Neural Networks,https://openreview.net/forum?id=vsMyHUq_C1c,https://openreview.net/pdf?id=vsMyHUq_C1c,"We develop Neural-IVP, a new method for solving initial value PDEs with Neural Networks that is both stable and scalable.","Unlike conventional grid and mesh based methods for solving PDEs, neural networks have the potential to break the curse of dimensionality, providing approximate solutions to high-dimensional PDEs. While global minimization of the PDE residual over the network parameters works well for boundary value problems, catastrophic forgetting limits its applicability to initial value problems. In an alternative local in time approach, the optimization problem can be converted into an ODE on the network parameters and the solution propagated forward in time; however, we demonstrate that current methods utilizing this idea suffer from two key issues. First, following the ODE produces an uncontrolled growth in the conditioning of the problem, ultimately leading to unacceptably large numerical errors. Second, as the ODE methods scale cubically with the number of model parameters, they are restricted to small neural networks, significantly limiting their ability to represent intricate PDE initial conditions and solutions. Building on these insights we develop Neural-IVP, an ODE based IVP solver which prevents the network from getting ill conditioned and runs in time linear in the number of parameters, enabling us to evolve the dynamics of challenging high-dimensional PDEs with neural networks.","PDE, Neural PDE solvers, Initial value problems" Shallow Learning In Materio.,https://openreview.net/forum?id=npwbjVljAEU,https://openreview.net/pdf?id=npwbjVljAEU,,"We introduce Shallow Learning In Materio (SLIM) as a resource-efficient method to realize closed-loop higher-order perceptrons. Our SLIM method provides a rebuttal to the Minsky school's disputes with the Rosenblatt school about the efficacy of learning representations in shallow perceptrons. As a proof-of-concept, here we devise a physically-scalable realization of the parity function. Our findings are relevant to artificial intelligence engineers, as well as neuroscientists and biologists. ", How Can Deep Learning Performs Deep (Hierarchical) Learning,https://openreview.net/forum?id=j2ymLjCr-Sj,https://openreview.net/pdf?id=j2ymLjCr-Sj,"We present a theory to study *how* deep neural networks (of even super-constantly many layers) can perform hierarchical feature learning, on tasks that are not known to be efficiently solvable by non-hierarchical methods (such as kernel methods).","(this is a theory paper) Deep learning is also known as hierarchical learning, where the learner $\textit{learns}$ to represent a complex target function by decomposing it into a sequence of simpler functions to reduce sample and time complexity. This paper formally analyzes how multi-layer neural networks can perform such hierarchical learning $\textit{efficiently}$ and $\textit{automatically}$ by applying stochastic gradient descent (SGD) or its variants. On the conceptual side, we present a characterizations of how certain deep (i.e. super-constantly many layers) neural networks can still be sample and time efficiently trained on hierarchical learning tasks, when no known existing algorithm (including layer-wise training, kernel method, etc) is efficient. We establish a new principle called ``backward feature correction'', where \emph{the errors in the lower-level features can be automatically corrected when training together with the higher-level layers}. We believe this is a key behind how deep learning is performing deep (hierarchical) learning, as opposed to layer-wise learning or simulating some known non-hierarchical method.", Data Subset Selection via Machine Teaching,https://openreview.net/forum?id=tGHi1HFNBx1,https://openreview.net/pdf?id=tGHi1HFNBx1,"We propose, analyze, and evaluate a machine teaching approach to data subset selection.","We study the problem of data subset selection: given a fully labeled dataset and a training procedure, select a subset such that training on that subset yields approximately the same test performance as training on the full dataset. We propose an algorithm, inspired by recent work in machine teaching, that has theoretical guarantees, compelling empirical performance, and is model-agnostic meaning the algorithm's only information comes from the predictions of models trained on subsets. Furthermore, we prove lower bounds that show that our algorithm achieves a subset with near-optimal size (under computational hardness assumptions) while training on a number of subsets that is optimal up to extraneous log factors. We then empirically compare our algorithm, machine teaching algorithms, and coreset techniques on six common image datasets with convolutional neural networks. We find that our machine teaching algorithm can find a subset of CIFAR10 of size less than 16k that yields the same performance (5-6% error) as training on the full dataset of size 50k.","Data pruning, data selection, machine teaching" Do Summarization Models Synthesize?,https://openreview.net/forum?id=1PTeB4MWCfU,https://openreview.net/pdf?id=1PTeB4MWCfU,"We measure if multidocument summarization models can effectively synthesize contrasting inputs, and explore methods to change synthesis performance.","Multi-document summarization entails producing concise synopses of collections of inputs. For some applications, the synopsis should accurately \emph{synthesize} inputs with respect to a key property or aspect. For example, a synopsis of film reviews all written about a particular movie should reflect the average critic consensus. As a more consequential example, consider narrative summaries that accompany biomedical \emph{systematic reviews} of clinical trial results. These narratives should fairly summarize the potentially conflicting results from individual trials. In this paper we ask: To what extent do modern multi-document summarization models implicitly perform this type of synthesis? To assess this we perform a suite of experiments that probe the degree to which conditional generation models trained for summarization using standard methods yield outputs that appropriately synthesize inputs. We find that existing models do partially perform synthesis, but do so imperfectly. In particular, they are over-sensitive to changes in input ordering and under-sensitive to changes in input compositions (e.g., the ratio of positive to negative movie reviews). We propose a simple, general method for improving model synthesis capabilities by generating an explicitly diverse set of candidate outputs, and then selecting from these the string best aligned with the expected aggregate measure for the inputs, or \emph{abstaining} when the model produces no good candidate. This approach improves model synthesis performance. Our hope is that by highlighting the need for synthesis (in some summarization settings), this work motivates further research into multi-document summarization methods and learning objectives that explicitly account for the need to synthesize. ","Summarization, Factuality, Sentiment, Systematic Reviews, Evidence Synthesis" CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets,https://openreview.net/forum?id=NEEtm5laNK1,https://openreview.net/pdf?id=NEEtm5laNK1,,"Open vocabulary models (e.g. CLIP) have shown strong performance on zeroshot classification through their ability generate embeddings for each class based on their (natural language) names. Prior work has focused on improving the accuracy of these models through prompt engineering or by incorporating a small amount of labeled downstream data (via finetuning). In this paper, we propose Classification with Hierarchical Label Sets (or CHiLS), an alternative strategy that proceeds in three steps: (i) for each class, produce a set of subclasses, using either existing label hierarchies or by querying GPT-3; (ii) perform the standard zero-shot CLIP procedure as though these subclasses were the labels of interest; (iii) map the predicted subclass back to its parent to produce the final prediction. Across numerous datasets, CHiLS leads to improved accuracy yielding gains of over 30% in situations where known hierarchies are available and more modest gains when they are not. CHiLS is simple to implement within existing CLIP pipelines and requires no additional training cost.","open vocabulary models, CLIP, zero-shot learning, zero-shot image classification" Multi-Grid Tensorized Fourier Neural Operator for High Resolution PDEs,https://openreview.net/forum?id=po-oqRst4Xm,https://openreview.net/pdf?id=po-oqRst4Xm,"An efficient neural operator that leverages a novel multi-grid approach as well a tensorized architecture for better performance, generalization and scalability.","Memory complexity and data scarcity are two main pressing challenges in learning solution operators of partial differential equations (PDE) at high resolutions. These challenges limited prior neural operator modelsMemory complexity and data scarcity are two main pressing challenges in learning solution operators of partial differential equations (PDE) at high resolutions. These challenges limited prior neural operator modelsMemory complexity and data scarcity are two main pressing challenges in learning solution operators of partial differential equations (PDE) at high resolutions. These challenges limited prior neural operator models to low/mid-resolution problems rather than full scale real-world problems. Yet, these problems possess spatially local structures that is not used by previous approaches. We propose to exploit this natural structure of real-world phenomena to predict solutions locally and unite them into a global solution. Specifically, we introduce a neural operator that scales to large resolutions by leveraging local and global structures through decomposition of both the input domain and the operator's parameter space. It consists of a multi-grid tensorized neural operator (MG-TFNO), a new data efficient and highly parallelizable operator learning approach with reduced memory requirement and better generalization. MG-TFNO employs a novel multi-grid based domain decomposition approach to exploit the spatially local structure in the data. Using the FNO as a backbone, its parameters are represented in a high-order latent subspace of the Fourier domain, through a global tensor factorization, resulting in an extreme reduction in the number of parameters and improved generalization. In addition, the low-rank regularization it applies to the parameters enables efficient learning in low-data regimes, which is particularly relevant for solving PDEs where obtaining ground-truth predictions is extremely costly and samples, therefore, are limited. We empirically verify the efficiency of our method on the turbulent Navier-Stokes equations where we demonstrate superior performance, with 2.5 times lower error, 10X compression of the model parameters, and 1.8X compression of the input domain size. Our tensorization approach yields up to 400x reduction in the number of parameter without loss in accuracy. Similarly, our domain decomposition method gives a 7x reduction in the domain size while slightly improving accuracy. Furthermore, our method can be trained with much fewer samples than previous approaches, outperforming the FNO when trained with just half the samples.","Fourier-Neural-Operators, Tensorization, Multi-Grid" $\beta$-Stochastic Sign SGD: A Byzantine Resilient and Differentially Private Gradient Compressor for Federated Learning,https://openreview.net/forum?id=oVPqFCI1g7q,https://openreview.net/pdf?id=oVPqFCI1g7q,"Clients send stochastic sign-bits gradient estimates to a server, which aggregates updates based on majority vote. We show that this algorithm is provable differentially private, convergent, communication efficient, and Byzantine fault tolerant.","Federated Learning (FL) is a nascent privacy-preserving learning framework under which the local data of participating clients is kept locally throughout model training. Scarce communication resources and data heterogeneity are two defining characteristics of FL. Besides, a FL system is often implemented in a harsh environment -- leaving the clients vulnerable to Byzantine attacks. To the best of our knowledge, no gradient compressors simultaneously achieve quantitative Byzantine resilience and privacy preservation. In this paper, we fill this gap via revisiting the stochastic sign SGD \cite{Jin2020}. We propose $\beta$-stochastic sign SGD, which contains a gradient compressor that encodes a client's gradient information in sign bits subject to the privacy budget $\beta>0$. We show that as long as $\beta>0$, $\beta$-stochastic sign SGD converges in the presence of partial client participation and mobile Byzantine faults, showing that it achieves quantifiable Byzantine-resilience and differential privacy simultaneously. In sharp contrast, when $\beta=0$, the compressor is not differentially private. Notably, for the special case when each of the stochastic gradients involved is bounded with known bounds, our gradient compressor with $\beta=0$ coincides with the compressor proposed in \cite{Jin2020}. As a byproduct, we show that when the clients report sign messages, the popular information aggregation rules simple mean, trimmed mean, median and majority vote are identical in terms of the output signs. Our theories are corroborated by experiments on MNIST and CIFAR-10 datasets.","distributed systems, differential privacy, communication efficiency, convergence rate analysis, robust optimization" Sequential Brick Assembly with Efficient Constraint Satisfaction,https://openreview.net/forum?id=MRfbe7VAoqu,https://openreview.net/pdf?id=MRfbe7VAoqu,"We address the problem of generating a sequence of LEGO brick assembly with high-fidelity structures, satisfying physical constraints between bricks.","We address the problem of generating a sequence of LEGO brick assembly with high-fidelity structures, satisfying physical constraints between bricks. The assembly problem is challenging since the number of possible structures increases exponentially with the number of available bricks, complicating the physical constraints to satisfy across bricks. To tackle this problem, our method performs a brick structure assessment to predict the next brick position and its confidence by employing a U-shaped sparse 3D convolutional network. The convolution filter efficiently validates physical constraints in a parallelizable and scalable manner, allowing to process of different brick types. To generate a novel structure, we devise a sampling strategy to determine the next brick position by considering attachable positions under physical constraints. Instead of using handcrafted brick assembly datasets, our model is trained with a large number of 3D objects that allow to create a new high-fidelity structure. We demonstrate that our method successfully generates diverse brick structures while handling two different brick types and outperforms existing methods based on Bayesian optimization, graph generative model, and reinforcement learning, all of which are limited to a single brick type.","combinatorial problem, brick assembly" Cross-Domain Self-Supervised Deep Learning for Robust Alzheimer's Disease Progression Modeling,https://openreview.net/forum?id=VCyfx4aghT3,https://openreview.net/pdf?id=VCyfx4aghT3,,"Developing successful artificial intelligence systems in practice depends both on robust deep learning models as well as large high quality data. Acquiring and labeling data can become prohibitively expensive and time-consuming in many real-world applications such as clinical disease models. Self-supervised learning has demonstrated great potential in increasing model accuracy and robustness in small data regimes. In addition, many clinical imaging and disease modeling applications rely heavily on regression of continuous quantities. However, the applicability of self-supervised learning for these medical-imaging regression tasks has not been extensively studied. In this study, we develop a cross-domain self-supervised learning approach for disease prognostic modeling as a regression problem using 3D images as input. We demonstrate that self-supervised pre-training can improve the prediction of Alzheimer's Disease progression from brain MRI. We also show that pre-training on extended (but not labeled) brain MRI data outperforms pre-training on natural images. We further observe that the highest performance is achieved when both natural images and extended brain-MRI data are used for pre-training.","Self-supervision, regression, 3D imaging, transfer learning, predictive modeling" Data-Efficient Finetuning Using Cross-Task Nearest Neighbors,https://openreview.net/forum?id=WtW_s7EDWPe,https://openreview.net/pdf?id=WtW_s7EDWPe,We use unlabelled task-specific data to select subsets of massive multitask datasets and show that language models fine-tuned on these subsets outperform models trained on all available data for unseen tasks in zero and few-shot settings.,"Language models trained on massive prompted multitask datasets like T0 (Sanh et al., 2021) or FLAN (Wei et al., 2021) can generalize to tasks unseen during training. We show that training on a carefully chosen subset of instances can outperform training on all available data on a variety of datasets. We assume access to a small number (250-1000) of unlabeled target task instances, select their nearest neighbors from a pool of multitask data, and use the retrieved data to train target task specific models. Our method is more data-efficient than training a single multitask model, while still outperforming it by large margins. We evaluate across a diverse set of tasks not in the multitask pool we retrieve from, including those used to evaluate T0 and in addition, more complex tasks including legal and scientific document QA. We retrieve small subsets of P3 (the collection of prompted datasets from which T0’s training data was sampled) and finetune T5 models that outperform the 3-billion parameter variant of T0 (T0-3B) by 8-30% on 11 out of 12 evaluation datasets while using at most 2% of the data used to train T0-3B. These models also provide a better initialization than T0-3B for few-shot finetuning on target-task data, as shown by a 3-23% relative improvement over few-shot finetuned T0-3B models on 8 datasets.","multitasking, retrieval, few-shot, efficiency, nlp" "Heavy-tailed Noise Does Not Explain the Gap Between SGD and Adam, but Sign Descent Might",https://openreview.net/forum?id=a65YK0cqH8g,https://openreview.net/pdf?id=a65YK0cqH8g,"A hypothesis for Adam's success is that it handles heavy-tailed noise better than SGD, but it works even better without noise; with big batch sizes, it performs very similarly to sign descent, which might help explain why it works.","The success of Adam on a wide array of architectures has made it the default in settings where stochastic gradient descent (SGD) performs poorly. However, our theoretical understanding of this discrepancy is lagging, preventing significant improvements on either algorithm. Recent work advances the hypothesis that Adam and other heuristics like gradient clipping outperform SGD on language tasks because the distribution of the error induced by stochasticity has heavy tails. This hypothesis suggests that the underlying mechanism causing the gap is a more robust estimator of the gradient. We evaluate this hypothesis by varying the batch size, up to the entire dataset, controlling for stochasticity. We find evidence that stochasticity and heavy-tailed noise are not major factors in the performance gap between SGD and Adam. Rather, SGD does not leverage reductions in noise due to larger batches as well as Adam. This raises the question as to why Adam outperforms SGD in the full-batch setting. Checking simple normalized variants of SGD, we find that the behavior of Adam with increasing batch sizes is most consistent with sign descent.","optimization for deep learning, adaptive methods, adam, rmsprop, sgd, sign descent, noise, stochasticity, full batch" BiAdam: Fast Adaptive Bilevel Optimization Methods,https://openreview.net/forum?id=sBfTc3SD9gp,https://openreview.net/pdf?id=sBfTc3SD9gp,,"Bilevel optimization recently has attracted increased interest in machine learning due to its many applications such as hyper-parameter optimization and mate learning. Although many bilevel optimization methods recently have been proposed, these methods do not consider using adaptive learning rates. It is well known that adaptive learning rates can accelerate many optimization algorithms including (stochastic) gradient-based algorithms. To fill this gap, in the paper, we propose a novel fast adaptive bilevel framework for solving bilevel optimization problems that the outer problem is possibly nonconvex and the inner problem is strongly convex. Our framework uses unified adaptive matrices including many types of adaptive learning rates, and can flexibly use the momentum and variance reduced techniques. In particular, we provide a useful convergence analysis framework for the bilevel optimization. Specifically, we propose a fast single-loop adaptive bilevel optimization (BiAdam) algorithm based on the basic momentum technique, which achieves a sample complexity of $\tilde{O}(\epsilon^{-4})$ for finding an $\epsilon$-stationary point. Meanwhile, we propose an accelerated version of BiAdam algorithm (VR-BiAdam) by using variance reduced technique, which reaches the best known sample complexity of $\tilde{O}(\epsilon^{-3})$ without relying on large batch-size. To the best of our knowledge, we first study the adaptive bilevel optimization methods with adaptive learning rates. Some experimental results on data hyper-cleaning and hyper-representation learning tasks demonstrate the efficiency of the proposed algorithms.","Bilevel Optimization, Momentum, Adaptive Learning Rate, Variance Reduced, Hyper-Parameter Learning" Building Normalizing Flows with Stochastic Interpolants,https://openreview.net/forum?id=li7qeBbCR1t,https://openreview.net/pdf?id=li7qeBbCR1t,"A method is proposed to construct normalizing flows based on stochastic interpolants, yielding an efficient training algorithm compared to equivalent ODE methods, and providing a theoretical framework to map score based diffusions to ODEs.","A simple generative model based on a continuous-time normalizing flow between any pair of base and target distributions is proposed. The velocity field of this flow is inferred from the probability current of a time-dependent distribution that interpolates between the base and the target in finite time. Unlike conventional normalizing flow inference methods based the maximum likelihood principle, which require costly backpropagation through ODE solvers, our interpolant approach leads to a simple quadratic loss for the velocity itself which is expressed in terms of expectations that are readily amenable to empirical estimation. The flow can be used to generate samples from either the base or target, and can be used to estimate the likelihood at any time along the interpolant. The approach is contextualized in its relation to diffusions. In particular, in situations where the base is a Gaussian distribution, we show that the velocity of our normalizing flow can also be used to construct a diffusion model to sample the target as well as estimate its score. This allows one to map methods based on stochastic differential equations to those of ordinary differential equations, simplifying the mechanics of the model, but capturing equivalent dynamics. Benchmarking on density estimation tasks illustrates that the learned flow can match and surpass maximum likelihood continuous flows at a fraction of the conventional ODE training costs.","normalizing flow, ODE, generative model" Elicitation Inference Optimization for Multi-Principal-Agent Alignment,https://openreview.net/forum?id=dz8i-yzXeVg,https://openreview.net/pdf?id=dz8i-yzXeVg,We integrate an LLM with a latent factor model to predict individual’s agreement on text perspectives with increasing efficiency at scale,"In multi-principal-agent alignment scenarios spanning governance, markets, diplomacy, and AI, it is infeasible to elicit every principal's view on all perspectives relevant to agent decisions. Elicitation inference optimization (EIO) aims to minimize the $n$ elicitations needed to approximate $N$ principal's views across $K$ perspectives. In this work, we demonstrate an EIO approach where data efficiency ($NK/n$) increases with scale. We introduce STUMP: an elicitation inference model which integrates an LLM with a latent factor model to enable learning transfer across samples, contexts, and languages. Then, we characterize STUMP's performance on a set of elicitation primitives from which scalable elicitation (sampling) protocols can be constructed. Building from these results, we design and demonstrate two scalable elicitation protocols for STUMP where data efficiency grows boundlessly, scaling like $O(n)$ in the number of elicitations $n$. This makes it possible to obtain complex, high-dimensional preference signals spanning principal populations at any scale.","alignment, large language models, LLMs, NLP, transfer learning, human-centered AI, LLMs, preference modeling" Dual Student Networks for Data-Free Model Stealing,https://openreview.net/forum?id=VE1s3e5xriA,https://openreview.net/pdf?id=VE1s3e5xriA,,"Data-free model stealing aims to replicate a target model without direct access to either the training data or the target model. To accomplish this, existing methods use a generator to produce samples in order to train a student model to match the target model outputs. To this end, the two main challenges are estimating gradients of the target model without access to its parameters, and generating a diverse set of images that thoroughly explores the input space. We propose a Dual Student method where two students are symmetrically trained in order to provide the generator a criterion to generate samples that the two students disagree on. On one hand, disagreement on a sample implies at least one student has classified the sample incorrectly when compared with the target model. This push towards disagreeing samples implicitly encourages exploring a more diverse region of input space. On the other hand, our method utilizes gradients of student models to indirectly estimate gradients of the target model. We show that this novel training objective for the generator network is equivalent to optimizing a lower bound on the generator’s loss if we had access to the target model gradients. In other words, our method alters the standard data-free model stealing paradigm by substituting the target model with a separate student model, thereby creating a lower bound which can be directly optimized without additional target model queries or separate synthetic datasets. We show that our new optimization framework provides more accurate gradient estimation of the target model and better accuracies on benchmark classification datasets. Additionally, our approach balances improved query efficiency with training computation cost. Finally, we demonstrate that our method serves as a better proxy model for transfer-based adversarial attacks than existing data-free model stealing methods.", Augmentation Curriculum Learning For Generalization in RL,https://openreview.net/forum?id=Fj1S0SV8p3U,https://openreview.net/pdf?id=Fj1S0SV8p3U,"Combining data augmentation, reinforcement learning and curriculum learning for generalization in reinforcement learning","Many Reinforcement Learning tasks rely solely on pixel-based observations of the environment. During deployment, these observations can fall victim to visual perturbations and distortions, causing the agent’s policy to significantly degrade in performance. This motivates the need for robust agents that can generalize in the face of visual distribution shift. One common technique for doing this is to ap- ply augmentations during training; however, it comes at the cost of performance. We propose Augmentation Curriculum Learning a novel curriculum learning ap- proach that schedules augmentation into training into a weak augmentation phase and strong augmentation phase. We also introduce a novel visual augmentation strategy that proves to aid in the benchmarks we evaluate on. Our method achieves state-of-the-art performance on Deep Mind Control Generalization Benchmark.","reinforcement learning, generalization, pixel-based RL, embodied learning" Composite Slice Transformer: An Efficient Transformer with Composition of Multi-Scale Multi-Range Attentions,https://openreview.net/forum?id=nWTzIsgrYNN,https://openreview.net/pdf?id=nWTzIsgrYNN,We propose an efficient Transformer based on composition of multi-scale attention with stacked slice representation and show that it outperforms the state-of-the-art efficient transformers in multiple benchmarks.,"Since the introduction of Transformers, researchers have tackled the notoriously expensive quadratic complexity problem. While significant computational efficiency improvements have been achieved, they come at the cost of reduced accuracy trade-offs. In this paper, we propose Composite Slice Transformer (CST), a Transformer-based network equipped with a composition of multi-scale multi-range attentions, boosting both efficiency and modeling capability. After stacking fixed-length slices of the input sequence, each layer in CST performs a pair of fine-and-coarse-grained attentions with short-long ranges in a sequential manner, coupled with volatile instant positional embedding, enabling efficient token interactions {\em and} improving expressiveness of the model. In addition to significantly reduced $O(NL+N^2/L^2)$ complexity for sequence length $N$ and slice length $L$, CST achieves superior performance on a variety of tasks. We show that CST surpasses recently published efficient Transformers on the Long Range Arena benchmark, demonstrating the bidirectional long-range dependency modeling capability of our model. It outperforms the standard Transformer by a margin of $6.9$\% in average accuracy across the five classification tasks of the benchmark, while being of complexity comparable to other efficient transformers. Furthermore, on the word-level autoregressive language modeling task with the WikiText-103 dataset, CST performs competitively against the Transformer model with only $2$\% gap in the test perplexity while outperforming other efficient Transformers.","transformer, efficient transformer, efficient attention" Graph Fourier MMD for signals on data graphs,https://openreview.net/forum?id=jH6pg6JaSP2,https://openreview.net/pdf?id=jH6pg6JaSP2,We introduce a new efficient MMD measure computed analytically with an explicit feature map for signals on data graphs. ," While numerous methods have been proposed for computing distances between probability distributions in Euclidean space, relatively little attention has been given to computing such distances for distributions on graphs. However, there has been a marked increase in data that either lies on graph (such as protein interaction networks) or can be modeled as a graph (single cell data), particularly in the biomedical sciences. Thus, it becomes important to find ways to compare signals defined on such graphs. Here, we propose Graph Fourier MMD (GFMMD), a novel a distance between distributions, or non-negative signals on graphs. GFMMD is defined via an optimal witness function that is both smooth on the graph and maximizes difference in expectation between the pair of distributions on the graph. We find an analytical solution to this optimization problem as well as an embedding of distributions that results from this method. We also prove several properties of this method including scale invariance and applicability to disconnected graphs. We showcase it on graph benchmark datasets as well on single cell RNA-sequencing data analysis. In the latter, we use the GFMMD-based gene embeddings to find meaningful gene clusters. We also propose a novel type of score for gene selection called {\em gene localization score} which helps select genes for cellular state space characterization.","Optimal transport, Data Geometry, Graph Signal Processing" Equal Improvability: A New Fairness Notion Considering the Long-term Impact,https://openreview.net/forum?id=dhYUMMy0_Eg,https://openreview.net/pdf?id=dhYUMMy0_Eg,We propose a new group fairness notion called Equal Improvability that equalizes the improvement of the rejected individuals across different groups.,"Devising a fair classifier that does not discriminate against different groups is an important problem in machine learning. Although researchers have proposed various ways of defining group fairness, most of them only focused on the immediate fairness, ignoring the long-term impact of a fair classifier under the dynamic scenario where each individual can improve its feature over time. Such dynamic scenarios happen in real world, e.g., college admission and credit loaning, where each rejected sample makes effort to change its features to get accepted afterwards. In this dynamic setting, the long-term fairness should equalize the samples’ feature distribution across different groups after the rejected samples make some effort to improve. In order to promote long-term fairness, we propose a new fairness notion called Equal Improvability (EI), which equalizes the potential acceptance rate of the rejected samples across different groups assuming a bounded level of effort will be spent by each rejected sample. We analyze the properties of EI and its connections with existing fairness notions. To find a classifier that satisfies the EI requirement, we propose and study three different approaches that solve EI regularized optimization problems. Through experiments on both synthetic and real datasets, we demonstrate that the proposed EI-regularized algorithms encourage us to find a fair classifier in terms of EI. Finally, we provide experimental results on dynamic scenarios which highlight the advantages of our EI metric in achieving the long-term fairness. Codes are available in anonymous GitHub repository.","Fairness and Bias in Artificial Intelligence, Machine Learning" Does progress on ImageNet transfer to real world datasets?,https://openreview.net/forum?id=7T2XgpklLDA,https://openreview.net/pdf?id=7T2XgpklLDA,,"Does progress on ImageNet transfer to real world datasets? We investigate this question by evaluating ImageNet pre-trained models with varying accuracy (57% - 83%) on six practical image classification datasets. In particular, we study datasets collected with the goal of solving real world tasks (e.g., classifying images from camera traps or satellites), as opposed to web-scraped benchmarks collected for comparing models. On multiple datasets, models with higher ImageNet accuracy do not consistently yield performance improvements. For certain tasks, interventions such as data augmentation improve performance even when architectures do not. We hope that future benchmarks will include more diverse datasets to encourage a more comprehensive approach to improving learning algorithms.", Competitive Physics Informed Networks ,https://openreview.net/forum?id=z9SIj-IM7tn,https://openreview.net/pdf?id=z9SIj-IM7tn,We introduce competitive physics informed networks where two neural networks solve a partial differential equation by playing a zero-sum game.,"Neural networks can be trained to solve partial differential equations (PDEs) by using the PDE residual as the loss function. This strategy is called ""physics-informed neural networks"" (PINNs), but it currently cannot produce high-accuracy solutions, typically attaining about $0.1\%$ relative error. We present an adversarial approach that overcomes this limitation, which we call competitive PINNs (CPINNs). CPINNs train a discriminator that is rewarded for predicting mistakes the PINN makes. The discriminator and PINN participate in a zero-sum game with the exact PDE solution as an optimal strategy. This approach avoids squaring the large condition numbers of PDE discretizations, which is the likely reason for failures of previous attempts to decrease PINN errors even on benign problems. Numerical experiments on a Poisson problem show that CPINNs achieve errors four orders of magnitude smaller than the best-performing PINN. We observe relative errors on the order of single-precision accuracy, consistently decreasing with each epoch. To the authors' knowledge, this is the first time this level of accuracy and convergence behavior has been achieved. Additional experiments on the nonlinear Schr{\""o}dinger, Burgers', and Allen--Cahn equation show that the benefits of CPINNs are not limited to linear problems.","Physics informed learning, multi-agent games, Lagrange multipliers, partial differential equations" Decomposed Prompting: A Modular Approach for Solving Complex Tasks,https://openreview.net/forum?id=_nGgzQjzaRy,https://openreview.net/pdf?id=_nGgzQjzaRy,A new few-shot prompting approach to solve complex task by decomposing complex tasks into a shared library of prompts.,"Few-shot prompting is a surprisingly powerful way to use Large Language Models (LLMs) to solve various tasks. However, this approach struggles as the task complexity increases or when the individual reasoning steps of the task themselves are hard to learn, especially when embedded in more complex tasks. To address this, we propose Decomposed Prompting, a new approach to solve complex tasks by decomposing them (via prompting) into simpler sub-tasks that can be delegated to a library of prompting-based LLMs dedicated to these sub-tasks. This modular structure allows each prompt to be optimized for its specific sub-task, further decomposed if necessary, and even easily replaced with more effective prompts, trained models, or symbolic functions if desired. We show that the flexibility and modularity of Decomposed Prompting allows it to outperform prior work on few-shot prompting using GPT3. On symbolic reasoning tasks, we can further decompose sub-tasks that are hard for LLMs into even simpler solvable sub-tasks. When the complexity comes from the input length, we can recursively decompose the task into the same task but with smaller inputs. We also evaluate our approach on textual multi-step reasoning tasks: on long-context multi-hop QA task, we can more effectively teach the sub-tasks via our separate sub-tasks prompts; and on open-domain multi-hop QA, we can incorporate a symbolic information retrieval within our decomposition framework, leading to improved performance on both tasks","prompting, decomposition, in-context learning, reasoning, few-shot prompts, multi-step reasoning" Designing and Using Goal-Conditioned Tools,https://openreview.net/forum?id=i0lHs3ji9xT,https://openreview.net/pdf?id=i0lHs3ji9xT,We propose a framework for learning goal-conditioned design and manipulation policies for robotic tool use. ,"When limited by their own morphologies, humans and some species of animals have the remarkable ability to use objects from the environment towards accomplishing otherwise impossible tasks. Embodied agents might similarly unlock a range of additional capabilities through tool use. Recent techniques for jointly optimizing morphology and control via deep learning output effective solutions for tasks such as designing locomotion agents. But while designing a single-goal morphology makes sense for locomotion, manipulation involves a wide variety of strategies depending on the task goals at hand. An agent must be capable of rapidly prototyping specialized tools for different goals. Therefore, we propose the idea of learning a designer policy, rather than a single design. A designer policy is conditioned on task goals, and outputs a design for a tool that helps solve the task. A design-agnostic controller policy can then perform manipulation using these tools. In this work, we introduce a reinforcement learning framework for learning these policies. Through simulated manipulation tasks, we show that this framework is more sample efficient than black-box optimization methods in multi-goal settings. It can also perform zero-shot interpolation or finetuning to tackle previously unseen goals. Finally, we demonstrate that our framework allows tradeoffs between the complexity of design and control policies when required by practical constraints.","Design, Manipulation, RL, Tool use" Post-mortem on a deep learning contest: a Simpson’s paradox and the complementary roles of scale metrics versus shape metrics,https://openreview.net/forum?id=qDQRvlFfz-K,https://openreview.net/pdf?id=qDQRvlFfz-K,diagnostics for neural network models to understand better their performance,"To understand better good generalization performance in state-of-the-art neural network (NN) models, and in particular the success of the AlphaHat metric based on Heavy-Tailed Self-Regularization (HT-SR) theory, we analyze of a corpus of models that was made publicly-available for a contest to predict the generalization accuracy of NNs. These models include a wide range of qualities and were trained with a range of architectures and regularization hyperparameters. We break AlphaHat into its two subcomponent metrics: a scale-based metric; and a shape-based metric. We identify what amounts to a Simpson’s paradox: where “scale” metrics (from traditional statistical learning theory) perform well in aggregate, but can perform poorly on subpartitions of the data of a given depth, when regularization hyperparameters are varied; and where “shape” metrics (from HT-SR theory) perform well on each subpartition of the data, when hyperparameters are varied for models of a given depth, but can perform poorly overall when models with varying depths are aggregated. Our results highlight the subtlety of comparing models when both architectures and hyperparameters are varied; the complementary role of implicit scale versus implicit shape parameters in understanding NN model quality; and the need to go beyond one-size-fits-all metrics based on upper bounds from generalization theory to describe the performance of NN models. Our results also clarify further why the AlphaHat metric from HT-SR theory works so well at predicting generalization across a broad range of CV and NLP models.","model diagnostics, heavy-tailed self regularization, Simpson's paradox" ProtFIM: Fill-in-Middle Protein Sequence Design via Protein Language Models,https://openreview.net/forum?id=9XAZBUfnefS,https://openreview.net/pdf?id=9XAZBUfnefS, We propose a new evaluation scheme and protein language model for fill-in-middle protein sequence design.,"Following the investigation that protein sequence determines its structure and function, engineering protein sequences allows us to optimize the functions of proteins for specific purposes such as enhancement of catalytic activity or binding affinity maturation. In protein engineering, there are many cases where the amino acids in the middle of a protein sequence are changed while maintaining the remaining residues to avoid unwanted functional changes from remaining residues. However, existing research on protein sequence design via protein language models (PLMs) has focused on modifying suffix residues by prompting prefix residues to the model or mutating the overall sequence residues. This is unsuitable for scenarios where the residues located in the middle of the sequence are to be optimized. In this work, we suggest a PLM-based framework to solve the fill-in-middle (FIM) protein engineering tasks. To evaluate the performance of PLMs on the FIM tasks, we design a novel evaluation scheme where PLMs are tasked to generate new sequences while maintaining the secondary structures. Also, we propose a new PROTein language model specialized for the Fill-In-Middle task, ProtFIM. Experiments confirm that ProtFIM performs FIM engineering efficiently, especially for alpha-helix structures, and provides decent protein representations of sequence-function relationships. Finally, we demonstrate an artificial protein sequence design framework composed of ProtFIM and a high-quality structure predictor as a novel tool to optimize protein sequences.","Protein language modeling, Protein engineering, Text infilling" Beyond Deep Learning: An Evolutionary Feature Engineering Approach to Tabular Data Classification,https://openreview.net/forum?id=3C9Eqd0hCrr,https://openreview.net/pdf?id=3C9Eqd0hCrr,,"In recent years, deep learning has achieved impressive performance in the computer vision and natural language processing domains. In the tabular data classification scenario, with the emergence of the transformer architecture, a number of algorithms have been reported to yield better results than conventional tree-based models. Most of these methods attribute the success of deep learning methods to the expressive feature construction capability of neural networks. Nonetheless, in real practice, manually designed high-order features with traditional machine learning methods are still widely used because neural-network-based features can be easy to over-fitting. In this paper, we propose an evolution-based feature engineering algorithm to imitate the manual feature construction process through trial and improvement. Importantly, the evolutionary method provides an opportunity to optimize cross-validation loss, where gradient methods fail to do so. On a large-scale classification benchmark of 119 datasets, the experimental results demonstrate that the proposed method outperforms existing fine-tuned state-of-the-art tree-based and deep-learning-based classification algorithms.","Automated Feature Construction, Automated Machine Learning, Genetic Programming, Evolutionary Algorithm" Proportional Multicalibration,https://openreview.net/forum?id=lw1WKaIL3LR,https://openreview.net/pdf?id=lw1WKaIL3LR,"We study a fairness criteria called proportional multicalibration that unites two fairness measures, multicalibration and differential calibration, under one simple post-processing strategy.","Multicalibration is a desirable fairness criteria that constrains calibration error among flexibly-defined groups in the data while maintaining overall calibration. However, when outcome probabilities are correlated with group membership, multicalibrated models can exhibit a higher percent calibration error among groups with lower base rates than groups with higher base rates. As a result, it remains possible for a decision-maker to learn to trust or distrust model predictions for specific groups. To alleviate this, we propose proportional multicalibration, a criteria that constrains the percent calibration error among groups and within prediction bins. We prove that satisfying proportional multicalibration bounds a model's multicalibration as well its differential calibration, a stronger fairness criteria inspired by the fairness notion of sufficiency. We provide an efficient algorithm for post-processing risk prediction models for proportional multicalibration and evaluate it empirically. We conduct simulation studies and investigate a real-world application of PMC-postprocessing to prediction of emergency department patient admissions. We observe that proportional multicalibration is a promising criteria for controlling simultenous measures of calibration fairness of a model over intersectional groups with virtually no cost in terms of classification performance. ","fairness, calibration, risk prediction, multicalibration, clinical decision support" On The Impact of Machine Learning Randomness on Group Fairness,https://openreview.net/forum?id=n_SwMH9o-oT,https://openreview.net/pdf?id=n_SwMH9o-oT,,"Statistical measures for group fairness in machine learning reflect the gap in performance of algorithms across different groups. These measures, however, exhibit a high variance, between different training instances, that makes them unreliable for empirical evaluation of fairness. What is the cause of this variance, and how can we reduce it? We investigate the impact of different sources of randomness in machine learning on group fairness. We show that the variance in group fairness measures is mainly due to the high volatility of the learning process on under-represented groups, which itself is largely caused by the stochasticity of data order during training. Based on these findings, we show how to manipulate group level accuracy (i.e. model fairness), with high efficiency and negligible impact on the overall predictive power of the model, by changing the data order.", Using the Training History to Detect and Prevent Overfitting in Deep Learning Models,https://openreview.net/forum?id=mzrNhoaHRDc,https://openreview.net/pdf?id=mzrNhoaHRDc,"We propose a time series based method to: (1) detect overfitting in a trained model, and (2) prevent overfitting from happening in training. ","Overfitting of deep learning models on training data leads to poor generalizability on unseen data. Overfitting can be (1) prevented (e.g., using dropout or early stopping) or (2) detected in a trained model (e.g., using correlation-based methods). We propose a method that can both detect and prevent overfitting based on the training history (i.e., validation losses). Our method first trains a time series classifier on training histories of overfit models. This classifier is then used to detect if a trained model is overfit. In addition, our trained classifier can be used to prevent overfitting by identifying the optimal point to stop a model's training. We evaluate our method on its ability to identify and prevent overfitting in real-world samples (collected from papers published in the last 5 years at top AI venues). We compare our method against correlation-based detection methods and the most commonly used prevention method (i.e., early stopping). Our method achieves an F1 score of 0.91 which is at least 5% higher than the current best-performing non-intrusive overfitting detection method. In addition, our method can find the optimal stopping point and avoid overfitting at least 32% earlier than early stopping and achieve at least the same accuracy (often better) as early stopping.","overfitting, early stopping, deep learning" Multi-scale Sinusoidal Embeddings Enable Learning on High Resolution Mass Spectrometry Data,https://openreview.net/forum?id=WY0g8Gu58at,https://openreview.net/pdf?id=WY0g8Gu58at,,"Small molecules in biological samples are studied to provide information about disease states, environmental toxins, natural product drug discovery, and many other applications. The primary window into the composition of small molecule mixtures is tandem mass spectrometry (MS2), which produces data that are of high sensitivity and part per million resolution. We adopt multi-scale sinusoidal embeddings of the mass data in MS2 designed to meet the challenge of learning from the full resolution of MS2 data. Using these embeddings, we provide a new state of the art model for spectral library search, the standard task for initial evaluation of MS2 data. We also investigate the task of chemical property prediction from MS2 data, that has natural applications in high-throughput MS2 experiments and show that an average $R^2$ of 80\% for novel compounds can be achieved across 10 chemical properties prioritized by medicinal chemists. We use dimensionality reduction techniques and experiments with different floating point resolutions to show the essential role multi-scale sinusoidal embeddings play in learning from MS2 data.","Cheminformatics for Life Sciences, Tandem Mass Spectrometry, Transformers" Self-Ensemble Protection: Training Checkpoints Are Good Data Protectors,https://openreview.net/forum?id=9MO7bjoAfIA,https://openreview.net/pdf?id=9MO7bjoAfIA,"We protect proprietary datasets by using intermediate checkpoints in a self-ensemble way, which more than halves the testing accuracy in unauthorized training compared to the best baselines.","As data become increasingly vital for deep learning, a company would be very cautious about releasing data. This is because the competitors could use the released data to train high-performance models, thereby posing a tremendous threat to the company's commercial competence. To protect the dataset from unauthorized use for training, imperceptible perturbations crafted with a deep model are added to data so that other deep neural networks trained on it all have poor generalization. In this paper, we propose a self-ensemble protection (SEP) method to take advantage of intermediate checkpoints in a single training process for data protection. Contrary to the popular belief on the similarity of checkpoints, we are surprised to find that their cross-model gradients are close to orthogonal, and thus diverse enough to produce very effective protective perturbations. Besides, we further improve the performance of SEP by developing a novel feature alignment technique to induce feature collapse into the mean of incorrect-class features. Extensive experiments verify the consistent superiority of SEP over 7 state-of-the-art data protection baselines. SEP perturbations on CIFAR-10 with an $\ell_\infty$ bound as small as $2/255$ could reduce the testing accuracy of a ResNet18 from 94.56% to 14.68%, and the average accuracy reduction from the best-known results is 27.63%. Under the $\ell_\infty=8$ bound, SEP perturbations lead DNNs with 5 architectures to have less than 5.7% / 3.2% / 0.6% accuracy on CIFAR-10 / CIFAR-100 / ImageNet subset.","data protection, poisoning attack, self-ensemble, deep neural network" Efficient parametric approximations of neural net function space distance,https://openreview.net/forum?id=3l36EPLnPzA,https://openreview.net/pdf?id=3l36EPLnPzA,We propose an efficient parametric approximation of neural network function space distance that is memory-efficient and can be successfully applied to continual learning and influence function estimation tasks.,"It is often useful to compactly summarize important properties of a training dataset so that they can be used later without storing and/or iterating over the entire dataset. We consider a specific case of this: approximating the function space distance (FSD) over the training set, i.e. the average distance between the outputs of two neural networks. We propose an efficient approximation to FSD for ReLU neural networks based on approximating the architecture as a linear network with stochastic gating. Despite requiring only one parameter per unit of the network, our approach outcompetes other parametric approximations with larger memory requirements. Applied to continual learning, our parametric approximation is competitive with state-of-the-art nonparametric approximations which require storing many training examples. Furthermore, we show its efficacy in influence function estimation, allowing influence functions to be accurately estimated without iterating over the full dataset.","Function space distance, memory-efficiency, continual learning, influence function estimation" Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks,https://openreview.net/forum?id=pXDmbfVL_SB,https://openreview.net/pdf?id=pXDmbfVL_SB,"We present a causal transformer that learns multiple algorithmic tasks and generalizes to longer sequences, and show that it develops signs of task decomposition and exploits shared task structures.","Transformer networks have seen great success in natural language processing and machine vision, where task objectives such as next word prediction and image classification benefit from nuanced context sensitivity across high-dimensional inputs. However, there is an ongoing debate about how and when transformers can acquire highly structured behavior and achieve systematic generalization. Here, we explore how well a causal transformer can perform a set of algorithmic tasks, including copying, sorting, and hierarchical compositions of these operations. We demonstrate strong generalization to sequences longer than those used in training by replacing the standard positional encoding typically used in transformers with labels arbitrarily paired with items in the sequence. By finding the layer and head configuration sufficient to solve the task, then performing ablation experiments and representation analysis, we show that two-layer transformers learn generalizable solutions to multi-level problems and develop signs of systematic task decomposition. They also exploit shared computation across related tasks. These results provide key insights into how transformer models may be capable of decomposing complex decisions into reusable, multi-level policies in tasks requiring structured behavior.","systematic generalization, transformers, representation, multi-task learning" Energy-Inspired Self-Supervised Pretraining for Vision Models,https://openreview.net/forum?id=ZMz-sW6gCLF,https://openreview.net/pdf?id=ZMz-sW6gCLF,,"Motivated by the fact that forward and backward passes of a deep network naturally form symmetric mappings between input and output representations, we introduce a simple yet effective self-supervised vision model pretraining framework inspired by energy-based models (EBMs). In the proposed framework, we model energy estimation and data restoration as the forward and backward passes of a single network without any auxiliary components, e.g., an extra decoder. For the forward pass, we fit a network to an energy function that assigns low energy scores to samples that belong to an unlabeled dataset, and high energy otherwise. For the backward pass, we restore data from corrupted versions iteratively using gradient-based optimization along the direction of energy minimization. In this way, we naturally fold the encoder-decoder architecture widely used in masked image modeling into the forward and backward passes of a single vision model. Our framework accepts a wide range of pretext tasks with different data corruption methods, and permits models to be pretrained from masked image modeling, patch sorting, and image restoration, including super-resolution, denoising, and colorization. We support our findings with extensive experiments, and show the proposed method delivers comparable and even better performance with remarkably fewer epochs of training compared to the state-of-the-art self-supervised vision model pretraining methods. Our findings shed light on further exploring self-supervised vision model pretraining and pretext tasks beyond masked image modeling. ", Effectively Modeling Time Series with Simple Discrete State Spaces,https://openreview.net/forum?id=2EpjkjzdCAa,https://openreview.net/pdf?id=2EpjkjzdCAa,"We propose SpaceTime, a deep state space time series model that achieves state-of-the-art results on forecasting and classification benchmarks, by improving expressiveness, forecasting flexibility, and training efficiency over prior approaches. ","Time series modeling is a well-established problem, which often requires that methods (1) expressively represent complicated dependencies, (2) forecast long horizons, and (3) efficiently train over long sequences. State-space models (SSMs) are classical models for time series, and prior works combine SSMs with deep learning layers for efficient sequence modeling. However, we find fundamental limitations with these prior approaches, proving their SSM representations cannot express autoregressive time series processes. We thus introduce SpaceTime, a new state-space time series architecture that improves all three criteria. For expressivity, we propose a new SSM parameterization based on the companion matrix---a canonical representation for discrete-time processes---which enables SpaceTime's SSM layers to learn desirable autoregressive processes. For long horizon forecasting, we introduce a ""closed-loop"" variation of the companion SSM, which enables SpaceTime to predict many future time-steps by generating its own layer-wise inputs. For efficient training and inference, we introduce an algorithm that reduces the memory and compute of a forward pass with the companion matrix. With sequence length $\ell$ and state-space size $d$, we go from $\tilde{O}(d \ell)$ naïvely to $\tilde{O}(d + \ell)$. In experiments, our contributions lead to state-of-the-art results on extensive and diverse benchmarks, with best or second-best AUROC on 6 / 7 ECG and speech time series classification, and best MSE on 14 / 16 Informer forecasting tasks. Furthermore, we find SpaceTime (1) fits AR($p$) processes that prior deep SSMs fail on, (2) forecasts notably more accurately on longer horizons than prior state-of-the-art, and (3) speeds up training on real-world ETTh1 data by 73% and 80% relative wall-clock time over Transformers and LSTMs.","time series, forecasting, state-space models, time series classification" Forgetful causal masking makes causal language models better zero-shot learners,https://openreview.net/forum?id=YrZEKNLWhlp,https://openreview.net/pdf?id=YrZEKNLWhlp,A simple masking strategy for casual language models that significantly improves few shot and finetuning performance.,"Large language models (LLM) trained using the next-token-prediction objective, such as GPT3 and PaLM, have revolutionized natural language processing in recent years by showing impressive zero-shot and few-shot capabilities across a wide range of tasks. In this work, we propose a simple technique that significantly boosts the performance of LLMs without adding computational cost. Our key observation is that, by performing the next token prediction task with randomly selected past tokens masked out, we can improve the quality of the learned representations for downstream language understanding tasks. We hypothesize that randomly masking past tokens prevents over-attending to recent tokens and encourages attention to tokens in the distant past. By randomly masking input tokens in the PaLM model, we show that we can significantly improve PaLM's zero-shot performance on the SuperGLUE benchmark from 55.7 to 59.2. Experimental results show that FCM also improves PaLM's zero- and few-shot performance on a diverse suite of tasks, including commonsense reasoning, natural language inference and cloze completion. Moreover, we show that our technique also helps representation learning, significantly improving PaLM's finetuning results on SuperGLUE. ","Language modeling, casual language model, few shot language models" "When and why Vision-Language Models behave like Bags-of-Words, and what to do about it?",https://openreview.net/forum?id=KRLUvxh8uaX,https://openreview.net/pdf?id=KRLUvxh8uaX,,"Despite the success of large vision and language models (VLMs) in many downstream applications, it is unclear how well they encode the compositional relationships between objects and attributes. Here, we create the Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order information. ARO consists of \emph{Visual Genome Attribution}, to test the understanding of objects' properties; \emph{Visual Genome Relation}, to test for relational understanding; and \emph{COCO-Order \& Flickr30k-Order}, to test for order sensitivity in VLMs. ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases. We present the settings where state-of-the-art VLMs behave like bags-of-words---i.e. when they have poor relational understanding, can blunder when linking objects to their attributes, and demonstrate a severe lack of order sensitivity. VLMs are predominantly trained and evaluated on large scale datasets with rich compositional structure in the images and captions. Yet, training on these datasets has not been enough to address the lack of compositional understanding, and evaluating on these datasets has failed to surface this deficiency. To understand why these limitations emerge and are not represented in the standard tests, we zoom into the evaluation and training procedures. We demonstrate that it is possible to perform well on image-text retrieval over existing datasets without using the composition and order information. This further motivates the value of using ARO to benchmark VLMs. Given that contrastive pretraining optimizes for retrieval on large datasets with similar shortcuts, we hypothesize that this can explain why the models do not need to learn to represent compositional information. This finding suggests a natural solution: composition-aware hard negative mining. We show that a simple-to-implement modification of contrastive learning significantly improves the performance on tasks requiring understanding of order and compositionality. ","vision-language models, clip, contrastive learning, retrieval, vision-language pretraining, multimodal representation learning" A Time Series is Worth 64 Words: Long-term Forecasting with Transformers,https://openreview.net/forum?id=Jbdc0vTOcol,https://openreview.net/pdf?id=Jbdc0vTOcol,Channel-independent patch time series transformer works very well for long-term forecasting and representation learning.,"We propose an efficient design of Transformer-based models for multivariate time series forecasting and self-supervised representation learning. It is based on two key components: (i) segmentation of time series into subseries-level patches which are served as input tokens to Transformer; (ii) channel-independence where each channel contains a single univariate time series that shares the same embedding and Transformer weights across all the series. Patching design naturally has three-fold benefit: local semantic information is retained in the embedding; computation and memory usage of the attention maps are quadratically reduced given the same look-back window; and the model can attend longer history. Our channel-independent patch time series Transformer (PatchTST) can improve the long-term forecasting accuracy significantly when compared with that of SOTA Transformer-based models. We also apply our model to self-supervised pre-training tasks and attain excellent fine-tuning performance, which outperforms supervised training on large datasets. Transferring of masked pre-training performed on one dataset to other datasets also produces SOTA forecasting accuracy.","time series, transformer, forecasting, channel-independence, self-supervised learning, representation learning" Protecting DNN from Evasion Attacks using Ensemble of High Focal Diversity,https://openreview.net/forum?id=65P6pfeT3eg,https://openreview.net/pdf?id=65P6pfeT3eg,,"Edge AI continues to attract emerging applications that deploy well-trained DNN models on heterogeneous edge clients for real-time object detection. Recent studies have shown that evasion attacks on DNN object detection models at the test time are on the rise. Such evasion attacks generate deceptive queries using maliciously manipulated or out-of-distribution data, aiming to mislead high-quality object detectors during edge inference. This paper introduces ODEN, a novel approach to object detection ensemble, which combines a detection inconsistency solver with focal diversity-optimized ensemble pruning to defend against evasion attacks. The focal diversity ranking techniques enable ODEN to compose an ensemble from a pool of base object detectors with high failure independence, which strengthens the generalization performance of the ODEN ensemble in the presence of irregular query data and evasion attacks. The ODEN inconsistency solver can detect and resolve three types of inconsistency by combining detection results from multiple DNN object detectors: the inconsistency of the object existence, the size and location inconsistency of the bounding boxes of detected objects, and the classification inconsistency of detected objects and their confidence. Extensive experiments on three benchmark vision datasets (OpenImages, COCO, and VOC) show that under no attack, ODEN can outperform existing ensemble methods by up to 9.33% of mAP. Compared to the low mAP of 2.64~18.07% under four evasion attacks, ODEN can maintain a high mAP of 58.97~86.00%, achieving up to an 82.44% increase in AI safety.", Fantastic Rewards and How to Tame Them: A Case Study on Reward Learning for Task-Oriented Dialogue Systems,https://openreview.net/forum?id=086pmarAris,https://openreview.net/pdf?id=086pmarAris,we propose techniques for learning and utilizing reward functions that can be used for training task-oriented dialogue agents,"When learning task-oriented dialogue (TOD) agents, one can naturally utilize reinforcement learning (RL) techniques to train dialogue strategies to achieve user-specific goals. Prior works mainly focus on adopting advanced RL techniques to train the TOD agents, while the design of the reward function is not well studied. This paper aims at answering the question of how to efficiently learn and leverage a reward function for training end-to-end TOD agents. Specifically, we introduce two generalized objectives for reward-function learning, inspired from the classical learning-to-rank literature. Further, we utilize the learned reward-function to guide the training of the end-to-end TOD agent. With the proposed techniques, we achieve competitive results on the end-to-end response-generation task on the Multiwoz 2.0 dataset. ","task-oriented dialogue, reinforcement learning, reward learning" Efficient Stochastic Optimization for Attacking Randomness Involved Inference,https://openreview.net/forum?id=faPdyjayCRi,https://openreview.net/pdf?id=faPdyjayCRi,,"Recent years witnessed a surging interest in test-time defense against adversarial attacks by introducing randomness during model inference. Notable examples include randomized smoothing equipped with probabilistic certified robustness and adversarial purification that leverages score-based generalization models. Specifically, the adversarial purification achieves state-of-the-art adversarial robustness under the strongest existing attack. Perhaps the most important component to developing and validating adversarial robustness is efficient attacks. Stochastic Projected Gradient Descent (S-PGD), which combines Expectation over Transformation (EOT) and PGD attacks, has become a common strategy to attack inference randomness and validate defense strategies. However, it often has severe efficiency issues that make it prohibitive for complete verification. For example, one step of S-PGD requires multiple runs of score-based purification models for each data point. This work revisits the techniques attacking randomness-involved inference and subsumes them into a unified stochastic optimization framework, which enables us to use acceleration and variance reduction techniques to largely improve the convergence and thus reduce the cost of attack. In other words, the proposed work can significantly improve attack performance, given a fixed budget for attacking. ", Supervision Complexity and its Role in Knowledge Distillation,https://openreview.net/forum?id=8jU7wy7N7mA,https://openreview.net/pdf?id=8jU7wy7N7mA,We provide a new theoretical perspective on knowledge distillation through the lens of supervision complexity -- a measure of alignment between the teacher-provided supervision and the student's neural tangent kernel.,"Despite the popularity and efficacy of knowledge distillation, there is limited understanding of why it helps. In order to study the generalization behavior of a distilled student, we propose a new theoretical framework that leverages supervision complexity: a measure of alignment between teacher-provided supervision and the student's neural tangent kernel. The framework highlights a delicate interplay among the teacher's accuracy, the student’s margin with respect to the teacher predictions, and the complexity of the teacher predictions. Specifically, it provides a rigorous justification for the utility of various techniques that are prevalent in the context of distillation, such as early stopping and temperature scaling. Our analysis further suggests the use of online distillation, where a student receives increasingly more complex supervision from teachers in different stages of their training. We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.","distillation, kernel methods, neural tangent kernel" GLINKX: A Scalable Unified Framework For Homophilous and Heterophilous Graphs,https://openreview.net/forum?id=ZaQrYalGFIh,https://openreview.net/pdf?id=ZaQrYalGFIh,Scalable method for node classification for homophilous and heterophilous graphs,"In graph learning, there have been two predominant inductive biases regarding graph-inspired architectures: On the one hand, higher-order interactions and message passing work well on homophilous graphs and are leveraged by GCNs and GATs. Such architectures, however, cannot easily scale to large real-world graphs. On the other hand, shallow (or node-level) models using ego features and adjacency embeddings work well in heterophilous graphs. In this work, we propose a novel scalable shallow method -- GLINKX -- that can work both on homophilous and heterophilous graphs. GLINKX leverages (i) novel monophilous label propagations (ii) ego/node features, (iii) knowledge graph embeddings as positional embeddings, (iv) node-level training, and (v) low-dimensional message passing. Formally, we prove novel error bounds and justify the components of GLINKX. Experimentally, we show its effectiveness of it on several homophilous and heterophilous datasets.","graph learning, node classification, homophily, heterophily, positional embeddings, knowledge graph embeddings, monophily, label propagation" Marich: A Query-efficient & Online Model Extraction Attack using Public Data,https://openreview.net/forum?id=kocBczDfBeT,https://openreview.net/pdf?id=kocBczDfBeT,,"In this paper, we study black-box model stealing attacks where the attacker is only able to query a machine learning model through publicly available APIs. Specifically, our aim is to design a black-box model stealing attack that uses a minimal number of queries to create an informative replica of the target model. First, we reduce this problem to an online variational optimization problem. At every step, the attacker solves this problem to select the most informative query that maximizes the entropy of the selected queries and simultaneously reduces the mismatch between the target and the stolen models. We propose an online and adaptive algorithm, Marich, that leverages active learning to select the queries. We instantiate efficiency of our attack against different models, including logistic regression, BERT and ResNet18, trained on different text and image datasets. Marich is able to steal a model that can achieve 70-96$\%$ of true model's accuracy using 0.8-10$\%$ samples from the attack datasets which are publicly available and different from the training datasets. Our stolen models also achieve 75-98$\%$ accuracy of membership inference and also show 70-90$\%$ agreement of membership inference with direct membership inference on the target models. Our experiments validate that Marich is query-efficient and capable of creating an informative replica of the target model.","Privacy attacks, Model extraction, Membership inference, Black-box attack, Query efficiency, Active learning" CORE-PERIPHERY PRINCIPLE GUIDED REDESIGN OF SELF-ATTENTION IN TRANSFORMERS,https://openreview.net/forum?id=tOG2kU6h57B,https://openreview.net/pdf?id=tOG2kU6h57B,,"Designing more efficient, reliable, and explainable neural network architectures is a crucial topic in the artificial intelligence (AI) field. Numerous efforts have been devoted to exploring the best structures, or structural signatures, of well-performing artificial neural networks (ANN). Previous studies, by post-hoc analysis, have found that the best-performing ANNs surprisingly resemble biological neural networks (BNN), which indicates that ANNs and BNNs may share some common principles to achieve optimal performance in either machine learning tasks or cognitive/behavior processes. Inspired by this phenomenon, rather than relying on post-hoc schemes, we proactively instill organizational principles of BNNs to guide the redesign of ANNs by infusing an efficient information communication mechanism of BNNs into ANNs. Specifically, we quantified the typical Core-Periphery (CP) organization of the human brain networks, infused the CorePeriphery principle into the redesign of the vision transformer (ViT), and proposed a novel CP-ViT architecture: the pair-wised densely interconnected self-attention architecture of ViT was upgraded by a sparse Core-Periphery architecture. In CPViT, the attention operation between nodes (image patches) is defined by a sparse graph with a Core-Periphery structure (CP graph), where the core nodes are redesigned and reorganized to play an integrative role and serve as a center for other periphery nodes to exchange information. We evaluated the proposed CP-ViT on multiple public datasets, including medical image datasets (INbreast) and natural image datasets (CIFAR-100). We show that there exist sweet spots of CP graphs that lead to CP-ViTs with significantly improved performance. In general, our work advances the state of the art in three aspects: 1) This work provides novel insights for brain-inspired AI: we can instill the efficient information communication mechanism of BNNs into ANNs by infusing similar organizational principles of BNNs into ANNs; 2) The optimized CP-ViT can significantly improve its predictive performance while dramatically reducing computational cost by benefiting from the infused efficient information communication mechanism existing in BNNs; and 3) The core nodes in CP-ViT can identify task-related meaningful and important image patches, which can significantly enhance the interpretability of the trained deep model. (Code is ready for release).","Core-periphery Structure, Self-Attention" Lovasz Theta Contrastive Learning,https://openreview.net/forum?id=-hWhz9xfrB9,https://openreview.net/pdf?id=-hWhz9xfrB9,,"We establish a connection between the Lovasz theta function of a graph and the widely used InfoNCE loss. We show that under certain conditions, the minima of the InfoNCE loss are related to minimizing the Lovasz theta function on the empty similarity graph between the samples. Building on this connection, we generalize contrastive learning on weighted similarity graphs between samples. Our Lovasz theta contrastive loss uses a weighted graph that can be learned to take into account similarities between our data. We evaluate our method on image classification tasks, demonstrating an improvement of $1 \%$ in the supervised case and up to $4 \%$ in the unsupervised case.","Lovasz theta, Contrastive learning, Similarity graph, Graph Theory" Transferable Unlearnable Examples,https://openreview.net/forum?id=-htnolWDLvP,https://openreview.net/pdf?id=-htnolWDLvP,,"With more people publishing their personal data online, unauthorized data usage has become a serious concern. The unlearnable strategies have been introduced to prevent third parties from training on the data without permission. They add perturbations to the users' data before publishing, which aims to make the models trained on the perturbed published dataset invalidated. These perturbations have been generated for a specific training setting and a target dataset. However, their unlearnable effects significantly decrease when used in other training settings and datasets. To tackle this issue, we propose a novel unlearnable strategy based on Clustering Separability Discriminant (CSD), which aims to better transfer the unlearnable effects to other training settings and datasets by enhancing the linear separability. Extensive experiments demonstrate the transferability of the proposed unlearnable examples across training settings and datasets.","Unlearnable Examples, Data Protection" MUG: Interactive Multimodal Grounding on User Interfaces,https://openreview.net/forum?id=bbf_lxmcpTQ,https://openreview.net/pdf?id=bbf_lxmcpTQ,"We present MUG, a novel interactive task for multimodal grounding where a user and an agent work collaboratively on an interface screen.","We present MUG, a novel interactive task for multimodal grounding where a user and an agent work collaboratively on an interface screen. Prior works modeled multimodal UI grounding in one round: the user gives a command and the agent responds to the command. Yet, in a realistic scenario, a user command can be ambiguous when the target action is inherently difficult to articulate in natural language. MUG allows multiple rounds of interactions such that upon seeing the agent responses, the user can give further commands for the agent to refine or even correct its actions. Such interaction is critical for improving grounding performances in real-world use cases. To investigate the problem, we create a new dataset that consists of 77,820 sequences of human user-agent interaction on mobile interfaces in which 20% involves multiple rounds of interactions. To establish our benchmark, we experiment with a range of modeling variants and evaluation strategies, including both offline and online evaluation—the online strategy consists of both human evaluation and automatic with simulators. Our experiments show that allowing iterative interaction significantly improves the absolute task completion by 18% over the entire test dataset and 31% over the challenging subset. Our results lay the foundation for further investigation of the problem.","Multimodal Grounding, UI, Mobile, Interaction" Tabular Deep Learning when $d \gg n$ by Using an Auxiliary Knowledge Graph,https://openreview.net/forum?id=b1F-_7dUo0w,https://openreview.net/pdf?id=b1F-_7dUo0w,PLATO uses an auxiliary KG about input features to enable tabular deep learning prediction when $d \gg n$.,"Machine learning models exhibit strong performance on datasets with abundant labeled samples. However, for tabular datasets with extremely high $d$-dimensional features but limited $n$ samples (i.e. $d \gg n$), machine learning models struggle to achieve strong performance. Here, our key insight is that even in tabular datasets with limited labeled data, input features often represent real-world entities about which there is abundant prior information which can be structured as an auxiliary knowledge graph (KG). For example, in a tabular medical dataset where every input feature is the amount of a gene in a patient's tumor and the label is the patient's survival, there is an auxiliary knowledge graph connecting gene names with drug, disease, and human anatomy nodes. We therefore propose PLATO, a machine learning model for tabular data with $d \gg n$ and an auxiliary KG with input features as nodes. PLATO uses a multilayer perceptron (MLP) to predict the output labels from the tabular data and the auxiliary KG with two methodological components. First, PLATO predicts the parameters in the first layer of the MLP from the auxiliary KG. PLATO thereby reduces the number of trainable parameters in the MLP and integrates auxiliary information about the input features. Second, PLATO predicts different parameters in the first layer of the MLP for every input sample, thereby increasing the MLP’s representational capacity by allowing it to use different prior information for every input sample. Across 10 state-of-the-art baselines and 6 $d \gg n$ datasets, PLATO exceeds or matches the prior state-of-the-art, achieving performance improvements of up to 10.19%. Overall, PLATO uses an auxiliary KG about input features to enable tabular deep learning prediction when $d \gg n$.","Tabular Deep Learning, Knowledge Graph, High Dimensional, Low Sample" Random Laplacian Features for Learning with Hyperbolic Space,https://openreview.net/forum?id=3pfNb4pZBNp,https://openreview.net/pdf?id=3pfNb4pZBNp,,"Due to its geometric properties, hyperbolic space can support high-fidelity embeddings of tree- and graph-structured data, upon which various hyperbolic networks have been developed. Existing hyperbolic networks encode geometric priors not only for the input, but also at every layer of the network. This approach involves repeatedly mapping to and from hyperbolic space, which makes these networks complicated to implement, computationally expensive to scale, and numerically unstable to train. In this paper, we propose a simpler approach: learn a hyperbolic embedding of the input, then map once from it to Euclidean space using a mapping that encodes geometric priors by respecting the isometries of hyperbolic space, and finish with a standard Euclidean network. The key insight is to use a random feature mapping via the eigenfunctions of the Laplace operator, which we show can approximate any isometry-invariant kernel on hyperbolic space. Our method can be used together with any graph neural networks: using even a linear graph model yields significant improvements in both efficiency and performance over other hyperbolic baselines in both transductive and inductive tasks. ","hyperbolic space, random features, kernel approximation" Replay Memory as An Empirical MDP: Combining Conservative Estimation with Experience Replay,https://openreview.net/forum?id=SjzFVSJUt8S,https://openreview.net/pdf?id=SjzFVSJUt8S,,"Experience replay, which stores transitions in a replay memory for repeated use, plays an important role of improving sample efficiency in reinforcement learning. Existing techniques such as reweighted sampling, episodic learning and reverse sweep update further process the information in the replay memory to make experience replay more efficient. In this work, we further exploit the information in the replay memory by treating it as an empirical \emph{Replay Memory MDP (RM-MDP)}. By solving it with dynamic programming, we learn a conservative value estimate that \emph{only} considers transitions observed in the replay memory. Both value and policy regularizers based on this conservative estimate are developed and integrated with model-free learning algorithms. We design the metric \textit{memory density} to measure the quality of RM-MDP. Our empirical studies quantitatively find a strong correlation between performance improvement and memory density. Our method combines \emph{Conservative Estimation with Experience Replay (CEER)}, improving sample efficiency by a large margin, especially when the memory density is high. Even when the memory density is low, such a conservative estimate can still help to avoid suicidal actions and thereby improve performance.", Neural Causal Models for Counterfactual Identification and Estimation,https://openreview.net/forum?id=vouQcZS8KfW,https://openreview.net/pdf?id=vouQcZS8KfW,We solve the two problems of counterfactual identification and estimation from arbitrary surrogate experiments using a Generative Adversarial Network implementation of the Neural Causal Model.,"Evaluating hypothetical statements about how the world would be had a different course of action been taken is arguably one key capability expected from modern AI systems. Counterfactual reasoning underpins discussions in fairness, the determination of blame and responsibility, credit assignment, and regret. In this paper, we study the evaluation of counterfactual statements through neural models. Specifically, we tackle two causal problems required to make such evaluations, i.e., counterfactual identification and estimation from an arbitrary combination of observational and experimental data. First, we show that neural causal models (NCMs) are expressive enough and encode the structural constraints necessary for performing counterfactual reasoning. Second, we develop an algorithm for simultaneously identifying and estimating counterfactual distributions. We show that this algorithm is sound and complete for deciding counterfactual identification in general settings. Third, considering the practical implications of these results, we introduce a new strategy for modeling NCMs using generative adversarial networks. Simulations corroborate with the proposed methodology.","causal inference, deep learning, neural models, neural causal models, causal identification, causal estimation, counterfactual" Connecting representation and generation via masked vision-language transformer,https://openreview.net/forum?id=cRCEabpC5XQ,https://openreview.net/pdf?id=cRCEabpC5XQ,Unified vision-language Transformer trained with masked token prediction for both representation learning and generation of image and text.,"Recently, there has been great progress in the self-supervised pre-training of multimodal representation models that understand image and language jointly. One particularly popular application of such models is text-to-image generation, which is typically obtained via a two-stage process: in the first stage, a representation model is trained via self-supervised objectives; then in the second stage, a conditional generative decoder is trained on top of the representation to generate natural images. In this work, we aim at bringing representation learning and conditional generation together by unifying the two stages into a single model and training objective. We present UPGen, a unified pre-trained model for both representation learning and generation. UPGen is trained with a simple masked token prediction objective on a flexible mixture of image and language data. We use a pre-trained VQGAN image tokenizer to convert images into discrete tokens, then train a masked token prediction model on both paired image-text datasets and unpaired language datasets, using randomly sampled mask ratios. We show that this masked token prediction model can be directly used to generate images and language by iteratively re-masking and predicting the masked tokens. We demonstrate empirically that UPGen serves as both a good representation learning model and a generative model for both image and language.","Representation Learning, Pre-training, Generative Model, Conditional Generation" Is margin all you need? An extensive empirical study of active learning on tabular data,https://openreview.net/forum?id=wXdEKf5mV6N,https://openreview.net/pdf?id=wXdEKf5mV6N,,"Given a labeled training set and a collection of unlabeled data, the goal of active learning (AL) is to identify the best unlabeled points to label. In this comprehensive study, we analyze the performance of a variety of AL algorithms on deep neural networks trained on 69 real-world tabular classification datasets from the OpenML-CC18 benchmark. We consider different data regimes and the effect of self-supervised model pre-training. Surprisingly, we find that the classical margin sampling technique matches or outperforms all others, including current state-of-art, in a wide range of experimental settings. To researchers, we hope to encourage rigorous benchmarking against margin, and to practitioners facing tabular data labeling constraints that hyper-parameter-free margin may often be all they need.", "Momentum Stiefel Optimizer, with Applications to Suitably-Orthogonal Attention, and Optimal Transport",https://openreview.net/forum?id=vCJ9-Ri-6xU,https://openreview.net/pdf?id=vCJ9-Ri-6xU,This paper proposes accurate and efficient optimizers on Stiefel manifold based on a new variational principle and its careful discretization.,"The problem of optimization on Stiefel manifold, i.e., minimizing functions of (not necessarily square) matrices that satisfy orthogonality constraints, has been extensively studied. Yet, a new approach is proposed based on, for the first time, an interplay between thoughtfully designed continuous and discrete dynamics. It leads to a gradient-based optimizer with intrinsically added momentum. This method exactly preserves the manifold structure but does not require additional operation to keep momentum in the changing (co)tangent space, and thus has low computational cost and pleasant accuracy. Its generalization to adaptive learning rates is also demonstrated. Notable performances are observed in practical tasks. For instance, we found that placing orthogonal constraints on attention heads of trained-from-scratch Vision Transformer (Dosovitskiy et al., 2020) could markedly improve its performance, when our optimizer is used, and it is better that each head is made orthogonal within itself but not necessarily to other heads. This optimizer also makes the useful notion of Projection Robust Wasserstein Distance (Paty and Cuturi, 2019; Lin et al., 2020) for high-dim. optimal transport even more effective.", Target Conditioned Representation Independence (TCRI); from Domain-Invariant to Domain-General Representations,https://openreview.net/forum?id=ZZCJv2biATn,https://openreview.net/pdf?id=ZZCJv2biATn,We propose a Target Conditioned Representation Independence (TCRI) objective to learn domain-general representations and predictors.,"We propose a Target Conditioned Representation Independence (TCRI) objective for domain generalization. TCRI addresses the limitations of existing domain generalization methods due to incomplete constraints. Specifically, TCRI implements regularizers motivated by conditional independence constraints that are sufficient to strictly learn complete sets of invariant mechanisms, which we show are necessary and sufficient for domain generalization. Empirically, we show that TCRI is effective on both synthetic and real-world data. TCRI is competitive with baselines in average accuracy while outperforming them in worst-domain accuracy, indicating desired cross-domain stability.","Domain Generalization, Out-of-distribution Generalization, Transfer Learning, Distribution Shift, Covariate Shift" Multi-Task Option Learning and Discovery for Stochastic Path Planning,https://openreview.net/forum?id=gwPea8qxdzz,https://openreview.net/pdf?id=gwPea8qxdzz,This paper presents a novel approach for learning reusable multi-options to compute solutions for stochastic path planning problems using an hierarchical approach that combines planning and learning. ,"This paper addresses the problem of reliably and efficiently solving broad classes of long-horizon stochastic path planning problems. Starting with a vanilla RL formulation with a stochastic dynamics simulator and an occupancy matrix of the environment, our approach computes useful options with policies as well as high-level paths that compose the discovered options. Our main contributions are (1) data-driven methods for creating abstract states that serve as endpoints for helpful options, (2) methods for computing option policies using auto-generated option guides in the form of dense pseudo-reward functions, and (3) an overarching algorithm for composing the computed options. We show that this approach yields strong guarantees of executability and solvability: under fairly general conditions, the computed option guides lead to composable option policies and consequently ensure downward refinability. Empirical evaluation on a range of robots, environments, and tasks shows that this approach effectively transfers knowledge across related tasks and that it outperforms existing approaches by a significant margin.","Option discovery, learning abstractions, planning and learning, reinforcement learning, RL for robotics, hierarchical methods, stochastic path planning" MolEBM: Molecule Generation and Design by Latent Space Energy-Based Modeling,https://openreview.net/forum?id=u_pS0sDr95-,https://openreview.net/pdf?id=u_pS0sDr95-,We propose a probabilistic generative model to model molecule and molecular properties jointly and naturally achieve molecule design by posterior sampling conditional on desired properties. ,"Generation of molecules with desired chemical and biological properties such as high drug-likeness, high binding affinity to target proteins, is critical in drug discovery. In this paper, we propose a probabilistic generative model to capture the joint distribution of molecules and their properties. Our model assumes an energy-based model (EBM) in the latent space. Given the latent vector sampled from the latent space EBM, both molecules and molecular properties are conditionally sampled via a top-down molecule generator model and a property regression model respectively. The EBM in a low dimensional latent space allows our model to capture complex chemical rules implicitly but efficiently and effectively. Due to the joint modeling with chemical properties, molecule design can be conveniently and naturally achieved by conditional sampling from our learned model given desired properties, in both single-objective and multi-objective optimization settings. The latent space EBM, top-down molecule generator, and property regression model are learned jointly by maximum likelihood, while optimization of properties is accomplished by gradual shifting of the model distribution towards the region supported by molecules with high property values. Our experiments show that our model outperforms state-of-the-art models on various molecule design tasks. ","molecule design, energy-based model" Information-Theoretic Diffusion,https://openreview.net/forum?id=UvmDCdSPDOW,https://openreview.net/pdf?id=UvmDCdSPDOW,A new information-theoretic foundation for diffusion models leads to simpler and more computationally efficient density modeling. ,"Denoising diffusion models have spurred significant gains in density modeling and image generation, precipitating an industrial revolution in text-guided AI art generation. Whether interpreted through the lens of variational models or differential equations, diffusion models require many steps of expensive computation to give accurate density estimates. We introduce a new mathematical foundation for diffusion models inspired by classic results in information theory that connects Information with Minimum Mean Square Error estimators, the so-called I-MMSE relations. We generalize the I-MMSE relations to \emph{exactly} relate the data distribution and optimal denoising, leading to an elegant refinement of existing diffusion bounds. This new insight improves density estimation for diffusion models and enables simultaneous modeling of both continuous and discrete probabilities with no additional cost. ","diffusion, density models, information theory" Bandwith Enables Generalization in Quantum Kernel Models,https://openreview.net/forum?id=Ry-cTiH_cus,https://openreview.net/pdf?id=Ry-cTiH_cus,,"Quantum computers are known to provide speedups over classical state-of-the-art machine learning methods in some specialized settings. For example, quantum kernel methods have been shown to provide an exponential speedup on a learning version of the discrete logarithm problem. Understanding the generalization of quantum models is essential to realizing similar speedups on practically interesting problems. Recent results demonstrate that generalization is hindered by the exponential size of the quantum feature space. Although these results suggest that quantum models cannot generalize when the number of qubits is large, in this paper we show that these results rely on overly restrictive assumptions. We consider a wider class of models by varying a hyperparameter that we call quantum kernel bandwidth. We analyze the large-qubit limit and provide explicit formulas for the generalization of a quantum model that can be solved in closed form. Specifically, we show that changing the value of bandwidth can take a model from provably not being able to generalize on any target function to good generalization for well-aligned targets. Our analysis shows how the bandwidth controls the spectrum of the kernel integral operator, and thereby the inductive bias of the model. We demonstrate empirically that our theory correctly predicts how varying the bandwidth affects generalization of quantum models on challenging datasets, including those far outside our theoretical assumptions. We discuss the implications of our results for quantum advantage in machine learning.","kernel methods, generalization error, quantum machine learning, spectral bias" Giving Robots a Hand: Broadening Generalization via Hand-Centric Human Video Demonstrations,https://openreview.net/forum?id=Uo3usD5FFSR,https://openreview.net/pdf?id=Uo3usD5FFSR,"We leverage hand-centric human video demonstrations to learn generalizable robotic manipulation policies via imitation learning, introducing a simple framework that allows one to avoid using explicit human-robot domain adaptation methods.","Videos of humans performing tasks are a promising data source for robotic manipulation, because they are easy to collect in a wide range of scenarios and thus have the potential to significantly expand the generalization capabilities of vision-based robotic manipulators. Prior approaches to learning from human video demonstrations typically use third-person or egocentric data, but a central challenge that must be overcome there is the domain shift caused by the difference in appearance between human and robot morphologies. In this work, we largely reduce this domain gap by collecting hand-centric human video data (i.e., videos captured by a human demonstrator wearing a camera on their arm). To further close the gap, we simply crop out a portion of every visual observation such that the hand is no longer visible. We propose a framework for broadening the generalization of deep robotic imitation learning policies by incorporating unlabeled data in this format---without needing to employ any domain adaptation method, as the human embodiment is not visible in the frame. On a suite of six real robot manipulation tasks, our method substantially improves the generalization performance of manipulation policies acting on hand-centric image observations. Moreover, our method enables robots to generalize to both new environment configurations and new tasks that are unseen in the expert robot imitation data.","imitation learning, robotics, manipulation, learning from human demonstrations, learning from observations, generalization, visuomotor control" SpENCNN: Orchestrating Encoding and Sparsity for Fast Homomorphically Encrypted Neural Network Inference,https://openreview.net/forum?id=-syx4GzWdTM,https://openreview.net/pdf?id=-syx4GzWdTM,,"Homomorphic Encryption (HE) is a promising technology for protecting user's data privacy for Machine Learning as a Service (MLaaS) on public clouds. However, the computation overheads associated with the HE operations, which can be orders of magnitude slower than their counterparts for plaintexts, can lead to extremely high latency in neural network inference, seriously hindering its application in practice. While extensive neural network optimization techniques have been proposed, such as sparsification and pruning for plaintext domain, they cannot address this problem effectively. In this paper, we propose an HE-based CNN inference framework, i.e., SpENCNN, that can effectively exploit the single-instruction-multiple-data (SIMD) feature of the HE scheme to improve the CNN inference latency. In particular, we first develop a HE-group convolution technique that can partition channels among different groups based on the data size and ciphertext size, and then encode them into the same ciphertext in an interleaved manner, so as to dramatically reduce the bottlenecked operations in HE convolution. We further develop a sub-block weight pruning technique that can reduce more costly HE-operations for CNN convolutions. Our experiment results show that the SpENCNN-optimized CNN models can achieve overall speedups of 8.37x, 12.11x, and 19.26x for LeNet, VGG-5, and HEFNet, respectively, with negligible accuracy loss.","Cryptographic inference, model sparsity, data encoding" No Pairs Left Behind: Improving Metric Learning with Regularized Triplet Objective,https://openreview.net/forum?id=_tfLpF9mFiq,https://openreview.net/pdf?id=_tfLpF9mFiq,We propose a novel triplet objective that improves representation learning on a variety of applications without requiring additional sample mining or overhead costs.,"We propose a novel formulation of the triplet objective function that improves metric learning without additional sample mining or overhead costs. Our approach aims to explicitly regularize the distance between the positive and negative samples in a triplet with respect to the anchor-negative distance. As an initial validation, we show that our method (called No Pairs Left Behind [NPLB]) improves upon the traditional and current state-of-the-art triplet objective formulations on standard benchmark datasets. To show the effectiveness and potentials of NPLB on real-world complex data, we evaluate our approach on a large-scale healthcare dataset (UK Biobank), demonstrating that the embeddings learned by our model significantly outperform all other current representations on tested downstream tasks. Additionally, we provide a new model-agnostic single-time health risk definition that, when used in tandem with the learned representations, achieves the most accurate prediction of a patient's future health complications. Our results indicate that NPLB is a simple, yet effective framework for improving existing deep metric learning models, showcasing the potential implications of deep metric learning in more complex applications, especially in the biological and healthcare domains.","Deep Metric Learning, Representation Learning, Machine Learning for Healthcare, Triplet Loss" Minimal Value-Equivalent Partial Models for Scalable and Robust Planning in Lifelong Reinforcement Learning,https://openreview.net/forum?id=m1f8XUs-RQP,https://openreview.net/pdf?id=m1f8XUs-RQP,We propose new kinds of models to perform scalable and robust planning in lifelong reinforcement learning.,"Learning models of the environment from pure interaction is often considered an essential component of building lifelong reinforcement learning agents. However, the common practice in model-based reinforcement learning is to learn models that model every aspect of the agent's environment, regardless of whether they are important in coming up with optimal decisions or not. In this paper, we argue that such models are not particularly well-suited for performing scalable and robust planning in lifelong reinforcement learning scenarios and we propose new kinds of models that only model the relevant aspects of the environment, which we call minimal value-equivalent partial models. After providing the formal definitions of these models, we provide theoretical results demonstrating the scalability advantages of performing planning with minimal value-equivalent partial models and then perform experiments to empirically illustrate our theoretical results. Finally, we provide some useful heuristics on how to learn such models with deep learning architectures and empirically demonstrate that models learned in such a way can allow for performing planning that is robust to distribution shifts and compounding model errors. Overall, both our theoretical and empirical results suggest that minimal value-equivalent partial models can provide significant benefits to performing scalable and robust planning in lifelong reinforcement learning scenarios. ","reinforcement learning, lifelong learning, transfer learning, model-based reinforcement learning" Gradient Preconditioning for Non-Lipschitz smooth Nonconvex Optimization,https://openreview.net/forum?id=PQk-8VyP-dv,https://openreview.net/pdf?id=PQk-8VyP-dv,,"First-order optimization methods often perform poorly on non-Lipschitz smooth and ill-conditioned problems. Recent work introduced the dual preconditioned gradient descent algorithm, which applies a nonlinear preconditioning to the gradient map to improve performance on convex functions satisfying relative smoothness -- a generalized version of Lipschitz gradient smoothness. In this paper, we significantly extend this prior work by providing a convergence analysis of this algorithm for non-Lipschitz smooth nonconvex problems. To this end, we exploit recent connections with generalized versions of convexity and smoothness, referred to as anisotropic convexity/smoothness, which guarantee convergence to a first-order stationary point. Further, we show that some recently proposed preconditioners based on power functions or relativistic dynamics are well-suited for a broad class of objectives. Our experiments demonstrate improved performance using these preconditioners on a variety of non-Lipschitz smooth, nonconvex optimization objectives, including large-scale deep learning tasks.", Predictive Coding with Approximate Laplace Monte Carlo,https://openreview.net/forum?id=m2OeuIGTJW-,https://openreview.net/pdf?id=m2OeuIGTJW-,A novel method that improves the performance of predictive coding by incorporating information about the curvature of the energy landscape.,"Predictive coding (PC) accounts of perception now form one of the dominant computational theories of the brain. Despite this, they have enjoyed little export to the broader field of machine learning, where comparative generative models have flourished. In part, this has been due to the poor performance of models trained with standard implementations of PC when evaluated by both sample quality and marginal likelihood. By adopting the perspective of PC as a variational Bayes algorithm under the Laplace approximation, we identify the source of these deficits to lie in the exclusion of an associated Hessian term in the standard PC objective function. To remedy this, we make three primary contributions: we begin by suggesting a simple Monte Carlo estimated evidence lower bound which relies on sampling from the Hessian-parameterised variational posterior. We then derive a novel block diagonal approximation to the full Hessian matrix that has lower memory requirements and favourable mathematical properties. Lastly, we present an algorithm that combines our method with standard PC to reduce memory complexity further. We evaluate models trained with our approach against the standard PC framework on image benchmark datasets. Our approach produces higher log-likelihoods and qualitatively better samples that more closely capture the diversity of the data-generating distribution. ","predictive coding, variational Bayes, Laplace approximation, generative modelling, free energy" What Spurious Features Can Pretrained Language Models Combat?,https://openreview.net/forum?id=BcbwGQWB-Kd,https://openreview.net/pdf?id=BcbwGQWB-Kd,,"Machine learning models are known to exploit spurious features: features that are predictive during training (e.g., the exclamation mark) but are not useful in general (e.g., the exclamation mark does not imply sentiment). Relying on such features may result in significant performance drops under distribution shifts. Recent work has found that Pretrained Language Models (PLMs) improve robustness against spurious features. However, existing evaluation of PLMs only focuses on a small set of spurious features, painting a limited picture of the inductive bias in PLMs. In this work, we conduct a comprehensive empirical analysis to compare the generalization patterns of PLMs on diverse categories of spurious features as a way to analyze the inductive biases of PLMs. We find systematic patterns when finetuning BERT and few-shot prompting GPT-3: they exploit certain types of spurious features (e.g., content words) to a much larger extent than others (e.g., function words). Our findings inform the kinds of settings where pretraining alone can be expected to confer robustness, and the kinds of spurious features where other mitigation methods are necessary, for which we also study how different finetuning and prompting methods affect the robustness of PLMs.","spurious correlation, pretrained language models" SIMPLE: A Gradient Estimator for k-Subset Sampling,https://openreview.net/forum?id=GPJVuyX4p_h,https://openreview.net/pdf?id=GPJVuyX4p_h,,"$k$-subset sampling is ubiquitous in machine learning, enabling regularization and interpretability through sparsity. The challenge lies in rendering $k$-subset sampling amenable to end-to-end learning. This has typically involved relaxing the reparameterized samples to allow for backpropagation, but introduces both bias and variance. In this work, we fall back to discrete $k$-subset sampling on the forward pass. This is coupled with using the gradient with respect to the exact marginals, computed efficiently, as a proxy for the true gradient. We show that our gradient estimator exhibits lower bias and variance compared to state-of-the-art estimators. Empirical results show improved performance on learning to explain and sparse models benchmarks. We provide an algorithm for computing the exact ELBO for the $k$-subset distribution, obtaining significantly lower loss compared to state-of-the-art discrete sparse VAEs. All of our algorithms are exact and efficient.", Transformers Implement First-Order Logic with Majority Quantifiers,https://openreview.net/forum?id=W668diqwp4l,https://openreview.net/pdf?id=W668diqwp4l,Transformers can be translated to formulae in first-order logic with majority quantifiers that compute the same function.,"Characterizing the implicit structure of the computation within neural networks is a foundational problem in the area of deep learning interpretability. Can their inner decision process be captured symbolically in some familiar logic? We show that any transformer neural network can be translated into an equivalent fixed-size first-order logic formula which may also use majority quantifiers. The idea is to simulate transformers with highly uniform threshold circuits and leverage known theoretical connections between circuits and logic. Our findings also reveal the surprising fact that the entire transformer computation can be reduced merely to the division of two (large) integers. While our results are most pertinent for transformers, they apply equally to a broader class of neural network architectures, namely those with a fixed-depth uniform computation graph made up of standard neural net components, which includes feedforward and convolutional networks.","transformers, complexity theory, logic, interpretability" Robustness Evaluation Using Local Substitute Networks,https://openreview.net/forum?id=yPMsKyrn5A-,https://openreview.net/pdf?id=yPMsKyrn5A-,,"Robustness of a neural network against adversarial examples is an important topic when a deep classifier is applied in safety critical use cases like health care or autonomous driving. In order to assess the robustness, practitioners use a range of different tools ranging from the adversarial attacks to exact computation of the distance to the decision boundary. We use the fact that robustness of a neural network is a local property and empirically show that computing the same metrics for the smaller local substitute networks yields good estimates of the robustness for lower cost. To construct the substitute network we develop two pruning techniques that preserve the local properties of the initial network around a given anchor point. Our experiments on CIFAR10 and MNIST datasets prove that this approach saves a significant amount of computing time and is especially beneficial for the larger models.","Robustness, verification, pruning, neural networks" Learning Iterative Neural Optimizers for Image Steganography,https://openreview.net/forum?id=gLPkzWjdhBN,https://openreview.net/pdf?id=gLPkzWjdhBN,,"Image steganography is the process of concealing secret information in images through imperceptible changes. Recent work has formulated this task as a classical constrained optimization problem. In this paper, we argue that image steganography is inherently performed on the (elusive) manifold of natural images, and propose to train an iterative neural network to perform the optimization steps. In contrast to classical optimization methods like L-BFGS or projected gradient descent, we train a neural network to stay close to the manifold of natural images throughout the optimization. We show that our learned neural optimization is faster and more reliable than classical optimization approaches. In comparison to the previous state-of-the-art encoder-decoder based steganography approaches, it reduces the recovery error rate by multiple orders of magnitude and achieve zero error up to 3 bits per pixel (bpp) without the need for error correcting codes. ", Graph Neural Networks as Multi-View Learning,https://openreview.net/forum?id=r_4nJuPpCh-,https://openreview.net/pdf?id=r_4nJuPpCh-,"A new Multi-View Learning perspective on GNN, which achieves better computation and memory efficiency.","Graph Neural Networks (GNNs) have demonstrated powerful representation capability in semi-supervised node classification. In this task, there are often three types of information -- graph structure, node features, and node labels. Existing GNNs usually leverage both node features and graph structure by feature transformation and aggregation, following end-to-end training via node labels. In this paper, we change our perspective by considering these three types of information as three views of nodes. This perspective motivates us to design a new GNN framework as multi-view learning which enables alternating optimization training instead of end-to-end training, resulting in significantly improved computation and memory efficiency. Extensive experiments with different settings demonstrate the effectiveness and efficiency of the proposed method. ","Graph Neural Networks, Alternating Optimization, Semi-Supervised Learning, Multi-View Learning" Cramming: Training a language model on a single GPU in one day,https://openreview.net/forum?id=gUL6zYN4Uaf,https://openreview.net/pdf?id=gUL6zYN4Uaf,"Cramming transformer-based language model pretraining into less compute, what happens?","Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers and practitioners. While most in the community are asking how to push the limits of extreme computation, we ask the opposite question: How far can we get with a single GPU in just one day? We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings. Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting.","Transformers, Scaling, Pretraining, Language" BertNet: Harvesting Knowledge Graphs from Pretrained Language Models,https://openreview.net/forum?id=ntIq8Wm79G-,https://openreview.net/pdf?id=ntIq8Wm79G-,"In this work, we aim at harvesting symbolic KGs from the LMs, and propose a new framework for automatic KG construction empowered by the neural LMs' flexibility and scalability.","Symbolic knowledge graphs (KGs) have been constructed either by expensive human crowdsourcing or with complex text mining pipelines. The emerging large pretrained language models (LMs), such as BERT, have shown to implicitly encode massive knowledge which can be queried with properly designed prompts. However, compared to the explicit KGs, the implict knowledge in the black-box LMs is often difficult to access or edit and lacks explainability. In this work, we aim at harvesting symbolic KGs from the LMs, and propose a new framework for automatic KG construction empowered by the neural LMs’ flexibility and scalability. Compared to prior works that often rely on large human annotated data or existing massive KGs, our approach requires only the minimal definition of relations as inputs, and hence is suitable for extracting knowledge of rich new relations that are instantly assigned and not available before. The framework automatically generates diverse prompts, and performs efficient knowledge search within a given LM for consistent outputs. The knowledge harvested with our approach shows competitive quality, diversity, and novelty. As a result, we derive from diverse LMs a family of new KGs (e.g., BERTNET and ROBERTANET) that contain a richer set of relations, including some complex ones (e.g., ""A is capable of but not good at B"") that cannot be extracted with previous methods. Besides, the resulting KGs also serve as a vehicle to interpret the respective source LMs, leading to new insights into the varying knowledge capability of different LMs. ","knowledge graph, pretrained language models" How Hard is Trojan Detection in DNNs? Fooling Detectors With Evasive Trojans,https://openreview.net/forum?id=V-RDBWYf0go,https://openreview.net/pdf?id=V-RDBWYf0go,We design hard-to-detect trojan attacks for deep neural networks.,"As AI systems become more capable and widely used, a growing concern is the possibility for trojan attacks in which adversaries inject deep neural networks with hidden functionality. Recently, methods for detecting trojans have proven surprisingly effective against existing attacks. However, there is comparatively little work on whether trojans themselves could be rendered hard to detect. To fill this gap, we develop a general method for making trojans more evasive based on several novel techniques and observations. Our method combines distribution-matching, specificity, and randomization to eliminate distinguishing features of trojaned networks. Importantly, our method can be applied to various existing trojan attacks and is detector-agnostic. In experiments, we find that our evasive trojans reduce the efficacy of a wide range of detectors across numerous evaluation settings while maintaining high attack success rates. Moreover, we find that evasive trojans are also harder to reverse-engineer, underscoring the importance of developing more robust monitoring mechanisms for neural networks and clarifying the offence-defense balance of trojan detection.","trojan detection, neural trojans, trojans, hidden functionality, monitoring" Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding,https://openreview.net/forum?id=UERcQuXlwy,https://openreview.net/pdf?id=UERcQuXlwy,"We propose general-purpose pixel-to-text models that can be finetuned on tasks with visually-situated language, such as UIs, charts, figures, tables, documents, etc.","Visually-situated language is ubiquitous---sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse masked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large source of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, language modeling, image captioning. In addition to the novel pretraining strategy, we introduce a variable-resolution input representation and a more flexible integration of language and vision inputs, where language prompts such as questions are rendered directly on top of the input image. For the first time, we show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images. ","self-supervised, pretraining, screenshots, parsing, language, vision, transformers, interfaces, charts, figures, tables, documents" Label-Free Synthetic Pretraining of Object Detectors,https://openreview.net/forum?id=LdUByi1hN3,https://openreview.net/pdf?id=LdUByi1hN3,,"We propose a new approach, Synthetic Optimized Layout with Instance Detection (SOLID), to pretrain object detectors with synthetic images. Our ``SOLID'' approach consists of two main components: (1) generating synthetic images using a collection of unlabelled 3D models with optimized scene arrangement; (2) pretraining an object detector on ""instance detection"" task - given a query image depicting an object, detecting all instances of the exact same object in a target image. Our approach does not need any semantic labels for pretraining and allows the use of arbitrary, diverse 3D models. Experiments on COCO show that with optimized data generation and a proper pretraining task, synthetic data can be highly effective data for pretraining object detectors. In particular, pretraining on rendered images achieves performance competitive with pretraining on real images while using significantly less computing resources.", Confidence-Conditioned Value Functions for Offline Reinforcement Learning,https://openreview.net/forum?id=Zeb5mTuqT5,https://openreview.net/pdf?id=Zeb5mTuqT5,We propose a new offline reinforcement learning algorithm that adapts how conservative its behavior will be.,"Offline reinforcement learning (RL) promises the ability to learn effective policies solely using existing, static datasets, without any costly online interaction. To do so, offline RL methods must handle distributional shift between the dataset and the learned policy. The most common approach is to learn conservative, or lower-bound, value functions, which underestimate the return of OOD actions. However, such methods exhibit one notable drawback: policies optimized on such value functions can only behave according to a fixed, possibly suboptimal, degree of conservatism. However, this can be alleviated if we instead are able to learn policies for varying degrees of conservatism at training time and devise a method to dynamically choose one of them during evaluation. To do so, in this work, we propose learning value functions that additionally condition on the degree of conservatism, which we dub confidence-conditioned value functions. We derive a new form of a Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability. By conditioning on confidence, our value functions enable adaptive strategies during online evaluation by controlling for confidence level using the history of observations thus far. This approach can be implemented in practice by conditioning the Q-function from existing conservative algorithms on the confidence. We theoretically show that our learned value functions produce conservative estimates of the true value at any desired confidence. Finally, we empirically show that our algorithm outperforms existing conservative offline RL algorithms on multiple discrete control domains. ","reinforcement learning, offline reinforcement learning, ensembles, adaptation" Current Anomaly Detectors are Anomalous: On Semantic Treatment of OOD Inputs,https://openreview.net/forum?id=7NUTyhyQt9x,https://openreview.net/pdf?id=7NUTyhyQt9x,"We propose that in-distribution should not be tied to the training distribution but to the distribution of semantic information in training data, and therefore OOD detection should be performed on the semantic information extracted from training data","Machine learning models have achieved impressive performance across different modalities. It is well known that these models are prone to making mistakes on out-of-distribution inputs. OOD detection has, therefore, gained a lot of attention recently. We observe that most existing detectors use the distribution estimated by the training dataset for OOD detection. This can be a serious impediment since faulty OOD detectors can potentially restrict utility of the model. Such detectors, tied to the bias in data collection process, can be impermeable to inputs lying outside the training distribution but with the same semantic information (e.g., class labels) as the training data. We argue that in-distribution should not be tied to just the training distribution but to the distribution of the semantic information contained in the training data. To support our argument, we perform OOD detection on semantic information extracted from the training data of MNIST and COCO datasets, and show that it not only reduces false alarms but also significantly improves detection of OOD inputs with spurious features from training data. ","machine learning, training distribution, out-of-distribution, OODs, detection, semantic information" FedX: Federated Learning for Compositional Pairwise Risk Optimization,https://openreview.net/forum?id=lwVwTjNwNl,https://openreview.net/pdf?id=lwVwTjNwNl,A communication-efficient algorithm to optimize pairwise,"In this paper, we tackle a novel federated learning (FL) problem for optimizing a family of compositional pairwise risks, to which no existing FL algorithms are applicable. In particular, the objective has the form of $\E_{\z\sim \mathcal S_1} f(\E_{\z'\sim\mathcal S_2} \ell(\w; \z, \z'))$, where two sets of data $\mathcal S_1, \mathcal S_2$ are distributed over multiple machines, $\ell(\cdot; \cdot,\cdot)$ is a pairwise loss that only depends on the prediction outputs of the input data pairs $(\z, \z')$, and $f(\cdot)$ is possibly a non-linear non-convex function. This problem has important applications in machine learning, e.g., AUROC maximization with a pairwise loss, and partial AUROC maximization with a compositional loss, etc. The challenges for designing a FL algorithm lie at the non-decomposability of the objective over multiple machines and the interdependency between different machines. We propose two provable FL algorithms (FedX) for handling linear and nolinear $f$, respectively. To tackle the challenges, we decouple the gradient's components with two types namely active parts and lazy parts, where the {\it active} parts depend on local data that can be computed with the local model and the {\it lazy} parts depend on other machines that are communicated/computed based on historical models. We develop a novel theoretical analysis to address the issue of latency of lazy parts and interdependency between the local gradient estimators and the involved data. We establish both iteration and communication complexities and exhibit that using the historical models for computing the lazy parts do not degrade the complexity results. We conduct empirical studies of FedX for AUROC and partial AUROC maximization, and demonstrate their performance compared with multiple baselines.","Federated Learning, Pairwise Loss, Compositional Optimization" On the Sensitivity of Reward Inference to Misspecified Human Models,https://openreview.net/forum?id=hJqGbUpDGV,https://openreview.net/pdf?id=hJqGbUpDGV,We investigate the impact of assuming wrong human models on reward learning.,"Inferring reward functions from human behavior is at the center of value alignment – aligning AI objectives with what we, humans, actually want. But doing so relies on models of how humans behave given their objectives. After decades of research in cognitive science, neuroscience, and behavioral economics, obtaining accurate human models remains an open research topic. This begs the question: how accurate do these models need to be in order for the reward inference to be accurate? On the one hand, if small errors in the model can lead to catastrophic error in inference, the entire framework of reward learning seems ill-fated, as we will never have perfect models of human behavior. On the other hand, if as our models improve, we can have a guarantee that reward accuracy also improves, this would show the benefit of more work on the modeling side. We study this question both theoretically and empirically. We do show that it is unfortunately possible to construct small adversarial biases in behavior that lead to arbitrarily large errors in the inferred reward. However, and arguably more importantly, we are also able to identify reasonable assumptions under which the reward inference error can be bounded linearly in the error in the human model. Finally, we verify our theoretical insights in discrete and continuous control tasks with both simulated biases, as well as real human data.","reward learning, inverse reinforcement learning, misspecification" DeepDFA: Dataflow Analysis-Guided Efficient Graph Learning for Vulnerability Detection,https://openreview.net/forum?id=HVvqbDQdhW2,https://openreview.net/pdf?id=HVvqbDQdhW2,"We present DeepDFA, a dataflow analysis-guided graph learning framework and embedding technique for vulnerability detection.","Deep learning-based vulnerability detection models have recently been shown to be effective and, in some cases, outperform static analysis tools. However, the highest-performing approaches use token-based transformer models, which do not leverage domain knowledge. Classical program analysis techniques such as dataflow analysis can detect many types of bugs and are the most commonly used methods in practice. Motivated by the causal relationship between bugs and dataflow analysis, we present DeepDFA, a dataflow analysis-guided graph learning framework and embedding that use program semantic features for vulnerability detection. We show that DeepDFA is performant and efficient. DeepDFA ranked first in recall, first in generalizing over unseen projects, and second in F1 among all the state-of-the-art models we experimented with. It is also the smallest model in terms of the number of parameters, and was trained in 9 minutes, 69x faster than the highest-performing baseline. DeepDFA can be used with other models. By integrating LineVul and DeepDFA, we achieved the best vulnerability detection performance of 96.4 F1 score, 98.69 precision, and 94.22 recall.","deep learning, vulnerability detection, dataflow analysis, program analysis" Probability flow solution of the Fokker-Planck equation,https://openreview.net/forum?id=x9tAJ3_N0k,https://openreview.net/pdf?id=x9tAJ3_N0k,We develop an efficient method to solve the Fokker-Planck equation in high-dimension by learning the score of the solution.,"The method of choice for integrating the time-dependent Fokker-Planck equation in high-dimension is to generate samples from the solution via integration of the associated stochastic differential equation. Here, we introduce an alternative scheme based on integrating an ordinary differential equation that describes the flow of probability. Acting as a transport map, this equation deterministically pushes samples from the initial density onto samples from the solution at any later time. Unlike integration of the stochastic dynamics, the method has the advantage of giving direct access to quantities that are challenging to estimate from trajectories alone, such as the probability current, the density itself, and its entropy. The probability flow equation depends on the gradient of the logarithm of the solution (its ""score""), and so is a-priori unknown. To resolve this dependence, we model the score with a deep neural network that is learned on-the-fly by propagating a set of samples according to the instantaneous probability current. We consider several high-dimensional examples from the physics of interacting particle systems to highlight the efficiency and precision of the approach; we find that the method accurately matches analytical solutions computed by hand and moments computed via Monte-Carlo.","score-based diffusion, high-dimensional scientific computing" Binding Language Models in Symbolic Languages,https://openreview.net/forum?id=lH1PV42cbF,https://openreview.net/pdf?id=lH1PV42cbF,binding language models in symbolic languages,"Though end-to-end neural approaches have recently been dominating NLP tasks in both performance and ease-of-use, they lack interpretability and robustness. We propose Binder, a training-free neural-symbolic framework that maps the task input to a program, which (1) allows binding a unified API of language model (LM) functionalities to a programming language (e.g., SQL, Python) to extend its grammar coverage and thus tackle more diverse questions, (2) adopts an LM as both the program parser and the underlying model called by the API during execution, and (3) requires only a few in-context exemplar annotations. Specifically, we employ GPT-3 Codex as the LM. In the parsing stage, with only a few in-context exemplars, Codex is able to identify the part of the task input that cannot be answerable by the original programming language, correctly generate API calls to prompt Codex to solve the unanswerable part, and identify where to place the API calls while being compatible with the original grammar. In the execution stage, Codex can perform versatile functionalities (e.g., commonsense QA, information extraction) given proper prompts in the API calls. Binder achieves state-of-the-art results on WikiTableQuestions and TabFact datasets, with explicit output programs that benefit human debugging. Note that previous best systems are all finetuned on tens of thousands of task-specific samples, while Binder only uses dozens of annotations as in-context exemplars without any training. Our code is available at anonymized.","semantic parsing, large language model, neural symbolic, code generation" Probabilistic Categorical Adversarial Attack and Adversarial Training,https://openreview.net/forum?id=lgIPsrxrU7,https://openreview.net/pdf?id=lgIPsrxrU7,,"The existence of adversarial examples brings huge concern for people to apply Deep Neural Networks (DNNs) in safety-critical tasks. However, how to generate adversarial examples with categorical data is an important problem but lacks extensive exploration. Previously established methods leverage greedy search methods, which can be very time-consuming to conduct a successful attack. This also limits the development of adversarial training and potential defenses for categorical data. To tackle this problem, we propose a Probabilistic Categorical Adversarial Attack (PCAA), which transfers the discrete optimization problem to a continuous problem that can be solved efficiently by Projected Gradient Descent. In our paper, we theoretically analyze its optimality and time complexity to demonstrate its significant advantage over current greedy-based attacks. Moreover, based on our attack, we propose an efficient adversarial training framework. Through a comprehensive empirical study, we justify the effectiveness of our proposed attack and defense algorithms. ","adversarial attacks, robustness, discrete input model" Multi-Sample Contrastive Neural Topic Model as Multi-Task Learning,https://openreview.net/forum?id=fFAV-_MCTd,https://openreview.net/pdf?id=fFAV-_MCTd,,"Recent representation learning approaches to polish global semantics of neural topic models optimize the weighted linear combination of the evidence lower bound (ELBO) of the log-likelihood and the discriminative objective that contrasts instance pairings. However, contrastive learning on the individual level might capture noisy mutual information that is irrelevant to the topic modeling task. Moreover, there is a potential conflict between the ELBO loss that memorizes input details for better reconstruction quality, and the contrastive term which attempts to generalize representations among inputs. To address the issues, we firstly hypothesize that useful features should be shared among multiple input samples. For that reason, we propose a novel set-based contrastive learning method for neural topic models to employ the concept of multi-sample representation learning. Secondly, because the solution of the linear combination approach might not satisfy all objectives when they compete, we explicitly cast contrastive topic modeling as gradient-based multi-objective optimization, with the goal of achieving a Pareto stationary solution. Extensive experiments demonstrate that our framework consistently produces higher-performing neural topic models in terms of topic coherence and downstream performance. ", Time Will Tell: New Outlooks and A Baseline for Temporal Multi-View 3D Object Detection,https://openreview.net/forum?id=H3HcEJA2Um,https://openreview.net/pdf?id=H3HcEJA2Um,"We leverage complementary coarse, long-term and fine-grained, short-term multi-view stereo for camera-only 3D object detection.","While recent camera-only 3D detection methods leverage multiple timesteps, the limited history they use significantly hampers the extent to which temporal fusion can improve object perception. Observing that existing works' fusion of multi-frame images are instances of temporal stereo matching, we find that performance is hindered by the interplay between 1) the low granularity of matching resolution and 2) the sub-optimal multi-view setup produced by limited history usage. Our theoretical and empirical analysis demonstrates that the optimal temporal difference between views varies significantly for different pixels and depths, making it necessary to fuse many timesteps over long-term history. Building on our investigation, we propose to generate a cost volume from a long history of image observations, compensating for the coarse but efficient matching resolution with a more optimal multi-view matching setup. Further, we augment the per-frame monocular depth predictions used for long-term, coarse matching with short-term, fine-grained matching and find that long and short term temporal fusion are highly complementary. While maintaining high efficiency, our framework sets new state-of-the-art on nuScenes, achieving first place on the test set and outperforming previous best art by 5.2% mAP and 3.7% NDS on the validation set.","Computer Vision, 3D Object Detection, Stereo Matching" Less Is More: Training on Low-Fidelity Images Improves Robustness to Adversarial Attacks,https://openreview.net/forum?id=8mQrCW_JO3,https://openreview.net/pdf?id=8mQrCW_JO3,,"Since adversarial attacks are defined relative to human perception, it may be fruitful to investigate why human perception (and biological perception in general) is robust to the types of perturbations that DNNs are convincingly deceived by. In the context of vision, we hypothesize that a factor contributing to the robustness of human visual perception is our constant exposure to low-fidelity visual stimuli. To investigate the impact, vis-à-vis adversarial robustness, of exposure to low-fidelity visual stimuli, we train and evaluate object recognition DNNs on images which have been blurred and have had their color saturation reduced. We find that DNNs trained on such images can achieve high classification accuracy over a small number of classes, while becoming significantly more robust to low-magnitude adversarial attacks. Furthermore, we design a blurring module that simulates that loss of visual acuity with increasing eccentricity by selecting the intensity of Gaussian blur at each pixel based on its distance from a given fixation point. Our results indicate that using this retina-inspired blurring mechanism, instead of blurring the entire image with the same Gaussian kernel, yields better robustness while keeping the accuracy on clean data unchanged.", Greedy Information Maximization for Online Feature Selection,https://openreview.net/forum?id=hAcApnx50F,https://openreview.net/pdf?id=hAcApnx50F,A greedy procedure for performing online feature selection by maximizing mutual information,"Feature selection is commonly used to reduce feature acquisition costs, but the standard approach is to train models with static feature subsets. Here, we consider the online feature selection problem, where the model can adaptively query features based on the presently available information. Online feature selection has mainly been viewed as a reinforcement learning problem, but we propose a simpler approach of greedily selecting features that maximize mutual information with the response variable. This intuitive idea is difficult to implement without perfect knowledge of the joint data distribution, so we propose a deep learning approach that recovers the greedy procedure when perfectly optimized. We apply our approach to numerous datasets and observe better performance than both RL-based and offline feature selection methods","Online learning, feature selection, greedy optimization, mutual information" Towards Fair Classification against Poisoning Attacks,https://openreview.net/forum?id=l1tSyx67_J,https://openreview.net/pdf?id=l1tSyx67_J,We propose new poisoning attack and defense for fair classification methods. ,"Fair classification aims to stress the classification models to achieve the equality (treatment or prediction quality) among different sensitive groups. However, fair classification can be under the risk of poisoning attacks which deliberately insert malicious training samples to manipulate the trained classifiers' performance. In this work, we study the poisoning scenario where the attacker can insert a small fraction of samples into training data, with arbitrary sensitive attributes as well as other predictive features. We demonstrate that the fairly trained classifiers can be greatly vulnerable to such poisoning attacks, with much worse accuracy & fairness trade-off, even when we apply some of the most effective defenses (originally proposed to defend traditional classification tasks). As countermeasures to defend fair classification tasks, we propose a general and theoretically guaranteed framework which accommodates traditional defense methods to fair classification against poisoning attacks. Through extensive experiments, the results validate that the proposed defense framework obtains better robustness in terms of accuracy and fairness than baseline methods.","Poisoning Attacks, Fairness, Robustness" Unveiling Transformers with LEGO: A Synthetic Reasoning Task,https://openreview.net/forum?id=1jDN-RfQfrb,https://openreview.net/pdf?id=1jDN-RfQfrb,We propose a synthetic task for logical reasoning on which we study transformer models' intriguing behaviors regarding generalization and the role of pre-training; we gain insights leading to large-scale practical improvements.,"We propose a synthetic reasoning task, LEGO (Learning Equality and Group Operations), that encapsulates the problem of following a chain of reasoning, and we study how the Transformer architectures learn this task. We pay special attention to data effects such as pretraining (on seemingly unrelated NLP tasks) and dataset composition (e.g., differing chain length at training and test time), as well as architectural variants such as weight-tied layers or adding convolutional components. We study how the trained models eventually succeed at the task, and in particular, we are able to understand (to some extent) some of the attention heads as well as how the information flows in the network. Based on these observations we propose a hypothesis that here pretraining helps for LEGO tasks due to certain structured attention patterns, and we experimentally verify this hypothesis. We also observe that in some data regimes the trained transformer finds ``shortcut"" solutions to follow the chain of reasoning, which impedes the model's robustness, and moreover we propose ways to prevent it. Motivated by our findings on structured attention patterns, we propose to replace certain attention heads with hardcoded patterns. This architectural change significantly reduces Flops and maintains or even improves the model's performance at large-scale pretraining.","transformers, logical reasoning, role of pretraining, attention pattern" "How Much Data Are Augmentations Worth? An Investigation into Scaling Laws, Invariance, and Implicit Regularization",https://openreview.net/forum?id=3aQs3MCSexD,https://openreview.net/pdf?id=3aQs3MCSexD,"We uncover mechanisms by which data augmentations regularize training and inform the relationship between augmentations and extra data, invariance, stochasticity, and flatness.","Despite the clear performance benefits of data augmentations, little is known about why they are so effective. In this paper, we disentangle several key mechanisms through which data augmentations operate. Establishing an exchange rate between augmented and additional real data, we find that in out-of-distribution testing scenarios, augmentations which yield samples that are diverse, but inconsistent with the data distribution can be even more valuable than additional training data. Moreover, we find that data augmentations which encourage invariances can be more valuable than invariance alone, especially on small and medium sized training sets. Following this observation, we show that augmentations induce additional stochasticity during training, effectively flattening the loss landscape.","Data Augmentations, Stochasticity, Flatness, Neural Networks, Invariance" Fast exploration and learning of latent graphs with aliased observations,https://openreview.net/forum?id=4phxC1MmcfN,https://openreview.net/pdf?id=4phxC1MmcfN,,"We consider the problem of quickly recovering the structure of a latent graph by navigating in it, when the agent can only perform stochastic actions and ---crucially--- different nodes may emit the same observation. This corresponds to learning the transition function of a partially observable Markov decision process (POMDP) in which observations are deterministic. This is highly relevant for partially observed reinforcement learning, where the agent needs to swiftly learn how to navigate new environments from sensory observations. The challenge involves solving two related problems: exploring the graph as fast as possible, and learning it from the obtained aliased observations, where the learning helps to explore faster. Our approach leverages a recently proposed model, the Clone Structured Cognitive Graph (CSCG), which can handle aliasing, and guide exploration. We provide empirical evidence that our model-based algorithm can recover graphs from a wide range of challenging topologies, and shows linear scaling with graph size even for severely aliased and loopy graph structures where model-free methods require an exponential number of steps.","graph learning, fast exploration, aliased environments, POMDPs" Spatial Reasoning Network for Zero-shot Constrained Scene Generation,https://openreview.net/forum?id=ABqIh51jNQm,https://openreview.net/pdf?id=ABqIh51jNQm,This paper introduces the Spatial Reasoning Network for zero-shot constrained scene generation.,"Constrained scene generation (CSG) generates images satisfying a given set of constraints. Zero-shot CSG generates images satisfying constraints not presented in the training set without retraining. Recent neural-based models generate images with excellent details, but largely cannot satisfy constraints, especially in complex scenes involving multiple objects. Such difficulty is due to the lack of effective approaches combining low-level visual element generation with high-level spatial reasoning. We introduce a Spatial Reasoning Network for constrained scene generation (SPREN). SPREN adds to the state-of-the-art image generation networks (for low-level visual element generation) a spatial reasoning module (for high-level spatial reasoning). The spatial reasoning module decides objects' positions following the output of a Recursive Neural Network (RNN), which is trained to learn implicit spatial knowledge (such as trees growing from the ground) from an image dataset. During inference, explicit constraints can be enforced by a forward-checking algorithm, which blocks invalid decisions from the RNN in a zero-shot manner. In experiments, we demonstrate SPREN is able to generate images with excellent detail while satisfying complex spatial constraints. SPREN also transfers good quality scene generation to unseen constraints without retraining. ","Spatial Reasoning Network, Constrained Scene Generation" Robust Graph Dictionary Learning,https://openreview.net/forum?id=qxRscesArBZ,https://openreview.net/pdf?id=qxRscesArBZ,This paper proposes a robust graph dictionary learning method based on a novel robust variant of GWD.,"Traditional Dictionary Learning (DL) aims to approximate data vectors as sparse linear combinations of basis elements (atoms) and is widely used in machine learning, computer vision, and signal processing. To extend DL to graphs, Vincent-Cuaz et al. 2021 propose a method, called GDL, which describes the topology of each graph with a pairwise relation matrix (PRM) and compares PRMs via the Gromov-Wasserstein Discrepancy (GWD). However, the lack of robustness often excludes GDL from a variety of real-world applications since GWD is sensitive to the structural noise in graphs. This paper proposes an improved graph dictionary learning algorithm based on a robust Gromov-Wasserstein discrepancy (RGWD) which has theoretically sound properties and an efficient numerical scheme. Based on such a discrepancy, our dictionary learning algorithm can learn atoms from noisy graph data. Experimental results demonstrate that our algorithm achieves good performance on both simulated and real-world datasets.","Graph Learning, Optimal Transport" Matrix factorization under the constraint of connectivity between observed and source data ~ Muscle synergy analysis based on connectivity between muscle and brain activities ~,https://openreview.net/forum?id=5XrQ2mskPQz,https://openreview.net/pdf?id=5XrQ2mskPQz,,"Matrix factorization is a popular method to investigate the hidden elements in observed data for tasks such as speech separation and muscle synergy analysis. The hidden elements may be closely related to the source phenomenon that cause the observed phenomenon. However, conventional methods do not always factorize the observed phenomenon elements with the connectivity between the observed and source phenomena because they only use the observed phenomenon. This paper proposes a matrix decomposition method that constrains the connectivity between observed and source data by using the representations from a decoding model from source data to observed data. We applied our method to the corticomuscular system, which is made up of corticospinal pathways between the primary motor cortex and muscles in the body and creates muscle synergies that enable efficient connections between the brain and muscles. In this context, muscle activities are the observed phenomenon and brain activities are the source. Many previous studies have analyzed muscle synergies using only observed muscle activity, but there may be unrevealed muscle synergies under the constraint of the connectivity between brain and muscle activity. We therefore simultaneously recorded the brain activity from multiple regions of an extensive cortical area and the activity of multiple muscles of a monkey's forelimb while it performed a reach and grasp task throughout the course of recovery from a partial spinal cord injury (SCI). Analysis from a dataset of the monkey before SCI showed that some of the muscle synergies calculated from the proposed method using brain and muscle activities, did not exhibit a high degree of similarity to synergies obtained from the conventional method. The proposed method results obtained from the monkey after SCI showed an adaptive change in the number of muscle synergies associated with the degree of functional recovery. Specifically, the numbers of muscle synergies obtained by the proposed method initially increased immediately after SCI and then gradually decreased, while those obtained by a conventional method maintained the same number before and after SCI. These results suggest that our method is able to capture the unrevealed connectivity in the corticomuscular system that contributes to functional recovery: in other words, that it can factorize the observed data under the constraint of the connectivity between the observed and source data. Our work thus demonstrates the importance of using not only observed data but also source data to reveal unknown hidden elements.","Matrix factorization, Muscle synergy" Fundamental limits on the robustness of image classifiers,https://openreview.net/forum?id=gpmL0D4VjN4,https://openreview.net/pdf?id=gpmL0D4VjN4,Image classifiers are fundamentally sensitive to small perturbations in their inputs.,"We prove that image classifiers are fundamentally sensitive to small perturbations in their inputs. Specifically, we show that given some image space of $n$-by-$n$ images, all but a tiny fraction of images in any image class induced over that space can be moved outside that class by adding some perturbation whose $p$-norm is $O(n^{1/\max{(p,1)}})$, as long as that image class takes up at most half of the image space. We then show that $O(n^{1/\max{(p,1)}})$ is asymptotically optimal. Finally, we show that an increase in the bit depth of the image space leads to a loss in robustness. We supplement our results with a discussion of their implications for vision systems.","Theory, Computer vision, Isoperimetry" Stochastic Constrained DRO with a Complexity Independent of Sample Size,https://openreview.net/forum?id=vep-Hlmn0tc,https://openreview.net/pdf?id=vep-Hlmn0tc,,"Distributionally Robust Optimization (DRO), as a popular method to train robust models against distribution shifts between training and test sets, has received tremendous attention in recent years. In this paper, we propose and analyze stochastic algorithms that apply to both non-convex and convex losses for solving Kullback–Leibler divergence constrained DRO problem. Compared with existing methods solving this problem, such as primal-dual methods and large mini-batch methods, our stochastic algorithms not only enjoy competitive if not better complexity independent of sample size but also just require a constant batch size at every iteration, which is more practical for broad applications. We establish a nearly optimal complexity bound for finding an $\epsilon$-stationary solution for non-convex losses and an optimal complexity for finding an $\epsilon$-optimal solution for convex losses. Empirical studies demonstrate the effectiveness of the proposed algorithms for solving non-convex and convex constrained DRO problems. ", "Evolve Smoothly, Fit Consistently: Learning Smooth Latent Dynamics For Advection-Dominated Systems",https://openreview.net/forum?id=Z4s73sJYQM,https://openreview.net/pdf?id=Z4s73sJYQM,,"We present a data-driven, space-time continuous framework to learn surrogate models for complex physical systems described by advection-dominated partial differential equations. Those systems have slow-decaying Kolmogorov n-width that hinders standard methods, including reduced order modeling, from producing high-fidelity simulations at low cost. In this work, we construct hypernetwork-based latent dynamical models directly on the parameter space of a compact representation network. We leverage the expressive power of the network and a specially designed consistency-inducing regularization to obtain latent trajectories that are both low-dimensional and smooth. These properties render our surrogate models highly efficient at inference time. We show the efficacy of our framework by learning models that generate accurate multi-step rollout predictions at much faster inference speed compared to competitors, for several challenging examples.", Dissecting adaptive methods in GANs,https://openreview.net/forum?id=hfaNXjEQB47,https://openreview.net/pdf?id=hfaNXjEQB47,,"Adaptive methods are a crucial component widely used for training generative adversarial networks (GANs). While there has been some work to pinpoint the “marginal value of adaptive methods” in standard tasks, it remains unclear why they are still critical for GAN training. In this paper, we formally study how adaptive methods help train GANs; inspired by the grafting method proposed in (Agarwal et al. 2021), we separate the magnitude and direction components of the Adam updates, and graft them to the direction and magnitude of SGDA updates respectively. By considering an update rule with the magnitude of the Adam update and the normalized direction of SGD, we empirically show that the adaptive magnitude of Adam is key for GAN training. This motivates us to have a closer look at the class of normalized stochastic gradient descent ascent (nSGDA) methods in the context of GAN training. We propose a synthetic theoretical framework to compare the performance of nSGDA and SGDA for GAN training with neural networks. We prove that in that setting, GANs trained with nSGDA recover all the modes of the true distribution, whereas the same networks trained with SGDA (and any learning rate configuration) suffer from mode collapse. The critical insight in our analysis is that normalizing the gradients forces the discriminator and generator to be updated at the same pace. We also experimentally show that for several datasets, Adam's performance can be recovered with nSGDA methods.","deep learning theory, generative adversarial networks, adaptive methods" Recycling Scraps: Improving Private Learning by Leveraging Intermediate Checkpoints,https://openreview.net/forum?id=IskSBCo0-0,https://openreview.net/pdf?id=IskSBCo0-0,"DP-ML benchmarks and deployments typically use only the final model to make predictions. In this work, for the first time, we comprehensively explore various methods that aggregate intermediate checkpoints to improve the utility of DP training.","All state-of-the-art (SOTA) differentially private machine learning (DP ML) methods are iterative in nature, and their privacy analyses allow publicly releasing the intermediate training checkpoints. However, DP ML benchmarks, and even practical deployments, typically use only the final training checkpoint to make predictions. In this work, for the first time, we comprehensively explore various methods that aggregate intermediate checkpoints to improve the utility of DP training. Empirically, we demonstrate that checkpoint aggregations provide significant gains in the prediction accuracy over the existing SOTA for CIFAR10 and StackOverflow datasets, and that these gains get magnified in settings with periodically varying training data distributions. For instance, we improve SOTA StackOverflow accuracies to 22.7\% (+0.43\% absolute) for $\epsilon=8.2$, and 23.84\% (+0.43\%) for $\epsilon=18.9$. Theoretically, we show that uniform tail averaging of checkpoints improves the empirical risk minimization bound compared to the last checkpoint of DP-SGD. Lastly, we initiate an exploration into estimating the uncertainty that DP noise adds in the predictions of DP ML models. We prove that, under standard assumptions on the loss function, the sample variance from last few checkpoints provides a good approximation of the variance of the final model of a DP run. Empirically, we show that the last few checkpoints can provide a reasonable lower bound for the variance of a converged DP model. ","Differential privacy, training checkpoints, confidence intervals, uncertainty" Understanding Influence Functions and Datamodels via Harmonic Analysis,https://openreview.net/forum?id=cxCEOSF99f,https://openreview.net/pdf?id=cxCEOSF99f,"This paper establishes connections between datamodels, influence functions and Fourier coefficients using theoretical tools from harmonic analysis of Boolean functions","Influence functions estimate effect of individual data points on predictions of the model on test data and were adapted to deep learning in \cite{koh2017understanding}. They have been used for detecting data poisoning, detecting helpful and harmful examples, influence of groups of datapoints, etc. Recently, \cite{ilyas2022datamodels} introduced a linear regression method they termed {\em datamodels} to predict the effect of training points on outputs on test data. The current paper seeks to provide a better theoretical understanding of such interesting empirical phenomena. The primary tool is harmonic analysis and the idea of {\em noise stability}. Contributions include: (a) Exact characterization of the learnt datamodel in terms of Fourier coefficients. (b) An efficient method to estimate the residual error and quality of the optimum linear datamodel without having to train the datamodel. (c) New insights into when influences of groups of datapoints may or may not add up linearly.","theory, datamodels, influence functions, fourier analysis, harmonic analysis" BC-IRL: Learning Generalizable Reward Functions from Demonstrations,https://openreview.net/forum?id=Ovnwe_sDQW,https://openreview.net/pdf?id=Ovnwe_sDQW,,"How well do reward functions learned with inverse reinforcement learning (IRL) generalize? We illustrate that state-of-the-art IRL algorithms, which maximize a maximum-entropy objective, learn rewards that overfit to the demonstrations. Such rewards struggle to provide meaningful rewards for states not covered by the demonstrations, a major detriment when using the reward to learn policies in new situations. We introduce BC-IRL a new inverse reinforcement learning method that learns reward functions that generalize better when compared to maximum-entropy IRL approaches. In contrast to the MaxEnt framework, which learns to maximize rewards around demonstrations, BC-IRL updates reward parameters such that the policy trained with the new reward matches the expert demonstrations better. We show that BC-IRL learns rewards that generalize better on an illustrative simple task and two continuous robotic control tasks, achieving over twice the success rate of baselines in challenging generalization settings.","inverse reinforcement learning, reward learning, reinforcement learning, imitation learning" TextGrad: Advancing Robustness Evaluation in NLP by Gradient-Driven Optimization,https://openreview.net/forum?id=5tKXUZil3X,https://openreview.net/pdf?id=5tKXUZil3X,,"Robustness evaluation against adversarial examples has become increasingly important to unveil the trustworthiness of the prevailing deep models in natural language processing (NLP). However, in contrast to the computer vision domain where the first-order projected gradient descent (PGD) is used as the benchmark approach to generate adversarial examples for robustness evaluation, there lacks a principled first-order gradient-based robustness evaluation framework in NLP. The emerging optimization challenges lie in 1) the discrete nature of textual inputs together with the strong coupling between the perturbation location and the actual content, and 2) the additional constraint that the perturbed text should be fluent and achieve a low perplexity under a language model. These challenges make the development of PGD-like NLP attacks difficult. To bridge the gap, we propose TextGrad, a new attack generator using gradient-driven optimization, supporting high-accuracy and high-quality assessment of adversarial robustness in NLP. Specifically, we address the aforementioned challenges in a unified optimization framework. And we develop an effective convex relaxation method to co-optimize the continuously-relaxed site selection and perturbation variables and leverage an effective sampling method to establish an accurate mapping from the continuous optimization variables to the discrete textual perturbations. Moreover, as a first-order attack generation method, TextGrad can be baked into adversarial training to further improve the robustness of NLP models. Extensive experiments are provided to demonstrate the effectiveness of TextGrad not only in attack generation for robustness evaluation but also in adversarial defense. From the attack perspective, we show that TextGrad achieves remarkable improvements in both the attack success rate and the perplexity score over five state-of-the-art baselines. From the defense perspective, TextGrad-enabled adversarial training yields the most robust NLP model against a wide spectrum of NLP attacks. ", Robustness for Free: Adversarially Robust Anomaly Detection Through Diffusion Model,https://openreview.net/forum?id=imIlOpuEsi,https://openreview.net/pdf?id=imIlOpuEsi,,"Deep learning-based anomaly detection models have achieved remarkably high accuracy on commonly used benchmark datasets. However, the robustness of those models may not be satisfactory due to the existence of adversarial examples, which pose significant threats to the practical deployment of deep anomaly detectors. To tackle this issue, we propose an adversarially robust anomaly detector based on the diffusion model. There are two things that make diffusion models a perfect match for our task: 1) the diffusion model itself is a reconstruction-based modeling method whose reconstruction error can serve as a natural indicator of the anomaly score; 2) previous studies have shown that diffusion models can help purify the data for better adversarial robustness. In this work, we highlight that our diffusion model based method gains the adversarial robustness for free: the diffusion model will act both as an anomaly detector and an adversarial defender, thus no extra adversarial training or data purification is needed as in standard robust image classification tasks. We also extend our proposed method for certified robustness to $l_2$ norm bounded perturbations. Through extensive experiments, we show that our proposed method exhibits outstanding (certified) adversarial robustness while also maintaining equally strong anomaly detection performance on par with the state-of-the-art anomaly detectors on benchmark datasets.", Optimal control neural networks for data-driven discovery of gradient flows.,https://openreview.net/forum?id=W0-FISdkHtZ,https://openreview.net/pdf?id=W0-FISdkHtZ,This paper proposes an optimal control neural network to discover nonlinear dynamical systems from time series data sampled from solution trajectories.,"This work aims to discover nonlinear dynamical systems from only a set of time series data on solution trajectories. To tackle this problem, we propose Optimal Control Networks (OCN) to learn the unknown vector field accurately and efficiently. The OCN consists of a neural network representation of the system, coupled with an optimal control formulation. Specifically, we formulate the parameter learning problem as a data-driven optimal control problem. This allows for the use of powerful optimal control tools. We derive generalization error bounds for both the solution and the vector field, and the bounds are shown to depend on both the training error and the time gaps between neighboring data. We also provide several numerical examples to demonstrate the viability of OCN, as well as its good generalization ability.","neural network, data-driven, dynamical system, gradient flow, optimal control" ErrorAug: Making Errors to Find Errors in Semantic Segmentation,https://openreview.net/forum?id=cPhfgGIbVfZ,https://openreview.net/pdf?id=cPhfgGIbVfZ,Introduces Error Augmentation as a framework for reliably producing error detectors for semantic segmentation with less architectural and computational complexity.,"In order to develop trustworthy downstream applications for semantic segmentation models, it is important to not only understand the performance of a model on datasets, but to localize areas where the model may produce errors. Pixel-wise error prediction of semantic segmentation maps is a challenging problem in which prior work relies on complicated image resynthesis pipelines. We introduce \it{error augmentation}, a framework which enables us to learn robust error detectors by applying data transformations independently on the predicted segmentation maps. This approach enables direct prediction of pixel-wise error in semantic segmentation maps, an approach explored as a naive baseline in prior works, to achieve state of the art performance. As a proof-of-concept we propose a series of three simple transformations that generate challenging segmentation errors by swapping pixel predictions within a segmentation map. Our approach outperforms previous methods of error detection for semantic segmentation across all metrics and improves performance by over $7.8\%$ on AUPR-Error. Additionally, we show that our approach not only generalizes to unseen test examples, but remains reliable despite significant shifts in the target domain. ","Semantic Segmentation, Uncertainty Quantification, Error Detection" Kernel Regression with Infinite-Width Neural Networks on Millions of Examples,https://openreview.net/forum?id=ED3WvUgu09,https://openreview.net/pdf?id=ED3WvUgu09,We enable kernel regression with infinite-width neural networks at a larger scale than was previously possible to calculate scaling laws across many orders of magnitude and achieve SotA results on protein and small molecule prediction benchmarks.,"While kernel regression remains an important practical method, its connection to neural networks as their width becomes large has initiated fresh research. These neural kernels have drastically increased performance on diverse and nonstandard data modalities but require significantly more compute, which previously limited their application to smaller datasets. We address this by massively parallelizing their computation across many GPUs. We combine this with a distributed, preconditioned conjugate gradients algorithm to enable kernel regression at a large scale (i.e. up to 5 million examples). Using this approach, we study scaling laws of several neural kernels across many orders of magnitude for the CIFAR-5m dataset. Using data augmentation to expand the original CIFAR-10 training dataset by a factor of 20, we obtain a test accuracy of 91.2\% (SotA for a pure kernel method). Finally, we explore other data modalities, obtaining results on protein and small molecule prediction tasks that are competitive with SotA methods. ","gaussian processes, neural tangent kernel, infinite-width neural networks" Information Plane Analysis for Dropout Neural Networks,https://openreview.net/forum?id=bQB6qozaBw,https://openreview.net/pdf?id=bQB6qozaBw,"Information plane analysis is a promising tool for neural networks analysis, for which mutual information can be measured more reliable with continuous dropout","The information theoretic framework promises to explain the predictive power of neural networks. In particular, the information plane analysis, which measures mutual information (MI) between input and representation as well as representation and output, should give rich insights into the training process. This approach, however, was shown to strongly depend on the choice of estimator of the MI: measuring discrete MI does not capture the nature of deterministic neural networks and continuous data distributions, and different approaches for discretization arbitrarily change results. On the other hand, measuring continuous MI for a deterministic network is not mathematically meaningful. In this work we show how the stochasticity induced by dropout layers can be utilized to estimate MI in a theoretically sound manner. We demonstrate in a range of experiments that this approach enables a meaningful information plane analysis for the large class of dropout neural networks that is widely used in practice.","information plane, deep learning, mutual information, dropout, continuous distributions" Fed-Cor: Federated Correlation Test with Secure Aggregation,https://openreview.net/forum?id=BgLTe3a4FO,https://openreview.net/pdf?id=BgLTe3a4FO,"We propose the first secure federated correlation test protocol Fed-Cor, which minimizes both privacy leakage and communication cost.","In this paper, we propose the first federated correlation test framework compatible with secure aggregation, namely Fed-Cor. In Fed-Cor, correlation tests are recast as frequency moment estimation problems. To estimate the frequency moments, the clients collaboratively generate a shared projection matrix and then use stable projection to encode the local information in a compact vector. As such encodings can be linearly aggregated, secure aggregation can be applied to conceal the individual updates. We formally establish the security guarantee of Fed-Cor by proving that only the minimum necessary information (i.e., the correlation statistics) is revealed to the server. The evaluation results show that Fed-Cor achieves good accuracy with small client-side computation overhead and performs comparably to the centralized correlation test in several real-world case studies.","Federated Analytics, Privacy and Security" Feasible Adversarial Robust Reinforcement Learning for Underspecified Environments,https://openreview.net/forum?id=Su_HbZ0Sdz,https://openreview.net/pdf?id=Su_HbZ0Sdz,We propose a method to train a robust RL agent to feasible parameters even when the adversary has access to infeasible parameters.,"Robust reinforcement learning (RL) considers the problem of learning policies that perform well in the worst case among a set of possible environment parameter values. In real-world environments, choosing the set of possible values for robust RL can be a difficult task. When that set is specified too narrowly, the agent will be left vulnerable to reasonable parameter values unaccounted for. When specified too broadly, the agent will be too cautious. In this paper, we propose Feasible Adversarial Robust RL (FARR), a novel problem formulation and objective for automatically determining the set of environment parameter values over which to be robust. FARR implicitly defines the set of feasible parameter values as those on which an agent could achieve a benchmark reward given enough training resources. By formulating this problem as a two-player zero-sum game, optimizing the FARR objective jointly produces an adversarial distribution over parameter values with feasible support and a policy robust over this feasible parameter set. We demonstrate that approximate Nash equilibria for this objective can be found using a variation of the PSRO algorithm. Furthermore, we show that an optimal agent trained with FARR is more robust to feasible adversarial parameter selection than with existing minimax, domain-randomization, and regret objectives in a parameterized gridworld and three MuJoCo control environments.","reinforcement learning, robust rl, sim-to-real, game-theory, psro" Dynamical systems embedding with a physics-informed convolutional network,https://openreview.net/forum?id=z9C5dGip90,https://openreview.net/pdf?id=z9C5dGip90,"Unsupervised framework for learning high-quality, physically-meaningful embeddings of dynamical systems. ","Dynamical systems are found in innumerable forms across the physical and biological sciences, yet all these systems fall naturally into universal equivalence classes: conservative or dissipative, stable or unstable, compressible or incompressible. Predicting these classes from data remains an essential open challenge in computational physics at which existing time-series classification methods struggle. Here, we propose, \texttt{phase2vec}, an embedding method that learns high-quality, physically-meaningful representations of dynamical systems without supervision. Our embeddings are produced by a convolutional backbone that extracts geometric features from flow data and is trained to minimize a physically-informed vector field reconstruction loss. The trained architecture can not only predict the equations of unseen data, but also, crucially, learns embeddings that respect the underlying semantics of the embedded physical systems. We first validate the quality of these learned embeddings by showing that the dynamical features they encode can be used to denoise corrupted testing data. Next, we examine the extent to which the underlying physical categories of input data can be decoded from embeddings compared to standard blackbox classifiers and state-of-the-art time series classification techniques. We find that our embeddings encode important physical properties of the underlying data, including the stability of fixed points, conservation of energy, and the incompressibility of flows, with greater fidelity than competing methods. We finally apply our embeddings to the analysis of meteorological data, showing we can detect climatically meaningful features. Collectively, our results demonstrate the viability of embedding approaches for the discovery of dynamical features in physical systems.","dynamical systems, convolutional networks, computational physics, physics-informed" Learning Harmonic Molecular Representations on Riemannian Manifold,https://openreview.net/forum?id=ySCL-NG_I3,https://openreview.net/pdf?id=ySCL-NG_I3,We propose a harmonic molecular representation learning framework to achieve multi-resolution molecular encoding on 2D Riemannian manifold.,"Molecular representation learning plays a crucial role in AI-assisted drug discovery research. Encoding 3D molecular structures through Euclidean neural networks has become the prevailing method in the geometric deep learning community. However, the equivariance constraints and message passing in Euclidean space may limit the network expressive power. In this work, we propose a Harmonic Molecular Representation learning (HMR) framework, which represents a molecule using the Laplace-Beltrami eigenfunctions of the molecular surface. HMR offers a multi-resolution representation of molecular geometric and chemical properties on 2D Riemannian manifold. We also introduce a harmonic message passing method to realize efficient spectral message passing over the surface manifold for better molecular encoding. Our proposed method shows comparable predictive power to current models in small molecule property prediction, and outperforms the state-of-the-art deep learning models for the rigid protein docking challenge, demonstrating its versatility in molecular representation learning.","Riemannian manifold, molecular surface, harmonic analysis, functional map, binding site prediction, rigid protein docking" When is Offline Hyperparameter Selection Feasible for Reinforcement Learning?,https://openreview.net/forum?id=Hvcmr6FSIX8,https://openreview.net/pdf?id=Hvcmr6FSIX8,,"Hyperparameter selection is a critical procedure before deploying reinforcement learning algorithms in real-world applications. However, hyperparameter selection prior to deployment requires selecting policies offline without online execution, which is a significant challenge known as offline policy selection. As yet, there is little understanding about the fundamental limitations of the offline policy selection problem. To contribute to our understanding of this problem, in this paper, we investigate when sample efficient offline policy selection is possible. As off-policy policy evaluation (OPE) is a natural approach for policy selection, the sample complexity of offline policy selection is therefore upper-bounded by the number of samples needed to perform OPE. In addition, we prove that the sample complexity of offline policy selection is also lower-bounded by the sample complexity of OPE. These results imply not only that offline policy selection is effective when OPE is effective, but also that sample efficient policy selection is not possible without additional assumptions that make OPE effective. Moreover, we theoretically study the conditions under which offline policy selection using Fitted Q evaluation (FQE) and the Bellman error is sample efficient. We conclude with an empirical study comparing FQE and Bellman errors for offline policy selection.","Offline policy selection, offline reinforcement learning, off-policy policy evaluation" Plansformer: Generating Multi-Domain Symbolic Plans using Transformers,https://openreview.net/forum?id=uvSQ8WhWHQ,https://openreview.net/pdf?id=uvSQ8WhWHQ,"Plansformer, an LLM fine-tuned on planning problems and capable of generating plans with favorable behavior in terms of correctness and length with minimal knowledge-engineering efforts. ","Large Language Models (LLMs) have been the subject of active research, significantly advancing the field of Natural Language Processing (NLP). From BERT to BLOOM, LLMs have surpassed state-of-the-art results in various natural language tasks such as question answering, summarization, and text generation. Many ongoing efforts are focused on understanding LLMs' capabilities, including their knowledge of the world, syntax, and semantics. However, extending the textual prowess of LLMs to symbolic reasoning has been slow and predominantly focused on tackling problems related to the mathematical field. In this paper, we explore the use of LLMs for automated planning - a branch of AI concerned with the realization of action sequences (plans) to achieve a goal, typically for execution by intelligent agents, autonomous robots, and unmanned vehicles. We introduce Plansformer; an LLM fine-tuned on planning problems and capable of generating plans with favorable behavior in terms of correctness and length with minimal knowledge-engineering efforts. We also demonstrate the adaptability of Plansformer in solving different planning domains with varying complexities, owing to the transfer learning abilities of LLMs. For one configuration of Plansformer, we achieve ~97\% valid plans, out of which ~95\% are optimal for Towers of Hanoi - a puzzle-solving domain.","automated planning, language models, transfer learning" Greedy Actor-Critic: A New Conditional Cross-Entropy Method for Policy Improvement,https://openreview.net/forum?id=eSQh8rG8Oa,https://openreview.net/pdf?id=eSQh8rG8Oa,We propose an alternative update for the actor in actor-critic algorithms that does not rely on entropy-regularization,"Many policy gradient methods are variants of Actor-Critic (AC), where a value function (critic) is learned to facilitate updating the parameterized policy (actor). The update to the actor involves a log-likelihood update weighted by the action-values, with the addition of entropy regularization for soft variants. In this work, we explore an alternative update for the actor, based on an extension of the cross entropy method (CEM) to condition on inputs (states). The idea is to start with a broader policy and slowly concentrate around maximal actions, using a maximum likelihood update towards actions in the top percentile per state. The speed of this concentration is controlled by a proposal policy, that concentrates at a slower rate than the actor. We first provide a policy improvement result in an idealized setting, and then prove that our conditional CEM (CCEM) strategy tracks a CEM update per state, even with changing action-values. We empirically show that our Greedy AC algorithm, that uses CCEM for the actor update, performs better than Soft AC and is much less sensitive to entropy-regularization.","actor-critic, policy gradient, entropy, cross-entropy method, greedy actor-critic, policy optimization" VISION TRANSFORMER FOR MULTIVARIATE TIME- SERIES CLASSIFICATION (VITMTSC),https://openreview.net/forum?id=IJn-rxhkZsN,https://openreview.net/pdf?id=IJn-rxhkZsN,A Vision Transformer based Multivariate Time-Series Classification model that significantly outperforms current SOTA on commercial datasets.,"Multivariate Time-Series Classification (MTSC) is an important issue in many disciplines because of the proliferation of disparate data sources and sensors (economics, retail, health, etc.). Nonetheless, it remains difficult due to the high-dimensionality and richness of data that is regularly updated. We present a Vision Transformer for Multivariate Time-Series Classification (VitMTSC) model that learns latent features from raw time-series data for classification tasks and is applicable to large-scale time-series data with millions of data samples of variable lengths. According to our knowledge, this is the first implementation of the Vision Transformer (ViT) for MTSC. We demonstrate that our approach works on datasets ranging from a few thousand to millions of samples and achieves close to the state-of-the-art (SOTA) results on open datasets. Using click-stream data from a major retail website, we demonstrate that our model can scale to millions of samples and vastly outperform previous neural net-based MTSC models in real-world applications. Our source code is publicly accessible at https://github.com/mtsc-research/vitmtsc to facilitate further research. ","time-series classification, vision-transformer, transformer" Multi-Environment Pretraining Enables Transfer to Action Limited Datasets,https://openreview.net/forum?id=-kAWfaLkPT3,https://openreview.net/pdf?id=-kAWfaLkPT3,,"Using massive datasets to train large-scale models has emerged as a dominant approach for broad generalization in natural language and vision applications. In reinforcement learning, however, a key challenge is that available data of sequential decision making is often not annotated with actions - for example, videos of game-play are much more available than sequences of frames paired with the logged game controls. We propose to circumvent this challenge by combining large but sparsely-annotated datasets from a \emph{target} environment of interest with fully-annotated datasets from various other \emph{source} environments. Our method, Action Limited PreTraining (ALPT), leverages the generalization capabilities of inverse dynamics modelling (IDM) to label missing action data in the target environment. We show that utilizing even one additional environment dataset of labelled data during IDM pretraining gives rise to substantial improvements in generating action labels for unannotated sequences. We evaluate our method on benchmark game-playing environments and show that we can significantly improve game performance and generalization capability compared to other approaches, even when using annotated datasets equivalent to only $12$ minutes of gameplay. ", Preserving In-Context Learning Ability in Large Language Model Fine-tuning,https://openreview.net/forum?id=sVV0KK3COzD,https://openreview.net/pdf?id=sVV0KK3COzD,,"Pretrained large language models (LLMs) are strong in-context learners that are able to perform few-shot learning without changing model parameters. However, as we show, fine-tuning an LLM on any specific task generally destroys its in-context ability. We discover an important cause of this loss, format specialization, where the model overfits to the format of the fine-tuned task and is unable to output anything beyond this format. We further show that format specialization happens at the beginning of fine-tuning. To solve this problem, we propose Prompt Tuning with MOdel Tuning (ProMoT), a simple yet effective two-stage fine-tuning framework that preserves in-context abilities of the pretrained model substantially better than vanilla fine-tuning. ProMoT first trains a soft prompt for the fine-tuning target task, and then fine-tunes the model itself with this soft prompt attached. ProMoT offloads task-specific formats into the soft prompt that can be easily removed when doing other in-context tasks. We fine-tune mT5 XXL with ProMoT on natural language inference (NLI) and English-French translation and evaluate the in-context abilities of the resulting models on 8 different NLP datasets including classification, summarization, translation and question answering. ProMoT achieves similar performance on the fine-tuned tasks compared with vanilla fine-tuning, but with much less reduction of in-context learning performances across the board. More importantly, ProMoT shows remarkable generalization ability on tasks that have different formats, e.g. fine-tuning on a NLI binary classification task improves the model's in-context ability to do summarization (+0.53 Rouge-2 score compared to the pretrained model), making ProMoT a promising method to build general purpose capabilities such as grounding and reasoning into LLMs with small but high quality datasets.","in-context learning, large language models" Efficiently Controlling Multiple Risks with Pareto Testing,https://openreview.net/forum?id=cyg2YXn_BqF,https://openreview.net/pdf?id=cyg2YXn_BqF,This paper presents a statistically efficient strategy for performing multiple hypothesis testing in order to find risk-controlling model configurations that are also useful with respect to arbitrary auxiliary objectives. ,"Machine learning applications frequently come with multiple diverse objectives and constraints that can change over time. Accordingly, trained models can be tuned with sets of hyper-parameters that affect their predictive behavior (e.g., their run-time efficiency versus error rate). As the number of constraints and hyper-parameter dimensions grow, naively selected settings may lead to sub-optimal and/or unreliable results. We develop an efficient method for calibrating models such that their predictions provably satisfy multiple explicit and simultaneous statistical guarantees (e.g., upper-bounded error rates), while also optimizing any number of additional, unconstrained objectives (e.g., total run-time cost). Building on recent results in distribution-free, finite-sample risk control for general losses, we propose Pareto Testing: a two-stage process which combines multi-objective optimization with multiple hypothesis testing. The optimization stage constructs a set of promising combinations on the Pareto frontier. We then apply statistical testing to this frontier only to identify configurations that have (a) high utility with respect to our objectives, and (b) guaranteed risk levels with respect to our constraints, with specifiably high probability. We demonstrate the effectiveness of our approach to reliably accelerate the execution of large-scale Transformer models in natural language processing (NLP) applications. In particular, we show how Pareto Testing can be used to dynamically configure multiple inter-dependent model attributes—including the number of layers computed before exiting, number of attention heads pruned, or number of text tokens considered—to simultaneously control and optimize various accuracy and cost metrics.","conformal prediction, risk control, multi-objective optimization, hypothesis testing" Graph Mixup with Soft Alignments,https://openreview.net/forum?id=UntYZBCdypc,https://openreview.net/pdf?id=UntYZBCdypc,,"We study graph data augmentation by mixup, which has been used successfully on images. A key operation of mixup is to compute a convex combination of a pair of inputs. This operation is straightforward for grid-like data, such as images, but challenging for graph data. The key difficulty lies in the fact that different graphs typically have different numbers of nodes, and thus there lacks a node-level correspondence between graphs. In this work, we propose a simple yet effective mixup method for graph classification by soft alignments. Specifically, given a pair of graphs, we explicitly obtain node-level correspondence via computing a soft assignment matrix to match the nodes between two graphs. Based on the soft assignments, we transform the adjacency and node feature matrices of one graph, so that the transformed graph is aligned with the other graph. In this way, any pair of graphs can be mixed directly to generate an augmented graph. We conduct systematic experiments to show that our method can improve the performance and generalization of graph neural networks (GNNs) on various graph classification tasks. In addition, we show that our method can increase the robustness of GNNs against noisy labels.", CNN Compression and Search Using Set Transformations with Width Modifiers on Network Architectures,https://openreview.net/forum?id=p3UGLrWofT,https://openreview.net/pdf?id=p3UGLrWofT,"convnet compression that is fast, not resource hungry and uses width modifiers applied with a new twist.","We propose a new approach, based on discrete filter pruning, to adapt off-the-shelf models into an embedded environment. Importantly, we circumvent the usually prohibitive costs of model compression. Our method, Structured Coarse Block Pruning (SCBP), prunes whole CNN kernels using width modifiers applied to a novel transformation of convlayers into superblocks. SCBP uses set representations to construct a rudimentary search to provide candidate networks. To test our approach, the original ResNet architectures serve as the baseline and also provide the 'seeds' for our candidate search. The search produces a configurable number of compressed (derived) models. These derived models are often ~20\% faster and ~50\% smaller than their unmodified counterparts. At the expense of accuracy, the size can become even smaller and the inference latency lowered even further. The unique SCBP transformations yield many new model variants, each with their own trade-offs, and does not require GPU clusters or expert humans for training and design.","cnn, compression, efficient search, sets, embedded systems" Event-former: A Self-supervised Learning Paradigm for Temporal Point Processes,https://openreview.net/forum?id=DbLtChzghG,https://openreview.net/pdf?id=DbLtChzghG,We propose a new paradigm for self-supervised learning for multivariate temporal point processes. Our approach demonstrates performance boost of as high as up to 16% compared to state-of-the art models for next event prediction. ,"Self-supervision is one of the hallmarks of representation learning in the increasingly popular suite of foundation models including large language models such as BERT and GPT-3, but it has not been pursued in the context of multivariate event streams, to the best of our knowledge. We introduce a new paradigm for self-supervised learning for temporal point processes using a transformer encoder. Specifically, we design a novel pre-training strategy for the encoder where we not only mask random event epochs but also insert randomly sampled ‘void’ epochs where an event does not occur; this differs from the typical discrete-time pretext tasks such as word-masking in BERT but expands the effectiveness of masking to better capture continuous-time dynamics. The pre-trained model can subsequently be fine-tuned on a potentially much smaller event dataset, similar to other foundation models. We demonstrate the effectiveness of our proposed paradigm on the next-event prediction task using synthetic datasets and 3 real applications, observing a relative performance boost of as high as up to 15% compared to state-of-the art models.","event sequences, self-supervised learning, point process, transformer" Learning Interpretable Dynamics from Images of a Freely Rotating 3D Rigid Body,https://openreview.net/forum?id=VBB4fh45HF,https://openreview.net/pdf?id=VBB4fh45HF,Using Hamiltonian structure to learn interpretable dynamics from images of rotating 3D objects,"In many real-world settings, image observations of freely rotating 3D rigid bodies, such as satellites, may be available when low-dimensional measurements are not. However, the high-dimensionality of image data precludes the use of classical estimation techniques to learn the dynamics and a lack of interpretability reduces the usefulness of standard deep learning methods. In this work, we present a physics-informed neural network model to estimate and predict 3D rotational dynamics from image sequences. We achieve this using a multi-stage prediction pipeline that maps individual images to a latent representation homeomorphic to $\mathbf{SO}(3)$, computes angular velocities from latent pairs, and predicts future latent states using the Hamiltonian equations of motion with a learned representation of the Hamiltonian. We demonstrate the efficacy of our approach on a new rotating rigid-body dataset with sequences of rotating cubes and rectangular prisms with uniform and non-uniform density.","Deep Learning, Dynamics, 3D, Rigid Body, Images, Physics informed" NOTELA: A Generalizable Method for Source Free Domain Adaptation,https://openreview.net/forum?id=aOBs18ycBr,https://openreview.net/pdf?id=aOBs18ycBr,"We evaluate popular source-free domain adaptation methods on a new realistic set of distribution shifts in audio, and design a more robust method.","Source-free domain adaptation (SFDA) is a compelling problem as it allows to leverage any off-the-shelf model without requiring access to its original training set and adapts it using only unlabelled data. While several SFDA approaches have recently been proposed, their evaluation focuses on a narrow set of distribution shifts for vision tasks, and their generalizability outside of that scope has not yet been investigated. We put those recent approaches to the test by evaluating them on a new set of challenging---due to extreme covariate and label shift---and naturally-occurring distribution shifts in the audio domain. We study the task of adapting a bird species classifier trained on focalized recordings of bird songs to datasets of passive recordings for various geographical locations. Interestingly, we find that some recent SFDA methods underperform doing no adaptation at all. Drawing inspiration from those findings and insights, we propose a new method that improves on noisy student approaches by adjusting the teacher's pseudo-labels through Laplacian regularization. Our approach enjoys increased stability and significantly better performance on several of our proposed distribution shifts. We then look back at SFDA benchmarks in the vision domain and find that our approach is competitive with the state-of-the-art there as well. ","source-free domain adaptation, robustness to distribution shifts, bioacoustics" Characteristic Neural Ordinary Differential Equation,https://openreview.net/forum?id=loIfC8WHevK,https://openreview.net/pdf?id=loIfC8WHevK,,"We propose Characteristic-Neural Ordinary Differential Equations (C-NODEs), a framework for extending Neural Ordinary Differential Equations (NODEs) beyond ODEs. While NODE models the evolution of latent variables as the solution to an ODE, C-NODE models the evolution of the latent variables as the solution of a family of first-order partial differential equations (PDEs) along curves on which the PDEs reduce to ODEs, referred to as characteristic curves. This reduction along characteristic curves allows for analyzing PDEs through standard techniques used for ODEs, in particular the adjoint sensitivity method. We also derive C-NODE-based continuous normalizing flows, which describe the density evolution of latent variables along multiple dimensions. Empirical results demonstrate the improvements provided by the proposed method for irregularly sampled time series prediction on MuJoCo, Physionet, and Human Activity datasets and classification and density estimation on CIFAR-10, SVHN, and MNIST datasets given a similar computational budget as the existing NODE methods. The results also provide empirical evidence that the learned curves improve the system efficiency using a lower number of parameters and function evaluations compared with those of the baselines.","Neural ODE, Differential Equation, Method of characteristics" Fast Sampling of Diffusion Models with Exponential Integrator,https://openreview.net/forum?id=Loek7hfb46P,https://openreview.net/pdf?id=Loek7hfb46P,"Training-free acceleration for diffusion model, 4.17 FID with 10 NFEs on CIFAR10","The past few years have witnessed the great success of Diffusion models~(DMs) in generating high-fidelity samples in generative modeling tasks. A major limitation of the DM is its notoriously slow sampling procedure which normally requires hundreds to thousands of time discretization steps of the learned diffusion process to reach the desired accuracy. Our goal is to develop a fast sampling method for DMs with a much less number of steps while retaining high sample quality. To this end, we systematically analyze the sampling procedure in DMs and identify key factors that affect the sample quality, among which the method of discretization is most crucial. By carefully examining the learned diffusion process, we propose Diffusion Exponential Integrator Sampler~(DEIS). It is based on the Exponential Integrator designed for discretizing ordinary differential equations (ODEs) and leverages a semilinear structure of the learned diffusion process to reduce the discretization error. The proposed method can be applied to any DMs and can generate high-fidelity samples in as few as 10 steps. Moreover, by directly using pre-trained DMs, we achieve state-of-art sampling performance when the number of score function evaluation~(NFE) is limited, e.g., 4.17 FID with 10 NFEs, 2.86 FID with only 20 NFEs on CIFAR10.","Fast diffusion model, generative model" STay-On-the-Ridge (STON'R): Guaranteed Convergence to Local Minimax Equilibrium in Nonconvex-Nonconcave Games,https://openreview.net/forum?id=6dZqGFB8g-O,https://openreview.net/pdf?id=6dZqGFB8g-O,,"Min-max optimization problems involving nonconvex-nonconcave objectives have found important applications in adversarial training and other multi-agent learning settings. Yet, no known gradient descent-based method is guaranteed to converge to (even local notions of) min-max equilibrium in the nonconvex-nonconcave setting. For all known methods, there exist relatively simple objectives for which they cycle or exhibit other undesirable behavior different from converging to a point, let alone to some game-theoretically meaningful one [Flokas et al. '19, Hsieh et al. '21]. The only known convergence guarantees hold under the strong assumption that the initialization is very close to a local min-max equilibrium [Wang et al. '19]. Moreover, the afore-described challenges are not just theoretical curiosities. All known methods are unstable in practice, even in simple settings. We propose the first method that is guaranteed to converge to a local min-max equilibrium for smooth nonconvex-nonconcave objectives. Our method is second-order and provably escapes limit cycles as long as it is initialized at an easy-to-find initial point. Both the definition of our method and its convergence analysis are motivated by the topological nature of the problem. In particular, our method is not designed to decrease some potential function, such as the distance of its iterate from the set of local min-max equilibria or the projected gradient of the objective, but is designed to satisfy a topological property that guarantees the avoidance of cycles and implies its convergence. ", Federated Representation Learning via Maximal Coding Rate Reduction,https://openreview.net/forum?id=Rpo9dvNlEYW,https://openreview.net/pdf?id=Rpo9dvNlEYW,We propose a federated way of learning low dimensional representations. ,"We propose a federated methodology to learn low-dimensional representations from a dataset that is distributed among several clients. In particular, we move away from the commonly-used cross-entropy loss in federated learning, and seek to learn shared low-dimensional representations of the data in a decentralized manner via the principle of maximal coding rate reduction (MCR2). Our proposed method, which we refer to as FLOW, utilizes MCR2 as the objective of choice, hence resulting in representations that are both between-class discriminative and within-class compressible. We theoretically show that our distributed algorithm achieves a first-order stationary point. Moreover, we demonstrate, via numerical experiments, the utility of the learned low-dimensional representations.","Federated Learning, Representation Learning, Information Theory" 3D Surface Reconstruction in the Wild by Deforming Shape Priors from Synthetic Data,https://openreview.net/forum?id=Kn43SKplAn,https://openreview.net/pdf?id=Kn43SKplAn,A method for single view 3D reconstruction without camera pose supervision,"We present a new method for category-specific 3D reconstruction from a single image. A limitation of current color image-based 3D reconstruction models is that they do not generalize across datasets, due to domain shift. In contrast, we show that one can learn to reconstruct objects across datasets by shape priors learned from synthetic 3D data and a point cloud pose canonicalization method. Given a single depth image at test time, we first place this partial point cloud in a canonical pose. Then, we use a neural deformation field in the canonical coordinate frame to reconstruct the 3D surface of the object. Finally, we jointly optimize object pose and 3D shape to fit the partial depth observation. Our approach achieves state-of-the-art reconstruction performance across several real-world datasets, even when trained without ground truth camera poses (which are required by some of the state-of-the-art methods). We further show that our method generalizes to different input modalities, from dense depth images to sparse and noisy LIDAR scans. ","3D reconstruction, pose estimation, shape deformation" gDDIM: Generalized denoising diffusion implicit models,https://openreview.net/forum?id=1hKE9qjvz-,https://openreview.net/pdf?id=1hKE9qjvz-,a small but delicate modification in parameterization to accelerate general diffusion models,"Our goal is to extend the denoising diffusion implicit model (DDIM) to general diffusion models~(DMs) besides isotropic diffusions. Instead of constructing a non-Markov noising process as in the original DDIM, we examine the mechanism of DDIM from a numerical perspective. We discover that the DDIM can be obtained by using some specific approximations of the score when solving the corresponding stochastic differential equation. We present an interpretation of the accelerating effects of DDIM that also explains the advantages of a deterministic sampling scheme over the stochastic one for fast sampling. Building on this insight, we extend DDIM to general DMs, coined generalized DDIM (gDDIM), with a small but delicate modification in parameterizing the score network. We validate gDDIM in two non-isotropic DMs: Blurring diffusion model (BDM) and Critically-damped Langevin diffusion model (CLD). We observe more than 20 times acceleration in BDM. In the CLD, a diffusion model by augmenting the diffusion process with velocity, our algorithm achieves an FID score of 2.26, on CIFAR10, with only 50 number of score function evaluations~(NFEs) and an FID score of 2.86 with only 27 NFEs.","Fast sampling, diffusion model" Panning for Gold in Federated Learning: Targeted Text Extraction under Arbitrarily Large-Scale Aggregation,https://openreview.net/forum?id=A9WQaxYsfx,https://openreview.net/pdf?id=A9WQaxYsfx,We propose a method that extracts target sequences by keywords under extremely large-scale aggregation in federated learning.,"As federated learning (FL) matures, privacy attacks against FL systems in turn become more numerous and complex. Attacks on language models have progressed from recovering single sentences in simple classification tasks to recovering larger parts of user data. Current attacks against federated language models are sequence-agnostic and aim to extract as much data as possible from an FL update - often at the expense of fidelity for any particular sequence. Because of this, current attacks fail to extract any meaningful data under large-scale aggregation. In realistic settings, an attacker cares most about a small portion of user data that contains sensitive personal information, for example sequences containing the phrase ""my credit card number is ..."". In this work, we propose the first attack on FL that achieves targeted extraction of sequences that contain privacy-critical phrases, whereby we employ maliciously modified parameters to allow the transformer itself to filter relevant sequences from aggregated user data and encode them in the gradient update. Our attack can effectively extract sequences of interest even against extremely large-scale aggregation.","Federated Learning, Privacy, Security, Privacy attack" Artificial Neuronal Ensembles with Learned Context Dependent Gating,https://openreview.net/forum?id=dBk3hsg-n6,https://openreview.net/pdf?id=dBk3hsg-n6,A method to alleviate catastrophic forgetting in artificial neural networks using learned context dependent activity gates,"Biological neural networks are capable of recruiting different sets of neurons to encode different memories. However, when training artificial neural networks on a set of tasks, typically, no mechanism is employed for selectively producing anything analogous to these neuronal ensembles. Further, artificial neural networks suffer from catastrophic forgetting, where the network's performance rapidly deteriorates as tasks are learned sequentially. By contrast, sequential learning is possible for a range of biological organisms. We introduce Learned Context Dependent Gating (LXDG), a method to flexibly allocate and recall `artificial neuronal ensembles', using a particular network structure and a new set of regularization terms. Activities in the hidden layers of the network are modulated by gates, which are dynamically produced during training. The gates are outputs of networks themselves, trained with a sigmoid output activation. The regularization terms we have introduced correspond to properties exhibited by biological neuronal ensembles. The first term penalizes low gate sparsity, ensuring that only a specified fraction of the network is used. The second term ensures that previously learned gates are recalled when the network is presented with input from previously learned tasks. Finally, there is a regularization term responsible for ensuring that new tasks are encoded in gates that are as orthogonal as possible from previously used ones. We demonstrate the ability of this method to alleviate catastrophic forgetting on continual learning benchmarks. When the new regularization terms are included in the model along with Elastic Weight Consolidation (EWC) it achieves better performance on the benchmark `permuted MNIST' than with EWC alone. The benchmark `rotated MNIST' demonstrates how similar tasks recruit similar neurons to the artificial neuronal ensemble. ","Continual Learning, Catastrophic Forgetting" Linkless Link Prediction via Relational Distillation,https://openreview.net/forum?id=He7UIpiEq_O,https://openreview.net/pdf?id=He7UIpiEq_O,,"Graph Neural Networks (GNNs) have been widely used on graph data and have shown exceptional performance in the task of link prediction. Despite their effectiveness, GNNs often suffer from high latency due to non-trivial neighborhood data dependency in practical deployments. To address this issue, researchers have proposed methods based on knowledge distillation (KD) to transfer the knowledge from teacher GNNs to student MLPs, which are known to be efficient even with industrial scale data, and have shown promising results on node classification. Nonetheless, using KD to accelerate link prediction is still unexplored. In this work, we start with exploring two direct analogs of traditional KD for link prediction, i.e., predicted logit-based matching and node representation-based matching. Upon observing direct KD analogs do not perform well for link prediction, we propose a relational KD framework, Linkless Link Prediction (LLP). Unlike simple KD methods that match independent link logits or node representations, LLP distills relational knowledge that is centered around each (anchor) node to the student MLP. Specifically, we propose two matching strategies that complement each other: rank-based matching and distribution-based matching. Extensive experiments demonstrate that LLP boosts the link prediction performance of MLPs with significant margins, and even outperforms the teacher GNNs on 6 out of 9 benchmarks. LLP also achieves a 776.37x speedup in link prediction inference compared to GNNs on the large scale Citation2 dataset. ","link prediction, knowledge distillation" Controllable Concept Transfer of Intermediate Representations,https://openreview.net/forum?id=UvziTI2JGP7,https://openreview.net/pdf?id=UvziTI2JGP7,We propose a novel approach for controlling the transfer of user-determined semantic concepts from source to target task,"With the proliferation of large pre-trained models in various domains, transfer learning has gained prominence where intermediate representations from these models can be leveraged to train better (target) task-specific models, with possibly limited labeled data. Although transfer learning can be beneficial in many cases, it can also transfer undesirable information to target tasks that may severely curtail its performance in the target domain or raise ethical concerns related to privacy and/or fairness. In this paper, we propose a novel approach for controlling the transfer of user-determined semantic concepts (viz. color, glasses, etc.) in intermediate source representations to target tasks without the need to retrain the source model which can otherwise be expensive or even infeasible. Notably, this is also a bigger challenge than blocking concepts in the input representation as a given intermediate source representation is biased towards the source task it was originally trained to solve, thus possibly further entangling the desired concepts. We qualitatively and quantitatively evaluate our approach in the visual domain showcasing its efficacy for classification and generative source models. ","concept transfer, transfer learning, transferrable representations" A Differentiable Loss Function for Learning Heuristics in A*,https://openreview.net/forum?id=2_B-eiVbgBs,https://openreview.net/pdf?id=2_B-eiVbgBs,A novel loss function,"Optimization of heuristic functions for the A* algorithm, realized by deep neural networks, is usually done by minimizing square root loss of estimate of the cost to goal values. This paper argues that this does not necessarily lead to a faster search of A* algorithm since its execution relies on relative values instead of absolute ones. As a mitigation, we propose a L* loss, which upper-bounds the number of excessively expanded states inside the A* search. The L* loss, when used in the optimization of state-of-the-art deep neural networks for automated planning in maze domains like Sokoban and maze with teleports, significantly improves the fraction of solved problems, the quality of founded plans, and reduces the number of expanded states to approximately 50%","Differentiable Loss Function, a star, heuristic search" Understanding Multi-Task Scaling in Machine Translation,https://openreview.net/forum?id=k09v6oRxQPq,https://openreview.net/pdf?id=k09v6oRxQPq,"We study the scaling behavior of multilingual, multi-task neural machine translation models. ","In this work, we provide a large-scale empirical study of the scaling properties of multilingual (multitask) neural machine translation models. We examine how increases in the model size affect the model performance and investigate the role of the individual task weights on the scaling behavior. We find that these weights only affect the multiplicative factor of the scaling law and in particular, the scaling exponent is unaffected by them. Through a novel joint scaling law formulation, we compute the effective number of parameters allocated to each task and examine the role of language similarity in the scaling behavior of our models. We find minimal evidence that language similarity has any impact. In contrast, ``direction'' of the multilinguality plays a big role, with models translating from multiple languages into English having a larger number of effective parameters per task than their reversed counterparts. Finally, we leverage our observations to predict the performance of multilingual models trained with any language weighting at any scale, greatly reducing efforts required for task balancing in large multitask models. Our findings apply to both in-domain and out-of-domain test sets and to multiple evaluation metrics, such as ChrF and BLEURT.","scaling laws, machine translation, multilinguality, multi-task optimization" Learning Language Representations with Logical Inductive Bias,https://openreview.net/forum?id=rGeZuBRahju,https://openreview.net/pdf?id=rGeZuBRahju,We develop a novel neural architecture for learning language representations.,"Transformer architectures have achieved great success in solving natural language tasks, which learn strong language representations from large-scale unlabeled texts. In this paper, we seek to go further beyond and explore a new logical inductive bias for better language representation learning. Logic reasoning is known as a formal methodology to reach answers from given knowledge and facts. Inspired by such a view, we develop a novel neural architecture named FOLNet (First-Order Logic Network), to encode this new inductive bias. We devise and compose several neural logic operators into a set of learnable Horn clauses, which are further forward-chained into a fully differentiable neural architecture (FOLNet). Interestingly, we find that the self-attention module in transformers can be composed by two of our neural logic operators, which probably explains their strong reasoning performance. Our proposed FOLNet has the same input and output interfaces as other pretrained models (e.g., BERT) and thus could be pretrained/finetuned by using similar losses. It also allows FOLNet to be used in a plug-and-play manner when replacing other pretrained models. With our logical inductive bias, the same set of ``logic deduction skills'' learned through pretraining are expected to be equally capable of solving diverse downstream tasks. For this reason, FOLNet learns language representations that have much stronger transfer capabilities. Experimental results on several language understanding tasks show that our pretrained FOLNet model outperforms the existing strong transformer-based approaches.","Pretraining, model architecture, language representation learning, inductive bias" AsymQ: Asymmetric Q-loss to mitigate overestimation bias in off-policy reinforcement learning,https://openreview.net/forum?id=UXPrt1ffxYD,https://openreview.net/pdf?id=UXPrt1ffxYD,a lightweight approach to mitigate estimation bias without extra computational costs,"It is well-known that off-policy deep reinforcement learning algorithms suffer from overestimation bias in value function approximation. Existing methods to reduce overestimation bias often utilize multiple value function estimators. Consequently, these methods have a larger time and memory consumption. In this work, we propose a new class of policy evaluation algorithms dubbed, \textbf{AsymQ}, that use asymmetric loss functions to train the Q-value network. Departing from the symmetric loss functions such as mean squared error~(MSE) and Huber loss on the Temporal difference~(TD) error, we adopt asymmetric loss functions of the TD-error to impose a higher penalty on overestimation error. We present one such AsymQ loss called \textbf{Softmax MSE~(SMSE)} that can be implemented with minimal modifications to the standard policy evaluation. Empirically, we show that using SMSE loss helps reduce estimation bias, and subsequently improves policy performance when combined with standard reinforcement learning algorithms. With SMSE, even the Deep Deterministic Policy Gradients~(DDPG) algorithm can achieve performance comparable to that of state-of-the-art methods such as the Twin-Delayed DDPG (TD3) and Soft Actor Critic~(SAC) on challenging environments in the OpenAI Gym MuJoCo benchmark. We additionally demonstrate that the proposed SMSE loss can also boost the performance of Deep Q learning (DQN) in Atari games with discrete action spaces.","reinforcement learning, estimation bias" Movement-to-Action Transformer Networks for Temporal Action Proposal Generation,https://openreview.net/forum?id=BxXXPvrL1Pg,https://openreview.net/pdf?id=BxXXPvrL1Pg,,"The task of generating temporal action proposals is aimed at identifying temporal intervals containing human actions in untrimmed videos. For arbitrary actions, this requires learning long-range interactions. We propose an end-to-end Movement-and-Action Transformer Network (MatNet) that uses results of human movement studies to encode actions ranging from localized, atomic, body part movements, to longer-range, semantic ones, involving movements of subsets of body parts. In particular, we make direct use of the results of Laban Movement Analysis (LMA). We use LMA-based measures of movements as computational definitions of actions. We input RGB + Flow (I3D) features and 3D pose, compute LMA based low-to-high-level movement features from it, and learn the action proposals by applying two heads on the boundary Transformer and three heads on the proposal Transformer, and using five losses with different weights. We visualize and explain relations between the movement descriptors and attention map of the action proposals. We report results from extensive experiments on the Thumos14, ActivityNet and PKU-MMD datasets, showing that MatNet achieves SOTA or better performance on the temporal action proposal generation task.","Temporal Action Proposal Generation, Video Action Segmentation" INSPIRE: A Framework for Integrating Individual User Preferences in Recourse,https://openreview.net/forum?id=LUQ2Csy_LUm,https://openreview.net/pdf?id=LUQ2Csy_LUm,,"Most recourse generation approaches optimize for indirect distance-based metrics like diversity, proximity, and sparsity, or a shared cost function across all users to generate recourse. The latter is an unrealistic assumption because users can have diverse feature preferences which they might be willing to act upon and any changes to any undesirable feature might lead to an impractical recourse. In this work, we propose a novel framework to incorporate the individuality of users in both recourse generation and evaluation procedure by focusing on the cost incurred by a user when opting for a recourse. To achieve this, we first propose an objective function, Expected Minimum Cost (EMC) that is based on two key ideas: (1) the user should be comfortable adopting at least one solution when presented with multiple options, and (2) we can approximately optimize for users' satisfaction even when their true cost functions (i.e., costs associated with feature changes) are unknown. EMC samples multiple plausible cost functions based on diverse feature preferences in the population and then finds a recourse set with one good solution for each category of user preferences. We optimize EMC with a novel discrete optimization algorithm, Cost-Optimized Local Search (COLS), that is guaranteed to improve the quality of the recourse set over iterations. Our evaluation framework computes the fraction of satisfied users by simulating each user's cost function and then computing the incurred cost for the provided recourse set. Experimental evaluation on popular real-world datasets demonstrates that our method satisfies up to 25.9% more users compared to strong baselines. Moreover, the human evaluation shows that our recourses are preferred more than twice as often as the strongest baseline.", How Does Self-supervised Learning Work? A Representation Learning Perspective,https://openreview.net/forum?id=Dzmd-Cc8OI,https://openreview.net/pdf?id=Dzmd-Cc8OI,,"Self-supervised learning (SSL) is a popular machine learning paradigm that utilizes a large amount of unlabeled data to facilitate the learning from a small number of labeled data. While SSL has achieved great success in different tasks, its theoretical understanding remains largely open. In this paper, we aim to theoretically understand a special kind of SSL approaches based on pre-training and fine-tuning. In particular, the SSL approach we consider first trains a neural network based on the unlabeled data with help of pseudo labelers. Then it fine-tunes the pre-trained network on a small amount of labeled data. We prove that, under certain data and neural network models, SSL can achieve nearly zero test loss, while a neural network directly trained by supervised learning on the same amount of labeled data can only achieve constant test loss. Our theoretical result demonstrates a separation between SSL and supervised learning on the same amount of labeled data and sheds light on the essence of representation learning for the success of SSL. ", Empowering Graph Representation Learning with Test-Time Graph Transformation,https://openreview.net/forum?id=Lnxl5pr018,https://openreview.net/pdf?id=Lnxl5pr018,Transforming the test graph data can enhance the generalization and robustness of graph neural networks.,"As powerful tools for representation learning on graphs, graph neural networks (GNNs) have facilitated various applications from drug discovery to recommender systems. Nevertheless, the effectiveness of GNNs is immensely challenged by issues related to data quality, such as distribution shift, abnormal features and adversarial attacks. Recent efforts have been made on tackling these issues from a modeling perspective which requires additional cost of changing model architectures or re-training model parameters. In this work, we provide a data-centric view to tackle these issues and propose a graph transformation framework named GTrans which adapts and refines graph data at test time to achieve better performance. We provide theoretical analysis on the design of the framework and discuss why adapting graph data works better than adapting the model. Extensive experiments have demonstrated the effectiveness of GTrans on three distinct scenarios for eight benchmark datasets where suboptimal data is presented. Remarkably, GTrans performs the best in most cases with improvements up to 2.8%, 8.2% and 3.8% over the best baselines on three experimental settings.","graph neural networks, out-of-distribution generalization, adversarial robustness" Provable Robustness against Wasserstein Distribution Shifts via Input Randomization,https://openreview.net/forum?id=HJFVrpCaGE,https://openreview.net/pdf?id=HJFVrpCaGE,We present provable robustness guarantees on the accuracy of a model under Wasserstein shifts of the input distribution.,"Certified robustness in machine learning has primarily focused on adversarial perturbations with a fixed attack budget for each sample in the input distribution. In this work, we present provable robustness guarantees on the accuracy of a model under bounded Wasserstein shifts of the data distribution. We show that a simple procedure that randomizes the input of the model within a transformation space is provably robust to distributional shifts under that transformation. Our framework allows the datum-specific perturbation size to vary across different points in the input distribution and is general enough to include fixed-sized perturbations as well. Our certificates produce guaranteed lower bounds on the performance of the model for any shift (natural or adversarial) of the input distribution within a Wasserstein ball around the original distribution. We apply our technique to certify robustness against natural (non-adversarial) transformations of images such as color shifts, hue shifts, and changes in brightness and saturation. We obtain strong performance guarantees for the robust model under clearly visible shifts in the input images. Our experiments establish the non-vacuousness of our certificates by showing that the certified lower bound on a robust model's accuracy is higher than the empirical accuracy of an undefended model under a distribution shift. Moreover, our results also imply guaranteed lower bounds (hardness result) on the performance of models trained on so-called ""unlearnable"" datasets that have been poisoned to interfere with model training. We show that the performance of a robust model is guaranteed to remain above a certain threshold on the test distribution even when the base model is trained on the poisoned dataset.","Distributional Robustness, Wasserstein Distance, Certified Robustness" GROOT: Corrective Reward Optimization for Generative Sequential Labeling,https://openreview.net/forum?id=EFTpmFg9cb,https://openreview.net/pdf?id=EFTpmFg9cb,"This paper proposes a novel framework for iteratively training Seq2Seq models to directly optimize a given blackbox reward metric, showing its effectiveness on sequential labeling tasks.","Sequential labeling is a fundamental NLP task, forming the backbone of many applications. Supervised learning of Seq2Seq models (like T5) has shown great success on these problems. However there remains a significant disconnect between the training objectives of these models vs the metrics and desiderata we care about in practical applications. For example, a practical sequence tagging application may want to optimize for a certain precision-recall trade-off (of the top-k predictions) which is quite different from the standard objective of maximizing the likelihood of the gold labeled sequence. Thus to bridge this gap, we propose GROOT -- a simple yet effective framework for Generative Reward Optimization Of Text sequences. GROOT works by training a generative sequential labeling model to match the decoder output distribution with that of the (black-box) reward function. Using an iterative training regime, we first generate prediction candidates, then correct errors in them, and finally contrast those candidates (based on their reward values). As demonstrated via extensive experiments on four public benchmarks, GROOT significantly improves all reward metrics. Furthermore, GROOT also leads to improvements of the overall decoder distribution as evidenced by the quality gains of the top-k candidates.","sequential labeling, reward optimization, natural language processing" Interpretations of Domain Adaptations via Layer Variational Analysis,https://openreview.net/forum?id=YtntjusJV6,https://openreview.net/pdf?id=YtntjusJV6,Interpretations of Domain Adaptations via Layer Variational Analysis,"Transfer learning is known to perform efficiently in many applications empirically, yet limited literature reports the mechanism behind the scene. This study establishes both formal derivations and heuristic analysis to formulate the theory of transfer learning in deep learning. Our framework utilizing layer variational analysis proves that the success of transfer learning can be guaranteed with corresponding data conditions. Moreover, our theoretical calculation yields intuitive interpretations towards the knowledge transfer process. Subsequently, an alternative method for network-based transfer learning is derived. The method shows an increase in efficiency and accuracy for domain adaptation. It is particularly advantageous when new domain data is sufficiently sparse during adaptation. Numerical experiments over diverse tasks validated our theory and verified that our analytic expression achieved better performance in knowledge adaptation than the gradient descent method. ","deep learning theory, domain adaptation, transfer learning, variational analysis, knowledge transfer" Forget Unlearning: Towards True Data-Deletion in Machine Learning,https://openreview.net/forum?id=goLFJ0ZNwl,https://openreview.net/pdf?id=goLFJ0ZNwl,"We show that unlearning guarantees do not ensure the ""right to be forgotten"" in the online setting, and we certify noisy GD as a trustworthy and utility-preserving deletion algorithm under our improved notion of data deletion.","Unlearning has emerged as a technique to efficiently erase information of deleted records from learned models. We show, however, that the influence created by the original presence of a data point in the training set can still be detected after running certified unlearning algorithms (which can result in its reconstruction by an adversary). Thus, under realistic assumptions about the dynamics of model releases over time and in the presence of adaptive adversaries, we show that unlearning is not equivalent to data deletion and does not guarantee the ""right to be forgotten."" We then propose a more robust data-deletion guarantee and show that it is necessary to satisfy differential privacy to ensure true data deletion. Under our notion, we propose an accurate, computationally efficient, and secure data-deletion machine learning algorithm in the online setting based on noisy gradient descent algorithm.","Unlearning, Differential Privacy, Data Deletion, Noisy Gradient Descent" Meta-Learning with Explicit Task Information,https://openreview.net/forum?id=XHjFakRjPsk,https://openreview.net/pdf?id=XHjFakRjPsk,A meta-learning algorithm which incorporates task-specific metadata to learn context across tasks,"A common approach in few-shot learning is to adapt to a new task after learning a variety of similar tasks. When the diversity of the tasks is high, however, it can be challenging for models to generalize effectively. Prior work has approached this problem by inferring task information implicitly from the data in order to better adapt to each new task. However, in some cases, explicit information about tasks is available that can inform task adaptation to improve performance, especially in the context of few-shot learning. In this work, we introduce task-informed meta-learning (TIML), an algorithm which modulates a model based on explicit task metadata. We evaluated TIML for a range of classification and regression tasks and found that TIML significantly improves performance in both regimes across a diversity of model architectures. In particular, we show the power of TIML in remote sensing for agriculture---an area of high societal impact where traditional methods have failed due to limited and imbalanced data.","meta-learning, climate change, agriculture, remote sensing" Evaluating Unsupervised Denoising Requires Unsupervised Metrics,https://openreview.net/forum?id=xTWoeTdHgH-,https://openreview.net/pdf?id=xTWoeTdHgH-,"We introduce two novel unsupervised metrics, uMSE and uPSNR, computed exclusively from noisy data, which are asymptotically consistent estimators of the corresponding supervised metrics, MSE and PSNR, and yield accurate approximations in practice","Unsupervised denoising is a crucial challenge in real-world imaging applications. Unsupervised deep-learning methods have demonstrated impressive performance on benchmarks based on synthetic noise. However, no metrics are available to evaluate these methods in an unsupervised fashion. This is highly problematic for the many practical applications where ground-truth clean images are not available. In this work, we propose two novel metrics: the unsupervised mean squared error (MSE) and the unsupervised peak signal-to-noise ratio (PSNR), which are computed using only noisy data. We provide a theoretical analysis of these metrics, showing that they are asymptotically consistent estimators of the supervised MSE and PSNR. Controlled numerical experiments with synthetic noise confirm that they provide accurate approximations in practice. We validate our approach on real-world data from two imaging modalities: videos in raw format and transmission electron microscopy. Our results demonstrate that the proposed metrics enable unsupervised evaluation of denoising methods based exclusively on noisy data.","Denoising, Unsupervised Learning, Evaluation Metrics, Statistical Estimation, Imaging, Electron Microscopy" Denoising Diffusion Samplers,https://openreview.net/forum?id=8pvnfTAbu1f,https://openreview.net/pdf?id=8pvnfTAbu1f,How to use denoising diffusion models ideas to sample unnormalized target densities and estimate their normalizing constants,"Denoising diffusion models are a popular class of generative models providing state-of-the-art results in many domains. One adds gradually noise to data using a diffusion to transform the data distribution into a Gaussian distribution. Samples from the generative model are then obtained by simulating an approximation of the time-reversal of this diffusion initialized by Gaussian samples. Practically, the intractable score terms appearing in the time-reversed process are approximated using score matching techniques. We explore here a similar idea to sample approximately from unnormalized probability density functions and estimate their normalizing constants. We consider a process where the target density diffuses towards a Gaussian. Denoising Diffusion Samplers (DDS) are obtained by approximating the corresponding time-reversal. While score matching is not applicable in this context, we can leverage many of the ideas introduced in generative modeling for Monte Carlo sampling. Existing theoretical results from denoising diffusion models also provide theoretical guarantees for DDS. We discuss the connections between DDS, optimal control and Schr\""odinger bridges and finally demonstrate DDS experimentally on a variety of challenging sampling tasks.","diffusion models, importance sampling, monte carlo, variational inference" How I Learned to Stop Worrying and Love Retraining,https://openreview.net/forum?id=_nF5imFKQI,https://openreview.net/pdf?id=_nF5imFKQI,,"Many Neural Network Pruning approaches consist of several iterative training and pruning steps, seemingly losing a significant amount of their performance after pruning and then recovering it in the subsequent retraining phase. Recent works of Renda et al. (2020) and Le & Hua (2021) demonstrate the significance of the learning rate schedule during the retraining phase and propose specific heuristics for choosing such a schedule for IMP (Han et al., 2015). We place these findings in the context of the results of Li et al. (2020) regarding the training of models within a fixed training budget and demonstrate that, consequently, the retraining phase can be massively shortened using a simple linear learning rate schedule. Improving on existing retraining approaches, we additionally propose a method to adaptively select the initial value of the linear schedule. Going a step further, we propose similarly imposing a budget on the initial dense training phase and show that the resulting simple and efficient method is capable of outperforming significantly more complex or heavily parameterized state-of-the-art approaches that attempt to sparsify the network during training. These findings not only advance our understanding of the retraining phase, but more broadly question the belief that one should aim to avoid the need for retraining and reduce the negative effects of ‘hard’ pruning by incorporating the sparsification process into the standard training.", The Value of Out-of-distribution Data,https://openreview.net/forum?id=ZS8L3Fbv-L,https://openreview.net/pdf?id=ZS8L3Fbv-L,"We study the distribution shifts that could occur within datasets and demonstrate that under such shifts, the generalization error of the desired target task can be a non-monotonic function of the number of OOD samples.","More data is expected to help us generalize to a task. But real datasets can contain out-of-distribution (OOD) data; this can come in the form of heterogeneity such as intra-class variability but also in the form of temporal shifts or concept drifts. We demonstrate a counter-intuitive phenomenon for such problems: generalization error of the task can be a non-monotonic function of the number of OOD samples; a small number of OOD samples can improve generalization but if the number of OOD samples is beyond a threshold, then the generalization error can deteriorate. We also show that if we know which samples are OOD, then using a weighted objective between the target and OOD samples ensures that the generalization error decreases monotonically. We demonstrate and analyze this phenomenon using linear classifiers on synthetic datasets and medium-sized neural networks on vision benchmarks such as MNIST, CIFAR-10, CINIC-10, PACS, and DomainNet, and observe the effect data augmentation, hyperparameter optimization, and pre-training have on this behavior. ","Distribution Shift, Learning Theory" Recursive Neural Programs: Variational Learning of Image Grammars and Part-Whole Hierarchies,https://openreview.net/forum?id=qMK1Zd49IS,https://openreview.net/pdf?id=qMK1Zd49IS,"We introduce a neural generative model that addresses the part-whole hierarchy learning problem by modeling images as recursive hierarchical trees of probabilistic sensory-motor programs, enabling intuitive composition and learning of image grammars.","Human vision involves parsing and representing objects and scenes using structured representations based on part-whole hierarchies. Computer vision and machine learning researchers have recently sought to emulate this capability using capsule networks, object reference frames and active predictive coding, but a generative model formulation has been lacking. We introduce Recursive Neural Programs (RNPs), a neural generative model that addresses the part-whole hierarchy learning problem by modeling images as hierarchical trees of probabilistic sensory-motor programs. These programs recursively reuse learned sensory-motor primitives to model an image within different reference frames, enabling intuitive and explainable composition and allowing for forming recursive image grammars. We express RNPs as structured variational autoencoders (sVAEs) for inference and sampling, and demonstrate parts-based parsing, sampling and one-shot transfer learning for MNIST, Omniglot and ETH-80 datasets. Our results show that RNPs provide an intuitive and explainable way of composing objects and scenes, allowing rich compositionality and intuitive interpretations of objects in terms of part-whole hierarchies.","computer vision, generative models, composing representations, image grammar" SaiT: Sparse Vision Transformers through Adaptive Token Pruning,https://openreview.net/forum?id=u9o4qgwJlj,https://openreview.net/pdf?id=u9o4qgwJlj,This work proposes a general dense/sparse training framework and adaptive token pruning strategies for efficient vision transformer model acceleration.,"While vision transformers have achieved impressive results, effectively and efficiently accelerating these models can further boost performances. In this work, we propose a dense/sparse training framework to obtain a unified model, enabling weight sharing across various token densities. Thus one model offers a range of accuracy and throughput tradeoffs for different applications. Besides, we introduce adaptive token pruning to optimize the patch token sparsity based on the input image. In addition, we investigate knowledge distillation to enhance token selection capability in early transformer modules. Sparse adaptive image Transformer (SaiT) offers varying levels of model acceleration by merely changing the token sparsity on the fly. Specifically, SaiT reduces the computation complexity (FLOPs) by 39% - 43% and increases the throughput by 67% - 91% with less than 0.5% accuracy loss for various vision transformer models. Meanwhile, the same model also provides the zero accuracy drop option by skipping the sparsification step. SaiT achieves better accuracy and computation tradeoffs than state-of-the-art transformer and convolutional models.","pruning, vision transformer, computer vision, deep learning" Cooperation or Competition: Avoiding Player Domination for Multi-target Robustness by Adaptive Budgets,https://openreview.net/forum?id=Lmff9URfo5,https://openreview.net/pdf?id=Lmff9URfo5,"For multi-target adversarial training, we identify a phenomenon named player domination which leads to the non-convergence of previous algorithms and further design a novel adaptive budget method to achieve better robustness.","Despite incredible advances, deep learning has been shown to be susceptible to adversarial attacks. Numerous approaches were proposed to train robust networks both empirically and certifiably. However, most of them defend against only a single type of attack, while recent work steps forward at defending against multiple attacks. In this paper, to understand multi-target robustness, we view this problem as a bargaining game in which different players (adversaries) negotiate to reach an agreement on a joint direction of parameter updating. We identify a phenomenon named \emph{player domination} in the bargaining game, and show that with this phenomenon, some of the existing max-based approaches such as MAX and MSD do not converge. Based on our theoretical results, we design a novel framework that adjusts the budgets of different adversaries to avoid player domination. Experiments on two benchmarks show that employing the proposed framework to the existing approaches significantly advances multi-target robustness.","Multi-target Adversarial Training, Bargaining Game" Image Classification by Throwing Quantum Kitchen Sinks at Tensor Networks,https://openreview.net/forum?id=6BLZcpw1sh,https://openreview.net/pdf?id=6BLZcpw1sh,,"Several variational quantum circuit approaches to machine learning have been proposed in recent years, with one promising class of variational algorithms involving tensor networks operating on states resulting from local feature maps. In contrast, a random feature approach known as quantum kitchen sinks provides comparable performance, but leverages non-local feature maps. Here we combine these two approaches by proposing a new circuit ansatz where a tree tensor network coherently processes the non-local feature maps of quantum kitchen sinks, and we run numerical experiments to empirically evaluate the performance of image classification with the new ansatz. From the perspective of classification performance, we find that simply combining quantum kitchen sinks with tensor networks yields no qualitative improvements. However, the addition of feature optimization greatly boosts performance, leading to state-of-the-art quantum circuits for image classification, requiring only shallow circuits and a small number of qubits -- both well within reach of near-term quantum devices.", Cross-Domain Few-Shot Relation Extraction via Representation Learning and Domain Adaptation,https://openreview.net/forum?id=MzQEMwIzlL,https://openreview.net/pdf?id=MzQEMwIzlL,,"Few-shot relation extraction aims to recognize novel relations with few labeled sentences in each relation. Previous metric-based few-shot relation extraction methods classify by comparing the embeddings of query sentence embedding with those prototypes generated by the few labeled sentences embedding using a learned metric function. However, the generalization ability of these methods on unseen relations in different domains is limited, since these domains always have significant discrepancies from those in the training dataset. Because the prototype is essential for extracting relations between entities in the latent space. To extract new relations in various domains more effectively, we propose to learn more interpretable and robust prototypes by learning from prior knowledge and intrinsic semantics of relations. We improve the prototype representation of relations more efficiently by using prior knowledge to explore the connections between relations. The geometric interpretability of the prototype is improved by making the classification margins between sentence embedding clearer through contrastive learning. Besides, for better-extracting relations in different domains, using a cross-domain approach makes the generation process of the prototype take into account the gap between other domains, which makes the prototype more robust. The experimental results on the benchmark FewRel dataset demonstrate the advantages of the proposed method over some state-of-the-art methods.","few-shot, domain adaptation, relation extraction" Factors Influencing Generalization in Chaotic Dynamical Systems,https://openreview.net/forum?id=GAGpLgWAWX,https://openreview.net/pdf?id=GAGpLgWAWX,We explore factors influencing in- and out-of-distribution generalisation in forecasting chaotic dynamics.,"Many real-world systems exhibit chaotic behaviour, for example: weather, fluid dynamics, stock markets, natural ecosystems, and disease transmission. While chaotic systems are often thought to be completely unpredictable, in fact there are patterns within and across that experts frequently describe and contrast qualitatively. We hypothesize that given the right supervision / task definition, representation learning systems will be able to pick up on these patterns, and successfully generalize both in- and out-of-distribution (OOD). Thus, this work explores and identifies key factors which lead to good generalization. We observe a variety of interesting phenomena, including: learned representations transfer much better when fine-tuned vs. frozen; forecasting appears to be the best pre-training task; OOD robustness falls off very quickly outside the training distribution; recurrent architectures generally outperform others on OOD generalization. Our findings are of interest to any domain of prediction where chaotic dynamics play a role.","generalization, multi-task learning, chaos, dynamical systems, representation learning" Interpretable Geometric Deep Learning via Learnable Randomness Injection,https://openreview.net/forum?id=6u7mf9s2A9,https://openreview.net/pdf?id=6u7mf9s2A9,,"Point cloud data is ubiquitous in scientific fields. Recently, geometric deep learning (GDL) has been widely applied to solve prediction tasks with point cloud data. However, GDL models are often complicated and hardly interpretable, which poses concerns to scientists when deploying these models in scientific analysis and experiments. This work proposes a general mechanism based on \learnable randomness injection (LRI) that allows building inherently interpretable models with general GDL backbones. Once being trained, LRI-induced models may detect the points within the point cloud data, which carry information indicative to the prediction labels. Such indicative information may be reflected by either the existence of these points in the data or the geometric locations of these points. We also propose four scientific datasets in the domains of high energy physics and biochemistry to evaluate LRI. Compared with previous post-hoc interpretation methods, the points detected by LRI align much better and stabler with the ground-truth patterns that have actual scientific meanings. LRI-induced models are also more robust to the distribution shifts between training and test scenarios.","Geometric Deep Learning, Interpretation, Graph Neural Networks" Koopman Operator Learning for Accelerating Quantum Optimization and Machine Learning,https://openreview.net/forum?id=wyjAf9GPD_,https://openreview.net/pdf?id=wyjAf9GPD_,Koopman opeartor learning for accelerating quantum optimization and quantum machine learning.,"Finding efficient optimization methods plays an important role for quantum optimization and quantum machine learning on near-term quantum computers. While backpropagation on classical computers is computationally efficient, obtaining gradients on quantum computers is not, because the computational complexity scales linearly with the number of parameters and measurements. In this paper, we connect Koopman operator theory, which has been successful in predicting nonlinear dynamics, with natural gradient methods in quantum optimization. We propose a data-driven approach using Koopman operator learning to accelerate quantum optimization and quantum machine learning. We develop two new families of methods: the sliding window dynamic mode decomposition (DMD) and the neural DMD for efficiently updating parameters on quantum computers. We show that our methods can predict gradient dynamics on quantum computers and accelerate the quantum variational eigensolver used in quantum optimization, as well as quantum machine learning. We further implement the learning algorithms on a real quantum computer and demonstrate their practical effectiveness. ","Koopman operators, quantum optimization, machine learning" GOGGLE: Generative Modelling for Tabular Data by Learning Relational Structure,https://openreview.net/forum?id=fPVRcJqspu,https://openreview.net/pdf?id=fPVRcJqspu,,"Deep generative models learn highly complex and non-linear representations to generate realistic synthetic data. While they have achieved notable success in computer vision and natural language processing, similar advances have been less demonstrable in the tabular domain. This is partially because generative modelling of tabular data entails a particular set of challenges, including heterogeneous relationships, limited number of samples, and difficulties in incorporating prior knowledge. Additionally, unlike their counterparts in image and sequence domain, deep generative models for tabular data almost exclusively employ fully-connected layers, which encode weak inductive biases about relationships between inputs. Real-world data generating processes can often be represented using relational structures, which encode sparse, heterogeneous relationships between variables. In this work, we learn and exploit relational structure underlying tabular data to better model variable dependence, and as a natural means to introduce regularization on relationships and include prior knowledge. Specifically, we introduce GOGGLE, an end-to-end message passing scheme that jointly learns the relational structure and corresponding functional relationships as the basis of generating synthetic samples. Using real-world datasets, we provide empirical evidence that the proposed method is effective in generating realistic synthetic data and exploiting domain knowledge for downstream tasks. ","tabular data, synthetic data, generative model" Query by Self,https://openreview.net/forum?id=dWhS55KGSKy,https://openreview.net/pdf?id=dWhS55KGSKy,,"Training with hard-to-obtain and therefore valuable data, improving generalization performance, and accelerating training speed are all challenging problems in machine learning community. Active learning , whose performance depends on its query strategy, is a powerful tool for these challenges. Unlike the famous Query by Committee strategy aimed at classification problem and dependent on a committee of student, our proposed query by self is suitable to both classification and regression problem and requires only a student, which benefits from estimated output variance, the intermediate product of Kalman filtering optimization. This means larger scope of application and less requirement for computation and the number of data. Besides, this strategy reduces training time and improves accuracy via filtration of similar data and better generalization. We theoretically explain query by self strategy from the perspective of entropy. To verify effectiveness of query by self empirically, we conduct several experiments on two classical models in machine learning.","Active learning, Kalman Filter, Variance" A Reproducible and Realistic Evaluation of Partial Domain Adaptation Methods,https://openreview.net/forum?id=_TbyZ0OxvC,https://openreview.net/pdf?id=_TbyZ0OxvC,,"Unsupervised Domain Adaptation (UDA) aims at classifying unlabeled target images leveraging source labeled ones. In this work, we consider the Partial Domain Adaptation (PDA) variant, where we have extra source classes not present in the target domain. Most successful algorithms use model selection strategies that rely on target labels to find the best hyper-parameters and/or models along training. However, these strategies violate the main assumption in PDA: only unlabeled target domain samples are available. Moreover, there are also inconsistencies in the experimental settings - architecture, hyper-parameter tuning, number of runs - yielding unfair comparisons. The main goal of this work is to provide a realistic evaluation of PDA methods with the different model selection strategies under a consistent evaluation protocol. We evaluate 7 representative PDA algorithms on 2 different real-world datasets using 7 different model selection strategies. Our two main findings are: (i) without target labels for model selection, the accuracy of the methods decreases up to 30 percentage points; (ii) only one method and model selection pair performs well on both datasets. Experiments were performed with our PyTorch framework, BenchmarkPDA, which we open source.","Domain adaptation, benchmark, reproducible research, model selection" Progressive Prompts: Continual Learning for Language Models without Forgetting,https://openreview.net/forum?id=UJTgQBc91_,https://openreview.net/pdf?id=UJTgQBc91_,,"We introduce Progressive Prompts - a simple and efficient approach for continual learning in language models. Our method does not require data replay, and alleviates catastrophic forgetting without using a large number of task-specific parameters. Progressive Prompts learns a new soft prompt for each task and sequentially concatenates it with the previously learned prompts, while keeping the base model frozen. Experiments on standard continual learning benchmarks show that our approach outperforms state-of-the-art methods, with an improvement >20% in average test accuracy over the previous best-preforming method on T5 model. We also explore a more challenging continual learning setup with longer sequences of tasks and show that Progressive Prompts significantly outperforms prior methods.","natural language processing, continual learning, prompt tuning" Differentiable Rendering with Reparameterized Volume Sampling,https://openreview.net/forum?id=zHSaBQtj-l,https://openreview.net/pdf?id=zHSaBQtj-l,An importance sampling-based rendering algorithm for neural radiance fields based alleviates the costs of redundant radiance computation.,"We propose an alternative rendering algorithm for neural radiance fields based on importance sampling. In view synthesis, a neural radiance field approximates underlying density and radiance fields based on a sparse set of views of a scene. To generate a pixel of a novel view, it marches a ray through the pixel and computes a weighted sum of radiance emitted from a dense set of ray points. This rendering algorithm is fully differentiable and facilitates gradient-based optimization of the fields. However, in practice, only a tiny opaque portion of the ray contributes most of the radiance to the sum. Therefore, we can avoid computing radiance in the rest part. In this work, we use importance sampling to pick non-transparent points on the ray. Specifically, we generate samples according to the probability distribution induced by the density field. Our main contribution is the reparameterization of the sampling algorithm. It allows end-to-end learning with gradient descent as in the original rendering algorithm. With our approach, we can optimize a neural radiance field with just a few radiance field evaluations per ray. As a result, we alleviate the costs associated with the color component of the neural radiance field.","neural radiance fields, differentiable rendering, importance sampling, reparameterization trick" "Deep Learning From Crowdsourced Labels: Coupled Cross-Entropy Minimization, Identifiability, and Regularization",https://openreview.net/forum?id=_qVhsWyWB9,https://openreview.net/pdf?id=_qVhsWyWB9,,"Using noisy crowdsourced labels from multiple annotators, a deep learning-based end-to-end (E2E) system aims to learn the label correction mechanism and the neural classifier simultaneously. To this end, many E2E systems concatenate the neural classifier with multiple annotator-specific label confusion layers and co-train the two parts in a parameter-coupled manner. The formulated coupled cross-entropy minimization (CCEM)-type criteria are intuitive and work well in practice. Nonetheless, theoretical understanding of the CCEM criterion has been limited. The contribution of this work is twofold: First, performance guarantees of the CCEM criterion are presented. Our analysis reveals for the first time that the CCEM can indeed correctly identify the annotators' confusion characteristics and the desired ``ground-truth'' neural classifier under realistic conditions, e.g., when only incomplete annotator labeling and finite samples are available. Second, based on the insights learned from our analysis, two regularized variants of the CCEM are proposed. The regularization terms provably enhance the identifiability of the target model parameters in various more challenging cases. A series of synthetic and real data experiments are presented to showcase the effectiveness of our approach.","deep learning, learning under noisy labels, neural classifier, end-to-end learning, crowdsourcing" Maximum Likelihood Learning of Energy-Based Models for Simulation-Based Inference,https://openreview.net/forum?id=gL68u5UuWa,https://openreview.net/pdf?id=gL68u5UuWa,We introduce two Synthetic Likelihood methods for Simulation-Based Inference using Conditional Energy-Based Models,"We introduce two Synthetic Likelihood methods for Simulation-Based Inference (SBI), to conduct either amortized or targeted inference from experimental observations when a high-fidelity simulator is available. Both methods learn a Conditional Energy-Based Model (EBM) of the likelihood using synthetic data generated by the simulator, conditioned on parameters drawn from a proposal distribution. The learned likelihood can then be combined with any prior to obtain a posterior estimate, from which samples can be drawn using MCMC. Our methods uniquely combine a flexible Energy-Based Model and the minimization of a KL loss: this is in contrast to other synthetic likelihood methods, which either rely on normalizing flows, or minimize score-based objectives; choices that come with known pitfalls. Our first method, Amortized Unnormalized Neural Likelihood Estimation (AUNLE), introduces a tilting trick during training that allows to perform inference using efficient MCMC techniques. Our second method, Sequential UNLE (SUNLE), employs a doubly intractable approach in order to re-use simulation data and improve posterior accuracy for a specific observation. We demonstrate the properties of both methods on a range of synthetic datasets, and apply it to a neuroscience model of the pyloric network in the crab, matching the performance of other synthetic likelihood methods at a fraction of the simulation budget.","Simulation Based Inference, Energy Based Models, Maximum Likelihood" Provable Re-Identification Privacy,https://openreview.net/forum?id=apxFz3xWhF,https://openreview.net/pdf?id=apxFz3xWhF,We propose a novel privacy notion with easily interpretable guarantees and a practical algorithm for achieving it on many machine learning tasks. We precisely characterize the relationship with differential privacy and experiments confirm our theory.,"In applications involving sensitive data, such as finance and healthcare, the necessity for preserving data privacy can be a significant barrier to machine learning model development. Differential privacy (DP) has emerged as one canonical standard for provable privacy. However, DP's strong theoretical guarantees often come at the cost of a large drop in its utility for machine learning; and DP guarantees themselves can be difficult to interpret. As a result, standard DP has encountered deployment challenges in practice. In this work, we propose a different privacy notion, re-identification privacy (RIP), to address these challenges. RIP guarantees are easily interpretable in terms of the success rate of membership inference attacks. We give a precise characterization of the relationship between RIP and DP, and show that RIP can be achieved using less randomness compared to the amount required for guaranteeing DP, leading to smaller drop in utility. Our theoretical results also give rise to a simple algorithm for guaranteeing RIP which can be used as a wrapper around any algorithm with a continuous output, including parametric model training.","privacy, membership inference, re-identification" Just Avoid Robust Inaccuracy: Boosting Robustness Without Sacrificing Accuracy,https://openreview.net/forum?id=ctLFd1-gHJ,https://openreview.net/pdf?id=ctLFd1-gHJ,,"While current methods for training robust deep learning models optimize robust accuracy, they significantly reduce natural accuracy, hindering their adoption in practice. Further, the resulting models are often both robust and inaccurate on numerous samples, providing a false sense of safety for those. In this work, we extend prior works in three main directions. First, we explicitly train the models to jointly maximize robust accuracy and minimize robust inaccuracy. Second, since the resulting models are trained to be robust only if they are accurate, we leverage robustness as a principled abstain mechanism. Finally, this abstain mechanism allows us to combine models in a compositional architecture that significantly boosts overall robustness without sacrificing accuracy. We demonstrate the effectiveness of our approach for empirical and certified robustness on six recent state-of-the-art models and four datasets. For example, on CIFAR-10 with $\epsilon_\infty = 1/255$, we successfully enhanced the robust accuracy of a pre-trained model from 26.2% to 87.8% while even slightly increasing its natural accuracy from 97.8% to 98.0%.",robustness Projective Proximal Gradient Descent for Nonconvex Nonsmooth Optimization: Fast Convergence Without Kurdyka-Lojasiewicz (KL) Property,https://openreview.net/forum?id=yEsj8pGNl1,https://openreview.net/pdf?id=yEsj8pGNl1,We propose Projected Proximal Gradient Descent (PPGD) which solves a class of non-convex and non-smooth optimization problems with the Nesterov's optimal convergence rate.,"Nonconvex and nonsmooth optimization problems are important and challenging for statistics and machine learning. In this paper, we propose Projected Proximal Gradient Descent (PPGD) which solves a class of nonconvex and nonsmooth optimization problems, where the nonconvexity nd nonsmoothness come from a nonsmooth regularization term which is nonconvex but piecewise convex. In contrast with existing convergence analysis of accelerated PGD methods for nonconvex and nonsmooth problems based on the Kurdyka-Lojasiewicz (KL) property, we provide a new theoretical analysis showing that PPGD achieves optimal convergence rate on a class of nonconvex and nonsmooth problems under mild assumptions, which is the Nesterov's optimal convergence rate of first-order methods on smooth and convex objective function with Lipschitz continuous gradient. Experimental results demonstrate the effectiveness of the PPGD.","Nonconvex Nonsmooth Optimization, Projective Proximal Gradient Descent, Kurdyka-Lojasiewicz Property, Optimal Convergence Rate." First Steps Toward Understanding the Extrapolation of Nonlinear Models to Unseen Domains,https://openreview.net/forum?id=7wrq3vHcMM,https://openreview.net/pdf?id=7wrq3vHcMM,We prove the first result for understanding the extrapolation of nonlinear model class with structured domain shifts.,"Real-world machine learning applications often involve deploying neural networks to domains that are not seen in the training time. Hence, we need to understand the extrapolation of \textit{nonlinear} models---under what conditions on the distributions and function class, models can be guaranteed to extrapolate to new test distributions. The question is very challenging because even two-layer neural networks cannot be guaranteed to extrapolate outside the support of the training distribution without further assumptions on the domain shift. This paper makes some initial steps towards analyzing the extrapolation of nonlinear models for structured domain shift. We primarily consider settings where the \textit{marginal} distribution of each coordinate of the data (or subset of coordinates) do not shift significantly across the training and test distributions, but the joint distribution may have a much bigger shift. We prove that the family of nonlinear models of the form $f(x)=\sum f_i(x_i)$, where $f_i$ is an \emph{arbitrary} function on the subset of features $x_i$, can extrapolate to unseen distributions, if the covariance of the features is well-conditioned. To the best of our knowledge, this is the first result that goes beyond linear models and the bounded density ratio assumption, even though the assumptions on the distribution shift and function class are stylized.","extrapolation of nonlinear models, theory, structured domain shift, gaussian kernel" A Kernel-Based View of Language Model Fine-Tuning,https://openreview.net/forum?id=erHaiO9gz3m,https://openreview.net/pdf?id=erHaiO9gz3m,"We show when language model fine-tuning can be in the kernel regime and derive a new kernel formula to describe Adam, and we empirically validate that this kernel can solve many downstream tasks.","It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g., why fine-tuning a model with $10^8$ or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural Tangent Kernel (NTK)--which originated as a model to study the gradient descent dynamics of infinitely wide networks with suitable random initialization--describes fine-tuning of pre-trained LMs. This study was inspired by the decent performance of NTK for computer vision tasks (Wei et al., 2022). We also extend the NTK formalism to fine-tuning with Adam. We present extensive experiments that suggest that once the task is formulated as a masked language modeling problem through prompting, the NTK lens can often reasonably describe the model updates during fine-tuning with both SGD and Adam. This kernel view also suggests an explanation for success of parameter-efficient subspace-based fine-tuning methods. Finally, we suggest a path toward a formal explanation for our findings via Tensor Programs (Yang, 2020).","language models, fine-tuning, neural tangent kernels, tensor programs" Variable Compositionality Reliably Emerges in Neural Networks,https://openreview.net/forum?id=-Yzz6vlX7V-,https://openreview.net/pdf?id=-Yzz6vlX7V-,Compositional systems reliably emerge between neural networks- just with natural language like variation.,"Human languages are pervasively compositional: because the meaning of an expression is composed from the meaning of its parts, we can productively leverage our prior experience in order to communicate about novel meanings. Work looking at the languages that emerge when neural networks solve communicative tasks has shown that networks regularly develop languages that allow them to communicate and generalize to novel examples; surprisingly, that work has struggled to show that compositional systems reliably develop, leading to claims that a language's degree of compositionality has little bearing on how well it can generalise. We argue that the languages that emerge between networks are in fact straightforwardly compositional, just with variation. We introduce a variation-based framework for interpreting the mappings produced by neural networks in emergent communication games and find that they reliably exhibit straight-forward compositional structure, with a degree of natural language-like variation that obscures their compositionality under measures used in previous work. We show that early in training measures of variation correlate with generalization performance, but that this effect goes away later in training as the languages become regular enough to compositionally generalize. In an effort to decrease the variability of the emergent languages we show how reducing a model's capacity results in greater regularity, in line with claims about factors shaping the emergence of regularity in human language.","compositionality, emergence, generalization, regularity" Systematic Rectification of Language Models via Dead-end Analysis,https://openreview.net/forum?id=k8_yVW3Wqln,https://openreview.net/pdf?id=k8_yVW3Wqln,,"With adversarial or otherwise normal prompts, existing large language models (LLM) can be pushed to generate toxic discourses. One way to reduce the risk of LLMs generating undesired discourses is to alter the training of the LLM. This can be very restrictive due to demanding computation requirements. Other methods rely on rule-based or prompt-based token elimination, which are limited as they dismiss future tokens and the overall meaning of the complete discourse. Here, we center detoxification on the probability that the finished discourse is ultimately considered toxic. That is, at each point, we advise against token selections proportional to how likely a finished text from this point will be toxic. To this end, we formally extend the dead-end theory from the recent reinforcement learning (RL) literature to also cover uncertain outcomes. Our approach, called rectification, utilizes a separate but significantly smaller model for detoxification, which can be applied to diverse LLMs as long as they share the same vocabulary. Importantly, our method does not require access to the internal representations of the LLM, but only the token probability distribution at each decoding step. We believe this is important since many LLMs today are hosted in servers and only accessible through APIs. When applied to various LLMs, including GPT-3, our approach generates notably better results compared to the base LLMs and other techniques in terms of the overall language and detoxification performance.","Language Models, Detoxification, Dead-end Theory, Reinforcement Learning." Model-free Reinforcement Learning that Transfers Using Random Reward Features,https://openreview.net/forum?id=1P8eOmWgdk,https://openreview.net/pdf?id=1P8eOmWgdk,We develop model-free reinforcement learning algorithms that transfer across tasks using random features to approximate reward functions.,"Favorable reinforcement learning (RL) algorithms should not only be able to synthesize controller for complex tasks, but also transfer across various such tasks. Classical model-free RL algorithms like Q-learning can be made stable, and has the potential to solve complicated tasks individually. However, rewards are key supervision signals in model-free approaches, making it challenging in general to transfer across multiple tasks with different reward functions. On the other hand, model-based RL algorithms, naturally transfers to various reward functions if the transition dynamics are learned well. Unfortunately, model-learning usually suffers from high dimensional observations and/or long horizons due to the challenges of compounding error. In this work, we propose a new way to transfer behaviors across problems with different reward functions that enjoy the best of both worlds. Specifically, we develop a model-free approach that implicitly learns the model without constructing the transition dynamics. This is achieved by using random features to generate reward functions in training, and incorporating model predictive control with open-loop policies in online planning. We show that the approach enables fast adaptation to problems with completely new reward functions, while scaling to high dimensional observations and long horizons. Moreover, our method can easily be trained on large offline datasets, and be quickly deployed on new tasks with good performance, making it more widely applicable than typical model-free and model-based RL methods. We evaluate the superior performance of our algorithm in a variety of RL and robotics domains.","Reinforcement Learning, Model-free, Transfer, Random Features" Differentiable Channel Selection for Self-Attention,https://openreview.net/forum?id=ZHyTtEd4lXz,https://openreview.net/pdf?id=ZHyTtEd4lXz,We propose Differentiable Channel Selection (DCS) which searches for informative channels so as to compute semantic attention weights in a self-attention module.,"Self-attention has been widely used in deep learning, and recent efforts have been devoted to incorporating self-attention modules into convolutional neural networks for computer vision. In this paper, we propose a novel attention module termed Differentiable Channel Selection (DCS). In contrast with conventional self-attention, DCS searches for the locations and key dimension of channels in a continuous space by a novel differentiable searching method. Our DCS module is compatible with either fixed neural network backbone or learnable backbone with Differentiable Neural Architecture Search (DNAS), leading to DCS with Fixed Backbone (DCS-FB) and DCS-DNAS respectively. We apply DCS-FB and DCS-DNAS to three computer vision tasks, person Re-IDentification methods (Re-ID), object detection, and image classification, with state-of-the-art results on standard benchmarks and compact architecture compared to competing methods, revealing the advantage of DCS.","Differentiable Channel Selection, Self-Attention, Differentiable Neural Architecture Search, Re-IDentification, Object Detection, Image Classification" Membership Inference Attacks Against Text-to-image Generation Models,https://openreview.net/forum?id=J41IW8Z7mE,https://openreview.net/pdf?id=J41IW8Z7mE,We perform the first privacy analysis of text-to-image generation models through the lens of membership inference.,"Text-to-image generation models have recently attracted unprecedented attention as they unlatch imaginative applications in all areas of life. However, developing such models requires huge amounts of data that might contain privacy-sensitive information, e.g., face identity. While privacy risks have been extensively demonstrated in the image classification and GAN generation domains, privacy risks in the text-to-image generation domain are largely unexplored. In this paper, we perform the first privacy analysis of text-to-image generation models through the lens of membership inference. Specifically, we propose three key intuitions about membership information and design four attack methodologies accordingly. We conduct comprehensive evaluations on two mainstream text-to-image generation models including sequence-to-sequence modeling and diffusion-based modeling. The empirical results show that all of the proposed attacks can achieve significant performance, in some cases even close to an accuracy of 1, and thus the corresponding risk is much more severe than that shown by existing membership inference attacks. We further conduct an extensive ablation study to analyze the factors that may affect the attack performance, which can guide developers and researchers to be alert to vulnerabilities in text-to-image generation models. All these findings indicate that our proposed attacks pose a realistic privacy threat to the text-to-image generation models.","Text-to-image Generation Model, Membership inference attacks" Multiple sequence alignment as a sequence-to-sequence learning problem,https://openreview.net/forum?id=8efJYMBrNb,https://openreview.net/pdf?id=8efJYMBrNb,,"The sequence alignment problem is one of the most fundamental problems in bioinformatics and a plethora of methods were devised to tackle it. Here we introduce BetaAlign, a novel methodology for aligning sequences using a natural language processing (NLP) approach. BetaAlign accounts for the possible variability of the evolutionary process among different datasets by using an ensemble of transformers, each trained on millions of samples generated from a different evolutionary model. Our approach leads to outstanding alignment accuracy, often outperforming commonly used methods, such as MAFFT, DIALIGN, ClustalW, T-Coffee, and MUSCLE.","sequence alignment, molecular evolution, natural language processing, bioinformatics" Fair Graph Message Passing with Transparency,https://openreview.net/forum?id=NGv_ui-1wz,https://openreview.net/pdf?id=NGv_ui-1wz,We aim to achieve fair message passsing with transparency to explictly use sensitive attributes in forward progagation instead of backward propagation..,"Recent advanced works achieve fair representations and predictions through regularization, adversarial debiasing, and contrastive learning in graph neural networks (GNNs). These methods \textit{implicitly} encode the sensitive attribute information in the well-trained model weight via \textit{backward propagation}. In practice, we not only pursue a fair machine learning model but also lend such fairness perception to the public. For current fairness methods, how the sensitive attribute information usage makes the model achieve fair prediction still remains a black box. In this work, we first propose the concept \textit{transparency} to describe \textit{whether} the model embraces the ability of lending fairness perception to the public \textit{or not}. Motivated by the fact that current fairness models lack of transparency, we aim to pursue a fair machine learning model with transparency via \textit{explicitly} rendering sensitive attribute usage for fair prediction in \textit{forward propagation} . Specifically, we develop an effective and transparent \textsf{F}air \textsf{M}essage \textsf{P}assing (FMP) scheme adopting sensitive attribute information in forward propagation. In this way, FMP explicitly uncovers how sensitive attributes influence final prediction. Additionally, FMP scheme can aggregate useful information from neighbors and mitigate bias in a unified framework to simultaneously achieve graph smoothness and fairness objectives. An acceleration approach is also adopted to improve the efficiency of FMP. Experiments on node classification tasks demonstrate that the proposed FMP outperforms the state-of-the-art baselines in terms of fairness and accuracy on three real-world datasets. The code is available in {\color{blue}\url{https://anonymous.4open.science/r/FMP-AD84}}.","Fairness, Transparency, Graph Neural Networks" FedExP: Speeding up Federated Averaging via Extrapolation,https://openreview.net/forum?id=IPrzNbddXV,https://openreview.net/pdf?id=IPrzNbddXV,"We propose FedExP, a method to adaptively determine the server step size in FedAvg for faster convergence. ","Federated Averaging (FedAvg) remains the most popular algorithm for Federated Learning (FL) optimization due to its simple implementation, stateless nature, and privacy guarantees combined with secure aggregation. Recent work has sought to generalize the vanilla averaging in FedAvg to a generalized gradient descent step by treating client updates as pseudo-gradients and using a server step size. While the use of a server step size has been shown to provide performance improvement theoretically, the practical benefit of the server step size has not been seen in most existing works. In this work, we present FedExP, a method to adaptively determine the server step size in FL based on dynamically varying pseudo-gradients throughout the FL process. We begin by considering the overparameterized convex regime, where we reveal an interesting similarity between FedAvg and the Projection Onto Convex Sets (POCS) algorithm. We then show how FedExP can be motivated as a novel extension to the extrapolation mechanism that is used to speed up POCS. Our theoretical analysis later also discusses the implications of FedExP in underparameterized and non-convex settings. Experimental results show that FedExP consistently converges faster than FedAvg and competing baselines on a range of realistic FL datasets. ","Federated Learning, Optimization, Step Size" Graph Neural Networks Are More Powerful Than we Think,https://openreview.net/forum?id=lgYzzQ0fX5D,https://openreview.net/pdf?id=lgYzzQ0fX5D,,"Graph Neural Networks (GNNs) are powerful convolutional architectures that have shown remarkable performance in various node-level and graph-level tasks. Despite their success, the common belief is that the expressive power of standard GNNs is limited and that they are at most as discriminative as the Weisfeiler-Lehman (WL) algorithm. In this paper we argue the opposite and show that the WL algorithm is the upper bound only when the input to the GNN is the vector of all ones. In this direction, we derive an alternative analysis that employs linear algebraic tools and characterize the representational power of GNNs with respect to the eigenvalue decomposition of the graph operators. We show that GNNs can distinguish between any graphs that differ in at least one eigenvalue and design simple GNN architectures that are provably more expressive than the WL algorithm. Thorough experimental analysis on graph isomorphism and graph classification datasets corroborates our theoretical results and demonstrates the effectiveness of the proposed architectures.","graph neural networks, expressive power, representation, graph isomorphism, classification, spectral decomposition" A Mixture-of-Expert Approach to RL-based Dialogue Management,https://openreview.net/forum?id=4FBUihxz5nm,https://openreview.net/pdf?id=4FBUihxz5nm,A mixture-of-expert based dialogue manager that is amenable to sequential decision making techniques,"Despite recent advancements in language models (LMs), their application to dialogue management (DM) problems and ability to carry on rich conversations remain a challenge. We use reinforcement learning (RL) to develop a dialogue agent that avoids being short-sighted (outputting generic utterances) and maximizes overall user satisfaction. Most existing RL approaches to DM train the agent at the word-level, and thus, have to deal with a combinatorially complex action space even for a medium-size vocabulary. As a result, they struggle to produce a successful and engaging dialogue even if they are warm-started with a pre-trained LM. To address this issue, we develop a RL-based DM using a novel mixture of expert language model (MoE-LM) that consists of (i) a LM capable of learning diverse semantics for conversation histories, (ii) a number of specialized LMs (or experts) capable of generating utterances corresponding to a particular attribute or personality, and (iii) a RL-based DM that performs dialogue planning with the utterances generated by the experts. Our MoE approach provides greater flexibility to generate sensible utterances with different intents and allows RL to focus on conversational-level DM. We compare it with SOTA baselines on open-domain dialogues and demonstrate its effectiveness both in terms of the diversity and sensibility of the generated utterances and the overall DM performance. ",Reinforcement learning A Retrieve-and-Read Framework for Knowledge Graph Reasoning,https://openreview.net/forum?id=tCsdRTELrZs,https://openreview.net/pdf?id=tCsdRTELrZs,We introduce a Retrieve-and-Read Framework for Knowledge Graph Reasoning,"Knowledge graph (KG) reasoning aims to infer new facts based on existing facts in the KG. Recent studies have shown that using the graph neighborhood of a node via graph neural networks (GNNs) provides more useful information compared to just using the query information. Conventional GNNs for KG reasoning follow the standard message-passing paradigm on the entire KG, which leads to over-smoothing of representations and also limits their scalability. At a large scale, it becomes computationally expensive to aggregate useful information from the entire KG for inference. To address limitations of existing KG reasoning frameworks, we propose a novel retrieve-and-read framework, which first retrieves a relevant subgraph context for the query and then jointly reasons over the context and the query with a high-capacity reader. As part of our exemplar instantiation for the new framework, we propose a novel Transformer-based GNN as the reader, which incorporates graph-based attention structure and cross-attention from deep fusing between query and context. This design enables the model to focus on salient subgraph information that is relevant to the query. Empirical experiments on two standard KG reasoning datasets demonstrate the competitive performance of the proposed method.","Knowledge Graph Reasoning, Graph Neural Networks, Transformer" f-DM: A Multi-stage Diffusion Model via Progressive Signal Transformation,https://openreview.net/forum?id=iBdwKIsg4m,https://openreview.net/pdf?id=iBdwKIsg4m,"We propose a generalized family of diffusion models that incorporates progressive signal transformation including downsampling, blurring and VAEs.","Diffusion models (DMs) have recently emerged as SoTA tools for generative modeling in various domains. Standard DMs can be viewed as an instantiation of hierarchical variational autoencoders (VAEs) where the latent variables are inferred from input-centered Gaussian distributions with fixed scales and variances. Unlike VAEs, this formulation constrains DMs from changing the latent spaces and learning abstract representations. In this work, we propose f-DM, a generalized family of DMs which allows progressive signal transformation. More precisely, we extend DMs to incorporate a set of (hand-designed or learned) transformations, where the transformed input is the mean of each diffusion step. We propose a generalized formulation and derive the corresponding de-noising objective with a modified sampling algorithm. As a demonstration, we apply f-DM in image generation tasks with a range of functions, including down-sampling, blurring, and learned transformations based on the encoder of pretrained VAEs. In addition, we identify the importance of adjusting the noise levels whenever the signal is sub-sampled and propose a simple rescaling recipe. f-DM can produce high-quality samples on standard image generation benchmarks like FFHQ, AFHQ, LSUN, and ImageNet with better efficiency and semantic interpretation.","diffusion models, progressive signal transformation" An Empirical Study of the Neural Contextual Bandit Algorithms,https://openreview.net/forum?id=p4X5ZrM2AY,https://openreview.net/pdf?id=p4X5ZrM2AY,,"Recent advances in representation learning have made significant influences on solutions of contextual bandit problems. Neural bandit algorithms have been actively developed and reported to gain extraordinary performance improvement against classical bandit algorithms in numerous papers. However, there lacks a comprehensive comparison among the existing neural bandit algorithms, and it is still not clear whether or when they can succeed in complex real-world problems. In this work, we present an inclusive empirical study on three different categories of existing neural bandit algorithms on several real-world datasets. The results show that such algorithms are highly competitive against their classical counterparts in most cases, however the advantage is not consistent. The results also reveal crucial challenges for future research in neural bandit algorithms. ","Contextual Bandits, Neural Network, Neural Bandits" "Backpropagation at the Infinitesimal Inference Limit of Energy-Based Models: Unifying Predictive Coding, Equilibrium Propagation, and Contrastive Hebbian Learning",https://openreview.net/forum?id=nIMifqu2EO,https://openreview.net/pdf?id=nIMifqu2EO,"We unify and provide a single limit for many papers in the literature concerning when energy based models approximate backdrop, typically in the context of biologically plausible learning algorithms","How the brain performs credit assignment is a fundamental unsolved problem in neuroscience. Many `biologically plausible' algorithms have been proposed, which compute gradients that approximate those computed by backpropagation (BP), and which operate in ways that more closely satisfy the constraints imposed by neural circuitry. Many such algorithms utilize the framework of energy-based models (EBMs), in which all free variables in the model are optimized to minimize a global energy function. However, in the literature, these algorithms exist in isolation and no unified theory exists linking them together. Here, we provide a comprehensive theory of the conditions under which EBMs can approximate BP, which lets us unify many of the BP approximation results in the literature (namely, predictive coding, equilibrium propagation, and contrastive Hebbian learning) and demonstrate that their approximation to BP arises from a simple and general mathematical property of EBMs at free-phase equilibrium. This property can then be exploited in different ways with different energy functions, and these specific choices yield a family of BP-approximating algorithms, which both includes the known results in the literature and can be used to derive new ones.","predictive coding, equilibrium propagation, contrastive hebbian learning, backprop, machine learning, computational neuroscience" A Theoretical Framework for Inference and Learning in Predictive Coding Networks,https://openreview.net/forum?id=ZCTvSF_uVM4,https://openreview.net/pdf?id=ZCTvSF_uVM4,We provide a comprehensive mathematical framework for understanding predictive coding networks including novel links with target propagation and expectation maximisation and prove that they converge to the same minima as backdrop,"Predictive coding (PC) is an influential theory in computational neuroscience, which argues that the cortex forms unsupervised world models by implementing a hierarchical process of prediction error minimization. PC networks (PCNs) are trained in two phases. First, neural activities are updated to optimize the network's response to external stimuli. Second, synaptic weights are updated to consolidate this change in activity --- an algorithm called \emph{prospective configuration}. While previous work has shown how in various limits, PCNs can be found to approximate backpropagation (BP), recent work has demonstrated that PCNs operating in this standard regime, which does not approximate BP, nevertheless obtain competitive training and generalization performance to BP-trained networks while outperforming them on tasks such as online, few-shot, and continual learning, where brains are known to excel. Despite this promising empirical performance, little is understood theoretically about the properties and dynamics of PCNs in this regime. In this paper, we provide a comprehensive theoretical analysis of the properties of PCNs trained with prospective configuration. We first derive analytical results concerning the inference equilibrium for PCNs and a previously unknown close connection relationship to target propagation (TP). Secondly, we provide a theoretical analysis of learning in PCNs as a variant of generalized expectation-maximization and use that to prove the convergence of PCNs to critical points of the BP loss function, thus showing that deep PCNs can, in theory, achieve the same generalization performance as BP, while maintaining their unique advantages.","predictive coding, backpropagation, target propagation, machine learning, neuroscience" Causally-guided Regularization of Graph Attention improves Generalizability,https://openreview.net/forum?id=U086TJFWy4p,https://openreview.net/pdf?id=U086TJFWy4p,We introduce a general regularization framework for graph attention networks that aligns attention weights with the causal effects of interventions on graph connectivity.,"Graph attention networks estimate the relational importance of node neighbors to aggregate relevant information over local neighborhoods for a prediction task. However, the inferred attentions are vulnerable to spurious correlations and connectivity in the training data, hampering the generalizability of the model. We introduce CAR, a general-purpose regularization framework for graph attention networks. Embodying a causal inference approach, CAR aligns the attention mechanism with the causal effects of active interventions on graph connectivity in a scalable manner. CAR is compatible with a variety of graph attention architectures, and we show that it systematically improves generalizability on various node classification tasks. Our ablation studies indicate that CAR hones in on the aspects of graph structure most pertinent to the prediction (e.g., homophily), and does so more effectively than alternative approaches. Finally, we also show that CAR enhances interpretability of attention weights by accentuating node-neighbor relations that point to causal hypotheses. For social media network-sized graphs, a CAR-guided graph rewiring approach could allow us to combine the scalability of graph convolutional methods with the higher performance of graph attention.","graph neural network, attention, generalization, regularization, causal effect, causal interventions, interpretability" On a Benefit of Masked Language Model Pretraining: Robustness to Simplicity Bias,https://openreview.net/forum?id=MGnPyYQ2QAA,https://openreview.net/pdf?id=MGnPyYQ2QAA,We theoretically and empirically show that MLM pretraining makes models robust to lexicon-level spurious features.,"Despite the success of pretrained masked language models (MLM), why MLM pretraining is useful is still a question not fully answered. In this work we theoretically and empirically show that MLM pretraining makes models robust to lexicon-level spurious features, partly answering the question. Our explanation is that MLM pretraining may alleviate problems brought by simplicity bias (Shahet al., 2020), which refers to the phenomenon that a deep model tends to rely excessively on simple features. In NLP tasks, those simple features could be token-level features whose spurious association with the label can be learned easily. We show that MLM pretraining makes learning from the context easier. Thus, pretrained models are less likely to rely excessively on a single token. We also explore the theoretical explanations of MLM’s efficacy in causal settings. Compared with Wei et al. (2021), we achieve similar results with milder assumptions. Finally, we close the gap between our theories and real-world practices by conducting experiments on real-world tasks.","language model, robustness, pretraining" FLGAME: A Game-theoretic Defense against Backdoor Attacks In Federated Learning,https://openreview.net/forum?id=TwCGI3rVddj,https://openreview.net/pdf?id=TwCGI3rVddj,,"Federated learning enables the distributed training paradigm, where multiple local clients jointly train a global model without needing to share their local training data. However, recent studies have shown that federated learning provides an additional surface for backdoor attacks. For instance, an attacker can compromise a subset of clients and thus corrupt the global model to incorrectly predict an attacker-chosen target class given any input embedded with the backdoor trigger. Existing defenses for federated learning against backdoor attacks usually detect and exclude the corrupted information from the compromised clients based on a $\textit{static}$ attacker model. Such defenses, however, are less effective when faced with $\textit{dynamic}$ attackers who can strategically adapt their attack strategies. In this work, we model the strategic interaction between the (global) defender and attacker as a minimax game. Based on the analysis of our model, we design an interactive defense mechanism that we call FLGAME. Theoretically, we prove that under mild assumptions, the global model trained with FLGAME under backdoor attacks is close to that trained without attacks. Empirically, we perform extensive evaluations on benchmark datasets and compare FLGAME with multiple state-of-the-art baselines. Our experimental results show that FLGAME can effectively defend against strategic attackers and achieves significantly higher robustness than baselines. ", DeepReShape: Redesigning Neural Networks for Private Inference,https://openreview.net/forum?id=-AEYAk13n_a,https://openreview.net/pdf?id=-AEYAk13n_a,"Redesigning the neural network by distributing the network's ReLU in their order of criticality for higher ReLU-efficiency, and enabling FLOPs-ReLU-Accuracy balance for fast Private Inference. ","The increased demand for privacy and security has given rise to private inference (PI), where inferences are made on encrypted data using cryptographic techniques. A challenge with deploying PI is computational and storage overheads, which makes them impractical. Unlike plaintext inference, PI's overheads stem from non-linear operations,i.e., ReLU. Despite the inverted neural operator overheads, all the previous ReLU-optimizations for PI still leverage classic networks optimized for plaintext. This paper investigates what PI-optimized network architectures should look like, and through thorough experimentation, we find that wider networks are more ReLU efficient and that how ReLUs are allocated between layers has a significant impact. The insights are compiled into a set of design principles (DeepReShape) and used to synthesize specific architectures (HybReNet) for efficient PI. We further develop a novel channel-wise ReLU dropping mechanism, ReLU-reuse, and achieve upto 3\% accuracy boost. Compared to the state-of-the-art (SNL on CIFAR-100), we achieve a 2.35\% accuracy gain at 180K ReLUs. For ResNet50 on TinyImageNet our method saves 4.2$\times$ ReLUs at iso-accuracy. ","Private Inference, Neural network design, ReLU efficiency" The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich Regimes,https://openreview.net/forum?id=JLINxPOVTh7,https://openreview.net/pdf?id=JLINxPOVTh7,Empirical study of neural networks in the overparameterized regime shows how finite-width effects are brought on by initialization variance as sample size grows.,"For small training set sizes $P$, the generalization error of wide neural networks is well-approximated by the error of an infinite width neural network (NN), either in the kernel or mean-field/feature learning regime. However, at a critical sample size $P^*$, the finite-width network generalization begins to worsen compared to the infinite width performance. In this work, we empirically study the transition from the infinite width behavior to this variance-limited regime as a function of sample size $P$ and network width $N$. We find that finite size effects can become relevant for very small dataset sizes going as $P^* \sim \sqrt{N}$ for polynomial regression with ReLU networks. We discuss the source of this finite size behavior based on the variance of the NN's final neural tangent kernel (NTK). We then show how this transition can be pushed to larger $P$ by enhancing feature learning or by ensemble averaging the network. We find that the learning curve for regression with the final NTK is an accurate approximation of the NN learning curve. Using this, we provide a toy model which also exhibits $P^* \sim \sqrt{N}$ scaling and has $P$-dependent benefits from feature learning. ","Feature Learning, Neural Tangent Kernel, Scaling Laws, Deep Ensembles" Semi-Supervised Single Domain Generalization with Label-Free Adversarial Data Augmentation,https://openreview.net/forum?id=86_enbV-pNB,https://openreview.net/pdf?id=86_enbV-pNB,,"Domain generalization (DG) has attracted increasing attention recently, as it seeks to improve the generalization ability of visual recognition models to unseen target domains. DG leverages multiple source domains for model training, while single domain generalization (SDG) further restricts such setting by exploiting only a single source domain. Nevertheless, both DG and SDG assume that the source domains are fully labeled, which might not be practical in many real world scenarios. In this paper, we present a new problem, i.e., semi-supervised single domain generalization (SS-SDG), which aims to train a model with a partially labeled single source domain to generalize to multiple unseen testing domains. We propose an effective framework to address this problem. In particular, we design a label-free adversarial data augmentation strategy to diversify the source domain, and propose a novel multi-pair FixMatch loss to generalize classifiers to unseen testing domains. Extensive experiments on OfficeHome, PACS and DomainNet20 datasets show that our method surpasses the latest SDG and semi-supervised methods. Moreover, on PACS and DomainNet20, our method approaches the fully supervised ERM upper bound within $5\%$ gap, but only uses less than $8\%$ of the labels.","Representation Learning, Domain Generalization" A Simple Approach for Visual Room Rearrangement: 3D Mapping and Semantic Search,https://openreview.net/forum?id=1C6nCCaRe6p,https://openreview.net/pdf?id=1C6nCCaRe6p,"A System For Exploring A Scene, Mapping Objects, and Rearranging Objects To A Visual Goal","Physically rearranging objects is an important capability for embodied agents. Visual room rearrangement evaluates an agent's ability to rearrange objects in a room to a desired goal based solely on visual input. We propose a simple yet effective method for this problem: (1) search for and map which objects need to be rearranged, and (2) rearrange each object until the task is complete. Our approach consists of an off-the-shelf semantic segmentation model, voxel-based semantic map, and semantic search policy to efficiently find objects that need to be rearranged. On the AI2-THOR Rearrangement Challenge, our method improves on current state-of-the-art end-to-end reinforcement learning-based methods that learn visual rearrangement policies from 0.53\% correct rearrangement to 16.56\%, using only 2.7\% as many samples from the environment.","Embodied AI, Deep Learning, Object Rearrangement" Memory Efficient Dynamic Sparse Training,https://openreview.net/forum?id=RZT4uwbZ5qr,https://openreview.net/pdf?id=RZT4uwbZ5qr,,"The excessive memory and energy consumption of modern Artificial Neural Networks (ANNs) is posing limitations on the machines that can run these models. Sparsification of ANNs is often motivated by time, memory and energy savings only during model inference, yielding no benefits during training. A growing body of work is now focusing on providing the benefits of model sparsification also during training. While these methods improve the energy efficiency during training, the algorithms yielding the most accurate models still have a peak memory usage on the same order as the dense model. We propose a Dynamic Sparse Training (DST) algorithm that reduces the peak memory usage during training while preserving the energy advantages of sparsely trained models. We evaluate our algorithm on CIFAR-10/100 using ResNet-56 and VGG-16 and compare it against a range of sparsification methods. The benefits of our method are twofold: first, it allows for a given model to be trained to an accuracy on par with the dense model while requiring significantly less memory and energy; second, the savings in memory and energy can be allocated towards training an even larger sparse model on the same machine, generally improving the accuracy of the model.","Dynamic Sparse Training, Sparse Neural Networks" Accelerated Training via Principled Methods for Incrementally Growing Neural Networks,https://openreview.net/forum?id=yRkNJh5WgRE,https://openreview.net/pdf?id=yRkNJh5WgRE,,"We develop an approach to efficiently grow neural networks, within which parameterization and optimization strategies are designed by considering their effects on the training dynamics. Unlike existing growing methods, which follow simple replication heuristics or utilize auxiliary gradient-based local optimization, we craft a parameterization scheme which dynamically stabilizes weight, activation, and gradient scaling as the architecture evolves, and maintains the inference functionality of the network. To address the optimization difficulty resulting from imbalanced training effort distributed to subnetworks fading in at different growth phases, we propose a learning rate adaption mechanism that rebalances the gradient contribution of these separate subcomponents. Experimental results show that our method achieves comparable or better accuracy than training large fixed-size models, while saving a substantial portion of the original computation budget for training. We demonstrate that these gains translate into real wall-clock training speedups.", Progressive Mix-Up for Few-Shot Supervised Multi-Source Domain Transfer,https://openreview.net/forum?id=H7M_5K5qKJV,https://openreview.net/pdf?id=H7M_5K5qKJV,,"This paper targets at a new and challenging setting of knowledge transfer from multiple source domains to a single target domain, where target data is few shot or even one shot with label. Traditional domain generalization or adaptation methods cannot directly work since there is no sufficient target domain distribution serving as the transfer object. The multi-source setting further prevents the transfer task as excessive domain gap introduced from all the source domains. To tackle this problem, we newly propose a progressive mix-up (P-Mixup) mechanism to introduce an intermediate mix-up domain, pushing both the source domains and the few-shot target domain aligned to this mix-up domain. Further by enforcing the mix-up domain to progressively move towards the source domains, we achieve the domain transfer from multi-source domains to the single one-shot target domain. Our P-Mixup is different from traditional mix-up that ours is with a progressive and adaptive mix-up ratio, following the curriculum learning spirit to better align the source and target domains. Moreover, our P-Mixup combines both pixel-level and feature-level mix-up to better enrich the data diversity. Experiments on two benchmarks show that our P-Mixup significantly outperforms the state-of-the-art methods, i.e., 6.0\% and 6.8\% improvements on Office-Home and DomainNet.","Representation Learning, Domain Adaptation" Mitigating Propagation Failures in PINNs using Evolutionary Sampling,https://openreview.net/forum?id=Jzliv-bxZla,https://openreview.net/pdf?id=Jzliv-bxZla,,"Despite the success of physics-informed neural networks (PINNs) in approximating partial differential equations (PDEs), it is known that PINNs can sometimes fail to converge to the correct solution in problems involving complicated PDEs. This is reflected in several recent studies on characterizing and mitigating the ``failure modes'' of PINNs. While most of these studies have focused on balancing loss functions or adaptively tuning PDE coefficients, what is missing is a thorough understanding of the connection between failure modes of PINNs and sampling strategies used for training PINNs. In this paper, we provide a novel perspective of failure modes of PINNs by hypothesizing that the training of PINNs rely on successful ``propagation'' of solution from initial and/or boundary condition points to interior points. We show that PINNs with poor sampling strategies can get stuck at trivial solutions if there are propagation failures. We additionally demonstrate that propagation failures are characterized by highly imbalanced PDE residual fields where very high residuals are observed over very narrow regions. To mitigate propagation failures, we propose a novel evolutionary sampling (Evo) method that can incrementally accumulate collocation points in regions of high PDE residuals with little to no computational overhead. We provide an extension of Evo to respect the principle of causality while solving time-dependent PDEs. We theoretically analyze the behavior of Evo and empirically demonstrate its efficacy and efficiency in comparison with baselines on a variety of PDE problems.","Physics-informed Neural Networks, Adaptive Sampling, Failure Modes of PINNs." Revisiting Information-Based Clustering with Pseudo-Posterior Models,https://openreview.net/forum?id=2PI2EKASh_Z,https://openreview.net/pdf?id=2PI2EKASh_Z,,"Maximization of mutual information (MI) between the network's input and output motivates standard losses for unsupervised discriminative clustering enforcing ""decisiveness"" and ""fairness"". In the context of common softmax models, we clarify several general properties of such discriminative losses that were previously not well understood: the relation to K-means, or lack thereof, and ""margin-maximization"". In particular, we show that ""desiciveness"" without the extra regularization term can lead to poor classification margins. Also, non-convexity of information-based losses motivates us to focus on self-supervised approaches introducing effective higher-order optimization algorithms with auxiliary variables. Addressing limitations of existing formulations, we propose a new self-supervised loss with soft auxiliary variables, or ""pseudo-confidence"" estimates. In particular, we introduce ""strong"" fairness and motivate the ""reverse"" cross-entropy as a robust loss for network training from noisy pseudo-confidence estimates. The latter is efficiently computed using variational inference - we derive a new EM algorithm with closed-form solutions for E and M steps. Empirically, our algorithm improves the performance of earlier methods for information-based clustering.", Neural Compositional Rule Learning for Knowledge Graph Reasoning,https://openreview.net/forum?id=F8VKQyDgRVj,https://openreview.net/pdf?id=F8VKQyDgRVj,"In this paper, we propose an end-to-end neural model for learning compositional logic rules called NCRL. NCRL treats logic rules as a hierarchical tree, and breaks the rule body into small atomic compositions during inference.","Learning logic rules is critical to improve reasoning in knowledge graphs. This is due to their ability to provide logical interpretable explanations when used for predictions, as well as their ability to generalize to other tasks, domains, and data. While recent methods have been proposed to learn logic rules, the majority of these methods are either restricted by their computational complexity to manage the large search space, or are particularly designed for inductive relational reasoning task. In this paper, we propose an end-to-end neural model for learning compositional logic rules called NCRL. NCRL treats logic rules as a hierarchical tree, and breaks the rule body into small atomic compositions during inference. By recurrently merging compositions in the rule body with a recurrent attention unit, NCRL finally predicts a single rule head. Experimental results show that NCRL learns high-quality rules, as well as being generalizable across multiple tasks. Specifically, we show that NCRL is scalable and yields state-of-the-art results for link prediction on large-scale KGs. Moreover, we test NCRL for systematic generalization by learning to reason on small-scale observed graphs and evaluating on larger ones.","Logical Rule, Knowledge Graph, Reasoning, Compositionality, Systematicity" Temporal Change Sensitive Representation for Reinforcement Learing,https://openreview.net/forum?id=YnPpdxEcZbi,https://openreview.net/pdf?id=YnPpdxEcZbi,TCSR is a self-supervised representation learning auxiliary task designed for reinforcement learning methods with a latent dynamic model.,"Image-based deep reinforcement learning has made a great improvement recently by combining state-of-the-art reinforcement learning algorithms with self-supervised representation learning algorithms. However, these self-supervised representation learning algorithms are designed to preserve global visual information, which may miss changes in visual information that are important for performing the task. To resolve this problem, self-supervised representation learning specifically designed for better preserving task relevant information is necessary. Following this idea, we introduce Temporal Change Sensitive Representation (TCSR), which is designed for reinforcement learning algorithms that have a latent dynamic model. TCSR enforces the latent state representation of the reinforcement agent to put more emphasis on the part of observation that could potentially change in the future. Our method achieves SoTA performance in Atari100K benchmark.","Reinforcement Learning, Representation Learning" Provably Efficient Reinforcement Learning for Online Adaptive Influence Maximization,https://openreview.net/forum?id=49N06mWPFUm,https://openreview.net/pdf?id=49N06mWPFUm,We propose a model-based optimistic RL approach to solve the content-dependent online adaptive influence maximization problem.,"Online influence maximization aims to maximize the influence spread of a content in a social network with an unknown network model by selecting a few seed nodes. Recent studies followed a non-adaptive setting, where the seed nodes are selected before the start of the diffusion process and network parameters are updated when the diffusion stops. We consider an adaptive version of content-dependent online influence maximization problem where the seed nodes are sequentially activated based on real-time feedback. In this paper, we formulate the problem as an infinite-horizon discounted MDP under a linear diffusion process and present a model-based reinforcement learning solution. Our algorithm maintains a network model estimate and selects seed users adaptively, exploring the social network while improving the optimal policy optimistically. We establish $\widetilde O(\sqrt{T})$ regret bound for our algorithm. Empirical evaluations on synthetic and real-world networks demonstrate the efficiency of our algorithm. ","influence maximization, reinforcement learning" Fairness via Adversarial Attribute Neighbourhood Robust Learning,https://openreview.net/forum?id=ceDaKReXBdY,https://openreview.net/pdf?id=ceDaKReXBdY,We propose a principled Robust Adversarial Attribute Neighbourhood (RAAN) loss to debias the classification head and to promote a fairer representation distribution across different sensitive attribute groups with a theoretical guarantee,"Improving fairness between privileged and less-privileged sensitive attribute groups (e.g, {race, gender}) has attracted lots of attention. To enhance the model performs uniformly well in different sensitive attributes, we propose a principled \underline{R}obust \underline{A}dversarial \underline{A}ttribute \underline{N}eighbourhood (RAAN) loss to debias the classification head and to promote a fairer representation distribution across different sensitive attribute groups. The key idea of RAAN is to mitigate the differences of biased representations between different sensitive attribute groups by assigning each sample an adversarial robust weight, which is defined on the representations of adversarial attribute neighbors, i.e, the samples from different protected groups. To provide efficient optimization algorithms, we cast the RAAN into a sum of coupled compositional functions and propose a stochastic adaptive (Adam-style) and non-adaptive (SGD-style) algorithm framework SCRAAN with provable theoretical guarantee. Extensive empirical studies on fairness-related benchmark datasets verify the effectiveness of the proposed method. ", Efficient approximation of neural population structure and correlations with probabilistic circuits,https://openreview.net/forum?id=XC_yGI-0j9,https://openreview.net/pdf?id=XC_yGI-0j9,We present a computationally efficient generative model for a wide range of population structures with higher order correlations and a large number of neurons. ," We present a computationally efficient framework to model a wide range of population structures with high order correlations and a large number of neurons. Our method is based on a special type of Bayesian network that has linear inference time and is founded upon the concept of contextual independence. Moreover, we use an efficient architecture learning method for network selection to model large neural populations even with a small amount of data. Our framework is both fast and accurate in approximating neural population structures. Furthermore, our approach enables us to reliably quantify higher order neural correlations. We test our method on simulated neural populations commonly used to generate higher order correlations, as well as on publicly available large-scale neural recordings from the Allen Brain Observatory. Our approach significantly outperforms other models both in terms of statistical measures and alignment with experimental evidence.","Computational neuroscience, Probabilistic circuits, Sum Product networks, AI for science" Exploring perceptual straightness in learned visual representations,https://openreview.net/forum?id=4cOfD2qL6T,https://openreview.net/pdf?id=4cOfD2qL6T,,"Humans have been shown to use a ''straightened'' encoding to represent the natural visual world as it evolves in time (Henaff et al. 2019). In the context of discrete video sequences, ''straightened'' means that changes between frames follow a more linear path in representation space at progressively deeper levels of processing. While deep convolutional networks are often proposed as models of human visual processing, many do not straighten natural videos. In this paper, we explore the relationship between network architecture, robustness, biologically-inspired filtering mechanisms, and representational straightness in response to time-varying input; we identify curvature as a useful way of evaluating neural network representations. We find that (1) adversarial training leads to straighter representations in both CNN and transformer-based architectures but (2) this effect is task-dependent, not generalizing to tasks such as segmentation and frame-prediction, where straight representations are not favorable for predictions. Finally, (3) biologically-inspired elements increase straightness in the early stages of a network, but do not guarantee increased straightness in downstream layers of CNNs. Our results suggest that the ability of a model to straighten is a useful and easily computed measure of representational robustness and stability, as well as a marker of similarity between human and machine representations.","adversarial robustness, deep learning, representation learning, computer vision, neuroscience, human vision" Improving Subgraph Representation Learning via Multi-View Augmentation,https://openreview.net/forum?id=X4Jj-SmWX_i,https://openreview.net/pdf?id=X4Jj-SmWX_i,We develop a novel multi-view augmentation mechanism to improve subgraph representation learning models and thus the accuracy of downstream prediction tasks.,"Subgraph representation learning based on Graph Neural Network (GNN) has exhibited broad applications in scientific advancements, such as predictions of molecular structure-property relationships and collective cellular function. In particular, graph augmentation techniques have shown promising results in improving graph-based and node-based classification tasks. Still, they have rarely been explored in the existing GNN-based subgraph representation learning studies. In this study, we develop a novel multi-view augmentation mechanism to improve subgraph representation learning models and thus the accuracy of downstream prediction tasks. Our augmentation technique creates multiple variants of subgraphs and embeds these variants into the original graph to achieve highly improved training efficiency, scalability, and accuracy. Benchmark experiments on several real-world biological and physiological datasets demonstrate the superiority of our proposed multi-view augmentation techniques in subgraph representation learning.","Graph Learning, Subgraph Representation Learning, Graph Data Augmentation, Multi-view Augmentation" Efficient Proxy for NAS is Extensible Now,https://openreview.net/forum?id=eGtS3Cuj1Zo,https://openreview.net/pdf?id=eGtS3Cuj1Zo,We proposed a configurable and extensible efficient proxy task for evaluating neural architectures with a search method to extend the proxy to various downstream tasks and search spaces.,"Neural Architecture Search (NAS) has become a de facto approach in the recent trend of AutoML to design deep neural networks (DNNs). Efficient or near-zero-cost NAS proxies are further proposed to address the demanding computational issues of NAS, where each candidate architecture network only requires one iteration of backpropagation. The values obtained from the proxies are considered the predictions of architecture performance on downstream tasks. However, two significant drawbacks hinder the extended usage of Efficient NAS proxies. (1) Efficient proxies are not adaptive to various search spaces. (2) Efficient proxies are not extensible to multi-modality downstream tasks. Based on the observations, we design a Extensible proxy (Eproxy) that utilizes self-supervised, few-shot training (i.e., 10 iterations of backpropagation) which yields near-zero costs. The key component that makes Eproxy efficient is an untrainable convolution layer termed barrier layer that add the non-linearities to the optimization spaces so that the Eproxy can discriminate the performance of architectures in the early stage. Furthermore, to make Eproxy adaptive to different downstream tasks/search spaces, we propose a Discrete Proxy Search (DPS) to find the optimized training settings for Eproxy with only handful of benchmarked architectures on the target tasks. Our extensive experiments confirm the effectiveness of both Eproxy and Eproxy+DPS. On NAS-Bench-101 (~423k architectures), Eproxy achieves 0.65 as the spearman rho. In contrast, the previous best zero-cost method achieves 0.45. On NDS-ImageNet search spaces, Eproxy+DPS delivers 0.73 Spearman $\rho$ average ranking correlation while the previous efficient proxy only achieves 0.47. On NAS-Bench-Trans-Micro search space (7 tasks), Eproxy+DPS delivers comparable performance with early stop methods which requires 660 GPU hours per task. For the end-to-end task such as DARTS-ImageNet-1k, our method delivers better results compared to NAS performed on CIFAR-10 while only requiring a GPU hour with a single batch of CIFAR-10 images.","Neural Architecture Search, Zero Cost Neural Architecture Search, Efficient NAS, Self-supervised Learning, Computer Vision" "System identification of neural systems: If we got it right, would we know?",https://openreview.net/forum?id=nBi2tQ_Wba,https://openreview.net/pdf?id=nBi2tQ_Wba,,"Various artificial neural networks developed by engineers are now proposed as models of parts of the brain, such as the ventral stream in the primate visual cortex. The network activations are compared to recordings of biological neurons, and good performance in reproducing neural responses is considered to support the model's validity. This system identification approach, however, is only part of the traditional ways to develop and test models in the natural sciences. A key question is how much the ability to predict neural responses tells us. In particular, do these functional tests about neuron activation allow us to distinguish between different model architectures? We benchmark existing techniques to correctly identify a model by replacing brain recordings with known ground truth models. We evaluate the most commonly used identification approaches, such as a linear encoding model and centered kernel alignment. Even in the setting where the correct model is among the candidates, system identification performance is quite variable; it also depends significantly on factors independent of the ground truth architecture, such as stimuli images. In addition, we show the limitations of using functional similarity scores in identifying higher-level architectural motifs. ","Computational Neuroscience, Neural Networks" TKIL: Tangent Kernel Optimization for Class Balanced Incremental Learning,https://openreview.net/forum?id=nJYAl6pVlc,https://openreview.net/pdf?id=nJYAl6pVlc,Tangent Kernel optimization for class balanced Incremental Learning that addresses the imbalances in memory-based incremental learning.,"When sequentially learning multiple tasks, deep neural networks tend to loose accuracy on tasks learned in the past while gaining accuracy on the current task. This phenomenon is called catastrophic forgetting. Class Incremental Learning (CIL) methods address this problem by keeping a memory of exemplars from previous tasks, which are supposed to assist with overall accuracy of all tasks. However, existing methods struggle to balance the accuracy across all seen tasks since there is still overfitting to the current task due to data imbalance between the complete training data points for the current task and limited exemplars in the memory buffer for previous tasks. Here, we propose to avoid the data imbalance by learning a set of generalized non-task-specific parameters. In particular, we propose a novel methodology of Tangent Kernel for Incremental Learning (TKIL) that seeks an equilibrium between current and previous representations. We achieve such equilibrium by computing and optimizing for a new Gradient Tangent Kernel (GTK). Specifically, TKIL tunes task-specific parameters for all tasks with GTK loss. Therefore, when representing previous tasks, task-specific models are not influenced by the samples of the current task and are able to retain learned representations. As a result, TKIL equally considers the contribution from all task models. The generalized parameters that TKIL obtains allow our method to automatically identify which task is being considered and to adapt to it during inference. Extensive experiments on 5 CIL benchmark datasets with 10 incremental learning settings show that TKIL outperforms existing state-of-the-art methods, e.g., a 9.4% boost on CIFAR100 with 25 incremental stages. Furthermore, TKIL attains strong state-of-the-art accuracy on the large-scale dataset, with a much smaller model size (36%) compared to other approaches.",Incremental Learning Is Forgetting Less a Good Inductive Bias for Forward Transfer?,https://openreview.net/forum?id=dL35lx-mTEs,https://openreview.net/pdf?id=dL35lx-mTEs,Study if forgetting less is a good inductive bias for forward transfer. ,"One of the main motivations of studying continual learning is that the problem setting allows a model to accrue knowledge from past tasks to learn new tasks more efficiently. However, recent studies suggest that the key metric that continual learning algorithms optimize, reduction in catastrophic forgetting, does not correlate well with the forward transfer of knowledge. We believe that the conclusion previous works reached is due to the way they measure forward transfer. We argue that the measure of forward transfer to a task should not be affected by the restricted updates on the task by the continual learner to preserve previous tasks. Instead, forward transfer should be measured by how easy it is to learn a new task given a set of representations produced by continual learning on previous tasks. Under this notion of forward transfer, we evaluate different continual learning algorithms on a variety of image classification benchmarks. Our results indicate that less forgetful representations lead to a better forward transfer suggesting a strong correlation between retaining past information and learning efficiency on new tasks. Further, we found less forgetful representations to be more diverse and discriminative compared to their forgetful counterparts. ","continual learning, transfer learning" High-Precision Regressors for Particle Physics,https://openreview.net/forum?id=Ke2uzCpFcP0,https://openreview.net/pdf?id=Ke2uzCpFcP0,We design and build high-precision regressors that speed up Monte Carlo simulations in particle physics by a thousand to a million times,"Monte Carlo simulations of physics processes at particle colliders like the Large Hadron Collider at CERN take up a major fraction of the computational budget. For some simulations, a single data point takes seconds, minutes, or even hours to compute from first principles. Since the necessary number of data points per simulation is on the order of $10^9$ -- $10^{12}$, machine learning regressors can be used in place of physics simulators to significantly reduce this computational burden. However, this task requires high precision regressors that can deliver data with relative errors less than 1\% or even 0.1\% over the entire domain of the function. In this paper, we develop optimal training strategies and tune various machine learning regressors to satisfy the high-precision requirement. We leverage symmetry arguments from particle physics to optimize the performance of the regressors. Inspired by ResNets, we design a Deep Neural Network with skip connections that outperform fully connected Deep Neural Networks. We find that at lower dimensions, boosted decision trees far outperform neural networks while at higher dimensions neural networks perform better. We show that these regressors can speed up simulations by a factor of $10^3$ -- $10^6$ over the first-principles computations currently used in Monte Carlo simulations. Additionally, using symmetry arguments derived from particle physics, we reduce the number of regressors necessary for each simulation by an order of magnitude. Our work can significantly reduce the training and storage burden of Monte Carlo simulations at current and future collider experiments.","Boosted Decision Trees, Skip Connections, Deep Neural Networks, Particle Physics, Monte Carlo Simulation, Symmetry" Learning Structured Representations by Embedding Class Hierarchy,https://openreview.net/forum?id=7J-30ilaUZM,https://openreview.net/pdf?id=7J-30ilaUZM,We propose to learn structured representations that preserve the hierarchy between label classes by using CPCC as a regularizer.,"Existing models for learning representations in supervised classification problems are permutation invariant with respect to class labels. However, structured knowledge about the classes, such as hierarchical label structures, widely exists in many real-world datasets, e.g., the ImageNet and CIFAR benchmarks. How to learn representations that can preserve such structures among the classes remains an open problem. To approach this problem, given a tree of class hierarchy, we first define a tree metric between any pair of nodes in the tree to be the length of the shortest path connecting them. We then provide a method to learn the hierarchical relationship of class labels by approximately embedding the tree metric in the Euclidean space of features. More concretely, during supervised training, we propose to use the Cophenetic Correlation Coefficient (CPCC) as a regularizer for the cross-entropy loss to correlate the tree metric of classes and the Euclidean distance in the class-conditioned representations. Our proposed regularizer is computationally lightweight and easy to implement. Empirically, we demonstrate that this approach can help to learn more interpretable representations due to the preservation of the tree metric, and leads to better in-distribution generalization as well as under sub-population shifts over six real-world datasets.","structured representations, representation learning, tree embedding" Promptagator: Few-shot Dense Retrieval From 8 Examples,https://openreview.net/forum?id=gmL46YMpu2J,https://openreview.net/pdf?id=gmL46YMpu2J,,"Much recent research on information retrieval has focused on how to transfer from one task (typically with abundant supervised data) to various other retrieval tasks where supervision is limited, with the implicit assumption that it is possible to generalize from one task to all the rest. However, this overlooks the fact that there are many diverse and unique retrieval problems, each targeting different search intents, queries, and search domains. In this paper, we suggest to work on Few-shot Dense Retrieval, a setting where each task comes with a short description and a few examples. To address this, we introduce Prompt-based Query Generation forRetrieval (Promptagator): for each task, we feed the few-shot examples to a large language model (LLM) and prompt it to behave as a task-specific query generator. Using this, we can synthetically generate a large number of relevant queries for any document, yielding abundant data for training task-specific retrievers --- with no reliance on traditional resources such as Natural Questions (Kwiatkowskiet al., 2019) or MS MARCO (Nguyen et al., 2016). Surprisingly, Promptagator with only 8 annotated examples enables efficient dual encoder retrievers to outperform computationally more expensive models trained on MS MARCO such as ColBERT v2 (Santhanam et al., 2022) by more than 1.2 points nDCG@10 on average on 11 retrieval sets. Further training standard-size re-rankers using the same generated data yields another 5.0 points nDCG@10 improvement. Our studies show that synthetic query generation can be far more effective than previously observed, especially when a small amount of task-specific knowledge is given.","large language model, few-shot prompting, information retrieval" Balance is Essence: Accelerating Sparse Training via Adaptive Gradient Correction,https://openreview.net/forum?id=Jbfd7BpQaa-,https://openreview.net/pdf?id=Jbfd7BpQaa-,,"Despite impressive performance on a wide variety of tasks, deep neural networks require significant memory and computation costs, which prohibits their application in resource-constrained scenarios. Sparse training is one of the most common techniques to reduce these costs, however, the sparsity constraints add difficulty to the optimization, resulting in an increase in training time and instability. In this work, we aim to overcome this problem and achieve space-time co-efficiency. To accelerate and stabilize the convergence of sparse training, we analyze the gradient changes and develop an adaptive gradient correction method. Specifically, we approximate the correlation between the current and previous gradients, which is used to balance the two gradients to obtain a corrected gradient. Our method can be used with most popular sparse training pipelines under both standard and adversarial setups. Theoretically, we prove that our method can accelerate the convergence rate of sparse training. Extensive experiments on multiple datasets, model architectures, and sparsities demonstrate that our method outperforms leading sparse training methods by up to \textbf{5.0\%} in accuracy given the same number of training epochs, and reduces the number of training epochs by up to \textbf{52.1\%} to achieve the same accuracy.", Brain-like representational straightening of natural movies in robust feedforward neural networks,https://openreview.net/forum?id=mCmerkTCG2S,https://openreview.net/pdf?id=mCmerkTCG2S,Brain-like temporal straightening of natural movies emerge in robust neural networks trained on static images ,"Representational straightening refers to a decrease in curvature of visual feature representations of a sequence of frames taken from natural movies. Prior work established straightening in neural representations of the primate primary visual cortex (V1) and perceptual straightening in human behavior as a hallmark of biological vision in contrast to artificial feedforward neural networks which did not demonstrate this phenomenon as they were not explicitly optimized to produce temporally predictable movie representations. Here, we show robustness to noise can produce representational straightening in feedforward neural networks. Both adversarial training (AT) and base classifiers for Random Smoothing (RS) induced remarkably straightened feature codes. Demonstrating their utility within the domain of natural movies, these codes could be inverted to generate intervening movie frames by linear interpolation in the feature space even though they were not trained on these trajectories. Demonstrating their biological utility, we found that RS was the best network for explaining neurons in primate V1 providing a parsimonious, bio-plausible mechanism (noise in the sensory input stages) for generating representations in the early visual cortex. Finally, we compared the geometric properties of frame representations in these networks to better understand how they produced brain-like representations. Overall, this work elucidating emergent properties of robust neural networks demonstrates that it is not necessary to utilize predictive objectives or train directly on natural movie statistics to achieve models supporting more brain-like straightened movie representations that predict neural behavior. ","Perceptual straightening, invertibility, robustness, visual cortex" FunkNN: Neural Interpolation for Functional Generation,https://openreview.net/forum?id=BT4N_v7CLrk,https://openreview.net/pdf?id=BT4N_v7CLrk,,"Can we build continuous generative models which generalize across scales, can be evaluated at any coordinate, admit calculation of exact derivatives, and are conceptually simple? Existing MLP-based architectures generate worse samples than the grid-based generators with favorable convolutional inductive biases. Models that focus on generating images at different scales do better, but employ complex architectures not designed for continuous evaluation of images and derivatives. We take a signal-processing perspective and treat continuous signal generation as interpolation from samples. Indeed, correctly sampled discrete images contain all information about the low spatial frequencies. The question is then how to extrapolate the spectrum in a data-driven way while meeting the above design criteria. Our answer is FunkNN---a novel convolutional network which learns how to reconstruct continuous images at arbitrary coordinates and can be applied to any image dataset. Combined with a discrete generative model it becomes a functional generator which can act as a prior in continuous ill-posed inverse problems. We show that FunkNN generates high-quality continuous images and exhibits strong out-of-distribution performance thanks to its patch-based design. We further showcase its performance in several stylized inverse problems with exact spatial derivatives.", A Framework for Comprehensive Evaluations of Graph Neural Network based Community Detection using Node Clustering,https://openreview.net/forum?id=qDgEEVS3TYM,https://openreview.net/pdf?id=qDgEEVS3TYM,,"Graph Neural Networks (GNNs) have shown promising performance across a number of tasks in recent years. Unsupervised community detection using GNNs involves the clustering of nodes of a graph given both the features of nodes as well as the structure of the graph, and has many applications to real world tasks from social networks to genomics. Unfortunately, there has been relatively little research using GNNs for commOunity detection, and even less that evaluates the systems rigorously and fairly. A comprehensive evaluation of the performance of GNNs requires an suitable environment within which they are evaluated. This is exacerbated by the fact that community detection is primarily an unsupervised task, and that (graph) neural networks are used which contain many hyperparameters, discovered by inconsistent procedures. We argue that there is currently a gap in the literature that establishes a sufficient benchmarking environment for the consistent evaluation of GNN based community detection, thereby impeding progress in this nascent field. In this work we propose and evaluate an environment for the consistent evaluation of neural community detection. With this we show the strong dependence of the performance to the experimental settings , thereby motivating the use of this framework to facilitate research into GNN based community detection. ","graph neural networks, clustering, community detection" TEXTCRAFT: ZERO-SHOT GENERATION OF HIGH FIDELITY AND DIVERSE SHAPES FROM TEXT,https://openreview.net/forum?id=SbEvg8qlasl,https://openreview.net/pdf?id=SbEvg8qlasl,,"Language is one of the primary means by which we describe the 3D world around us. While rapid progress has been made in text-to-2D-image synthesis, similar progress in text-to-3D-shape synthesis has been hindered by the lack of paired (text, shape) data. Moreover, extant methods for text-to-shape generation have limited shape diversity and fidelity. We introduce TextCraft, a method to address these limitations by producing high-fidelity and diverse 3D shapes without the need for (text, shape) pairs for training. TextCraft achieves this by using CLIP and using a multi-resolution approach by first generating in a low-dimensional latent space and then upscaling to a higher resolution, improving the fidelity of the generated shape. To improve shape diversity, we use a discrete latent space which is modelled using a bidirectional transformer conditioned on the interchangeable image-text embedding space induced by CLIP. Moreover, we present a novel variant of classifier-free guidance, which further improves the accuracy diversity trade-off. Finally, we perform extensive experiments that demonstrate that TextCraft outperforms state-of-the-art baselines.","Text to shape generation, 3D shape generation, Zero-Shot Method, CLIP, Vision-Text models" CrystalBox: Efficient Model-Agnostic Explanations for Deep RL Controllers,https://openreview.net/forum?id=K1CNgCJtLLr,https://openreview.net/pdf?id=K1CNgCJtLLr,,"Practical adoption of Reinforcement Learning (RL) controllers is hindered by a lack of explainability. Particularly, in input-driven environments such as computer systems where the state dynamics are affected by external processes, explainability can serve as a key towards increased real-world deployment of RL controllers. In this work, we propose a novel framework, CrystalBox, for generating black-box post-hoc explanations for RL controllers in input-driven environments. CrystalBox is built on the principle of separation between policy learning and explanation computation. As the explanations are generated completely outside the training loop, CrystalBox is generalizable to a large family of input-driven RL controllers.To generate explanations, CrystalBox combines the natural decomposability of reward functions in systems environments with the explanatory power of decomposed returns. CrystalBox predicts these decomposed future returns using on policy Q-function approximations. Our design leverages two complementary approaches for this computation: sampling- and learning-based methods. We evaluate CrystalBox with RL controllers in real-world settings and demonstrate that it generates high-fidelity explanations. ","explainability, reinforcement learning" Label Propagation with Weak Supervision,https://openreview.net/forum?id=aCuFa-RRqtI,https://openreview.net/pdf?id=aCuFa-RRqtI,Theoretical analysis of label propagation with prior information and connection to weak supervision,"Semi-supervised learning and weakly supervised learning are important paradigms that aim to reduce the growing demand for labeled data in current machine learning applications. In this paper, we introduce a novel analysis of the classical label propagation algorithm (LPA) (Zhu & Ghahramani, 2002) that moreover takes advantage of useful prior information, specifically probabilistic hypothesized labels on the unlabeled data. We provide an error bound that exploits both the local geometric properties of the underlying graph and the quality of the prior information. We also propose a framework to incorporate multiple sources of noisy information. In particular, we consider the setting of weak supervision, where our sources of information are weak labelers. We demonstrate the ability of our approach on multiple benchmark weakly supervised classification tasks, showing improvements upon existing semi-supervised and weakly supervised methods. ","Semi-supervised learning, Weakly supervised learning, Label propagation" TypeT5: Seq2seq Type Inference using Static Analysis,https://openreview.net/forum?id=4TyNEhI2GdN,https://openreview.net/pdf?id=4TyNEhI2GdN,Combining the strengths of CodeT5 and static analysis to predict Python type annotations.,"There has been growing interest in automatically predicting missing type annotations in programs written in Python and Javascript. Prior methods have achieved impressive accuracy when predicting the most common types, but they often perform poorly on rare or complex types (or do not support them at all). In this paper, we present a new type inference method that treats type prediction as a code completion task by leveraging CodeT5, a state-of-the-art seq2seq pre-trained language model for code. Our method uses static analysis to construct context for each code element whose type is to be predicted by the model. We also propose a sequential decoding scheme that incorporates previous type predictions in the model's input context, allowing information exchange between related code elements. Our evaluation shows that our proposed approach, TypeT5, not only achieves a higher overall accuracy (particularly on rare and complex types) but also produces more coherent results with fewer type errors, while enabling easy user intervention.","Type inference, Code completion, Static analysis, Transformers, Pre-training" Approximating any Function via Coreset for Radial Basis Functions: Towards Provable Data Subset Selection For Efficient Neural Networks training,https://openreview.net/forum?id=bth6XbnDmib,https://openreview.net/pdf?id=bth6XbnDmib,,"Radial basis function neural networks (\emph{RBFNN}) are notoriously known for their capability to approximate any continuous function on a closed bounded set with arbitrary precision given enough hidden neurons. Coreset is a small weighted subset of an input set of items, that provably approximates their loss function for a given set of queries (models, classifiers, etc.). In this paper, we suggest the first coreset construction algorithm for \emph{RBFNNs}, i.e., a small weighted subset which approximates the loss of the input data on any radial basis function network and thus approximates any function defined by an \emph{RBFNN} on the big input data. This is done by constructing coresets for radial basis and Laplacian loss functions. We use our coreset to suggest a provable data subset selection algorithm for training deep neural networks, since our coreset approximates every function, it should approximate the gradient of each weight in a neural network as it is defined as a function on the input. Experimental results on function approximation and dataset subset selection on popular network architectures and data sets are presented, demonstrating the efficacy and accuracy of our coreset construction.","Data subset selection, Coresets, Radial basis functions neural networks, deep learning" Axiomatic Explainer Locality With Optimal Transport,https://openreview.net/forum?id=nQBQByfLeSC,https://openreview.net/pdf?id=nQBQByfLeSC,,"Explainability methods have been notoriously difficult to evaluate and compare. Because of this, practitioners are often left guessing as to which explainer they should use for their task. Locality is one critical property of explainers which grants insight into the diversity of produced explanations. In this paper, we define a set of axioms which align with natural intuition regarding globalness, the inverse of locality. We then introduce a novel measure of globalness, Wasserstein Globalness, which uses optimal transport to quantify how local or global a given explainer is. Finally, we provide theoretical results describing the sample complexity of Wasserstein Globalness, and experimentally demonstrate how globalness can be used to effectively compare explainers. These results illustrate connections between both explainer fidelity and explainer robustness.","Interpretability, Explainability, Optimal Transport, Wasserstein" Fine-Tuning Offline Policies With Optimistic Action Selection,https://openreview.net/forum?id=2x8EKbGU51k,https://openreview.net/pdf?id=2x8EKbGU51k,,"Offline reinforcement learning algorithms can train performant policies for hard tasks using previously-collected datasets. However, the quality of the offline dataset often limits the levels of performance possible. We consider the problem of improving offline policies through online fine-tuning. Offline RL requires a pessimistic training objective to mitigate distributional shift between the trained policy and the offline behavior policy, which will make the trained policy averse to picking novel actions. In contrast, online RL requires exploration, or optimism. Thus, fine-tuning online policies with the offline training objective is not ideal. Additionally, loosening the fine-tuning objective to allow for more exploration can potentially destroy the behaviors learned in the offline phase because of the sudden and significant change in the optimization objective. To mitigate this challenge, we propose a method to facilitate exploration during online fine-tuning that maintains the same training objective throughout both offline and online phases, while encouraging exploration. We accomplish this by changing the action-selection method to be more optimistic with respect to the Q-function. By choosing to take actions in the environment with higher expected Q-values, our method is able to explore and improve behaviors more efficiently, obtaining 56% more returns on average than the alternative approaches on several locomotion, navigation, and manipulation tasks.", Improving the Strength of Human-Like Models in Chess,https://openreview.net/forum?id=fJY2iCssvIs,https://openreview.net/pdf?id=fJY2iCssvIs,"We efficiently train Human-like AI models to play chess at a stronger level, while retaining their human-like style, by extending the concept of curriculum learning to support multiple teachers..","Designing AI systems that capture human-like behavior has attracted growing attention in applications where humans may want to learn from, or need to collaborate with, these AI systems. Many existing works in designing human-like AI have taken a supervised learning approach that learns from data of human behavior, with the goal of creating models that can accurately predict human behavior. While this approach has shown success in capturing human behavior at different skill levels and even identifying individual behavioral styles, it also suffers from the drawback of mimicking human mistakes. Moreover, existing models only capture a snapshot of human behavior, leaving the question of how to improve them---e.g., from one human skill level to a stronger one---largely unanswered. Using chess as an experimental domain, we investigate the question of teaching an existing human-like model to be stronger using a data-efficient curriculum, while maintaining the model's human similarity. To achieve this goal, we extend the concept of curriculum learning to settings with multiple labeling strategies, allowing us to vary both the curriculum (dataset) and the teacher (labeling strategy). We find that the choice of teacher has a strong impact on both playing strength and human similarity; for example, a teacher that is too strong can be less effective at improving playing strength and degrade human similarity more rapidly. We also find that the choice of curriculum can impact these metrics, but to a smaller extent; for example, training on a curriculum of human mistakes provides only a marginal benefit over training on a random curriculum. Finally, we show that our strengthened models achieve human similarity on datasets corresponding to their strengthened level of play, suggesting that our curriculum training methodology is improving them in human-like steps.","Human-like AI, Curriculum Learning" Test-Time Training on Video Streams,https://openreview.net/forum?id=orbnZE-4UvD,https://openreview.net/pdf?id=orbnZE-4UvD,,"We investigate visual generalization video streams instead of independent images, since the former is closer to the smoothly changing environments where natural agents operate. Traditionally, single-image models are tested on videos as collections of unordered frames. We instead test on each video in temporal order, making a prediction on the current frame before the next arrives, after training at test time on frames from the recent past. To perform test-time training without ground truth labels, we leverage recent advances in masked autoencoders for self-supervision. We improve performance on various real-world applications. We also discover that forgetting can be beneficial for test-time training, in contrast to the common belief in the continual learning community that it is harmful.", AGRO: Adversarial discovery of error-prone Groups for Robust Optimization,https://openreview.net/forum?id=IrzkT99fDJH,https://openreview.net/pdf?id=IrzkT99fDJH,"AGRO is an end-to-end robust optimization technique that discovers error-prone groups and optimizes for their accuracy, resulting in improved robustness to test-time distributional shifts.","Models trained via empirical risk minimization (ERM) are known to rely on spurious correlations between labels and task-independent input features, resulting in poor generalization to distributional shifts. Group distributionally robust optimization (G-DRO) can alleviate this problem by minimizing the worst-case loss over a set of pre-defined groups over training data. G-DRO successfully improves performance of the worst group, where the correlation does not hold. However, G-DRO assumes that the spurious correlations and associated worst groups are known in advance, making it challenging to apply them to new tasks with potentially multiple unknown correlations. We propose AGRO---Adversarial Group discovery for Distributionally Robust Optimization---an end-to-end approach that jointly identifies error-prone groups and improves accuracy on them. AGRO equips G-DRO with an adversarial slicing model to find a group assignment for training examples which maximizes worst-case loss over the discovered groups. On the WILDS benchmark, AGRO results in 8\% higher model performance on average on known worst-groups, compared to prior group discovery approaches used with G-DRO. AGRO also improves out-of-distribution performance on SST2, QQP, and MS-COCO---datasets where potential spurious correlations are as yet uncharacterized. Human evaluation of ARGO groups shows that they contain well-defined, yet previously unstudied spurious correlations that lead to model errors.","robust optimization, distributionally robust, slice discovery, error analysis, adversarial learning" Learning Multiobjective Program Through Online Learning,https://openreview.net/forum?id=d8tJcOxnzF9,https://openreview.net/pdf?id=d8tJcOxnzF9,,"We investigate the problem of learning the parameters (i.e., objective functions or constraints) of a multiobjective decision making model, based on a set of sequentially arrived decisions. In particular, these decisions might not be exact and possibly carry measurement noise or are generated with the bounded rationality of decision makers. In this paper, we propose a general online learning framework to deal with this learning problem using inverse multiobjective optimization, and prove that this framework converges at a rate of $\mathcal{O}(1/\sqrt{T})$ under certain regularity conditions. More precisely, we develop two online learning algorithms with implicit update rules which can handle noisy data. Numerical results with both synthetic and real world datasets show that both algorithms can learn the parameters of a multiobjective program with great accuracy and are robust to noise.","Learning Multiobjective Program, Multiobjective Optimization" Dichotomy of Control: Separating What You Can Control from What You Cannot,https://openreview.net/forum?id=DEGjDDV22pI,https://openreview.net/pdf?id=DEGjDDV22pI,We propose dichotomy of control (DoC) for supervised learning in stochastic environments by separating things within a policy's control (actions) from those outside of a policy’s control (env stochasticity) through a mutual information constraint.,"Future- or return-conditioned supervised learning is an emerging paradigm for offline reinforcement learning (RL), in which the future outcome (i.e., return) associated with a sequence of actions in an offline dataset is used as input to a policy trained to imitate those same actions. While return-conditioning is at the heart of popular algorithms such as decision transformer (DT), these methods tend to perform poorly in highly stochastic environments, where an occasional high return associated with a sequence of actions may be due more to the randomness of the environment than to the actions themselves. Such situations can lead to a learned policy that is inconsistent with its conditioning inputs; i.e., using the policy – while conditioned on a specific desired return – to act in the environment can lead to a distribution of real returns that is wildly different than desired. In this work, we propose the dichotomy of control (DoC), a future-conditioned supervised learning framework that separates mechanisms within a policy’s control (actions) from those outside of a policy’s control (environment stochasticity). We achieve this by conditioning the policy on a latent variable representation of the future and designing a mutual information constraint that removes any future information from the latent variable that is only due to randomness of the environment. Theoretically, we show that DoC yields policies that are consistent with their conditioning inputs, ensuring that conditioning a learned policy on a desired high-return future outcome will correctly induce high-return behavior. Empirically, we show that DoC is able to achieve significantly better performance than DT on environments with highly stochastic rewards (e.g., Bandit) and transitions (e.g., FrozenLake).","Offline reinforcement learning, return-conditioned supervised learning, stochastic environments, decision transformer" Progressive Knowledge Distillation: Constructing Ensembles for Efficient Inference,https://openreview.net/forum?id=sE-9hkZL5wV,https://openreview.net/pdf?id=sE-9hkZL5wV,,"Knowledge distillation is commonly used to compress an ensemble of models into a single model. In this work we study the problem of progressive distillation: Given a large, pretrained teacher model $g$, we seek to decompose the model into an ensemble of smaller, low-inference cost student models $f_i$. The resulting ensemble allows for flexibly tuning accuracy vs. inference cost, which can be useful for a multitude of applications in efficient inference. Our method, B-DISTIL, uses a boosting procedure that allows function composition based aggregation rules to construct expressive ensembles with similar performance as $g$ using much smaller student models. We demonstrate the effectiveness of B-DISTIL by decomposing pretrained models across a variety of image, speech, and sensor datasets. Our method comes with strong theoretical guarantees in terms of convergence as well as generalization.","progressive distillation, anytime inference, efficient inference, resource constrained ML, devices" Efficient Approximations of Complete Interatomic Potentials for Crystal Property Prediction,https://openreview.net/forum?id=KaKXygtEGK,https://openreview.net/pdf?id=KaKXygtEGK,We propose to directly model complete interactions for crystals with potential summations,"We study the problem of crystal material property prediction. A crystal structure consists of a minimal unit cell that is repeated infinitely in 3D space. How to accurately represent such repetitive structures in machine learning models remains unresolved. Current methods construct graphs by establishing edges only between nearby nodes, thereby failing to faithfully capture infinite repeating patterns and distant interatomic interactions. In this work, we propose several innovations to overcome these limitations. First, we propose to model physics-principled interatomic potentials directly instead of only using distances as in existing methods. These potentials include the Coulomb potential, London dispersion potential, and Pauli repulsion potential. Second, we propose to model the complete set of potentials among all atoms, instead of only between nearby atoms as in prior methods. This is enabled by our approximations of infinite potential summations with provable error bounds. We further develop efficient algorithms to compute the approximations. Finally, we propose to incorporate our computations of complete interatomic potentials into message passing neural networks for representation learning. We perform experiments on the JARVIS and Materials Project benchmarks for evaluation. Results show that the use of complete interatomic potentials leads to consistent performance improvements with reasonable computational costs.","graph neural network, material property prediction, crystal property prediction, crystal structure modeling, interatomic potential" LogicDP: Creating Labels for Graph Data via Inductive Logic Programming,https://openreview.net/forum?id=2b2s9vd7wYv,https://openreview.net/pdf?id=2b2s9vd7wYv,A data programming framework for generating training labels for graph data,"Graph data, such as scene graphs and knowledge graphs, see wide use in AI systems. In real-world and large applications graph data are usually incomplete, motivating graph reasoning models for missing-fact or missing-relationship inference. While these models can achieve state-of-the-art performance, they require a large amount of training data. Recent years have witnessed the rising interest in label creation with data programming (DP) methods, which aim to generate training labels from heuristic labeling functions. However, existing methods typically focus on unstructured data and are not optimized for graphs. In this work, we propose LogicDP, a data programming framework for graph data. Unlike existing DP methods, (1) LogicDP utilizes the inductive logic programming (ILP) technique and automatically discovers the labeling functions from the graph data; (2) LogicDP employs a budget-aware framework to iteratively refine the functions by querying an oracle, which significantly reduces the human efforts in function creations. Experiments show that LogicDP achieves better data efficiency in both scene graph and knowledge graph reasoning tasks.","Data Programming, Graph Reasoning, Inductive Logic Programming" Simulating Environments for Evaluating Scarce Resource Allocation Policies,https://openreview.net/forum?id=vSsnEd0Jmou,https://openreview.net/pdf?id=vSsnEd0Jmou,We provide a principled method for evaluating organ-allocation policies before deployment.,"Consider the sequential decision problem of allocating a limited supply of resources to a pool of potential recipients: This scarce resource allocation problem arises in a variety of settings characterized by ""hard-to-make"" tradeoffs– such as assigning organs to transplant patients, or rationing ventilators in overstretched ICUs. Assisting human judgement in these choices are dynamic allocation policies that prescribe how to match available assets to an evolving pool of beneficiaries– such as clinical guidelines that stipulate selection criteria on the basis of recipient and organ attributes. However, while such policies have received increasing attention in recent years, a key challenge lies in pre-deployment evaluation: How might allocation policies behave in the real world? In particular, in addition to conventional backtesting, it is crucial that policies be evaluated on a variety of possible scenarios and sensitivities– such as distributions of recipients and organs that may diverge from historic patterns. In this work, we present AllSim, an open-source framework for performing data-driven simulation of scarce resource allocation policies for pre-deployment evaluation. Simulation environments are modular (i.e. parameterized componentwise), learnable (i.e. on historical data), and customizable (i.e. to unseen conditions), and– upon interaction with a policy –outputs a dataset of simulated outcomes for analysis and benchmarking. Compared to existing work, we believe this approach takes a step towards more methodical evaluation of scarce resource allocation policies.","Simulation, Evaluation, Validation" Domain Transfer with Large Dynamics Shift in Offline Reinforcement Learning,https://openreview.net/forum?id=2SXIFDczAJG,https://openreview.net/pdf?id=2SXIFDczAJG,,"The domain transfer problem with large dynamics shift commonly exists when using offline reinforcement learning (RL) in real-world applications, where the source dataset collected from one domain needs to be reused to accelerate training the target domain agent with offline RL. The large dynamics shift issue arises when there are unpredictable changes in the target domain’s environment. Existing works typically assume that each state-action pair in the target domain should be at least covered in the source domain, which is often unrealistic and limited to small dynamics shift transfers. To tackle the large dynamics shift problem, we propose to use the source domain data not only for offline policy training but also for safe and efficient data collection in the target domain, thus relaxing the above requirement. Specifically, the source data will play two roles, one is to serve as augmentation data by compensating for the difference in dynamics with modified reward. Another is to form prior knowledge for the behaviour policy to collect a small amount of new data in the target domain safely and efficiently. The target domain policy is trained using offline RL with the source data and modest amounts of newly collected target data. We justify our method in gridworld and autonomous driving environments. Results show that our method requires fewer target domain data and collecting the data in a safer manner compared with prior methods.", Learning to reason with relational abstractions,https://openreview.net/forum?id=3BOwNcqM_Wq,https://openreview.net/pdf?id=3BOwNcqM_Wq,Sequences with abstract relations can help models solve mathematical reasoning tasks with a significantly higher accuracy compared to those that are trained with human-generated sequences and other baselines.,"Large language models have recently shown promising progress in mathematical reasoning when fine-tuned with human-generated sequences walking through a sequence of solution steps. However, the solution sequences are not formally structured and the resulting model-generated sequences may not reflect the kind of systematic reasoning we might expect an expert human to produce. In this paper, we study how to build stronger reasoning capability in language models using the idea of relational abstractions. We introduce new types of sequences that more explicitly provide an abstract characterization of the transitions through intermediate solution steps to the goal state. We found that models that are supplied with such sequences as prompts can solve tasks with a significantly higher accuracy, and models that are trained to produce such sequences solve problems better than those that are trained with previously used human-generated sequences and other baselines. Our work thus takes several steps toward elucidating and improving how language models perform on tasks requiring multi-step mathematical reasoning.","mathematical reasoning, language models, relational abstraction" A Simple Approach for State-Action Abstraction using a Learned MDP Homomorphism,https://openreview.net/forum?id=rtTHBnBx4vO,https://openreview.net/pdf?id=rtTHBnBx4vO,MDP Homomorphism with a forwards-backwards model,"Animals are able to rapidly infer from limited experience when sets of state action pairs have equivalent reward and transition dynamics. On the other hand, modern reinforcement learning systems must painstakingly learn through trial and error that sets of state action pairs are value equivalent---requiring an often prohibitively large amount of samples from their environment. MDP homomorphisms have been proposed that reduce the observed MDP of an environment to an abstract MDP, which can enable more sample efficient policy learning. Consequently, impressive improvements in sample efficiency have been achieved when a suitable MDP homomorphism can be constructed a priori---usually by exploiting a practioner's knowledge of environment symmetries. We propose a novel approach to constructing a homomorphism in discrete action spaces, which uses a partial model of environment dynamics to infer which state action pairs lead to the same state---reducing the size of the state-action space by a factor equal to the cardinality of the action space. We call this method equivalent effect abstraction. We demonstrate empirically that equivalent effect abstraction can improve sample efficiency in a model-free setting and planning efficiency for model based approaches.", RankMe: Assessing the Downstream Performance of Pretrained Self-Supervised Representations by Their Rank,https://openreview.net/forum?id=uGEBxC8dnEh,https://openreview.net/pdf?id=uGEBxC8dnEh,"We show how the rank of representations can be used to measure their dowstream performance, even on unseen datasets. We validate this simple metric with thorough experiments and show its power for hyperparameter selection.","Joint-Embedding Self Supervised Learning (JE-SSL) has seen a rapid development, with the emergence of many method variations and few principled guidelines that would help practitioners to successfully deploy those methods. The main reason for that pitfall actually comes from JE-SSL's core principle of not employing any input reconstruction. Without any visual clue, it becomes extremely cryptic to judge the quality of a learned representation without having access to a labelled dataset. We hope to correct those limitations by providing a single --theoretically motivated-- criterion that reflects the quality of learned JE-SSL representations: their effective rank. Albeit simple and computationally friendly, this method ---coined {\em RankMe}--- allows one to assess the performance of JE-SSL representations, even on different downstream datasets, without requiring any labels, training or parameters to tune. Through thorough empirical experiments involving hundreds of repeated training episodes, we demonstrate how RankMe can be used for hyperparameter selection with nearly no loss in final performance compared to the current selection method that involve dataset labels. We hope that RankMe will facilitate the use of JE-SSL in domains with little or no labeled data.","self-supervised learning, evaluation, dimensional collapse, hyperparameter selection" Less is More: Task-aware Layer-wise Distillation for Language Model Compression,https://openreview.net/forum?id=-Ov808Vm7dw,https://openreview.net/pdf?id=-Ov808Vm7dw,We propose a task-aware layer-wise knowledge distillation method for language model compression.,"Layer-wise distillation is a powerful tool to compress large models (i.e. teacher models) into small ones (i.e., student models). The student distills knowledge from the teacher by mimicking the hidden representations of the teacher at every intermediate layer. However, layer-wise distillation is difficult. Since the student has a smaller model capacity than the teacher, it is often under-fitted. Furthermore, the hidden representations of the teacher contain redundant information that the student does not necessarily need for the target task's learning. To address these challenges, we propose a novel Task-aware layEr-wise Distillation (TED). TED designs task-aware filters to align the hidden representations of the student and the teacher at each layer. The filters select the knowledge that is useful for the target task from the hidden representations. As such, TED reduces the knowledge gap between the two models and helps the student to fit better on the target task. We evaluate TED in two scenarios: continual pre-training and fine-tuning. TED demonstrates significant and consistent improvements over existing distillation methods in both scenarios. We will make our code publicly available.","Knowledge Distillation, Pre-trained Language Models, Model Compression" Revisiting Curiosity for Exploration in Procedurally Generated Environments,https://openreview.net/forum?id=j3GK3_xZydY,https://openreview.net/pdf?id=j3GK3_xZydY,,"Exploration under sparse rewards remains a key challenge in deep reinforcement learning. Recently, studying exploration in procedurally-generated environments draws increasing attention. Existing works generally combine lifelong curiosity and episodic curiosity as the intrinsic reward to encourage exploration. Though various lifelong and episodic curiosities have been proposed, the individual contributions of the two kinds of curiosities to improving exploration are barely investigated. To bridge this gap, we disentangle these two parts and conduct extensive ablative experiments. We consider lifelong and episodic curiosities used in prior works, and compare the performance of all lifelong-episodic combinations on the commonly used MiniGrid benchmark. Experimental results show that only using episodic curiosity can match or surpass prior state-of-the-art methods. On the other hand, only using lifelong curiosity can hardly make progress in exploration. This demonstrates that episodic curiosity is more crucial than lifelong curiosity in boosting exploration. Moreover, we find through experimental analysis that the learned lifelong curiosity does not accurately reflect the novelty of states, which explains why it does not help much in improving exploration.", Online Learning for Obstacle Avoidance,https://openreview.net/forum?id=4hsI9zyNSfw,https://openreview.net/pdf?id=4hsI9zyNSfw,Regret bounds for online learning obstacle avoidance policies,"We approach the fundamental problem of obstacle avoidance for robotic systems via the lens of online learning. In contrast to prior work that either assumes worst-case realization of uncertainty in the environment or a given stochastic model of uncertainty, we propose a method that is efficient to implement and provably grants instance-optimality to perturbations of trajectories generated from an open-loop planner in the sense of minimizing worst-case regret. The resulting policy thus adapts online to realizations of uncertainty and provably compares well with the best obstacle avoidance policy in hindsight from a rich class of policies. The method is validated in simulation on a dynamical system environment and compared to baseline open-loop planning and robust Hamilton-Jacobi reachability techniques.","obstacle avoidance, online optimization, regret minimization" Transformer-based World Models Are Happy With 100k Interactions,https://openreview.net/forum?id=TdBaDGCpjly,https://openreview.net/pdf?id=TdBaDGCpjly,,"Deep neural networks have been successful in many reinforcement learning settings. However, compared to human learners they are overly data hungry. To build a sample-efficient world model, we apply a transformer to real-world episodes in an autoregressive manner: not only the compact latent states and the taken actions but also the experienced or predicted rewards are fed into the transformer, so that it can attend flexibly to all three modalities at different time steps. The transformer allows our world model to access previous states directly, instead of viewing them through a compressed recurrent state. By utilizing the Transformer-XL architecture, it is able to learn long-term dependencies while staying computationally efficient. Our transformer-based world model (TWM) generates meaningful, new experience, which is used to train a policy that outperforms previous model-free and model-based reinforcement learning algorithms on the Atari 100k benchmark.","Model-based Reinforcement Learning, World Models, Transfomers, Atari 100k benchmark" Can Neural Networks Learn Implicit Logic from Physical Reasoning?,https://openreview.net/forum?id=HVoJCRLByVk,https://openreview.net/pdf?id=HVoJCRLByVk,,"Despite the success of neural network models in a range of domains, it remains an open question whether they can learn to represent abstract logical operators such as negation and disjunction. We test the hypothesis that neural networks without inherent inductive biases for logical reasoning can acquire an implicit representation of negation and disjunction. Here, implicit refers to limited, domain-specific forms of these operators, and work in psychology suggests these operators may be a precursor (developmentally and evolutionarily) to the type of abstract, domain-general logic that is characteristic of adult humans. To test neural networks, we adapt a test designed to diagnose the presence of negation and disjunction in animals and pre-verbal children, which requires inferring the location of a hidden object using constraints of the physical environment as well as implicit logic: if a ball is hidden in A or B, and shown not to be in A, can the subject infer that it is in B? Our results show that, despite the neural networks learning otherwise good representations of the objects' physical dynamics and constraints, they are unable to generalize to a task that requires implicit logic. We further show that models are unable to generalize to the test task even when they are trained directly on a logically identical (though visually dissimilar) task. However, experiments using transfer learning reveal that the models do recognize structural similarity between tasks which invoke the same logical reasoning pattern, suggesting that some desirable abstractions are learned, even if they are not yet sufficient to pass established tests of logical reasoning.","logic, logical operators, logical reasoning, intuitive physics, physical reasoning, representation learning, developmental psychology, cognitive science" Blockwise self-supervised learning with Barlow Twins,https://openreview.net/forum?id=uXeEBgzILe5,https://openreview.net/pdf?id=uXeEBgzILe5,We extend current self-supervised learning methods to blockwise training scheme,"Current state-of-the-art deep networks are all powered by backpropagation. In this paper, we explore alternatives to full backpropagation in the form of blockwise learning rules, leveraging the latest developments in self-supervised learning. Notably, we show that a blockwise pretraining procedure consisting of training independently the 4 main blocks of layers of a ResNet-50 with Barlow Twins loss function at each block performs almost as well as end-to-end backpropagation on ImageNet: a linear probe trained on top of our blockwise pretrained model obtains a top-1 classification accuracy of 70.48\%, only 1.1\% below the accuracy of an end-to-end pretrained network (71.57\% accuracy). We perform extensive experiments to understand the impact of different components within our method and explore a variety of adaptations of self-supervised learning to the blockwise paradigm, building an exhaustive understanding of the critical avenues for scaling local learning rules to large networks, with implications ranging from hardware design to neuroscience.","Blockwise training, self-supervised learning, local learning" DIGEST: FAST AND COMMUNICATION EFFICIENT DECENTRALIZED LEARNING WITH LOCAL UPDATES,https://openreview.net/forum?id=OYKIo3ySkxA,https://openreview.net/pdf?id=OYKIo3ySkxA,,"Decentralized learning advocates the elimination of centralized parameter servers (aggregation points) for potentially better utilization of underlying resources, de- lay reduction, and resiliency against parameter server unavailability and catas- trophic failures. Gossip based decentralized algorithms, where each node in a net- work has its own locally kept model on which it effectuates the learning by talking to its neighbors, received a lot of attention recently. Despite their potential, Gossip algorithms introduce huge communication costs. In this work, we show that nodes do not need to communicate as frequently as in Gossip for fast convergence; in fact, a sporadic exchange of a digest of a trained model is sufficient. Thus, we design a fast and communication-efficient decentralized learning mechanism; DI- GEST by particularly focusing on stochastic gradient descent (SGD). DIGEST is a decentralized algorithm building on local-SGD algorithms, which are originally designed for communication efficient centralized learning. We show through anal- ysis and experiments that DIGEST significantly reduces the communication cost without hurting convergence time for both iid and non-iid data.","Decentralized Learning, Distributed Optimization, Communication Efficient Learning, Local SGD, Federated Learning" Learning to Improve Code Efficiency,https://openreview.net/forum?id=935WW9F8ALr,https://openreview.net/pdf?id=935WW9F8ALr,We propose a generative model trained on programming competition submissions to transform programs into faster versions of those programs.,"Improvements in the performance of computing systems, driven by Moore’s Law, have transformed society. As such hardware-driven gains slow down, it becomes even more important for software developers to focus on performance and efficiency during development. While several studies have demonstrated the potential from such improved code efficiency (e.g., 2x better generational improvements compared to hardware), unlocking these gains in practice has been challenging. Reasoning about algorithmic complexity and the interaction of coding patterns on hardware can be challenging for the average programmer, especially when combined with pragmatic constraints around development velocity and multi-person development. This paper seeks to address this problem. We analyze a large competitive programming dataset from the Google Code Jam competition and find that efficient code is indeed rare, with a 2x runtime difference between the median and the 90th percentile of solutions. We propose using machine learning to automatically provide prescriptive feedback in the form of hints, to guide programmers towards writing high-performance code. To automatically learn these hints from the dataset, we propose a novel discrete variational auto-encoder, where each discrete latent variable represents a different learned category of code-edit that increases performance. We show that this method represents the multi-modal space of code efficiency edits better than a sequence-to-sequence baseline and generates a distribution of more efficient solutions.","Machine Learning for Code, Program Synthesis, Program Optimization" Real Data Distributions Prefer Simplicity and So Do Our Models: Why Machine Learning and Model Selection Are Possible,https://openreview.net/forum?id=0nI0G46i6kT,https://openreview.net/pdf?id=0nI0G46i6kT,"We demonstrate that neural networks, trained or randomly initialized, prefer the low-complexity data we observe in practice, and we explain how model selection can be automated.","No free lunch theorems for supervised learning state that no learner can solve all problems or that all learners achieve exactly the same accuracy on average over a uniform distribution on learning problems. Accordingly, these theorems are often referenced in support of the notion that individual problems require specially tailored inductive biases. While all but exponentially few uniformly sampled datasets have high complexity, we argue that neural network models share the same preference for low-complexity data that we observe on real-world problems. Notably, we show that architectures designed for a particular domain, such as computer vision, are compressors for labeling functions on a variety of seemingly unrelated domains. From our experiments, we see that pre-trained and even randomly initialized language models prefer to generate low-complexity sequences and can therefore be used for inference. In principle, the use of expert knowledge and bias for simplicity of human practitioners could be folded into the learning algorithm, automating design and selection of models. We explain how typical areas requiring human intervention such as picking the appropriately sized model when labeled data is sparse or plentiful can be automated into a single learning algorithm. These observations help justify the trend in deep learning of unifying seemingly disparate problems with an increasingly small set of machine learning models.","No Free Lunch, PAC-Bayes, Simplicity Bias, Model Selection, Meta-Learning" Backdoor Attacks in the Supply Chain of Masked Image Modeling,https://openreview.net/forum?id=Pb-SC2gFOO,https://openreview.net/pdf?id=Pb-SC2gFOO,"In this paper, we perform the first security risk quantification of MIM through the lens of backdoor attacks.","Masked image modeling (MIM) revolutionizes self-supervised learning (SSL) for image pre-training. In contrast to previous dominating self-supervised methods, i.e., contrastive learning, MIM attains state-of-the-art performance by masking and reconstructing random patches of the input image. However, the associated security and privacy risks of this novel generative method are unexplored. In this paper, we perform the first security risk quantification of MIM through the lens of backdoor attacks. Different from previous work, we are the first to systematically threat modeling on SSL in every phase of model supply chain, i.e., pre-training, release, and downstream phases. Our evaluation shows that models built with MIM are vulnerable to existing backdoor attacks in release and downstream phases and are compromised by our proposed method in pre-training phase. For instance, on CIFAR10 dataset, the attack success rate can reach 99.62%, 96.48%, and 98.89% in the downstream phase, release phase, and pre-training phase, respectively. We also take the first step to investigate the success factors of backdoor attacks in the pre-training phase and find the trigger number and trigger pattern play key roles in the success of backdoor attacks while trigger location has only tiny effects. In the end, our empirical study of the defense mechanisms across three detection-level on model supply chain phases indicates that different defenses are suitable for backdoor attacks in different phases. However, backdoor attacks in the release phase cannot be detected by all three detection-level methods, calling for more effective defenses in future research.","backdoor attack, masked image modeling" ESCHER: Eschewing Importance Sampling in Games by Computing a History Value Function to Estimate Regret,https://openreview.net/forum?id=35QyoZv8cKO,https://openreview.net/pdf?id=35QyoZv8cKO,We propose a principled deep CFR algorithm that can scale to large games by removing importance sampling,"Recent techniques for approximating Nash equilibria in very large games leverage neural networks to learn approximately optimal policies (strategies). One promis- ing line of research uses neural networks to approximate counterfactual regret minimization (CFR) or its modern variants. DREAM, the only current CFR-based neural method that is model free and therefore scalable to very large games, trains a neural network on an estimated regret target that can have extremely high variance due to an importance sampling term inherited from Monte Carlo CFR (MCCFR). In this paper we propose an unbiased model-free method that does not require any importance sampling. Our method, ESCHER, is principled and is guaranteed to converge to an approximate Nash equilibrium with high probability. We show that the variance of the estimated regret of ESCHER is orders of magnitude lower than DREAM and other baselines. We then show that ESCHER outperforms the prior state of the art—DREAM and neural fictitious self play (NFSP)—on a number of games and the difference becomes dramatic as game size increases. In the very large game of dark chess, ESCHER is able to beat DREAM and NFSP in a head-to-head competition over 90% of the time.","game theory, two-player zero-sum, CFR, Reinforcement learning" On Achieving Optimal Adversarial Test Error,https://openreview.net/forum?id=fVm3nZMZs9,https://openreview.net/pdf?id=fVm3nZMZs9,,"We first elucidate various fundamental properties of optimal adversarial predictors: the structure of optimal adversarial convex predictors in terms of optimal adversarial zero-one predictors, bounds relating the adversarial convex loss to the adversarial zero-one loss, and the fact that continuous predictors can get arbitrarily close to the optimal adversarial error for both convex and zero-one losses. Applying these results along with new Rademacher complexity bounds for adversarial training near initialization, we prove that for general data distributions and perturbation sets, adversarial training on shallow networks with early stopping and an idealized optimal adversary is able to achieve optimal adversarial test error. By contrast, prior theoretical work either considered specialized data distributions or only provided training error guarantees.", General Policy Evaluation and Improvement by Learning to Identify Few But Crucial States,https://openreview.net/forum?id=Frt6LTRFhui,https://openreview.net/pdf?id=Frt6LTRFhui,,"Learning to evaluate and improve policies is a core problem of Reinforcement Learning (RL). Traditional RL algorithms learn a value function defined for a single policy. A recently explored competitive alternative is to learn a single value function for many policies. Here we combine the actor-critic architecture of Parameter-Based Value Functions and the policy embedding of Policy Evaluation Networks to learn a single value function for evaluating (and thus helping to improve) any policy represented by a deep neural network (NN). The method yields competitive experimental results. In continuous control problems with infinitely many states, our value function minimizes its prediction error by simultaneously learning a small set of `probing states' and a mapping from actions produced in probing states to the policy's return. The method extracts crucial abstract knowledge about the environment in form of very few states sufficient to fully specify the behavior of many policies. A policy improves solely by changing actions in probing states, following the gradient of the value function's predictions. Surprisingly, it is possible to clone the behavior of a near-optimal policy in Swimmer-v3 and Hopper-v3 environments only by knowing how to act in 3 and 5 such learned states, respectively. Remarkably, our value function trained to evaluate NN policies is also invariant to changes of the policy architecture: we show that it allows for zero-shot learning of linear policies competitive with the best policy seen during training.","Reinforcement Learning, Off-Policy Reinforcement Learning" Serving Graph Compression for Graph Neural Networks,https://openreview.net/forum?id=T-qVtA3pAxG,https://openreview.net/pdf?id=T-qVtA3pAxG,Compressing the graph for graph neural networks inference,"Serving a GNN model online is challenging --- in many applications when testing nodes are connected to training nodes, one has to propagate information from training nodes to testing nodes to achieve the best performance, and storing the whole training set (including training graph and node features) during inference stage is prohibitive for large-scale problems. In this paper, we study graph compression to reduce the storage requirement for GNN in serving. Given a GNN model to be served, we propose to construct a compressed graph with smaller number of nodes. In serving time, one just need to replace the original training set graph by this compressed graph, without the need of changing the actual GNN model and the forward pass. We carefully analyze the error in the forward pass and derive simple ways to construct the compressed graph to minimize the approximation error. Experimental results on semi-supervised node classification demonstrate that the proposed method can significantly reduce the serving space requirement for GNN inference. ","Model compression, Graph Neural Networks" Optimal Data Sampling for Training Neural Surrogates of Programs,https://openreview.net/forum?id=UcKEodTPtfI,https://openreview.net/pdf?id=UcKEodTPtfI,We show how to optimally sample different paths of a program to construct a neural network surrogate of that program.,"Programmers and researchers are increasingly developing surrogates of programs, models of a subset of the observable behavior of a given program, to solve a variety of software development challenges. Programmers train surrogates from measurements of the behavior of a program on a dataset of input examples. We present a methodology for optimally sampling datasets to train neural network based surrogates of programs. We first characterize the optimal proportion of data to sample from each path in a program based on the complexity of learning the path. We next provide a program analysis to determine the complexity of different paths in a program. We evaluate these results on a large-scale graphics program, demonstrating that theoretically optimal sampling results in empirical improvements in accuracy.","programming languages, surrogates, program analysis" Towards Understanding GD with Hard and Conjugate Pseudo-labels for Test-Time Adaptation,https://openreview.net/forum?id=FJXf1FXN8C,https://openreview.net/pdf?id=FJXf1FXN8C,Towards Understanding GD with Hard and Conjugate Pseudo-labels for Test-Time Adaptation,"We consider a setting that a model needs to adapt to a new domain under distribution shifts, given that only unlabeled test samples from the new domain are accessible at test time. A common idea in most of the related works is constructing pseudo-labels for the unlabeled test samples and applying gradient descent (GD) to a loss function with the pseudo-labels. Recently, Goyal et al. (2022) propose conjugate labels, which is a new kind of pseudo-labels for self-training at test time. They empirically show that the conjugate label outperforms other ways of pseudo-labeling on many domain adaptation benchmarks. However, provably showing that GD with conjugate labels learns a good classifier for test-time adaptation remains open. In this work, we aim at theoretically understanding GD with hard and conjugate labels for a binary classification problem. We show that for square loss, GD with conjugate labels converges to a solution that minimizes the testing $0$-$1$ loss under a Gaussian model, while GD with hard pseudo-labels fails in this task. We also analyze them under different loss functions for the update. Our results shed lights on understanding when and why GD with hard labels or conjugate labels works in test-time adaptation. ", Achieving Communication-Efficient Policy Evaluation for Multi-Agent Reinforcement Learning: Local TD-Steps or Batching?,https://openreview.net/forum?id=krFbWKl3Sz,https://openreview.net/pdf?id=krFbWKl3Sz,,"In many consensus-based actor-critic multi-agent reinforcement learning (MARL) strategies, one of the key components is the MARL policy evaluation (PE) problem, where a set of $N$ agents work cooperatively to evaluate the value function of the global states under a given policy only through communicating with their neighbors. In MARL-PE, a critical challenge is how to lower the communication complexity, which is defined as the rounds of communication between neighboring nodes in order to converge to some $\epsilon$-stationary point. To lower communication complexity in MARL-PE, there exist two ``natural'' ideas: i) using batching to reduce the variance of TD (temporal difference) errors, which in turn improves the convergence rate of MARL-PE; and ii) performing multiple local TD update steps between each consecutive rounds of communication, so as to reduce the communication frequency. While the effectiveness of the batching approach has been verified and relatively well-understood, the validity of the local TD-steps approach remains unclear due to the potential ``agent-drift'' phenomenon resulted from various heterogeneity factors across agents. This leads to an interesting open question in MARL-PE: *Does the local TD-steps approach really work and how does it perform in comparison to the batching approach?* In this paper, we take the first attempt to answer this interesting and fundamental question. Our theoretical analysis and experimental results confirm that allowing multiple local TD steps is indeed a valid approach in lowering the communication complexity of MARL-PE compared to vanilla consensus-based MARL-PE algorithms. Specifically, the local TD steps between two consecutive communication rounds can be as large as $\mathcal{O}(\sqrt{1/\epsilon}\log{(1/\epsilon)})$ in order to converge to an $\epsilon$-stationary point of MARL-PE. Theoretically, we show that in order to reach the optimal sample complexity up to a log factor, the communication complexity is $\mathcal{O}(\sqrt{1/\epsilon}\log{(1/\epsilon)})$, which is *considerably worse* than TD learning with batching, whose communication complexity is $\mathcal{O}(\log (1/\epsilon))$. However, the experimental results show that the allowing multiple steps can be as good as the batch approach. ", Learning where and when to reason in neuro-symbolic inference,https://openreview.net/forum?id=en9V5F8PR-,https://openreview.net/pdf?id=en9V5F8PR-,,"The integration of hard constraints on neural network outputs is a very desirable capability. This allows to instill trust in AI by guaranteeing the sanity of that neural network predictions with respect to domain knowledge. Recently, this topic has received a lot of attention. However, all the existing methods usually either impose the constraints in a ``weak'' form at training time, with no guarantees at inference, or fail to provide a general framework that supports different tasks and constraint types. We tackle this open problem from a neuro-symbolic perspective. Our pipeline enhances a conventional neural predictor with (1) a symbolic reasoning module capable of correcting structured prediction errors and (2) a neural attention module that learns to direct the reasoning effort to focus on potential prediction errors, while keeping other outputs unchanged. This framework provides an appealing trade-off between the efficiency of constraint-free neural inference and the prohibitive cost of exhaustive reasoning at inference time. We show that our method outperforms the state of the art on visual-Sudoku tasks and can further improve the performance of existing neuro-symbolic systems that lack our explicit reasoning during inference.", Aging with GRACE: Lifelong Model Editing with Key-Value Adaptors,https://openreview.net/forum?id=ngCT1EelZk,https://openreview.net/pdf?id=ngCT1EelZk,We continually fix large models' mistakes by caching learned activations in a codebook for a selected layer. The cached activations can be re-used to influence the model's behavior for future instances that are similar to previously-fixed errors.,"Large language models often err during deployment, either due to non-representative training data or distribution shift in the test set. Recently, model editors have been proposed to fix errors by adjusting a pre-trained model's weights. So far, however, existing model editors fail when making sequential edits by quickly decaying a model's performance on its upstream data. Further, when editing deployed online models, they quickly forget how to fix previously-seen mistakes. We advance beyond these existing methods by proposing and studying a novel Lifelong Model Editing setting, where errors stream into a deployed model and we update the model to correct its predictions without influencing its predictions for unrelated inputs. Towards effective methods in this challenging setting, we propose with General Retrieval Adaptors for Continual Editing, or GRACE. GRACE is a new Key-Value framework that casts model editing as a codebook update problem. The proposed approach edits selected model layers by caching activations that are queried using embeddings from the previous layer. The cached activations are trained to correct a model's predictions, treating future layers as a decoder. As edits stream in, the keys and values of a GRACE layer are updated while the model weights remain frozen, ensuring similar edits are treated similarly without altering the model's performance on unrelated instances. Experimentally, we show that \method substantially improves over recent model editors.", A VAE for Transformers with Nonparametric Variational Information Bottleneck,https://openreview.net/forum?id=6QkjC_cs03X,https://openreview.net/pdf?id=6QkjC_cs03X,We propose a Variational AutoEncoder using Bayesian nonparametrics to regularise a Transformer encoder-decoder with latent mixture distributions.,"We propose a Variational AutoEncoder (VAE) for Transformers by developing a Variational Information Bottleneck (VIB) regulariser for Transformer embeddings. We formalise the embedding space of Transformer encoders as mixture distributions, and use Bayesian nonparametrics to develop a Nonparametric VIB (NVIB) for such attention-based representations. The variable number of mixture components supported by nonparametric methods captures the variable number of vectors supported by attention, and exchangeable distributions from nonparametric methods capture the permutation invariance of attention. We then propose our Transformer VAE (NVAE) using NVIB to regularise the information passing from the Transformer encoder to the Transformer decoder through cross-attention. Evaluations of a NVAE, trained on natural language text, demonstrate that NVIB can regularise the number of mixture components in the induced embedding whilst maintaining generation quality and reconstruction capacity. ","VAE, VIB, Bayesian nonparametrics, Transformers, natural language" "Learning MLPs on Graphs: A Unified View of Effectiveness, Robustness, and Efficiency",https://openreview.net/forum?id=Cs3r5KLdoj,https://openreview.net/pdf?id=Cs3r5KLdoj,"We propose NOSMOG, a novel method to learn noise-robust and structure-aware MLPs on graphs, with superior effectiveness, outstanding robustness, and exceptional efficiency.","While Graph Neural Networks (GNNs) have demonstrated their efficacy in dealing with non-Euclidean structural data, they are difficult to be deployed in real applications due to the scalability constraint imposed by the multi-hop data dependency. Existing methods attempt to address this scalability issue by training student multi-layer perceptrons (MLPs) exclusively on node content features using labels derived from the teacher GNNs. However, the trained MLPs are neither effective nor robust. In this paper, we ascribe the lack of effectiveness and robustness to three significant challenges: 1) the misalignment between content feature and label spaces, 2) the strict hard matching to teacher's output, and 3) the sensitivity to node feature noises. To address the challenges, we propose NOSMOG, a novel method to learn NOise-robust Structure-aware MLPs On Graphs, with remarkable effectiveness, robustness, and efficiency. Specifically, we first address the misalignment by complementing node content with position features to capture the graph structural information. We then design an innovative representational similarity distillation strategy to inject soft node similarities into MLPs. Finally, we introduce adversarial feature augmentation to ensure stable learning against feature noises. Extensive experiments and theoretical analyses demonstrate the superiority of NOSMOG by comparing it to GNNs and the state-of-the-art method in both transductive and inductive settings across seven datasets.","Graph Representation Learning, Knowledge Distillation" On The Specialization of Neural Modules,https://openreview.net/forum?id=Fh97BDaR6I,https://openreview.net/pdf?id=Fh97BDaR6I,We use the linear neural networks framework to mathematically study the ability of neural modules to specialize and facilitate systematic generalization in modular network architectures.,"A number of machine learning models have been proposed with the goal of achieving systematic generalization: the ability to reason about new situations by combining aspects of previous experiences. These models leverage compositional architectures which aim to learn specialized modules dedicated to structures in a task that can be composed to solve novel problems with similar structures. While the compositionality of these architectures is guaranteed by design, the modules specializing is not. Here we theoretically study the ability of network modules to specialize to useful structures in a dataset and achieve systematic generalization. To this end we introduce a minimal space of datasets motivated by practical systematic generalization benchmarks. From this space of datasets we present a mathematical definition of systematicity and study the learning dynamics of linear neural modules when solving components of the task. Our results shed light on the difficulty of module specialization, what is required for modules to successfully specialize, and the necessity of modular architectures to achieve systematicity. Finally, we confirm that the theoretical results in our tractable setting generalize to more complex datasets and non-linear architectures.","Systematic Generalization, Linear Neural Networks, Neural Module Networks" HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers,https://openreview.net/forum?id=D7srTrGhAs,https://openreview.net/pdf?id=D7srTrGhAs,We propose a novel task-agnostic distillation method for Transformer-based language models equipped with iterative pruning.,"Knowledge distillation has been shown to be a powerful model compression approach to facilitate the deployment of pre-trained language models in practice. This paper focuses on task-agnostic distillation. It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints. Despite the practical benefits, task-agnostic distillation is challenging. Since the teacher model has a significantly larger capacity and stronger representation power than the student model, it is very difficult for the student to produce predictions that match the teacher's over a massive amount of open-domain training data. Such a large prediction discrepancy often diminishes the benefits of knowledge distillation. To address this challenge, we propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning. Specifically, we initialize the student model from the teacher model, and iteratively prune the student's neurons until the target width is reached. Such an approach maintains a small discrepancy between the teacher's and student's predictions throughout the distillation process, which ensures the effectiveness of knowledge transfer. Extensive experiments demonstrate that HomoDistil achieves significant improvements on existing baselines. Our codes will be released.","Knowledge Distillation, Structured Pruning, Pre-trained Transformer Language Models, Model Compression" Information-Theoretic Underpinnings of Generalization and Translation in Emergent Communication,https://openreview.net/forum?id=avNxfA4IXj,https://openreview.net/pdf?id=avNxfA4IXj,Controlling emergent communication complexity and informativeness allows agents to generalize better and understand translations of natural language,"Traditional emergent communication (EC) methods often fail to generalize to novel settings or align with representations of natural language. While these limitations may at first appear unrelated, in this work, we show how controlling the Information Bottleneck (IB) tradeoff between complexity and informativeness (a principle thought to guide human languages) helps to address both of these problems in EC. Specifically, we build on VQ-VIB, a recently proposed method for training EC agents while controlling the IB tradeoff, in addition to maximizing agents' utility. We find that increasing informativeness, which is a task-agnostic measure of how well a listener can reconstruct a speaker's meaning, allows EC agents to better generalize to novel settings and more challenging tasks. At the same time, in translation experiments between EC and English, we find that increasing EC informativeness only improves team performance up to a certain threshold, corresponding to the English informativeness-complexity tradeoff. Jointly, our results indicate the importance of training EC systems while controlling the informativeness-complexity tradeoff to simultaneously support improved self-play performance and human-agent interaction.","Emergent Communication, Information Theory" Optimal Transport-Based Supervised Graph Summarization,https://openreview.net/forum?id=Bq1-IOPKet,https://openreview.net/pdf?id=Bq1-IOPKet,,"Graph summarization is the problem of producing smaller graph representations of an input graph dataset, in such a way that the smaller ``compressed'' graphs capture relevant structural information for downstream tasks. One graph summarization method, recently proposed in Garg & Jaakkola (2019), formulates an optimal transport-based framework that allows prior information about node, edge, and attribute importance to be incorporated into the graph summarization process. We extend the optimal transport framework to a supervised graph summarization setting, wherein we seek to preserve relevant information about a class label. We first formulate the problem in terms of maximizing the mutual information between the summarized graph and the class label. We then propose a method that incorporates mutual information estimates between random variables associated with sample graphs and class labels into the optimal transport compression framework from Garg & Jaakkola (2019). We empirically show performance improvements over the previous work by Garg & Jaakkola (2019), in terms of classification and compression on synthetic and real datasets. We then theoretically show limitations of the optimal transport approach: e.g., that it fails to satisfy a certain desirable information monotonicity property. ","Graph Summarization, Optimal Transport, Supervised Learning, Mutual Information" Contrastive Vision Transformer for Self-supervised Out-of-distribution Detection,https://openreview.net/forum?id=UAmH4nDH4l,https://openreview.net/pdf?id=UAmH4nDH4l,,"Out-of-distribution (OOD) detection is a type of technique that aims to detect abnormal samples that don't belong to the distribution of training data (or in-distribution (ID) data). The technique has been applied to various image classification tasks to identify abnormal image samples for which the abnormality is caused by semantic shift (from different classes) or covariate shift (from different domains). However, disentangling OOD samples caused by different shifts remains a challenge in image OOD detection. This paper proposes Contrastive Vision Transformer (CVT), an attention-based contrastive learning model, for self-supervised OOD detection in image classification tasks. Specifically, vision transformer architecture is integrated as a feature extracting module under a contrastive learning framework. An empirical ensemble module is developed to extract representative ensemble features, from which a balance can be achieved between semantic and covariate OOD samples. The proposed CVT model is tested in various self-supervised OOD detection tasks, and our approach outperforms state-of-the-art methods by 5.5% AUROC on CIFAR-10 (ID) vs. CIFAR-100 (OOD), and by 10.7% AUROC on CIFAR-100 (ID) vs. CIFAR-10 (OOD).","Out-of-distribution, self-supervised learning, contrastive learning, vision transformer" Does the Half Adversarial Robustness Represent the Whole? It Depends... A Theoretical Perspective of Subnetwork Robustness,https://openreview.net/forum?id=8vJcsZ-3Ly,https://openreview.net/pdf?id=8vJcsZ-3Ly,"We prove with theory and experimental results that if a subnetwork is adversarially robust and highly correlated with the rest of the network, then the remaining layers are also robust.","Adversarial robustness of deep neural networks has been studied extensively and can bring security against adversarial attacks/examples. However, adversarially robust training approaches require a training mechanism on the entire deep network which can come at the cost of efficiency and computational complexity such as runtime. As a pilot study, we develop in this paper a novel theoretical framework that aims to answer the question of how can we make a whole model robust to adversarial examples by making part of a model robust? Toward promoting subnetwork robustness, we propose for the first time a new concept of semirobustness, which indicates adversarial robustness of a part of the network. We provide a theoretical analysis to show that if a subnetwork is robust and highly correlated with the rest of the network, then the remaining layers are also guaranteed to be robust. To guide the empirical investigation of our theoretical findings, we implemented our method at multiple layer depths and across multiple common image classification datasets. Experiments demonstrate that our method, with sufficient dependency between subnetworks, successfully utilizes subnetwork robustness to match fully-robust models' performance across AlexNet, VGG16, and ResNet50 benchmarks, for attack types FGSM, I-FGSM, PGD, and C&W.","Adversarial Learning, Adversarial Robustness, Subnetworks, Semirobustness, Information-Theoretic Measures, Mutual Dependency" Using Both Demonstrations and Language Instructions to Efficiently Learn Robotic Tasks,https://openreview.net/forum?id=4u42KCQxCn8,https://openreview.net/pdf?id=4u42KCQxCn8,Conditioning robotic manipulation policies on both demonstrations and language instructions improves sample efficiency and generalization to novel tasks.,"Demonstrations and natural language instructions are two common ways to specify and teach robots novel tasks. However, for many complex tasks, a demonstration or language instruction alone contains ambiguities, preventing tasks from being specified clearly. In such cases, a combination of both a demonstration and an instruction more concisely and effectively conveys the task to the robot than either modality alone. To instantiate this problem setting, we train a single multi-task policy on a few hundred challenging robotic pick-and-place tasks and propose DeL-TaCo (Joint Demo-Language Task Conditioning), a method for conditioning a robotic policy on task embeddings comprised of two components: a visual demonstration and a language instruction. By allowing these two modalities to mutually disambiguate and clarify each other during novel task specification, DeL-TaCo (1) substantially decreases the teacher effort needed to specify a new task and (2) achieves better generalization performance on novel objects and instructions over previous task-conditioning methods. To our knowledge, this is the first work to show that simultaneously conditioning a multi-task robotic manipulation policy on both demonstration and language embeddings improves sample efficiency and generalization over conditioning on either modality alone.","natural language for robotics, instruction following, learning from demonstrations, multi-task learning, robotic manipulation" Improving Accuracy and Explainability of Online Handwriting Recognition,https://openreview.net/forum?id=20tAZh6Ut3,https://openreview.net/pdf?id=20tAZh6Ut3,,"Handwriting recognition technology allows recognizing a written text from a given data. The recognition task can target letters, symbols, or words, and the input data can be a digital image or recorded by various sensors. A wide range of applications from signature verification to electronic document processing can be realized by implementing efficient and accurate handwriting recognition algorithms. Over the years, there has been an increasing interest in experimenting with different types of technology to collect handwriting data, create datasets, and develop algorithms to recognize characters and symbols. More recently, the OnHW-chars dataset has been published that contains multivariate time series data of the English alphabet collected using a ballpoint pen fitted with sensors. The authors of OnHW-chars also provided some baseline results through their machine learning (ML) and deep learning (DL) classifiers. In this paper, we develop handwriting recognition models on the OnHW-chars dataset and improve the accuracy of previous models. More specifically, our ML models provide $11.3\%$-$23.56\%$ improvements over the previous ML models, and our optimized DL models with ensemble learning provide $3.08\%$-$7.01\%$ improvements over the previous DL models. In addition to our accuracy improvements over the spectrum, we aim to provide some level of explainability for our models to provide more logic behind chosen methods and why the models make sense for the data type in the dataset. Our source codes, data, and models will be made publicly available for verifiability and reproducibility of our results.","Machine Learning, Deep Learning, Explainability, Computer vision, Ensemble Learning" On the duality between contrastive and non-contrastive self-supervised learning,https://openreview.net/forum?id=kDEL91Dufpa,https://openreview.net/pdf?id=kDEL91Dufpa,"We show that contrastive and non-contrastive self-supervised methods can be shown to be closely related, and then study how implementation details impact performance. We validate empirically our findings and significantly improve known behaviours.","Recent approaches in self-supervised learning of image representations can be categorized into different families of methods and, in particular, can be divided into contrastive and non-contrastive approaches. While differences between the two families have been thoroughly discussed to motivate new approaches, we focus more on the theoretical similarities between them. By designing contrastive and covariance based non-contrastive criteria that can be related algebraically and shown to be equivalent under limited assumptions, we show how close those families can be. We further study popular methods and introduce variations of them, allowing us to relate this theoretical result to current practices and show the influence (or lack thereof) of design choices on downstream performance. Motivated by our equivalence result, we investigate the low performance of SimCLR and show how it can match VICReg's with careful hyperparameter tuning, improving significantly over known baselines. We also challenge the popular assumptions that contrastive and non-contrastive methods, respectively, need large batch sizes and output dimensions. Our theoretical and quantitative results suggest that the numerical gaps between contrastive and non-contrastive methods in certain regimes can be closed given better network design choices and hyperparameter tuning. The evidence shows that unifying different SOTA methods is an important direction to build a better understanding of self-supervised learning.","Self-supervised learning, contrastive, non-contrastive" Few-Shot Incremental Learning Using HyperTransformers,https://openreview.net/forum?id=nXmU89Rfmgg,https://openreview.net/pdf?id=nXmU89Rfmgg,An attention-based recurrent hypernetwork for incremental few-shot learning using prototypical loss,"Incremental few-shot learning methods make it possible to learn without forgetting from multiple few-shot tasks arriving sequentially. In this work we approach this problem using the recently published HyperTransformer (HT): a hypernetwork that generates task-specific CNN weights directly from the support set. We propose to re-use these generated weights as an input to the HT for the next task of the continual-learning sequence. Thus, the HT uses the weights themselves as the representation of the previously learned tasks. This approach is different from most continual learning algorithms that typically rely on using replay buffers, weight regularization or task-dependent architectural changes. Instead, we show that the HT works akin to a recurrent model, relying on the weights from the previous task and a support set from a new task. We demonstrate that a single HT equipped with a prototypical loss is capable of learning and retaining knowledge about past tasks for two continual learning scenarios: incremental-task learning and incremental-class learning.","few-shot learning, incremental learning, continual learning, transformers, hypernetworks" The Brainy Student: Scalable Unlearning by Selectively Disobeying the Teacher,https://openreview.net/forum?id=f9eHl5mKx5i,https://openreview.net/pdf?id=f9eHl5mKx5i,"We propose a new approach for deep machine unlearning that breaks free of limiting assumptions made in previous work, scales significantly better and consistently outperforms previous methods across a wide range of scenarios","Deep machine unlearning is the problem of removing the influence of a cohort of data from the weights of a trained deep model. This challenge has enjoyed increasing attention recently, motivated to the widespread use of neural networks in applications involving user data: allowing users to exercise their `right to be forgotten' necessitates an effective unlearning algorithm. Deleting data from models is also of interest in practice for removing out-of-date examples, outliers or noisy labels. However, most previous unlearning methods consider simple scenarios where a theoretical treatment is possible. Consequently, not only do their guarantees not apply to deep neural networks, but they also scale poorly. In this paper, drawing inspiration from teacher-student methods, we propose a scalable deep unlearning method that breaks free of previous limiting assumptions. Our thorough empirical investigation reveals that our approach significantly improves upon previous methods in being by far the most consistent in achieving unlearning in a wide range of scenarios, while incurring only a minimal performance degradation, if any, and being significantly more scalable than previous methods.","deep machine unlearning, machine unlearning, scalable unlearning" FIGARO: Controllable Music Generation using Learned and Expert Features,https://openreview.net/forum?id=NyR8OZFHw6i,https://openreview.net/pdf?id=NyR8OZFHw6i,We achieve state-of-the-art results in symbolic music style transfer by enabling human-interpretable control over the generation process while improving sample quality at the same time.,"Recent symbolic music generative models have achieved significant improvements in the quality of the generated samples. Nevertheless, it remains hard for users to control the output in such a way that it matches their expectation. To address this limitation, high-level, human-interpretable conditioning is essential. In this work, we release FIGARO, a Transformer-based conditional model trained to generate symbolic music based on a sequence of high-level control codes. To this end, we propose description-to-sequence learning, which consists of automatically extracting fine-grained, human-interpretable features (the description) and training a sequence-to-sequence model to reconstruct the original sequence given only the description as input. FIGARO achieves state-of-the-art performance in multi-track symbolic music generation both in terms of style transfer and sample quality. We show that performance can be further improved by combining human-interpretable with learned features. Our extensive experimental evaluation shows that FIGARO is able to generate samples that closely adhere to the content of the input descriptions, even when they deviate significantly from the training distribution.","symbolic music, style transfer, music generation, controllable generation, human-interpretability, self-supervised learning" A Neural PDE Solver with Temporal Stencil Modeling,https://openreview.net/forum?id=Nvlqsofsc6-,https://openreview.net/pdf?id=Nvlqsofsc6-,We propose a novel Temporal Stencil Modeling (TSM) method for solving time-dependent PDEs in conservation form.,"Numerical simulation of non-linear partial differential equations plays a crucial role in modeling physical science and engineering phenomena, such as weather, climate, and aerodynamics. Recent Machine Learning (ML) models trained on low-resolution spatio-temporal signals have shown new promises in capturing important dynamics in high-resolution signals, under the condition that the models can effectively recover the missing details. However, this study shows that significant information is often lost in the low-resolution down-sampled features. To address such issues, we propose a new approach, namely Temporal Stencil Modeling (TSM), which combines the strengths of advanced time-series sequence modeling (with the HiPPO features) and state-of-the-art neural PDE solvers (with learnable stencil modeling). TSM aims to recover the lost information from the PDE trajectories and can be regarded as a temporal generalization of classic finite volume methods such as WENO. Our experimental results show that TSM achieves the new state-of-the-art simulation accuracy for 2-D incompressible Navier-Stokes turbulent flows: it significantly outperforms the previously reported best results by 19.9% in terms of the highly-correlated duration time, and reduces the inference latency into 80%. We also show a strong generalization ability of the proposed method to various out-of-distribution turbulent flow settings.","neural PDE solver, Navier-Stokes equation, turbulent flow, Computational Fluid Dynamics, CFD" The Right Losses for the Right Gains: Improving the Semantic Consistency of Deep Text-to-Image Generation with Distribution-Sensitive Losses,https://openreview.net/forum?id=jnpGR7xu_P_,https://openreview.net/pdf?id=jnpGR7xu_P_,,"One of the major challenges in training deep neural networks for text-to-image generation is the significant linguistic discrepancy between ground-truth captions of each image in most popular datasets. The large difference in the choice of words in such captions results in synthesizing images that are semantically dissimilar to each other and to their ground-truth counterparts. Moreover, existing models either fail to generate the fine-grained details of the image or require a huge number of parameters that renders them inefficient for text-to-image synthesis. To fill this gap in the literature, we propose using the contrastive learning approach with a novel combination of two loss functions: fake-to-fake loss to increase the semantic consistency between generated images of the same caption, and fake-to-real loss to reduce the gap between the distributions of real images and fake ones. We test this approach on two baseline models: SSAGAN and AttnGAN (with style blocks to enhance the fine-grained details of the images.) Results show that our approach improves the qualitative results on AttnGAN with style blocks on the CUB dataset. Additionally, on the challenging COCO dataset, our approach achieves competitive results against the state-of-the-art Lafite model, outperforms the FID scores of SSAGAN and DALL-E models by 44% and 66.83% respectively, yet with only around 1% of the model size and training data of the huge DALL-E model.","Generative Adversarial Networks, Attention, GAN, Text to Image, Contrastive learning" Selection Collider Bias in Large Language Models,https://openreview.net/forum?id=25VgHaPz0l4,https://openreview.net/pdf?id=25VgHaPz0l4,"Using causal inference methods, we explain and demonstrate how sample selection bias causes spurious correlations during training, and how those spurious correlations can be used to classify prediction tasks as underspecified during inference.","In this paper we motivate the causal mechanisms behind sample selection induced collider bias (selection collider bias) that can cause Large Language Mod- els (LLMs) to learn unconditional dependence between entities that are unconditionally independent in the real world. We show that selection collider bias can become amplified in underspecified learning tasks, and although difficult to overcome, we describe a method to exploit the resulting spurious correlations for determination of when a model may be uncertain about its prediction. We demonstrate an uncertainty metric that matches human uncertainty in tasks with gender pronoun underspecification on an extended version of the Winogender Schemas evaluation set, and we provide online demos where users can evaluate spurious correlations and apply our uncertainty metric to their own texts and models. Finally, we generalize our approach to address a wider range of prediction tasks.","large language models, causal inference, selection bias" CausalBench: A Large-scale Benchmark for Network Inference from Single-cell Perturbation Data,https://openreview.net/forum?id=fHT8kZcyyT,https://openreview.net/pdf?id=fHT8kZcyyT,We introduce CausalBench - a comprehensive benchmark suite for evaluating network inference methods on perturbational single-cell gene expression data.,"Mapping biological mechanisms in cellular systems is a fundamental step in early-stage drug discovery that serves to generate hypotheses on what disease-relevant molecular targets may effectively be modulated by pharmacological interventions. With the advent of high-throughput methods for measuring single-cell gene expression under genetic perturbations, we now have effective means for generating evidence for causal gene-gene interactions at scale. However, inferring graphical networks of the size typically encountered in real-world gene-gene interaction networks is difficult in terms of both achieving and evaluating faithfulness to the true underlying causal graph. Moreover, standardised benchmarks for comparing methods for causal discovery in perturbational single-cell data do not yet exist. Here, we introduce CausalBench - a comprehensive benchmark suite for evaluating network inference methods on large-scale perturbational single-cell gene expression data. CausalBench introduces several biologically meaningful performance metrics and operates on two large, curated and openly available benchmark data sets for evaluating methods on the inference of gene regulatory networks from single-cell data generated under perturbations. With real-world datasets consisting of over 200000 training samples under interventions, CausalBench could potentially help facilitate advances in causal network inference by providing what is - to the best of our knowledge - the largest openly available test bed for causal discovery from real-world perturbation data to date.","causal discovery, gene regulatory networks, gene-gene interaction networks, network inference, single cell RNA sequencing, scRNAseq" Language models are multilingual chain-of-thought reasoners,https://openreview.net/forum?id=fR3wGCk-IXp,https://openreview.net/pdf?id=fR3wGCk-IXp,"We introduce the Multilingual Grade School Math (MGSM) benchmark, and analyze the multilingual multi-step reasoning abilities of large language models. ","We evaluate the reasoning abilities of large language models in multilingual settings. We introduce the Multilingual Grade School Math (MGSM) benchmark, by manually translating 250 grade-school math problems from the GSM8K dataset (Cobbe et al., 2021) into ten typologically diverse languages. We find that the ability to solve MGSM problems via chain-of-thought prompting emerges with increasing model scale, and that models have strikingly strong multilingual reasoning abilities, even in underrepresented languages such as Bengali and Swahili. Finally, we show that multilingual reasoning abilities of language models extend to other tasks such as commonsense reasoning and word-in-context semantic judgment. The MGSM benchmark is publicly available at AnonymousLink and the supplementary material. ","multilingual, reasoning, large language model" DreamFusion: Text-to-3D using 2D Diffusion,https://openreview.net/forum?id=FjNys5c7VyY,https://openreview.net/pdf?id=FjNys5c7VyY,DeepDream on a pretrained 2D diffusion model enables text-to-3D synthesis,"Recent breakthroughs in text-to-image synthesis have been driven by diffusion models trained on billions of image-text pairs. Adapting this approach to 3D synthesis would require large-scale datasets of labeled 3D or multiview data and efficient architectures for denoising 3D data, neither of which currently exist. In this work, we circumvent these limitations by using a pretrained 2D text-to-image diffusion model to perform text-to-3D synthesis. We introduce a loss based on probability density distillation that enables the use of a 2D diffusion model as a prior for optimization of a parametric image generator. Using this loss in a DeepDream-like procedure, we optimize a randomly-initialized 3D model (a Neural Radiance Field, or NeRF) via gradient descent such that its 2D renderings from random angles achieve a low loss. The resulting 3D model of the given text can be viewed from any angle, relit by arbitrary illumination, or composited into any 3D environment. Our approach requires no 3D training data and no modifications to the image diffusion model, demonstrating the effectiveness of pretrained image diffusion models as priors.","diffusion models, score-based generative models, NeRF, neural rendering, 3d synthesis" Recitation-Augmented Language Models,https://openreview.net/forum?id=-cqvvvb-NkI,https://openreview.net/pdf?id=-cqvvvb-NkI,We propose a novel recitation-augmented generation framework to improve language models’ performance in the closed-book question-answering setting.,"We propose a new paradigm to help Large Language Models (LLMs) generate more accurate factual knowledge without retrieving from an external corpus, called RECITation-augmented gEneration (RECITE). Different from retrieval-augmented language models that retrieve relevant documents before generating the outputs, given an input, RECITE first recites one or several relevant passages from LLMs’ own memory via sampling, and then produces the final answers. We show that RECITE is a powerful paradigm for knowledge-intensive NLP tasks. Specifically, we show that by utilizing recitation as the intermediate step, a recite-and-answer scheme can achieve new state-of-the-art performance in various closed-book question answering (CBQA) tasks. In experiments, we verify the effectiveness of RECITE on three pre-trained models (In-house LM, UL2, and OPT) and three CBQA tasks (Natural Questions, TriviaQA, and HotpotQA).","Large Language Models, In-context Learning, Memorization, Closed-book Question Answering, CBQA" Continual Active Learning,https://openreview.net/forum?id=GC5MsCxrU-,https://openreview.net/pdf?id=GC5MsCxrU-,We reduce Active Learning (AL) training time with the help of replay based Continual Learning algorithms all while maintaining performance on par with standard AL. ,"While active learning (AL) improves the labeling efficiency of machine learning (by allowing models to query the labels of data samples), a major problem is that compute efficiency is decreased since models are typically retrained from scratch at each query round. In this work, we develop a new framework that circumvents this problem by biasing further training towards the recently labeled sets, thereby complementing existing work on AL acceleration. We employ existing and novel replay-based Continual Learning (CL) algorithms that are effective at quickly learning new samples without forgetting previously learned information, especially when data comes from a shifting or evolving distribution. We call this compute-efficient active learning paradigm $\textit{``Continual Active Learning"" (CAL)}$. We demonstrate that standard AL with warm starting fails, both to accelerate training, and that naive fine-tuning suffers from catastrophic forgetting due to distribution shifts over query rounds. We then show CAL achieves significant speedups using a plethora of replay schemes that use model distillation, and that select diverse/uncertain points from the history, all while maintaining performance on par with standard AL. We conduct experiments across many data domains, including natural language, vision, medical imaging, and computational biology, each with very different neural architectures (Transformers/CNNs/MLPs). CAL consistently provides a 2-6x reduction in training time, thus showing its applicability across differing modalities.","Active Learning, Deep Learning, Efficient Machine Learning, Continual Learning" KwikBucks: Correlation Clustering with Cheap-Weak and Expensive-Strong Signals,https://openreview.net/forum?id=p0JSSa1AuV,https://openreview.net/pdf?id=p0JSSa1AuV,"Inspired by text clustering, we study correlation clustering where similarities must be queried via an expensive model (e.g. a large language model) with additional help from a cheap but noisy model (e.g. an embedding based model).","The unprecedented rate at which the sizes of machine learning (ML) models are growing necessitates novel approaches to enable efficient and scalable solutions. We contribute to this line of work by studying a novel version of the Budgeted Correlation Clustering problem (\bcc) where along with a limited number of queries to an expensive oracle for node similarities (e.g. a large ML model), we have unlimited access to a cheaper but less accurate second oracle. Our formulation is inspired by many practical scenarios where coarse approximations of the expensive similarity metric can be efficiently obtained via weaker models. We develop a theoretically motivated algorithm in this setting that leverages the cheap oracle to judiciously query the strong oracle while maintaining high clustering quality. We empirically demonstrate gains in query minimization and clustering metrics on a variety of datasets with diverse strong and cheap oracles. Most notably, we demonstrate a practical application in text clustering based on expensive cross-attention language models by showing that cheaper (but weaker) embedding-based models can be leveraged to substantially reduce the number of inference calls to the former.","correlation clustering, text clustering, learning-augmented algorithms, weak and strong signals, query efficient clustering, query efficient, budgeted clustering" "Credible, Sealed-bid, Optimal Repeated Auctions With Differentiable Economics",https://openreview.net/forum?id=b-WNV1iPro,https://openreview.net/pdf?id=b-WNV1iPro,"We propose an approach to run computationally efficient, credible, revenue-maximizing repeated auctions with cryptographic tools.","Online advertisement auctions happen billions of times per day. Bidders in auctions strategize to improve their own utility, subject to published auctions' rules. Yet, bidders may not know that an auction has been run as promised. A credible auction is one in which bidders can trust the auctioneer to run its allocation and pricing mechanisms as promised. It is known that, assuming no communication between bidders, no credible, sealed-bid, and incentive compatible (aka ``truth-telling'' or otherwise truthful-participation-incentivizing) mechanism can exist. In reality, bidders can certainly communicate, so what happens if we relax this (typically unrealistic) constraint? In this work, we propose a framework incorporating cryptography to allow computationally-efficient, credible, revenue-maximizing (aka ``optimal'') auctions in a repeated auction setting. Our contribution is two-fold: first, we introduce a protocol for running repeated auctions with a verification scheme, and we show such a protocol can eliminate the auctioneer's incentive to deviate while costing negligible additional computation. Secondly, we provide a method for training optimal auctions under uncertain bidder participation profiles, which generalizes our protocol to a much wider class of auctions in the online ad market. Our empirical results show strong support for both the theory and competency of the proposed method. ","Mechanism Design, Differentiable Economics, Deep Learning, Zero-Knowledge Proofs" The Power of Feel-Good Thompson Sampling: A Unified Framework for Linear Bandits,https://openreview.net/forum?id=Kk-kJl9fmm,https://openreview.net/pdf?id=Kk-kJl9fmm,,"Linear contextual bandit is one of the most popular models in online decision-making with bandit feedback. Prior work has studied different variants of this model, e.g., misspecified, non-stationary, and multi-task/life-long linear contextual bandits. However, there is no single framework that can unify the algorithm design and analysis for these variants. In this paper, we propose a unified framework for linear contextual bandits based on feel-good Thompson sampling (Zhang, 2021). The algorithm derived from our framework achieves nearly minimax optimal regret in various settings and resolves the respective open problem in each setting. Specifically, let $d$ be the dimension of the context and $T$ be the length of the horizon, our algorithm achieves an $\widetilde{\mathcal{O}}(d\sqrt{ST})$ regret bound for non-stationary linear bandits with at most $S$ switches, $\widetilde{\mathcal{O}}(d^{\frac{5}{6}} T^{\frac{2}{3}} P^{\frac{1}{3}})$ regret for non-stationary linear bandits with bounded path length $P$, and $\widetilde{\mathcal{O}}(d\sqrt{kT} + \sqrt{dkMT})$ regret for (generalized) lifelong linear bandits over $M$ tasks that share an unknown representation of dimension $k$. We believe our framework will shed light on the design and analysis of other linear contextual bandit variants.", Two-Tailed Averaging: Anytime Adaptive Once-in-a-while Optimal Iterate Averaging for Stochastic Optimization,https://openreview.net/forum?id=W1cQ9FPFdb,https://openreview.net/pdf?id=W1cQ9FPFdb,New approximately optimal iterate averaging algorithm with no hyperparameters that approximates the optimal average at all optimization steps.,"Tail averaging improves on Polyak averaging's non-asymptotic behaviour by excluding a number of leading iterates of stochastic optimization from its calculations. In practice, with a finite number of optimization steps and a learning rate that cannot be annealed to zero, tail averaging can get much closer to a local minimum point of the training loss than either the individual iterates or the Polyak average. However, the number of leading iterates to ignore is an important hyperparameter, and starting averaging too early or too late leads to inefficient use of resources or suboptimal solutions. Setting this hyperparameter to improve generalization is even more difficult, especially in the presence of other hyperparameters and overfitting. Furthermore, before averaging starts, the loss is only weakly informative of the final performance, which makes early stopping unreliable. To alleviate these problems, we propose an anytime variant of tail averaging, that has no hyperparameters and approximates the optimal tail at all optimization steps. Our algorithm is based on two running averages with adaptive lengths bounded in terms of the optimal tail length, one of which achieves approximate optimality with some regularity. Requiring only the additional storage for two sets of weights and periodic evaluation of the loss, the proposed two-tailed averaging algorithm is a practical and widely applicable method for improving stochastic optimization. ","optimization, polyak, iterate averaging, anytime, adaptive, online" Reward Design with Language Models,https://openreview.net/forum?id=10uNUgI5Kl,https://openreview.net/pdf?id=10uNUgI5Kl,We make reward design easier by using large language models models (like GPT-3) as a proxy for a user's reward function given that a user provides a few examples (or a description) of the desired behavior.,"Reward design in reinforcement learning (RL) is challenging since specifying human notions of desired behavior may be difficult via reward functions or require many expert demonstrations. Can we instead cheaply design rewards using a natural language interface? This paper explores how to simplify reward design by using a large language model (LLM) such as GPT-3 as a proxy reward function, where the user provides a textual prompt containing a few examples (few-shot) or a description (zero-shot) of desired behavior. Our approach leverages this proxy reward function in an RL framework. Specifically, users specify a prompt once at the beginning of training. During training, the LLM evaluates an RL agent's behavior against the desired behavior described by the prompt and outputs a corresponding reward signal. The RL agent then uses this reward to update its behavior. We evaluate whether our approach can train agents aligned with user objectives in the Ultimatum Game, matrix games, and the DealOrNoDeal negotiation task. In all three tasks, we show that RL agents trained with our framework are well-aligned with the user's objectives and outperforms RL agents trained with reward functions learned via supervised learning. ","reward design, foundation models, gpt3, reward specification, reinforcement learning, human-ai interaction" Calibrating the Rigged Lottery: Making All Tickets Reliable,https://openreview.net/forum?id=KdwnGErdT6,https://openreview.net/pdf?id=KdwnGErdT6,,"Although sparse training has been successfully used in various deep learning tasks to save memory and reduce inference time, the reliability of the produced sparse models remains unexplored. Previous research has shown that deep neural networks tend to be over-confident, and we find that sparse training exacerbates this problem. Therefore, calibrating the sparse models is crucial for reliable prediction and decision making. In this paper, we propose a new sparse training method to produce sparse models with improved confidence calibration. In contrast to previous research that uses only one mask to control the sparse topology, our method utilizes two masks, including a deterministic mask and a random mask. The former efficiently searches and activates important weights by exploiting the magnitude of weights and gradients. While the latter brings better exploration and finds more appropriate weight values by random updates. Theoretically, we prove our method can be viewed as a hierarchical variational approximation of a probabilistic deep Gaussian process. Extensive experiments on multiple datasets, model architectures, and sparsities show that our method can reduce ECE values by up to 47.8\% and simultaneously maintain or even improve accuracy with only a slight increase in computational and storage burden.", Replay Buffer with Local Forgetting for Adaptive Deep Model-Based Reinforcement Learning,https://openreview.net/forum?id=uWpq1-rQbV,https://openreview.net/pdf?id=uWpq1-rQbV,,"One of the key behavioral characteristics used in neuroscience to determine whether the subject of study---be it a rodent or a human---exhibits model-based learning is effective adaptation to local changes in the environment. In reinforcement learning, however, recent work has shown that modern deep model-based reinforcement-learning (MBRL) methods adapt poorly to such changes. An explanation for this mismatch is that MBRL methods are typically designed with sample-efficiency on a single task in mind and the requirements for effective adaptation are substantially higher, both in terms of the learned world model and the planning routine. One particularly challenging requirement is that the learned world model has to be sufficiently accurate throughout relevant parts of the state-space. This is challenging for deep-learning-based world models due to catastrophic forgetting. And while a replay buffer can mitigate the effects of catastrophic forgetting, the traditional first-in-first-out replay buffer precludes effective adaptation due to maintaining stale data. In this work, we show that a conceptually simple variation of this traditional replay buffer is able to overcome this limitation. By removing only samples from the buffer from the local neighbourhood of the newly observed samples, deep world models can be built that maintain their accuracy across the state-space, while also being able to effectively adapt to changes in the reward function. We demonstrate this by applying our replay-buffer variation to the classical Dyna method, as well as to recent methods such as PlaNet and DreamerV2, showing for the first time that deep model-based methods are able to achieve effective adaptation.", Contrastive Audio-Visual Masked Autoencoder,https://openreview.net/forum?id=QPtMRyk5rb,https://openreview.net/pdf?id=QPtMRyk5rb,"We propose the Contrastive Audio-Visual Masked Auto-Encoder that combines contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation.","In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation. Our experiments show that the contrastive audio-visual correspondence learning objective not only enables the model to perform audio-visual retrieval tasks, but also helps the model learn a better joint representation. As a result, our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound, and is comparable with the previous best supervised pretrained model on AudioSet in the audio-visual event classification task.","multi-modal learning, audio-visual learning, self-supervised learning, masked autoencoder, contrastive learning" Pessimistic Model-Based Actor-Critic for Offline Reinforcement Learning: Theory and Algorithms,https://openreview.net/forum?id=VQfSsOTrLIy,https://openreview.net/pdf?id=VQfSsOTrLIy,,"Model-based offline reinforcement learning (RL) has achieved superior performance than model-free RL in many decision-making problems due to its sample efficiency and generalizability. However, prior model-based offline RL methods in the literature either demonstrate their successes only through empirical studies, or provide algorithms that have theoretical guarantees but are hard to implement in practice. To date, a general computationally-tractable algorithm for model-based offline RL with PAC guarantees is still lacking. To fill this gap, we develop a pessimistic model-based actor-critic (PeMACO) algorithm with general function approximations assuming partial coverage of the offline dataset. Specifically, the critic provides a pessimistic Q-function through incorporating uncertainties of the learned transition model, and the actor updates policies by employing approximations of the pessimistic Q-function. Under some mild assumptions, we establish theoretical PAC guarantees of the proposed PeMACO algorithm by proving upper bounds on the suboptimality of the returned policy by PeMACO.","Actor-critic, Model-based offline RL, PAC guarantee, Pessimism" The Asymmetric Maximum Margin Bias of Quasi-Homogeneous Neural Networks,https://openreview.net/forum?id=IM4xp7kGI5V,https://openreview.net/pdf?id=IM4xp7kGI5V,"We generalize implicit max-margin bias to a class of models which describes nearly all networks, identifying a competition between maximizing margin and minimizing an asymmetric parameter norm, which can degrade robustness and explain Neural Collapse","In this work, we explore the maximum-margin bias of quasi-homogeneous neural networks trained with gradient flow on an exponential loss and past a point of separability. We introduce the class of quasi-homogeneous models, which is expressive enough to describe nearly all neural networks with homogeneous activations, even those with biases, residual connections, and normalization layers, while structured enough to enable geometric analysis of its gradient dynamics. Using this analysis, we generalize the existing results of maximum-margin bias for homogeneous networks to this richer class of models. We find that gradient flow implicitly favors a subset of the parameters, unlike in the case of a homogeneous model where all parameters are treated equally. We demonstrate through simple examples how this strong favoritism toward minimizing an asymmetric norm can degrade the robustness of quasi-homogeneous models. On the other hand, we conjecture that this norm-minimization discards, when possible, unnecessary higher-order parameters, reducing the model to a sparser parameterization. Lastly, by applying our theorem to sufficiently expressive neural networks with normalization layers, we reveal a universal mechanism behind the empirical phenomenon of Neural Collapse.","margin, maximum-margin, implicit regularization, neural networks, neural collapse, gradient flow, implicit bias, robustness, homogeneous, symmetry, classification" Soft Diffusion: Score Matching For General Corruptions,https://openreview.net/forum?id=QsVditUhXR,https://openreview.net/pdf?id=QsVditUhXR,We define a broader family of corruption processes that generalizes previously known diffusion models and we show how to learn to reverse them.,"We define a broader family of corruption processes that generalizes previously known diffusion models. To reverse these general diffusions, we propose a new objective called Soft Score Matching that provably learns the score function for any linear corruption process and yields state of the art results for CelebA. Soft Score Matching incorporates the degradation process in the network. Our new loss trains the model to predict a clean image, that after corruption, matches the diffused observation. We show that our objective learns the gradient of the likelihood under suitable regularity conditions for a family of corruption processes. We further develop a principled way to select the corruption levels for general diffusion processes and a novel sampling method that we call Momentum Sampler. We show experimentally that our framework works for general linear corruption processes, such as Gaussian blur and masking. We achieve state-of-the-art FID score $1.85$ on CelebA-64, outperforming all previous linear diffusion models. We also show significant computational benefits compared to vanilla denoising diffusion.","diffusion, score-based models, generative models" Open-Vocabulary Panoptic Segmentation MaskCLIP,https://openreview.net/forum?id=zWudXc9343,https://openreview.net/pdf?id=zWudXc9343,,"In this paper, we tackle an emerging computer vision task, open-vocabulary panoptic segmentation, that aims to perform panoptic segmentation (background semantic labeling + foreground instance segmentation) for arbitrary categories of text-based descriptions in inference time. We first build a baseline method by directly adopting pre-trained CLIP models without finetuning nor distillation. We then develop MaskCLIP, a Transformer-based approach with a Relative Mask Attention (RMA) module. The RMA is an encoder-only module that seamless integrates mask tokens with a pre-trained ViT CLIP model for semantic/instance segmentation and class prediction. MaskCLIP learns to efficiently and effectively utilize pre-trained dense/local CLIP features within the RMA that avoids the time-consuming student-teacher training process. We obtain encouraging results for open-vocabulary panoptic/instance segmentation and state-of-the-art results for semantic segmentation on ADE20K and PASCAL datasets. We show qualitative illustration for MaskCLIP with online custom categories.","open-vocabulary, panoptic segmentation, semantic segmentation, CLIP" Robust Federated Learning with Majority Adversaries via Projection-based Re-weighting,https://openreview.net/forum?id=adgYjVvm9Xy,https://openreview.net/pdf?id=adgYjVvm9Xy,This paper shows two methods for improving the adversarial robustness of federated learning under a majority adversary regime.,"Most robust aggregators for distributed or federated learning assume that adversarial clients are the minority in the system. In contrast, this paper considers the majority adversary setting. We first show that a filtering method using a few trusted clients can defend against many standard attacks. However, a new attack called Mimic-Shift can circumvent simple filtering. To this end, we develop a re-weighting strategy that identifies and down-weights the potential adversaries under the majority adversary regime. We show that our aggregator converges to a neighborhood around the optimum under the Mimic-Shift attack. Empirical results further show that our aggregator achieves negligible accuracy loss with a majority of adversarial clients, outperforming strong baselines.","Federated learning, robustness, adversarial attack, majority adversary" Double Wins: Boosting Accuracy and Efficiency of Graph Neural Networks by Reliable Knowledge Distillation,https://openreview.net/forum?id=NGIFt6BNvLe,https://openreview.net/pdf?id=NGIFt6BNvLe,,"The recent breakthrough achieved by graph neural networks (GNNs) with few labeled data accelerates the pace of deploying GNNs on real-world applications. While several efforts have been made to scale GNNs training for large-scale graphs, GNNs still suffer from the scalability challenge of model inference, due to the graph dependency issue incurred by the message passing mechanism, therefore hindering its deployment in resource-constrained applications. A recent study~\citep{zhang2021graph} revealed that GNNs can be compressed to inference-friendly multi-layer perceptrons (MLPs), by training MLPs using the soft labels of labeled and unlabeled nodes from the teacher. However, blindly leveraging the soft labels of all unlabeled nodes may be suboptimal, since the teacher model would inevitably make wrong predictions. This intriguing observation motivates us to ask: \textit{Is it possible to train a stronger MLP student by making better use of the unlabeled data?} This paper studies cross-model knowledge distillation - from GNN teacher to MLP student in a semi-supervised setting, showing their strong promise in achieving a ``sweet point'' in co-optimizing model accuracy and efficiency. Our proposed solution, dubbed \textit{Reliable Knowledge Distillation for MLP optimization} (\textbf{RKD-MLP}), is the first noise-aware knowledge distillation framework for GNNs distillation. Its core idea is to use a meta-policy to filter out those unreliable soft labels. To train the meta-policy, we design a reward-driven objective based on a meta-set and adopt policy gradient to optimize the expected reward. Then we apply the meta-policy to the unlabeled nodes and select the most reliable soft labels for distillation. Extensive experiments across various GNN backbones, on 7 small graphs and 2 large-scale datasets from the challenging Open Graph Benchmark, demonstrate the superiority of our proposal. Moreover, our RKD-MLP model shows good robustness w.r.t. graph topology and node feature noises. The code is available at \url{https://anonymous.4open.science/r/RKD-MLP-F2A6/}.","Graph Neural Networks, Reliable Knowledge Distillation, Model Inference Acceleration" "A Statistical Framework for Personalized Federated Learning and Estimation: Theory, Algorithms, and Privacy",https://openreview.net/forum?id=FUiDMCr_W4o,https://openreview.net/pdf?id=FUiDMCr_W4o,We utilize a statistical framework to enable our design of new personalized Federated Learning/Estimation algorithms with privacy guarantees.,"A distinguishing characteristic of federated learning is that the (local) client data could have statistical heterogeneity. This heterogeneity has motivated the design of personalized learning, where individual (personalized) models are trained, through collaboration. There have been various personalization methods proposed in literature, with seemingly very different forms and methods ranging from use of a single global model for local regularization and model interpolation, to use of multiple global models for personalized clustering, etc. In this work, we begin with a generative framework that could potentially unify several different algorithms as well as suggest new algorithms. We apply our generative framework to personalized estimation, and connect it to the classical empirical Bayes' methodology. We develop private personalized estimation under this framework. We then use our generative framework to propose new personalized learning algorithms, including AdaPeD based on a Knowledge Distillation, which numerically outperforms several known algorithms. We develop privacy for personalized learning methods with guarantees for user-level privacy and composition. We numerically evaluate the performance as well as the privacy for both the estimation and learning problems, demonstrating the advantages of our proposed methods. ","Personalized Federated Learning, Personalized Statistical Estimation, Differential Privacy, Empirical/Hierarchical Bayes" Invariant Aggregator for Defending against Federated Backdoor Attacks,https://openreview.net/forum?id=3ZHX6_Mydd7,https://openreview.net/pdf?id=3ZHX6_Mydd7,This paper shows how to defend against federated backdoor attacks by focusing on the invariant directions in the model optimization trajectory. ,"Federated learning is gaining popularity as it enables training of high-utility models across several clients without directly sharing their private data. As a downside, the federated setting makes the model vulnerable to various adversarial attacks in the presence of malicious clients. Specifically, an adversary can perform backdoor attacks to control model predictions via poisoning the training dataset with a trigger. In this work, we propose a mitigation for backdoor attacks in a federated learning setup. Our solution forces the model optimization trajectory to focus on the invariant directions that are generally useful for utility and avoid selecting directions that favor few and possibly malicious clients. Concretely, we consider the sign consistency of the pseudo-gradient (the client update) as an estimation of the invariance. Following this, our approach performs dimension-wise filtering to remove pseudo-gradient elements with low sign consistency. Then, a robust mean estimator eliminates outliers among the remaining dimensions. Our theoretical analysis further shows the necessity of the defense combination and illustrates how our proposed solution defends the federated learning model. Empirical results on three datasets with different modalities and varying number of clients show that our approach mitigates backdoor attacks with a negligible cost on the model utility. ","Federated learning, robustness, backdoor attack, invariant learning" Improving Adversarial Robustness of Deep Neural Networks via Self-adaptive Margin Defense,https://openreview.net/forum?id=a60Jo7_RUd2,https://openreview.net/pdf?id=a60Jo7_RUd2,,"Adversarial training has become the most popular and effective strategy to improve Deep Neural Network (DNN) robustness against adversarial noises. Many adversarial training methods have been proposed in the past few years. However, most adversarial training methods are highly susceptible to hyperparameters, especially the training noise upper bound. Tuning these parameters is expensive for large datasets and difficult for people not in the adversarial robustness research domain, which prevents adversarial training techniques from being used in many application fields. This paper introduces a new adversarial training method with a gradual expansion mechanism to generate adversarial training samples, and it is parameter-free for the user. By gradually expanding the exploration range with self-adaptive and gradient-aware step size, adversarial training samples can be placed into the optimal locations in the input data space. Unlike other defense methods that usually need to fine-tune hyperparameters (e.g., training noise upper bound) by grid-search, our method has no hyperparameters for the user. We name our method Self-adaptive Margin Defense (SMD). We evaluate SMD on three publicly available datasets (CIFAR10, SVHN, and Fashion-MNIST) under the most popular adversarial attacks, AutoAttack and PGD. The results show that: (1) compared with all other competing defense methods, SMD has the best overall performance in robust accuracy on noisy data; (2) the accuracy degradation of SMD on clean data is minor among all competing defense methods.","adversarial robustness, adversarial training, deep neural network" Laser: Latent Set Representations for 3D Generative Modeling,https://openreview.net/forum?id=uNHWPiNJBsV,https://openreview.net/pdf?id=uNHWPiNJBsV,Generative NeRF with fast inference that can handle large scenes and can inpain unobserved parts of these scenes.,"NeRF provides unparalleled fidelity of novel view synthesis---rendering a 3D scene from an arbitrary viewpoint. NeRF requires training on a large number of views that fully cover a scene, which limits its applicability. While these issues can be addressed by learning a prior over scenes in various forms, previous approaches have been either applied to overly simple scenes or struggling to render unobserved parts. We introduce Laser-NV---a generative model which achieves high modelling capacity, and which is based on a set-valued latent representation modelled by normalizing flows. Similarly to previous amortized approaches, Laser-NV learns structure from multiple scenes and is capable of fast, feed-forward inference from few views. To encourage higher rendering fidelity and consistency with observed views, Laser-NV further incorporates a geometry-informed attention mechanism over the observed views. Laser-NV further produces diverse and plausible completions of occluded parts of a scene while remaining consistent with observations. Laser-NV shows state-of-the-art novel-view synthesis quality when evaluated on ShapeNet and on a novel simulated City dataset, which features high uncertainty in the unobserved regions of the scene.","generative models, nerf, computer vision, 3D scenes, novel view synthesis, variational auto-encoder" Towards Efficient Gradient-Based Meta-Learning in Heterogenous Environments,https://openreview.net/forum?id=xsNTv784iah,https://openreview.net/pdf?id=xsNTv784iah,"In our paper, we propose a nonparametric version of MAML which is able to solve problems in heterogeneous enviroments","A challenging problem for machine learning is few-shot learning, as its models usually require many training samples. Since meta-learning models have strong fine-tuning capabilities for the distribution of tasks, many of them have been applied to few-shot learning. Model-agnostic meta-learning (MAML) is one of the most popular ones. Recent studies showed that MAML-trained models tend to reuse learned features and do not perform strong adaption, especially in the earlier layers. This paper presents an in-detail analysis of this phenomenon by analyzing MAML's components of different variants. Our results show an interesting relationship between the importance of fine-tuning earlier layers and the difference in the distribution between training and testing. As a result, we determine a fundamental weakness of existing MAML variants when the task distribution is heterogeneous, e.g., the numbers of classes do not match during testing and training. We propose a novel nonparametric version of MAML that overcomes these issues while still being able to perform cross-domain adaption.","few-shot learning, heterogeneous datasets, cross-domain adaptation" Knowledge Cascade: Reverse Knowledge Distillation,https://openreview.net/forum?id=wHfVEDi_N9E,https://openreview.net/pdf?id=wHfVEDi_N9E,,"With the rapidly growing model complexity in the state-of-the-art machine learning methods, the expensive model training process has rendered the algorithm design and computation resources allocation challenging. To tackle the challenges, we propose the knowledge cascade (KCas), a strategy that reverses the idea of knowledge distillation (KD). While KD compresses and transfers the knowledge learned by a large-and-complex model (teacher model) to a small-and-simple model (student model), KCas inversely transfer the knowledge in a student model to a teacher model. Despite the fact that teacher models are more sophisticated and capable than student models, we show that in KCas, student models can effectively facilitate teacher models building by taking advantage of the statistical asymptotic theories. We demonstrate the outstanding performance of KCas on the nonparametric multivariate functional estimation in reproducing kernel Hilbert space. One of the crucial problems in accomplishing the estimation is the daunting computational cost of selecting smoothing parameters, whose number will increase exponentially as the number of predictors increases. KCas transfers the knowledge about the smoothing parameters of the target function learned from the student model to the teacher model based on empirical and asymptotic results. KCas significantly reduces the computational cost of the smoothing parameter selection process from $O(n^3)$ to $O(n^{3/4})$, while preserving excellent performance. Theoretical analysis of asymptotic convergence rates and extensive empirical evaluations on simulated and real data validate the effectiveness of KCas.","Knowledge distillation, subsampling, large-scale data, nonparametric, reproducing kernel Hilbert space, asymptotic theory" Optimal Transport for Offline Imitation Learning,https://openreview.net/forum?id=MhuFzFsrfvH,https://openreview.net/pdf?id=MhuFzFsrfvH,We present an offline imitation learning based on optimal transport that demonstrates strong performance and sample efficiency,"With the advent of large datasets, offline reinforcement learning is a promising framework for learning good decision-making policies without the need to interact with the real environment. However, offline RL requires the dataset to be reward-annotated, which presents practical challenges when reward engineering is difficult or when obtaining reward annotations is labor-intensive. In this paper, we introduce Optimal Transport Relabeling (OTR), an imitation learning algorithm that can automatically relabel offline data of mixed and unknown quality with rewards from a few good demonstrations. OTR's key idea is to use optimal transport to compute an optimal alignment between an unlabeled trajectory in the dataset and an expert demonstration to obtain a similarity measure that can be interpreted as a reward, which can then be used by an offline RL algorithm to learn the policy. OTR is easy to implement and computationally efficient. On D4RL benchmarks, we demonstrate that OTR with a single demonstration can consistently match the performance of offline RL with ground-truth rewards. ","offline reinforcement learning, optimal transport, imitation learning" FedorAS: Federated Architecture Search under system heterogeneity,https://openreview.net/forum?id=t8Jk_Vo1jHS,https://openreview.net/pdf?id=t8Jk_Vo1jHS,FedorAS is a system that performs cross-device Federated Neural Architecture Search under heterogeneous system and data distributions.,"Federated learning (FL) has recently gained considerable attention due to its ability to learn on decentralised data while preserving client privacy. However, it also poses additional challenges related to the heterogeneity of the participating devices, both in terms of their computational capabilities and contributed data. Meanwhile, Neural Architecture Search (NAS) has been successfully used with centralised datasets, producing state-of-the-art results in constrained or unconstrained settings. However, such centralised datasets may not be always available for training. Most recent work at the intersection of NAS and FL attempts to alleviate this issue in a cross-silo federated setting, which assumes homogeneous compute environments with datacenter-grade hardware. In this paper we explore the question of whether we can design architectures of different footprints in a cross-device federated setting, where the device landscape, availability and scale are very different. To this end, we design our system, FedorAS, to discover and train promising architectures in a resource-aware manner when dealing with devices of varying capabilities holding non-IID distributed data. We present empirical evidence of its effectiveness across different settings, spanning across three different modalities (vision, speech, text), and showcase its better performance compared to state-of-the-art federated solutions, while maintaining resource efficiency.","Federated Learning, Neural Architecture Search, Deep Learning, Efficient DNN Training" "Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization",https://openreview.net/forum?id=8aHzds2uUyB,https://openreview.net/pdf?id=8aHzds2uUyB,"We provide an open-source framework, benchmark, and novel algorithm to train large language models to better align to automated measures of human preferences.","We tackle the problem of aligning pre-trained large language models (LMs) with human preferences. If we view text generation as a sequential decision-making problem, reinforcement learning (RL) appears to be a natural conceptual framework. However, using RL for LM-based generation faces empirical challenges, including training instability due to the combinatorial action space, as well as a lack of open-source libraries and benchmarks customized for LM alignment. Thus, a question rises in the research community: is RL a practical paradigm for NLP? To help answer this, we first introduce an open-source modular library, $RL4LMs$ (Reinforcement Learning for Language Models), for optimizing language generators with RL. The library consists of on-policy RL algorithms that can be used to train any encoder or encoder-decoder LM in the HuggingFace library (Wolf et al. 2020) with an arbitrary reward function. Next, we present the $GRUE$ (General Reinforced-language Understanding Evaluation) benchmark, a set of 6 language generation tasks which are supervised not by target strings, but by reward functions which capture automated measures of human preference.GRUE is the first leaderboard-style evaluation of RL algorithms for NLP tasks. Finally, we introduce an easy-to-use, performant RL algorithm, $NLPO$ (Natural Language Policy Optimization)} that learns to effectively reduce the combinatorial action space in language generation. We show 1) that RL techniques are generally better than supervised methods at aligning LMs to human preferences; and 2) that NLPO exhibits greater stability and performance than previous policy gradient methods (e.g., PPO (Schulman et al. 2017)), based on both automatic and human evaluation.","natural language processing, reinforcement learning, language models, feedback learning" Towards A Unified View of Sparse Feed-Forward Network in Transformer,https://openreview.net/forum?id=lX478WYy0Up,https://openreview.net/pdf?id=lX478WYy0Up,"We present a unified framework for large and sparse feed-forward networks in transformer, and use it to arrive at a better method.","Large and sparse feed-forward networks (S-FFN) such as Mixture-of-Experts (MoE) have demonstrated to be an efficient approach for scaling up Transformers model size for pretraining. By only activating part of the FFN parameters con- ditioning on input, S-FFN improves generalization performance while keeping training and inference cost (in FLOPs) fixed. A growing body of work has been focusing on improving the S-FFN design, including routing and load balancing methods in the context of MoEs. Previously, another line of work motivates from a neural memory perspective and develops sparse neural memory techniques for S-FFN. This work merges the two seemingly different lines of work. We present a unified framework to categorize design choices along two axes: memory block size and memory block selection method. Using this unified framework, we compare several S-FFN architectures for language modeling and provide insights into their relative efficacy and efficiency. We show that a smaller memory block size leads to lower perplexity. Additionally, we find that selection through a gate, in general, improves the perplexity-FLOPs trade-off but has worse perplexity than selection using hidden states without a gate. Based on these insights, we propose a new selection method — Avg-K that selects blocks through their mean aggregated hidden states. With 1% additional FLOPs, Avg-K achieves 2.16 lower perplexity than a vanilla transformer (16.96), outperforming Switch Transformer (16.45).","Mixture of Expert, Neural Memory, Pre-trained Language Model, NLP" Learning multi-scale local conditional probability models of images,https://openreview.net/forum?id=VZX2I_VVJKH,https://openreview.net/pdf?id=VZX2I_VVJKH,"We develop a spatially Markov wavelet conditional probability model for images, and demonstrate (through, denoising, super-resolution and synthesis) its effectiveness in capturing global dependencies.","Deep neural networks can learn powerful prior probability models for images, as evidenced by high-quality synthesis results achieved with VAEs, GANs, and recent score-based diffusion methods. But these models are implicit, and the means by which these networks capture complex global statistical structure, apparently without suffering from the curse of dimensionality, remain a mystery. To study this, we generalize a multi-scale model class motivated by the renormalization group of theoretical physics. It circumvents the curse of dimensionality by assuming Markov structure of multi-scale wavelet coefficients conditioned on coarser scale coefficients. We parameterize the model using conditional convolutional neural networks with local receptive fields, which enforce stationary Markov properties. We test the capabilities of the model on a dataset of face images, which are highly non-stationary and contain long-range geometric structures. Remarkably, denoising, super-resolution, and image synthesis results demonstrate that these structures are well-captured, using significantly smaller neighborhoods than required by a CNN operating in the pixel domain.","Image priors, Markov wavelet conditional models, multi-scale score-based image synthesis, denoising, super-resolution" Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions,https://openreview.net/forum?id=zyLVMgsZ0U_,https://openreview.net/pdf?id=zyLVMgsZ0U_,"We prove that given an L2-accurate score estimate, diffusion models can sample from (essentially) any data distribution, even if it is highly non-log-concave and/or supported on a low dimensional manifold.","We provide theoretical convergence guarantees for score-based generative models (SGMs) such as denoising diffusion probabilistic models (DDPMs), which constitute the backbone of large-scale real-world generative models such as DALL$\cdot$E 2. Our main result is that, assuming accurate score estimates, such SGMs can efficiently sample from essentially any realistic data distribution. In contrast to prior works, our results (1) hold for an $L^2$-accurate score estimate (rather than $L^\infty$-accurate); (2) do not require restrictive functional inequality conditions that preclude substantial non-log-concavity; (3) scale polynomially in all relevant problem parameters; and (4) match state-of-the-art complexity guarantees for discretization of the Langevin diffusion, provided that the score error is sufficiently small. We view this as strong theoretical justification for the empirical success of SGMs. We also examine SGMs based on the critically damped Langevin diffusion (CLD). Contrary to conventional wisdom, we provide evidence that the use of the CLD does *not* reduce the complexity of SGMs.","diffusion models, score-based generative models, sampling, score estimation, Langevin, stochastic differential equations" Online Continual Learning with Feedforward Adaptation,https://openreview.net/forum?id=biGSK6L5JiT,https://openreview.net/pdf?id=biGSK6L5JiT,We propose an online adaptation method with feedforward compensation.,"Recently deep learning has been widely used in time-series prediction tasks. Although a trained deep neural network model typically performs well on the training set, performance drop significantly in a test set under slight distribution shifts. This challenge motivates the adoption of online adaptation algorithms to update the prediction models in real-time to improve the prediction performance. Existing online adaptation methods optimize the prediction model by feeding back the latest prediction error computed with respect to the latest observation. However, feedback based approach is prone to forgetting past information. In this work, we propose an online adaptation method with feedforward compensation, which uses critical data samples from a memory buffer, instead of the latest samples, to optimize the prediction model. We prove that the proposed feedforward approach has a smaller error bound than the feedback approach in slow time-varying systems. The experiments on several time-series prediction tasks show that the proposed feedforward adaptation outperforms conventional feedback adaptation by more than 10%. In addition, the proposed feedforward adaptation method is able to estimate an uncertainty bound of the prediction that is agnostic from specific optimizers, while existing feedback adaptation could not. ","Online adaptation, Online Learning, Continual Learning" Understanding ReLU Network Robustness Through Test Set Certification Performance,https://openreview.net/forum?id=8cST_EWo9X,https://openreview.net/pdf?id=8cST_EWo9X,Robustness certificates for ReLU networks are strongly correlated with network accuracy for data in-distribution and are highly unreliable for data out-of-distribution.,"Neural networks might exhibit weak robustness against input perturbations within the learning distribution and become more severe for distributional shifts or data outside the distribution. For their safer use, robustness certificates provide formal guarantees to the stability of the prediction in the vicinity of the input. However, the relationship between correctness and robustness remains unclear. In this work, we investigate the unexpected outcomes of verification methods applied to piecewise linear classifiers for clean, perturbed, in- and out-of-distribution samples. In our experiments, we conduct a thorough analysis for image classification tasks and show that robustness certificates are strongly correlated with prediction correctness for in-distribution data. In addition, we provide a theoretical demonstration that formal verification methods robustly certify samples sufficiently far from the training distribution. These results are integrated with an experimental analysis and demonstrate their weakness compared to standard out-of-distribution detection methods.","Robustness Certificates, Robust Machine Learning, Out-Of-Distribution Detection" Mind the Privacy Budget: How Generative Models Spend their Privacy Budgets,https://openreview.net/forum?id=sMsShmoszg,https://openreview.net/pdf?id=sMsShmoszg,We analyze the specific steps in which different DP generative approaches ``spend'' their privacy budget and evaluate the effects on downstream tasks performance with increasingly wider and taller training datasets.,"Numerous Differentially Private (DP) generative models have been presented that aim to produce synthetic data while minimizing privacy risks. As there is no single model that works well in all settings, empirical analysis is needed to establish and optimize trade-offs vis-\`a-vis the intended use of the synthetic data. In this paper, we identify and address several challenges in the empirical evaluation of such models. First, we analyze the steps in which different algorithms ``spend'' their privacy budget. We evaluate the effects on the performance of downstream tasks to identify problem settings they are most likely to be successful at. Then, we experiment with increasingly wider and taller training sets with various features, decreasing privacy budgets, and different DP mechanisms and generative models. Our empirical evaluation, performed on both graphical and deep generative models, sheds light on the distinctive features of different models/mechanisms that make them well-suited for different settings and tasks. Graphical models distribute the privacy budget horizontally and cannot handle relatively wide datasets, while the performance on the task they were optimized for monotonically increases with more data. Deep generative models spend their budget per iteration, and their behavior is less predictable with varying dataset dimensions, but could perform better if trained on more features. Also, low levels of privacy ($\epsilon\geq100$) could help some models generalize, achieving better results than without applying DP.","synthetic data, differential privacy, generative models, graphical models, GANs" Resource Efficient Self-Supervised Learning for Speech Recognition,https://openreview.net/forum?id=L9pW5fknjO,https://openreview.net/pdf?id=L9pW5fknjO,,"Representation learning from sequential data using self-supervised learning (SSL) has proven to be a powerful technique and improved state-of-the-art (SOTA) results when fine tuned for various downstream tasks, including Automatic Speech Recognition (ASR). So far the success of SSL frameworks, e.g., Wav2Vec-2.0, for sequence-to-sequence (seq2seq) modeling is primarily carried out by masking intermediate features and then solving a contrastive task in an end-to-end manner. Although very successful, the overall training time (for example, days or weeks) and demanding resource requirements for achieving SOTA performance remain a significant barrier to further improving ASR solutions using such approaches. In this work we show that non-contrastive learning, such as an extension of the Barlow–Twins methodology, when applied to seq2seq SSL modeling improves convergence, while reducing training time. Our results show that Wav2Vec-2.0 architecture pre-training with a non-contrastive SSL approach reduces the GPU training hours by 2.3 times, compared to masking based SSL approaches, while achieving a significant improvement (i.e., up to 6% relative WER decrease) in the model performance for the ASR task. We further demonstrate that a combination of both masking based SSL and non-contrastive SSL improves the ASR performance, e.g., up to 12% relative WER decrease, for all splits of LibriSpeech evaluation dataset. ","SSL, ASR" Subsampling in Large Graphs Using Ricci Curvature,https://openreview.net/forum?id=w9WUQkBvpI,https://openreview.net/pdf?id=w9WUQkBvpI,,"In the past decades, many large graphs with millions of nodes have been collected/constructed. The high computational cost and significant visualization difficulty hinder the analysis of large graphs. To overcome the difficulties, researchers have developed many graph subsampling approaches to provide a rough sketch that preserves global properties. By selecting representative nodes, these graph subsampling methods can help researchers estimate the graph statistics, e.g., the number of communities, of the large graph from the subsample. However, the available subsampling methods, e.g., degree node sampler and random walk sampler, tend to leave out minority communities because nodes with high degrees are more likely to be sampled. To overcome the shortcomings of the existing methods, we are motivated to apply the community information hidden in the graph to the subsampling method. Though the community structure is unavailable, community structure information can be obtained by applying geometric methods to a graph. An analog of Ricci curvature in the manifold is defined for the graph, i.e., Ollivier Ricci curvature. Based on the asymptotic results about the within-community edge and between-community edge's OR curvature, we propose a subsampling algorithm based on our theoretical results, the Ollivier-Ricci curvature Gradient-based subsampling (ORG-sub) algorithm. The proposed ORG-sub algorithm has two main contributions: First, ORG-sub provides a rigorous theoretical guarantee that the probability of ORG-sub taking all communities into the final subgraph converges to one. Second, extensive experiments on synthetic and benchmark datasets demonstrate the advantages of our algorithm.","Graph subsampling, Ricci curvature" Membership Leakage in Pre-trained Language Models,https://openreview.net/forum?id=3vzguDiEOr,https://openreview.net/pdf?id=3vzguDiEOr,This paper evaluates membership leakage of pre-trained language models,"Pre-trained language models are becoming a dominating component in NLP domain and have achieved state-of-the-art in various downstream tasks. Recent research has shown that language models are vulnerable to privacy leakage of their training data, such as text extraction and membership leakage. However, existing works against NLP applications mainly focus on the privacy leakage of text generation and downstream classification, and the privacy leakage of pre-trained language models is largely unexplored. In this paper, we take the first step toward systematically auditing the privacy risks of pre-trained language models through the lens of membership leakage. In particular, we focus on membership leakage of pre-training data in the exposure of downstream models adapted from pre-trained language models. We conduct extensive experiments on a variety of pre-trained model architectures and different types of downstream tasks. Our empirical evaluations demonstrate that membership leakage of pre-trained language models exists even when only the downstream model output is exposed, thereby posing a more severe risk than previously thought. We further conduct sophisticated ablation studies to analyze the relationship between membership leakage of pre-trained models and the characteristic of downstream tasks, which can guide developers or researchers to be vigilant about the vulnerability of pre-trained language models. Lastly, we explore possible defenses against membership leakage of PLMs and propose two promising defenses based on empirical evaluations.","membership leakage, pre-trained language models, natural language processing" DSI++: Updating Transformer Memory with New Documents,https://openreview.net/forum?id=XkwkFYPT6t,https://openreview.net/pdf?id=XkwkFYPT6t,"We introduce DSI++, a continual learning challenge for DSI that requires incrementally adding documents to the model and propose a two-fold approach focusing on training dynamics and data-based regularization to enable it.","Differentiable Search Indices (DSIs) encode a corpus of documents in the parameters of a model and use the same model to map queries directly to relevant document identifiers. Despite the solid performance of DSI models, successfully deploying them in scenarios where document corpora change with time is an open problem. In this work, we introduce DSI++, a continual learning challenge for DSI with the goal of continuously indexing new documents while being able to answer queries related to both previously and newly indexed documents. Across different model scales and document identifier representations, we show that continual indexing of new documents leads to considerable forgetting of previously indexed documents. We also hypothesize and verify that the model experiences forgetting events during training, leading to unstable learning. To mitigate these issues, we investigate two approaches. The first focuses on modifying the training dynamics. Flatter minima implicitly alleviates forgetting, so we explicitly optimize for flatter loss basins and show that the model stably memorizes more documents (+12\%). Next, we introduce a parametric memory to generate pseudo-queries for documents and supplement them during incremental indexing to prevent forgetting for the retrieval task. Extensive experiments on a novel continual indexing benchmark based on Natural Questions demonstrate that our proposed solution mitigates the forgetting in DSI++ by a significant margin and improves the average Hits@10 by $+21.1\%$ over competitive baselines.","Differentiable Search Index, Transformer Memory, Catastrophic Forgetting, Continual Learning, Lifelong Learning, Incremental Learning" Universal Few-shot Learning of Dense Prediction Tasks with Visual Token Matching,https://openreview.net/forum?id=88nT0j5jAn,https://openreview.net/pdf?id=88nT0j5jAn,a universal few-shot learner for general dense prediction tasks,"Dense prediction tasks are a fundamental class of problems in computer vision. As supervised methods suffer from high pixel-wise labeling cost, a few-shot learning solution that can learn any dense task from a few labeled images is desired. Yet, current few-shot learning methods target a restricted set of tasks such as semantic segmentation, presumably due to challenges in designing a general and unified model that is able to flexibly and efficiently adapt to arbitrary tasks of unseen semantics. We propose Visual Token Matching (VTM), a universal few-shot learner for arbitrary dense prediction tasks. It employs non-parametric matching on patch-level embedded tokens of images and labels that encapsulates all tasks. Also, VTM flexibly adapts to any task with a tiny amount of task-specific parameters that modulate the matching algorithm. We implement VTM as a powerful hierarchical encoder-decoder architecture involving ViT backbones where token matching is performed at multiple feature hierarchies. We experiment VTM on a challenging variant of Taskonomy dataset and observe that it robustly few-shot learns various unseen dense prediction tasks. Surprisingly, it is competitive with fully supervised baselines using only 10 labeled examples of novel tasks ($0.004\%$ of full supervision) and sometimes outperforms using $0.1\%$ of full supervision.","few-shot learning, dense prediction tasks" The Game of Hidden Rules: A New Challenge for Machine Learning,https://openreview.net/forum?id=YhKScHeK4Ed,https://openreview.net/pdf?id=YhKScHeK4Ed,We present a new learning environment allowing researchers to rigorously study how the characteristics of learning tasks affect difficulty.,"Systematic examination of learning tasks remains an important but understudied area of machine learning (ML) research. To date, most ML research has focused on measuring performance on new tasks or surpassing state of the art performance on existing tasks. These efforts are vital but do not explain why some tasks are more difficult than others. Understanding how task characteristics affect difficulty is critical to formalizing ML's strengths and limitations; a rigorous assessment of which types of tasks are well-suited to a specific algorithm and, conversely, which algorithms are well-suited to a specific task would mark an important step forward for the field. To assist researchers in this effort, we introduce a novel learning environment designed to study how task characteristics affect measured difficulty for the learner. This tool frames learning tasks as a ``board-clearing game,'' which we call the Game of Hidden Rules (GOHR). In each instance of the game, the researcher encodes a specific rule, unknown to the learner, that determines which moves are allowed at each state of the game. The learner must infer the rule through play. We detail the game's expressive rule syntax and show how it gives researchers granular control over learning tasks. We present sample rules, a sample ML algorithm, and methods to assess algorithm performance. Separately, we provide additional benchmark rules, a public leaderboard for performance on these rules, and documentation for installing and using the GOHR environment.","benchmark, environment, rule learning" Motif-based Graph Representation Learning with Application to Chemical Molecules,https://openreview.net/forum?id=70_umOqc6_-,https://openreview.net/pdf?id=70_umOqc6_-,We propose a motif-based representation learning method to better capture local structure information and demonstrate the performance and explainability advantages on molecular benchmarks. ,"This work considers the task of representation learning on the attributed relational graph (ARG). Both the nodes and edges in an ARG are associated with attributes/features allowing ARGs to encode rich structural information widely observed in real applications. Existing graph neural networks offer limited ability to capture complex interactions within local structural contexts, which hinders them from taking advantage of the expression power of ARGs. We propose Motif Convolution Module (MCM), a new motif-based graph representation learning technique to better utilize local structural information. The ability to handle continuous edge and node features is one of MCM's advantages over existing motif-based models. MCM builds a motif vocabulary in an unsupervised way and deploys a novel motif convolution operation to extract the local structural context of individual nodes, which is then used to learn higher-level node representations via multilayer perceptron and/or message passing in graph neural networks. When compared with other graph learning approaches to classifying synthetic graphs, our approach is substantially better in capturing structural context. We also demonstrate the performance and explainability advantages of our approach by applying it to several molecular benchmarks.","Graph Neural Networks, Molecular Graph Representation" "Graph schemas as abstractions for transfer learning, inference, and planning",https://openreview.net/forum?id=2T80ygeeWE0,https://openreview.net/pdf?id=2T80ygeeWE0,"We propose schemas in a higher order graph structures as a model for abstractions that can be used for rapid transfer learning, inference, and planning.","We propose schemas as a model for abstractions that can be used for rapid transfer learning, inference, and planning. Common structured representations of concepts and behaviors---schemas---have been proposed as a powerful way to encode abstractions. Latent graph learning is emerging as a new computational model of the hippocampus to explain map learning and transitive inference. We build on this work to show that learned latent graphs in these models have a slot structure---schemas---that allow for quick knowledge transfer across environments. In a new environment, an agent can rapidly learn new bindings between the sensory stream to multiple latent schemas and select the best fitting one to guide behavior. To evaluate these graph schemas, we use two previously published challenging tasks: the memory \& planning game and one-shot StreetLearn, that are designed to test rapid task solving in novel environments. Graph schemas can be learned in far fewer episodes than previous baselines, and can model and plan in a few steps in novel variations of these tasks. We further demonstrate learning, matching, and reusing graph schemas in navigation tasks in more challenging environments with aliased observations and size variations, and show how different schemas can be composed to model larger environments.","Schema learning, abstractions, higher order graphs, perceptual aliasing, aliased graphs, planning, spatial navigation, cognitive science" Conservative Bayesian Model-Based Value Expansion for Offline Policy Optimization,https://openreview.net/forum?id=dNqxZgyjcYA,https://openreview.net/pdf?id=dNqxZgyjcYA,,"Offline reinforcement learning (RL) addresses the problem of learning a performant policy from a fixed batch of data collected by following some behavior policy. Model-based approaches are particularly appealing in the offline setting since they can extract more learning signals from the logged dataset by learning a model of the environment. However, the performance of existing model-based approaches falls short of model-free counterparts, due to the compounding of estimation errors in the learned model. Driven by this observation, we argue that it is critical for a model-based method to understand when to trust the model and when to rely on model-free estimates, and how to act conservatively w.r.t. both. To this end, we derive an elegant and simple methodology called conservative Bayesian model-based value expansion for offline policy optimization (CBOP), that trades off model-free and model-based estimates during the policy evaluation step according to their epistemic uncertainties, and facilitates conservatism by taking a lower bound on the Bayesian posterior value estimate. On the standard D4RL continuous control tasks, we find that our method significantly outperforms previous model-based approaches: e.g., MOPO by $116.4$%, MOReL by $23.2$% and COMBO by $23.7$%. Further, CBOP achieves state-of-the-art performance on $11$ out of $18$ benchmark datasets while doing on par on the remaining datasets.","Offline reinforcement learning, model-based reinforcement learning, model-based value expansion, Bayesian inference" Beam Tree Recursive Cells,https://openreview.net/forum?id=sKDtBKYOdIP,https://openreview.net/pdf?id=sKDtBKYOdIP,We apply beam search on easy-parsing strategy to simulate RvNN without ground truth tree supervision and experiment its different extensions.,"Recursive Neural Networks (RvNNs) generalize Recurrent Neural Networks (RNNs) by allowing sequential composition in a more flexible order, typically, based on some tree structure. While initially user-annotated tree structures were used, in due time, several approaches were proposed to automatically induce tree-structures from raw text to guide the recursive compositions in RvNNs. In this paper, we present an approach called Beam Tree Recursive Cell (or BT-Cell) based on a simple yet overlooked backpropagation-friendly framework. BT-Cell applies beam search on easy-first parsing for simulating RvNNs with automatic structure-induction. Our results show that BT-Cell achieves near-perfect performance on several aspects of challenging structure-sensitive synthetic tasks like ListOps and also comparable performance in realistic data to other RvNN-based models. We further introduce and analyze several extensions of BT-Cell based on relaxations of the hard top-k operators in beam search. We evaluate the models in different out of distribution splits in both synthetic and realistic data. Additionally, we identify a previously unknown failure case for neural models in generalization to unseen number of arguments in ListOps. We will release our code.","Recursive Neural Networks, RvNNs, length generalization, systematicity" The Ultimate Combo: Boosting Adversarial Example Transferability by Composing Data Augmentations,https://openreview.net/forum?id=6yaLHYv5L91,https://openreview.net/pdf?id=6yaLHYv5L91,"We comprehensively studied data-augmentation methods for enhancing the transferability of adversarial examples, finding compositions that work best, and advancing the state of the art.","Transferring adversarial examples from surrogate (ML) models to evade target models is a common method for evaluating adversarial robustness in black-box settings. Researchers have invested substantial efforts to enhance transferability. Chiefly, attacks leveraging data augmentation have been found to help adversarial examples generalize better from surrogate to target models. Still, prior work has explored a limited set of augmentation techniques and their composition. To fill the gap, we conducted a systematic, comprehensive study of how data augmentation affects transferability. Particularly, we explored ten augmentation techniques of six categories originally proposed to help ML models generalize to unseen benign samples, and assessed how they influence transferability, both when applied individually and when composed. Our extensive experiments with the ImageNet dataset showed that simple color-space augmentations (e.g., color to greyscale) outperform the state of the art when combined with standard augmentations, such as translation and scaling. Additionally, except for two methods that may harm transferability, we found that composing augmentation methods impacts transferability monotonically (i.e., more methods composed $\rightarrow$ $\ge$transferability)---the best composition we found significantly outperformed the state of the art (e.g., 95.6% vs. 90.9% average transferability from normally trained surrogates to other normally trained models). We provide intuitive, empirically supported explanations for why certain augmentations fail to improve transferability.","Adversarial machine learning, transferability, evasion, black-box attacks" In-Time Refining Optimization Trajectories Toward Improved Robust Generalization,https://openreview.net/forum?id=MdKAP5oHJ5l,https://openreview.net/pdf?id=MdKAP5oHJ5l,We propose a new method named weighted optimization trajectories (WOT) that refines the optimization trajectories of adversarial training in time to improve robust generalization.,"Despite the fact that adversarial training has become the de facto method for improving robustness of deep neural networks, it is well-known that vanilla adversarial training suffers from daunting robust overfitting, resulting in unsatisfactory robust generalization. A number of approaches have been proposed to address these drawbacks such as extra regularization, adversarial weights perturbation, and training with more data over the last few years. However, the robust generalization improvement is yet far from satisfactory. In this paper, we approach this challenge with a brand new perspective -- refining historical optimization trajectories. We propose a new method named \textbf{Weighted Optimization Trajectories (WOT)} that leverages the optimization trajectories of adversarial training in time. We have conducted extensive experiments to demonstrate the effectiveness of WOT under various state-of-the-art adversarial attacks. Our results show that WOT integrates seamlessly with the existing adversarial training methods and consistently overcomes the robust overfitting issue, resulting in better adversarial robustness. For example, WOT boosts the robust accuracy of AT-PGD under AA-$L_{\infty}$ attack by 1.53\% $\sim$ 6.11\% and meanwhile increases the clean accuracy by 0.55\%$\sim$5.47\% across SVHN, CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets. Codes are included in the supplementary.","Adversarial Robustness, Optimization Trajectories, Robust overfitting" Scaling up and Stabilizing Differentiable Planning with Implicit Differentiation,https://openreview.net/forum?id=PYbe4MoHf32,https://openreview.net/pdf?id=PYbe4MoHf32,,"Differentiable planning promises end-to-end differentiability and adaptivity. However, an issue prevents it from scaling up to larger-scale problems: they need to differentiate through forward iteration layers to compute gradients, which couples forward computation and backpropagation and needs to balance forward planner performance and computational cost of the backward pass. To alleviate this issue, we propose to differentiate through the Bellman fixed-point equation to decouple forward and backward passes for Value Iteration Network and its variants, which enables constant backward cost (in planning horizon) and flexible forward budget and helps scale up to large tasks. We study the convergence stability, scalability, and efficiency of the proposed implicit version of VIN and its variants and demonstrate their superiorities on a range of planning tasks: 2D navigation, visual navigation, and 2-DOF manipulation in configuration space and workspace.", Improving Aspect Ratio Distribution Fairness in Detector Pretraining via Cooperating RPN’s,https://openreview.net/forum?id=9BXSGPfRhX,https://openreview.net/pdf?id=9BXSGPfRhX,We propose Cooperating RPN’s for improving the fairness to object aspect ratio distribution in few-shot object detection.,"Region proposal networks (RPN) are a key component of modern object detectors. An RPN identifies image boxes likely to contain objects, and so worth further investigation. An RPN false negative is unrecoverable, so the performance of an object detector can be significantly affected by RPN behavior, particularly in low-data regimes. The RPN for a few shot detector is trained on base classes. Our experiments demonstrate that, if the distribution of box aspect ratios for base classes is different from that for novel classes, errors caused by RPN failure to propose a good box become significant. This is predictable: for example, an RPN trained on base classes that are mostly square will tend to miss short wide boxes. It has not been noticed to date because the (relatively few) standard base/novel class splits on current datasets do not display this effect. But changing the base/novel split highlights the problem. We describe datasets where the distribution shift is severe using PASCAL VOC, COCO, and LVIS datasets. We show that the effect can be mitigated by training multiple distinct but cooperating specialized RPNs. Each specializes in a different aspect ratio, but cooperation constraints reduce the extent to which the RPNs are tuned. This means that if a box is missed by one RPN, it has a good chance of being picked up by another. Experimental evaluation confirms this approach results in substantial improvements in performance on the ARShift benchmarks, while remaining comparable to SOTA on conventional splits. Our approach applies to any few-shot detector and consistently improves performance of detectors.","Few-Shot Learning, Object Detection, Distribution Shift" Learning parsimonious dynamics for generalization in reinforcement learning,https://openreview.net/forum?id=qyBgN5bLasw,https://openreview.net/pdf?id=qyBgN5bLasw,,"Humans are skillful navigators: We aptly maneuver through new places, realize when we are back at a location we have seen before, and can even conceive of shortcuts that go through parts of our environments we have never visited. Current methods in model-based reinforcement learning on the other hand struggle with generalizing about environment dynamics out of the training distribution. We argue that two principles can help bridge this gap: latent learning and parsimonious dynamics. Humans tend to think about environment dynamics in simple terms -- we reason about trajectories not in reference to what we expect to see along a path, but rather in an abstract latent space, containing information about the places' spatial coordinates. Moreover, we assume that moving around in novel parts of our environment works the same way as in parts we are familiar with. These two principles work together in tandem: it is in the latent space that the dynamics show parsimonious characteristics. We develop a model that learns such parsimonious dynamics. Using a variational objective, our model is trained to reconstruct experienced transitions in a latent space using locally linear transformations, while encouraged to invoke as few distinct transformations as possible. Using our framework, we demonstrate the utility of learning parsimonious latent dynamics models in a range of policy learning and planning tasks.", DECODING LAYER SALIENCY IN TRANSFORMERS,https://openreview.net/forum?id=5ycxwq2VFAX,https://openreview.net/pdf?id=5ycxwq2VFAX,,"In this paper, we introduce a strategy for identifying textual saliency in large-scale language models applied to classification tasks. In visual networks where saliency is more well-studied, saliency is naturally localized through the convolutional layers of the network; however, the same is not true in modern transformer-stack networks used to process natural language. We adapt gradient-based saliency methods for these networks, propose a method for evaluating the degree of semantic coherence of each layer, and demonstrate consistent improvement over numerous other methods for textual saliency on multiple benchmark classification datasets. Our approach requires no additional training or access to labelled data, and is comparatively very computationally efficient.","saliency, explainability, transparency, transformers, NLP, feature attribution" UNDERSTANDING THE ROLE OF POSITIONAL ENCODINGS IN SENTENCE REPRESENTATIONS,https://openreview.net/forum?id=8taH4yjN62m,https://openreview.net/pdf?id=8taH4yjN62m,"In this work, we investigate the role of positional encodings systematically.","Positional encodings are used to inject word-order information into transformer-based language models. While they can significantly enhance the quality of sentence representations, their specific contribution to language models are not fully understood, especially given recent findings that building natural-language understanding from language models with positional encodings is insensitive to word order. In this work, we investigate the role of positional encodings systematically. (1) We uncover the core function of existing positional encodings is to symmetrically combine local units by identifying two common properties, Locality, and Symmetry. (2) We reveal that positional and contextual encodings play a distinct role in understanding sentences. (3) Based on these findings, we propose a simplified new method to inject positional information into such models. Empirical studies demonstrate that this method can improve the performance of the BERT-based model on 10 downstream tasks. We hope these new probing results and findings can shed light on how to design and inject positional encodings into language models. ","Positional Encodings, Sentence Representations, Pre-trained Language Models" Artificial Replay: A Meta-Algorithm for Harnessing Historical Data in Bandits,https://openreview.net/forum?id=vKXd1m74DkN,https://openreview.net/pdf?id=vKXd1m74DkN,"We propose a meta-algorithm for multi-armed bandits that most efficiently uses historical data, overcoming challenges of spurious data and imbalanced data coverage.","While standard bandit algorithms sometimes incur high regret, their performance can be greatly improved by ""warm starting"" with historical data. Unfortunately, how best to incorporate historical data is unclear: naively initializing reward estimates using all historical samples can suffer from spurious data and imbalanced data coverage, leading to computational and storage issues - particularly in continuous action spaces. We address these two challenges by proposing Artificial Replay, a meta-algorithm for incorporating historical data into any arbitrary base bandit algorithm. Artificial Replayuses only a subset of the historical data as needed to reduce computation and storage. We show that for a broad class of base algorithms that satisfy independence of irrelevant data (IIData), a novel property that we introduce, our method achieves equal regret as a full warm-start approach while potentially using only a fraction of historical data. We complement these theoretical results with a case study of $K$-armed and continuous combinatorial bandit algorithms, including on a green security domain using real poaching data, to show the practical benefits of Artificial Replayin achieving optimal regret alongside low computational and storage costs.","multi-armed bandits, historical data, adaptive discretization, online learning" Score-based Continuous-time Discrete Diffusion Models,https://openreview.net/forum?id=BYWWwSY2G5s,https://openreview.net/pdf?id=BYWWwSY2G5s,"a generalized discrete score matching for learning continuous-time diffusion in categorical spaces, with new parameterization and novel analytical sampling.","Score-based modeling through stochastic differential equations (SDEs) has provided a new perspective on diffusion models, and demonstrated superior performance on continuous data. However, the gradient of the log-likelihood function, \ie, the score function, is not properly defined for discrete spaces. This makes it non-trivial to adapt SDE with score functions to categorical data. In this paper, we extend diffusion models to discrete variables by introducing a stochastic jump process where the reverse process denoises via a continuous-time Markov chain. This formulation admits an analytical simulation during backward sampling. To learn the reverse process, we extend score matching to general categorical data, and show that an unbiased estimator can be obtained via simple matching of the conditional marginal distributions. We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks.","discrete space diffusion, discrete score matching, continuous-time diffusion" Decision Transformer under Random Frame Dropping,https://openreview.net/forum?id=NmZXv4467ai,https://openreview.net/pdf?id=NmZXv4467ai,Learning to control against random frame dropping through three original modifications to the Decision Transformer.,"Controlling agents remotely with deep reinforcement learning~(DRL) in the real world is yet to come. One crucial stepping stone is to devise RL algorithms that are robust in the face of dropped information from corrupted communication or malfunctioning sensors. Typical RL methods usually requires considerable online interaction data that are costly and unsafe to collect in the real world. Furthermore, when they are applied to the frame dropping scenarios, they perform unsatisfactorily even with moderate drop rates. To devise a robust and deployable algorithm, we propose Decision Transformer under Random Frame Dropping(DeFog), an offline RL algorithm that enables agents to act robustly in frame dropping scenarios without online interaction. DeFog first randomly masks out data in the offline datasets and explicitly adds the timespan of frame dropping as inputs. After that, a finetuning stage on the same offline dataset with a higher mask rate would further boost the performance. Empirical results show that DeFog outperforms strong baselines under severe frame drop rates like 90\%, while maintaining similar returns under non-frame-dropping conditions in the regular MuJoCo control benchmarks and the Atari environments.","Decision Transformer, Reinforcement Learning, Frame Dropping" Semi-supervised consistency regularization for accurate cell type fraction and gene expression estimation,https://openreview.net/forum?id=gmufyyjyjnN,https://openreview.net/pdf?id=gmufyyjyjnN,,"Cell deconvolution is the estimation of cell type fractions and cell type-specific gene expression from mixed data with unknown composition. In biomedical research, cell deconvolution, which is a source separation task, is used to obtain mechanistic and diagnostic insights into human diseases. An unmet challenge in cell deconvolution, however, is the scarcity of realistic training data and the strong domain shift observed in synthetic training data that is used in contemporary methods. Here, we hypothesize that simultaneous consistency regularization of the target and training domains will improve deconvolution performance. By adding this biologically motivated consistency loss to two novel deep learning-based deconvolution algorithms, we achieve state-of-the-art performance on both cell fraction and gene expression estimation. Our method, DISSECT, outperforms competing algorithms across several biomedical gene expression datasets and can be easily adapted to deconvolve other biomedical data types, as exemplified by our spatial expression deconvolution experiments.","Cell deconvolution, consistency regularization" Adversarial Imitation Learning with Preferences,https://openreview.net/forum?id=bhfp5GlDtGe,https://openreview.net/pdf?id=bhfp5GlDtGe,We extend Adversarial Imitation Learning to simultaneously utilize both demonstrations and preferences.,"Designing an accurate and explainable reward function for many Reinforcement Learning tasks is a cumbersome and tedious process. Instead, learning policies directly from the feedback of human teachers naturally integrates human domain knowledge into the policy optimization process. However, different feedback modalities, such as demonstrations and preferences, provide distinct benefits and disadvantages. For example, demonstrations convey a lot of information about the task but are often hard or costly to obtain from real experts while preferences typically contain less information but are in most cases cheap to generate. However, existing methods centered around human feedback mostly focus on a single teaching modality, causing them to miss out on important training data while making them less intuitive to use. In this paper we propose a novel method for policy learning that incorporates two different feedback types, namely \emph{demonstrations} and \emph{preferences}. To this end, we make use of the connection between discriminator training and density ratio estimation to incorporate preferences into the popular Adversarial Imitation Learning paradigm. This insight allows us to express loss functions over both demonstrations and preferences in a unified framework. Besides expert demonstrations, we are also able to learn from imperfect ones and combine them with preferences to achieve improved task performance. We experimentally validate the effectiveness of combining both preferences and demonstrations on common benchmarks and also show that our method can efficiently learn challenging robot manipulation tasks.","preference learning, learning from demonstration, adversarial imitation learning" How to Do a Vocab Swap? A Study of Embedding Replacement for Pre-trained Transformers,https://openreview.net/forum?id=MsjB2ohCJO1,https://openreview.net/pdf?id=MsjB2ohCJO1,We investigate strategies for swapping the vocabularies of transformer encoders using smart initializations.,"There are a wide range of different tokenizers and vocabularies that have been used to train language models, and training a language model on just one of these can be prohibitively expensive. The ability to swap the vocabulary of a model after it has been trained enables models to be adapted to different tokenizers, and even different languages, without the computational or data cost of from-scratch training. In this paper, we ask when such swaps are possible, and how to perform them effectively? The major challenge of performing a vocab swap is re-learning the parameters of the embedding layer for the vocabulary. We observe that it is possible to re-learn the embedding for a vocabulary using a naive initialization, and we investigate strong initialization strategies that enable learning of new embeddings for swapped vocabularies, even when those vocabularies come from a different source language than the original language model.","transfer learning, transformers, language models" Attribution Scores are Redundant: Explaining Feature Contribution By Trajectories,https://openreview.net/forum?id=gY25_vAwX6G,https://openreview.net/pdf?id=gY25_vAwX6G,"We propose a novel form of explanation that not only outperforms attribution methods in the most commonly used insertion/deletion metric, but also is able to theoretically achieve the best possible explanations under such metric.","Opening black boxes and revealing the inner mechanism of deep models is vital in applying them to real-world tasks. As one of the most intuitive and straightforward explanations for deep models, attributive explanation methods have been extensively studied. Existing attribution methods typically assign attribution scores to each individual feature as an explanation. However, when we use or evaluate the explanations in practice, what really matters is not the attribution scores, but the rank order of features (e.g., identifying the top-contributing features, or checking for changes in the model output by masking features in order). In other words, achieving attribution scores is a redundant step in achieving explanations. To address this, we propose a novel framework TRAjectory importanCE (TRACE) which directly provides feature ranking explanations. Our method introduces several improvements. First, TRACE greatly reduces the set of feasible explanations, allowing us to actually solve for the best explanation. Second, TRACE is able to achieve the theoretically-grounded best possible explanation in commonly used deletion evaluations. Third, we provide extensive experiments to validate that TRACE outperforms attribution methods with a significant margin.","Interpretability, Trajectory Importance, Combinatorial Optimization" SuperFed: Weight Shared Federated Learning,https://openreview.net/forum?id=9hp9PIFDhsK,https://openreview.net/pdf?id=9hp9PIFDhsK,Federated Training of K models in O(1) (amortized) communication and computation cost. ,"Federated Learning (FL) is a well-established technique for privacy preserving distributed training. Much attention has been given to various aspects of FL training. A growing number of applications that consume FL-trained models, however, increasingly operate under dynamically and unpredictably variable conditions, rendering a single model insufficient. We argue for training a global “family of models” cost efficiently in a federated fashion. Training them independently for different tradeoff points incurs ≈ O(k) cost for any k architectures of interest, however. Straightforward applications of FL techniques to recent weight-shared training approaches is either infeasible or prohibitively expensive. We propose SuperFed — an architectural framework that incurs O(1) cost to co-train a large family ofmodels in a federated fashion by leveraging weight-shared learning. We achieve an order of magnitude cost savings on both communication and computation by proposing two novel training mechanisms: (a) distribution of weight-shared models to federated clients, (b) central aggregation of arbitrarily overlapping weight-shared model parameters. The combination of these mechanisms is shown to reach an order of magnitude (9.43x) reduction in computation and communication cost for training a 5*10^18-sized family of models, compared to independently training as few as k = 9 DNNs without any accuracy loss.","Weight Shared, Federated Learning" Is Model Ensemble Necessary? Model-based RL via a Single Model with Lipschitz Regularized Value Function,https://openreview.net/forum?id=hNyJBk3CwR,https://openreview.net/pdf?id=hNyJBk3CwR,,"Probabilistic dynamics model ensemble is widely used in existing model-based reinforcement learning methods as it outperforms a single dynamics model in both asymptotic performance and sample efficiency. In this paper, we provide both practical and theoretical insights on the empirical success of the probabilistic dynamics model ensemble through the lens of Lipschitz continuity. We find that, for a value function, the stronger the Lipschitz condition is, the smaller the gap between the true dynamics- and learned dynamics-induced Bellman operators is, thus enabling the converged value function to be closer to the optimal value function. Hence, we hypothesize that the key functionality of the probabilistic dynamics model ensemble is to regularize the Lipschitz condition of the value function using generated samples. To validate this hypothesis, we devise two practical robust training mechanisms through computing the adversarial noise and regularizing the value network’s spectral norm to directly regularize the Lipschitz condition of the value functions. Empirical results show that combined with our mechanisms, model-based RL algorithms with a single dynamics model outperform those with ensemble of the probabilistic dynamics models. These findings not only support the theoretical insight, but also provide a practical solution for developing computationally efficient model-based RL algorithms.","Model-based Reinforcement Learning, Probabilistic Dynamics Model Ensemble, Lipschitz Regularization" Recurrent Back-Projection Generative Adversarial Network for Video Super Resolution,https://openreview.net/forum?id=p5DeuCSE9q,https://openreview.net/pdf?id=p5DeuCSE9q,Enhancing videos quality through the exploitation of Recurrent Back-Projection Generative Adversarial Networks,"In this paper, we propose a new Video Super Resolution algorithm in an attempt to generate videos that are temporally coherent, spatially detailed, and match human perception. To achieve this, we developed a new generative adversarial network named RBPGAN which is composed of two main components: a generator Network that exceeds other models for producing very high-quality frames, and a discriminator which outperforms others in terms of temporal consistency. The generator of the model uses a reduced recurrent back-projection network that takes a set of neighboring frames and a target frame applies SISR (Single Image Super Resolution) on each frame, and applies MISR (Multiple Image Super Resolution) through an encoder-decoder Back-Projection based approach to concatenate them and produce x4 resolution version of the target frame. The Spatio-temporal discriminator uses triplets of frames and penalizes the generator to generate the desired results. Our contribution results in a model that outperforms earlier work in terms of perceptual similarity and natural flow of frames, while maintaining temporal coherence and high-quality spatial details. The algorithm was tested on different datasets to eliminate bias.","Video Super Resolution, GANs, Temporal Coherence, Recurrent Projection." Neural Networks as Paths through the Space of Representations,https://openreview.net/forum?id=wAQU0Frxoa,https://openreview.net/pdf?id=wAQU0Frxoa,We visualize how information is transformed through a network using geometry derived from representational distance metrics.,"Deep neural networks implement a sequence of layer-by-layer operations that are each relatively easy to understand, but the resulting overall computation is generally difficult to understand. We develop a simple idea for interpreting the layer-by-layer construction of useful representations: the role of each layer is to reformat information to reduce the ""distance"" to the desired outputs. With this framework, the layer-wise computation implemented by a deep neural network can be viewed as a path through a high-dimensional representation space. We formalize this intuitive idea of a ""path"" by leveraging recent advances in metric representational similarity. We extend existing representational distance methods by computing geodesics, angles, and projections of representations, going beyond mere layer distances. We then demonstrate these tools by visualizing and comparing the paths taken by ResNet and VGG architectures on CIFAR-10. We conclude by sketching additional ways that this kind of representational geometry can be used to understand and interpret network training, to describe novel kinds of similarities between different models, and for representation-learning without backpropagation.","representational similarity, metrics, geodesic, neural network expressivity, visualization" From Points to Functions: Infinite-dimensional Representations in Diffusion Models,https://openreview.net/forum?id=0DwzMsUNIr,https://openreview.net/pdf?id=0DwzMsUNIr,We perform an analysis on the trajectory-based representation obtained from Diffusion Based Representation Learning to measure how different points of the trajectory encode semantically different information.,"Diffusion-based generative models learn to iteratively transfer unstructured noise to a complex target distribution as opposed to Generative Adversarial Networks (GANs) or the decoder of Variational Autoencoders (VAEs) which produce samples from the target distribution in a single step. Thus, in diffusion models every sample is naturally connected to a random trajectory which is a solution to a learned stochastic differential equation (SDE). Generative models are only concerned with the final state of this trajectory that delivers samples from the desired distribution. \cite{abstreiter2021diffusion} showed that these stochastic trajectories can be seen as continuous filters that wash out information along the way. Consequently, it is reasonable to ask if there is an intermediate time step at which the preserved information is optimal for a given downstream task. In this work, we show that a combination of information content from different time steps gives a strictly better representation for the downstream task. We introduce an attention and recurrence based modules that ``learn to mix'' information content of various time-steps such that the resultant representation leads to superior performance in downstream tasks. ","representation learning, diffusion models, score-based learning" Disentangling with Biological Constraints: A Theory of Functional Cell Types,https://openreview.net/forum?id=9Z_GfhZnGH,https://openreview.net/pdf?id=9Z_GfhZnGH,"We prove biological constraints of nonnegativity and energy efficiency lead to disentanged representations, and empirically demonstrate this in machine learning and neuroscience tasks.","Neurons in the brain are often finely tuned for specific task variables. Moreover, such disentangled representations are highly sought after in machine learning. Here we mathematically prove that simple biological constraints on neurons, namely nonnegativity and energy efficiency in both activity and weights, promote such sought after disentangled representations by enforcing neurons to become selective for single factors of task variation. We demonstrate these constraints lead to disentangling in a variety of tasks and architectures, including variational autoencoders. We also use this theory to explain why the brain partitions its cells into distinct cell types such as grid and object-vector cells, and also explain when the brain instead entangles representations in response to entangled task factors. Overall, this work provides a mathematical understanding of why, when, and how neurons represent factors in both brains and machines, and is a first step towards understanding of how task demands structure neural representations.","Disentangling, neurosciece, representation learning, hippocampus, cortex" ESEAD: An Enhanced Simple Ensemble and Distillation Framework for Natural Language Processing,https://openreview.net/forum?id=_omKGvXq0oX,https://openreview.net/pdf?id=_omKGvXq0oX,A simple yet effective logits-based distillation method for natural language processing.,"Large-scale pre-trained language models (PLM) are today’s leading technology for a wide range of natural language processing tasks. However, the enormous size of these models may discourage their use in practice. To tackle this problem, some recent studies have used knowledge distillation (KD) to compress these large models into shallow ones. Despite the success of the knowledge distillation, it remains unclear how students learn. We extend knowledge distillation in this paper and propose an enhanced version of the logits-based distillation method, ESEAD, to utilize the knowledge of multiple teachers to assist student learning. In extensive experiments with total 13 tasks on the GLUE and SuperGLUE benchmarks, ESEAD with different fine-tuning paradigms (e.g., delta tuning) obtained superior results over other KD methods and even outperformed the teacher model on some tasks. In addition, ESEAD remained the best performing student model in the few-shot (e.g., 100 samples) settings. ","Natural Language Processing, Knowledge Distillation" Efficient One-Shot Neural Architecture Search With Progressive Choice Freezing Evolutionary Search,https://openreview.net/forum?id=XZRmNjUMj0c,https://openreview.net/pdf?id=XZRmNjUMj0c,,"Neural Architecture Search (NAS) is a fast-developing research field to promote automatic machine learning. Among the recently populated NAS methods, One-Shot NAS has attracted significant attention since it greatly reduces the training cost compared with the previous NAS methods. In One-Shot NAS, the best network architecture is searched within a supernet, which is trained only once. In practice, the search process involves numerous inference processes for each user case, which causes high overhead in terms of latency and energy consumption. To tackle this problem, we first observe that the choices of the first few blocks that belong to different candidate networks will become similar at the early search stage. Furthermore, these choices are already close to the optimal choices obtained at the end of the search. Leveraging this interesting feature, we propose a Progressive Choice Freezing Evolutionary Search (PCF-ES) method that gradually freezes block choices for all subnets at different search generations. This approach gives us an opportunity to reuse intermediate data produced by the frozen block instead of re-computing them. The experiment results show that the proposed PCF-ES provides up to 55\% speedup and reduces energy consumption by 51\% during the searching stage.", Synthetic Data Generation of Many-to-Many Datasets via Random Graph Generation,https://openreview.net/forum?id=Q120_4COf-K,https://openreview.net/pdf?id=Q120_4COf-K,We synthesise datasets with many-to-many relationships by first generating the relationships via random graph generation and then generating the data attributes.,"Synthetic data generation (SDG) has become a popular approach to release private datasets. In SDG, a generative model is fitted on the private real data, and samples drawn from the model are released as the protected synthetic data. While real-world datasets usually consist of multiple tables with potential \emph{many-to-many} relationships (i.e.~\emph{many-to-many datasets}), recent research in SDG mostly focuses on modeling tables \emph{independently} or only considers generating datasets with special cases of many-to-many relationships such as \emph{one-to-many}. In this paper, we first study the challenge of building faithful generative models for many-to-many datasets. We then present a novel, scalable generation framework based on recent results from random graph theory and representation learning. Finally, we extend the framework to establish the notion of $(\epsilon,\delta)$-differential privacy. Through a real-world dataset, we demonstrate that our method can generate synthetic datasets while preserving information within and across tables better than its closest competitor.","synthetic data generation, random graph generation, differential privacy" Learning rigid dynamics with face interaction graph networks,https://openreview.net/forum?id=J7Uh781A05p,https://openreview.net/pdf?id=J7Uh781A05p,"Face to face, multi-index collisions improve accuracy and efficiency of graph network models for rigid body dynamics","Simulating rigid collisions among arbitrary shapes is notoriously difficult due to complex geometry and the strong non-linearity of the interactions. While graph neural network (GNN)-based models are effective at learning to simulate complex physical dynamics, such as fluids, cloth and articulated bodies, they have been less effective and efficient on rigid-body physics, except with very simple shapes. Existing methods that model collisions through the meshes' nodes are often inaccurate because they struggle when collisions occur on faces far from nodes. Alternative approaches that represent the geometry densely with many particles are prohibitively expensive for complex shapes. Here we introduce the ``Face Interaction Graph Network'' (FIGNet) which extends beyond GNN-based methods, and computes interactions between mesh faces, rather than nodes. Compared to learned node- and particle-based methods, FIGNet is around 4x more accurate in simulating complex shape interactions, while also 8x more computationally efficient on sparse, rigid meshes. Moreover, FIGNet can learn frictional dynamics directly from real-world data, and can be more accurate than analytical solvers given modest amounts of training data. FIGNet represents a key step forward in one of the few remaining physical domains which have seen little competition from learned simulators, and offers allied fields such as robotics, graphics and mechanical design a new tool for simulation and model-based planning.","graph networks, rigid body dynamics, physics" On the Importance of Contrastive Loss in Multimodal Learning,https://openreview.net/forum?id=wfU0emciOcM,https://openreview.net/pdf?id=wfU0emciOcM,We show that contrastive pairs are important for models to learn aligned and balanced representations in multimodal learning. ,"Recently, contrastive learning approaches (e.g., CLIP (Radford et al., 2021)) have received huge success in multimodal learning, where the model tries to minimize the distance between the representations of different views (e.g., image and its caption) of the same data point, while keep the representations of different data points away from each other. However, from a theoretical perspective, it is unclear how contrastive learning can learn to align the representations from different views efficiently, especially in cases where the data is not isotropic. In this work, we analyze the training dynamics of a simple multimodal contrastive learning model, and show that contrastive pairs are important for the model to efficiently balance the learned representations. In particular, we reveal a stage-wise behavior of the learning process: In the first stage, the model aligns the feature representations using positive pairs and the condition number grows in this stage. Then, in the second stage, the model reduces the condition number of the learned representations using negative pairs.","multimodal learning, contrastive learning" MAD for Robust Reinforcement Learning in Machine Translation,https://openreview.net/forum?id=WhoOFXdnys6,https://openreview.net/pdf?id=WhoOFXdnys6,"New distributed policy gradient algorithm that outperforms existing reward-aware training procedures such as REINFORCE, minimum risk training (MRT) and proximal policy optimization (PPO) when optimizing machine translation models.","We introduce a new distributed policy gradient algorithm and show that it outperforms existing reward-aware training procedures such as REINFORCE, minimum risk training (MRT) and proximal policy optimization (PPO) in terms of training stability and generalization performance when optimizing machine translation models. Our algorithm, which we call MAD (on account of using the mean absolute deviation in the importance weighting calculation), has distributed data generators sampling multiple candidates per source sentence on worker nodes, while a central learner updates the policy. MAD depends crucially on two variance reduction strategies: (1) a conditional reward normalization method that ensures each source sentence has both positive and negative reward translation examples and (2) a new robust importance weighting scheme that acts as a conditional entropy regularizer. Experiments on a variety of translation tasks show that policies learned using the MAD algorithm perform very well when using both greedy decoding and beam search, and that the learned policies are sensitive to the specific reward used during training.","Machine Translation, Reinforcement Learning" An Exploration of Conditioning Methods in Graph Neural Networks,https://openreview.net/forum?id=m2A7e4fMvT,https://openreview.net/pdf?id=m2A7e4fMvT,We unify three conditioning methods in graph neural networks in a common formulation and compare their performance on several tasks in computational chemistry.,"The flexibility and effectiveness of message passing based graph neural networks (GNNs) induced considerable advances in deep learning on graph-structured data. In such approaches, GNNs recursively update node representations based on their neighbors and they gain expressivity through the use of node and edge attribute vectors. E.g., In computational tasks such as physics and chemistry usage of edge attributes such as relative position or distance proved to be essential. In this work, we address not what kind of attributes to use, but how to condition on this information to improve model performance. We consider three types of conditioning; weak, strong, and pure, which respectively relate to concatenation-based conditioning, gating, and transformations that are causally dependent on the attributes. This categorization provides a unifying viewpoint on different classes of GNNs, from separable convolutions to various forms of message passing networks. We provide an empirical study on the effect of conditioning methods in several tasks in computational chemistry.","graph neural networks, geometric deep learning, deep learning" Speed Up Iterative Non-Autoregressive Transformers by Distilling Multiple Steps,https://openreview.net/forum?id=wWg_Ee5q_W,https://openreview.net/pdf?id=wWg_Ee5q_W,,"The computational benefits of iterative non-autoregressive transformers decrease as the number of decoding steps increases. As a remedy, we introduce Distill Multiple Steps (DiMS), a simple yet effective distillation technique to decrease the number of required steps to reach a certain translation quality. The distilled model enjoys the computational benefits of early iterations while preserving the enhancements from several iterative steps. DiMS relies on two models namely student and teacher. The student is optimized to predict the output of the teacher after multiple decoding steps while the teacher follows the student via a slow-moving average. The moving average keeps the teacher's knowledge updated and enhances the quality of the labels provided by the teacher. During inference, the student is used for translation and no additional computation is added. We verify the effectiveness of DiMS on various models obtaining 7 and 12.9 BLEU points improvements on distilled and raw versions of WMT'14 De-En, respectively.","non-autoregressive machine translation, knowledge distillation" Global View For GCN: Why Go Deep When You Can Be Shallow?,https://openreview.net/forum?id=2BruD7pa7E,https://openreview.net/pdf?id=2BruD7pa7E,,"Existing graph convolutional network (GCN) methods attempt to expand the receptive field of its convolution by either stacking up more convolutional layers or accumulating multi-hop adjacency matrices. Either approach increases computation complexity while providing a limited view of the network topology. We propose to extend k-hop adjacency matrices into one generalized exponential matrix to provide GCNs with a global overview of the network topology. This technique allows the GCNs to learn global topology without going deep and with much fewer parameters than most state-of-the-art GCNs, challenging the common assumption that deep GCNs are empirically better for learning global features. We show a significant improvement in performance in semi-supervised learning when this technique is used for common GCNs while maintaining much shallower network architectures ($\leq4$ layers) than the existing ones.","GCN, GNN, Clustering, Semi-supervised Learning" Cross-Silo Training of Differentially Private Models with Secure Multiparty Computation,https://openreview.net/forum?id=lLu1Xel2qfh,https://openreview.net/pdf?id=lLu1Xel2qfh,,"We address the problem of learning a machine learning model from training data that originates at multiple data holders in a cross-silo federated setup, while providing formal privacy guarantees regarding the protection of each holder's data. Existing solutions based on Differential Privacy (DP) achieve this at the cost of a drop in accuracy. Solutions based on Secure Multiparty Computation (MPC) do not incur such accuracy loss but leak information when the trained model is made publicly available. We propose an MPC solution for training differentially private models. Our solution relies on an MPC protocol for model training, and an MPC protocol for perturbing the trained model coefficients with Laplace noise in a privacy-preserving manner. The resulting MPC+DP approach achieves higher accuracy than a pure DP approach, while providing the same formal privacy guarantees. ","Privacy-preserving machine learning, Secure multi party computation, Differential privacy" HyperTime: Implicit Neural Representations for Time Series Generation,https://openreview.net/forum?id=bvgHBkSBdcj,https://openreview.net/pdf?id=bvgHBkSBdcj,"We propose a time series specific implicit neural representation architecture, and use it to generate synthetic data. ","Implicit neural representations (INRs) have recently emerged as a powerful tool that provides an accurate and resolution-independent encoding of data. Their robustness as general approximators has been shown in a wide variety of data sources, with applications on image, sound, and 3D scene representation. However, little attention has been given to leveraging these architectures for the representation and analysis of time series data. In this paper, we propose a new INR architecture for time series (iSIREN) designed to perform an accurate reconstruction of univariate and multivariate data, while also providing an interpretable encoding of the signal. We compare our architecture against SIREN and INRs with different activations, in terms of training convergence, and the reconstruction accuracy of both the signal and its spectral distribution. To achieve generalization, we propose a hypernetwork architecture (HyperTime) that leverages iSIRENs to learn a latent representation of an entire time series dataset. In addition to the traditional reconstruction loss, we introduce an FFT-based loss that guides the training by enforcing a good match of the ground truth spectral distribution. We show how these architectures can be used for time series generation, and evaluate our method through fidelity metrics, presenting results that exceed the performance of state-of-the-art techniques. Finally, we propose an alternative hypernetwork architecture (iHyperTime) that incorporates interpretability into the latent representation, enabling the introduction of prior knowledge by imposing constraints into the generation process.","Time Series, Implicit Neural Representations, Time Series Generation" Generative Adversarial Federated Model,https://openreview.net/forum?id=94bybXmOLz-,https://openreview.net/pdf?id=94bybXmOLz-,,"As an emerging technique, vertical federated learning collaborates with different data sources to jointly train a machine learning model without data exchange. However, federated learning is computationally expensive and inefficient in modeling due to complex encryption algorithms or secure computation protocols. Split learning offers a solution to the high computational cost and low modeling efficiency of traditional federated learning using encryption algorithms or secure computation protocols. However, vanilla split learning still suffers privacy leakage, especially the label leakage from the active party. Here, we propose the Generative Adversarial Federated Model (GAFM) built upon the vanilla split learning framework and the Generative Adversarial Network (GAN) for improved label privacy protection against commonly used attacks. In our empirical studies on two publicly available datasets, GAFM showed significant performance improvement for prediction and label privacy protection compared to existing models, including Marvell and SplitNN, which is an application of split learning to neural networks. We provide intuition on why GAFM can improve over SplitNN and Marvell and demonstrate that GAFM offers label protection through gradient perturbation compared to SplitNN. ", Unsupervised Pretraining for Neural Value Approximation,https://openreview.net/forum?id=AgQ4GpzzRT,https://openreview.net/pdf?id=AgQ4GpzzRT,The paper presents an unsupervised pretraining approach that learns initializations of the critic/value network which possess desirable generalization properties in the context of deep reinforcement learning. ,"Deep neural networks are powerful function approximators and have successfully been employed for the parameterization of value functions in deep reinforcement learning. Neural value approximation is a powerful paradigm for model-free control but it can often result in instability and divergence, especially when combined with off-policy learning and bootstrapping. Recent works have revealed some intrinsic connections between the unstable behavior of neural value approximation and the generalization properties of the value network/critic. Motivated by this, we propose a simple and computationally efficient unsupervised pretraining method to be performed before neural value learning. The method learns initializations of the critic parameters that correspond to Neural Tangent Kernels with desirable generalization structures. We demonstrate the merits of our approach by combining it with the Soft Actor-Critic algorithm and testing its performance on the continuous control environments of the DeepMind Control Suite. Our approach results in considerable improvements in reward accumulation, sample efficiency and stability for the majority of the domain environments. Furthermore, the use of the proposed pretraining enables us to retain the performance gains when changing the in between layers activation function of the critic architecture.","reinforcement learning, Neural Tangent Kernels, unsupervised pretraining, neural value approximation" Homotopy Learning of Parametric Solutions to Constrained Optimization Problems,https://openreview.net/forum?id=GdimRqV_S7,https://openreview.net/pdf?id=GdimRqV_S7,,"Building deep learning (DL) alternatives to constrained optimization problems has been proposed as a cheaper solution approach than classical constrained optimization solvers. However, these approximate learning-based solutions still suffer from constraint violations. From this perspective, reaching a reliable convergence remains an open challenge to DL models even with state-of-the-art methods to impose constraints, especially when facing a large set of nonlinear constraints forming a non-convex feasible set. In this paper, we propose the use of homotopy meta-optimization heuristics which creates a continuous transformation of the objective and constraints during training, to promote a more reliable convergence where the solution feasibility can be further improved. The method developed in this work includes 1) general-purpose homotopy heuristics based on the relaxation of objectives and constraint bounds to enlarge the basin of attraction and 2) physics-informed transformation of domain problem leading to trivial starting points lying within the basin of attraction. Experimentally, we demonstrate the efficacy of the proposed method on a set of abstract constrained optimization problems and real-world power grid optimal power flow problems with increasing complexity. Results show that constrained deep learning models with homotopy heuristics can improve the feasibility of the resulting solutions while achieving near-optimal objective values when compared with non-homotopy counterparts.","homotopy, deep learning, constrained optimization, nonlinear programming, constrained deep learning, differentiable parametric programming" When Rigid Coherency Hurts: Distributional Coherency Regularization for Probabilistic Hierarchical Time Series Forecasting,https://openreview.net/forum?id=YsNlFsG-jj,https://openreview.net/pdf?id=YsNlFsG-jj,A novel probabilistic neural model for calibrated and accurate probabilistic forecasting on datasets of varying hierarchical consistency.,"Probabilistic hierarchical time-series forecasting is an important variant of time-series forecasting, where the goal is to model and forecast multivariate time-series that have hierarchical relations. Previous works all assume rigid consistency over the given hierarchies and do not adapt to real-world data that show deviation from this assumption. Moreover, recent state-of-art neural probabilistic methods also impose hierarchical relations on point predictions and samples of distribution. This does not account for full forecast distributions being coherent with the hierarchy and leads to poorly calibrated forecasts. We close both these gaps and propose PROFHIT, a probabilistic hierarchical forecasting model that jointly models forecast distributions over the entire hierarchy. PROFHIT (1) uses a flexible probabilistic Bayesian approach and (2) introduces soft distributional coherency regularization that enables end-to-end learning of the entire forecast distribution leveraging information from the underlying hierarchy. This enables robust and calibrated forecasts as well as adaptation to real-life data with varied hierarchical consistency. PROFHIT provides 41-88% better performance in accuracy and 23-33% better calibration over a wide range of dataset consistency. Furthermore, PROFHIT can robustly provide reliable forecasts even if up to 10% of input time-series data is missing, whereas other methods’ performance severely degrade by over 70%.","Hierarchical Forecasting, Time Series Forecasting, Deep Probabilistic Models" EENet: Learning to Early Exit for Adaptive Inference,https://openreview.net/forum?id=_SQ-303Iu6G,https://openreview.net/pdf?id=_SQ-303Iu6G,"We introduce EENet, a novel, lightweight early exit policy optimization method for budgeted adaptive inference with early exits.","Budgeted adaptive inference with early exits is an emerging technique to improve the computational efficiency of deep neural networks (DNNs) for edge AI applications with limited resources at test time. This method leverages the fact that different test data samples may not require the same amount of computation for a correct prediction. By allowing early exiting from full layers of DNN inference for some test examples, we can reduce latency and improve throughput of edge inference while preserving performance. Although there have been numerous studies on designing specialized DNN architectures for training early-exit enabled DNN models, most of the existing work employ hand-tuned or manual rule-based early exit policies. In this study, we introduce a novel multi-exit DNN inference framework, coined as EENet, which leverages multi-objective learning to optimize the early exit policy for a trained multi-exit DNN under a given inference budget. This paper makes two novel contributions. First, we introduce the concept of early exit utility scores by combining diverse confidence measures with class-wise prediction scores to better estimate the correctness of test-time predictions at a given exit. Second, we train a lightweight, budget-driven, multi-objective neural network over validation predictions to learn the exit assignment scheduling for query examples at test time. The EENet early exit scheduler optimizes both the distribution of test samples to different exits and the selection of the exit utility thresholds such that the given inference budget is satisfied while the performance metric is maximized. Extensive experiments are conducted on five benchmarks, including three image datasets (CIFAR-10, CIFAR-100, ImageNet) and two NLP datasets (SST-2, AgNews). The results demonstrate the performance improvements of EENet compared to existing representative early exit techniques. We also perform an ablation study and visual analysis to interpret the results.","deep learning, edge computing, computational efficiency" MALIBO: Meta-Learning for Likelihood-free Bayesian Optimization,https://openreview.net/forum?id=K2spEiswXVf,https://openreview.net/pdf?id=K2spEiswXVf,"A Meta-learning method for likelihood-free Bayesian optimization, scalable and robust to different scales across datasets.","Bayesian Optimization (BO) is a popular method to optimize expensive black-box functions. Typically, BO only uses observations from the current task. Recently proposed methods try to warm-start BO by exploiting knowledge from related tasks, yet suffer from scalability issues and sensitivity to heterogeneous scale across multiple datasets. We propose a novel approach to solve these problems by combining a meta-learning technique and a likelihood-free acquisition function. The meta-learning model simultaneously learns the underlying (task-agnostic) data distribution and a latent feature representation for individual tasks. The likelihood-free BO technique has less stringent assumptions about the problems and works with any classification algorithm, making it computation efficient and robust to different scales across tasks. Finally, gradient boosting is used as a residual model on top to adapt to distribution drifts between new and prior tasks, which might otherwise weaken the usefulness of the meta-learned features. Experiments show that the meta-model learns an effective prior for warm-starting optimization algorithms, while being cheap to evaluate and invariant to changes of scale across different datasets.","Bayesian Optimization, Meta-learning" Finding and only finding local Nash equilibria by both pretending to be a follower,https://openreview.net/forum?id=8abnSMeFaqA,https://openreview.net/pdf?id=8abnSMeFaqA,"We propose double Follow-the-Ridge (double-FTR), an algorithm with local convergence guarantee to differential Nash equilibria in general-sum two-player differential games.","Finding (local) Nash equilibria in two-player differentiable games is a classical problem in game theory with important relevance in machine learning. We propose double Follow-the-Ridge (double-FTR), an algorithm that locally converges to and only to local Nash equilibria in general-sum two-player differentiable games. To our knowledge, double-FTR is the first algorithm with such guarantees for general-sum games. Furthermore, we show that by varying its preconditioner, double-FTR leads to a broader family of algorithms with the same convergence guarantee. In addition, double-FTR avoids oscillation near equilibria due to the real-eigenvalues of its Jacobian at fixed points. Empirically, we validate the double-FTR algorithm on a range of simple zero-sum and general sum games, as well as simple Generative Adversarial Network (GAN) tasks.","game theory, general-sum games, local Nash equilibrium, optimization" Learning Low Dimensional State Spaces with Overparameterized Recurrent Neural Networks,https://openreview.net/forum?id=k9CF4h3muD,https://openreview.net/pdf?id=k9CF4h3muD,Linear RNNs optimized with gradient descent have implicit bias leading to solutions with low dimensional state spaces leading to non-trivial extrapolation.,"Overparameterization in deep learning refers to settings where a trained Neural Network (NN) has representational capacity to fit the training data in many ways, some of which generalize well, while others do not. In the case of Recurrent Neural Networks (RNNs) there exists an additional layer of overparameterization, in the sense that a model may exhibit many solutions that generalize well for sequence lengths seen in training, some of which \emph{extrapolate} to longer sequences, while others do not. Numerous works studied the tendency of Gradient Descent (GD) to fit overparameterized NNs with solutions that generalize well. On the other hand, its tendency to fit overparameterized RNNs with solutions that extrapolate has been discovered only lately, and is far less understood. In this paper, we analyze the extrapolation properties of GD when applied to overparameterized linear RNNs. In contrast to recent arguments suggesting an implicit bias towards short-term memory, we provide theoretical evidence for learning low dimensional state spaces, which can also model long-term memory. Our result relies on a dynamical characterization showing that GD (with small step size and near zero initialization) strives to maintain a certain form of balancedness, as well as tools developed in the context of the \emph{moment problem} from statistics (recovery of discrete probability distribution from its moments). Experiments corroborate our theory, demonstrating extrapolation via learning low dimensional state spaces with both linear and non-linear RNNs.","RNN, gradient descent, implicit bias, extrapolation" Images as Weight Matrices: Sequential Image Generation Through Synaptic Learning Rules,https://openreview.net/forum?id=ddad0PNUvV,https://openreview.net/pdf?id=ddad0PNUvV,We train neural nets to execute sequences of synaptic learning rules to sequentially generate natural images (instead of weight matrices).,"Work on fast weight programmers has demonstrated the effectiveness of key/value outer product-based learning rules for sequentially generating a weight matrix (WM) of a neural net (NN) by another NN or itself. However, the weight generation steps are typically not visually interpretable by humans, because the contents stored in the WM of an NN are not. Here we apply the same principle to generate natural images. The resulting fast weight painters (FPAs) learn to execute sequences of delta learning rules to sequentially generate images as sums of outer products of self-invented keys and values, one rank at a time, as if each image was a WM of an NN. We train our FPAs in the generative adversarial networks framework, and evaluate on various image datasets. We show how these generic learning rules can generate images with respectable visual quality without any explicit inductive bias for images. While the performance largely lags behind the one of specialized state-of-the-art image generators, our approach allows for visualising how synaptic learning rules iteratively produce complex connection patterns, yielding human-interpretable meaningful images. Finally, we also show that an additional convolutional U-Net (now popular in diffusion models) at the output of an FPA can learn one-step ''denoising'' of FPA-generated images to enhance their quality.","learning rules, Fast Weight Programmers, linear Transformers, image generation, GANs" SurCo: Learning Linear Surrogates for Combinatorial Nonlinear Optimization Problems,https://openreview.net/forum?id=5o8oFs5D9Z,https://openreview.net/pdf?id=5o8oFs5D9Z,SurCo learns linear surrogate problems for nonlinear combinatorial optimization by training high-quality linear surrogates using end-to-end gradient descent with better performance in two industrial domains,"Optimization problems with expensive nonlinear cost functions and combinatorial constraints appear in many real-world applications, but remain challenging to solve efficiently. Existing combinatorial solvers like Mixed Integer Linear Programming can be fast in practice but cannot readily optimize nonlinear cost functions, while general nonlinear optimizers like gradient descent often do not handle complex combinatorial structures, may require many queries of the cost function, and are prone to local optima. To bridge this gap, we propose SurCo that learns linear Surrogate costs which can be used by existing Combinatorial solvers to output good solutions to the original nonlinear combinatorial optimization problem, combining the flexibility of gradient-based methods with the structure of linear combinatorial optimization. We learn these linear surrogates end-to-end with the nonlinear loss by differentiating through the linear surrogate solver. Three variants of SurCo are proposed: SurCo-zero operates on individual nonlinear problems, SurCo-prior trains a linear surrogate predictor on distributions of problems, and SurCo-hybrid uses a model trained offline to warm start online solving for SurCo-zero. We analyze our method theoretically and empirically, showing smooth convergence and improved performance. Experiments show that compared to state-of-the-art approaches and expert-designed heuristics, SurCo obtains lower cost solutions with comparable or faster solve time for two real-world industry-level applications: embedding table sharding and inverse photonic design.","Differentiable Optimization, Machine Learning, Nonlinear Optimization, Combinatorial Optimization" DT+GNN: A Fully Explainable Graph Neural Network using Decision Trees,https://openreview.net/forum?id=9IlzJa5cAv,https://openreview.net/pdf?id=9IlzJa5cAv,A new GNN architecture that allows for full explanation not only of the important imputs but also the full decision making process how the inputs are used.,"We propose a new Decision Tree Graph Neural Network (DT+GNN) architecture for Graph Neural Network (GNN) explanation. Existing post-hoc explanation methods highlight important inputs but fail to reveal how a GNN uses these inputs. In contrast DT+GNN is fully explainable: Humans can inspect and understand the decision making of DT+GNN at every step. DT+GNN internally uses a novel GNN layer that is restricted to categorical state spaces for nodes and messages. After training with gradient descent we can easily distill these layers into decision trees. These trees are further pruned using our newly proposed method to ensure they are small and easy to interpret. DT+GNN can also compute node-level importance scores like the existing explanation methods. We demonstrate on real-world GNN benchmarks that DT+GNN has competitive classification accuracy and computes competitive explanations. Furthermore, we leverage DT+GNN's full explainability to inspect the decision processes in synthetic and real-world datasets with surprising results. We make this inspection accessible through an interactive web tool.", Why (and When) does Local SGD Generalize Better than SGD?,https://openreview.net/forum?id=svCcui6Drl,https://openreview.net/pdf?id=svCcui6Drl,We derive a Stochastic Differential Equation (SDE) that captures the long-term behavior of Local SGD and provide a theoretical explanation why Local SGD generalizes better than SGD.,"Local SGD is a communication-efficient variant of SGD for large-scale training, where multiple GPUs perform SGD independently and average the model parameters periodically. It has been recently observed that Local SGD can not only achieve the design goal of reducing the communication overhead but also lead to higher test accuracy than the corresponding SGD baseline (Lin et al., 2020b), though the training regimes for this to happen are still in debate (Ortiz et al., 2021). This paper aims to understand why (and when) Local SGD generalizes better based on Stochastic Differential Equation (SDE) approximation. The main contributions of this paper include (i) the derivation of an SDE that captures the long-term behavior of Local SGD with a small learning rate, after approaching the manifold of minima, (ii) a comparison between the SDEs of Local SGD and SGD, showing that Local SGD induces a stronger drift term that can result in a stronger effect of regularization, e.g., a faster reduction of sharpness, and (iii) empirical evidence validating that having small learning rate and long enough training time enables the generalization improvement over SGD but removing either of the two conditions leads to no improvement.","local SGD, SDE, regularization, implicit bias, deep learning theory, optimization, distributed training" Function-space regularized Rényi divergences,https://openreview.net/forum?id=89GT-S49mGd,https://openreview.net/pdf?id=89GT-S49mGd,,"We propose a new family of regularized Rényi divergences parametrized not only by the order $\alpha$ but also by a variational function space. These new objects are defined by taking the infimal convolution of the standard Rényi divergence with the integral probability metric (IPM) associated with the chosen function space. We derive a novel dual variational representation that can be used to construct numerically tractable divergence estimators. This representation avoids risk-sensitive terms and therefore exhibits lower variance, making it well-behaved when $\alpha>1$; this addresses a notable weakness of prior approaches. We prove several properties of these new divergences, showing that they interpolate between the classical Rényi divergences and IPMs. We also study the $\alpha\to\infty$ limit, which leads to a regularized worst-case-regret and a new variational representation in the classical case. Moreover, we show that the proposed regularized Rényi divergences inherit features from IPMs such as the ability to compare distributions that are not absolutely continuous, e.g., empirical measures and distributions with low-dimensional support. We present numerical results on both synthetic and real datasets, showing the utility of these new divergences in both estimation and GAN training applications; in particular, we demonstrate significantly reduced variance and improved training performance.","Rényi divergence, integral probability metrics, variational formulas, worst-case-regret" Constant-Factor Approximation Algorithms for Socially Fair $k$-Clustering,https://openreview.net/forum?id=nMAbvsQo5YY,https://openreview.net/pdf?id=nMAbvsQo5YY,We present a constant-factor approximation algorithm for the socially fair k-means and k-medians problems.,"We study approximation algorithms for the socially fair $(\ell_p, k)$-clustering problem with $m$ groups which include the socially fair $k$-median ($p=1$) and $k$-means ($p=2$). We present (1) a polynomial-time $(5+2\sqrt{6})^p$-approximation with at most $k+m$ centers (2) a $(5+2\sqrt{6}+\epsilon)^p$-approximation with $k$ centers in time $(nk)^{{2^{O(p)} m^2}/\epsilon}$, and (3) a $(15+6\sqrt{6})^p$ approximation with $k$ centers in time $k^{m}\cdot\text{poly}(n)$. The former is obtained by a refinement of the iterative rounding method via a sequence of linear programs. The latter two are obtained by converting a solution with up to $k+m$ centers to one with $k$ centers by sparsification methods for (2) and via an exhaustive search for (3). We also compare the performance of our algorithms with existing approximation algorithms on benchmark datasets, and find that our algorithms outperform existing methods.","Clustering, k-means, fairness, approximation algorithms, iterative rounding" Re-calibrated Wasserstein GAN for large-scale imputation with informative missing,https://openreview.net/forum?id=Jg-oXkENo2p,https://openreview.net/pdf?id=Jg-oXkENo2p,We develop a novel method for imputing missing data in large scale health records using a Wasserstein GAN whose loss function is reweighted by missingness probability estimates,"Missing data are pervasive in electronic health records (EHR) and oftentimes the missingness is informative (i.e. Missing Not At Random). Presently available imputation methods typically do not account for this informative missingness or are computationally infeasible to handle the scale of EHR data. We develop a deep learning imputation method based on \textit{recalibrating} a Wasserstein Generative Adversarial Network (WGAN) to account for informative missingness in high-dimensional quantitative medical data. We propose a new quantile re-weighting technique to ensure distributional equivariance under informative missingness and integrate it with WGAN to enable efficient imputations in large-scale observational data in presence of informative missingness and covariate imbalance. Results from our proposed algorithm show better recovery compared to present methods in both synthetic and real-world data from the Reactions to Acute Hospitalization (REACH) and laboratory test results of COVID-19 patients in the New York Metropolitan area from the INSIGHT dataset. ","deep learning, data imputation, missing data, neural networks, Wasserstein GAN, quantile regression" Implicit Bias of Large Depth Networks: a Notion of Rank for Nonlinear Functions,https://openreview.net/forum?id=6iDHce-0B-a,https://openreview.net/pdf?id=6iDHce-0B-a,The representation cost of DNNs converges to a notion of nonlinear rank as the depth grows to infinity. This bias towards low-rank functions extends to large but finite widths.,"We show that the representation cost of fully connected neural networks with homogeneous nonlinearities - which describes the implicit bias in function space of networks with $L_2$-regularization or with losses such as the cross-entropy - converges as the depth of the network goes to infinity to a notion of rank over nonlinear functions. We then inquire under which conditions the global minima of the loss recover the `true' rank of the data: we show that for too large depths the global minimum will be approximately rank 1 (underestimating the rank); we then argue that there is a range of depths which grows with the number of datapoints where the true rank is recovered. Finally, we discuss the effect of the rank of a classifier on the topology of the resulting class boundaries and show that autoencoders with optimal nonlinear rank are naturally denoising.","Deep Neural Networks, implicit bias, representation cost, sparsity" Depth Separation with Multilayer Mean-Field Networks,https://openreview.net/forum?id=uzFQpkEzOo,https://openreview.net/pdf?id=uzFQpkEzOo,"We show that, using gradient flow, 3-layer networks can efficiently learn a function that no 2-layer networks can efficiently approximate.","Depth separation—why a deeper network is more powerful than a shallow one—has been a major problem in deep learning theory. Previous results often focus on representation power, for example, Safran et al. (2019) constructed a function that is easy to approximate using a 3-layer network but not approximable by any 2-layer network. In this paper, we show that this separation is in fact algorithmic: one can learn the function constructed by Safran et al. (2019) using an overparametrized network with polynomially many neurons efficiently. Our result relies on a new way of extending the mean-field limit to multilayer networks, and a decomposition of loss that factors out the error introduced by the discretization of infinite-width mean-field networks.","depth separation, mean-field, nonconvex optimization" Robust Policy Optimization in Deep Reinforcement Learning,https://openreview.net/forum?id=HnLFY8F9uS,https://openreview.net/pdf?id=HnLFY8F9uS,,"Entropy can play an essential role in policy optimization by selecting the stochastic policy, which eventually helps better explore the environment in reinforcement learning (RL). A proper balance between exploration and exploitation is challenging and might depend on the particular RL task. However, the stochasticity often reduces as the training progresses; thus, the policy becomes less exploratory. Therefore, in many cases, the policy can converge to sub-optimal due to a lack of representative data during training. Moreover, this issue can even be severe in high-dimensional environments. This paper investigates whether keeping a certain entropy threshold throughout training can help better policy learning. In particular, we propose an algorithm Robust Policy Optimization (RPO), which leverages a perturbed Gaussian distribution to encourage high-entropy actions. We evaluated our methods on various continuous control tasks from DeepMind Control, OpenAI Gym, Pybullet, and IsaacGym. We observed that in many settings, RPO increases the policy entropy early in training and then maintains a certain level of entropy throughout the training period. Eventually, our agent RPO shows consistently improved performance compared to PPO and other techniques such as data augmentation and entropy regularization. Furthermore, in several settings, our method stays robust in performance, while other baseline mechanisms fail to improve and even worsen the performance.","Deep Reinforcement Learning, Policy Optimization" Analogical Networks for Memory-Modulated 3D Parsing,https://openreview.net/forum?id=SRIQZTh0IK,https://openreview.net/pdf?id=SRIQZTh0IK,,"Despite recent breakthroughs in the applications of deep neural networks in visual perception, one setting that presents a persistent challenge is that of “few-shot learning.” Works in the area of few shot visual learning mostly address the task of coarse image classification. Fine-grain visual parsing is necessary for scene understanding and action recognition. Thus far, a separate neural model is trained to parse each semantic category, which hinders knowledge sharing across objects, let alone few shot visual parsing. We present Analogical Networks, a model that casts fine-grained visual parsing into analogical inference: instead of mapping input scenes to part labels, which is hard to adapt in a few-shot manner to novel inputs, our model retrieves related scenes from memory and their corresponding part structures, and predicts analogous part structures in the input scene, via an end-to-end learnable modulation mechanism. By conditioning on more than one memory, compositions of structures are predicted, that mix and match parts from different visual experiences. We show Analogical Networks excel at few-shot learning, where instances of novel object categories are successfully parsed simply by expanding the model’s memory, without any weight updates. Analogical Networks outperform existing state-of-the-art detection transformer models at part segmentation, as well as paradigms of meta-learning and few-shot learning. We show part correspondences emerge across memory and input scenes by simply training for a label-free segmentation objective, as a byproduct of the analogical inductive bias.", Fake It Until You Make It : Towards Accurate Near-Distribution Novelty Detection,https://openreview.net/forum?id=QWQM0ZwZdRS,https://openreview.net/pdf?id=QWQM0ZwZdRS,,"We aim for image-based novelty detection. Despite considerable progress, existing models either fail or face dramatic drop under the so-called ``near-distribution"" setup, where the differences between normal and anomalous samples are subtle. We first demonstrate existing methods could experience up to 20\% decrease in their AUCs in the near-distribution setting. Next, we propose to exploit a score-based generative model to produce synthetic near-distribution anomalous data. Our model is then fine-tuned to distinguish such data from the normal samples. We make quantitative as well as qualitative evaluation of this strategy, and compare the results with a variety of GAN-based models. Effectiveness of our method for both near-distribution and standard novelty detection is assessed through extensive experiments on datasets in diverse applications such as medical images, object classification, and quality control. This reveals that our method significantly improves upon existing models, and consistently decreases the gap between the near-distribution and standard novelty detection AUCs by a considerable amount.", Injecting knowledge into language generation: a case study in auto-charting after-visit care instructions from medical dialogue,https://openreview.net/forum?id=wkZcA4Lb7Mw,https://openreview.net/pdf?id=wkZcA4Lb7Mw,We propose an approach for injecting domain knowledge into neural autoregressive language models using marginal probability regularization during training and apply it to the care plan generation task.,"Factual correctness is often the limiting factor in practical applications of natural language generation in high-stake domains such as healthcare. An essential requirement for maintaining factuality is the ability to deal with rare tokens. This paper focuses on rare tokens that appear in both the source and reference sequences, and which, when missed during generation, can hamper the factual correctness of the generated text. Starting from our fundamental premise that high-stake domains are also knowledge-rich, we show how to use knowledge to (a) identify what rare tokens that appear in both source and reference are important and (b) uplift their conditional probability. We introduce the ``utilization rate'' that encodes knowledge and serves as a regularizer by maximizing the marginal probability of selected tokens. We present a study in a knowledge-rich domain of healthcare, where we tackle the problem of generating after-visit care instructions based on patient-doctor dialogues. We verify that, in our dataset, specific medical concepts with high utilization rates are underestimated by conventionally trained sequence to sequence models. We observe that correcting this with our approach to knowledge injection reduces the uncertainty of the model and improves factuality and coherence without negatively impacting fluency. ","neural language generation, sequence-to-sequence, knowledge injection, medical dialogue data, care plan generation, EHR auto-charting" DySR: Adaptive Super-Resolution via Algorithm and System Co-design,https://openreview.net/forum?id=Pgtn4l6eKjv,https://openreview.net/pdf?id=Pgtn4l6eKjv,"We present DySR, an algorithm and system co-design approach to maintain super-resolution streaming task QoS on mobile devices via fast model adaptation.","Super resolution (SR) is a promising approach for improving the quality of low resolution steaming services on mobile devices. On mobile devices, the available computing and memory resources change dynamically depending on other running applications. Due to the high computation and memory demands of SR models, it is essential to adapt the model according to available resources to harvest the best possible model performance while maintaining quality of service (QoS), such as meeting a minimum framerate and avoiding interruptions. Nevertheless, there is no SR model or machine learning system that supports adaptive SR, and enabling adaptive SR model on mobile devices is challenging because adapting model can cause significant framerate drop or even service interruption. To address this challenge, we take an algorithm and system co-design approach and propose DySR that maintains QoS while maximizing the model performance. During the training stage, DySR employs an adaption-aware one-shot Neural Architecture Search to produce sub-graphs that share kernel operation weights for low model adaption overhead while striking a balance between performance and framerate. During the inference stage, an incremental model adaption method is developed for further reducing the model adaption overhead. We evaluate on a diverse set of hardware and datasets to show that DySR can generate models close to the Pareto frontier while maintaining a steady framerate throughput with a memory footprint of around 40\% less compared to baseline methods.","super-resolution, quality of service, inference, deep learning, systems" Domain Invariant Q-Learning for model-free robust continuous control under visual distractions,https://openreview.net/forum?id=sVx6FKx1iv,https://openreview.net/pdf?id=sVx6FKx1iv,,"End-to-end reinforcement learning on images showed significant performance progress in the recent years, especially with regularization to value estimation brought by data augmentation \citep{yarats2020image}. At the same time, domain randomization and representation learning helped push the limits of these algorithms in visually diverse environments, full of distractors and spurious noise, making RL more robust to unrelated visual features. We present DIQL, a method that combines risk invariant regularization and domain randomization to reduce out-of-distribution generalization gap for temporal-difference learning. In this work, we draw a link by framing domain randomization as a richer extension of data augmentation to RL and support its generalized use. Our model-free approach improve baselines performances without the need of additional representation learning objectives and with limited additional computational cost. We show that DIQL outperforms existing methods on complex visuo-motor control environment with high visual perturbation. In particular, our approach achieves state-of the-art performance on the Distracting Control Suite benchmark, where we evaluate the robustness to a number of visual perturbators, as well as OOD generalization and extrapolation capabilities.", Continual Learning with Soft-Masking of Parameter-Level Gradient Flow,https://openreview.net/forum?id=Zz8_2A4iPS,https://openreview.net/pdf?id=Zz8_2A4iPS,"This work aims to (1) overcome catastrophic forgetting, (2) encourage knowledge transfer, and (3) tackle the capacity problem in continual learning.","Existing research on task incremental learning in continual learning has primarily focused on preventing catastrophic forgetting (CF). Several techniques have achieved learning with no CF. However, they attain it by letting each task monopolize a sub-network in a shared network, which seriously limits knowledge transfer (KT) and causes over-consumption of the network capacity, i.e., as more tasks are learned, the performance deteriorates. The goal of this paper is threefold: (1) overcoming CF, (2) encouraging KT, and (3) tackling the capacity problem. A novel and simple technique (called SPG) is proposed that soft-masks (partially blocks) parameter updating in training based on the importance of each parameter to old tasks. Each task still uses the full network, i.e., no monopoly of any part of the network by any task, which enables maximum KT and reduction of capacity usage. Extensive experiments demonstrate the effectiveness of SPG in achieving all three objectives. More notably, it attains significant transfer of knowledge not only among similar tasks (with shared knowledge) but also among dissimilar tasks (with little shared knowledge) while preventing CF.","continual learning, catastrophic forgetting, knowledge transfer" Asynchronous Message Passing: A new Framework for Learning in Graphs,https://openreview.net/forum?id=2_I3JQ70U2,https://openreview.net/pdf?id=2_I3JQ70U2,A new framework for neural networks in graphs: messages are handled one at a time giving beneits in expressiveness and longrange propagation.,"This paper studies asynchronous message passing (AMP), a new framework for applying neural networks to graphs. Existing graph neural networks (GNNs) use the message passing framework which is based on the synchronous distributed computing model. In traditional GNNs, nodes aggregate their neighbors in each round, which causes problems such as oversmoothing and expressiveness limitations. On the other hand, our AMP framework is based on the \textit{asynchronous} model, where nodes react to messages of their neighbors individually. We prove (i) AMP is at least as powerful as the message passing framework, (ii) AMP is more powerful than the $1-$WL test for graph isomorphism, an important benchmark for message passing GNNs, and (iii) conceptually, AMP can even separate any pair of graphs and compute graph isomorphism. We experimentally validate the findings on AMP's expressiveness, and show that AMP might be better suited to propagate messages over large distances in graphs. We also demonstrate that AMP performs well on several graph classification benchmarks.", Integrating Symmetry into Differentiable Planning with Steerable Convolutions,https://openreview.net/forum?id=n7CPzMPKQl,https://openreview.net/pdf?id=n7CPzMPKQl,,"We study how group symmetry helps improve data efficiency and generalization for end-to-end differentiable planning algorithms, when symmetry appears in decision-making tasks. Motivated by equivariant convolution networks, we treat the path planning problem as \textit{signals} over grids. We show that value iteration in this case is a \textit{linear equivariant operator}, which is a (steerable) \textit{convolution}. This extends Value Iteration Networks (VINs) on using convolutional networks for path planning with additional \textit{rotation} and \textit{reflection} symmetry. Our implementation is based on VINs and uses steerable convolution networks to incorporate symmetry. The experiments are performed on four tasks: 2D navigation, visual navigation, 2 degrees of freedom (2DOFs) configuration space and workspace manipulation. % in configuration space or workspace. Our symmetric planning algorithms improve training efficiency and generalization by large margins compared to non-equivariant counterparts, VIN and GPPN.", MolJET: Multimodal Joint Embedding Transformer for Conditional de novo Molecular Design and Multi-Property Optimization,https://openreview.net/forum?id=7UudBVsIrr,https://openreview.net/pdf?id=7UudBVsIrr,MolJET is a foundational generative chemistry model for molecular design that uses joint embeddings learned from three chemistry-related modalities to perform conditional multi-property optimization.,"Multi-property constrained optimization of molecules using generative de novo design models is vital for the successful application of Artificial Intelligence (AI) towards materials and drug discovery. Yet there remains a gap between the reported performance of such models in the literature and their practical utility in real world design scenarios. Furthermore, existing models are largely inaccessible to chemists without an extensive background in computer science. To address these challenges, we propose a generative foundation model, the Multimodal Joint Embedding Transformer (MolJET), which performs conditional generation of desired molecular distributions based on human-interpretable chemistry prompts in a zero-shot manner. We assess MolJET on the standard benchmarks available in the GuacaMol and MIMOSA evaluation frameworks. These include structure-based sampling tasks as well as a range of multi-property optimization tasks that probe a models ability to design drug-like molecules given realistic property constraints. We demonstrate that with self-supervised pretraining, MolJET outperforms 80% of task-optimized models while using zero-shot inferences and beats all baselines after minimal supervision. Moreover, the performance of MolJET on text-only conditioning tasks improves with the inclusion of property modalities during training, highlighting the importance of a multimodal approach to molecular design. MolJET is the first example of text-based de novo molecular design using large-scale multimodal foundation models and should serve as a building block towards further improvements to accessible AI for chemists.","Transformers, Multimodal, Molecules, Generative, Drug-design, LLM" The Challenges of Exploration for Offline Reinforcement Learning,https://openreview.net/forum?id=oKTl_-4qLJ,https://openreview.net/pdf?id=oKTl_-4qLJ,We compare existing and new exploration methods as a new way to generate useful data for offline reinforcement learning.,"Offline Reinforcement Learning (ORL) enables us to separately study the two interlinked processes of reinforcement learning: collecting informative experience and inferring optimal behaviour. The second step has been widely studied in the offline setting, but just as critical to data-efficient RL is the collection of informative data. The task-agnostic setting for data collection, where the task is not known a priori, is of particular interest due to the possibility of collecting a single dataset and using it to solve several downstream tasks as they arise. We investigate this setting via curiosity-based intrinsic motivation, a family of exploration methods which encourage the agent to explore those states or transitions it has not yet learned to model. With Explore2Offline, we propose to evaluate the quality of collected data by transferring the collected data and inferring policies with reward relabelling and standard offline RL algorithms. We evaluate a wide variety of data collection strategies, including a new exploration agent, Intrinsic Model Predictive Control (IMPC), using this scheme and demonstrate their performance on various tasks. We use this decoupled framework to strengthen intuitions about exploration and the data prerequisites for effective offline RL.","Offline Reinforcement Learning, Exploration, Robotics" SGD with large step sizes learns sparse features,https://openreview.net/forum?id=ipRGZ91NvG4,https://openreview.net/pdf?id=ipRGZ91NvG4,Loss stabilization achieved via SGD with large step sizes leads to a hidden dynamics that promotes sparse feature learning ,"We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD) in the training of neural networks. We present empirical observations that the commonly used large step sizes (i) lead the iterates to jump from one side of a valley to the other causing \textit{loss stabilisation} (ii) this stabilisation induces a hidden stochastic dynamics orthogonal to the bouncing directions that \textit{biases it implicitly} toward simple predictors. Furthermore, we show empirically that the longer large step sizes keep SGD high in the loss landscape valleys, the better the implicit regularization can operate and find sparse representations. Notably, no explicit regularization is used so that the regularization effect comes solely from the SGD training dynamics influenced by the step size schedule. Therefore, these observations unveil how, through the step size schedules, both gradient and noise drive together the SGD dynamics through the loss landscape of neural networks. We justify these findings theoretically through the study of simple neural network models. Finally, we shed a new light on some common practice and observed phenomena when training neural networks.","SGD, large step sizes, implicit regularization, feature learning" Synergies Between Disentanglement and Sparsity: a Multi-Task Learning Perspective,https://openreview.net/forum?id=7ZcyRF7Y3S,https://openreview.net/pdf?id=7ZcyRF7Y3S,"We show how disentangled representations combined with sparse base-predictors can improve generalization and how, in a multi-task learning setting, sparsity regularization on the task-specific predictors can induce disentanglement.","Although disentangled representations are often said to be beneficial for downstream tasks, current empirical and theoretical understanding is limited. In this work, we provide evidence that disentangled representations coupled with sparse base-predictors improve generalization. In the context of multi-task learning, we prove a new identifiability result that provides conditions under which maximally sparse base-predictors yield disentangled representations. Motivated by this theoretical result, we propose a practical approach to learn disentangled representations based on a sparsity-promoting bi-level optimization problem. Finally, we explore a meta-learning version of this algorithm based on group Lasso multiclass SVM base-predictors, for which we derive a tractable dual formulation. It obtains competitive results on standard few-shot classification benchmarks, while each task is using only a fraction of the learned representations. ","Disentanglement, identifiability, multi-task learning, sparsity, transfer learning, meta-learing" Discerning Hydroclimatic Behavior with a Deep Convolutional Residual Regressive Neural Network,https://openreview.net/forum?id=pm4Wuso4da1,https://openreview.net/pdf?id=pm4Wuso4da1,We analyze and visualize the water cycles of four large US watersheds using representation learning techniques.,"Water impacts the globe daily in new and familiar ways such as the ongoing western United States drought and the 2022 Pakistan flood. These events sustain uncertainty, risk, and loss forces to the global ecosystem. Better forecasting tools are mandatory to calibrate our response in an effort to mitigate such natural hazards in our watersheds and adapt to the planet’s dynamic environment. Here, we present a Deep Convolutional Residual Regressive Neural Net (DCRRNN - pronounced “discern”) platform for obtaining, visualizing, and analyzing the basin response of watersheds to water cycle fluxes. We examine four very large basins, simulating river response to the hydroclimatic fluxes they face. Experiments modulating the lever of time lag between remotely sensed and ground truth measurements are performed to assess the metrological limits of this forecasting device. The resultant grand mean Nash Sutcliffe and Kling Gupta efficiency values are both of greater value than 90\%. Our results show that DCRRNN can become a powerful resource to simulate and forecast the impacts of hydroclimatic events as they relate to watershed response in a globally changing climate.","water, climate, sustainability, supervised representation learning, societal considerations" Causal Reasoning in the Presence of Latent Confounders via Neural ADMG Learning,https://openreview.net/forum?id=dcN0CaXQhT,https://openreview.net/pdf?id=dcN0CaXQhT,,"Latent confounding has been a long-standing obstacle for causal reasoning from observational data. One popular approach is to model the data using acyclic directed mixed graphs (ADMGs), which describe ancestral relations between variables using directed and bidirected edges. However, existing methods using ADMGs are based on either linear functional assumptions or a discrete search that is complicated to use and lacks computational tractability for large datasets. In this work, we further extend the existing body of work and develop a novel gradient-based approach to learning an ADMG with nonlinear functional relations from observational data. We first show that the presence of latent confounding is identifiable under the assumptions of bow-free ADMGs with nonlinear additive noise models. With this insight, we propose a novel neural causal model based on autoregressive flows. This not only enables us to model complex causal relationships behind the data, but also estimate their functional relationships (hence treatment effects) simultaneously. We further validate our approach via experiments on both synthetic and real-world datasets, and demonstrate the competitive performance against relevant baselines.","causality, causal discovery, causal inference, structural equation model, latent confounders, variational inference" ESC: A Benchmark For Multi-Domain End-to-End Speech Recognition,https://openreview.net/forum?id=9OL2fIfDLK,https://openreview.net/pdf?id=9OL2fIfDLK,,"Speech recognition applications cover a range of different audio and text distributions, with different speaking styles, background noise, transcription punctuation and character casing. However, many speech recognition systems require dataset-specific tuning (audio filtering, punctuation removal and normalisation of casing), therefore assuming a-priori knowledge of both the audio and text distributions. This tuning requirement can lead to systems failing to generalise to other datasets and domains. To promote the development of multi-domain speech systems, we introduce the End-to end Speech Challenge (ESC) for evaluating the performance of a single automatic speech recognition (ASR) system across a broad set of speech datasets. Benchmarked systems must use the same data pre- and post-processing algorithm across datasets - assuming the audio and text data distributions are a-priori unknown. We compare a series of state-of-the-art (SoTA) end-to-end (E2E) systems on this benchmark, demonstrating how a single speech system can be applied and evaluated on a wide range of data distributions. We find E2E systems to be effective across datasets: in a fair comparison, E2E systems achieve within 2.6% of SoTA systems tuned to a specific dataset. Our analysis reveals that transcription artefacts, such as punctuation and casing, pose difficulties for ASR systems and should be included in evaluation. We believe E2E benchmarking over a range of datasets promotes the research of multi-domain speech recognition systems.","speech, end-to-end, evaluation, benchmark" Mitigating Gradient Bias in Multi-objective Learning: A Provably Convergent Approach,https://openreview.net/forum?id=dLAYGdKTi2,https://openreview.net/pdf?id=dLAYGdKTi2,We propose a gradient based multi-objective optimization algorithm which provably convergence to a Pareto stationary point in stochastic convex and non-convex settings.,"Many machine learning problems today have multiple objective functions. They appear either in learning with multiple criteria where learning has to make a trade-off between multiple performance metrics such as fairness, safety and accuracy; or, in multi-task learning where multiple tasks are optimized jointly, sharing inductive bias between them. This problems are often tackled by the multi-objective optimization framework. However, existing stochastic multi-objective gradient methods and its variants (e.g., MGDA, PCGrad, CAGrad, etc.) all adopt a biased noisy gradient direction, which leads to degraded empirical performance. To this end, we develop a stochastic multi-objective gradient correction (MoCo) method for multi-objective optimization. The unique feature of our method is that it can guarantee convergence without increasing the batch size even in the nonconvex setting. Simulations on multi-task supervised and reinforcement learning demonstrate the effectiveness of our method relative to the state-of-the-art methods. ","Multi-objective Optimization, Machine Learning" Pareto Rank-Preserving Supernetwork for HW-NAS,https://openreview.net/forum?id=dMsyUtZxj_,https://openreview.net/pdf?id=dMsyUtZxj_,,"In neural architecture search (NAS), training every sampled architecture is very time-consuming and should be avoided. Weight-sharing is a promising solution to speed up the evaluation process. However, a sampled subnetwork is not guaranteed to be estimated precisely unless a complete individual training process is done. Additionally, practical deep learning engineering processes require incorporating realistic hardware-performance metrics into the NAS evaluation process, also known as hardware-aware NAS (HW-NAS). HW-NAS results a Pareto front, a set of all architectures that optimize conflicting objectives, i.e. task-specific performance and hardware efficiency. This paper proposes a supernetwork training methodology that preserves the Pareto ranking between its different subnetworks resulting in more efficient and accurate neural networks for a variety of hardware platforms. The results show a 97% near Pareto front approximation in less than 2 GPU days of search, which provides x2 speed up compared to state-of-the-art methods. We validate our methodology on NAS-Bench-201, DARTS and ImageNet. Our optimal model achieves 77.2% accuracy (+1.7% compared to baseline) with an inference time of 3.68ms on Edge GPU for ImageNet.","Neural Architecture Search, Supernetwork, Computer Vision" ProSampler: Improving Contrastive Learning by Better Mini-batch Sampling,https://openreview.net/forum?id=H71l8_zALJ,https://openreview.net/pdf?id=H71l8_zALJ,,"In-batch contrastive learning has emerged as a state-of-the-art self-supervised learning solution, with the philosophy of bringing semantically similar instances closer while pushing dissimilar instances apart within a mini-batch. However, the in-batch negative sharing strategy is limited by the batch size and falls short of prioritizing the informative negatives (i.e., hard negatives) globally. In this paper, we propose to sample mini-batches with hard negatives on a proximity graph in which the instances (nodes) are connected according to the similarity measurement. Sampling on the proximity graph can better exploit the hard negatives globally by bridging in similar instances from the entire dataset. The proposed method can flexibly explore the negatives by modulating two parameters, and we show that such flexibility is the key to better exploit hard negative globally. We evaluate the proposed method on three representative contrastive learning algorithms, each of which corresponds to one modality: image, text, and graph. Besides, we also apply it to the variants of the InfoNCE objective to verify its generality. The results show that our method can consistently boost the performance of contrastive methods, with a relative improvement of 2.5% for SimCLR on ImageNet-100, 1.4% for SimCSE on the standard STS task, and 1.2% for GraphCL on the COLLAB dataset.","global hard negative sampling, contrastive learning" $O(T^{-1})$ Convergence of Optimistic-Follow-the-Regularized-Leader in Two-Player Zero-Sum Markov Games ,https://openreview.net/forum?id=VWqiPBB_EM,https://openreview.net/pdf?id=VWqiPBB_EM,,"We prove that the optimistic-follow-the-regularized-leader (OFTRL) algorithm, together with smooth value updates, finds an $O(T^{−1})$ approximate Nash equilibrium in $T$ iterations for two-player zero-sum Markov games with full information. This improves the $\tilde{O}(T^{−5/6})$ convergence rate recently shown by Zhang et al (2022). The refined analysis hinges on two essential ingredients. First, the sum of the regrets of the two players, though not necessarily non-negative as in normal-form games, is approximately non-negative in Markov games. This property allows us to bound the second-order path lengths of the learning dynamics. Second, we prove a tighter algebraic inequality regarding the weights deployed by OFTRL that shaves an extra $\log T$ factor. This crucial improvement enables the inductive analysis that leads to the final $O(T^{−1})$ rate.","multi-agent reinforcement learning, policy optimization, Markov game" Bispectral Neural Networks,https://openreview.net/forum?id=xnsg4pfKb7,https://openreview.net/pdf?id=xnsg4pfKb7,,"We present a neural network architecture, Bispectral Neural Networks (BNNs) for learning representations that are invariant to the actions of compact commutative groups on the space over which a signal is defined. The model incorporates the ansatz of the bispectrum, an analytically defined group invariant that is complete -- that is, it preserves all signal structure while removing only the variation due to group actions. Here, we demonstrate that BNNs are able to simultaneously learn groups, their irreducible representations, and corresponding complete invariant maps purely from symmetries implicit in data. Further, we demonstrate that the completeness property endows these networks with strong adversarial robustness. This work establishes Bispectral Neural Networks as a powerful computational primitive for robust invariant representation learning.","invariance, group theory, representation theory, geometry, representation learning, symmetry" Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise,https://openreview.net/forum?id=slHNW9yRie0,https://openreview.net/pdf?id=slHNW9yRie0,,"Standard diffusion models involve an image transform -- adding Gaussian noise -- and an image restoration operator that inverts this degradation. We observe that the generative behavior of diffusion models is not strongly dependent on the choice of image degradation, and in fact an entire family of generative models can be constructed by varying this choice. Even when using completely deterministic degradations (e.g., blur, masking, and more), the training and test-time update rules that underlie diffusion models can be easily generalized to create generative models. The success of these fully deterministic models calls into question the community's understanding of diffusion models, which relies on noise in either gradient Langevin dynamics or variational inference, and paves the way for generalized diffusion models that invert arbitrary processes.", Beyond Lipschitz: Sharp Generalization and Excess Risk Bounds for Full-Batch GD,https://openreview.net/forum?id=pOyi9KqE56b,https://openreview.net/pdf?id=pOyi9KqE56b,We show generalization and excess risk guarantees for the full-batch Gradient Descent (GD) algorithm,"We provide sharp path-dependent generalization and excess risk guarantees for the full-batch Gradient Descent (GD) algorithm on smooth losses (possibly non-Lipschitz, possibly nonconvex). At the heart of our analysis is an upper bound on the generalization error, which implies that average output stability and a bounded expected optimization error at termination lead to generalization. This result shows that a small generalization error occurs along the optimization path, and allows us to bypass Lipschitz or sub-Gaussian assumptions on the loss prevalent in previous works. For nonconvex, convex, and strongly convex losses, we show the explicit dependence of the generalization error in terms of the accumulated path-dependent optimization error, terminal optimization error, number of samples, and number of iterations. For nonconvex smooth losses, we prove that full-batch GD efficiently generalizes close to any stationary point at termination, and recovers the generalization error guarantees of stochastic algorithms with fewer assumptions. For smooth convex losses, we show that the generalization error is tighter than existing bounds for SGD (up to one order of error magnitude). Consequently the excess risk matches that of SGD for quadratically less iterations. Lastly, for strongly convex smooth losses, we show that full-batch GD achieves essentially the same excess risk rate as compared with the state of the art on SGD, but with an exponentially smaller number of iterations (logarithmic in the dataset size).","Gradient Descent, Generalization Error, Smooth Nonconvex/Convex Optimization" Zero-Shot Retrieval with Search Agents and Hybrid Environments,https://openreview.net/forum?id=C8by2OoY6Y2,https://openreview.net/pdf?id=C8by2OoY6Y2,A learning to search agent combined with a hybrid dense-sparse retrieval environment achieves sota on zero shot retrieval (BEIR).,"Learning to search is the task of building artificial agents that learn to autonomously use a search box to find information. So far, it has been shown that current language models can learn symbolic query reformulation policies, in combination with traditional term-based retrieval, but fall short of outperforming neural retrievers. We extend the previous learning to search setup to a hybrid environment, which accepts discrete query refinement operations, after a first-pass retrieval step performed by a dual encoder. Experiments on the BEIR task show that search agents, trained via behavioral cloning, outperform the underlying search system based on a combined dual encoder retriever and cross encoder reranker. Furthermore, we find that simple heuristic Hybrid Retrieval Environments (HRE) can improve baseline performance by several nDCG points. The search agent based on the HRE environment (HaRE) produces state-of-the-art performance on both zero-shot and in-domain evaluations. We carry out an extensive qualitative analysis to shed light on the agents policies.","learning to search, information retrieval, document ranking, relevance feedback, zero shot, language models, behavioral cloning" Hyper-Decision Transformer for Efficient Online Policy Adaptation,https://openreview.net/forum?id=AatUEvC-Wjv,https://openreview.net/pdf?id=AatUEvC-Wjv,"We propose Hyper-Decision Transformer (HDT), a transformer-based model which generalizes to novel unseen tasks maintaining strong data and parameter efficiency.","Decision Transformers (DT) have demonstrated strong performances in offline reinforcement learning settings, but quickly adapting to unseen novel tasks remains challenging. To address this challenge, we propose a new framework, called Hyper-Decision Transformer (HDT), that can generalize to novel tasks from a handful of demonstrations in a data- and parameter-efficient manner. To achieve such a goal, we propose to augment the base DT with an adaptation module, whose parameters are initialized by a hyper-network. When encountering unseen tasks, the hyper-network takes a handful of demonstrations as inputs and initializes the adaptation module accordingly. This initialization enables HDT to efficiently adapt to novel tasks by only fine-tuning the adaptation module. We validate HDT's generalization capability on object manipulation tasks. We find that with a single expert demonstration and fine-tuning only 0.5% of DT parameters, HDT adapts faster to unseen tasks than fine-tuning the whole DT model. Finally, we explore a more challenging setting where expert actions are not available, and we show that HDT outperforms state-of-the-art baselines in terms of task success rates by a large margin. Demos are available on our project page: https://sites.google.com/view/hdtforiclr2023/home.","Offline Reinforcement Learning, One-shot Imitation Learning, Parameter-efficient Fine-tuning" Deep Learning of Intrinsically Motivated Options in the Arcade Learning Environment,https://openreview.net/forum?id=7bns2VTdMAx,https://openreview.net/pdf?id=7bns2VTdMAx,,"In Reinforcement Learning, Intrinsic Motivation motivates directed behaviors through a wide range of reward-generating methods. Depending on the task and environment, these rewards can be useful, might complement each other, but can also break down entirely, as seen with the noisy TV problem for curiosity. We therefore argue that scalability and robustness, among others, are key desirable properties of a method to incorporate intrinsic rewards, which a simple weighted sum of reward lacks. In a tabular setting, Explore Options let the agent call an intrinsically motivated policy in order to learn from its trajectories. We introduce Deep Explore Options, revising Explore Options within the Deep Reinforcement Learning paradigm to tackle complex visual problems. Deep Explore Options can naturally learn from several unrelated intrinsic rewards, ignore harmful intrinsic rewards, learn to balance exploration, but also isolate exploitative and exploratory behaviors for independent usage. We test Deep Explore Options on hard and easy exploration games of the Atari Suite, following a benchmarking study to ensure fairness. Our empirical results show that they achieve similar results than weighted sum baselines, while maintaining their key properties. ","reinforcement learning, intrinsic motivation, exploration, options, auxiliary task learning" Solving Continuous Control via Q-learning,https://openreview.net/forum?id=U5XOGxAgccS,https://openreview.net/pdf?id=U5XOGxAgccS,Decoupling action dimensions during optimization and exploration for DQN in combination with bang-bang action discretization achieves state-of-the-art performance on a variety of continuous control tasks.,"While there has been substantial success in applying actor-critic methods to continuous control, simpler critic-only methods such as Q-learning often remain intractable in the associated high-dimensional action spaces. However, most actor-critic methods come at the cost of added complexity: heuristics for stabilisation, compute requirements as well as wider hyperparameter search spaces. We show that these issues can be largely alleviated via Q-learning by combining action discretization with value decomposition, framing single-agent control as cooperative multi-agent reinforcement learning (MARL). With bang-bang actions, performance of this critic-only approach matches state-of-the-art continuous actor-critic methods when learning from features or pixels. We extend classical bandit examples from cooperative MARL to provide intuition for how decoupled critics leverage state information to coordinate joint optimization, and demonstrate surprisingly strong performance across a wide variety of continuous control tasks.","reinforcement learning, continuous control, learning efficiency" Make-A-Video: Text-to-Video Generation without Text-Video Data,https://openreview.net/forum?id=nJfylDvgzlq,https://openreview.net/pdf?id=nJfylDvgzlq,,"We propose Make-A-Video -- an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V). Our intuition is simple: learn what the world looks like and how it is described from paired text-image data, and learn how the world moves from unsupervised video footage. Make-A-Video has three advantages: (1) it accelerates training of the T2V model (it does not need to learn visual and multimodal representations from scratch), (2) it does not require paired text-video data, and (3) the generated videos inherit the vastness (diversity in aesthetic, fantastical depictions, etc.) of today's image generation models. We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules. First, we decompose the full temporal U-Net and attention tensors and approximate them in space and time. Second, we design a spatial temporal pipeline to generate high resolution and frame rate videos with a video decoder, interpolation model and two super resolution models that can enable various applications besides T2V. In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation, as determined by both qualitative and quantitative measures.", EiX-GNN : Concept-level eigencentrality explainer for graph neural networks,https://openreview.net/forum?id=k8hq5rkmsJ_,https://openreview.net/pdf?id=k8hq5rkmsJ_,,"Nowadays, deep prediction models, especially graph neural networks, have a major place in critical applications. In such context, those models need to be highly interpretable or being explainable by humans, and at the societal scope, this understanding may also be feasible for humans that do not have a strong prior knowledge in models and contexts that need to be explained. In the literature, explaining is a human knowledge transfer process regarding a phenomenon between an explainer and an explainee. We propose EiX-GNN (Eigencentrality eXplainer for Graph Neural Networks) a new powerful method for explaining graph neural networks that encodes computationally this social explainer-to-explainee dependence underlying in the explanation process. To handle this dependency, we introduce the notion of explainee concept assimibility which allows explainer to adapt its explanation to explainee background or expectation. We lead a qualitative study to illustrate our explainee concept assimibility notion on real-world data as well as a qualitative study that compares, according to objective metrics established in the literature, fairness and compactness of our method with respect to performing state-of-the-art methods. It turns out that our method achieves strong results in both aspects. ", Unsupervised Adaptation for Fairness under Covariate Shift,https://openreview.net/forum?id=9_VrvV7d-FK,https://openreview.net/pdf?id=9_VrvV7d-FK,We propose an unsupervised adaptation algorithm to address fairness under covariate shift. Our proposed objective involves the standard training loss along with a novel min-max entropy formulation to handle shift and a wasserstein loss for fairness.,"Training fair models typically involves optimizing a composite objective accounting for both prediction accuracy and some fairness measure. However, due to a shift in the distribution of the covariates at test time, the learnt fairness tradeoffs may no longer be valid, which we verify experimentally. To address this, we consider an unsupervised adaptation problem of training fair classifiers when only a small set of unlabeled test samples is available along with a large labeled training set. We propose a novel modification to the traditional composite objective by adding a weighted entropy objective on the unlabeled test dataset. This involves a min-max optimization where weights are optimized to mimic the importance weighting ratios followed by classifier optimization. We demonstrate that our weighted entropy objective provides an upper bound on the standard importance sampled training objective common in covariate shift formulations under some mild conditions. Experimentally, we demonstrate that Wasserstein distance based penalty for representation matching across protected sub groups together with the above loss outperforms existing baselines. Our method achieves the best accuracy-equalized odds tradeoff under the covariate shift setup. We find that, for the same accuracy, we get upto 2x improvement in equalized odds on notable benchmarks.","Out of Distribution, Fairness, Unsupervised, Adaptation" Pushing the limits of self-supervised learning: Can we outperform supervised learning without labels?,https://openreview.net/forum?id=wEP-3nECiUE,https://openreview.net/pdf?id=wEP-3nECiUE,,"Despite recent progress made by self-supervised methods in representation learning with residual networks, they still underperform supervised learning on the ImageNet classification benchmark, limiting their applicability in performance critical settings. Building on prior theoretical insights from RELIC [Mitrovic et al., 2021], we include additional inductive biases into self-supervised learning. We propose a new self-supervised representation learning method, RELICv2,which combines an explicit invariance loss with a contrastive objective over avaried set of appropriately constructed data views to avoid learning spurious cor-relations and obtain more informative representations. RELICv2 achieves 77.1% top-1 classification accuracy on ImageNet using linear evaluation with a ResNet50 architecture and 80.6% with larger ResNet models, outperforming previous state-of-the-art self-supervised approaches by a wide margin. Most notably, RELICv2 is the first unsupervised representation learning method to consistently outperform the supervised baseline in a like-for-like comparison over a range of ResNet architectures. Finally, we show that despite using ResNet encoders, RELICv2 is comparable to state-of-the-art self-supervised vision transformers.","self-supervised learning, contrastive learning, ImageNet" Towards Dynamic Sparsification by Iterative Prune-Grow LookAheads,https://openreview.net/forum?id=mKNAOg7CLX,https://openreview.net/pdf?id=mKNAOg7CLX,Network pruning via an iterative Grow-and-Prune approach,"Model sparsification is a process of removing redundant connections in a neural network, making it more compact and faster. Most pruning methods start with a dense pretrained model, which is computationally intensive to train. Other pruning approaches perform compression at initialization which saves training time, however, at the cost of final accuracy as an unreliable architecture can be selected given weak feature representation. In this work, we re-formulate network sparsification as an exploitation-exploration process during initial training to enable dynamic learning of network sparsification. The exploitation phase assumes architecture stability and trains it to maximize accuracy. Whereas the exploration phase challenges the current architecture with a novel $\textit{LookAhead}$ step that reactivates pruned parameters, quickly updates them together with existing ones, and reconfigures the sparse architecture with a pruning-growing paradigm. We demonstrate that $\textit{LookAhead}$ methodology can effectively and efficiently oversee both architecture and performance during training, enabling early pruning with a capability of future recovery to correct previous poor pruning selections. Extensive results on ImageNet and CIFAR datasets show consistent improvements over the prior art by large margins, for varying networks towards both structured and unstructured sparsity. For example, our method surpasses recent work by $+1.3\%$ top-1 accuracy at the same compression ratio for ResNet50-ImageNet unstructured sparsity. Moreover, our structured sparsity results also improve upon the previous best hardware-aware pruning method by $+0.8\%$ top-1 accuracy for MobileNet-ImageNet sparsification, offering $+134$ in hardware FPS(im/s), while halving the training cost.","Network pruning, Network growing, Efficient Networks" Learning Useful Representations for Shifting Tasks and Distributions ,https://openreview.net/forum?id=rRgLJ8TwXe,https://openreview.net/pdf?id=rRgLJ8TwXe,,"Representation learning in deep models usually happens as a side effect of minimizing the expected risk using back-propagation. However, one of the challenges of modern deep learning is the increasingly recognized need to deal with multiple tasks and varying data distributions, as illustrated, for instance, by the value of transfer learning and the risks of shortcut learning. Are the representations learned by back-propagation up to the task? This work presents and empirically evaluates two methods that combine the feature extractors learned during multiple training episodes and construct a representation that is richer than those usually obtained through a single expected risk minimization episode. Comprehensive experiments in supervised transfer learning, self-supervised transfer learning, few-shot learning, and out-of-distribution robust learning scenarios, show that such rich representations can match and often exceed the performance of those obtained by training an equivalently sized network, with usually a far less computational burden.","rich representation learning, supervised transfer learning, self-supervised transfer learning, few shot learning, out-of-distribution robust learning" Personalized Reward Learning with Interaction-Grounded Learning (IGL),https://openreview.net/forum?id=wGvzQWFyUB,https://openreview.net/pdf?id=wGvzQWFyUB,Eliminating reward engineering for recommendation systems,"In an era of countless content offerings, recommender systems alleviate information overload by providing users with personalized content suggestions. Due to the scarcity of explicit user feedback, modern recommender systems typically optimize for a fixed combination of implicit feedback signals across all users. However, this approach disregards a growing body of work that (i) implicit signals can be used by users in diverse ways, signaling anything from satisfaction to active dislike, and (ii) different users communicate preferences in different ways. We propose applying the recent Interaction Grounded Learning (IGL) paradigm to address the challenge of learning representations of diverse user communication modalities. Rather than taking a fixed, human-designed reward function, IGL is able to learn personalized reward functions for different users and then optimize directly for the latent user satisfaction. We demonstrate the success of IGL with experiments using simulations as well as with real-world production traces.","interaction-grounded learning, recommendation systems, interactive machine learning, contextual bandits" From Adaptive Query Release to Machine Unlearning,https://openreview.net/forum?id=5Qt6ZqXSDEZ,https://openreview.net/pdf?id=5Qt6ZqXSDEZ,Efficient algorithms for exact machine unlearning for stochastic convex optimization,"We formalize the problem of machine unlearning as design of efficient unlearning algorithms corresponding to learning algorithms which perform a selection of adaptive queries from structured query classes. We give efficient unlearning algorithms for linear and prefix-sum query classes. As applications, we show that unlearning in many problems, in particular, stochastic convex optimization (SCO), can be reduced to the above, yielding improved guarantees for the problem. In particular, for smooth Lipschitz losses and any $\rho>0$, our results yield an unlearning algorithm with excess population risk of $\tilde O\big(\frac{1}{\sqrt{n}}+\frac{\sqrt{d}}{n\rho}\big)$ with unlearning query (gradient) complexity $\tilde O(\rho \cdot \text{Retraining Complexity})$, where $d$ is the model dimensionality and $n$ is the initial number of samples. For non-smooth Lipschitz losses, we give an unlearning algorithm with excess population risk $\tilde O\big(\frac{1}{\sqrt{n}}+\big(\frac{\sqrt{d}}{n\rho}\big)^{1/2}\big)$ with the same unlearning query (gradient) complexity. Furthermore, in the special case of Generalized Linear Models (GLMs), such as those in linear and logistic regression, we get dimension-independent rates of $\tilde O\big(\frac{1}{\sqrt{n}} +\frac{1}{(n\rho)^{2/3}}\big)$ and $\tilde O\big(\frac{1}{\sqrt{n}} +\frac{1}{(n\rho)^{1/3}}\big)$ for smooth Lipschitz and non-smooth Lipschitz losses respectively. Finally, we give generalizations of the above from one unlearning request to dynamic streams consisting of insertions and deletions.","machine unlerarning, stochastic convex optimization" ReAct: Synergizing Reasoning and Acting in Language Models,https://openreview.net/forum?id=WE_vluYUL-X,https://openreview.net/pdf?id=WE_vluYUL-X,"We synergize reasoning and action taking in language models and make them more capable, versatile and interpretable.","While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studied as separate topics. In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information. We apply our approach, named ReAct, to a diverse set of language and decision making tasks and demonstrate its effectiveness over state-of-the-art baselines, as well as improved human interpretability and trustworthiness over methods without reasoning or acting components. Concretely, on question answering (HotpotQA) and fact verification (Fever), ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API, and generates human-like task-solving trajectories that are more interpretable than baselines without reasoning traces. On two interactive decision making benchmarks (ALFWorld and WebShop), ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively, while being prompted with only one or two in-context examples.","Language model, agent, reasoning, decision making" Towards convergence to Nash equilibria in two-team zero-sum games,https://openreview.net/forum?id=4BPFwvKOvo5,https://openreview.net/pdf?id=4BPFwvKOvo5,Common no-regret algorithms fail to converge to a Nash equilibrium in two-team zero-sum games but a novel approach does converge locally.,"Contemporary applications of machine learning raise important and overlooked theoretical questions regarding optimization in two-team games. Formally, two-team zero-sum games are defined as multi-player games where players are split into two competing sets of agents, each experiencing a utility identical to that of their teammates and opposite to that of the opposing team. We focus on the solution concept of Nash equilibria and prove $\textrm{CLS}$-hardness of computing them in this class of games. To further examine the capabilities of online learning algorithms in games with full-information feedback, we propose a benchmark of a simple ---yet nontrivial--- family of such games. These games do not enjoy the properties used to prove convergence for relevant algorithms. In particular, we use a dynamical systems perspective to demonstrate that gradient descent-ascent, its optimistic variant, optimistic multiplicative weights update, and extra gradient fail to converge (even locally) to a Nash equilibrium. On a brighter note, we propose a first-order method that leverages control theory techniques and under some conditions enjoys last-iterate local convergence to a Nash equilibrium. We also believe our proposed method is of independent interest for general min-max optimization.","no-regret-learning, no-regret, optimization, learning-in-games, nash-equilibrium, game-theory, min-max-optimization, min-max" Ensemble Homomorphic Encrypted Data Classification,https://openreview.net/forum?id=LMuVjYmHNh4,https://openreview.net/pdf?id=LMuVjYmHNh4,,"Homomorphic encryption (HE) is encryption that permits users to perform computations on encrypted data without first decrypting it. HE can be used for privacy-preserving outsourced computation and analysis, allowing data to be encrypted and outsourced to commercial cloud environments for processing while encrypted or sensitive data. HE enables new services by removing privacy barriers inhibiting data sharing or increasing the security of existing services. A convolution neural network (CNN) with shallow architecture can be homomorphically evaluated using addition and multiplication by replacing the activation function, such as ReLU, with a low polynomial degree. To achieve the same performance as the ReLU activation function, we study the impact of applying the ensemble techniques to solve the accuracy problem. Our experimental results empirically show that the ensemble approach can reduce bias, and variance, increasing accuracy to achieve the same ReLU performance with parallel and sequential techniques. We demonstrate the effectiveness and robustness of our method using three data sets: MNIST, FMNIST, and CIFAR-10 ","Machine Learning Privacy, Homomorphic Encrypted Data Classification, Ensemble Learning" Generative Pretraining for Black-Box Optimization,https://openreview.net/forum?id=eAR9bgWrUsa,https://openreview.net/pdf?id=eAR9bgWrUsa,,"Many problems in science and engineering involve optimizing an expensive black-box function over a high-dimensional space. For such black-box optimization (BBO) problems, we typically assume a small budget for online function evaluations, but also often have access to a fixed, offline dataset for pretraining. Prior approaches seek to utilize the offline data to approximate the function or its inverse but are not sufficiently accurate far from the data distribution. We propose BONET, a generative framework for pretraining a novel black-box optimizer using offline datasets. In BONET, we train an autoregressive model on fixed-length trajectories corresponding to runs of implicit black-box function optimizers. We design a sampling strategy to synthesize trajectories from offline data using a simple heuristic of rolling out monotonic transitions from low-fidelity to high-fidelity samples. Empirically, we instantiate BONET using a causally masked Transformer and evaluate it on Design-Bench, where we rank the best on average, outperforming state-of-the-art baselines.","decision making, generative modelling, transformers" Meta-Learning Black-Box Optimization via Black-Box Optimization,https://openreview.net/forum?id=mFDU0fP3EQH,https://openreview.net/pdf?id=mFDU0fP3EQH,"We meta-learn evolution strategies, which flexibly generalize to unseen optimization problems, population sizes and optimization horizons. ","Optimizing functions without access to gradients is the remit of black-box methods such as evolution strategies. While highly general, their learning dynamics are often times heuristic and inflexible --- exactly the limitations that meta-learning can address. Hence, we propose to discover effective update rules for evolution strategies via meta-learning. Concretely, our approach employs a search strategy parametrized by a self-attention-based architecture, which guarantees the update rule is invariant to the ordering of the candidate solutions. We show that meta-evolving this system on a small set of representative low-dimensional analytic optimization problems is sufficient to discover new evolution strategies capable of generalizing to unseen optimization problems, population sizes and optimization horizons. Furthermore, the same learned evolution strategy can outperform established neuroevolution baselines on supervised and continuous control tasks. As additional contributions, we ablate the individual neural network components of our method; reverse engineer the learned strategy into an explicit heuristic form, which remains highly competitive; and show that it is possible to self-referentially train an evolution strategy from scratch, with the learned update rule used to drive the outer meta-learning loop.","Meta-Learning, Evolution Strategies, Gradient-Free Optimization" The Use of Open-Source Boards for Data Collection and Machine Learning in Remote Deployments,https://openreview.net/forum?id=QjHSvrwkT8S,https://openreview.net/pdf?id=QjHSvrwkT8S,This paper describes how open source hardware are used for data collection and machine learning tasks in off-grid setups. ,"Machine learning is being adopted in many walks of life to solve various problems. This is being driven by development of robust machine learning algorithms, availability of large datasets and low cost computation resources. Some machine learning applications require deployment of devices off-the-grid for data collection and real time monitoring. Such applications require development of systems that can operate autonomously during their deployment. Advancement in technology has seen development of low-cost and low-power open-source microcontrollers and single board computers. These boards can be interfaced with a wide array of sensors and can perform computation processes. The boards are finding wide applications in data collection and machine learning initiatives. This paper will describe how the boards are leveraged for off-grid deployments.","Open-source hardware, single board computer, microcontroller, on-board processing, edge computing, field programmable gate array" Rhino: Deep Causal Temporal Relationship Learning with History-dependent Noise,https://openreview.net/forum?id=i_1rbq8yFWC,https://openreview.net/pdf?id=i_1rbq8yFWC,"We propose a causal discovery method for time series, which combines deep learning and variational inference to model instantaneous effect and history-dependent noise with structure identifiability guarantee.","Discovering causal relationships between different variables from time series data has been a long-standing challenge for many domains. For example, in stock markets, the announcement of acquisitions from leading companies may have immediate effects on stock prices and increase the uncertainty of the future market due to this past action. To discover causal relations in such case, the model needs to consider non-linear relations between variables, instantaneous effect and the change of noise distribution due to past actions. We name the latter as history-dependent noise. However, previous works do not offer a solution addressing all these problems together. In this paper, we propose a structural equation model, called Rhino, which combines vector auto-regression, deep learning and variational inference to model non-linear relationships with instantaneous effects while allowing the noise distribution to be modulated by history observations. Theoretically, we prove the structural identifiability of Rhino. Our empirical results from extensive synthetic experiments and two real-world benchmarks demonstrate better discovery performance compared to relevant baselines, with ablation studies revealing its robustness under model misspecification.","Structure learning, Causal discovery, Time series, Structure equation model, deep generative model" DensePure: Understanding Diffusion Models towards Adversarial Robustness,https://openreview.net/forum?id=p7hvOJ6Gq0i,https://openreview.net/pdf?id=p7hvOJ6Gq0i,"We theoretically analyze the fundamental properties of diffusion models to understand why and how it enhances certified robustness. Inspired by the analysis, we propose a new method to improve the certified robustness of the clearn classifier","Diffusion models have been recently employed to improve certified robustness through the process of denoising. However, the theoretical understanding of why diffusion models are able to improve the certified robustness is still lacking, preventing from further improvement. In this study, we close this gap by analyzing the fundamental properties of diffusion models and establishing the conditions under which they can enhance certified robustness. This deeper understanding allows us to propose a new method DensePure, designed to improve the certified robustness of a pretrained model (i.e. classifier). Given an (adversarial) input, DensePure consists of multiple runs of denoising via the reverse process of the diffusion model (with different random seeds) to get multiple reversed samples, which are then passed through the classifier, followed by majority voting of inferred labels to make the final prediction. This design of using multiple runs of denoising is informed by our theoretical analysis of the conditional distribution of the reversed sample. Specifically, when the data density of a clean sample is high, its conditional density under the reverse process in a diffusion model is also high; thus sampling from the latter conditional distribution can purify the adversarial example and return the corresponding clean sample with a high probability. By using the highest density point in the conditional distribution as the reversed sample, we identify the robust region of a given instance under the diffusion model's reverse process. We show that this robust region is a union of multiple convex sets, and is potentially much larger than the robust regions identified in previous works. In practice, DensePure can approximate the label of the high density region in the conditional distribution so that it can enhance certified robustness. We conduct extensive experiments to demonstrate the effectiveness of DensePure by evaluating its certified robustness given a standard model via randomized smoothing. We show that DensePure is consistently better than existing methods on ImageNet, with 7% improvement on average. ","adversarial robustness, certified robustness, diffusion model" "Learn, Unlearn and Relearn: An Online Learning Paradigm for Deep Neural Networks",https://openreview.net/forum?id=gUTKBS34Q5c,https://openreview.net/pdf?id=gUTKBS34Q5c,An efficient online learning paradigm which interchanges between the unlearning phase (selective forgetting) and relearning phase (retraining) to improve generalization of the DNNs through weight reinitialization.,"Deep neural networks (DNNs) are often trained with the premise that the complete training data set is provided ahead of time. However, in real-world scenarios, data often arrive in chunks over time. This leads to important considerations about the optimal strategy for training DNNs, such as whether to fine-tune them with each chunk of incoming data (warm-start) or to retrain them from scratch with the entire corpus of data whenever a new chunk is available. While employing the latter for training can be computationally inefficient, recent work has pointed out the lack of generalization in warm-start models. Therefore, to strike a balance between efficiency and generalization, we introduce Learn, Unlearn, and Relearn (LURE) an online learning paradigm for DNNs. LURE interchanges between the unlearning phase, which selectively forgets the undesirable information in the model through weight reinitialization in a data-dependent manner, and the relearning phase, which emphasizes learning on generalizable features. We show that our training paradigm provides consistent performance gains across datasets in both classification and few-shot settings. We further show that it leads to more robust and well-calibrated models.","warm-start, generalization, online learning, weight reinitialization, active forgetting, Anytime learning" Towards Understanding How Machines Can Learn Causal Overhypotheses ,https://openreview.net/forum?id=bGC7Ai125lR,https://openreview.net/pdf?id=bGC7Ai125lR,We present a new flexible environment which allows for the evaluation of existing techniques under variable causal overhypotheses,"Recent work in machine learning and cognitive science has suggested that understanding causal information is essential to the development of intelligence. One of the key challenges for current machine learning algorithms is modeling and understanding causal overhypotheses: transferable abstract hypotheses about sets of causal relationships. In contrast, even young children spontaneously learn causal overhypotheses, and use these to guide their exploration or to generalize to new situations. This has been demonstrated in a variety of cognitive science experiments using the “blicket detector” environment. We present a causal learning benchmark adapting the “blicket"" environment for machine learning agents and evaluate a range of state-of-the-art methods in this environment. We find that although most agents have no problem learning causal structures seen during training, they are unable to learn causal overhypotheses from these experiences, and thus cannot generalize to new settings. ","causal reasoning, intervention, causal overhypotheses, Reinforcement learning, gpt-3" Grounding Graph Network Simulators using Physical Sensor Observations,https://openreview.net/forum?id=jsZsEd8VEY,https://openreview.net/pdf?id=jsZsEd8VEY,We ground Graph Network Simulators with physical sensor information to resolve uncertainties and improve long-term prediction quality.,"Physical simulations that accurately model reality are crucial for many engineering disciplines such as mechanical engineering and robotic motion planning. In recent years, learned Graph Network Simulators produced accurate mesh-based simulations while requiring only a fraction of the computational cost of traditional simulators. Yet, the resulting predictors are confined to learning from data generated by existing mesh-based simulators and thus cannot include real world sensory information such as point cloud data. As these predictors have to simulate complex physical systems from only an initial state, they exhibit a high error accumulation for long-term predictions. In this work, we integrate sensory information to ground Graph Network Simulators on real world observations. In particular, we predict the mesh state of deformable objects by utilizing point cloud data. The resulting model allows for accurate predictions over longer time horizons, even under uncertainties in the simulation, such as unknown material properties. Since point clouds are usually not available for every time step, especially in online settings, we employ an imputation-based model. The model can make use of such additional information only when provided, and resorts to a standard Graph Network Simulator, otherwise. We experimentally validate our approach on a suite of prediction tasks for mesh-based interactions between soft and rigid bodies. Our method results in utilization of additional point cloud information to accurately predict stable simulations where existing Graph Network Simulators fail.","graph network simulators, deformable object simulation, point clouds" Skill Decision Transformer,https://openreview.net/forum?id=mb7PtrUbHa,https://openreview.net/pdf?id=mb7PtrUbHa,We introduce a skill conditioned Decision Transformer that utilizes learned offline behaviors,"Recent work has shown that Large Language Models (LLMs) can be incredibly effective for offline reinforcement learning (RL) by representing the traditional RL problem as a sequence modelling problem. However many of these methods only optimize for high returns, and may not extract much information from a diverse dataset of trajectories. Generalized Decision Transformers (GDTs) have shown that by utilizing future trajectory information, in the form of information statistics, can help extract more information from offline trajectory data. Building upon this, we propose Skill Decision Transformer (Skill DT). Skill DT draws inspiration from hindsight relabelling and skill discovery methods to discover a diverse set of \emph{primitive behaviors}, or skills. We show that Skill DT can not only perform offline state-marginal matching (SMM), but can discovery descriptive behaviors that can be easily sampled. Furthermore, we show that through purely reward-free optimization, Skill DT is still competitive with supervised offline RL approaches on the D4RL benchmark.","Transformer, Offline Reinforcement Learning" Architectural Backdoors in Neural Networks,https://openreview.net/forum?id=BLNZwf-9k09,https://openreview.net/pdf?id=BLNZwf-9k09,paper demonstrates a backdoor that can be planted at the architectural definition level of a neural network,"Machine learning is vulnerable to adversarial manipulation. Previous literature has demonstrated that at the training stage attackers can manipulate data (Gu et al.) and data sampling procedures (Shumailov et al.) to control model behaviour. A common attack goal is to plant backdoors i.e. force the victim model to learn to recognise a trigger known only by the adversary. In this paper, we introduce a new class of backdoor attacks that hide inside model architectures i.e. in the inductive bias of the functions used to train. These backdoors are simple to implement, for instance by publishing open-source code for a backdoored model architecture that others will reuse unknowingly. We demonstrate that model architectural backdoors represent a real threat and, unlike other approaches, can survive a complete re-training from scratch. We formalise the main construction principles behind architectural backdoors, such as a link between the input and the output, and describe some possible protections against them. We evaluate our attacks on computer vision benchmarks of different scales and demonstrate the underlying vulnerability is pervasive in a variety of common training settings. ", In-distribution and Out-of-distribution Generalization for Graph Neural Networks,https://openreview.net/forum?id=JOix_wb4AeM,https://openreview.net/pdf?id=JOix_wb4AeM,,"Graph neural networks (GNNs) are models that allow learning with structured data of varying size. Despite their popularity, theoretical understanding of the generalization of GNNs is an under-explored topic. In this work, we expand the theoretical understanding of both in-distribution and out-of-distribution generalization of GNNs. Firstly, we improve upon the state-of-the-art PAC-Bayes (in-distribution) generalization bound primarily by reducing an exponential dependency on the node degree to a linear dependency. Secondly, utilizing tools from spectral graph theory, we prove some rigorous guarantees about the out-of-distribution (OOD) size generalization of GNNs, where graphs in the training set have different numbers of nodes and edges from those in the test set. To empirically verify our theoretical findings, we conduct experiments on both synthetic and real-world graph datasets. Our computed generalization gaps for the in-distribution case significantly improve the state-of-the-art PAC-Bayes results. For the OOD case, experiments on community classification tasks in large social networks show that GNNs achieve strong size generalization performance in cases guaranteed by our theory.","Graph Neural Networks, Generalization Bounds, Out-of-distribution generalization, Learning theory" "Where to Diffuse, How to Diffuse and How to get back: Learning in Multivariate Diffusions",https://openreview.net/forum?id=osei3IzUia,https://openreview.net/pdf?id=osei3IzUia,,"Diffusion-based generative models (DBGMs) perturb data to a target noise distribution and reverse this inference process to generate samples. The choice of inference diffusion affects both likelihoods and sample quality as it is tied to the generative model. Recent work in DBGMs has applied the principle of improving generative models with the use of auxiliary variables, leading to improved sample quality. While there are many such multivariate diffusions to explore, each new one requires significant model-specific analysis, hindering rapid prototyping and evaluation. In this work, we study linear Multivariate Diffusion Models (MDMs). First, for any number of auxiliary variables, we provide a recipe for maximizing a lower-bound on the MDM likelihood, without requiring any model-specific analysis. Next, we demonstrate how to parameterize the diffusion for a specified target noise distribution; these two points together enable optimizing the inference diffusion process. Optimizing the diffusion expands easy experimentation from just a few well-known processes to an automatic search over the set of linear diffusions. To demonstrate these ideas, we introduce two new specific diffusions as well as learn a diffusion process on the MNIST and CIFAR10 datasets. We achieve improved bits-per-dim bounds using the new diffusion, compared to the existing likelihood-trained VPSDE. We additionally connect the existing CLD objective to the likelihood lower-bound. ","Diffusion models, score based generative model, generative models, variational inference" Contrastive Corpus Attribution for Explaining Representations,https://openreview.net/forum?id=eWKfMBL5to,https://openreview.net/pdf?id=eWKfMBL5to,A novel method to explain representations (from unsupervised and supervised models) in terms of input features.,"Despite the widespread use of unsupervised models, very few methods are designed to explain them. Most explanation methods explain a scalar model output. However, unsupervised models output representation vectors, the elements of which are not good candidates to explain because they lack semantic meaning. To bridge this gap, recent works defined a scalar explanation output: a dot product-based similarity in the representation space to the sample being explained (i.e., an explicand). Although this enabled explanations of unsupervised models, the interpretation of this approach can still be opaque because similarity to the explicand's representation may not be meaningful to humans. To address this, we propose contrastive corpus similarity, a novel and semantically meaningful scalar explanation output based on a reference corpus and a contrasting foil set of samples. We demonstrate that contrastive corpus similarity is compatible with many post-hoc feature attribution methods to generate COntrastive COrpus Attributions (COCOA) and quantitatively verify that features important to the corpus are identified. We showcase the utility of COCOA in two ways: (i) we draw insights by explaining augmentations of the same image in a contrastive learning setting (SimCLR); and (ii) we perform zero-shot object localization by explaining the similarity of image representations to jointly learned text representations (CLIP).","explainable artificial intelligence, interpretable machine learning, feature attributions, contrastive explanations" The ethical ambiguity of AI data enrichment: Measuring gaps in research ethics norms and practices,https://openreview.net/forum?id=MB_O268uCY,https://openreview.net/pdf?id=MB_O268uCY,"This paper shows how AI researchers engage with research ethics when employing crowdworkers. The work finds research ethics disclosures are infrequent in AI papers, inconsistently following venue publication policies.","The technical progression of artificial intelligence (AI) research has been built on breakthroughs in fields such as computer science, statistics, and mathematics. However, in the past decade AI researchers have increasingly looked to the social sciences, turning to human interactions to solve the challenges of model development. Paying crowdsourcing workers to generate or curate data, or ‘data enrichment’, has become indispensable for many areas of AI research, from natural language processing to inverse reinforcement learning. Other fields that routinely interact with crowdsourcing workers, such as Psychology, have developed common governance requirements and norms to ensure research is undertaken ethically. This study explores how, and to what extent, comparable research ethics requirements and norms have developed for AI research and data enrichment. We focus on the approach taken by two leading AI conferences: ICLR and NeurIPS. In a longitudinal study of accepted papers, and a comparison with Springer journal articles and Psychology papers, this work finds that ICLR and NeurIPS have established protocols for human data collection which are inconsistently followed by authors. Whilst Psychology papers engaging with crowdsourcing workers frequently disclose ethics reviews, payment data, demographic data and other information, such disclosures are far less common in leading AI conferences despite similar guidance. The work concludes with hypotheses to explain these gaps in research ethics practices and considerations for its implications.","ethics, disclosures, crowdsourcing, data enrichment" Spatio-temporal point processes with deep non-stationary kernels,https://openreview.net/forum?id=PsIk0kO3hKd,https://openreview.net/pdf?id=PsIk0kO3hKd,"Deep non-stationary kernel for spatio-temporal point process data modeling with low-rank structure, and a barrier method for constraint MLE optimization.","Deep neural networks, especially recurrent neural network (RNN) models, have become a popular tool for analyzing point process data. Despite the powerful expressiveness and memorizing ability of RNN models, they may not successfully model sophisticated non-stationary dependencies among data due to the recurrent structure. Meanwhile, another type of deep model for point process data was recently proposed, which represents the influence kernel rather than the intensity function by neural networks. This paper develops a deep non-stationary influence kernel for spatio-temporal point processes with a novel parameterization that enables us to well approximate complicated kernels in a low-rank form. A log-barrier penalty is introduced during network optimization to maintain the non-negativity of conditional intensity. Our new method can also be extended to model high-dimensional marks, and we demonstrate outstanding performance gain on real police text data. The new approach significantly reduces the model and computational complexities, and the benefits of kernel recovery and event prediction are demonstrated using synthetic and real point process data. ","point process, neural network, non-stationary kernel, low-rank model" Federated Learning from Small Datasets,https://openreview.net/forum?id=hDDV1lsRV8,https://openreview.net/pdf?id=hDDV1lsRV8,"We propose federated daisy chaining to allow multiple parties to successfully train a joint model collaboratively from small local datasets, retaining the privacy benefits of federated learning.","Federated learning allows multiple parties to collaboratively train a joint model without having to share any local data. It enables applications of machine learning in settings where data is inherently distributed and undisclosable, such as in the medical domain. Joint training is usually achieved by aggregating local models. When local datasets are small, locally trained models can vary greatly from a globally good model. Bad local models can arbitrarily deteriorate the aggregate model quality, causing federating learning to fail in these settings. We propose a novel approach that avoids this problem by interleaving model aggregation and permutation steps. During a permutation step we redistribute local models across clients through the server, while preserving data privacy, to allow each local model to train on a daisy chain of local datasets. This enables successful training in data-sparse domains. Combined with model aggregation, we so achieve effective learning even if the local datasets are extremely small, while retaining the privacy benefits of federated learning.","federated learning, distributed, sparse data, daisy chain, small datasets" Zero-shot Human-Object Interaction Recognition by Bridging Generative and Contrastive Image-Language Models,https://openreview.net/forum?id=aZlmMFHaqPg,https://openreview.net/pdf?id=aZlmMFHaqPg,Our zero-shot HOI classifier outperforms supervised SOTAs by using a heterogeneous teach-student framework which bridges generative and contrastive pre-trained image-language models through pseudo-label distillation.,"Existing studies in Human-Object Interaction (HOI) recognition rely heavily on costly human-annotated labels, limiting the application of HOI in real-world scenarios like retail and surveillance. To address this issue, this paper investigates a new zero-shot setup where no HOI labels are available for any image. We propose a novel heterogenous teacher-student framework that bridges two types of pre-trained models, namely contrastive (e.g., CLIP) and generative (e.g., GIT) image-language models. To bridge their gap, we introduce pseudo-label distillation to extract HOI probabilities from image captions to train the student classifier. Our method leverages the complementary strengths of both models. As a result, the student model has ""the best of two worlds"", e.g., the compact backbone of a contrastive model and the fine-grained discriminability of a generative (captioning) model. It achieves 49.6 mAP on the HICO dataset without any ground-truth labels, becoming a new state-of-the-art that outperforms previous supervised approaches. Code will be released upon acceptance.","Zero-shot, knowledge distillation, Human-Object Interaction" Explainable Machine Learning Predictions for the Long-term Performance of Brain-Computer Interfaces,https://openreview.net/forum?id=YaPPldR6te,https://openreview.net/pdf?id=YaPPldR6te,Development of an explainable AI pipeline that can predict with high accuracy and elucidate the most important factors involved in the long-term stability of intracortical brain computer interfaces,"Brain computer interfaces (BCIs) can decode neural signals to control assistive technologies such as robotic limbs for people with paralysis. Neural recordings from intracortical microelectrodes offer the spatiotemporal resolution (e.g., sortable units) necessary for complex tasks, such as controlling a robotic arm with multiple degrees of freedom. However, the quality of these signals decays over time despite many attempts to prolong their longevity. This decrease in long-term performance limits the implementation of this potentially beneficial technology. Predicting whether a channel will have sortable units across time would mitigate this issue and increase the utility of these devices by reducing uncertainty, yet to-date, no such methods exist. Similarly, it would be useful to understand how variables like time post-implantation, electrochemical characteristics, and electrode design impact the long-term quality of these signals. Here, we obtained longitudinal neural recordings and electrochemical data from freely behaving rats implanted with a custom designed microelectrode array with varying site areas, shank positions, and site depths. This dataset was used to develop an explainable artificial intelligence pipeline that predicts with high accuracy the presence of sortable units on a given channel and elucidates the most important factors leading to these predictions. Our pipeline was able to predict whether a channel will be active with an AUC of 0.79 (95% C.I. 0.73–0.86) on unseen data. The most important features of the model were experimental subject, time post-implantation, and a channel’s previous spike metrics. Electrode site depth was the most important electrode design variable. Our results demonstrate the feasibility of implementing explainable artificial intelligence pipelines for longitudinal BCI studies and support previous reports on how factors like time, inter-animal variability, and cortical depth impact long-term performance of BCIs. These results are an important step forward in improving efficient decoding performance and guiding device development, which stand to advance the field and benefit the lives of human BCI patients.","SHAP, Explainability, feature importance, BCI, Neural interfaces, longitudinal intracortical recordings, neural engineering" The Minimal Feature Removal Problem in Neural Networks,https://openreview.net/forum?id=rCtDTKgxyMz,https://openreview.net/pdf?id=rCtDTKgxyMz,,"We present the \emph{minimal feature removal problem} for neural networks, a combinatorial problem which has interesting potential applications for improving interpretability and robustness of neural network predictions. For a given input to a trained neural network, our aim is to compute a smallest set of input features so that the model prediction changes when these features are disregarded by setting them to a given uninformative baseline value. While computing such minimal subsets of features is computationally intractable in general for fully-connected neural networks, we show that the problem becomes solvable in polynomial time by a greedy algorithm under mild assumptions on the network's activation functions. We then show that our tractability result extends seamlessly to more advanced neural network architectures such as convolutional and graph neural networks. Our experiments on standard datasets show favourable performance of our greedy algorithm in practice.", Effectively using public data in privacy preserving Machine learning,https://openreview.net/forum?id=5R96mIU85IW,https://openreview.net/pdf?id=5R96mIU85IW,improving the effect of public data in DP-SGD and improving the accuracy significantly,"A key challenge towards differentially private machine learning is balancing the trade-off between privacy and utility. A recent line of work has demonstrated that leveraging \emph{public data samples} can enhance the utility of DP-trained models (for the same privacy guarantees). In this work, we show that public data can be used to improve utility in DP models significantly more than shown in recent works. Towards this end, we introduce a modified DP-SGD algorithm that leverages public data during its training process. Our technique uses public data in two complementary ways: (1) it uses generative models trained on public data to produce synthetic data that is effectively embedded in multiple steps of the training pipeline; (2) it uses a new gradient clipping mechanism (required for achieving differential privacy) which changes the \emph{origin} of gradient vectors using information inferred from available public and synthesized data. Our experimental results demonstrate the effectiveness of our approach in improving the state-of-the-art in differentially private machine learning across multiple datasets, network architectures, and application domains. Notably, we achieve a $75\%$ accuracy on CIFAR10 when using only $2,000$ public images; this is \emph{significantly higher} than the state-of-the-art which is $68\%$ for DP-SGD with the privacy budget of $\varepsilon=2,\delta=10^{-5}$ (given the same number of public data points).","Privacy preserving machine learning, dp-sgd, public data in privacy" Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation,https://openreview.net/forum?id=VD-AYtP0dve,https://openreview.net/pdf?id=VD-AYtP0dve,Semantic entropy is a novel uncertainty estimation method for natural language generation that captures uncertainty over meanings rather than sequences.,"We introduce a method to measure uncertainty in large language models. For tasks like question answering, it is essential to know when we can trust the natural language outputs of foundation models. We show that measuring uncertainty in natural language is challenging because of ""semantic equivalence""—different sentences can mean the same thing. To overcome these challenges we introduce semantic entropy—an entropy which incorporates linguistic invariances created by shared meanings. Our method is unsupervised, uses only a single model, and requires no modifications to off-the-shelf language models. In comprehensive ablation studies we show that the semantic entropy is more predictive of model accuracy on question answering data sets than comparable baselines. ","uncertainty estimation, natural language generation" Illusory Adversarial Attacks on Sequential Decision-Makers and Countermeasures,https://openreview.net/forum?id=HB2HBIQKhp-,https://openreview.net/pdf?id=HB2HBIQKhp-,"We present illusory attacks on sequential decision-makers, which are undetectable.","Autonomous decision-making agents deployed in the real world need to be robust against possible adversarial attacks on sensory inputs. Existing work on adversarial attacks focuses on the notion of perceptual invariance popular in computer vision. We observe that such attacks can often be detected by victim agents, since they result in action-observation sequences that are not consistent with the dynamics of the environment. Furthermore, real-world agents, such as physical robots, commonly operate under human supervisors who are not susceptible to such attacks. We propose to instead focus on attacks that are statistically undetectable. Specifically, we propose illusory attacks, a novel class of adversarial attack that is consistent with the environment dynamics. We introduce a novel algorithm that can learn illusory attacks end-to-end. We empirically verify that our algorithm generates attacks that, in contrast to current methods, are undetectable to both AI agents with an environment dynamics model, as well as to humans. Furthermore, we show that existing robustification approaches are relatively ineffective against illusory attacks. Our findings highlight the need to ensure that real-world AI, and human-AI, systems are designed to make it difficult to corrupt sensory observations in ways that are consistent with the environment dynamics.","reinforcement learning, adversarial attacks" Prompt Tuning with Prompt-aligned Gradient for Vision-Language Models ,https://openreview.net/forum?id=TSqKS0lQQA6,https://openreview.net/pdf?id=TSqKS0lQQA6,We present Prompt-aligned Gradient to prevent prompt tuning from forgetting the general knowledge learned from CLIP.,"Thanks to the large pre-trained vision-language models (VLMs) like CLIP, we can craft a zero-shot classifier by ``prompt'', e.g., using the model provided similarity measure between an image and the prompt sentence ``$\texttt{a photo of a [CLASS]}$'', as the confidence score of predicting the image is ``$\texttt{[CLASS]}$''. Therefore, prompt shows a great potential for fast adapting the VLMs to downstream tasks if we fine-tune the prompt-based similarity measure. However, we find a common failure that improper fine-tuning may not only undermine the prompt's inherent prediction for the task-related classes, but also for other classes in the VLM vocabulary. Existing methods still address this problem by using traditional anti-overfitting techniques such as early stopping and data augmentation, which lack a principled solution specific to prompt. We present Prompt-aligned Gradient, dubbed $\texttt{ProGrad}$, to prevent prompt tuning from forgetting the the general knowledge learned from VLMs. In particular, $\texttt{ProGrad}$ only updates the prompt whose gradient is aligned (or non-conflicting) to the ``general direction'', which is represented as the gradient of the KL loss of the pre-defined prompt prediction. Extensive experiments demonstrate the stronger few-shot generalization ability of $\texttt{ProGrad}$ over state-of-the-art prompt tuning methods. Codes are in Appendix.","prompt tuning, vision-language models, CLIP" Relative Behavioral Attributes: Filling the Gap between Symbolic Goal Specification and Reward Learning from Human Preferences,https://openreview.net/forum?id=lGz9u1ubUXE,https://openreview.net/pdf?id=lGz9u1ubUXE,,"Generating complex behaviors from goals specified by non-expert users is a crucial aspect of intelligent agents. Interactive reward learning from trajectory comparisons is one way to allow non-expert users to convey complex objectives by expressing preferences over short clips of agent behaviors. Even though this method can encode complex tacit knowledge present in the underlying tasks, it implicitly assumes that the human is unable to provide rich-form feedback other than binary preference labels, leading to extremely high feedback complexity and poor user experience. While providing a detailed symbolic specification of the objectives might be tempting, it is not always feasible even for an expert user. However, in most cases, humans are aware of how the agent should change its behavior along meaningful axes to fulfill the underlying purpose, even if they are not able to fully specify task objectives symbolically. Using this as motivation, we introduce the notion of Relative Behavioral Attributes, which acts as a middle ground, between exact goal specification and reward learning purely from preference labels, by enabling the users to tweak the agent's behavior through nameable concepts (e.g., decreasing the steering sharpness of an autonomous driving agent, or increasing the softness of the movement of a two-legged ""sneaky"" agent). We propose two different parametric methods that can potentially encode any kind of behavioral attributes from ordered behavior clips. We demonstrate the effectiveness of our methods on 4 tasks with 9 different behavioral attributes and show that once the attributes are learned, end users can effortlessly produce desirable agent behaviors, by providing feedback just around 10 times. The feedback complexity of our approach is over 10 times less than the learning-from-human-preferences baseline and this demonstrates that our approach is readily applicable in real-world applications.","Neuro-Symbolic, Human-AI Interaction" DINO as a von Mises-Fisher mixture model,https://openreview.net/forum?id=cMJo1FTwBTQ,https://openreview.net/pdf?id=cMJo1FTwBTQ,Improving DINO with unnormalized prototypes based on a flexible von Mises-Fisher mixture model interpretation.,"Self-distillation methods using Siamese networks are popular for self-supervised pre-training. DINO is one such method based on a cross-entropy loss between $K$-dimensional probability vectors, obtained by applying a softmax function to the dot product between representations and learnt prototypes. Given the fact that the learned representations are $L^2$-normalized, we show that DINO can be interpreted as a mixture model of von Mises-Fisher components. With this interpretation, DINO assumes equal precision for all components when the prototypes are also $L^2$-normalized. Using this insight we propose DINO-vMF, that adds appropriate normalization constants when computing the cluster assignment probabilities. Unlike DINO, DINO-vMF is stable also for the larger ViT-Base model with unnormalized prototypes. We show that the added flexibility of the mixture model is beneficial in terms of better image representations. The DINO-vMF pre-trained model consistently performs better than DINO on a range of downstream tasks. ","self-supervised learning, vision transformers, mixture models" Continuous Depth Recurrent Neural Differential Equations,https://openreview.net/forum?id=-p5ZEVGtojQ,https://openreview.net/pdf?id=-p5ZEVGtojQ,Proposing novel RNN models based on differential equations that continuously transform hidden states in both temporal and depth dimensions.,"Recurrent neural networks (RNNs) have brought a lot of advancements in sequence labeling tasks and sequence data. However, their effectiveness is limited when the observations in the sequence are irregularly sampled, where the observations arrive at irregular time intervals. To address this, continuous time variants of the RNNs were introduced based on neural ordinary differential equations (NODE). They learn a better representation of the data using the continuous transformation of hidden states over time, taking into account the time interval between the observations. However, they are still limited in their capability as they use the discrete transformations and discrete number of layers (depth) over an input in the sequence to produce the output observation. We intend to address this limitation by proposing RNNs based on differential equations which model continuous transformations over depth and time to predict an output for a given input in the sequence. Specifically, we propose continuous depth recurrent neural differential equations (CDR-NDE) which generalizes RNN models by continuously evolving the hidden states in both the temporal and depth dimensions. CDR-NDE considers two separate differential equations over each of these dimensions and models the evolution in the temporal and depth directions alternatively. We also propose the CDR-NDE-heat model based on partial differential equations which treats the computation of hidden states as solving a heat equation over time. We demonstrate the effectiveness of the proposed models by comparing against the state-of-the-art RNN models on real world sequence modeling problems and data sets.","neural ordinary differential equations, recurrent neural networks, sequence data" Optimal Membership Inference Bounds for Adaptive Composition of Sampled Gaussian Mechanisms,https://openreview.net/forum?id=6Lh_wgIaT9l,https://openreview.net/pdf?id=6Lh_wgIaT9l,"We prove optimal membership inference bounds for DP-SGD, beating previously known upper bounds on membership inference.","Given a trained model and a data sample, membership-inference (MI) attacks predict whether the sample was in the model’s training set. A common counter- measure against MI attacks is to utilize differential privacy (DP) during model training to mask the presence of individual examples. While this use of DP is a principled approach to limit the efficacy of MI attacks, there is a gap between the bounds provided by DP and the empirical performance of MI attacks. In this paper, we derive bounds for the advantage of an adversary mounting a MI attack, and demonstrate tightness for the widely-used Gaussian mechanism. Our analysis answers an open problem in the field of differential privacy, namely the fact that membership inference is not 100% successful even for relatively high budgets ($\epsilon> 10$). Finally, using our analysis, we provide MI metrics for models trained on CIFAR10 dataset. To the best of our knowledge, our analysis provides the state-of-the-art membership inference bounds.","Membership inference, DP-SGD, Gaussian Mechanism" Advantage Constrained Proximal Policy Optimization in Multi-Agent Reinforcement Learning,https://openreview.net/forum?id=0LJRS7B3r4_,https://openreview.net/pdf?id=0LJRS7B3r4_,A multi-agent policy gradient reinforcement learning based on local advantage constraned.,"We explore the value-based method and policy gradient combination in multi-agent reinforcement learning (MARL). In value-based MARL, {\itshape{Individual-Global-Max}} (IGM) principle plays an important role, which maintains the consistency between joint and local action values. At the same time, IGM is difficult to guarantee in multi-agent policy gradient methods due to stochastic exploration and conflicting gradient directions. In this paper, we propose a novel multi-agent policy gradient algorithm called {\itshape{Advantage Constrained Proximal Policy Optimization}} (ACPPO). Based on {\itshape{multi-agent advantage decomposition lemma}}, ACPPO introduces an advantage network for each agent to estimate current local state-action advantage. The coefficient of each agent constrains the joint-action advantage according to the consistency of the estimated joint-action advantage and local advantage. Unlike previous policy gradient-based MARL algorithms, ACPPO does not need an extra sampled baseline to reduce variance. We evaluate the proposed methods for continuous matrix game and Multi-Agent MuJoCo tasks. Results show that ACPPO outperforms the baselines such as MAPPO, MADDPG, and HAPPO.","Multi agent, reinforcement learning, neural network, deep learning, trust region." Scalable Batch-Mode Deep Bayesian Active Learning via Equivalence Class Annealing,https://openreview.net/forum?id=GRZtigJljLY,https://openreview.net/pdf?id=GRZtigJljLY,We propose a new scalable batch-mode active learning algorithm,"Active learning has demonstrated data efficiency in many fields. Existing active learning algorithms, especially in the context of batch-mode deep Bayesian active models, rely heavily on the quality of uncertainty estimations of the model, and are often challenging to scale to large batches. In this paper, we propose Batch-BALanCe, a scalable batch-mode active learning algorithm, which combines insights from decision-theoretic active learning, combinatorial information measure, and diversity sampling. At its core, Batch-BALanCe relies on a novel decision-theoretic acquisition function that facilitates differentiation among different equivalence classes. Intuitively, each equivalence class consists of hypotheses (e.g., posterior samples of deep neural networks) with similar predictions, and Batch-BALanCe adaptively adjusts the size of the equivalence classes as learning progresses. To scale up the computation of queries to large batches, we further propose an efficient batch-mode acquisition procedure, which aims to maximize a novel combinatorial information measure defined through the acquisition function. We show that our algorithm can effectively handle realistic multi-class classification tasks, and achieves compelling performance on several benchmark datasets for active learning under both low- and large-batch regimes.","Bayesian Neural Network, Batch-Mode Active Learning, Decision-Centric Data Acquisition, Scalability" Neural multi-event forecasting on spatio-temporal point processes using probabilistically enriched transformers,https://openreview.net/forum?id=JUNKYmGGuEw,https://openreview.net/pdf?id=JUNKYmGGuEw," In this work, we introduce a novel neural network that is capable of simultaneous multi-event forecasting of spatio-temporal distributions associated with stochastic discrete events.","Predicting discrete events in time and space has many scientific applications, such as predicting hazardous earthquakes and outbreaks of infectious diseases. History-dependent spatio-temporal Hawkes processes are often used to mathematically model these point events. However, previous approaches have faced numerous challenges, particularly when attempting to forecast multiple future events. In this work, we propose a new neural architecture for multi-event forecasting of spatio-temporal point processes, utilizing transformers, augmented with normalizing flows and probabilistic layers. Our network makes batched predictions of complex history-dependent spatio-temporal distributions of future discrete events, achieving state-of-the-art performance on a variety of benchmark datasets including the South California Earthquakes, Citibike, Covid19, and Hawkes synthetic Pinwheel datasets. More generally, we illustrate how our network can be applied to any dataset of discrete events with associated markers, even when no underlying physics is known.","Stochastic Point Processes, Multi-event Prediction, Transformers, Normalizing Flows, Hawkes Process, Deep Learning, Generative Models" Associative Memory Augmented Asynchronous Spatiotemporal Representation Learning for Event-based Perception,https://openreview.net/forum?id=ZCStthyW-TD,https://openreview.net/pdf?id=ZCStthyW-TD,"We propose EventFormer, an asynchronous spatiotemporal representation learning framework augmented by an associative memory to efficiently perform event-based perception.","We propose $\textit{EventFormer}$, a computationally efficient event-based representation learning framework for asynchronously processing event camera data. EventFormer treats sparse input events as a spatially unordered set and models their spatial interactions using self-attention mechanism. An associative memory-augmented recurrent module is used to correlate with the stored representation computed from past events. A memory addressing mechanism is proposed to store and retrieve the latent states only $\textit{where}$ these events occur and update them only $\textit{when}$ they occur. The representation learning shift from input space to the latent memory space resulting in reduced computation cost for processing each event. We show that EventFormer achieves 0.5$\%$ and 9$\%$ better accuracy with 30000$\times$ and 200$\times$ less computation compared to the state-of-the-art dense and event-based method, respectively, on event-based object recognition datasets.","associative memory, memory augmented neural network, spatiotemporal representation, event-based camera, event-based perception, object recognition, attention, set processing" Detecting Small Query Graphs in A Large Graph via Neural Subgraph Search,https://openreview.net/forum?id=7XHiDnUb_hj,https://openreview.net/pdf?id=7XHiDnUb_hj,,"Recent advances have shown the success of using reinforcement learning and search to solve NP-hard graph-related tasks, such as Traveling Salesman Optimization, Graph Edit Distance computation, etc. However, it remains unclear how one can efficiently and accurately detect the occurrences of a small query graph in a large target graph, which is a core operation in graph database search, biomedical analysis, social group finding, etc. This task is called Subgraph Matching which essentially performs subgraph isomorphism check between a query graph and a large target graph. One promising approach to this classical problem is the “learning-to-search” paradigm, where a reinforcement learning (RL) agent is designed with a learned policy to guide a search algorithm to quickly find the solution without any solved instances for supervision. However, for the specific task of Subgraph Matching, though the query graph is usually small given by the user as input, the target graph is often orders-of-magnitude larger. It poses challenges to the neural network design and can lead to solution and reward sparsity. In this paper, we propose NSUBS with two innovations to tackle the challenges: (1) A novel encoder-decoder neural network architecture to dynamically compute the matching information between the query and the target graphs at each search state; (2) A novel look-ahead loss function for training the policy network. Experiments on six large real-world target graphs show that NSUBS can significantly improve the subgraph matching performance.","subgraph matching, subgraph isomorphism search" Can we achieve robustness from data alone?,https://openreview.net/forum?id=HmdOxc8zIWx,https://openreview.net/pdf?id=HmdOxc8zIWx,,"In robust machine learning, there is a widespread belief that samples can be decomposed into robust features (parts of the data that withstand small perturbations) and non-robust ones, and it is the role of the robust algorithm (i.e. adversarial training) to amplify the former and erase the latter. In this work, we challenge this view and try to position adversarial robustness as a more model-dependent property: many approaches that assume this simplistic distinction in the features, optimizing the data directly, only give rise to superficial adversarial robustness. We revisit prior approaches in the literature that were believed to be robust, and proceed to devise a principled meta-learning algorithm, that optimizes the dataset for robustness. Our method can be thought as a non-parametric version of adversarial training, and it is of independent interest and potentially wider applicability. Specifically, we cast the bi-level optimization as a min-max procedure on kernel regression, with a class of kernels that describe infinitely wide neural nets (Neural Tangent Kernels). Through extensive experiments we analyse the properties of the models trained on the optimized datasets and identify their shortcomings - all of them come in a similar flavor.", Catastrophic overfitting is a bug but it is caused by features,https://openreview.net/forum?id=n0okuXMlI7V,https://openreview.net/pdf?id=n0okuXMlI7V,We analyse the phenomena of catastrophic overfitting trough active interventions and show it is a shortcut for the network to avoid learning complex robust solutions. ,"Adversarial training (AT) is the de facto method to build robust neural networks, but it is computationally expensive. To overcome this, fast single-step attacks can be used, but doing so is prone to catastrophic overfitting (CO). This is when networks gain non-trivial robustness during the first stages of AT, but then reach a breaking point where they become vulnerable in just a few iterations. Although some works have succeeded at preventing CO, the different mechanisms that lead to this failure mode are still poorly understood. In this work, we study the onset of CO in single-step AT methods through controlled modifications of typical datasets of natural images. In particular, we show that CO can be induced when injecting the images with seemingly innocuous features that are very useful for non-robust classification but need to be combined with other features to obtain a robust classifier. This new perspective provides important insights into the mechanisms that lead to CO and improves our understanding of the general dynamics of adversarial training.","Adversarial robustness, catastrophic overfitting, understanding deep learning, single-step adversarial training, FGSM, fast adversarial training" Semi Parametric Inducing Point Networks,https://openreview.net/forum?id=FE99-fDrWd5,https://openreview.net/pdf?id=FE99-fDrWd5,,"We introduce semi-parametric inducing point networks (SPIN), a general-purpose architecture that can query the training set at inference time in a compute-efficient manner. Semi-parametric architectures are typically more compact than parametric models, but their computational complexity is often quadratic. In contrast, SPIN attains linear complexity via a cross-attention mechanism between datapoints inspired by inducing point methods. Querying large training sets can be particularly useful in meta-learning as it unlocks additional training signal, but often exceeds the scaling limits of existing models. We use SPIN as the basis of the Inducing Point Neural Process, a probabilistic model which supports large contexts in meta-learning, and achieves high accuracy where existing models fail. In our experiments, SPIN reduces memory requirements and improves accuracy across a range of meta-learning tasks and improves state-of-the-art performance on an important practical problem, genotype imputation.", Perceptual Grouping in Vision-Language Models,https://openreview.net/forum?id=3UDAU2unja,https://openreview.net/pdf?id=3UDAU2unja,We describe a minimal set of changes to vision-language models to endow these models with perceptual grouping and localization information.,"Recent advances in zero-shot image recognition suggest that vision-language models learn generic visual representations with a high degree of semantic information that may be arbitrarily probed with natural language phrases. Understanding an image, however, is not just about understanding {\it what} content resides within an image, but importantly, {\it where} that content resides. In this work we examine how well vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We demonstrate how contemporary vision and language representation learning models based on contrastive losses and large web-based data capture limited object localization information. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information. We measure this performance in terms of zero-shot image recognition, unsupervised bottom-up and top-down semantic segmentations, as well as robustness analyses. We find that the resulting model achieves state-of-the-art results in terms of unsupervised segmentation, and demonstrate that the learned representations are uniquely robust to spurious correlations in datasets designed to probe the causal behavior of vision models.","vision-language models, multimodal learning, perceptual grouping, image segmentation" CADet: Fully Self-Supervised Anomaly Detection With Contrastive Learning,https://openreview.net/forum?id=Q9yT-pxvWn8,https://openreview.net/pdf?id=Q9yT-pxvWn8,"We leverage self-supervised contrastive learning to simultaneously perform adversarial and unseen label detection using a statistic inspired by MMD, and without seeing out-of-distribution examples.","Handling out-of-distribution (OOD) samples has become a major stake in the real-world deployment of machine learning systems. This work explores the application of self-supervised contrastive learning to the simultaneous detection of two types of OOD samples: unseen classes and adversarial perturbations. Since in practice the distribution of such samples is not known in advance, we do not assume access to OOD examples. We show that similarity functions trained with contrastive learning can be leveraged with the maximum mean discrepancy (MMD) two-sample test to verify whether two independent sets of samples are drawn from the same distribution. Inspired by this approach, we introduce CADet (Contrastive Anomaly Detection), a method based on image augmentations to perform anomaly detection on single samples. CADet compares favorably to adversarial detection methods to detect adversarially perturbed samples on ImageNet. Simultaneously, it achieves comparable performance to unseen label detection methods on two challenging benchmarks: ImageNet-O and iNaturalist. Additionally, CADet is fully self-supervised and requires neither labels for in-distribution samples nor access to out-of-distribution examples.","Contrastive learning, OOD detection, adversarial detection, MMD, ImageNet-O, Anomaly detection" SPRINT: Scalable Semantic Policy Pre-training via Language Instruction Relabeling,https://openreview.net/forum?id=tDG-zrQ8S1Q,https://openreview.net/pdf?id=tDG-zrQ8S1Q,We propose a scalable offline policy pre-training approach based on natural language instructions that automatically generates new pre-training tasks with language-model relabeling.,"We propose SPRINT, an approach for scalable offline policy pre-training based on natural language instructions. SPRINT pre-trains an agent’s policy to execute a diverse set of semantically meaningful skills that it can leverage to learn new tasks faster. Prior work on offline pre-training required tedious manual definition of pre-training tasks or learned semantically meaningless skills via random goal-reaching. Instead, our approach SPRINT (Scalable Pre-training via Relabeling Language INsTructions) leverages natural language instruction labels on offline agent experience, collected at scale (e.g., via crowd-sourcing), to define a rich set of tasks with minimal human effort. Furthermore, by using natural language to define tasks, SPRINT can use pre-trained large language models to automatically expand the initial task set. By relabeling and aggregating task instructions, even across multiple training trajectories, we can learn a large set of new skills during pre-training. In experiments using a realistic household simulator, we show that agents pre-trained with SPRINT learn new long-horizon household tasks substantially faster than with previous pre-training approaches.","reinforcement learning, language-guided RL, offline RL, policy pre-training" SMART: Self-supervised Multi-task pretrAining with contRol Transformers,https://openreview.net/forum?id=9piH3Hg8QEf,https://openreview.net/pdf?id=9piH3Hg8QEf,"We propose a pretraining framework for sequential decision making based on a self-supervised objectives and a control transformer architecture, leading to significantly higher learning efficiency in various downstram control tasks.","Self-supervised pretraining has been extensively studied in language and vision domains, where a unified model can be easily adapted to various downstream tasks by pretraining representations without explicit labels. When it comes to sequential decision-making tasks, however, it is difficult to properly design such a pretraining approach that can cope with both high-dimensional perceptual information and the complexity of sequential control over long interaction horizons. The challenge becomes combinatorially more complex if we want to pretrain representations amenable to a large variety of tasks. To tackle this problem, in this work, we formulate a general pretraining-finetuning pipeline for sequential decision making, under which we propose a generic pretraining framework \textit{Self-supervised Multi-task pretrAining with contRol Transformer (SMART)}. By systematically investigating pretraining regimes, we carefully design a Control Transformer (CT) coupled with a novel control-centric pretraining objective in a self-supervised manner. SMART encourages the representation to capture the common essential information relevant to short-term control and long-term control, which is transferrable across tasks. We show by extensive experiments in DeepMind Control Suite that SMART significantly improves the learning efficiency among seen and unseen downstream tasks and domains under different learning scenarios including Imitation Learning (IL) and Reinforcement Learning (RL). Benefiting from the proposed control-centric objective, SMART is resilient to distribution shift between pretraining and finetuning, and even works well with low-quality pretraining datasets that are randomly collected.","pretrain, transformer, multi-task reinforcement learning, sequential decision making, self-supervised" Evaluation of Active Feature Acquisition Methods under Missing Data,https://openreview.net/forum?id=pPUoahHadAX,https://openreview.net/pdf?id=pPUoahHadAX,Evaluation of active feature acquisition methods under missing data corresponds to a distribution shift which requires adjustment,"Machine learning (ML) methods generally assume the full set of features are available at no cost. If the acquisition of a certain feature is costly at run-time, one might want to balance the acquisition cost and the predictive value of the feature for the ML task. The task of training an AI agent to decide which features are necessary to be acquired is called active feature acquisition (AFA). Current AFA methods, however, are challenged when the AFA agent has to be trained/tested with datasets that contain missing data. We formulate, for the first time, the problem of active feature acquisition performance evaluation (AFAPE) under missing data, i.e. the problem of adjusting for the inevitable missingness distribution shift between train/test time and run-time. We first propose a new causal graph, the AFA graph, that characterizes the AFAPE problem as an intervention on the reinforcement learning environment used to train AFA agents. Here, we discuss that for handling missing data in AFAPE, the conventional approaches (off-policy reinforcement learning, blocked feature acquisitions, imputation and inverse probability weighting (IPW)) often lead to biased results or are data inefficient. We then propose active feature acquisition importance sampling (AFAIS), a novel estimator that is more data efficient than IPW. We demonstrate the detrimental conclusions to which biased estimators can lead as well as the high data efficiency of AFAIS in multiple experiments using simulated and real-world data under induced MCAR, MAR and MNAR missingness.","Active feature acquisition, active sensing, missing data, reinforcement learning, distribution-shift, off-environment policy evaluation, AFA graph, AFAPE, AFAIS, MCAR, MAR, MNAR, causal inference" DAG Learning via Sparse Relaxations,https://openreview.net/forum?id=m9LCdYgN8-6,https://openreview.net/pdf?id=m9LCdYgN8-6,"We propose a continuous optimization framework over the Permutahedron for discovering (DAGs) from observations that does not relax acyclicity and accommodates any edge-optimization procedure, edge structural parameterization, and optimization loss","We propose a continuous optimization framework for discovering a latent directed acyclic graph (DAG) from observational data. Our approach optimizes over the polytope of permutation vectors, the so-called Permutahedron, to learn a topological ordering. Edges can be optimized jointly, or learned conditional on the ordering via a non-differentiable subroutine. Compared to existing continuous optimization approaches our formulation has a number of advantages including: 1. validity: optimizes over exact DAGs as opposed to other relaxations optimizing approximate DAGs; 2. modularity: accommodates any edge-optimization procedure, edge structural parameterization, and optimization loss; 3. end-to-end: either alternately iterates between node-ordering and edge-optimization, or optimizes them jointly; We demonstrate, on real-world data problems in protein-signaling and transcriptional network discovery, that our approach lies on the Pareto frontier of two key metrics, the SID and SHD.","directed acyclic graph, causal discovery, differentiable sorting" Explicitly Minimizing the Blur Error of Variational Autoencoders,https://openreview.net/forum?id=9krnQ-ue9M,https://openreview.net/pdf?id=9krnQ-ue9M,We propose a new reconstruction term for VAEs that explicitly focuses on minimizing the blur of generated/reconstructed images while still optimizing the ELBO.,"Variational autoencoders (VAEs) are powerful generative modelling methods, however they suffer from blurry generated samples and reconstructions compared to the images they have been trained on. Significant research effort has been spent to increase the generative capabilities by creating more flexible models but often flexibility comes at the cost of higher complexity and computational cost. Several works have focused on altering the reconstruction term of the evidence lower bound (ELBO), however, often at the expense of losing the mathematical link to maximizing the likelihood of the samples under the modeled distribution. Here we propose a new formulation of the reconstruction term for the VAE that specifically penalizes the generation of blurry images while at the same time still maximizing the ELBO under the modeled distribution. We show the potential of the proposed loss on three different data sets, where it outperforms several recently proposed reconstruction losses for VAEs.","Variational Autoencoders, Generative Modelling, Blur" GraphEditor: An Efficient Graph Representation Learning and Unlearning Approach,https://openreview.net/forum?id=tyvshLxFUtP,https://openreview.net/pdf?id=tyvshLxFUtP,This paper propose an efficient graph representation learning and unlearning method for linear-GNNs. The methods could also be extended to non-linear GNNs under some assumptions on input data.," As graph representation learning has received much attention due to its widespread applications, removing the effect of a specific node from the pre-trained graph representation learning model due to privacy concerns has become equally important. However, due to the dependency between nodes in the graph, graph representation unlearning is notoriously challenging and still remains less well explored. To fill in this gap, we propose \textsc{GraphEditor}, an efficient graph representation \textit{learning} and \textit{unlearning} approach that supports node/edge deletion, node/edge addition, and node feature update. Compared to existing unlearning approaches, \textsc{GraphEditor} requires neither retraining from scratch nor of all data presented during unlearning, which is beneficial for the settings that not all the training data are available to retrain. Besides, since \textsc{GraphEditor} is exact unlearning, the removal of all the information associated with the deleted nodes/edges can be guaranteed. Empirical results on real-world datasets illustrate the effectiveness of \textsc{GraphEditor} for both node and edge unlearning tasks.","graph representation learning, graph unlearning, machine unlearning, linear-GNNs" 3D Equivariant Diffusion for Target-Aware Molecule Generation and Affinity Prediction,https://openreview.net/forum?id=kJqXEPXMsE0,https://openreview.net/pdf?id=kJqXEPXMsE0,,"Rich data and powerful machine learning models allow us to design drugs for a specific protein target in silico. Recently, the inclusion of 3D structures during targeted drug design shows superior performance to other target-free models as the atomic interaction in the 3D space is explicitly modeled. However, current 3D target-aware models either rely on the voxelized atom densities or the autoregressive sampling process, which are not equivariant to rotation or easily violate geometric constraints resulting in unrealistic structures. In this work, we develop a 3D equivariant diffusion model to solve the above challenges. To achieve target-aware molecule design, our method learns a joint generative process of both continuous atom coordinates and categorical atom types with a SE(3)-equivariant network. Moreover, we show that our model can serve as an unsupervised feature extractor to estimate the binding affinity under proper parameterization, which provides an effective way for drug screening. To evaluate our model, we propose a comprehensive framework to evaluate the quality of sampled molecules from different dimensions. Empirical studies show our model could generate molecules with more realistic 3D structures and better affinities towards the protein targets, and improve binding affinity ranking and prediction without retraining.", PGASL: Predictive and Generative Adversarial Semi-supervised Learning for imbalanced data,https://openreview.net/forum?id=_VTkMy81R3x,https://openreview.net/pdf?id=_VTkMy81R3x,,"Modern machine learning techniques often suffer from class imbalance where only a small amount of data is available for minority classes. Classifiers trained on an imbalanced dataset, although have high accuracy on majority classes, can perform poorly on minority classes. This is problematic when minority classes are also important. Generative Adversarial Networks (GANs) have been proposed for generating artificial minority examples to balance the training. We propose a class-imbalanced semi-supervised learning algorithm PGASL which can be efficiently trained on unlabeled and class-imbalanced data. In this work, we use a predictive network which is trained adversarially for the discriminator to correct predictions on the unlabeled dataset. Experiments on text datasets show that PGASL outperforms state-of-the-art class-imbalanced learning algorithms by including both predictive network and generator. ","semi-supervised learning, imbalanced learning, GAN model, adversarial learning" Towards a More Rigorous Science of Blindspot Discovery in Image Models,https://openreview.net/forum?id=UrzBg1Zz7ob,https://openreview.net/pdf?id=UrzBg1Zz7ob,,"A growing body of work studies Blindspot Discovery Methods (BDMs): methods for finding semantically meaningful subsets of the data where an image classifier performs significantly worse, without making strong assumptions. Motivated by observed gaps in prior work, we introduce a new framework for evaluating BDMs, SpotCheck, that uses synthetic image datasets to train models with known blindspots and a new BDM, PlaneSpot, that uses a 2D image representation. We use SpotCheck to run controlled experiments that identify factors that influence BDM performance (e.g., the number of blindspot in a model) and show that PlaneSpot outperforms existing BDMs. Importantly, we validate these findings using real data. Overall, we hope that the methodology and analyses presented in this work will serve as a guide for future work on blindspot discovery.", How gradient estimator variance and bias impact learning in neural networks,https://openreview.net/forum?id=EBC60mxBwyw,https://openreview.net/pdf?id=EBC60mxBwyw,We characterize the impact of variance and bias in gradient estimates on learning and generalization and study how network architecture properties modulate these effects.,"There is growing interest in understanding how real brains may approximate gradients and how gradients can be used to train neuromorphic chips. However, neither real brains nor neuromorphic chips can perfectly follow the loss gradient, so parameter updates would necessarily use gradient estimators that have some variance and/or bias. Therefore, there is a need to understand better how variance and bias in gradient estimators impact learning dependent on network and task properties. Here, we show that variance and bias can impair learning on the training data, but some degree of variance and bias in a gradient estimator can be beneficial for generalization. We find that the ideal amount of variance and bias in a gradient estimator are dependent on several properties of the network and task: the size and sparsity of the network, the norm of the gradient, and the curvature of the loss landscape. As such, whether considering biologically-plausible learning algorithms or algorithms for training neuromorphic chips, researchers can analyze these properties to determine whether their approximation to gradient descent will be effective for learning given their network and task properties.","Computational Neuroscience, learning and plasticity, Credit assignment, Imperfect gradient descent, Gradient approximation, Biologically-plausible learning, Neuromorphic computing, Neural networks" Automatically Auditing Large Language Models via Discrete Optimization,https://openreview.net/forum?id=Pkb5FA5AjqP,https://openreview.net/pdf?id=Pkb5FA5AjqP,"We cast auditing as a discrete optimization problem, and demonstrate how this reduction uncovers large language model failure modes. ","Auditing large language models for unexpected behaviors is critical to preempt catastrophic deployments, yet remains challenging. In this work, we cast auditing as a discrete optimization problem, where we automatically search for input-output pairs that match a desired target behavior. For example, we might aim to find non-toxic input that starts with ``Barack Obama'' and maps to a toxic output. Our optimization problem is difficult to solve as the set of feasible points is sparse, the space is discrete, and the language models we audit are non-linear and high-dimensional. To combat these challenges, we introduce a discrete optimization algorithm, ARCA, that is tailored to autoregressive language models. We demonstrate how our approach can: uncover derogatory completions about celebrities (e.g. ``Barack Obama is a legalized unborn'' $\rightarrow$ ``child murderer'), produce French inputs that complete to English outputs, and find inputs that generate a specific name. Our work offers a promising new tool to uncover models' failure-modes before deployment. $\textbf{Trigger Warning: This paper contains model behavior that can be offensive in nature.}$","large language models, safety, auditing, robustness" Do We Really Need Complicated Model Architectures For Temporal Networks?,https://openreview.net/forum?id=ayPPc0SyLv1,https://openreview.net/pdf?id=ayPPc0SyLv1,This paper propose a conceptually and technically simple method for temporal graph link prediction,"Recurrent neural network (RNN) and self-attention mechanism (SAM) are the de facto methods to extract spatial-temporal information for temporal graph learning. Interestingly, we found that although both RNN and SAM could lead to a good performance, in practice neither of them is always necessary. In this paper, we propose \oure, a conceptually and technically simple architecture that consists of three components: \circled{1} a \emph{link-encoder} that is only based on multi-layer perceptrons (MLP) to summarize the information from temporal links, \circled{2} a \emph{node-encoder} that is only based on neighbor mean-pooling to summarize node information, and \circled{3} an MLP-based \emph{link classifier} that performs link prediction based on the outputs of the encoders. Despite its simplicity, \our attains an outstanding performance on temporal link prediction benchmarks with faster convergence and better generalization performance. These results motivate us to rethink the importance of simpler model architecture. ","temporal graph, link prediction" Synthetic Pre-Training Tasks for Neural Machine Translation,https://openreview.net/forum?id=w22mvGyIysy,https://openreview.net/pdf?id=w22mvGyIysy,,"Pre-training is an effective technique for ensuring robust performance on a variety of machine learning tasks. It typically depends on large-scale crawled corpora that can result in toxic or biased models. Such data can also be problematic with respect to copyright, attribution, and privacy. Pre-training with synthetic tasks and data is a promising way of alleviating such concerns since no real-world information is ingested by the model. Our goal in this paper is to understand what makes for a good pre-trained model when using synthetic resources. We answer this question in the context of neural machine translation by considering two novel approaches to translation model pre-training. Our first approach studies the effect of pre-training on obfuscated data derived from a parallel corpus by mapping words to a vocabulary of `nonsense' tokens. Our second approach explores the effect of pre-training on procedurally generated synthetic parallel data that does not depend on any real human language corpus. Our empirical evaluation on multiple language pairs shows that, to a surprising degree, the benefits of pre-training can be realized even with obfuscated or purely synthetic parallel data. In our analysis, we consider the extent to which obfuscated and synthetic pre-training techniques can be used to mitigate the issue of hallucinated model toxicity.","machine translation, synthetic data pre-training, toxicity and bias" On the System-Level Effectiveness of Physical Object-Hiding Adversarial Attack in Autonomous Driving,https://openreview.net/forum?id=XKjz6mR3iqe,https://openreview.net/pdf?id=XKjz6mR3iqe,,"In Autonomous Driving (AD) systems, perception is crucial for both security and safety. Among the different attacks on AD perception, the physical object-hiding adversarial attacks are especially severe due to their direct impact on road safety. However, we find that all existing works so far only evaluate their attack effect at the targeted AI component level, without any evaluation \textit{at the system level}, i.e., with the entire system semantics and context such as the full AD system pipeline and closed-loop control. This thus inevitably raise a critical research question: can these existing research efforts actually effectively achieve the desired system-level attack effects (e.g., causing vehicle collisions, traffic rule violations, etc.) in the real-world AD system context? In the paper, we thus perform the first measurement study on whether and how effective the existing designs can lead to system-level effects, where we take the STOP sign-hiding attack as our target. Our evaluation results show that all the representative prior works cannot achieve any system-level effect in a classical closed-loop AD setup in road speeds controlled by common STOP signs. With that, we then point out two limitation hypotheses that appear in all existing works: 1) the unpractical STOP sign size distribution in pixel sampling, and 2) missing particular consideration in system-critical attack range. Experimental results demonstrate that after overcoming these two limitations, the system-level effects can be further improved, i.e., the violation rate can increase around 70\%.", BIG-Graph: Brain Imaging Genetics by Graph Neural Network,https://openreview.net/forum?id=jxAle6zdyTI,https://openreview.net/pdf?id=jxAle6zdyTI,,"Imaging genetics is one of the foremost emerging fields in neuroscience research that aims to combine neuroimaging and genetic information with phenotypes to shed light on inherent underlying mechanisms. While significant progress has been made in integrating brain imaging, like functional magnetic resonance imaging (fMRI), with genetic data, such as single nucleotide polymorphisms (SNPs), little progress has been made in studying them jointly using graph structures. To raise a new perspective and overcome challenges in analyzing data with high dimensionality and inherently complex relationships, we developed a graphical neural network model (BIG-Graph) that jointly learns to effectively represent both neuroimaging and genetic data in a nonlinear manner without any prior knowledge. Here, we demonstrate that joint learning of imaging-genetics using BIG-Graph largely outperforms existing state-of-the-art Imaging genetics models and networks trained separately on neuroimaging or genetic data in predicting a variety of phenotypes.","Imaging Genetics, Graph Neural Network, Brain Network, genome-wide association studies" Optimizing the Performance of Text Classification Models by Improving the Isotropy of the Embeddings using a Joint Loss Function,https://openreview.net/forum?id=i3DLC5xIAJN,https://openreview.net/pdf?id=i3DLC5xIAJN,,"Recent studies show that the spatial distribution of the sentence representations generated from pre-trained language models is highly anisotropic, meaning that the representations are not uniformly distributed among the directions of the embedding space. Thus, the expressiveness of the embedding space is limited, as the embeddings are less distinguishable and less diverse. This results in a degradation in the performance of the models on the downstream task. Most methods that define the state-of-the-art in this area proceed by improving the isotropy of the sentence embeddings by refining the corresponding contextual word representations, then deriving the sentence embeddings from these refined representations. In this study, we propose to improve the quality and distribution of the sentence embeddings extracted from the [CLS] token of the pre-trained language models by improving the isotropy of the embeddings. We add one feed-forward layer, referred to as the Isotropy Layer, between the model and the downstream task layers. We train this layer using a novel joint loss function that optimizes an isotropy quality measure and the downstream task loss. This joint loss pushes the embeddings outputted by the Isotropy Layer to be more isotropic, and it also retains the semantics needed to perform the downstream task. The proposed approach results in transformed embeddings with better isotropy, that generalize better on the downstream task. Furthermore, the approach requires training one feed-forward layer, instead of retraining the whole network. We quantify and evaluate the isotropy through multiple metrics, mainly the Explained Variance and the IsoScore. Experimental results on 3 GLUE datasets with classification as the downstream task show that our proposed method is on par with the state-of-the-art, as it achieves performance gains of around 2-3% on the downstream tasks compared to the baseline. ", Data Feedback Loops: Model-driven Amplification of Dataset Biases,https://openreview.net/forum?id=dGdoZds9qAs,https://openreview.net/pdf?id=dGdoZds9qAs,We theoretically and experimentally characterize bias amplification when training a model on its own outputs.,"Datasets scraped from the internet have been critical to large-scale machine learning. Yet, its success puts the utility of future internet-derived datasets at potential risk, as model outputs begin to replace human annotations as a source of supervision. In this work, we formalize a system where interactions with one model are recorded as history and scraped as training data in the future. We then analyze its stability over time by tracking changes to a test-time bias statistic (e.g. gender bias of model predictions). We find that the degree of bias amplification is closely linked to whether the model’s outputs behave like samples from the training distribution, a behavior which we characterize and define as consistent calibration. Experiments in three conditional prediction scenarios – image classification, visual role-labeling, and language generation – demonstrate that models that exhibit a sampling-like behavior are more calibrated and thus more stable. Based on this insight, we propose an intervention to help calibrate and stabilize unstable feedback systems.","feedback loops, bias amplification, deep learning, self-supervised learning, CV, NLP" A $2$-parameter Persistence Layer for Learning,https://openreview.net/forum?id=WF7dU23lRCo,https://openreview.net/pdf?id=WF7dU23lRCo,A differentiable topological layer based on a novel vector representation on $2$-parameter persistence modules.,"$1$-parameter persistent homology, a cornerstone in Topological Data Analysis (TDA), studies the evolution of topological features such as cycle basis hidden in data. It has found its application in strengthening the representation power of deep learning models like Graph Neural Networks (GNN). To enrich the representations of topological features, here we propose to study $2$-parameter persistence modules induced by bi-filtration functions. In order to incorporate these representations into machine learning models, we introduce a novel vectorization on $2$-parameter persistence modules called Generalized Rank Invariant Landscape {\textsc{Gril}}. We show that this vector representation is stable and differentiable with respect to underlying filtration functions and can be easily integrated into machine learning models to augment encoding topological features. We present an algorithm to compute the vectorization and its gradients. We also test our methods on synthetic graph datasets and compare the results with some popular graph neural networks.","topological data analysis, graph representation, persistent homology, 2-parameter persistence, graph neural network" Is Conditional Generative Modeling all you need for Decision Making?,https://openreview.net/forum?id=sP1fo2K9DFG,https://openreview.net/pdf?id=sP1fo2K9DFG,Framing (offline) sequential decision making as conditional diffusion generative modeling,"Recent improvements in conditional generative modeling have made it possible to generate high-quality images from language descriptions alone. We investigate whether these methods can directly address the problem of sequential decision-making. We view decision-making not through the lens of reinforcement learning (RL), but rather through conditional generative modeling. To our surprise, we find that our formulation leads to policies that can outperform existing offline RL approaches across standard benchmarks. By modeling a policy as a return-conditional generative model, we avoid the need for dynamic programming and subsequently eliminate many of the complexities that come with traditional offline RL. We further demonstrate the advantages of modeling policies as conditional generative models by considering two other conditioning variables: constraints and skills. Conditioning on a single constraint or skill during training leads to behaviors at test-time that can satisfy several constraints together or demonstrate a composition of skills. Our results illustrate that conditional generative modeling is a powerful tool for decision-making. ","Offline Reinforcement Learning, Conditional Generative Modeling, Sequential Decision Making, Diffusion Models" META-STORM: Generalized Fully-Adaptive Variance Reduced SGD for Unbounded Functions,https://openreview.net/forum?id=1KtU2ya2zh5,https://openreview.net/pdf?id=1KtU2ya2zh5,We propose new fully adaptive variance reduced algorithms removing bounded function values and bounded gradients assumptions and improving upon previous work both in the theoretical convergence rate and empirical performance.,"We study the application of variance reduction (VR) techniques to general non-convex stochastic optimization problems. In this setting, the recent work STORM (Cutkosky & Orabona, 2019) overcomes the drawback of having to compute gradients of “mega-batches” that earlier VR methods rely on. There, STORM utilizes recursive momentum to achieve the VR effect and is then later made fully adaptive in STORM+ (Levy et al., 2021), where full-adaptivity removes the requirement for obtaining certain problem-specific parameters such as the smoothness of the objective and bounds on the variance and norm of the stochastic gradients in order to set the step size. However, STORM+ crucially relies on the assumption that the function values are bounded, excluding a large class of useful functions. In this work, we propose META-STORM, a generalized framework of STORM+ that removes this bounded function values assumption while still attaining the optimal convergence rate for non-convex optimization. META-STORM not only maintains full-adaptivity, removing the need to obtain problem specific parameters, but also improves the convergence rate’s dependency on the problem parameters. Furthermore, META-STORM can utilize a large range of parameter settings that subsumes previous methods allowing for more flexibility in a wider range of settings. Finally, we demonstrate the effectiveness of META-STORM through experiments across common deep learning tasks. Our algorithm improves upon the previous work STORM+ and is competitive with widely used algorithms after the addition of per-coordinate update and exponential moving average heuristics.","Nonconvex Optimization, Stochastic Optimization, Adaptive Algorithms, Variance Reduction" TEMPERA: Test-Time Prompt Editing via Reinforcement Learning,https://openreview.net/forum?id=gSHyqBijPFO,https://openreview.net/pdf?id=gSHyqBijPFO,,"Careful prompt design is critical to the use of large language models in zero-shot or few-shot learning. As a consequence, there is a growing interest in automated methods to design optimal prompts. In this work, we propose Test-time Prompt Editing using Reinforcement learning (TEMPERA). In contrast to prior prompt generation methods, TEMPERA can efficiently leverage prior knowledge, is adaptive to different queries and provides an interpretable prompt for every query. To achieve this, we design a novel action space that allows flexible editing of the initial prompts covering a wide set of commonly-used components like instructions, few-shot exemplars, and verbalizers. The proposed method achieves significant gains compared with recent SoTA approaches like prompt tuning, AutoPrompt, and RLPrompt, across a variety of tasks including sentiment analysis, topic classification, natural language inference, and reading comprehension. Our method achieves 5.33x on average improvement in sample efficiency when compared to the traditional fine-tuning methods.", Combining pretrained speech and text encoders for spoken language processing,https://openreview.net/forum?id=YlvyUw4wDgs,https://openreview.net/pdf?id=YlvyUw4wDgs,"we propose to combine pretrained speech and text encoders via cross-attention, and we show the application of the proposed architecture in multiple spoken language processing systems","Spoken Language Processing tasks that extract information from speech signal, have the advantage of using both speech and text modalities. In this paper, we propose to combine pretrained speech and text encoders via cross-attention, and we show the application of the proposed architecture in multiple spoken language processing systems. Our results indicate that it's more efficient to re-purpose previously trained independent modality encoders and learn only cross-attention from scratch. This resultant architecture captures both acoustic and lexical information, and performs text tagging while attending to speech encoder for improved results. We use compact pretrained speech and text encoder which are resource efficient and can be trained on a single consumer GPU card.","Spoken language processing, Multi-modal SLU, Encoder fusion" A Large Scale Sample Complexity Analysis of Neural Policies in the Low-Data Regime,https://openreview.net/forum?id=kKNVu-2J89s,https://openreview.net/pdf?id=kKNVu-2J89s,,"The progress in reinforcement learning algorithm development is at one of its highest points starting from the initial study that enabled sequential decision making from high-dimensional observations. Currently, deep reinforcement learning research has had quite recent breakthroughs from learning without the presence of rewards to learning functioning policies without even knowing the rules of the game. In our paper we focus on the trends currently used in deep reinforcement learning algorithm development in the low-data regime. We theoretically show that the performance profiles of the algorithms developed for the high-data regime do not transfer to the low-data regime in the same order. We conduct extensive experiments in the Arcade Learning Environment and our results demonstrate that the baseline algorithms perform significantly better in the low-data regime compared to the set of algorithms that were initially designed and compared in the large-data region.", Evaluating Representations with Readout Model Switching,https://openreview.net/forum?id=Fsd-6ax4T1m,https://openreview.net/pdf?id=Fsd-6ax4T1m,We propose an evaluation framework that is based on MDL and model switching for evaluating representations.,"Although much of the success of Deep Learning builds on learning good representations, a rigorous method to evaluate their quality is lacking. In this paper, we treat the evaluation of representations as a model selection problem and propose to use the Minimum Description Length (MDL) principle to devise an evaluation metric. Contrary to the established practice of limiting the capacity of the readout model, we design a hybrid discrete and continuous-valued model space for the readout models and employ a switching strategy to combine their predictions. The MDL score takes the model complexity, as well as the data efficiency into account. As a result, the most appropriate model for the specific task and representation will be chosen, making it a unified measure for comparison. The proposed metric can be efficiently computed with an online method and we present results for pre-trained vision encoders of various architectures (ResNet and ViT) and objective functions (supervised and self-supervised) on a range of downstream tasks. Finally, we discuss important properties revealed by these evaluations such as model scaling, preferred readout model, and data efficiency.","Representation Learning, Evaluation, Expert Switching, Minumum Description Length" Provable Defense Against Geometric Transformations,https://openreview.net/forum?id=ThXqBsRI-cY,https://openreview.net/pdf?id=ThXqBsRI-cY,We present a training framework and verifier for deterministic certified robustness against geometric transformations.,"Geometric image transformations that arise in the real world, such as scaling and rotation, have been shown to easily deceive deep neural networks (DNNs). Hence, training DNNs to be certifiably robust to these perturbations is critical. However, no prior work has been able to incorporate the objective of deterministic certified robustness against geometric transformations into the training procedure, as existing verifiers are exceedingly slow. To address these challenges, we propose the first provable defense for deterministic certified geometric robustness. Our framework leverages a novel GPU-optimized verifier that can certify images between 60$\times$ to 42,600$\times$ faster than existing geometric robustness verifiers, and thus unlike existing works, is fast enough for use in training. Our results across multiple datasets show that networks trained via our framework consistently achieve state-of-the-art deterministic certified geometric robustness and clean accuracy. Furthermore, for the first time, we verify the geometric robustness of a neural network for the challenging, real-world setting of autonomous driving.","Certified robustness, geometric transformations" Augmentation with Projection: Towards an Effective and Efficient Data Augmentation Paradigm for Distillation,https://openreview.net/forum?id=kPPVmUF6bM_,https://openreview.net/pdf?id=kPPVmUF6bM_,We proposed an effective and efficient data augmentation paradigm for knowledge distillation,"Knowledge distillation is one of the primary methods of transferring knowledge from large to small models. However, it requires massive task-specific data, which may not be plausible in many real-world applications. Data augmentation methods such as representation interpolation, token replacement, or augmentation with models are applied to tackle this problem. However, these data augmentation methods either potentially cause shifts in decision boundaries (representation interpolation), are not expressive enough (token replacement), or introduce too much computational overhead (augmentation with models). To this end, we propose AugPro (Augmentation with Projection), an effective and efficient data augmentation method for distillation. Our method builds on top of representation interpolation augmentation methods to maintain the diversity of expressions and converts the augmented data to tokens to avoid shifting decision boundaries. It uses simple operations that come with little computational overhead. The results on multiple GLUE tasks show that our methods can improve distillation performance by a large margin at a low time cost.","Knowledge Distillation, Data Augmentation, Natural Language Processing" Pseudoinverse-Guided Diffusion Models for Inverse Problems,https://openreview.net/forum?id=9_gsMA8MRKQ,https://openreview.net/pdf?id=9_gsMA8MRKQ,"We introduce pseudoinverse guidance, an approach to solve inverse problems with generative diffusion models.","Diffusion models have become competitive candidates for solving various inverse problems. Models trained for specific inverse problems work well but are limited to their particular use cases, whereas methods that use problem-agnostic models are general but often perform worse empirically. To address this dilemma, we introduce Pseudoinverse-guided Diffusion Models ($\Pi$GDM), an approach that uses problem-agnostic models to close the gap in performance. $\Pi$GDM directly estimates conditional scores from the measurement model of the inverse problem without additional training. It can address inverse problems with noisy, non-linear, or even non-differentiable measurements, in contrast to many existing approaches that are limited to noiseless linear ones. We illustrate the empirical effectiveness of $\Pi$GDM on several image restoration tasks, including super-resolution, inpainting and JPEG restoration. On ImageNet, $\Pi$GDM is competitive with state-of-the-art diffusion models trained on specific tasks, and is the first to achieve this with problem-agnostic diffusion models. $\Pi$GDM can also solve a wider set of inverse problems where the measurement processes are composed of several simpler ones.","diffusion models, inverse problems" Autoregressive Diffusion Model for Graph Generation,https://openreview.net/forum?id=98J48HZXxd5,https://openreview.net/pdf?id=98J48HZXxd5,A new autoregressive diffusion model for graph generation," Diffusion-based graph generative models have recently obtained promising results for graph generation. However, existing diffusion-based graph generative models are all one-shot generative models that apply Gaussian diffusion in the dequantized adjacency matrix space. Such a strategy can suffer from difficulty in model training, slow sampling speed, and incapability of incorporating constraints. We propose an \emph{autoregressive diffusion} model for graph generation. Unlike existing methods, we define a node-absorbing diffusion process that operates directly in the discrete graph space. For forward diffusion, we design a \emph{diffusion ordering network}, which learns an optimal node absorbing ordering from graph topology. For reverse generation, we design a \emph{denoising network} that uses the reverse node ordering to efficiently reconstruct the graph by predicting one row of the adjacency matrix at a time. Based on permutation invariance of graph generation, we show that the two networks can be jointly trained by optimizing a simple lower bound of data likelihood. Our experiments on six diverse datasets show that our model achieves better or comparable generation performance with previous state-of-the-art, and meanwhile enjoys fast generation speed.","graph generation, diffusion based generative model" Contrastive introspection (ConSpec) to rapidly identify invariant steps for success,https://openreview.net/forum?id=yI_xwoYO6cF,https://openreview.net/pdf?id=yI_xwoYO6cF,"In ConSpec, a contrastive loss rapidly identifies invariances among successful episodes, even in tasks with sparse rewards and multiple contingencies.","Reinforcement learning (RL) algorithms have achieved notable success in recent years, but still struggle with fundamental issues in long-term credit assignment. It remains difficult to learn in situations where success is contingent upon multiple critical steps that are distant in time from each other and from a sparse reward; as is often the case in real life. Moreover, how RL algorithms assign credit in these difficult situations is typically not coded in a way that can rapidly generalize to new situations. Here, we present an approach using offline contrastive learning, which we call contrastive introspection (ConSpec), that can be added to any existing RL algorithm and addresses both issues. In ConSpec, a contrastive loss is used during offline replay to identify invariances among successful episodes. This takes advantage of the fact that it is easier to retrospectively identify the small set of steps that success is contingent upon than it is to prospectively predict reward at every step taken in the environment. ConSpec stores this knowledge in a collection of prototypes summarizing the intermediate states required for success. During training, arrival at any state that matches these prototypes generates an intrinsic reward that is added to any external rewards. As well, the reward shaping provided by ConSpec can be made to preserve the optimal policy of the underlying RL agent. The prototypes in ConSpec provide two key benefits for credit assignment: (1) They enable rapid identification of all the critical states. (2) They do so in a readily interpretable manner, enabling out of distribution generalization when sensory features are altered. In summary, ConSpec is a modular system that can be added to any existing RL algorithm to improve its long-term credit assignment.","Reinforcement learning, long term credit assignment, rapid credit assignment, contrastive learning, few-shot learning in RL" Self-supervised video pretraining yields strong image representations,https://openreview.net/forum?id=8onXkaNWLHA,https://openreview.net/pdf?id=8onXkaNWLHA,We achieve SoTA transfer to image scene understanding tasks using frame-based models pre-trained using contrastive learning on videos. ,"Videos contain far more information than still images, and hold the potential for learning rich representations of the visual world. Yet, pretraining on image datasets has remained the dominant paradigm for learning representations that capture spatial information and previous attempts at video pretraining have fallen short on image understanding tasks. In this work we revisit self-supervised learning of image representations from the dynamic evolution of video frames. To that end, we propose a dataset curation procedure that addresses the domain mismatch between video and image datasets, and develop a contrastive learning framework which handles the complex transformations present in natural videos. This simple paradigm for distilling knowledge from videos to image representations, called VITO, performs surprisingly well on a variety of image-based transfer learning tasks. For the first time, our video-pretrained model closes the gap with ImageNet pretraining on semantic segmentation on PASCAL and ADE20k and object detection on COCO and LVIS, raising the possibility of video-pretraining becoming the new default for learning image representations. ","self-supervised, contrastive, video, representation learning, image segmentation, object detection" Planning with Language Models through Iterative Energy Minimization,https://openreview.net/forum?id=cVFD6qE8gnY,https://openreview.net/pdf?id=cVFD6qE8gnY,Planning with Transformers through the energy minimization (MCMC sampling),"Recent works have shown that language modeling can be effectively used to train reinforcement learning (RL) policies. However, the success of applying existing language models to planning, in which we wish to obtain a trajectory of actions to reach some goal, is less straightforward. The typical autoregressive generation procedures of language models preclude sequential refinement of earlier steps, which limits the effectiveness of a predicted plan. In this paper, we suggest an approach towards integrating planning with language models based on the idea of iterative energy minimization, and illustrate how such a procedure leads to improved RL performance across different tasks. We train a masked language model to capture an implicit energy function over trajectories of actions, and formulate planning as finding a trajectory of actions with minimum energy. We illustrate how this procedure enables improved performance over recent approaches across BabyAI and Atari environments. We further demonstrate unique benefits of our iterative optimization procedure, involving new task generalization, test-time constraints adaptation, and the ability to compose plans together.","Reinforcement Learning, Planning, Language Model, Decision Transformer" The Union of Manifolds Hypothesis,https://openreview.net/forum?id=Rvee9CAX4fi,https://openreview.net/pdf?id=Rvee9CAX4fi,"We show data of interest has varying intrinsic dimension, thus conforming to a union of manifolds hypothesis rather than the manifold hypothesis; and we study some implications in deep learning.","Deep learning has had tremendous success at learning low-dimensional representations of high-dimensional data. This success would be impossible if there was no hidden low-dimensional structure in data of interest; this existence is posited by the manifold hypothesis, which states that the data lies on an unknown manifold of low intrinsic dimension. In this paper, we argue that this hypothesis does not properly capture the low-dimensional structure typically present in data. Assuming that data lies on a single manifold implies intrinsic dimension is identical across the entire data space, and does not allow for subregions of this space to have a different number of factors of variation. To address this deficiency, we put forth the union of manifolds hypothesis, which states that data lies on a disjoint union of manifolds of varying intrinsic dimensionality. We empirically verify this hypothesis on commonly-used image datasets, finding that indeed, observed data lies on a disconnected set and that intrinsic dimension is not constant. We also provide insights into the impact of the union of manifolds hypothesis in deep learning, both supervised and unsupervised, showing that designing models with an inductive bias towards this structure improves performance across classification and generative modelling tasks.","manifold hypothesis, geometry, generative models" Last Layer Re-Training is Sufficient for Robustness to Spurious Correlations,https://openreview.net/forum?id=Zb6c8A-Fghk,https://openreview.net/pdf?id=Zb6c8A-Fghk,We propose a simple method based on retraining the last layer of a neural network which achieves strong results on spurious correlation benchmarks.,"Neural network classifiers can largely rely on simple spurious features, such as image backgrounds, to make predictions. However, even in these cases, we show that they still often learn core features associated with the desired attributes of the data, contrary to recent findings. Inspired by this insight, we demonstrate that simple last layer retraining can match or outperform state-of-the-art approaches on spurious correlation benchmarks, but with profoundly lower complexity and computational expenses. Moreover, we show that last layer retraining on large ImageNet-trained models can also significantly reduce reliance on background and texture information, improving robustness to covariate shift, after only minutes of training on a single GPU.","spurious correlations, robustness" UniS-MMC: Learning Unimodality-supervised Multimodal Contrastive Representations,https://openreview.net/forum?id=4lw-X9jRi1c,https://openreview.net/pdf?id=4lw-X9jRi1c,This paper proposes a novel multi-task-based multimodal contrastive method for multimodal representation learning (multimodal classification task).,"Multimodal learning aims to imitate human beings to acquire complementary information from multiple modalities for final decisions. However, just like a human's final decision can be confused by specific erroneous information from the environment, current multimodal learning methods also suffer from uncertain unimodal prediction when learning multimodal representations. In this work, we propose to contrastively explore reliable representations and increase the agreement among the unimodal representations that alone make potentially correct predictions. Specifically, we first capture task-related representations by directly sharing representations between unimodal and multimodal learning tasks. With the unimodal representations and predictions from the multitask-based framework, we then propose a novel multimodal contrastive learning method to align the representations towards the relatively more reliable modality under the weak supervision of the unimodal predictions. Experimental results on two image-text benchmarks UPMC-Food-101 and N24News, and two medical benchmarks ROSMAP and BRCA, show that our proposed Unimodality-supervised Multimodal Contrastive (UniS-MMC) learning method outperforms current state-of-the-art multimodal learning methods. The detailed ablation studies further demonstrate the advantage of our proposed method.","multimodal learning, contrastive learning, multi-task learning" Progressive Data Dropout: An Adaptive Training Strategy for Large-Scale Supervised Learning,https://openreview.net/forum?id=rUb1-7H4q6a,https://openreview.net/pdf?id=rUb1-7H4q6a,PDD is a model-agnostic strategy for removing class-level data as learned by the model during training.,"Common training strategies for deep neural networks are computationally expensive, continuing to redundantly train and evaluate on classes already well-understood by the model. A common strategy to diminish this cost is to reduce data used in training, however this often comes at the expense of the model's accuracy or an additional computational cost in training. We propose progressive data dropout (PDD), an adaptive training strategy which performs class-level data dropout from the training set as the network develops an understanding for each class. Our experiments on large-scale image classification demonstrate PDD reduces the total number of datapoints needed to train the network by a factor of 10, reducing the overall training time without significantly impacting accuracy or modifying the model architecture. We additionally demonstrate improvements via experiments and ablations on computer vision benchmarks, including MNIST, Fashion-MNIST, SVHN, CIFAR, and ImageNet datasets.","data dropout, training optimization, adaptive training, classification, large-scale" Error Sensitivity Modulation based Experience Replay: Mitigating Abrupt Representation Drift in Continual Learning,https://openreview.net/forum?id=zlbci7019Z3,https://openreview.net/pdf?id=zlbci7019Z3,A novel method that employs a principled mechanism for modulating the error sensitivity in a dual-memory rehearsal-based system for effective continual learning,"Humans excel at lifelong learning, as the brain has evolved to be robust to distribution shifts and noise in our ever-changing environment. Deep neural networks (DNNs), however, exhibit catastrophic forgetting and the learned representations drift drastically as they encounter a new task. This alludes to a different error-based learning mechanism in the brain. Unlike DNNs, where learning scales linearly with the magnitude of the error, the sensitivity to errors in the brain decreases as a function of their magnitude. To this end, we propose ESMER which employs a principled mechanism for modulating the error sensitivity in a dual-memory rehearsal-based system. Concretely, it maintains a memory of past errors and utilizes it to modify the learning dynamics so that the model learns more from small consistent errors compared to large sudden errors. We also propose Error-Sensitive Reservoir Sampling to maintain episodic memory, which leverages the error history to pre-select low-loss samples as candidates for the buffer, which are better suited for retaining information. Empirical results show that ESMER effectively reduces forgetting and abrupt drift in representations at the task boundary by gradually adapting to the new task while consolidating knowledge. Remarkably, it also enables the model to learn under high levels of label noise, which is ubiquitous in real-world streams.","Continual Learning, Catastrophic Forgetting, Multi memory System, Experience Replay, Error-Sensitivity modulation, Brain inspired Algorithm, Representation Drift" Panoptically guided Image Inpainting with Image-level and Object-level Semantic Discriminators,https://openreview.net/forum?id=T9iojz-kOfU,https://openreview.net/pdf?id=T9iojz-kOfU,Guided Image Inpainting and image inpainting with a novel discriminator design.,"Recent image inpainting methods have made great progress. However, the existing approaches often struggle to hallucinate realistic object instances in natural scenes. Such a limitation is partially due to the lack of semantic-level constraints inside the hole as well as the lack of a mechanism to enforce the realism of local objects. To tackle the challenging object inpainting task, we propose a new panoptically guided image inpainting task that leverages a panoptic segmentation map to guide the completion of object instances. To enforce the realism of the generated objects, we propose a semantic discriminator that leverages pretrained visual features to improve the generated semantics. Furthermore, we propose object-level discriminators that take aligned instances as input to enforce the realism of individual objects. Experiments on the large-scale Places2 dataset demonstrate the significant improvement by our method on object completion, verified in both quantitative and qualitative evaluation. Furthermore, our framework is flexible and can be generalized to other inpainting tasks including segmentation-guided inpainting, edge-guided inpainting, as well as standard image inpainting without guidance. Consequently, our approach achieves new state-of-the-art performance on the various inpainting tasks and impressive results on object completion. ","Generative model, image inpainting, image manipulation" Auditing Fairness Online through Interactive Refinement,https://openreview.net/forum?id=Gp91Et4LeRf,https://openreview.net/pdf?id=Gp91Et4LeRf,A visual inference-based optimization framework that facilitates the specification and auditing of fairness on blackbox ML models efficiently.,"Machine learning algorithms are increasingly being deployed for high-stakes scenarios. A sizeable proportion of currently deployed models make their decisions in a black box manner. Such decision-making procedures are susceptible to intrinsic biases, which has led to a call for accountability in deployed decision systems. In this work, we focus on user-specified accountability of decision-making processes of black box systems. Previous work has formulated this problem as run time fairness monitoring over decision functions. However, formulating appropriate specifications for situation-appropriate fairness metrics is challenging. We construct AVOIR, an automated inference-based optimization system that improves bounds for and generalizes prior work across a wide range of fairness metrics. AVOIR offers an interactive and iterative process for exploring fairness violations aligned with governance and regulatory requirements. Our bounds improve over previous probabilistic guarantees for such fairness grammars in online settings. We also construct a novel visualization mechanism that can be used to investigate the context of reported fairness violations and guide users towards meaningful and compliant fairness specifications. We then conduct case studies with fairness metrics on three different datasets and demonstrate how the visualization and improved optimization can detect fairness violations more efficiently and ameliorate the issues with faulty fairness metric design. ","fairness, metrics, verification, inference, online, monitoring" REM: Routing Entropy Minimization for Capsule Networks,https://openreview.net/forum?id=DUfpVGCXfwa,https://openreview.net/pdf?id=DUfpVGCXfwa," REM is a technique for Capsule Networks which combines pruning and quantization driving these models towards a higher interpretability. REM generate a significantly lower number of parse trees, with no performance loss.","Capsule Networks aim to build an interpretable and biologically-inspired neural network model. One of their main innovations relies on the routing mechanism which extracts a parse tree: its main purpose is to explicitly build relationships between capsules. However, their true potential has not surfaced yet: these relationships are extremely heterogeneous and difficult to understand, as the intra-class extracted parse trees are very different from each other. A school of thoughts, giving-up on this side, propose less interpretable versions of Capsule Networks without routing. This paper proposes REM, a technique which minimizes the entropy of the parse tree-like structure. We accomplish this by driving the model parameters distribution towards low entropy configurations, using a pruning mechanism as a proxy. Thanks to REM, we generate a significantly lower number of parse trees, with essentially no performance loss, showing also that Capsule Networks build stronger and more stable relationships between capsules.","capsule networks, deep learning, entropy, parse tree, routing" Don’t forget the nullspace! Nullspace occupancy as a mechanism for out of distribution failure,https://openreview.net/forum?id=39z0zPZ0AvB,https://openreview.net/pdf?id=39z0zPZ0AvB,,"Out of distribution (OoD) generalization has received considerable interest in recent years. In this work, we identify a particular failure mode of OoD generalization for discriminative classifiers that is based on test data (from a new domain) lying in the nullspace of features learnt from source data. We demonstrate the existence of this failure mode across multiple networks trained across RotatedMNIST, PACS, TerraIncognita, DomainNet and ImageNet-R datasets. We then study different choices for characterizing the feature space and show that projecting intermediate representations onto the span of directions that obtain maximum training accuracy provides consistent improvements in OoD performance. Finally, we show that such nullspace behavior also provides an insight into neural networks trained on poisoned data. We hope our work galvanizes interest in the relationship between the nullspace occupancy failure mode and generalization.", Variational Classification,https://openreview.net/forum?id=MCe881WzBr0,https://openreview.net/pdf?id=MCe881WzBr0,We show how we can view a classifier as a latent variable model and impose class conditional priors on this latent space that renders the classifier more robust to OOD and adversarial data,"Classification tasks, ubiquitous across machine learning, are commonly tackled by a suitably designed neural network with a softmax output layer, mapping each data point to a categorical distribution over class labels. We extend this familiar model from a latent variable perspective to variational classification (VC), analogous to how the variational auto-encoder relates to its deterministic counterpart. We derive a training objective based on the ELBO together with an \textit{adversarial} approach for optimising it. Within this framework, we identify design choices made implicitly in off-the-shelf softmax functions and can instead include domain-specific assumptions, such as class-conditional latent priors. We demonstrate benefits of the VC model in image classification. We show on several standard datasets, that treating inputs to the softmax layer as latent variables under a mixture of Gaussians prior, improves several desirable aspects of a classifier, such as prediction accuracy, calibration, out-of-domain calibration and adversarial robustness.","Latent priors, classification" ContraNorm: A Contrastive Learning Perspective on Oversmoothing and Beyond,https://openreview.net/forum?id=SM7XkJouWHm,https://openreview.net/pdf?id=SM7XkJouWHm,,"Oversmoothing is a common phenomenon in a wide range of Graph Neural Networks (GNNs) and Transformers, where performance degenerates as the layer goes deeper. Instead of characterizing oversmoothing from the view of complete collapse in which representations converge to a single point, we dive into a more general perspective dimensional collapse in which representations lie in a narrow cone. Accordingly, inspired by the power of contrastive learning in preventing dimensional collapse, we propose a novel normalization layer ContraNorm. Intuitively, ContraNorm implicitly shatters representations in the embedding space, leading to a more uniform distribution and slighter dimensional collapse. On the theoretical analysis, we prove that ContraNorm can alleviate both complete collapse and dimensional collapse under some conditions. Our proposed normalization layer can be easily inserted into GNNs and Transformers with negligible parameter overhead. Experiments on various real-world datasets verify the effectiveness of our method.", Accelerated Single-Call Methods for Constrained Min-Max Optimization,https://openreview.net/forum?id=HRwN7IQLUKA,https://openreview.net/pdf?id=HRwN7IQLUKA,We propose the first single-call single-projection algorithms with optimal convergence rate for constrained min-max optimization problems in the nonconvex-nonconcave setting.,"We study first-order methods for constrained min-max optimization. Existing methods either requires two gradient calls or two projections in each iteration, which may be costly in applications. In this paper, we first show that the Optimistic Gradient (OG) method, a single-call single-projection} algorithm, has $O(\frac{1}{\sqrt{T}})$ convergence rate for inclusion problems with operators that satisfy the weak Minty variation inequality (MVI). Our second result is the first single-call single-projection algorithm -- the Accelerated Reflected Gradient (ARG) method that achieves the optimal $O(\frac{1}{T})$ convergence rate for inclusion problems that satisfy negative comonotonicity. Both the weak MVI and negative comonotonicity are well-studied assumptions and capture a rich set of non-convex non-concave min-max optimization problems. Finally, we show that the Reflected Gradient (RG) method, another single-call single-projection algorithm, has $O(\frac{1}{\sqrt{T}})$ last-iterate convergence rate for constrained convex-concave min-max optimization, answering an open problem of (Hsieh et al., 2019)","min-max optimization, nonconvex-nonconcave, variational inequalities, saddle point problem, first-order method" Towards Interpretable Deep Reinforcement Learning with Human-Friendly Prototypes,https://openreview.net/forum?id=hWwY_Jq0xsN,https://openreview.net/pdf?id=hWwY_Jq0xsN,"An ""interpretable-by-design"" deep reinforcement learning agent is proposed which uses prototypes for decision making.","Despite recent success of deep learning models in research settings, their application in sensitive domains remains limited because of their opaque decision-making processes. Taking to this challenge, people have proposed various eXplainable AI (XAI) techniques designed to calibrate trust and understandability of black-box models, with the vast majority of work focused on supervised learning. Here, we focus on making an ""interpretable-by-design"" deep reinforcement learning agent which is forced to use human-friendly prototypes in its decisions, thus making its reasoning process clear. Our proposed method, dubbed Prototype-Wrapper Network (PW-Net), wraps around any neural agent backbone, and results indicate that it does not worsen performance relative to black-box models. Most importantly, we found in a user study that PW-Nets supported better trust calibration and task performance relative to standard interpretability approaches and black-boxes. ","Interpretable Machine Learning, Deep Reinforcement Learning, Prototypes, User Study" Distributed Extra-gradient with Optimal Complexity and Communication Guarantees,https://openreview.net/forum?id=b3itJyarLM0,https://openreview.net/pdf?id=b3itJyarLM0,"We propose quantized generalized extra-gradient, which is an unbiased and adaptive compression method tailored to a generic unifying framework for solving variational inequality problems.","We consider monotone variational inequality (VI) problems in multi-GPU settings where multiple processors/workers/clients have access to local stochastic dual vectors. This setting includes a broad range of important problems from distributed convex minimization to min-max and games. Extra-gradient, which is a de facto algorithm for monotone VI problems, has not been designed to be communication-efficient. To this end, we propose a quantized generalized extra-gradient (Q-GenX), which is an unbiased and adaptive compression method tailored to solve VIs. We provide an adaptive step-size rule, which adapts to the respective noise profiles at hand and achieve a fast rate of ${\cal O}(1/T)$ under relative noise, and an order-optimal ${\cal O}(1/\sqrt{T})$ under absolute noise and show distributed training accelerates convergence. Finally, we validate our theoretical results by providing real-world experiments and training generative adversarial networks on multiple GPUs. ","Unbiased Quantization, Variational Inequality, Extra-gradient, Adaptive Sep-size" 'I pick you choose': Joint human-algorithm decision making in multi-armed bandits,https://openreview.net/forum?id=yLCCfzv_8Yx,https://openreview.net/pdf?id=yLCCfzv_8Yx,Analysis of joint human-algorithm multi-armed bandit systems: human picks the final arms from a subset the algorithm selects. ,"Online learning in multi-armed bandits has been a rich area of research for decades, resulting in numerous \enquote{no-regret} algorithms that efficiently learn the arm with highest expected reward. However, in many settings the final decision of which arm to pull isn't under the control of the algorithm itself. For example, a driving app typically suggests a subset of routes (arms) to the driver, who ultimately makes the final choice about which to select. Typically, the human also wishes to learn the optimal arm based on historical reward information, but decides which arm to pull based on a potentially different objective function, such as being more or less myopic about exploiting near-term rewards. In this paper, we show when this joint human-algorithm system can achieve good performance. Specifically, we explore multiple possible frameworks for human objectives and give theoretical regret bounds for regret. Finally, we include experimental results exploring how regret varies with the human decision-maker's objective, as well as the number of arms presented. ","human algorithm collaboration, multi-armed bandits, complementarity" UnDiMix: Hard Negative Sampling Strategies for Contrastive Representation Learning,https://openreview.net/forum?id=ZpzkcSqsdmX,https://openreview.net/pdf?id=ZpzkcSqsdmX,"We introduce UnDimix, a hard negative sampling strategy that takes into account anchor similarity, model uncertainty and representativeness","One of the challenges in contrastive learning is the selection of appropriate \textit{hard negative} examples, in the absence of label information. Random sampling or importance sampling methods based on feature similarity often lead to sub-optimal performance. In this work, we introduce \modelname, a hard negative sampling strategy that takes into account anchor similarity, model uncertainty and diversity. Experimental results on several benchmarks show that \modelname improves negative sample selection, and subsequently downstream performance when compared to state-of-the-art contrastive learning methods. Code is available at \textit{anon. link","Contrastive Learning, Self-Supervised Learning, Hard Negative Sampling" What Matters In The Structured Pruning of Generative Language Models?,https://openreview.net/forum?id=Yg7ExbCxzt6,https://openreview.net/pdf?id=Yg7ExbCxzt6,,"Auto-regressive large language models such as GPT-3 require enormous computational resources to use, leading to huge financial cost and environmental impact. Structured pruning methods traditionally reduce resource usage, however, their application to and efficacy for generative language models is heavily under-explored. We analyze the effects of magnitude, random, and movement (Lagunas et al., 2021) pruning on MLP layers in GPT-like models. We find that movement can under-perform for these models while random pruning nearly matches the best methods. By examining neuron-level redundancy measures, we discover that movement does not select neurons based on how unique they are compared to other neurons, leaving behind excess redundancy. In view of this, we introduce Globally Unique Movement (GUM) to select neurons based on both uniqueness and sensitivity. We then discuss the roles of our techniques on different redundancy metrics through careful comparisons and ablations.","Neural Network Pruning, Natural Language Generation" MaxMin-Novelty: Maximizing Novelty via Minimizing the State-Action Values in Deep Reinforcement Learning,https://openreview.net/forum?id=bNozP02z7XO,https://openreview.net/pdf?id=bNozP02z7XO,,"Reinforcement learning research has achieved high acceleration in its progress starting from the initial installation of deep neural networks as function approximators to learn policies that make sequential decisions in high-dimensional state representation MDPs. While several consecutive barriers have been broken in deep reinforcement learning research (i.e. learning from high-dimensional states, learning purely via self-play), several others still stand. On this line, the question of how to explore in high-dimensional complex MDPs is a well-understudied and ongoing open problem. To address this, in our paper we propose a unique exploration technique based on maximization of novelty via minimization of the state-action value function (MaxMin Novelty). Our method is theoretically well motivated, and comes with zero computational cost while leading to significant sample efficiency gains in deep reinforcement learning training. We conduct extensive experiments in the Arcade Learning Environment with high-dimensional state representation MDPs. We show that our technique improves the human normalized median scores of Arcade Learning Environment by 248% in the low-data regime.", Complete Likelihood Objective for Latent Variable Models,https://openreview.net/forum?id=hO8qWILpJ3J,https://openreview.net/pdf?id=hO8qWILpJ3J,Use sample from the prior to construct a family informative distribution and use complete likelihood to both the target from the family and tune the model.,"In this work, we propose an alternative to the Marginal Likelihood (MaL) objective for training latent variable models, Complete Latent Likelihood (CoLLike). We analyze the objectives from the perspective of matching joint distributions. We show that MaL corresponds to a particular $KL$ divergence between some target \emph{joint} distribution and the model joint. Furthermore, the properties of the target joint explain such major malfunctions of MaL as uninformative latents (posterior collapse) and high deviation of the aggregated posterior from the prior. In CoLLike approach, we use a sample from the prior to construct a family of target joint distributions, which properties prevent these drawbacks. We utilize the complete likelihood both to choose the target from this family and to learn the model. We confirm our analysis by experiments with expressive low-dimensional latent variable models, which also indicate that it is possible to achieve high accuracy unsupervised classification using CoLLike objective.","Posterior Collapse, Latent Variable Models, Complete Likelihood, Empirical Distribution, Assignment Problem" The Surprising Effectiveness of Equivariant Models in Domains with Latent Symmetry,https://openreview.net/forum?id=P4MUGRM4Acu,https://openreview.net/pdf?id=P4MUGRM4Acu,This paper discovers that equivariant models are surprisingly effective in domains with latent or partial symmetries. ,"Extensive work has demonstrated that equivariant neural networks can significantly improve sample efficiency and generalization by enforcing an inductive bias in the network architecture. These applications typically assume that the domain symmetry is fully described by explicit transformations of the model inputs and outputs. However, many real-life applications contain only latent or partial symmetries which cannot be easily described by simple transformations of the input. In these cases, it is necessary to learn symmetry in the environment instead of imposing it mathematically on the network architecture. We discover, surprisingly, that imposing equivariance constraints that do not exactly match the domain symmetry is very helpful in learning the true symmetry in the environment. We differentiate between extrinsic and incorrect symmetry constraints and show that while imposing incorrect symmetry can impede the model's performance, imposing extrinsic symmetry can actually improve performance. We demonstrate that an equivariant model can significantly outperform non-equivariant methods on domains with latent symmetries both in supervised learning and in reinforcement learning for robotic manipulation and control problems.","Equivariant Learning, Reinforcement Learning, Robotics" Performance Bounds for Model and Policy Transfer in Hidden-parameter MDPs,https://openreview.net/forum?id=20gBzEzgtiI,https://openreview.net/pdf?id=20gBzEzgtiI,,"In the Hidden-Parameter MDP (HiP-MDP) framework, a family of reinforcement learning tasks is generated by varying hidden parameters specifying the dynamics and reward function for each individual task. HiP-MDP is a natural model for families of tasks in which meta- and lifelong-reinforcement learning approaches can succeed. Given a learned context encoder that infers the hidden parameters from previous experience, most existing algorithms fall into two categories: $\textit{model transfer}$ and $\textit{policy transfer}$, depending on which function the hidden parameters are used to parameterize. We characterize the robustness of model and policy transfer algorithms with respect to hidden parameter estimation error. We first show that the value function of HiP-MDPs is Lipschitz continuous under certain conditions. We then derive regret bounds for both settings through the lens of Lipschitz continuity. Finally, we empirically corroborate our theoretical analysis by experimentally varying the hyper-parameters governing the Lipschitz constants of two continuous control problems; the resulting performance is consistent with our predictions.","Reinforcement learning, Meta learning, Transfer learning, Theory" Parallel $Q$-Learning: Scaling Off-policy Reinforcement Learning,https://openreview.net/forum?id=vFvw8EzQNLy,https://openreview.net/pdf?id=vFvw8EzQNLy,We present a parallel training framework that scales up $Q$-learning algorithms on a single workstation and achieves faster learning speed than PPO.,"Reinforcement learning algorithms typically require tons of training data, resulting in long training time, especially on challenging tasks. With the recent advance in GPU-based simulation, such as Isaac Gym, data collection speed has been improved thousands of times on a commodity GPU. Most prior works have been using on-policy methods such as PPO to train policies in Isaac Gym due to its simpleness and effectiveness in scaling up. Off-policy methods are usually more sample-efficient but more challenging to be scaled up, resulting in a much longer wall-clock training time in practice. In this work, we presented a novel parallel $Q$-learning framework that not only gains better sample efficiency but also reduces the training wall-clock time compared to PPO. Different from prior works on distributed off-policy learning, such as Apex, our framework is designed specifically for massively parallel GPU-based simulation and optimized to work on a single workstation. We demonstrate the capability of scaling up $Q$ learning methods to tens of thousands of parallel environments. We also investigate various factors that can affect the policy learning training speed, including the number of parallel environments, exploration schemes, batch size, GPU models, etc.","GPU-based simulation, off-policy learning, distributed training, reinforcement learning" Emergence of shared sensory-motor graphical language from visual input,https://openreview.net/forum?id=VTYvxbr5E-A,https://openreview.net/pdf?id=VTYvxbr5E-A,We use contrastive multimodal learning to train artificial agents to communicate via a sensory-motor system producing drawings. We then show that the emerging graphical language has compositional properties,"The framework of Language Games studies the emergence of languages in populations of agents. Recent contributions relying on deep learning methods focused on agents communicating via an idealized communication channel, where utterances produced by a speaker are directly perceived by a listener. This comes in contrast with human communication, which instead relies on a sensory-motor channel, where motor commands produced by the speaker (e.g. vocal or gestural articulators) result in sensory effects perceived by the listener (e.g. audio or visual). Here, we investigate if agents can evolve a shared language when they are equipped with a continuous sensory-motor system to produce and perceive signs, e.g. drawings. To this end, we introduce the Graphical Referential Game (GREG) where a speaker must produce a graphical utterance to name a visual referent object consisting of combinations of MNIST digits while a listener has to select the corresponding object among distractor referents, given the produced message. The utterances are drawing images produced using dynamical motor primitives combined with a sketching library. To tackle GREG we present CURVES: a multimodal contrastive deep learning mechanism that represents the energy (alignment) between named referents and utterances generated through gradient ascent on the learned energy landscape. We, then, present a set of experiments and metrics based on a systematic compositional dataset to evaluate the resulting language. We show that our method allows the emergence of a shared, graphical language with compositional properties","Emergent Communication, Visual Communication, Sensory-motor communication, Contrastive Learning, Language Game" Compositional Task Generalization with Discovered Successor Feature Modules,https://openreview.net/forum?id=DrtSx1z40Ib,https://openreview.net/pdf?id=DrtSx1z40Ib,"A modular neural network for discovering, composing, and transferring predictive knowledge and behavior via Successor Features & Generalized Policy Improvement.","Recently, the Successor Features and Generalized Policy Improvement (SF&GPI) framework has been proposed as a method for learning, composing and transferring predictive knowledge and behavior. SF&GPI works by having an agent learn predictive representations (SFs) that can be combined for transfer to new tasks with GPI. However, to be effective this approach requires state features that are useful to predict, and these state-features are typically hand-designed. In this work, we present a novel neural network architecture, “Modular Successor Feature Approximators” (MSFA), where modules both discover what is useful to predict, and learn their own predictive representations. We show that MSFA is able to better generalize compared to baseline architectures for learning SFs and a modular network that discovers factored state representations. ","deep reinforcement learning, successor features, generalization, compositional generalization" DexDeform: Dexterous Deformable Object Manipulation with Human Demonstrations and Differentiable Physics,https://openreview.net/forum?id=LIV7-_7pYPl,https://openreview.net/pdf?id=LIV7-_7pYPl,We investigate the problem of learning dexterous manipulation of deformable objects using multi-fingered hands.,"In this work, we aim to learn dexterous manipulation of deformable objects using multi-fingered hands. Reinforcement learning approaches for dexterous rigid object manipulation would struggle in this setting due to the complexity of physics interaction with deformable objects. At the same time, previous trajectory optimization approaches with differentiable physics for deformable manipulation would suffer from local optima caused by the explosion of contact modes from hand-object interactions. To address these challenges, we propose DexDeform, a principled framework that abstracts dexterous manipulation skills from human demonstration, and refines the learned skills with differentiable physics. Concretely, we first collect a small set of human demonstrations using teleoperation. And we then train a skill model using demonstrations for planning over action abstractions in imagination. To explore the goal space, we further apply augmentations to the existing deformable shapes in demonstrations and use a gradient optimizer to refine the actions planned by the skill model. Finally, we adopt the refined trajectories as new demonstrations for finetuning the skill model. To evaluate the effectiveness of our approach, we introduce a suite of six challenging dexterous deformable object manipulation tasks. Compared with baselines, DexDeform is able to better explore and generalize across novel goals unseen in the initial human demonstrations. Additional materials can be found at our project website: https://sites.google.com/view/dexdeform.","Deformable Object Manipulation, Dexterous Manipulation, Differentiable Physics" "NAG-GS: semi-implicit, accelerated and robust stochastic optimizer.",https://openreview.net/forum?id=Oh5nigv45PI,https://openreview.net/pdf?id=Oh5nigv45PI,,"Classical machine learning models such as deep neural networks are usually trained by using Stochastic Gradient Descent-based (SGD) algorithms. The classical SGD can be interpreted as a discretization of the stochastic gradient flow. In this paper we propose a novel, robust and accelerated stochastic optimizer that relies on two key elements: (1) an accelerated Nesterov-like Stochastic Differential Equation (SDE) and (2) its semi-implicit Gauss-Seidel type discretization. The convergence and stability of the obtained method, referred to as NAG-GS, are first studied extensively in the case of the minimization of a quadratic function. This analysis allows us to come up with an optimal step size (or learning rate) in terms of rate of convergence while ensuring the stability of NAG-GS. This is achieved by the careful analysis of the spectral radius of the iteration matrix and the covariance matrix at stationarity with respect to all hyperparameters of our method. We show that NAG-GS is competitive with state-of-the-art methods such as momentum SGD with weight decay and AdamW for the training of machine learning models such as the logistic regression model, the residual networks models on standard computer vision datasets, and Transformers in the frame of the GLUE benchmark.","Accelerated gradient methods, stochastic optimization, stochastic differential equations, semi-implicit solver, convergence analysis, deep neural networks" Loop Unrolled Shallow Equilibrium Regularizer (LUSER) - A Memory-Efficient Inverse Problem Solver,https://openreview.net/forum?id=cfkKMKnqCzb,https://openreview.net/pdf?id=cfkKMKnqCzb,This paper proposes a memory efficient loop unrolled architecture for solving inverse problems.,"In inverse problems we aim to reconstruct some underlying signal of interest from potentially corrupted and often ill-posed measurements. Classical optimization-based techniques proceed by optimizing a data consistency metric together with a regularizer. Current state-of-the-art machine learning approaches draw inspiration from such techniques by unrolling the iterative updates for an optimization-based solver and then learning a regularizer from data. This \emph{loop unrolling} (LU) method has shown tremendous success, but often requires a deep model for the best performance leading to high memory costs during training. Thus, to address the balance between computation cost and network expressiveness, we propose an LU algorithm with shallow equilibrium regularizers (LUSER). These implicit models are as expressive as deeper convolutional networks, but far more memory efficient during training. The proposed method is evaluated on image deblurring, computed tomography (CT), as well as single-coil Magnetic Resonance Imaging (MRI) tasks and shows similar, or even better, performance while requiring up to $8 \times$ less computational resources during training when compared against a more typical LU architecture with feedforward convolutional regularizers.","Deep Equilibrium Models, Inverse Problems, Deep Learning, Loop Unrolled" Robust Universal Adversarial Perturbations,https://openreview.net/forum?id=VpYBxaPLaj-,https://openreview.net/pdf?id=VpYBxaPLaj-,"This paper introduces the concept of Robust Universal Adversarial Perturbations and a new algorithm, RobustUAP, which can be used to generate UAPs robust under human-interpretable, real-word transformations, such as rotation, contrast changes, etc.","Universal Adversarial Perturbations (UAPs) are imperceptible, image-agnostic vectors that cause deep neural networks (DNNs) to misclassify inputs from a data distribution with high probability. In practical attack scenarios, adversarial perturbations may undergo transformations such as changes in pixel intensity, rotation, etc. while being added to DNN inputs. Existing methods do not create UAPs robust to these real-world transformations, thereby limiting their applicability in attack scenarios. In this work, we introduce and formulate robust UAPs. We build an iterative algorithm using probabilistic robustness bounds and transformations generated by composing arbitrary sub-differentiable transformation functions to construct such robust UAPs. We perform an extensive evaluation on the popular CIFAR-10 and ILSVRC 2012 datasets measuring our UAPs' robustness under a wide range common, real-world transformations such as rotation, contrast changes, etc. Our results show that our method can generate UAPs up to 23% more robust than existing state-of-the-art baselines. ","Adversarial Machine Learning, Trustworthy Machine Learning, Universal Adversarial Perturbation, Expectation over Transformation, Robustness, Adversarial Perturbation" Understanding the Complexity Gains of Contextual Multi-task RL with Curricula,https://openreview.net/forum?id=IW3vvB8uggX,https://openreview.net/pdf?id=IW3vvB8uggX,,"Reinforcement learning (RL) problems can be challenging without well-shaped rewards. Prior work on provably efficient RL methods generally proposes to address this issue with dedicated exploration strategies, such as novelty-based bonuses. However, another way to tackle this challenge is to reformulate it as a multi-task RL problem, where the task space contains not only the challenging task of interest but also easier tasks that implicitly function as a curriculum. Such a reformulation opens up the possibility of running existing multi-task RL methods as a more efficient alternative to solving a single challenging task from scratch. In this work, we provide a theoretical framework that reformulates a single-task RL problem as a multi-task RL problem defined by a curriculum. Under mild regularity conditions on the curriculum, we show that sequentially solving each task in the multi-task RL problem is more computationally efficient than solving the original single-task problem, without any explicit exploration bonuses or other exploration strategies. We also show that our theoretical insights can be translated into an effective practical learning algorithm that can accelerate curriculum learning on simulated robotic goal-reaching tasks.","policy gradient methods, multi-task RL" The Lie Derivative for Measuring Learned Equivariance,https://openreview.net/forum?id=JL7Va5Vy15J,https://openreview.net/pdf?id=JL7Va5Vy15J,,"Equivariance guarantees that a model's predictions capture key symmetries in data. When an image is translated or rotated, an equivariant model's representation of that image will translate or rotate accordingly. The success of convolutional neural networks has historically been tied to their ability to directly encode translation equivariance in their architecture. The rising success of vision transformers, which have no explicit architectural bias towards equivariance, challenges this narrative and suggests that augmentations and training data might also play a significant role in their performance. In order to better understand the role of equivariance in recent vision models, we introduce the Lie derivative, a method for measuring equivariance with strong mathematical foundations and minimal hyperparameters. Using the Lie derivative, we study the equivariance properties of hundreds of pretrained models, spanning CNNs, transformers, and Mixer architectures. The scale of our analysis allows us to separate the impact of architecture from other factors like model size or training method. Surprisingly, we find that many violations of equivariance can be linked to spatial aliasing in ubiquitous network layers, such as pointwise non-linearities, and that as models get larger and more accurate they tend to display more equivariance, regardless of architecture.", Effective passive membership inference attacks in federated learning against overparameterized models,https://openreview.net/forum?id=QsCSLPP55Ku,https://openreview.net/pdf?id=QsCSLPP55Ku,"The observation that gradients of large overparameterized neural networks that generalize well behave like high-dimensional independent isotropic random vectors, leads to a new class of passive membership inference attacks in federated learning.","This work considers the challenge of performing membership inference attacks in a federated learning setting ---for image classification--- where an adversary can only observe the communication between the central node and a single client (a passive white-box attack). Passive attacks are one of the hardest-to-detect attacks, since they can be performed without modifying how the behavior of the central server or its clients, and assumes *no access to private data instances*. The key insight of our method is empirically observing that, near parameters that generalize well in test, the gradient of large overparameterized neural network models statistically behave like high-dimensional independent isotropic random vectors. Using this insight, we devise two attacks that are often little impacted by existing and proposed defenses. Finally, we validated the hypothesis that our attack depends on the overparametrization by showing that increasing the level of overparametrization (without changing the neural network architecture) positively correlates with our attack effectiveness.","membership inference attack, federated learning, overparameterization, neural networks, image classification" Optimizing Bi-Encoder for Named Entity Recognition via Contrastive Learning,https://openreview.net/forum?id=9EAQVEINuum,https://openreview.net/pdf?id=9EAQVEINuum,,"We present a bi-encoder framework for named entity recognition (NER), which applies contrastive learning to map candidate text spans and entity types into the same vector representation space. Prior work predominantly approaches NER as sequence labeling or span classification. We instead frame NER as a representation learning problem that maximizes the similarity between the vector representations of an entity mention and its type. This makes it easy to handle nested and flat NER alike, and can better leverage noisy self-supervision signals. A major challenge to this bi-encoder formulation for NER lies in separating non-entity spans from entity mentions. Instead of explicitly labeling all non-entity spans as the same class $\texttt{Outside}$ ($\texttt{O}$) as in most prior methods, we introduce a novel dynamic thresholding loss, learned in conjunction with the standard contrastive loss. Experiments show that our method performs well in both supervised and distantly supervised settings, for nested and flat NER alike, establishing new state of the art across standard datasets in the general domain (e.g., ACE2004, ACE2005) and high-value verticals such as biomedicine (e.g., GENIA, NCBI, BC5CDR, JNLPBA).","Named Entity Recognition, NER, Bi-Encoder, Contrastive Learning" Handling Covariate Shifts in Federated Learning with Generalization Guarantees,https://openreview.net/forum?id=Ho7W1yr8tV,https://openreview.net/pdf?id=Ho7W1yr8tV,We optimize a global FL model focusing on the overall generalization performance under both intra-client and inter-client covariate shifts.,"Covariate shift across clients is a major challenge for federated learning (FL). This work studies the generalization properties of FL under intra-client and inter-client covariate shifts. To this end, we propose Federated Importance-weighteD Empirical risk Minimization (FIDEM) to optimize a global FL model, along with new variants of density ratio matching methods, aiming to handle covariate shifts. These methods trade off some level of privacy for improving the overall generalization performance. We theoretically show that FIDEM achieves smaller generalization error than classical empirical risk minimization under some certain settings. Experimental results demonstrate the superiority of FIDEM over federated averaging (McMahan et al., 2017) and other baselines, which would open the door to study FL under distribution shifts more systematically. ","Federate Learning, Generalization, Covariate Shift, Importance Weighting" Agree to Disagree: Diversity through Disagreement for Better Transferability,https://openreview.net/forum?id=K7CbYQbyYhY,https://openreview.net/pdf?id=K7CbYQbyYhY,,"Gradient-based learning algorithms have an implicit \emph{simplicity bias} which in effect can limit the diversity of predictors being sampled by the learning procedure. This behavior can hinder the transferability of trained models by (i) favoring the learning of simpler but spurious features --- present in the training data but absent from the test data --- and (ii) by only leveraging a small subset of predictive features. Such an effect is especially magnified when the test distribution does not exactly match the train distribution---referred to as the Out of Distribution (OOD) generalization problem. However, given only the training data, it is not always possible to apriori assess if a given feature is spurious or transferable. Instead, we advocate for learning an ensemble of models which capture a diverse set of predictive features. Towards this, we propose a new algorithm D-BAT (Diversity-By-disAgreement Training), which enforces agreement among the models on the training data, but disagreement on the OOD data. We show how D-BAT naturally emerges from the notion of generalized discrepancy, as well as demonstrate in multiple experiments how the proposed method can mitigate shortcut-learning, enhance uncertainty and OOD detection, as well as improve transferability.","OOD generalization, Diversity, Ensemble" Expected Probabilistic Hierarchies,https://openreview.net/forum?id=dPOLZ2u4SKV,https://openreview.net/pdf?id=dPOLZ2u4SKV,Probabilistic model learning hierarchies in data by using gradient descent based optimizers outperforming several baselines.,"Hierarchical clustering has usually been addressed by discrete optimization using heuristics or continuous optimization of relaxed scores for hierarchies. In this work, we propose to optimize expected scores under a probabilistic model over hierarchies. (1) We show theoretically that the global optimum of the expected Dasgupta cost and Tree-Sampling divergence (TSD), two unsupervised metrics for hierarchical clustering scores, are equal to the optimum of their discrete counterparts contrary to some relaxed scores. (2) We propose Expected Probabilistic Hierarchies (EPH), a probabilistic model to learn hierarchies in data by optimizing expected scores. EPH uses differentiable hierarchy sampling enabling end-to-end gradient-descent based optimizations, and an unbiased subgraph sampling approach to scale to large datasets. (3) We evaluate EPH on synthetic and real-world datasets including vector and graph datasets. EPH outperforms all other approaches on quantitative results and provides meaningful hierarchies in qualitative evaluations.","Hierarchical Clustering, Graph Clustering, Clustering, Probabilistic Models" Taking a Step Back with KCal: Multi-Class Kernel-Based Calibration for Deep Neural Networks,https://openreview.net/forum?id=p_jIy5QFB7,https://openreview.net/pdf?id=p_jIy5QFB7,Provable full calibration for neural network classifiers using kernel density estimation.,"Deep neural network (DNN) classifiers are often overconfident, producing miscalibrated class probabilities. In high-risk applications like healthcare, practitioners require fully calibrated probability predictions for decision-making. That is, conditioned on the prediction vector, every class’ probability should be close to the predicted value. Most existing calibration methods either lack theoretical guarantees for producing calibrated outputs, reduce classification accuracy in the process, or only calibrate the predicted class. This paper proposes a new Kernel-based calibration method called KCal. Unlike existing calibration procedures, KCal does not operate directly on the logits or softmax outputs of the DNN. Instead, KCal learns a metric space on the penultimate-layer latent embedding and generates predictions using kernel density estimates on a calibration set. We first analyze KCal theoretically, showing that it enjoys a provable full calibration guarantee. Then, through extensive experiments across a variety of datasets, we show that KCal consistently outperforms baselines as measured by the calibration error and by proper scoring rules like the Brier Score.","Calibration, Kernel Density Estimation, Neural Networks, Healthcare, Classification" A distinct unsupervised reference model from the environment helps continual learning,https://openreview.net/forum?id=GK5m7a3Uy4,https://openreview.net/pdf?id=GK5m7a3Uy4,"In this paper, we introduced open-set semi-supervised continual learning as a realistic, practical scenario and proposed a novel dual-structured method to perform in this scenario.","The existing continual learning methods are mainly focused on fully-supervised scenarios and are still not able to take advantage of unlabeled data available in the environment. Some recent works tried to investigate semi-supervised continual learning (SSCL) settings in which the unlabeled data are available, but it is only from the same distribution as the labeled data. This assumption is still not general enough for real-world applications and restricts the utilization of unsupervised data. In this work, we introduce Open-Set Semi-Supervised Continual Learning (OSSCL), a more realistic semi-supervised continual learning setting in which out-of-distribution (OoD) unlabeled samples in the environment are assumed to coexist with the in-distribution ones. Under this configuration, we present a model with two distinct parts: (i) the reference network captures general-purpose and task-agnostic knowledge in the environment by using a broad spectrum of unlabeled samples, (ii) the learner network is designed to learn task-specific representations by exploiting supervised samples. The reference model both provides a pivotal representation space and also segregates unlabeled data to exploit them more efficiently. By performing a diverse range of experiments, we show the superior performance of our model compared with other competitors and prove the effectiveness of each component of the proposed model.","continual Learning, open-set semi-supervised continual learning, knowledge distillation" The Crossword Puzzle: Simplifying Deep Neural Network Pruning with Fabulous Coordinates,https://openreview.net/forum?id=lYP6zG2I1i,https://openreview.net/pdf?id=lYP6zG2I1i,"Fabulous coordinates can make pruning more simplified, efficient and effective.","Pruning is a promising technique to shrink the size of Deep Neural Network models with only negligible accuracy overheads. Recent efforts rely on experience-derived metric to guide pruning procedure, which heavily saddles with the effective generalization of pruning methods. We propose The Cross Puzzle, a new method to simplify this procedure by automatically deriving pruning metrics. The key insight behind our method is that: \textit{For Deep Neural Network Models, a Pruning-friendly Distribution of model's weights can be obtained, given a proper Coordinate}. We experimentally confirm the above insight, and denote the new Coordinate as the Fabulous Coordinates. Our quantitative evaluation results show that: the Crossword Puzzle can find a simple yet effective metric, which outperforms the state-of-the-art pruning methods by delivering no accuracy degradation on ResNet-56 (CIFAR-10)/-101 (ImageNet), while the pruning rate is raised to 70\%/50\% for the respective models.",Pruning Learning the Visualness of Text Using Large Vision-Language Models,https://openreview.net/forum?id=djfoLX57p9L,https://openreview.net/pdf?id=djfoLX57p9L,"We propose the task of predicting sentence visualness, curate a human-annotated dataset, and develop a fine-tuning strategy to predict sentence visualness using large vision-language models. ","Visual text evokes an image in a person's mind, while non-visual text fails to do so. A method to automatically detect visual text will unlock the ability to augment text with relevant images, as neural text-to-image generation and retrieval models operate on the implicit assumption that the input text is visual in nature. We curate a dataset of 3,620 English sentences and their visualness scores provided by multiple human annotators. Additionally, we use documents that contain text and visual assets to create a distantly supervised corpus of document text and associated images. We also propose a fine-tuning strategy that adapts large vision-language models like CLIP that assume a one-to-one correspondence between text and image to the task of scoring text visualness from text input alone. Our strategy involves modifying the model's contrastive learning objective to map text identified as non-visual to a common NULL image while matching visual text to their corresponding images in the document. We evaluate the proposed approach on its ability to (i) classify visual and non-visual text accurately, and (ii) attend over words that are identified as visual in psycholinguistic studies. Empirical evaluation indicates that our approach performs better than several heuristics and baseline models for the proposed task. Furthermore, to highlight the importance of modeling the visualness of text, we conduct qualitative analyses of text-to-image generation systems like DALL-E. We release the curated dataset and code.","text visualness, vision-language models, multimodal learning, natural language processing, deep learning" SemPPL: Predicting Pseudo-Labels for Better Contrastive Representations,https://openreview.net/forum?id=TAVBJ4aHsWt,https://openreview.net/pdf?id=TAVBJ4aHsWt,,"Learning from large amounts of unsupervised data and a small amount of supervision is an important open problem in computer vision. We propose a new semi-supervised learning method, Semantic Positives via Pseudo-Labels (SEMPPL), that combines labelled and unlabelled data to learn informative representations. Our method extends self-supervised contrastive learning—where representations are shaped by distinguishing whether two samples represent the same underlying datum (positives) or not (negatives)—with a novel approach to selecting positives. To enrich the set of positives, we leverage the few existing ground-truth labels to predict the missing ones through a k-nearest neighbors classifier by using the learned embeddings of the labelled data. We thus extend the set of positives with datapoints having the same pseudo-label and call these semantic positives. We jointly learn the representation and predict bootstrapped pseudo-labels. This creates a reinforcing cycle. Strong initial representations enable better pseudo-label predictions which then improve the selection of semantic positives and lead to even better representations. SEMPPL outperforms competing semi-supervised methods setting new state-of-the-art performance of 76% and 68.5% top-1accuracy when using a ResNet-50 and training on 10% and 1% of labels on ImageNet, respectively. Furthermore, when using selective kernels, SEMPPL significantly outperforms previous state-of-the-art achieving 72.3% and 78.3% top-1accuracy on ImageNet with 1% and 10% labels, respectively, which improves absolute +7.8% and +6.2% over previous work. SEMPPL also exhibits state-of-the-art performance over larger ResNet models as well as strong robustness, out-of-distribution and transfer performance.","contrastive learning, representation learning, semi-supervised learning" Differentially Private Adaptive Optimization with Delayed Preconditioners,https://openreview.net/forum?id=j1zQGmQQOX1,https://openreview.net/pdf?id=j1zQGmQQOX1,"We propose a private adaptive optimization framework that constructs delayed but less noisy preconditioners, yielding improved privacy/utility trade-offs without the need to access auxiliary data.","Privacy costs may negate the benefits of using adaptive optimizers in differentially private model training. Prior works typically address this issue by using auxiliary information (e.g., public data) to boost the effectiveness of adaptive optimization. In this work, we explore techniques to estimate and efficiently adapt to gradient geometry in private adaptive optimization without auxiliary data. Motivated by the observation that adaptive methods can tolerate stale preconditioners, we propose differentially private adaptive training with delayed preconditioners (DP^2), a simple method that constructs delayed but less noisy preconditioners to better realize the benefits of adaptivity. Theoretically, we provide convergence guarantees for our method for both convex and non-convex problems, and analyze trade-offs between delay and privacy noise reduction. Empirically, we explore DP^2 across several real-world datasets, demonstrating that it can improve convergence speed by as much as 4× relative to non-adaptive baselines and match the performance of state-of-the-art optimization methods that require auxiliary data.","adaptive optimization, differential privacy" Towards a Mathematics Formalisation Assistant using Large Language Models,https://openreview.net/forum?id=pKu077C57fH,https://openreview.net/pdf?id=pKu077C57fH,Large language models have the potential to be useful for mathematics formalization,"Mathematics formalisation is the task of writing mathematics (i.e., definitions, theorem statements, proofs) in natural language, as found in books and papers, into a formal language that can then be checked for correctness by a program. It is a thriving activity today, however formalisation remains cumbersome. In this paper, we explore the abilities of a large language model (Codex) to help with formalisation in the Lean theorem prover. We find that with careful input-dependent prompt selection and postprocessing, Codex is able to formalise short mathematical statements at undergrad level with nearly 75\% accuracy for $120$ theorem statements. For proofs quantitative analysis is infeasible and we undertake a detailed case study. We choose a diverse set of $13$ theorems at undergrad level with proofs that fit in two-three paragraphs. We show that with a new prompting strategy Codex can formalise these proofs in natural language with at least one out of twelve Codex completion being easy to repair into a complete proof. This is surprising as essentially no aligned data exists for formalised mathematics, particularly for proofs. These results suggest that large language models are a promising avenue towards fully or partially automating formalisation. ","Large Language Models, Mathematics Formalisation" Learning Robust Representations via Nuisance-extended Information Bottleneck,https://openreview.net/forum?id=G2AA1eB1vVE,https://openreview.net/pdf?id=G2AA1eB1vVE,We propose to model the nuisance of information bottleneck for out-of-distribution generalization.,"The information bottleneck (IB) is a principled approach to obtain a succinct representation $\mathbf{x} \rightarrow \mathbf{z}$ for a given downstream task $\mathbf{x} \rightarrow \mathbf{y}$: namely, it finds $\mathbf{z}$ that (a) maximizes the (task-relevant) mutual information $I(\mathbf{z}; \mathbf{y})$, while (b) minimizing $I(\mathbf{x}; \mathbf{z})$ to constrain the capacity of $\mathbf{z}$ for better generalization. In practical scenarios where the training data is limited, however, many predictive-yet-compressible signals in the data can be rather from some biases in data acquisition (i.e., less generalizable), so that even the IB objective cannot prevent $\mathbf{z}$ from co-adapting on such (so-called) ""shortcut"" signals. To bypass such a failure mode, we consider an adversarial threat model of $\mathbf{x}$ under constraint on the mutual information $I(\mathbf{x}; \mathbf{y})$. This motivates us to extend IB to additionally model the nuisance information against $\mathbf{z}$, namely $\mathbf{z}_n$, so that $(\mathbf{z}, \mathbf{z}_n)$ can reconstruct $\mathbf{x}$. To enable the idea, we propose an auto-encoder based training upon the variational IB framework, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training considering both convolutional- and Transformer-based architectures. Our experimental results show that the proposed scheme improves robustness of learned representations (remarkably without using any domain-specific knowledge), with respect to multiple challenging modern security measures including novelty detection, corruption (or natural) robustness and certified adversarial robustness.","out-of-distribution robustness, information bottleneck, representation learning, autoencoder" FedLite: Improving Communication Efficiency in Federated Split Learning,https://openreview.net/forum?id=VO-HUrkHSY,https://openreview.net/pdf?id=VO-HUrkHSY,,"In classical federated learning, clients contribute to the overall training by communicating local updates for the underlying model on their private data to a coordinating server. However, updating and communicating the entire model becomes prohibitively expensive when resource-constrained clients collectively aim to train a large machine learning model. Split learning provides a natural solution in such a setting, where only a (small) part of the model is stored and trained on clients while the remaining (large) part of the model only stays at the servers. Unfortunately, the model partitioning employed in split learning significantly increases the communication cost compared to the classical federated learning algorithms. This paper addresses this issue by compressing the additional communication cost associated with split learning via a novel clustering algorithm and a gradient correction technique. An extensive empirical evaluation on standard image and text benchmarks shows that the proposed method can achieve up to 490x communication cost reduction with minimal drop in accuracy, and enables a desirable performance vs. communication trade-off.", Adversarial Policies Beat Professional-Level Go AIs,https://openreview.net/forum?id=Kyz1SaAcnd,https://openreview.net/pdf?id=Kyz1SaAcnd,,"We attack the state-of-the-art Go-playing AI system, KataGo, by training an adversarial policy that plays against a frozen KataGo victim. Our attack achieves a >99% win-rate against KataGo without search, and a >80% win-rate when KataGo uses enough search to be near-superhuman. To the best of our knowledge, this is the first successful end-to-end attack against a Go AI playing at the level of a top human professional. Notably, the adversary does not win by learning to play Go better than KataGo---in fact, the adversary is easily beaten by human amateurs. Instead, the adversary wins by tricking KataGo into ending the game prematurely at a point that is favorable to the adversary. Our results demonstrate that even professional-level AI systems may harbor surprising failure modes.", Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions,https://openreview.net/forum?id=vOEXS39nOF,https://openreview.net/pdf?id=vOEXS39nOF,"Generating videos from text, with prompts that can change over time, and videos that can be as long as multiple minutes.","We present Phenaki, a model capable of realistic video synthesis given a sequence of textual prompts. Generating videos from text is particularly challenging due to the computational cost, limited quantities of high quality text-video data and variable length of videos. To address these issues, we introduce a new causal model for learning video representation which compresses the video to a small discrete tokens representation. This tokenizer is auto-regressive in time, which allows it to work with video representations of different length. To generate video tokens from text we are using a bidirectional masked transformer conditioned on pre-computed text tokens. The generated video tokens are subsequently de-tokenized to create the actual video. To address data issues, we demonstrate how joint training on a large corpus of image-text pairs as well as a smaller number of video-text examples can result in generalization beyond what is available in the video datasets. Compared to the previous video generation methods, Phenaki can generate arbitrary long videos conditioned on a sequence of prompts (i.e. time variable text or story) in open domain. To the best of our knowledge, this is the first time a paper studies generating videos from time variable prompts.","generative models, video generation, video prediction, text to video" Long Range Language Modeling via Gated State Spaces,https://openreview.net/forum?id=5MkYIYCbva,https://openreview.net/pdf?id=5MkYIYCbva,Explore and improve state space model family on long range language modeling tasks,"State space models have shown to be effective at modeling long range dependencies, specially on sequence classification tasks. In this work we focus on autoregressive sequence modeling over English books, Github source code and ArXiv mathematics articles. Based on recent developments around the effectiveness of gated activation functions, we propose a new layer named \textit{Gated State Space} (GSS) and show that it trains significantly faster than the diagonal version of S4 (i.e. DSS) on TPUs, is fairly competitive with several well-tuned Transformer-based baselines and exhibits zero-shot generalization to longer inputs while being straightforward to implement. Finally, we show that leveraging self-attention to model local dependencies improves the performance of GSS even further.","Long range language modeling, language modeling, state space models" On the (Non-)Robustness of Two-Layer Neural Networks in Different Learning Regimes,https://openreview.net/forum?id=guSxooOK9E,https://openreview.net/pdf?id=guSxooOK9E,"Tradeoffs between test-error and robustness for 2-layer neural networks in different learning regimes (RF, lazy training, SGD)","Neural networks are known to be highly sensitive to adversarial examples. These may arise due to different factors, such as random initialization, or spurious correlations in the learning problem. To better understand these factors, we provide a precise study of the adversarial robustness in different scenarios, from initialization to the end of training in different regimes, as well as intermediate scenarios where initialization still plays a role due to “lazy” training. We consider over-parameterized networks in high dimensions with quadratic targets and infinite samples. Our analysis allows us to identify new tradeoffs between approximation (as measured via test error) and robustness, whereby robustness can only get worse when test error improves, and vice versa. We also show how linearized lazy training regimes can worsen robustness, due to improperly scaled random initialization. Our theoretical results are illustrated with numerical experiments.","robustness, adversarial robustness, over-parametrization, lazy training, parent-student, regression" Task-customized Masked Autoencoder via Mixture of Cluster-conditional Experts,https://openreview.net/forum?id=j8IiQUM33s,https://openreview.net/pdf?id=j8IiQUM33s,,"Masked Autoencoder~(MAE) is a prevailing self-supervised learning method that achieves promising results in model pre-training. However, when the various downstream tasks have data distributions different from the pre-training data, the semantically irrelevant pre-training information might result in negative transfer, impeding MAE's scalability. To address this issue, we propose a novel MAE based pre-training paradigm, named Mixture of Cluster-conditional Experts (MoCE), which can be trained once but provide customized pre-training models for diverse downstream tasks. Different from the mixture of experts (MoE), our MoCE trains each expert only with semantically relevant images by using cluster-conditional gates, which can automatically format the hierarchical cluster architecture of the pre-training data. Thus, each downstream task can be allocated to its customized model pre-trained with data most similar to the downstream data. Experimental results show that our MoCE achieves 3.49\% average performance improvement compared to the vanilla MAE on a collection of 11 different downstream tasks.", A Deep Dive into Dataset Imbalance and Bias in Face Identification,https://openreview.net/forum?id=gOoONbY02OUz,https://openreview.net/pdf?id=gOoONbY02OUz,"In this work, we explore the effects of different kinds of data imbalance on bias in face identification problem. ","As the deployment of automated face recognition (FR) systems proliferates, bias in these systems is not just an academic question, but a matter of public concern. Media portrayals often center imbalance as the main source of bias, i.e., that FR models perform worse on images of non-white people or women because these demographic groups are underrepresented in training data. Recent academic research paints a more nuanced picture of this relationship. However, previous studies of data imbalance in FR have focused exclusively on the face verification setting, while the face identification setting has been largely ignored, despite being deployed in sensitive applications such as law enforcement. This is an unfortunate omission, as 'imbalance' is a more complex matter in identification; imbalance may arise in not only the training data, but also the testing data, and furthermore may affect the proportion of identities belonging to each demographic group or the number of images belonging to each identity. In this work, we address this gap in the research by thoroughly exploring the effects of each kind of imbalance possible in face identification, and discuss other factors which may impact bias in this setting.","face recognition, dataset imbalance, bias, fairness" Causally Constrained Data Synthesis For Private Data Release,https://openreview.net/forum?id=RIJM-pJF_3K,https://openreview.net/pdf?id=RIJM-pJF_3K,,"Data privacy is critical in many decision-making contexts, such as healthcare and finance. A common mechanism is to create differentially private synthetic data using generative models. Such data generation reflects certain statistical properties of the original data, but often has an unacceptable privacy vs. utility trade-off. Since natural data inherently exhibits causal structure, we propose incorporating \emph{causal information} into the training process to favorably navigate the aforementioned trade-off. Under certain assumptions for linear gaussian models and a broader class of models, we theoretically prove that causally informed generative models provide better differential privacy guarantees than their non-causal counterparts. We evaluate our proposal using variational autoencoders, and demonstrate that the trade-off is mitigated through better utility for comparable privacy.", Modeling the Data-Generating Process is Necessary for Out-of-Distribution Generalization,https://openreview.net/forum?id=uyqks-LILZX,https://openreview.net/pdf?id=uyqks-LILZX,,"Recent empirical studies on domain generalization (DG) have shown that DG algorithms that perform well on some distribution shifts fail on others, and no state-of-the-art DG algorithm performs consistently well on all shifts. Moreover, real-world data often has multiple distribution shifts over different attributes; hence we introduce multi-attribute distribution shift datasets and find that the accuracy of existing DG algorithms falls even further. To explain these results, we provide a formal characterization of generalization under multi-attribute shifts using a canonical causal graph. Based on the relationship between spurious attributes and the classification label, we obtain realizations of the canonical causal graph that characterize common distribution shifts and show that each shift entails different independence constraints over observed variables. As a result, we prove that any algorithm based on a single, fixed constraint cannot work well across all shifts, providing theoretical evidence for mixed empirical results on DG algorithms. Based on this insight, we develop Causally Adaptive Constraint Minimization (CACM), an algorithm that uses knowledge about the data-generating process to adaptively identify and apply the correct independence constraints for regularization. Results on fully synthetic, MNIST, small NORB, and Waterbirds datasets, covering binary and multi-valued attributes and labels, show that adaptive dataset-dependent constraints lead to the highest accuracy on unseen domains whereas incorrect constraints fail to do so. Our results demonstrate the importance of modeling the causal relationships inherent in the data-generating process.", Bayes-MIL: A New Probabilistic Perspective on Attention-based Multiple Instance Learning for Whole Slide Images,https://openreview.net/forum?id=_geIwiOyUhZ,https://openreview.net/pdf?id=_geIwiOyUhZ,Bayesian modeling of multiple instance learning addresses untrustworthy and unsatisfactory interpretability problem of related methods. ,"Multiple instance learning (MIL) is a popular weakly-supervised learning model on the whole slide image (WSI) for AI-assisted pathology diagnosis. The recent advance in attention-based MIL allows the model to find its region-of-interest (ROI) for interpretation by learning the attention weights for image patches of WSI slides. However, we empirically find that the interpretability of some related methods is either untrustworthy as the principle of MIL is violated or unsatisfactory as the high-attention regions are not consistent with experts' annotations. In this paper, we propose Bayes-MIL to address the problem from a probabilistic perspective. The induced patch-level uncertainty is proposed as a new measure of MIL interpretability, which outperforms previous methods in matching doctors annotations. We design a slide-dependent patch regularizer (SDPR) for the attention, imposing constraints derived from the MIL assumption, on the attention distribution. SDPR explicitly constrains the model to generate correct attention values. The spatial information is further encoded by an approximate convolutional conditional random field (CRF), for better interpretability. Experimental results show Bayes-MIL outperforms the related methods in patch-level and slide-level metrics and provides much better interpretable ROI on several large-scale WSI datasets. ","Multiple instance learning, medical imaging, histopathology, Bayesian neural network" Exploring Connections Between Memorization And Membership Inference,https://openreview.net/forum?id=EGIvMUk5duH,https://openreview.net/pdf?id=EGIvMUk5duH,,"Membership inference (MI) allows privacy adversaries to query trained machine learning models to infer if a particular data sample was used in model training. Prior work has shown that the efficacy of MI is not the same for every sample in the training dataset; they broadly attribute this behavior to various data properties such as distributional difference. However, systematically analyzing the reasons for such disparate behavior has received little attention. In this work, we investigate the cause for such a discrepancy, and observe that the reason is more subtle and fundamental. We first provide empirical insight that an MI adversary is very successful with those samples that are highly $\textit{likely to be memorized}$, irrespective of whether the sample is from the same or a different distribution. Next, we provide a game-based formulation which lower-bounds the advantage of an adversary with the ability to determine if a sample is memorized or not, under certain assumptions made about the efficacy of the model on the memorized samples. Finally, based on our theoretical results, we present a practical instantiation of a highly effective MI attack on memorized samples.", Action Matching: A Variational Method for Learning Stochastic Dynamics from Samples,https://openreview.net/forum?id=T6HPzkhaKeS,https://openreview.net/pdf?id=T6HPzkhaKeS,We propose Action Matching for modeling stochastic dynamics by learning an underlying mechanism to move samples.,"Stochastic dynamics are ubiquitous in many fields of science, from the evolution of quantum systems in physics to diffusion-based models in machine learning. Existing methods such as score matching can be used to simulate these physical processes by assuming that the dynamics is a diffusion, which is not always the case. In this work, we propose a method called ""Action Matching"" that enables us to learn a much broader family of stochastic dynamics. Our method requires access only to samples from different time-steps, makes no explicit assumptions about the underlying dynamics, and can be applied even when samples are uncorrelated (i.e., are not part of a trajectory). Action Matching directly learns an underlying mechanism to move samples in time without modeling the distributions at each time-step. In this work, we showcase how Action Matching can be used for several computer vision tasks such as generative modeling, super-resolution, colorization, and inpainting; and further discuss potential applications in other areas of science.", Pre-train Graph Neural Networks for Brain Network Analysis,https://openreview.net/forum?id=I1Mdyc2Bg5x,https://openreview.net/pdf?id=I1Mdyc2Bg5x,,"Human brains, controlling behaviors and cognition, are at the center of complex neurobiological systems. Recent studies in neuroscience and neuroimaging analysis have reached a consensus that interactions among brain regions of interest (ROIs) are driving factors for neural development and disorders. Graph neural networks as a powerful tool for analyzing graph-structured data are naturally applied to the analysis of brain networks. However, training of deep learning models including GNNs often requires a significant amount of labeled data. Due to the complicated data acquisition process and restrictions on data sharing, brain network datasets are still small compared to other domains (e.g., molecules, proteins). Moreover, real clinical tasks (e.g., mental disorder analysis) are often conducted on local datasets with even smaller scales and larger noises. To this end, we propose to leverage pre-training to capture the intrinsic brain network structures regardless of specific clinical outcomes. Specifically, we characterize the contributions in this work from two perspectives: (1) We design brain-network-oriented unsupervised pre-training techniques to utilize large-scale brain imaging studies without highly relevant task labels. (2) To facilitate effective knowledge transfer across studies with different ROI systems, we propose to develop a data-driven parcellation atlas mapping pipeline. The proposed pre-training techniques are validated with various GNN models. Extensive experiments demonstrate consistent improvement in performance as well as robustness.", Investigating Multi-task Pretraining and Generalization in Reinforcement Learning,https://openreview.net/forum?id=sSt9fROSZRO,https://openreview.net/pdf?id=sSt9fROSZRO,"Multi-task training and generalization on Atari game variants, showing benefits from fine-tuning over zero shot and scaling data size and model capacity.","Deep reinforcement learning (RL) has achieved remarkable successes in complex single-task settings. However, learning policies that can perform multiple tasks and leverage prior experience to learn faster remains challenging. Despite previous attempts to improve on these areas, our understanding of multi-task training and generalization in reinforcement learning remains limited. In this work we propose to investigate the generalization capabilities of a popular actor-critic method, IMPALA. We build on previous work that has advocated for the use of modes and difficulties of Atari 2600 games as a benchmark for transfer learning in reinforcement learning. We do so by pretraining an agent on multiple flavours of the same game before finetuning on the remaining unseen ones. This protocol simplifies the multi-task pretraining phase by limiting negative interference between tasks and allows us to better understand the dynamics of multi-task training and generalization. We find that, given a fixed amount of pretraining data, agents trained with more variations of a game are able to generalize better. Surprisingly we observe that this advantage can be more pronounced after finetuning for 200M environment frames than when doing zero-shot transfer. This highlights the importance of the learned representation and that performance after finetuning might more appropriate to evaluate generalization in reinforcement learning. We also find that, even though small networks have remained popular to solve Atari 2600 games increasing the capacity of the value and policy network is critical to achieve good performance as we increase the number of pretraining modes and difficulties. Overall our findings emphasize key points that are crucial for efficient multi-task training and generalization in reinforcement learning.","generalization, transfer, atari" FIT: A Metric for Model Sensitivity,https://openreview.net/forum?id=PDG4-Y3aboN,https://openreview.net/pdf?id=PDG4-Y3aboN,"We propose the Fisher Information Trace (FIT) metric, to quantify the effects of mixed-precision quantization. FIT facilitates zero-shot performance prediction of quantized models, and is fast to compute.","Model compression is vital to the deployment of deep learning on edge devices. Low precision representations, achieved via quantization of weights and activations, can reduce inference time and memory requirements. However, quantifying and predicting the response of a model to the changes associated with this procedure remains challenging. This response is non-linear and heterogeneous throughout the network. Understanding which groups of parameters and activations are more sensitive to quantization than others is a critical stage in maximizing efficiency. For this purpose, we propose FIT. Motivated by an information geometric perspective, FIT combines the Fisher information with a model of quantization. We find that FIT can estimate the final performance of a network without retraining. FIT effectively fuses contributions from both parameter and activation quantization into a single metric. Additionally, FIT is fast to compute when compared to existing methods, demonstrating favourable convergence properties. These properties are validated experimentally across hundreds of quantization configurations, with a focus on layer-wise mixed-precision quantization.","Fisher Information, Quantization" Transfer Learning with Deep Tabular Models,https://openreview.net/forum?id=b0RuGUYo8pA,https://openreview.net/pdf?id=b0RuGUYo8pA,We find that transfer learning with deep tabular models provides a definitive advantage over gradient boosted decision tree methods when downstream data is limited.,"Recent work on deep learning for tabular data demonstrates the strong performance of deep tabular models, often bridging the gap between gradient boosted decision trees and neural networks. Accuracy aside, a major advantage of neural models is that they are easily fine-tuned in new domains and learn reusable features. This property is often exploited in computer vision and natural language applications, where transfer learning is indispensable when task-specific training data is scarce. In this work, we explore the benefits that representation learning provides for knowledge transfer in the tabular domain. We conduct experiments in a realistic medical diagnosis test bed with limited amounts of downstream data and find that transfer learning with deep tabular models provides a definitive advantage over gradient boosted decision tree methods. We further compare the supervised and self-supervised pretraining strategies and provide practical advice on transfer learning with tabular models. Finally, we propose a pseudo-feature method for cases where the upstream and downstream feature sets differ, a tabular-specific problem widespread in real-world applications.","tabular data, transfer learning, tabular models, representation learning" An Empirical Study on the Efficacy of Deep Active Learning Techniques,https://openreview.net/forum?id=K75z1mX4VTo,https://openreview.net/pdf?id=K75z1mX4VTo,Our paper provides a comprehensive empirical study of existing deep active learning methods.,"Deep Active Learning (DAL) has been advocated as a promising method to reduce labeling costs in supervised learning. However, existing evaluations of DAL methods are based on different settings, and their results are controversial. To tackle this issue, this paper comprehensively evaluates 19 existing DAL methods in a uniform setting, including traditional fully-\underline{s}upervised \underline{a}ctive \underline{l}earning (SAL) strategies and emerging \underline{s}emi-\underline{s}upervised \underline{a}ctive \underline{l}earning (SSAL) techniques. We have several non-trivial findings. First, most SAL methods cannot achieve higher accuracy than random selection. Second, semi-supervised training brings significant performance improvement compared to pure SAL methods. Third, performing data selection in the SSAL setting can achieve a significant and consistent performance improvement, especially with abundant unlabeled data. Our findings produce the following guidance for practitioners: one should (i) apply SSAL as early as possible and (ii) collect more unlabeled data whenever possible, for better model performance. We will release our code upon acceptance.","deep neural networks, active learning, semi-supervised learning" CrAM: A Compression-Aware Minimizer,https://openreview.net/forum?id=_eTZBs-yedr,https://openreview.net/pdf?id=_eTZBs-yedr,We propose a method for training accurate models that are robust to compression in a single step. ,"Deep neural networks (DNNs) often have to be compressed, via pruning and/or quantization, before they can be deployed in practical settings. In this work we propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way, in order to produce models whose local loss behavior is stable under compression operations such as pruning. Thus, dense models trained via CrAM should be compressible post-training, in a single step, without significant accuracy loss. Experimental results on standard benchmarks, such as residual networks for ImageNet classification and BERT models for language modelling, show that CrAM produces dense models that can be more accurate than the standard SGD/Adam-based baselines, but which are stable under weight pruning: for instance, on the ImageNet task, we can prune models in one-shot to 70-80% sparsity with reasonable (≤ 1%) accuracy loss, which is competitive with gradual compression methods. Additionally, we show that CrAM produces sparse models which perform well for transfer learning, and that it also works for semi-structured pruning patterns supported by GPU hardware.","compression, sparsity, one shot pruning, optimization" Using Language to Extend to Unseen Domains,https://openreview.net/forum?id=eR2dG8yjnQ,https://openreview.net/pdf?id=eR2dG8yjnQ,Transforming multimodal embeddings with language improves accuracy on an unseen domain. ,"It is expensive to collect training data for every possible domain that a vision model may encounter when deployed. We instead consider how simply $\textit{verbalizing}$ the training domain (e.g.``photos of birds'') as well as domains we want to extend to but do not have data for (e.g.``paintings of birds'') can improve robustness. Using a multimodal model with a joint image and language embedding space, our method $\textit{LADS}$ learns a transformation of the image embeddings from the source domain to each target domain, while preserving task relevant information. Without using any images from the target domain, we show that over the $\textit{extended}$ domain containing both source and target, $\textit{LADS}$ outperforms standard fine-tuning and ensemble approaches over a suite of 4 benchmarks targeting domain adaptation and dataset bias.","vision and language, robust training, domain adaptation" Can We Find Nash Equilibria at a Linear Rate in Markov Games?,https://openreview.net/forum?id=eQzLwwGyQrb,https://openreview.net/pdf?id=eQzLwwGyQrb,A decentralized algorithm for finding Nash equilibria in two-player zero-sum discounted Markov games with global linear convergence.,"We study decentralized learning in two-player zero-sum discounted Markov games where the goal is to design a policy optimization algorithm for either agent satisfying two properties. First, the player does not need to know the policy of the opponent to update its policy. Second, when both players adopt the algorithm, their joint policy converges to a Nash equilibrium of the game. To this end, we construct a meta-algorithm, dubbed as $\texttt{Homotopy-PO}$, which provably finds a Nash equilibrium at a global linear rate. In particular, $\texttt{Homotopy-PO}$ interweaves two base algorithms $\texttt{Local-Fast}$ and $\texttt{Global-Slow}$ via homotopy continuation. $\texttt{Local-Fast}$ is an algorithm that enjoys local linear convergence while $\texttt{Global-Slow}$ is an algorithm that converges globally but at a slower sublinear rate. By switching between these two base algorithms, $\texttt{Global-Slow}$ essentially serves as a ``guide'' which identifies a benign neighborhood where $\texttt{Local-Fast}$ enjoys fast convergence. However, since the exact size of such a neighborhood is unknown, we apply a doubling trick to switch between these two base algorithms. The switching scheme is delicately designed so that the aggregated performance of the algorithm is driven by $\texttt{Local-Fast}$. Furthermore, we prove that $\texttt{Local-Fast}$ and $\texttt{Global-Slow}$ can both be instantiated by variants of optimistic gradient descent/ascent (OGDA) method, which is of independent interest.","Multi-Agent Reinforcement Learning, Markov Games, Linear Convergence, Policy Optimization" Speeding up Policy Optimization with Vanishing Hypothesis and Variable Mini-Batch Size,https://openreview.net/forum?id=bwss-i6lG45,https://openreview.net/pdf?id=bwss-i6lG45,"We propose a novel reinforcement learning method to speed up policy optimization by using a vanishing hypothesis term, and a varying mini-batch size.","Reinforcement learning-based algorithms have been used extensively in recent years due to their flexible nature, good performance, and the increasing number of said algorithms. However, the largest drawback of these techniques remains unsolved, that is, it usually takes a long time for the agents to learn how to solve a given problem. In this work, we outline a novel method that can be used to drastically reduce the training time of current state-of-the-art algorithms like Proximal Policy Optimization (PPO). We evaluate the performance of this approach in a unique environment where we use reinforcement learning to help with a practical astronomical problem: where to place a fixed number of observatory stations in the Solar System to observe space objects (e.g. asteroids) as permanently as possible. That is, the reward in this scenario corresponds to the total coverage of the trajectories of these objects. We apply noisy evaluation for calculating the reward to speed up the training, which technique has already been efficiently applied in stochastic optimization. Namely, we allow the incorporation of some additional noise in the reward function in the form of a hypothesis term and a varying mini-batch size. However, in order to follow the theoretical guidelines, both of them are forced to vanish during training to let the noise converge to zero. Our experimental results show that using this approach we can reduce the training time remarkably, even by 75%.","reinforcement learning, policy optimization, noisy evaluation, variable mini-batch size, vanishing hypothesis, optimal placement of sensors" Understanding Train-Validation Split in Meta-Learning with Neural Networks,https://openreview.net/forum?id=JVlyfHEEm0k,https://openreview.net/pdf?id=JVlyfHEEm0k,,"The goal of meta-learning is to learn a good prior model from a collection of tasks such that the learned prior is able to adapt quickly to new tasks without accessing many data from the new tasks. A common practice in meta-learning is to perform a train-validation split on each task, where the training set is used for adapting the model parameter to that specific task and the validation set is used for learning a prior model that is shared across all tasks. Despite its success and popularity in multitask learning and few-shot learning, the understanding of the train-validation split is still limited, especially when the neural network models are used. In this paper, we study the benefit of train-validation split for classification problems with neural network models trained by gradient descent. We prove that the train-validation split is necessary to learn a good prior model when the noise in the training sample is large, while the train-train method fails. We validate our theory by conducting experiment on both synthetic and real datasets. To the best of our knowledge, this is the first work towards the theoretical understanding of train-validation split in meta-learning with neural networks.","meta-learning, neural networks, deep learning, train-validation split, convolutional neural network" Revisiting Robustness in Graph Machine Learning,https://openreview.net/forum?id=h1o7Ry9Zctm,https://openreview.net/pdf?id=h1o7Ry9Zctm,"GNNs suffer from over-robustness, that is robustness beyond the point of semantic change, which can be combated by including the known label-structure at inference time.","Many works show that node-level predictions of Graph Neural Networks (GNNs) are unrobust to small, often termed adversarial, changes to the graph structure. However, because manual inspection of a graph is difficult, it is unclear if the studied perturbations always preserve a core assumption of adversarial examples: that of unchanged semantic content. To address this problem, we introduce a more principled notion of an adversarial graph, which is aware of semantic content change. Using Contextual Stochastic Block Models (CSBMs) and real-world graphs, our results suggest: $i)$ for a majority of nodes the prevalent perturbation models include a large fraction of perturbed graphs violating the unchanged semantics assumption; $ii)$ surprisingly, all assessed GNNs show over-robustness - that is robustness beyond the point of semantic change. We find this to be a complementary phenomenon to adversarial examples and show that including the label-structure of the training graph into the inference process of GNNs significantly reduces over-robustness, while having a positive effect on test accuracy and adversarial robustness. Theoretically, leveraging our new semantics-aware notion of robustness, we prove that there is no robustness-accuracy tradeoff for inductively classifying a newly added node. ","graph neural networks, adversarial robustness, label propagation, node-classification, stochastic block models, Bayes classifier, non-i.i.d. data, graph learning, graphs, robustness" Variational Information Pursuit for Interpretable Predictions,https://openreview.net/forum?id=77lSWa-Tm3Z,https://openreview.net/pdf?id=77lSWa-Tm3Z,A Framework for Interpretable ML," There is a growing interest in the machine learning community in developing predictive algorithms that are ``interpretable by design"". Towards this end, recent work proposes to make interpretable decisions by sequentially asking interpretable queries about data until a prediction can be made with high confidence based on the answers obtained (the history). To promote short query-answer chains, a greedy procedure called Information Pursuit (IP) is used, which adaptively chooses queries in order of information gain. Generative models are employed to learn the distribution of query-answers and labels, which is in turn used to estimate the most informative query. However, learning and inference with a full generative model of the data is often intractable for complex tasks. In this work, we propose Variational Information Pursuit (V-IP), a variational characterization of IP which bypasses the need for learning generative models. V-IP is based on finding a query selection strategy and a classifier that minimizes the expected cross-entropy between true and predicted labels. We then demonstrate that the IP strategy is the optimal solution to this problem. Therefore, instead of learning generative models, we can use our optimal strategy to directly pick the most informative query given any history. We then develop a practical algorithm by defining a finite-dimensional parameterization of our strategy and classifier using deep networks and train them end-to-end using our objective. Empirically, V-IP is 10-100x faster than IP on different Vision and NLP tasks with competitive performance. Moreover, V-IP finds much shorter query chains when compared to reinforcement learning which is typically used in sequential-decision-making problems. Finally, we demonstrate the utility of V-IP on challenging tasks like medical diagnosis where the performance is far superior to the generative modelling approach.","Interpretable ML, Explainable AI, Information Pursuit" Grammar-Induced Geometry for Data-Efficient Molecular Property Prediction,https://openreview.net/forum?id=SGQi3LgFnqj,https://openreview.net/pdf?id=SGQi3LgFnqj,We propose a data-efficient molecular property predictor based on an explicit geometry of the space of molecular graphs induced by a learnable hierarchical molecular grammar.,"The prediction of molecular properties is a crucial task in the field of material and drug discovery. The potential benefits of using deep learning techniques are reflected in the wealth of recent literature. Still, these techniques are faced with a common challenge in practice: Labeled data are limited by the cost of manual extraction from literature and laborious experimentation. In this work, we propose a data-efficient property predictor by utilizing a learnable hierarchical molecular grammar that can generate molecules from grammar production rules. Such a grammar induces an explicit geometry of the space of molecular graphs, which provides an informative prior on molecular structural similarity. The property prediction is performed using graph neural diffusion over the grammar-induced geometry. On both small and large datasets, our evaluation shows that this approach outperforms a wide spectrum of baselines, including supervised and pre-trained graph neural networks. We include a detailed ablation study and further analysis of our solution, showing its effectiveness in cases with extremely limited data (only ${\sim}100$ samples), and its extension to application in molecular generation.","Molecular property prediction, Graph grammar, Data-efficient model" EF21-P and Friends: Improved Theoretical Communication Complexity for Distributed Optimization with Bidirectional Compression,https://openreview.net/forum?id=fGMKL9dNR1,https://openreview.net/pdf?id=fGMKL9dNR1,,"The starting point of this paper is the discovery of a novel and simple error-feedback mechanism, which we call EF21-P, for dealing with the error introduced by a contractive compressor. Unlike all prior works on error feedback, where compression and correction operate in the dual space of gradients, our mechanism operates in the primal space of models. While we believe that EF21-P may be of interest in many situations where it is often advantageous to perform model perturbation prior to the computation of the gradient (e.g., randomized smoothing and generalization), in this work we focus our attention on its use as a key building block in the design of communication-efficient distributed optimization methods supporting bidirectional compression. In particular, we employ EF21-P as the mechanism for compressing and subsequently error-correcting the model broadcast by the server to the workers. By combining EF21-P with suitable methods performing worker-to-server compression, we obtain novel methods supporting bidirectional compression and enjoying new state-of-the-art theoretical communication complexity for convex and nonconvex problems. For example, our bounds are the first that manage to decouple the variance/error coming from the workers-to-server and server-to-workers compression, transforming a multiplicative dependence to an additive one. In the convex regime, we obtain the first bounds that match the theoretical communication complexity of gradient descent. Even in this convex regime, our algorithms work with biased gradient estimators, which is non-standard and requires new proof techniques that may be of independent interest. Finally, our theoretical results are corroborated through suitable experiments.","communication compression, bidirectional compression, error feedback, distributed optimization" Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints,https://openreview.net/forum?id=T5nUQDrM4u,https://openreview.net/pdf?id=T5nUQDrM4u,"We create sparsely activated Mixture-of-Experts models from pre-existing dense models, showing significant performance improvements and computational savings in doing so.","Training large, deep neural networks to convergence can be prohibitively expensive. As a result, often only a small selection of popular, dense models are reused across different contexts and tasks. Increasingly, sparsely activated models, which seek to decouple model size from computation costs, are becoming an attractive alternative to dense models. Although more efficient in terms of quality and computation cost, sparse models remain data-hungry and costly to train from scratch in the large scale regime. In this work, we propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint. We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet, using only ~50% of the initial dense pretraining sunk cost. The upcycled models also outperform sparse models trained from scratch on 100% of the initial dense pretraining computation budget.","mixture of experts, sparse, vision, language, deep learning, superglue, imagenet" A simple Training-Free Method for Rejection Option,https://openreview.net/forum?id=K1DdnjL6p7,https://openreview.net/pdf?id=K1DdnjL6p7,We present a simple yet effective method to implement the rejection option for a pre-trained classifier. ,"We present a simple yet effective method to implement the rejection option for a pre-trained classifier. Our method is based on a sound mathematical framework, enjoys good properties, and is hyperparameter free. It is lightweight, since it does not require any re-training of the network, and it is flexible, since it can be used with any model that outputs soft-probabilities. We compare our solution to state-of-the-art methods considering popular benchmarks (Cifar-10, Cifar-100, SVHN), and various models (VGG-16, DenseNet-121, ResNet-34). At evaluation time, our method, which is applied post-training to any classification model, achieves similar or better results with respect to its competitors that usually require further training and/or tuning of the models.","rejection option, safety AI, deep learning" Self-Programming Artificial Intelligence Using Code-Generating Language Models,https://openreview.net/forum?id=SKat5ZX5RET,https://openreview.net/pdf?id=SKat5ZX5RET,We develop and experimentally validate the first practical implementation of a self-reprogramming AI system. ,"Recent progress in large-scale language models has enabled breakthroughs in previously intractable computer programming tasks. Prior work in meta-learning and neural architecture search has led to substantial successes across various task domains, spawning myriad approaches for algorithmically optimizing the design and learning dynamics of deep learning models. At the intersection of these research areas, we implement a code-generating language model with the ability to modify its own source code. Self-programming AI algorithms have been of interest since the dawn of AI itself. Although various theoretical formulations of generalized self-programming AI have been posed, no such system has been successfully implemented to date under real-world computational constraints. Applying AI-based code generation to AI itself, we develop and experimentally validate the first practical implementation of a self-programming AI system. We empirically show that a self-programming AI implemented using a code generation model can successfully modify its own source code to improve performance and program sub-models to perform auxiliary tasks. Our model can self-modify various properties including model architecture, computational capacity, and learning dynamics.","Self-programming AI, NLP, code generation, AutoML" Lossless Adaptation of Pretrained Vision Models For Robotic Manipulation,https://openreview.net/forum?id=5IND3TXJRb-,https://openreview.net/pdf?id=5IND3TXJRb-,,"Recent works have shown that large models pretrained on common visual learning tasks can provide useful representations for a wide range of specialized perception problems, as well as a variety of robotic manipulation tasks. While prior work on robotic manipulation has predominantly used frozen pretrained features, we demonstrate that in robotics, unlike in other domains, this approach can fail to reach optimal performance, and that fine-tuning of the full model can lead to significantly better results. Unfortunately, fine-tuning disrupts the pretrained visual representation, and causes representational drift towards the fine-tuned task thus leading to a loss of the versatility of the original model. We introduce a method for lossless adaptation to address this shortcoming of classical fine-tuning. We demonstrate that appropriate placement of our parameter efficient adapters can significantly reduce the performance gap between frozen pretrained representations and full end-to-end fine-tuning without changes to the original representation and thus preserving original capabilities of the pretrained model. We perform a comprehensive investigation across three major model architectures (ViTs, NFNets, and ResNets), supervised (ImageNet-1K classification) and self-supervised pretrained weights (CLIP, BYOL, Visual MAE) in three manipulation task domains and 35 individual tasks, and demonstrate that our claims are strongly validated in various settings.", Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models,https://openreview.net/forum?id=I8ly64E5Nt,https://openreview.net/pdf?id=I8ly64E5Nt,"We develop a new class of large language models that is embarrassingly parallel: different parts of the model are independently trained on different subsets of the data, with no need for multi-node training or inference.","We present Branch-Train-Merge (BTM), a communication-efficient algorithm for embarrassingly parallel training of large language models (LLMs). We show it is possible to independently train subparts of a new class of LLMs on different subsets of the data, eliminating the massive multi-node synchronization currently required to train LLMs. BTM learns a set of independent Expert LMs (ELMs), each specialized to a different textual domain, such as scientific or legal text. These ELMs can be added and removed to update data coverage, ensembled to generalize to new domains, or averaged to collapse back to a single LM for efficient inference. New ELMs are learned by branching from (mixtures of) ELMs in the current set, further training on new domains, and then merging the resulting models back into the set for future use. Experiments show that BTM improves in- and out-of-domain perplexities as compared to GPT-style Transformer LMs, when controlling for training cost. Through extensive analysis, we show that these results are robust to different ELM initialization schemes, but require expert domain specialization; ensembles with random data splits do not perform well. Our results suggest that aggressive parallelism could be used to efficiently scale larger LMs in future work.","sparsity, language model, efficient" Logical Message Passing Networks with One-hop Inference on Atomic Formulas,https://openreview.net/forum?id=SoyOsp7i_l,https://openreview.net/pdf?id=SoyOsp7i_l,,"Complex Query Answering (CQA) over Knowledge Graphs (KGs) has attracted a lot of attention to potentially support many applications. Given that KGs are usually incomplete, neural models are proposed to answer the logical queries by parameterizing set operators with complex neural networks. However, such methods usually train neural set operators with a large number of entity and relation embeddings from zero, where whether and how the embeddings or the neural set operators contribute to the performance remains not clear. In this paper, we propose a simple framework for complex query answering that decomposes the KG embeddings from neural set operators. We propose to represent the complex queries in the query graph. On top of the query graph, we propose the Logical Message Passing Neural Network (LMPNN) that connects the local one-hop inferences on atomic formulas to the global logical reasoning for complex query answering. We leverage existing effective KG embeddings to conduct one-hop inferences on atomic formulas, the results of which are regarded as the messages passed in LMPNN. The reasoning process over the overall logical formulas is turned into the forward pass of LMPNN that incrementally aggregates local information to finally predict the answers’ embeddings. The complex logical inference across different types of queries will then be learned from training examples based on the LMPNN architecture. Theoretically, our query-graph representation is more general than the prevailing operator-tree formulation, so our approach applies to a broader range of complex KG queries. Empirically, our approach yields the new state-of-the-art neural CQA model. Our research bridges the gap between complex KG query answering tasks and the long-standing achievements of knowledge graph representation learning.","knowledge graph, complex query answering, graph neural network, representation learning" Noise-Robust De-Duplication at Scale,https://openreview.net/forum?id=bAz2DBS35i,https://openreview.net/pdf?id=bAz2DBS35i,," Identifying near duplicates within large, noisy text corpora has a myriad of applications that range from de-duplicating training datasets, reducing privacy risk, and evaluating test set leakage, to identifying reproduced news articles and literature within large corpora. Across these diverse applications, the overwhelming majority of work relies on N-grams. Limited efforts have been made to evaluate how well N-gram methods perform, in part because it is unclear how one could create an unbiased evaluation dataset for a massive corpus. This study uses the unique timeliness of historical news wires to create a 27,210 document dataset, with 122,876 positive duplicate pairs, for studying noise-robust de-duplication. The time-sensitivity of news makes comprehensive hand labelling feasible - despite the massive overall size of the corpus - as duplicates occur within a narrow date range. The study then develops and evaluates a range of de-duplication methods: hashing and N-gram overlap (which predominate in the literature), a contrastively trained bi-encoder, and a ""re-rank"" style approach combining a bi- and cross-encoder. The neural approaches significantly outperform hashing and N-gram overlap. We show that the bi-encoder scales well, de-duplicating a 10 million article corpus on a single GPU card in a matter of hours. The public release of our NEWS-COPY de-duplication dataset will facilitate further research and applications. ", P2PRISM - Peer to peer learning with individual prism for secure aggregation,https://openreview.net/forum?id=fa7PVbrgpj,https://openreview.net/pdf?id=fa7PVbrgpj,We highlight the vulnerabilities in peer-to-peer learning towards malicious attacks and propose a byzantine-robust defense against the same.,"Federated learning (FL) has made collaboration between nodes possible without explicit sharing of local data. However, it requires the participating nodes to trust the server and its model updates, the server itself being a critical node susceptible to failure and compromise. A loss of trust in the server and a demand to aggregate the model independently for oneself has led decentralized peer-to-peer learning (P2PL) to gain traction lately. In this paper, we highlight the never before exposed vulnerabilities of P2PL towards malicious attacks and how P2PL behaves differently from FL in such a malicious environment. We then present a robust defense - P2PRISM as a secure aggregation protocol for P2PL.","peer-to-peer, decentralized learning, byzantine-robust" Multi-scale Attention for Diabetic Retinopathy Detection in Retinal Fundus Images,https://openreview.net/forum?id=R370fuGO7JJ,https://openreview.net/pdf?id=R370fuGO7JJ,This paper proposed a novel deep learning-based approach for grading of Diabetic Retinopathy in fundus photograph,"The diagnosis and/or grading of diabetic retinopathy (DR) in the retina fundus has traditionally been done by physicians using manual procedures. However, there has been a significant demand for automated eye diagnostic and grading systems due to the constant rise in the number of persons with diabetes over the past few decades. An excellent diagnostic and predictive value for treatment planning exists with automatic DR grading based on retinal fundus pictures. With the majority of the current automated DR grading systems, it is exceedingly challenging to capture significant features because of the minor changes between severity levels. This paper presents a deep learning-based method for automatically assessing diabetic retinopathy in retina fundus pictures. This paper presents a deep learning-based method for automatically assessing diabetic retinopathy in retina fundus pictures. In order to increase the discriminative ability of the retrieved features, we implement a multi-scale attention mechanism within a deep convolutional neural network architecture in this research. Additionally, we provide a brand-new loss function termed modified grading loss that enhances the training convergence of the suggested strategy by taking into account the distance between various grades of distinct DR categories. The suggested technique is trained, validated, and tested using a dataset about diabetic retinopathy that is openly available. The experimental findings are presented to illustrate how well the suggested strategy competes.","diabetes, deep learning, diabetic retinopathy, microvascular complication, hyperglycemia, attention, CNN" Blessing from Experts: Super Reinforcement Learning in Confounded Environments,https://openreview.net/forum?id=47C06k5D2cn,https://openreview.net/pdf?id=47C06k5D2cn,,"We introduce super reinforcement learning in the batch setting, which takes the observed action as input for enhanced policy learning. In the presence of unmeasured confounders, the recommendations from human experts recorded in the observed data allow us to recover certain unobserved information. Including this information in the policy search, the proposed super reinforcement learning will yield a super policy that is guaranteed to outperform both the standard optimal policy and the behavior one (e.g., the expert’s recommendation). Furthermore, to address the issue of unmeasured confounding in finding super-policies, a number of non-parametric identification results are established. Finally, we develop two super-policy learning algorithms and derive their corresponding finite-sample regret guarantees.", Unscented Autoencoder,https://openreview.net/forum?id=US0hgxTfU7i,https://openreview.net/pdf?id=US0hgxTfU7i,Sampling fixed sigma points and regularizing posterior moments in VAEs promotes reconstruction quality while preserving a smooth latent space.,"The Variational Autoencoder (VAE) is a seminal approach in deep generative modeling with latent variables. It performs posterior inference by parameterizing a distribution of latent variables in the stochastic encoder (while penalizing the disparity to an assumed standard normal prior), and achieves sample reconstruction via a deterministic decoder. In our work, we start from a simple interpretation of the reconstruction process: a nonlinear transformation of the stochastic encoder. We apply the Unscented Transform (UT) from the field of filtering and control -- a well-known distribution approximation used in the Unscented Kalman Filter (UKF). A finite set of statistics called sigma points that are sampled deterministically provides a more informative and lower-variance posterior representation than the ubiquitous noise-scaling of the reparameterization trick. Inspired by the unscented transform, we derive a novel deterministic flavor of the VAE, the Unscented Autoencoder (UAE), trained purely with regularization-like terms on the per-sample, full-covariance posterior. A key ingredient for the good performance is the Wasserstein distribution metric in place of the Kullback-Leibler (KL) divergence, effectively performing covariance matrix regularization while allowing for a sharper posterior, which especially benefits reconstruction. Nevertheless, our results are consistent with recent findings showing that deterministic models can ensure good sample quality and smooth interpolation in the latent space. We empirically show superior performance in Fréchet Inception Distance (FID) scores over closely-related models, in addition to a lower training variance than the VAE.","generative models, variational autoencoders, deterministic autoencoders, unscented transform, wasserstein metric" Reinforcement Learning for Bandits with Continuous Actions and Large Context Spaces,https://openreview.net/forum?id=Q5uQecAw0vO,https://openreview.net/pdf?id=Q5uQecAw0vO,"We propose a reinforcement learning approach for the challenging contextual bandits scenario with continuous actions that can generalise to large context' spaces, unlike the current literature. ","We consider the challenging scenario of contextual bandits with continuous actions and large input ``context'' spaces, e.g. images. We posit that by modifying reinforcement learning (RL) algorithms for continuous control, we can outperform hand-crafted contextual bandit algorithms for continuous actions on standard benchmark datasets, i.e. vector contexts. We demonstrate that parametric policy networks outperform recently published tree-based policies in both average regret and costs on held-out samples. Furthermore, in contrast to previous work, we successfully demonstrate that RL algorithms can generalise contextual bandit problems with continuous actions to large context spaces. We obtain state-of-the-art performance using RL and significantly outperform previous methods on image contexts. Lastly, we introduce a new contextual bandits domain with multi-dimensional continuous action space and image context. ","Contextual bandits, Continuous actions, Image context, reinforcement learning" Explanation Uncertainty with Decision Boundary Awareness,https://openreview.net/forum?id=2a3aR6geXxy,https://openreview.net/pdf?id=2a3aR6geXxy,We introduce a method that generates uncertainty estimates for feature attribution explanations.,"Post-hoc explanation methods have become increasingly depended upon for understanding black-box classifiers in high-stakes applications, precipitating a need for reliable explanations. While numerous explanation methods have been proposed, recent works have shown that many existing methods can be inconsistent or unstable. In addition, high-performing classifiers are often highly nonlinear and can exhibit complex behavior around the decision boundary, leading to brittle or misleading local explanations. Therefore, there is an impending need to quantify the uncertainty of such explanation methods in order to understand when explanations are trustworthy. We introduce a novel uncertainty quantification method parameterized by a Gaussian Process model, which combines the uncertainty approximation of existing methods with a novel geodesic-based similarity which captures the complexity of the target black-box decision boundary. The proposed framework is highly flexible—it can be used with any black-box classifier and feature attribution method to amortize uncertainty estimates for explanations. We show theoretically that our proposed geodesic-based kernel similarity increases with the complexity of the decision boundary. Empirical results on multiple tabular and image datasets show that our decision boundary-aware uncertainty estimate improves understanding of explanations as compared to existing methods","Explainability, Interpretability, XAI, Feature Importance, Explanation Uncertainty, Reliability" Hierarchical Neural Program Synthesis,https://openreview.net/forum?id=AfmFjelAqW6,https://openreview.net/pdf?id=AfmFjelAqW6,,"Program synthesis aims to automatically construct human-readable programs that satisfy given task specifications such as input/output pairs or demonstrations. Recent works have demonstrated encouraging results in a variety of domains such as string transformation, tensor manipulation, and describing behaviors of embodied agents. Most existing program synthesis methods are designed to synthesize programs from scratch, generating a program token by token, line by line. This fundamentally prevents these methods from scaling up to synthesize programs that are longer or more complex. In this work, we present a scalable program synthesis framework that instead synthesizes a program by hierarchically composing programs. Specifically, we first learn a task embedding space and a program decoder that can decode a task embedding into a program. Then, we train a high-level module to comprehend the task specification (e.g. input/output pairs or demonstrations) from long programs and produce a sequence of task embeddings, which are then decoded by the program decoder and composed to yield the synthesized program. We extensively evaluate our proposed framework in a string transformation domain with input/output pairs. The experimental results demonstrate that the proposed framework can synthesize programs that are significantly longer and more complex than the programs considered in prior program synthesis works", SARNET: SARCASM VS TRUE-HATE DETECTION NETWORK,https://openreview.net/forum?id=Gid_Z_oUV5q,https://openreview.net/pdf?id=Gid_Z_oUV5q,"This research paper focuses on quasi-ternary classification of hate and sarcasm in a tweet using game theory, Nash Equilibrium and deep learning.","At times hate speech detection classifiers miss the context of a sentence and flag a sarcastic tweet incorrectly. To tackle this problem by emphasising on the context of a tweet we propose SarNet. SarNet is a two-fold deep learning based model which follows a quasi-ternary labelling strategy and contextually classifies a tweet as hate, sarcastic or neither. The first module of SarNet is an ANN-BiLSTM based Pyramid Network used to calculate the hate and sarcastic probabilities of a sentence. The second module of the SarNet is the Nash Equalizer which stems from the concept of game theory and prisoner’s dilemma. It treats hate and sarcasm as two prisoners. A payoff matrix is constructed to calculate the true hate of the tweet. True hate considers the hate part of a tweet excluding the sarcastic part of the tweet. Thus, this gives a true estimate of the hate content in a tweet thereby decreasing the number of sarcastic tweets being falsely flagged as hate. Our proposed model is trained on state-of-the-art hate speech and sarcasm datasets in the English language. The precision, recall and F1 score of our proposed model is 0.93, 0.84 and 0.88 respectively. Comparison with state-of-the-art architectures demonstrated better performance of SarNet by a significant margin.","Game Theory, Hate Speech, Sarcasm, Nash Equilibrium, Prisoner's Dilemma" Learning Portable Skills by Identifying Generalizing Features with an Attention-Based Ensemble,https://openreview.net/forum?id=nAvBCvT5oA,https://openreview.net/pdf?id=nAvBCvT5oA,We learn a generalizing state feature for skill transfer using an attention-based ensemble.,"The ability to rapidly generalize is crucial for reinforcement learning to be practical in real-world tasks. However, generalization is complicated by the fact that, in many settings, some state features reliably support generalization while others do not. We consider the problem of learning generalizable policies and skills (in the form of options) by identifying feature sets that generalize across instances. We propose an attention-ensemble approach, where a collection of minimally overlapping feature masks is learned, each of which individually maximizes performance on the source instance. Subsequent tasks are instantiated using the ensemble, and transfer performance is used to update the estimated probability that each feature set will generalize in the future. We show that our approach leads to fast policy generalization for eight tasks in the Procgen benchmark. We then show its use in learning portable options in Montezuma's Revenge, where it is able to generalize skills learned in the first screen to the remainder of the game. ","Hierarchical reinforcement learning, Skill transfer, Ensembling" Few-shot Backdoor Attacks via Neural Tangent Kernels,https://openreview.net/forum?id=a70lGJ-rwy,https://openreview.net/pdf?id=a70lGJ-rwy,We provide a new algorithm based on intuitions from kernel regression that uses neural tangent kernels to design stronger backdoor attacks.,"In a backdoor attack, an attacker injects corrupted examples into the training set. The goal of the attacker is to cause the final trained model to predict the attacker's desired target label when a predefined trigger is added to test inputs. Central to these attacks is the trade-off between the success rate of the attack and the number of corrupted training examples injected. We pose this attack as a novel bilevel optimization problem: construct strong poison examples that maximize the attack success rate of the trained model. We use neural tangent kernels to approximate the training dynamics of the model being attacked and automatically learn strong poison examples. We experiment on subclasses of CIFAR-10 and ImageNet with WideResNet-34 and ConvNeXt architectures on periodic and patch trigger attacks and show that NTBA-designed poisoned examples achieve, for example, an attack success rate of 90% with ten times smaller number of poison examples injected compared to the baseline. We provided an interpretation of the NTBA-designed attacks using the analysis of kernel linear regression. We further demonstrate a vulnerability in overparametrized deep neural networks, which is revealed by the shape of the neural tangent kernel. ","kernel regression, backdoor attack, data poisoning, robust machine learning, neural tangent kernel" Quantitative Universal Approximation Bounds for Deep Belief Networks,https://openreview.net/forum?id=WDX-0gwK7C,https://openreview.net/pdf?id=WDX-0gwK7C,We prove quantitative approximation results for deep belief networks with binary hidden units.,"We show that deep belief networks with binary hidden units can approximate any multivariate probability density under very mild integrability requirements on the parental density of the visible nodes. The approximation is measured in the $L^q$-norm for $q\in[1,\infty]$ ($q=\infty$ corresponding to the supremum norm) and in Kullback-Leibler divergence. Furthermore, we establish sharp quantitative bounds on the approximation error in terms of the number of hidden units.","Deep belief network, restricted Boltzmann machine, universal approximation property, expressivity, Kullback-Leibler approximation, probability density" Hyperparameter Optimization through Neural Network Partitioning,https://openreview.net/forum?id=nAgdXgfmqj,https://openreview.net/pdf?id=nAgdXgfmqj,We introduce partitioned networks and an out-of-training sample loss for scalable optimization of hyperparameters,"Well-tuned hyperparameters are crucial for obtaining good generalization behavior in neural networks. They can enforce appropriate inductive biases, regularize the model and improve performance --- especially in the presence of limited data. In this work, we propose a simple and efficient way for optimizing hyperparameters inspired by the marginal likelihood, an optimization objective that requires no validation data. Our method partitions the training data and a neural network model into $K$ data shards and parameter partitions, respectively. Each partition is associated with and optimized only on specific data shards. Combining these partitions into subnetworks allows us to define the ""out-of-training-sample"" loss of a subnetwork, i.e., the loss on data shards unseen by the subnetwork, as the objective for hyperparameter optimization. We demonstrate that we can apply this objective to optimize a variety of different hyperparameters in a single training run while being significantly computationally cheaper than alternative methods aiming to optimize the marginal likelihood for neural networks. Lastly, we also focus on optimizing hyperparameters in federated learning, where retraining and cross-validation are particularly challenging.","Hyperparameter optimization, invariances, data-augmentation, marginal likelihood, federated learning" DiscoBAX - Discovery of optimal intervention sets in genomic experiment design,https://openreview.net/forum?id=mBkUeW8rpD6,https://openreview.net/pdf?id=mBkUeW8rpD6,We introduce DiscoBAX — a sample-efficient algorithm for the discovery of genetic interventions that maximize the movement of a phenotype in a direction of interest while covering a diverse set of underlying mechanisms,"The discovery of novel therapeutics to cure genetic pathologies relies on the identification of the different genes involved in the underlying disease mechanism. With billions of potential hypotheses to test, an exhaustive exploration of the entire space of potential interventions is impossible in practice. Sample-efficient methods based on active learning or bayesian optimization bear the promise of identifying interesting targets using the least experiments possible. However, genomic perturbation experiments typically rely on proxy outcomes measured in biological model systems that may not completely correlate with the outcome of interventions in humans. In practical experiment design, one aims to find a set of interventions which maximally move a target phenotype via a diverse set of mechanisms in order to reduce the risk of failure in future stages of trials. To that end, we introduce DiscoBAX — a sample-efficient algorithm for the discovery of genetic interventions that maximize the movement of a phenotype in a direction of interest while covering a diverse set of underlying mechanisms. We provide theoretical guarantees on the optimality of the approach under standard assumptions, conduct extensive experiments in synthetic and real-world settings relevant to genomic discovery and demonstrate that DiscoBAX outperforms state-of-the-art active learning and Bayesian optimization methods in this task. Better methods for selecting effective and diverse perturbations in biological systems could enable researchers to potentially discover novel therapeutics for a range of genetically-driven diseases.","Optimal experiment design, Bayesian Algorithm Execution, Active learning, Genetic intervention, Drug design" How to Enable Uncertainty Estimation in Proximal Policy Optimization,https://openreview.net/forum?id=XKq49kJ5mZX,https://openreview.net/pdf?id=XKq49kJ5mZX,"A setup and comparison of out-of-distribution detection methods for PPO, with Masksembles as a novel, well-performing method in the setting of RL"," While deep reinforcement learning (RL) agents have showcased strong results across many domains, a major concern is their inherent opaqueness and the safety of such systems in real-world use cases. To overcome these issues, we need agents that can quantify their uncertainty and detect out-of-distribution (OOD) states. Existing uncertainty estimation techniques, like Monte-Carlo Dropout or Deep Ensembles, have not seen widespread adoption in on-policy deep RL. We posit that this is due to two reasons: concepts like uncertainty and OOD states are not well defined compared to supervised learning, especially for on-policy RL methods. Secondly, available implementations and comparative studies for uncertainty estimation methods in RL have been limited. To overcome the first gap, we propose definitions of uncertainty and OOD for Actor-Critic RL algorithms, namely, proximal policy optimization (PPO), and present possible applicable measures. In particular, we discuss the concepts of value and policy uncertainty. The second point is addressed by implementing different uncertainty estimation methods and comparing them across a number of environments. The OOD detection performance is evaluated via a custom evaluation benchmark of in-distribution (ID) and OOD states for various RL environments. We identify a trade-off between reward and OOD detection performance. To overcome this, we formulate a Pareto optimization problem in which we simultaneously optimize for reward and OOD detection performance. We show experimentally that the recently proposed method of Masksembles strikes a favourable balance among the survey methods, enabling high-quality uncertainty estimation and OOD detection while matching the performance of original RL agents.","Reinforcement Learning, Uncertainty Estimation, Out-of-distribution detection, Proximal Policy Optimizaiton" Joint-Predictive Representations for Multi-Agent Reinforcement Learning,https://openreview.net/forum?id=S80ioOGLpD9,https://openreview.net/pdf?id=S80ioOGLpD9,,"The recent advances in reinforcement learning have demonstrated the effectiveness of vision-based self-supervised learning (SSL). However, the main efforts on this direction have been paid on single-agent setting, making multi-agent reinforcement learning~(MARL) lags thus far. There are two significant obstacles that prevent applying off-the-shelf SSL approaches with MARL on a partially observable multi-agent system : (a) each agent only gets a partial observation, and (b) previous SSL approaches only take consistent temporal representations into account, while ignoring the characterization that captures the interaction and fusion among agents. In this paper, we propose \textbf{M}ulti-\textbf{A}gent \textbf{Jo}int-Predictive \textbf{R}epresentations~(MAJOR), a novel framework to explore self-supervised learning on cooperative MARL. Specifically, we treat the latent representations of local observations of all agents as the sequence of masked contexts of the global state, and we then learn effective representations by predicting the future latent representations for each agent with the help of the agent-level information interactions in a joint transition model. We have conducted extensive experiments on wide-range MARL environments, including both vision-based and state-based scenarios, and show that our proposed MAJOR achieves superior asymptotic performance and sample efficiency against other state-of-the-art methods.", "Symmetries, Flat Minima and the Conserved Quantities of Gradient Flow",https://openreview.net/forum?id=9ZpciCOunFb,https://openreview.net/pdf?id=9ZpciCOunFb,We introduce a framework for finding linear and nonlinear continuous symmetries in deep learning and show how they lead to extended local minima and conserved quantities,"Empirical studies have revealed that many minima are connected and reside in low-loss valleys in the loss landscape of deep networks. Ensemble models sam- pling different parts of a low-loss valley have reached SOTA performance. Yet, little is known about the theoretical origin of these low-loss valleys. We present a general framework for finding continuous symmetries in the parameter space, which give rise to the low-loss valleys. Importantly, we introduce a novel set of nonlinear, data-dependent symmetries for neural networks. These symmetries can transform a trained model such that it performs similarly on new samples. We then show that conserved quantities associated with linear symmetries can be used to define coordinates along low-loss valleys. The conserved quantities help reveal that using common initialization methods, gradient flow only explores a small part of the global minimum. By relating conserved quantities to convergence rate and sharpness of the minimum, we provide insights on how initialization impacts convergence and generalizability. We also find the nonlinear action to be viable for ensemble building to improve robustness under certain adversarial attacks.","symmetry, gradient flow, conserved quantity, flat minima, Lie group, Lie algebra" DP-SGD-LF: Improving Utility under Differentially Private Learning via Layer Freezing,https://openreview.net/forum?id=coLtCLTHFbW,https://openreview.net/pdf?id=coLtCLTHFbW,,"Differentially Private SGD (DP-SGD) is a widely known substitute for SGD to train deep learning models with privacy guarantees. However, privacy guarantees come at cost in model utility. The key DP-SGD steps responsible for this utility cost are per-sample gradient clipping, which introduces bias, and adding noise to the aggregated (clipped) gradients, which increases the variance of model updates. Inspired by the observation that different layers in a neural network often converge at different rates following a bottom-up pattern, we incorporate layer freezing into DP-SGD to increase model utility at fixed privacy budget. Through theoretical analysis and empirical evidence we show that layer freezing improves model utility, by reducing both the bias and variance introduced by gradient clipping and noising. These improvements in turn lead to better model accuracy, and empirically generalize over multiple datasets, models, and privacy budgets.", Explainability as statistical inference,https://openreview.net/forum?id=GKB566-8WkZ,https://openreview.net/pdf?id=GKB566-8WkZ,We propose to embed any classification or regression model in a framework that casts interpretability as a maximum likelihood problem.,"A wide variety of model explanation approaches have been proposed in recent years, all guided by very different rationales and heuristics. In this paper, we take a new route and cast interpretability as a statistical inference problem. We propose a general deep probabilistic model designed to produce interpretable predictions. The model’s parameters can be learned via maximum likelihood, and the method can be adapted to any predictor network architecture, and any type of prediction problem. Our method is a case of amortized interpretability models, where a neural network is used as a selector to allow for fast interpretation at inference time. Several popular interpretability methods are shown to be particular cases of regularised maximum likelihood for our general model. We propose new datasets with ground truth selection which allow for the evaluation of the features importance map. Using these datasets, we show experimentally that using multiple imputation provides more reasonable interpretation.","Interpretability, Explainability, Statistical Learning, Imputation" Concept-based Explanations for Out-of-Distribution Detectors,https://openreview.net/forum?id=9rRhMKNOkeT,https://openreview.net/pdf?id=9rRhMKNOkeT,We propose the first work to provide concept-based explanations for out-of-distribution detectors.,"Out-of-distribution (OOD) detection plays a crucial role in ensuring the safe deployment of deep neural network (DNN) classifiers. While a myriad of methods have focused on improving the performance of OOD detectors, a critical gap remains in interpreting their decisions. We help bridge this gap by providing explanations for OOD detectors based on learned high-level concepts. We first propose two new metrics for assessing the effectiveness of a particular set of concepts for explaining OOD detectors: 1) $\textit{detection completeness}$, which quantifies the sufficiency of concepts for explaining an OOD-detector's decisions, and 2) $\textit{concept separability}$, which captures the distributional separation between in-distribution and OOD data in the concept space. Based on these metrics, we propose a framework for learning a set of concepts that satisfy the desired properties of detection completeness and concept separability, and demonstrate the framework's effectiveness in providing concept-based explanations for diverse OOD detection techniques. We also show how to identify prominent concepts that contribute to the detection results via a modified Shapley value-based importance score.","out-of-distribution detection, interpretability, concept-based explanations" FaDIn: Fast Discretized Inference for Hawkes Processes with General Parametric Kernels,https://openreview.net/forum?id=Z2Kgq-czhh,https://openreview.net/pdf?id=Z2Kgq-czhh,,"Temporal point processes (TPP) are a natural tool for modeling event-based data. Among all TPP models, Hawkes processes have proven to be the most widely used, mainly due to their simplicity and computational ease when considering exponential or non-parametric kernels. Although non-parametric kernels are an option, such models require large datasets. While exponential kernels are more data efficient and relevant for certain applications where events immediately trigger more events, they are ill-suited for applications where latencies need to be estimated, such as in neuroscience. This work aims to offer an efficient solution to TPP inference using general parametric kernels with finite support. The developed solution consists of a fast L2 gradient-based solver leveraging a discretized version of the events. After supporting the use of discretization theoretically, the statistical and computational efficiency of the novel approach is demonstrated through various numerical experiments. Finally, the effectiveness of the method is evaluated by modeling the occurrence of stimuli-induced patterns from brain signals recorded with magnetoencephalography (MEG). Given the use of general parametric kernels, results show that the proposed approach leads to a more plausible estimation of pattern latency compared to the state-of-the-art. ","Hawkes processes, Neuroscience" Summarization Programs: Interpretable Abstractive Summarization with Neural Modular Trees,https://openreview.net/forum?id=ooxDOe7ZtBe,https://openreview.net/pdf?id=ooxDOe7ZtBe,An interpretable framework for abstractive summarization with neural modular trees,"Current abstractive summarization models either suffer from a lack of clear interpretability or provide incomplete rationales by only highlighting parts of the source document. To this end, we propose the Summarization Program (SP), an interpretable modular framework consisting of an (ordered) list of binary trees, each encoding the step-by-step generative process of an abstractive summary sentence from the source document. A Summarization Program contains one root node per summary sentence, and a distinct tree connects each summary sentence (root node) to the document sentences (leaf nodes) from which it is derived, with the connecting nodes containing intermediate generated sentences. Edges represent different modular operations involved in summarization such as sentence fusion, compression, and paraphrasing. We first propose an efficient best-first search method over neural modules, SP-Search that identifies SPs for human summaries by directly optimizing for ROUGE scores. Next, using these programs as automatic supervision, we propose seq2seq models that generate Summarization Programs, which are then executed to obtain final summaries. We demonstrate that SP-Search effectively represents the generative process behind hu- man summaries using modules that are typically faithful to their intended behavior. We also conduct a simulation study to show that Summarization Programs improve the interpretability of summarization models by allowing humans to better simulate model reasoning. Summarization Programs constitute a promising step toward interpretable and modular abstractive summarization, a complex task previously addressed primarily through blackbox end-to-end neural systems.", Planning with Large Language Models for Code Generation,https://openreview.net/forum?id=Lr8cOOtYbfL,https://openreview.net/pdf?id=Lr8cOOtYbfL,We provide a novel framework for code generation by combining the advantages of a large language model and a planning algorithm.,"Existing large language model-based code generation pipelines typically use beam search or sampling algorithms during the decoding process. Although the programs they generate achieve high token-matching-based scores, they often fail to compile or generate incorrect outputs. The main reason is that conventional Transformer decoding algorithms may not be the best choice for code generation. In this work, we propose a novel Transformer decoding algorithm, Planning-Guided Transformer Decoding (PG-TD), that uses a planning algorithm to do lookahead search and guide the Transformer to generate better programs. Specifically, instead of simply optimizing the likelihood of the generated sequences, the Transformer makes use of a planner that generates complete programs and tests them on public test cases. The Transformer can therefore make more informed decisions and generate tokens that will eventually lead to higher-quality programs. We also design a mechanism that shares information between the Transformer and the planner to make our algorithm computationally efficient. We empirically evaluate our framework with several large language models as backbones on public coding challenge benchmarks, showing that 1) it can generate programs that consistently achieve higher performance compared with competing baseline methods; 2) it enables controllable code generation, such as concise codes and highly-commented codes by optimizing modified objective.","Large Language Model, Code Generation, Planning" Unleash Model Capacity for Universal Dense Retrieval by Task Specialty Optimization,https://openreview.net/forum?id=JzrPpPnTUhk,https://openreview.net/pdf?id=JzrPpPnTUhk,,"Universal dense retrieval, with one unified representation space to empower various retrieval scenarios, has many appealing advantages in simplicity, efficiency, and potential to break echo chambers with cross-scenario information access. However, standard multi-task trained dense retrievers often fail to meet the accuracy of scenario-specific models. In this paper, we analyze the multi-task learning in universal retrieval and show that the model capacity is not the main bottleneck. It is the optimization failed to fully utilize the network parameters to capture task-specific signals. This motivated our development of TACO-DR, which conducts multi-task learning for universal retrieval with TAsk speCialty Optimization. TACO-DR dynamically adjusts the learning rate for each parameter regrading each task based on its task-specific sensitivity, to encourage parameters to better capture task specific signals. On the KILT benchmark, TACO-DR outperforms various multi-task learning methods and achieves better overall accuracy than single-task models. Our analysis shows that TACO-DR better utilizes the model capacity with more task-specific parameters. Our code and model checkpoints will be open-sourced.","Dense Retrieval, Multi-task, Parameter sensitivity" Training Equilibria in Reinforcement Learning,https://openreview.net/forum?id=lpxeg8dhJ-,https://openreview.net/pdf?id=lpxeg8dhJ-,"We study conditions under which RL algorithms get stuck in local optima, and how to mitigate them.","In partially observable environments, reinforcement learning algorithms such as policy gradient and Q-learning may have multiple equilibria---policies that are stable under further training---and can converge to policies that are strictly suboptimal. Prior work blames insufficient exploration, but suboptimal equilibria can arise despite full exploration and other favorable circumstances like a flexible policy parametrization. We show theoretically that the core problem is that in partially observed environments, an agent's past actions induce a distribution on hidden states. Equipping the policy with memory helps it model the hidden state and leads to convergence to a higher reward equilibrium, \emph{even when there exists a memoryless optimal policy}. Experiments show that policies with insufficient memory tend to learn to use the environment as auxiliary memory,and parameter noise helps policies escape suboptimal equilibria. ","theory, reinforcement learning, learning dynamics, partial observability, MDP, POMDP, markov decision processes" Hebbian Deep Learning Without Feedback,https://openreview.net/forum?id=8gd4M-_Rj1,https://openreview.net/pdf?id=8gd4M-_Rj1,"Advancing the state of the art in bio-plausible Deep Learning, and the plausibility of DL, through Hebbian plasticity and soft winner-take-all nets.","Recent approximations to backpropagation (BP) have mitigated many of BP's computational inefficiencies and incompatibilities with biology, but important limitations still remain. Moreover, the approximations significantly decrease accuracy in benchmarks, suggesting that an entirely different approach may be more fruitful. Here, grounded on recent theory for Hebbian learning in soft winner-take-all networks, we present multilayer SoftHebb, i.e. an algorithm that trains deep neural networks, without any feedback, target, or error signals. As a result, it achieves efficiency by avoiding weight transport, non-local plasticity, time-locking of layer updates, iterative equilibria, and (self-) supervisory or other feedback signals – which were necessary in other approaches. Its increased efficiency and biological compatibility do not trade off accuracy compared to state-of-the-art bio-plausible learning, but rather improve it. With up to five hidden layers and an added linear classifier, accuracies on MNIST, CIFAR-10, STL-10, and ImageNet, respectively reach 99.4%, 80.3%, 76.2%, and 27.3%. In conclusion, SoftHebb shows with a radically different approach from BP that Deep Learning over few layers may be plausible in the brain and increases the accuracy of bio-plausible machine learning.","Hebbian, winner-take-all, cortical circuits, unsupervised, online, biologically plausible, neuromorphic" A Simulation-based Framework for Robust Federated Learning to Training-time Attacks,https://openreview.net/forum?id=_5Q4covjmH,https://openreview.net/pdf?id=_5Q4covjmH,We frame robust distributed learning problem as a game between a server and an adversary that optimizes strong training-time attacks.,"Well-known robust aggregation schemes in federated learning (FL) are shown to be vulnerable to an informed adversary who can tailor training-time attacks [Fang et al., Xie et al.]. We frame robust distributed learning problem as a game between a server and an adversary that is able to optimize strong training-time attacks. We introduce RobustTailor, a simulation-based framework that prevents the adversary from being omniscient. The simulated game we propose enjoys theoretical guarantees through a regret analysis. RobustTailor improves robustness to training-time attacks significantly while preserving almost the same privacy guarantees as standard robust aggregation schemes in FL. Empirical results under challenging attacks show that RobustTailor performs similar to an upper bound with perfect knowledge of honest clients.","Robust federated learning, training-time attacks, game theory" Key Design Choices for Double-transfer in Source-free Unsupervised Domain Adaptation,https://openreview.net/forum?id=-PL1Gk4jt7,https://openreview.net/pdf?id=-PL1Gk4jt7,We systematically analyze the impact of the main design choices in Source-free Unsupervised Domain Adaptation through a large-scale empirical study.,"Fine-tuning and Domain Adaptation emerged as effective strategies for efficiently transferring deep learning models to new target tasks. However, target domain labels are not accessible in many real-world scenarios. This led to the development of Unsupervised Domain Adaptation (UDA) methods, which only employ unlabeled target samples. Furthermore, efficiency and privacy requirements may also prevent the use of source domain data during the adaptation stage. This particularly challenging setting, known as Source-free Unsupervised Domain Adaptation (SF-UDA), is still understudied. In this paper, we systematically analyze the impact of the main design choices in SF-UDA through a large-scale empirical study on 500 models and 74 domain pairs. We identify the normalization approach, pre-training strategy, and backbone architecture as the most critical factors. Based on our observations, we propose recipes to best tackle SF-UDA scenarios. Moreover, we show that SF-UDA performs competitively also beyond standard benchmarks and backbone architectures, performing on par with UDA at a fraction of the data and computational cost. Experimental data and code will be released upon acceptance.","Transfer Learning, Unsupervised Domain Adaptation" PALM: Preference-based Adversarial Manipulation against Deep Reinforcement Learning,https://openreview.net/forum?id=YzOEjv-7nP,https://openreview.net/pdf?id=YzOEjv-7nP,A preference-based adversarial attack method that manipulates the victim policy to perform human desired behaviors.,"To improve the robustness of DRL agents, it is important to study their vulnerability under adversarial attacks that would lead to extreme behaviors desired by adversaries. Preference-based RL (PbRL) aims for learning desired behaviors with human preferences. In this paper, we propose PALM, a preference-based adversarial manipulation method against DRL agents which adopts human preferences to perform targeted attacks with the assistance of an intention policy and a weighting function. The intention policy is trained based on the PbRL framework to guide the adversarial policy to mitigate restrictions of the victim policy during exploration, and the weighting function learns weight assignment to improve the performance of the adversarial policy. Theoretical analysis demonstrates that PALM converges to critical points under some mild conditions. Empirical results on a few manipulation tasks of Meta-world show that PALM exceeds the performance of state-of-the-art adversarial attack methods under the targeted setting. Additionally, we show the vulnerability of the offline RL agents by fooling them into behaving as human desires on several Mujoco tasks. Our code and videos are available in https://sites.google.com/view/palm-adversarial-attack.","adversarial attack, deep reinforcement learning, preference-based reinforcement learning, bi-level optimization" Architectural optimization over subgroups of equivariant neural networks,https://openreview.net/forum?id=a6rCdfABJXg,https://openreview.net/pdf?id=a6rCdfABJXg,"Towards architectural optimization over subgroups of equivariant neural networks, we present two mechanisms for approximate equivariance over subgroups and two equivariance-aware neural architecture search algorithms that utilize them.","Incorporating equivariance to symmetry groups as a constraint during neural network training can improve performance and generalization for tasks exhibiting those symmetries, but such symmetries are often not perfectly nor explicitly present. This motivates algorithmically optimizing the architectural constraints imposed by equivariance. We propose the equivariance relaxation morphism, which preserves functionality while reparameterizing a group equivariant layer to operate with equivariance constraints on a subgroup, as well as the $[G]$-mixed equivariant layer, which mixes layers constrained to different groups to enable within-layer equivariance optimization. We further present evolutionary and differentiable neural architecture search (NAS) algorithms that utilize these mechanisms respectively for equivariance-aware architectural optimization. Experiments across a variety of datasets show the benefit of dynamically constrained equivariance to find effective architectures with approximate equivariance.","equivariance, neural architecture search, geometric deep learning" Unsupervised Non-Parametric Signal Separation Using Bayesian Neural Networks,https://openreview.net/forum?id=xzqyoU4PsUj,https://openreview.net/pdf?id=xzqyoU4PsUj,Authors propose using BNNs as building blocks of graphical models and apply it to the spectral/spatial additive mixture example (signal/background).,"Bayesian neural networks (BNN) take the best from two worlds: the one of flexible and scalable neural networks and the one of probabilistic graphical models, the latter allowing for probabilistic interpretation of inference results. We make one extra step towards unification of these two domains and render BNN as an elementary unit of abstraction in the framework of probabilistic modeling, which allows us to promote well-known distributions to distribution fields. We use transformations to obtain field versions of several popular distributions and demonstrate the utility of our approach on the problem of signal/background separation. Starting from prior knowledge that a certain region of space contains predominantly one of the components, in an unsupervised and non-parametric manner, we recover the representation of both previously unseen components as well as their proportions.","bayesian neural networks, probabilistic graphical models, signal disaggregation" SPIDER: Searching Personalized Neural Architecture for Federated Learning,https://openreview.net/forum?id=BW9KtL-bott,https://openreview.net/pdf?id=BW9KtL-bott,SPIDER searches and trains heterogeneous architectures in a federated learning setting to achieve the objective of personalization.,"Federated learning (FL) is an efficient learning framework that assists distributed machine learning when data cannot be shared with a centralized server. Recent advancements in FL use predefined architecture-based learning for all the clients. However, given that clients' data are invisible to the server and data distributions are non-identical across clients, a predefined architecture discovered in a centralized setting may not be an optimal solution for all the clientsin FL. Motivated by this challenge, we introduce SPIDER, an algorithmic framework that aims to Search Personalized neural architecture for feDERated learning. SPIDER is designed based on two unique features: (1) alternately optimizing one architecture-homogeneous global model (Supernet) in a generic FL manner and one architecture-heterogeneous local model that is connected to the global model by weight-sharing-based regularization (2) achieving architecture-heterogeneous local model by an operation-level perturbation based neural architecture search method. Experimental results demonstrate that SPIDER outperforms other state-of-the-art personalization methods with much fewer times of hyperparameter tuning.","Personalized Neural Architecture Search, Data Heterogeneity, Personalized Federated Learning" On Gradient Descent Convergence beyond the Edge of Stability,https://openreview.net/forum?id=JgwnZxlxA46,https://openreview.net/pdf?id=JgwnZxlxA46,We prove convergence results of Gradient Descent beyond Edge of Stability in several nonlinear and high-dimensional problems.,"Gradient Descent (GD) is a powerful workhorse of modern machine learning thanks to its scalability and efficiency in high-dimensional spaces. Its ability to find local minimisers is only guaranteed for losses with Lipschitz gradients, where it can be seen as a `bona-fide' discretisation of an underlying gradient flow. Yet, many ML setups involving overparametrised models do not fall into this problem class, which has motivated research beyond the so-called ``Edge of Stability'' (EoS), where the step-size crosses the admissibility threshold inversely proportional to the Lipschitz constant above. Perhaps surprisingly, GD has been empirically observed to still converge regardless of local instability and oscillatory behavior. The incipient theoretical analysis of this phenomena has mainly focused in the overparametrised regime, where the effect of choosing a large learning rate may be associated to a `Sharpness-Minimisation' implicit regularisation within the manifold of minimisers, under appropriate asymptotic limits. In contrast, in this work we directly examine the conditions for such unstable convergence, focusing on simple, yet representative, learning problems. Specifically, we characterize a local condition involving third-order derivatives that stabilizes oscillations of GD above the EoS, and leverage such property in a teacher-student setting, under population loss. Finally, focusing on Matrix Factorization, we establish a non-asymptotic `Local Implicit Bias' of GD above the EoS, whereby quasi-symmetric initializations converge to symmetric solutions --- where sharpness is minimum amongst all minimisers. ","gradient descent, edge of stability" Synaptic Dynamics Realize First-order Adaptive Learning and Weight Symmetry,https://openreview.net/forum?id=adN-ccNeW4d,https://openreview.net/pdf?id=adN-ccNeW4d,,"Gradient-based first-order adaptive optimization methods such as the Adam optimizer are prevalent in training artificial networks, achieving the state-of-the-art results. This work attempts to answer the question whether it is viable for biological neural systems to adopt such optimization methods. To this end, we demonstrate a realization of the Adam optimizer using biologically-plausible mechanisms in synapses. The proposed learning rule has clear biological correspondence, runs continuously in time, and achieves performance to comparable Adam's. In addition, we present a new approach, inspired by the predisposition property of synapses observed in neuroscience, to circumvent the biological implausibility of the weight transport problem in backpropagation (BP). With only local information and no separate training phases, this method establishes and maintains weight symmetry in the forward and backward signaling paths, and is applicable to the proposed biologically plausible Adam learning rule. The aforementioned mechanisms may shed light on the way in which biological synaptic dynamics facilitate learning.","synapses, optimizer, biologically plausible, weight transport, Adam" FedAvg Converges to Zero Training Loss Linearly: The Power of Overparameterized Multi-Layer Neural Networks,https://openreview.net/forum?id=_5sMa2sdU4,https://openreview.net/pdf?id=_5sMa2sdU4,,"Federated Learning (FL) is a distributed learning paradigm that allows multiple clients to learn a joint model by utilizing privately held data at each client. Significant research efforts have been devoted to develop advanced algorithms that deal with the situation where the data at individual clients have different distributions (i.e., the data heterogeneity issue). In this work, we show that data heterogeneity can be dealt from a different perspective. That is, by utilizing a certain overparameterized multi-layer neural network at each client, even the vanilla FedAvg (a.k.a. the Local SGD) algorithm can accurately optimize the training problem. Specifically, when each client has a neural network with one wide layer of size $N$ (where $N$ is the number of total training samples), followed by layers of smaller widths, FedAvg converges linearly to a solution that achieves (almost) zero training loss, without requiring any assumptions on the data distributions at each client. To our knowledge, this is the first work that demonstrates such resilience to data heterogeneity for FedAvg when trained on multi-layer neural networks. Our experiments also confirm that, neural network of large size can achieve better and more stable performance for FL problems.","Overparameterized Neural Network, FedAvg" Robustifying Language Models via Adversarial Training with Masked Gradient,https://openreview.net/forum?id=fKemamaw9M,https://openreview.net/pdf?id=fKemamaw9M,,"Fine-tuning pre-trained language models (LMs) has become the de-facto standard method for improving state-of-the-art performances on various NLP tasks. Although these models are usually evaluated with accuracy on fixed validation sets, it is insufficient for the reliable deployment of fine-tuned LMs in real-world settings, as there are known issues within existing model evaluations, such as adversarial robustness and model calibration. To address such issues, we propose a simple yet effective training algorithm, coined Robustifying LMs via Adversarial training with Masked gradient (RAM), to improve the robustness of fine-tuned LMs. In particular, we leverage adversarial training to robustify LMs for various types of perturbations. Simultaneously, to prevent the trained model from largely deviating from the initial pre-trained model, we selectively update the important model parameters using the masked gradients; their relative importance is obtained from the gradients calculated during training. Consequently, it enables the model to preserve the generalizability of the pre-trained model while improving its robustness. Additionally, we construct a new benchmark to evaluate the robustness of fine-tuned LMs in terms of four representative aspects of model robustness in a unified way. Under these benchmarks, we demonstrate the effectiveness of RAM compared to other state-of-the-art fine-tuning methods, and verify that RAM is successfully robustifying various types of LMs. Our work suggests a rethinking of the robustness aspect of LMs as an essential direction for their reliable deployment, along with a simple yet effective solution.","NLP, language model, robustness, classification" Robust Graph Representation Learning via Predictive Coding,https://openreview.net/forum?id=3LUxNRrhK1,https://openreview.net/pdf?id=3LUxNRrhK1,"For the first time, we use predictive coding in deep geometric learning and demonstrate that we can enhance the robustness of learning representation through energy minimization.","Graph neural networks have recently shown outstanding results in diverse types of tasks in machine learning, providing interdisciplinary state-of-the-art performance on structured data. However, they have been proved to be vulnerable to imperceptible adversarial attacks and shown to be unfit for out-of-distribution generalisation. Here, we address this problem by introducing a novel message-passing scheme based on the theory of predictive coding, an energy-based alternative to back-propagation that has its roots in neuroscience. As both graph convolution and predictive coding can be seen as low-pass filtering mechanisms, we postulate that predictive coding adds a second efficient filter to the messaging passing process which enhances the robustness of the learned representation. Through an extensive set of experiments, we show that the proposed model attains comparable performance to its graph convolution network counterpart, delivering strictly better performance on inductive tasks. Most importantly, we show that the energy minimization enhances the robustness of the produced presentation and can be leveraged to further calibrate our models and provide representations that are more robust against advanced graph adversarial attacks. ","Predictive coding, deep geometric learning, deep learning, machine learning, bio-inspired learning, neuroscience" Accelerating Hamiltonian Monte Carlo via Chebyshev Integration Time,https://openreview.net/forum?id=FbRY1XVfwK,https://openreview.net/pdf?id=FbRY1XVfwK,,"Hamiltonian Monte Carlo (HMC) is a popular method in sampling. While there are quite a few works of studying this method on various aspects, an interesting question is how to choose its integration time to achieve acceleration. In this work, we consider accelerating the process of sampling from a distribution $\pi(x) \propto \exp(-f(x))$ via HMC via time-varying integration time. When the potential $f$ is $L$-smooth and $m$-strongly convex, i.e. for sampling from a log-smooth and strongly log-concave target distribution $\pi$, it is known that under a constant integration time, the number of iterations that ideal HMC takes to get an $\epsilon$ Wasserstein-2 distance to the target $\pi$ is $O( \kappa \log \frac{1}{\epsilon} )$, where $\kappa := \frac{L}{m}$ is the condition number. We propose a scheme of time-varying integration time based on the roots of Chebyshev polynomials. We show that in the case of quadratic potential $f$, i.e. when the target $\pi$ is a Gaussian distribution, ideal HMC with this choice of integration time only takes $O( \sqrt{\kappa} \log \frac{1}{\epsilon} )$ number of iterations to reach Wasserstein-2 distance less than $\epsilon$; this improvement on the dependence on condition number is akin to acceleration in optimization. The design and analysis of HMC with the proposed integration time is built on the tools of Chebyshev polynomials. Experiments find the advantage of adopting our scheme of time-varying integration time even for sampling from distributions with smooth strongly convex potentials that are not quadratic. ", PromptCAL: Contrastive Affinity Learning via Auxiliary Prompts for Generalized Category Discovery,https://openreview.net/forum?id=yVcLmMW5ySI,https://openreview.net/pdf?id=yVcLmMW5ySI,Our approach seeks to discover known and unknown classes in the unlabelled datasets using affinity relationships between samples via auxiliary prompts.,"Recent advances in semi-supervised learning (SSL) have achieved remarkable success in learning with partially labeled in-distribution data. However, many existing SSL models fail to learn on unlabeled data sampled from novel semantic classes and thus rely on the closed-set assumption. In this work, we adopt the open-set SSL setting and target a pragmatic but under-explored generalized category discovery (GCD) setting. The GCD setting aims to categorize unlabeled training data coming from known or unknown novel classes by leveraging the information in the labeled data. We propose a two-stage contrastive affinity learning method with auxiliary visual prompts, dubbed PromptCAL, to address this challenging problem, which can discover reliable affinities between labeled and unlabelled samples to further learn better clusters for both known and novel classes. Specifically, we first embed learnable visual prompts into a pre-trained visual transformer (ViT) backbone and supervise these prompts with an auxiliary loss to reinforce semantic discriminativeness and learn generalizable affinity relationships. Secondly, we propose an affinity-based contrastive loss based on an iterative semi-supervised affinity propagation process which can further enhance the benefits of prompt supervision. Extensive experimental evaluation on six benchmarks demonstrates that our method is effective in discovering novel classes even with limited annotations and surpasses the current state-of-the-art on six benchmark dataset (with more than 10% on CUB and StanfordCars, and significant margin on ImageNet-100). Our code and models will be publicly released.","Novel class discovery, General category discovery, Self-supervised learning, Label propagation" Multi-Hypothesis 3D human pose estimation metrics favor miscalibrated distributions,https://openreview.net/forum?id=N3FlFslv_J,https://openreview.net/pdf?id=N3FlFslv_J,"Pose estimation metrics favor overconfident models; we propose cGNF, a model capable of maximizing likelihood and thus estimating accurate and well-calibrated distributions of 3D poses.","Due to depth ambiguities and occlusions, lifting 2D poses to 3D is a highly ill-posed problem. Well-calibrated distributions of possible poses can make these ambiguities explicit and preserve the resulting uncertainty for downstream tasks. This study shows that previous attempts, which account for these ambiguities via multiple hypotheses generation, produce miscalibrated distributions. We identify that miscalibration can be attributed to the use of sample-based metrics such as $\operatorname{minMPJPE}$. In a series of simulations, we show that minimizing $\operatorname{minMPJPE}$, as commonly done, should converge to the correct mean prediction. However, it fails to correctly capture the uncertainty, thus resulting in a miscalibrated distribution. To mitigate this problem, we propose an accurate and well-calibrated model called Conditional Graph Normalizing Flow (cGNFs). Our model is structured such that a single cGNF can estimate both conditional and marginal densities within the same model - effectively solving a zero-shot density estimation problem. We evaluate cGNF on the Human 3.6M dataset and show that cGNF provides a well-calibrated distribution estimate while being close to state-of-the-art in terms of overall $\operatorname{minMPJPE}$. Furthermore, cGNF outperforms previous methods on occluded joints while it remains well-calibrated.","Pose estimation, calibration, metrics, graph neural networks" Learning to Abstain from Uninformative Data,https://openreview.net/forum?id=Zo9MZCOn0u,https://openreview.net/pdf?id=Zo9MZCOn0u,Learning with data contain majority uninformative data with selective loss,"Learning and decision making in domains with naturally high noise-to-signal ratios – such as Finance or Healthcare – can be challenging yet extremely important. In this paper, we study a problem of learning and decision making under a general noisy generative process. The distribution has a significant proportion of uninformative data with high noise in label, while part of the data contains useful information represented by low label noise. This dichotomy is present during both training and inference, which requires the proper handling of uninformative data at testing time. We propose a novel approach to learn under these conditions via a loss inspired by the selective learning theory. By minimizing the loss, the model is guaranteed to make a near-optimal decision by distinguishing informative data from the uninformative data and making predictions. We build upon the strength of our theoretical guarantees by describing an iterative algorithm, which jointly optimizes both a predictor and a selector, and evaluate its empirical performance under a variety of settings.","PAC Learning, Sample Complexity, Selective Learning, Uninformative Data" Order Matters: Agent-by-agent Policy Optimization,https://openreview.net/forum?id=Q-neeWNVv1,https://openreview.net/pdf?id=Q-neeWNVv1,,"While multi-agent trust region algorithms have achieved great success empirically in solving coordination tasks, most of them, however, suffer from a non-stationarity problem since agents update their policies simultaneously. In contrast, a sequential scheme that updates policies agent-by-agent provides another perspective and shows strong performance. However, sample inefficiency and lack of monotonic improvement guarantees for each agent are still the two significant challenges for the sequential scheme. In this paper, we propose the \textbf{A}gent-by-\textbf{a}gent \textbf{P}olicy \textbf{O}ptimization (A2PO) algorithm to improve the sample efficiency and retain the guarantees of monotonic improvement for each agent during training. We justify the tightness of the monotonic improvement bound compared with other trust region algorithms. From the perspective of sequentially updating agents, we further consider the effect of agent updating order and extend the theory of non-stationarity into the sequential update scheme. To evaluate A2PO, we conduct a comprehensive empirical study on four benchmarks: StarCraftII, Multi-agent MuJoCo, Multi-agent Particle Environment, and Google Research Football full game scenarios. A2PO consistently outperforms strong baselines.",Multi-agent Reinforcement Learning "AQuaMaM: An Autoregressive, Quaternion Manifold Model for Rapidly Estimating Complex SO(3) Distributions",https://openreview.net/forum?id=2W6ExpOzWGV,https://openreview.net/pdf?id=2W6ExpOzWGV,,"Accurately modeling complex, multimodal distributions is necessary for optimal decision-making, but doing so for rotations in three-dimensions, i.e., the SO(3) group, is challenging due to the curvature of the rotation manifold. The recently described implicit-PDF (IPDF) is a simple, elegant, and effective approach for learning arbitrary distributions on SO(3) up to a given precision. However, inference with IPDF requires $N$ forward passes through the network's final multilayer perceptron—where $N$ places an upper bound on the likelihood that can be calculated by the model—which is prohibitively slow for those without the computational resources necessary to parallelize the queries. In this paper, I introduce AQuaMaM, a neural network capable of both learning complex distributions on the rotation manifold and calculating exact likelihoods for query rotations in a single forward pass. Specifically, AQuaMaM autoregressively models the projected components of unit quaternions as a mixture of uniform distributions that partition their geometrically-restricted domain of values. On an ""infinite"" toy dataset with ambiguous viewpoints, AQuaMaM rapidly converges to a sampling distribution closely matching the true data distribution. In contrast, the sampling distribution for IPDF dramatically diverges from the true data distribution, despite IPDF approaching its theoretical minimum evaluation loss during training. On a constructed dataset of 500,000 renders of a die in different rotations, an AQuaMaM model trained from scratch reaches a log-likelihood 14% higher than an IPDF model using a pretrained ResNet-50. Further, compared to IPDF, AQuaMaM uses 24% fewer parameters, has a prediction throughput 52$\times$ faster on a single GPU, and converges in a similar amount of time during training.", Conformal Prediction is Robust to Label Noise,https://openreview.net/forum?id=yXk83o735o,https://openreview.net/pdf?id=yXk83o735o,,"We study the robustness of conformal prediction—a powerful tool for uncertainty quantification—to label noise. Our analysis tackles both regression and classification problems, characterizing when and how it is possible to construct uncertainty sets that correctly cover the unobserved noiseless ground truth labels. Through stylized theoretical examples and practical experiments, we argue that naïve conformal prediction covers the noiseless ground truth label unless the noise distribution is adversarially designed. This leads us to believe that correcting for label noise is unnecessary except for pathological data distributions or noise sources. In such cases, we can also correct for noise of bounded size in the conformal prediction algorithm in order to ensure correct coverage of the ground truth labels without score or data regularity.", $\Phi$-DVAE: Learning Physically Interpretable Representations with Nonlinear Filtering,https://openreview.net/forum?id=GUfVNbxIYv,https://openreview.net/pdf?id=GUfVNbxIYv,,"Incorporating unstructured data into physical models is a challenging problem that is emerging in data assimilation. Traditional approaches focus on well-defined observation operators whose functional forms are typically assumed to be known. This prevents these methods from achieving a consistent model-data synthesis in configurations where the mapping from data-space to model-space is unknown. To address these shortcomings, in this paper we develop a physics-informed dynamical variational autoencoder ($\Phi$-DVAE) for embedding diverse data streams into time-evolving physical systems described by differential equations. Our approach combines a standard (possibly nonlinear) filter for the latent state-space model and a VAE, to embed the unstructured data stream into the latent dynamical system. A variational Bayesian framework is used for the joint estimation of the embedding, latent states, and unknown system parameters. To demonstrate the method, we look at three examples: video datasets generated by the advection and Korteweg-de Vries partial differential equations, and a velocity field generated by the Lorenz-63 system. Comparisons with relevant baselines show that the $\Phi$-DVAE provides a data efficient dynamics encoding methodology that is competitive with standard approaches, with the added benefit of incorporating a physically interpretable latent space.","variational autoencoder, nonlinear filter, physics-informed, parameter estimation, variational inference, Bayesian inverse problems" Revisiting Structured Dropout,https://openreview.net/forum?id=ZAgV_f00Mm,https://openreview.net/pdf?id=ZAgV_f00Mm,,"Large neural networks are often overparameterised and prone to overfitting, Dropout is a widely used regularization technique to combat overfitting and improve model generalization. However, unstructured Dropout is not always effective for specific network architectures and this has led to the formation of multiple structured Dropout approaches to improve model performance and, sometimes, reduce the computational resources required for inferencing. In this work we revisit structured Dropout comparing different Dropout approaches on natural language processing and computer vision tasks for multiple state-of-the-art networks. Additionally, we devise an approach to structured Dropout we call \textbf{\emph{ProbDropBlock}} which drops contiguous blocks from feature maps with a probability given by the normalized feature salience values. We find that with a simple scheduling strategy the proposed approach to structured Dropout consistently improved model performance compared to baselines and other Dropout approaches on a diverse range of tasks and models. In particular, we show \textbf{\emph{ProbDropBlock}} improves RoBERTa finetuning on MNLI by $0.22\%$, and training of ResNet50 on ImageNet by $0.28\%$. ", Reducing the Capacity Gap via Spherical Knowledge Distillation,https://openreview.net/forum?id=ubqLbhIzbk,https://openreview.net/pdf?id=ubqLbhIzbk,This work proposes an efficient knowledge distillation method to train competitive students distilled by oversized teachers.,"Knowledge distillation aims to obtain a small and effective student model by learning the output from a large knowledgeable teacher model. However, when the student is distilled by an oversized teacher, a critical performance degradation problem is exposed. This paper revisits performance degradation problem from the perspective of model confidence. Specifically, we apply energy-based metrics to measure the confidence of models, and propose Spherical Knowledge Distillation (SKD): a more efficient knowledge distillation framework when distilling with larger teachers. A theoretical analysis is provided to show that SKD can effectively reduce the confidence gap between the teacher and student, thus alleviating the performance degradation problem. We demonstrate that SKD is easy to train, and can significantly outperform several strong baselines on various mainstream datasets, including CIFAR-100 and ImageNet. ","Knowledge Distillation, Model Compression" "Flatter, Faster: Scaling Momentum for Optimal Speedup of SGD",https://openreview.net/forum?id=3IXDfzaJ2LF,https://openreview.net/pdf?id=3IXDfzaJ2LF,"We find the implicit bias induced by noise in SGD with momentum; this leads us to identify a scaling limit of the momentum hyperparameter in the learning rate that maximally accelerates training, without depleting generalization.","Commonly used optimization algorithms often show a trade-off between good generalization and fast training times. For instance, stochastic gradient descent (SGD) tends to have good generalization; however, adaptive gradient methods have superior training times. Momentum can help accelerate training with SGD, but so far there has been no principled way to select the momentum hyperparameter. Here we study implicit bias arising from the interplay between SGD with label noise and momentum in the training of overparameterized neural networks. We find that scaling the momentum hyperparameter $1-\beta$ with the learning rate to the power of $2/3$ maximally accelerates training, without sacrificing generalization. To analytically derive this result we develop an architecture-independent framework, where the main assumption is the existence of a degenerate manifold of global minimizers, as is natural in overparameterized models. Training dynamics display the emergence of two characteristic timescales that are well-separated for generic values of the hyperparameters. The maximum acceleration of training is reached when these two timescales meet, which in turn determines the scaling limit we propose. We perform experiments, including matrix sensing and ResNet on CIFAR10, which provide evidence for the robustness of these results.","SGD, momentum, acceleration, generalization, scaling limit, deep learning, implicit bias, implicit regularization" Learning implicit hidden Markov models using neural likelihood-free inference,https://openreview.net/forum?id=5eCi6tAPc7,https://openreview.net/pdf?id=5eCi6tAPc7,"We propose a novel method, using an autoregressive-flow, for carrying out likelihood-free Bayesian inference of a hidden Markov model","Likelihood-free inference methods for implicit models based on neural conditional density estimation were shown to drastically reduce the simulation burden in comparison to classical methods such as ABC. However, when applied in the context of any latent variable model, such as a Hidden Markov model (HMM), these methods are designed to only estimate the parameters rather than the joint posterior distribution of both the parameters and the hidden states. Naive application of these methods to a HMM, ignoring the inference of this joint posterior distribution, will result in overestimation of uncertainty of the posterior predictive. We propose a postprocessing step that can rectify this problem. Our approach relies on learning directly the intractable posterior distribution of the hidden states, using an autoregressive-flow, by exploiting the Markov property. Upon evaluating our approach on some intractable HMMs, we found that the quality of the estimates retrieved using our postprocessing is comparable to what can be achieved using a computationally expensive particle-filtering which additionally requires a tractable data distribution.","likelihood-free, Bayesian inference, simulation based inference, ABC-SMC, HMM, simulator, implicit models" Brain Signal Generation and Data Augmentation with a Single-Step Diffusion Probabilistic Model,https://openreview.net/forum?id=woOQ5Hb1oOF,https://openreview.net/pdf?id=woOQ5Hb1oOF,"We show on multiple brain signal datasets that distilled diffusion probability models can synthesize EEG signals with high accuracy and diversity, which can be used for data augmentation.","Brain-computer interfaces based on deep learning rely on large amounts of high-quality data. Finding publicly available brain signal datasets that meet all requirements is a challenge. However, brain signals synthesized with generative models may provide a solution to this problem. Our work builds on diffusion probabilistic models (DPMs) and aims to generate brain signals that have the properties needed to develop further classification models based on deep learning. We show that our DPM can generate high-quality event-related potentials (ERPs) and motor imagery (MI) signals. Furthermore, with the progressive distillation of the model, subject-specific data can be produced in a one-step reverse process. We augment publicly available datasets and demonstrate the impact of the generated signals on a deep learning classification model. DPMs are versatile models, and this work shows that brain signal processing is one of many other tasks in which these models can be useful.","diffusion probabilistic model, generative model, electroencephalograpghy, eeg, erp, motor imagery, synthesis, augmentation" Know Your Boundaries: The Advantage of Explicit Behavior Cloning in Offline RL,https://openreview.net/forum?id=MT2l4ziaxeE,https://openreview.net/pdf?id=MT2l4ziaxeE,,"We introduce an offline reinforcement learning (RL) algorithm that explicitly clones a behavior policy to constrain value learning. In offline RL, it is often important to prevent a policy from selecting unobserved actions, since the consequence of these actions cannot be presumed without additional information about the environment. One straightforward way to implement such a constraint is to explicitly model a given data distribution via behavior cloning and directly force a policy not to select uncertain actions. However, many offline RL methods instantiate the constraint indirectly---for example, pessimistic value estimation---due to a concern about errors when modeling a potentially complex behavior policy. In this work, we argue that it is not only viable but beneficial to explicitly model the behavior policy for offline RL because the constraint can be realized in a stable way with the trained model. We first suggest a theoretical framework that allows us to incorporate behavior-cloned models into value-based offline RL methods, enjoying the strength of both explicit behavior cloning and value learning. Then, we propose a practical method utilizing a score-based generative model for behavior cloning. With the proposed method, we show state-of-the-art performance on several datasets within the D4RL and Robomimic benchmarks and achieve competitive performance across all datasets tested.", "On the Convergence of AdaGrad on $\mathbb{R}^d$: Beyond Convexity, Non-Asymptotic Rate and Acceleration",https://openreview.net/forum?id=ULnHxczCBaE,https://openreview.net/pdf?id=ULnHxczCBaE,New techniques to prove the convergence rate of AdaGrad and new accelerated adaptive algorithms without bounded domain assumption beyond standard convex and smooth functions,"Existing analysis of AdaGrad and other adaptive methods for smooth convex optimization is typically for functions with bounded domain diameter. In unconstrained problems, previous works guarantee an asymptotic convergence rate without an explicit constant factor that holds true for the entire function class. Furthermore, in the stochastic setting, only a modified version of AdaGrad, different from the one commonly used in practice, in which the latest gradient is not used to update the stepsize, has been analyzed. Our paper aims at bridging these gaps and developing a deeper understanding of AdaGrad and its variants in the standard setting of smooth convex functions as well as the more general setting of quasar convex functions. First, we demonstrate new techniques to explicitly bound the convergence rate of the vanilla AdaGrad for unconstrained problems in both deterministic and stochastic settings. Second, we propose a variant of AdaGrad for which we can show the convergence of the last iterate, instead of the average iterate. Finally, we give new accelerated adaptive algorithms and their convergence guarantee in the deterministic setting with explicit dependency on the problem parameters, improving upon the asymptotic rate shown in previous works. ","Convex Optimization, Adaptive Algorithms" Bounded Attacks and Robustness in Image Transform Domains,https://openreview.net/forum?id=4WjVKtMUOP,https://openreview.net/pdf?id=4WjVKtMUOP,"A novel set of attacks operating in the well-known DCT DWTs domains that do not abolish the usual $L^\infty$ threat model, leading to adversarial examples with higher visual similarity and better adversarial learning transferability.","Classical image transformation such as the discrete cosine transform (DCT) and the discrete wavelet transforms (DWTs) provide semantically meaningful representations of images. In this paper we propose a general method for adversarial attacks in such transform domains that, in contrast to prior work, obey the $L^\infty$ constraint in the pixel domain. The key idea is to replace the standard attack based on projections with the barrier method. Experiments with DCT and DWTs produce adversarial examples that are significantly more similar to the original than with prior attacks. Further, through adversarial training we show that robustness against our attacks transfers to robustness against a broad class of common image perturbations.","Adversarial example, white-box attack, neural networks, discrete linear transforms, DCT, JPEG, wavelet" SP2 : A Second Order Stochastic Polyak Method,https://openreview.net/forum?id=5mqFra2ZSuf,https://openreview.net/pdf?id=5mqFra2ZSuf,,"Recently the SP (Stochastic Polyak step size) method has emerged as a competitive adaptive method for setting the step sizes of SGD. SP can be interpreted as a method specialized to interpolated models, since it solves the interpolation equations. SP solves these equation by using local linearizations of the model. We take a step further and develop a method for solving the interpolation equations that uses the local second-order approximation of the model. Our resulting method SP2 uses Hessian-vector products to speed-up the convergence of SP. Furthermore, and rather uniquely among second-order methods, the design of SP2 in no way relies on positive definite Hessian matrices or convexity of the objective function. We show SP2 is competitive both in experiments and in theory. We show SP2 is very competitive on matrix completion, non-convex test problems and logistic regression. We also provide a convergence theory on sums-of-quadratics.", Multi-Objective GFlowNets,https://openreview.net/forum?id=3z1Ws6GEYV4,https://openreview.net/pdf?id=3z1Ws6GEYV4,We generate diverse Pareto-optimal candidates for high-dimensional multi-objective optimization problems with GFlowNets. ,"In many applications of machine learning, like drug discovery and material design, the goal is to generate candidates that simultaneously maximize a set of objectives. As these objectives are often conflicting, there is no single candidate that simultaneously maximizes all objectives, but rather a set of Pareto-optimal candidates where one objective cannot be improved without worsening another. Moreover, these objectives, when considered in practice are often under-specified, making diversity of candidates a key consideration. The existing multi-objective optimization methods focus predominantly on covering the Pareto front, failing the capture diversity in the space of candidates. Motivated by the success of GFlowNets for generation of diverse candidates in a single objective setting, in this paper we consider Multi-Objective GFlowNets (MOGFNs). MOGFNs consist of a Conditional GFlowNet which models a family of single-objective sub-problems derived by decomposing the multi-objective optimization problem. Our work is the first to empirically demonstrate conditional GFlowNets. Through a series of experiments on synthetic tasks and real-world domains, we empirically demonstrate that MOGFNs outperform existing methods in terms of Hypervolume, R2-distance and candidate diversity. We also demonstrate the effectiveness of MOGFNs over existing methods in active learning settings. Finally, we supplement our empirical results with a careful analysis of each component of MOGFNs.","generative flow networks, multi-objective optimization, drug discovery, material design" Making Better Decision by Directly Planning in Continuous Control,https://openreview.net/forum?id=r8Mu7idxyF,https://openreview.net/pdf?id=r8Mu7idxyF,Directly using the environment model to do the planning might be an efficient way when making decision. We propose a novel POMP algorithm with a D3P planner module to achieve the efficient planning in the continuous action space control problem.,"By properly utilizing the learned environment model, model-based reinforcement learning methods can improve the sample efficiency for decision-making problems. Beyond using the learned environment model to train a policy, the success of MCTS-based methods shows that directly incorporating the learned environment model as a planner to make decisions might be more effective. However, when action space is of high dimension and continuous, directly planning according to the learned model is costly and non-trivial. Because of two challenges: (1) the infinite number of candidate actions and (2) the temporal dependency between actions in different timesteps. To address these challenges, inspired by Differential Dynamic Programming (DDP) in optimal control theory, we design a novel Policy Optimization with Model Planning (POMP) algorithm, which incorporates a carefully designed Deep Differential Dynamic Programming (D3P) planner into the model-based RL framework. In D3P planner, (1) to effectively plan in the continuous action space, we construct a locally quadratic programming problem that uses a gradient-based optimization process to replace search. (2) To take the temporal dependency of actions at different timesteps into account, we leverage the updated and latest actions of previous timesteps (i.e., step $1, \cdots, h-1$) to update the action of the current step (i.e., step $h$), instead of updating all actions simultaneously. We theoretically prove the convergence rate for our D3P planner and analyze the effect of the feedback term. In practice, to effectively apply the neural network based D3P planner in reinforcement learning, we leverage the policy network to initialize the action sequence and keep the action update conservative in the planning process. Experiments demonstrate that POMP consistently improves sample efficiency on widely used continuous control tasks. Our code is released at https://github.com/POMP-D3P/POMP-D3P. ","Model-based Reinforcement Learning, Reinforcement Learning, Planning, Policy Optimization" Large language models are not zero-shot communicators,https://openreview.net/forum?id=WgbcOQMNXB,https://openreview.net/pdf?id=WgbcOQMNXB,"Large language models are significantly worse than humans in interpreting language in context, which is a crucial aspect of communication.","The recent success of large language models (LLMs) has drawn heavy attention and investment in their use as conversational and embodied systems. Despite widespread use of LLMs as conversational agents, evaluations of performance fail to capture a crucial aspect of communication: interpreting language in context. Humans interpret language using beliefs, prior knowledge about the world, and more. For example, we intuitively understand the response ""I wore gloves"" to the question ""Did you leave fingerprints?"" as meaning ""No"". To investigate whether LLMs have the ability to make this type of inference, known as an implicature, we design a simple task and evaluate a set of models. We find that despite only evaluating on utterances that require a binary inference (yes or no), most perform close to random. Models adapted to be ""aligned with human intent"" via reinforcement learning perform much better, but still leave a significant gap with human performance. This gap is even more pronounced for context-heavy utterances. We present our findings as the starting gun for further research into evaluating how LLMs interpret language in context, in order to drive the development of more pragmatic and useful models of human discourse.","large language models, pragmatics, natural language processing, communication, conversation, implicature" Data dependent frequency sensitivity of convolutional neural networks,https://openreview.net/forum?id=fgaiMgCpZV,https://openreview.net/pdf?id=fgaiMgCpZV,We show with theory and experiments that the observed sensitivity of convolutional neural networks (CNNs) to low frequency perturbations of input images is a consequence of the frequency distribution of natural images.,"It is widely acknowledged that trained convolutional neural networks (CNNs) have different levels of sensitivity to signals of different frequency. In particular, a number of empirical studies have documented CNNs sensitivity to low-frequency signals. In this work we show with theory and experiments that this observed sensitivity is a consequence of the frequency distribution of natural images, which is known to have most of its power concentrated in low-to-mid frequencies. Our theoretical analysis relies on representations of the layers of a CNN in frequency space, an idea that has previously been used to accelerate computations and study implicit bias of network training algorithms, but to the best of our knowledge has not been applied in the domain of model robustness. ","deep learning, convolutional neural networks, sparsity, matrix factorization, robustness" Is end-to-end learning enough for fitness activity recognition?,https://openreview.net/forum?id=s5GClg38TU,https://openreview.net/pdf?id=s5GClg38TU,With appropriately labeled data end-to-end learning on raw pixels can compete with pose estimation.,"End-to-end learning has taken hold of many computer vision tasks, in particular, related to still images, with task-specific optimization yielding very strong performance. Nevertheless, human-centric action recognition is still largely dominated by hand-crafted pipelines, and only individual components are replaced by neural networks that typically operate on individual frames. As a testbed to study the relevance of such pipelines, we present a new fully annotated video dataset of fitness activities. Any recognition capabilities in this domain are almost exclusively a function of human poses and their temporal dynamics, so pose-based solutions should perform well. We show that, with this labelled data, end-to-end learning on raw pixels can compete with state-of-the-art action recognition pipelines based on pose estimation. We also show that end-to-end learning can support temporally fine-grained tasks such as real-time repetition counting.","end-to-end learning, action recognition, 3D convolution, video, fitness" Debiasing the Pre-trained Language Model Through Fine-tuning the Downstream Tasks,https://openreview.net/forum?id=IfxsiXMZoNX,https://openreview.net/pdf?id=IfxsiXMZoNX,,"Recent studies have revealed that the widely-used pre-trained language models propagate societal biases from the large unmoderated pre-training corpora. Existing solutions mostly focused on debiasing the pre-training corpora or embedding models. Thus, these approaches need a separate pre-training process and extra training datasets which are resource-intensive and costly. Indeed, studies showed that these approaches hurt the models' performance on downstream tasks. In this study, we focus on gender debiasing and propose Gender-tuning, which comprises of the two training processes: gender-word perturbation and fine-tuning. This combination aims to interrupt gender word association with other words in training examples and classifies the perturbed example according to the ground-truth label. Gender-tuning uses a joint-loss for training both the perturbation model and fine-tuning. Comprehensive experiments show that Gender-tuning effectively reduces gender biases scores in pre-trained language models and, at the same time, improves performance on downstream tasks. Gender-tuning is applicable as a plug-and-play debiasing tool for pre-trained language models. The source code and pre-trained models will be available on the author’s GitHub page.","NLP, Debiasing pre-trained langugae model, Social biases, Robustness" Efficient Exploration using Model-Based Quality-Diversity with Gradients,https://openreview.net/forum?id=5-X1XzdAWcC,https://openreview.net/pdf?id=5-X1XzdAWcC,,"Exploration is a key challenge in Reinforcement Learning, especially in long-horizon, deceptive and sparse-reward environments. For such applications, population-based approaches have proven effective. Methods such as Quality-Diversity deals with this by encouraging novel solutions and producing a diversity of behaviours. However, these methods are driven by either undirected sampling (i.e. mutations) or use approximated gradients (i.e. Evolution Strategies) in the parameter space, which makes them highly sample-inefficient. In this paper, we propose a model-based Quality-Diversity approach, relying on gradients and learning in imagination. Our approach optimizes all members of a population simultaneously to maintain both performance and diversity efficiently by leveraging the effectiveness of QD algorithms as good data generators to train deep models. We demonstrate that it maintains the divergent search capabilities of population-based approaches while significantly improving their sample efficiency (5 times faster) and quality of solutions (2 times more performant).","Quality-Diversity, Exploration, Reinforcement Learning" ResFed: Communication Efficient Federated Learning by Transmitting Deep Compressed Residuals,https://openreview.net/forum?id=TTcpISh-_oI,https://openreview.net/pdf?id=TTcpISh-_oI,We introduce ResFed federated learning framework to achieve more efficient communication by leveraging deep compressed residuals rather than weights or gradients.,"Federated learning enables cooperative training among massively distributed clients by sharing their learned local model parameters. However, with increasing model size, deploying federated learning requires a large communication bandwidth, which limits its deployment in wireless networks. To address this bottleneck, we introduce a residual-based federated learning framework (ResFed), where residuals rather than model parameters are transmitted in communication networks for training. In particular, we integrate two pairs of shared predictors for the model prediction in both server-to-client and client-to-server communication. By employing a common prediction rule, both locally and globally updated models are always fully recoverable in clients and the server. We highlight that the residuals only indicate the quasi-update of a model in a single inter-round, and hence contain more dense information and have a lower entropy than the model, comparing to model weights and gradients. Based on this property, we further conduct lossy compression of the residuals by sparsification and quantization and encode them for efficient communication. The experimental evaluation shows that our ResFed needs remarkably less communication costs and achieves better accuracy by leveraging less sensitive residuals, compared to standard federated learning. For instance, to train a 4.08 MB CNN model on CIFAR-10 with 10 clients under non-independent and identically distributed (Non-IID) setting, our approach achieves a compression ratio over 700X in each communication round with minimum impact on the accuracy. To reach an accuracy of 70%, it saves around 99% of the total communication volume from 587.61 Mb to 6.79 Mb in up-streaming and to 4.61 Mb in down-streaming on average for all clients.","Federated Learning, Communication Efficiency, Deep Compression" HiT-MDP: Learning the SMDP option framework on MDPs with Hidden Temporal Variables,https://openreview.net/forum?id=VuuDXDgujAc,https://openreview.net/pdf?id=VuuDXDgujAc,,"The standard option framework is developed on the Semi-Markov Decision Process (SMDP) which is unstable to optimize and sample inefficient. To this end, we propose a novel Markov Decision Process (MDP), the Hidden Temporal MDP (HiT-MDP), and prove that the option-induced HiT-MDP is homomorphic equivalent to the option-induced SMDP. We also derive a sample efficient structured variational inference-based algorithm which leads to a novel stable option discovering method under the maximum-entropy reinforcement learning framework. Extensive experiments on challenging \textit{Mujoco} environments demonstrate HiT-MDP's efficiency and effectiveness: under widely used configurations, HiT-MDP achieves competitive, if not better, performance compared to the state-of-the-art baselines on all finite horizon and transfer learning environments. Moreover, HiT-MDP significantly outperforms all baselines on infinite horizon environments while exhibiting smaller variance, faster convergence, and better interpretability.","Hiearchical Reinforcement Learning, Reinforcement Learning, Markov Decision Process" Improved Group Robustness via Classifier Retraining on Independent Splits,https://openreview.net/forum?id=j3AyKG-H3uM,https://openreview.net/pdf?id=j3AyKG-H3uM,"We propose a simple method to improve group robustness by fine-tuning only the classification layer on independent splits of the data, with minimal parameter tuning.","Deep neural networks learned by minimizing the average risk can achieve strong average performance, but their performance for a subgroup may degrade, if the subgroup is underrepresented in the overall data population. Group distributionally robust optimization (Sagawa et al., 2020a, GDRO) is a standard baseline for learning models with strong worst-group performance. However, GDRO requires group labels for every example during training and can be prone to overfitting, often requiring careful model capacity control via regularization or early stopping. When only a limited amount of group labels is available, Just Train Twice (Liu et al., 2021, JTT) is a popular approach which infers a pseudo-group-label for every unlabeled example. The process of inferring pseudo labels can be highly sensitive during model selection. To alleviate overfitting for GDRO and the pseudo labeling process for JTT, we propose a new method via classifier retraining on independent splits (of the training data). We find that using a novel sample splitting procedure achieves robust worst-group performance in the fine-tuning step. When evaluated on benchmark image and text classification tasks, our approach consistently reduces the requirement of group labels and hyperparameter search during training. Experimental results confirm that our approach performs favorably compared with existing methods (including GDRO and JTT) when either group labels are available during training or are only available during validation.","spurious correlations, group shifts, overfitting, distributionally robust optimization" (Certified!!) Adversarial Robustness for Free!,https://openreview.net/forum?id=JLg5aHHv7j,https://openreview.net/pdf?id=JLg5aHHv7j,Using an off-the-shelf diffusion model as a denoiser gives state-of-the-art certified adversarial robustness.,"In this paper we show how to achieve state-of-the-art certified adversarial robustness to 2-norm bounded perturbations by relying exclusively on off-the-shelf pretrained models. To do so, we instantiate the denoised smoothing approach of Salman et al. by combining a pretrained denoising diffusion probabilistic model and a standard high-accuracy classifier. This allows us to certify 71% accuracy on ImageNet under adversarial perturbations constrained to be within a 2-norm of 0.5, an improvement of 14 percentage points over the prior certified SoTA using any approach, or an improvement of 30 percentage points over denoised smoothing. We obtain these results using only pretrained diffusion models and image classifiers, without requiring any fine tuning or retraining of model parameters.", URVoice: An Akl-Toussaint/ Graham- Sklansky Approach towards Convex Hull Computation for Sign Language Interpretation,https://openreview.net/forum?id=GG0sigkMnxF,https://openreview.net/pdf?id=GG0sigkMnxF," We present URVoice, a vocalizer for the communication impaired, based on the Indian Sign Language Notations and a real time translation of gesture to text/voice using convex hull as the computational geometry.","We present URVoice, a vocalizer for the communication impaired, based on the Indian Sign Language Notations. Contemporary psychological theories consider language and speech as devices to understand complex psychological processes and deliver them as cultural products of ideas and communication. Sign and gesture language, offering an intelligent co-ordination of eye-and-hand and ear-and-mouth, has evolved as an intelligent manifestation of speech for the impaired. However, they have very limited modality and iconicity in accommodating a greater range of linguistically relevant meanings. URVoice is an Augmentative and Alternative Communication (AAC) device, which currently features a pipeline of forward communication from signer to collocutor with a novel approach shouldered on convex hull using vision-based approach. The solution achieves real time translation of gesture to text/voice using convex hull as the computational geometry, which follows Akl-Toussaint heuristic and Graham-Sklansky scan algorithms. The results are weighed against our other solutions based on conventional Machine Learning and Deep Learning approaches. A futuristic version of URVoice, with voice translated to sign language gestures, will be a complete solution for effectively bridging the cognitive and communication gap between the impaired and the abled lot.","Communication disorder, computational geometry, convex hull, sign language, URVoice, vocalizer, computer vision, deep learning" Gaussian-Bernoulli RBMs Without Tears,https://openreview.net/forum?id=eHkWu_OXBGt,https://openreview.net/pdf?id=eHkWu_OXBGt,,"We revisit the challenging problem of training Gaussian-Bernoulli restricted Boltzmann machines (GRBMs), introducing two innovations. We propose a novel Gibbs-Langevin sampling algorithm that outperforms existing methods like Gibbs sampling. We propose a modified contrastive divergence (CD) algorithm so that one can generate images with GRBMs starting from noise. This enables direct comparison of GRBMs with deep generative models, improving evaluation protocols in the RBM literature. Moreover, we show that modified CD and gradient clipping are enough to robustly train GRBMs with large learning rates, thus removing the necessity of various tricks in the literature. Experiments on Gaussian Mixtures, MNIST, FashionMNIST, and CelebA show GRBMs can generate good samples, despite their single-hidden-layer architecture. ","Gaussian-Bernoulli Restricted Boltzmann Machines, Restricted Boltzmann Machines, Langevin Monte Carlo Sampling, Contrastive Divergence" Image Emotion Recognition using Cognitive Contextual Summarization Framework,https://openreview.net/forum?id=r8PDECVumsJ,https://openreview.net/pdf?id=r8PDECVumsJ,We propose a novel framework for emotion prediction in continuous space (Valence-Arousal) and provide a new benchmark for the community to explore and enhance the human emotion perception.,"Estimating the perceived emotion to visual stimuli has gained significant traction in the recent years. The existing frameworks rely either on a person's presence in the image or are based on object feature extraction and low-level image features. By focusing on the person/object in the image, the existing frameworks fail to capture the context or the interaction between multiple elements in the image. Also, what if an image does not have a human subject or an object? We address this drawback by building a Cognitive Contextual Summarization (CCS) model based on an One-For-All (OFA) backbone trained on multiple tasks, including image captioning. The ability of the backbone to recognize elements in the image and generate captions helps us capture interactions through captions, which we decode using BERT for contextual understanding. The end-to-end fusion of the OFA and the BERT features helps us predict continuous human emotion (Valence, Arousal) from an image. We train our framework on the Building Emotional Machines dataset in the literature, and the experiments show that our model outperforms the State-of-the-art.","Emotion Recognition, Valence-Arousal, Image Captioning, BERT, human cognition, Sentiment Analysis" PES: Probabilistic Exponential Smoothing for Time Series Forecasting,https://openreview.net/forum?id=EmABrt4zzz3,https://openreview.net/pdf?id=EmABrt4zzz3,,"Time series forecasting is a common task in many industries. It helps organizations in setting goals, making plans and taking decisions. Probabilistic forecasting, in addition, summarizes the confidence over future quantities, a useful property when targeting uncertainty. This paper proposes PES - Probabilistic Exponential Smoothing -, a hybrid model for univariate time series forecasting. The contribution is two-fold: we introduce a RNN-like cell incorporating a simple exponential smoothing operator; building on this new cell we develop an intelligible and data-efficient model. The proposed solution shows several desirable characteristics; being easy to implement and fast to train, it can accommodate multiple seasonality and learn them via cross-learning. It can produce intervals as well as point-forecasts and its structure could be a valuable time series decomposition scheme. We test the PES model over a demand forecasting task on a well-known, publicly available, dataset. Finally we show that the results obtained compare favorably to the state-of-the-art.","time series forecast, demand forecast, probabilistic forecast, recurrent neural network, exponential smoothing, automatic differentiation" Efficient Conditionally Invariant Representation Learning,https://openreview.net/forum?id=dJruFeSRym1,https://openreview.net/pdf?id=dJruFeSRym1,Batch-efficient conditional independence regularization,"We introduce the Conditional Independence Regression CovariancE (CIRCE), a measure of conditional independence for multivariate continuous-valued variables. CIRCE applies as a regularizer in settings where we wish to learn neural features $\varphi(X)$ of data $X$ to estimate a target $Y$, while being conditionally independent of a distractor $Z$ given $Y$. Both $Z$ and $Y$ are assumed to be continuous-valued but relatively low dimensional, whereas $X$ and its features may be complex and high dimensional. Relevant settings include domain-invariant learning, fairness, and causal learning. The procedure requires just a single ridge regression from $Y$ to kernelized features of $Z$, which can be done in advance. It is then only necessary to enforce independence of $\varphi(X)$ from residuals of this regression, which is possible with attractive estimation properties and consistency guarantees. By contrast, earlier measures of conditional feature dependence require multiple regressions for each step of feature learning, resulting in more severe bias and variance, and greater computational cost. When sufficiently rich features are used, we establish that CIRCE is zero if and only if $\varphi(X) \perp \!\!\! \perp Z \mid Y$. In experiments, we show superior performance to previous methods on challenging benchmarks, including learning conditionally invariant image features.","conditional independence, kernel methods" Distinguishing Feature Model for Ranking From Pairwise Comparisons,https://openreview.net/forum?id=JyD-NobfNL_,https://openreview.net/pdf?id=JyD-NobfNL_,,"We consider the problem of ranking a set of items from pairwise comparisons among them when the underlying preferences are intransitive in nature. Intransitivity is a common occurrence in real world data sets and we introduce a flexible and natural parametric model for pairwise comparisons that we call the \emph{Distinguishing Feature Model} (DF) to capture this. Under this model, the items have an unknown but fixed embedding and the pairwise comparison between a pair of items depends probabilisitically on the feature in the embedding that can best distinguish the items. We study several theoretical properties including how it generalizes the popular transitive Bradley-Terry-Luce model. With just an embedding dimension $d = 3$, we show that the proposed model can capture arbitrarily long cyclic dependencies. Furthermore, we explicitly show the type of preference relations that cannot be modelled under the DF model for $d=3$. On the algorithmic side, we propose a Siamese type neural network based algorithm which can learn to predict well under the DF model while at the same time being interpretable in the sense that the embeddings learnt can be extracted directly from the learnt model. Our experimental results show that the model outperforms standard baselines in both synthetic and real world ranking datasets.", Heterogeneous Neuronal and Synaptic Dynamics for Spike-Efficient Unsupervised Learning: Theory and Design Principles,https://openreview.net/forum?id=QIRtAqoXwj,https://openreview.net/pdf?id=QIRtAqoXwj,We prove that heterogeneity in neuronal dynamics improves the memory capacity while heterogeneity in the STDP synaptic dynamics improves the spike efficiency,"This paper shows that the heterogeneity in neuronal and synaptic dynamics reduces the spiking activity of a Recurrent Spiking Neural Network (RSNN) while improving prediction performance, enabling spike-efficient (unsupervised) learning. We analytically show that the diversity in the integration/relaxation dynamics of neurons improves an RSNN's ability to learn more distinct input patterns (higher memory capacity), leading to improved classification and prediction performance. We further prove that heterogeneous Spike-Timing-Dependent-Plasticity (STDP) dynamics of synapses reduce spiking activity but preserve memory capacity. The analytical results motivate \textbf{h}eterogeneous RSNN (HRSNN) design using Bayesian optimization to determine heterogeneity in neurons and synapses to improve $\mathcal{E}$, defined as the ratio of spiking activity and memory capacity. The empirical results on time series classification and prediction tasks show optimized HRSNN increases performance and reduces spiking activity compared to a ho\textbf{m}ogeneous RSNN (MRSNN).","theory, spiking neural network, LIF, STDP, heterogeneity, memory capacity, spike efficiency, bayesian optimization" MMVAE+: Enhancing the Generative Quality of Multimodal VAEs without Compromises,https://openreview.net/forum?id=sdQGxouELX,https://openreview.net/pdf?id=sdQGxouELX,,"Multimodal VAEs have recently gained attention as efficient models for weakly-supervised generative learning with multiple modalities. However, all existing variants of multimodal VAEs are affected by a non-trivial trade-off between generative quality and generative coherence. In particular mixture-based models achieve good coherence only at the expense of sample diversity and a resulting lack of generative quality. We present a novel variant of the mixture-of-experts multimodal variational autoencoder that improves its generative quality, while maintaining high semantic coherence. We model shared and modality-specific information in separate latent subspaces, proposing an objective that overcomes certain dependencies on hyperparameters that arise for existing approaches with the same latent space structure. Compared to these existing approaches, we show increased robustness with respect to changes in the design of the latent space, in terms of the capacity allocated to modality-specific subspaces. We show that our model achieves both good generative coherence and high generative quality in challenging experiments, including more complex multimodal datasets than those used in previous works.","Multimodal Variational Autoencoder, Variational Autoencoder, Multimodal Generative Learning" Forget to Learn (F2L): Rethinking Replay Loss in Unsupervised Continuous Domain Adaptation,https://openreview.net/forum?id=QCcrFi7q3u,https://openreview.net/pdf?id=QCcrFi7q3u,,"Although continuous unsupervised domain adaptation (CUDA) has shown success in dealing with non-stationary data, catastrophic forgetting is still a challenge hindering its full potential. The current state-of-the-art (SOTA) focuses on training a single model to simultaneously perform adaptation (e.g., domain alignment) and knowledge retention (i.e., minimizing replay loss). However, the two conflicting objectives result in a hyper-parameter, which is difficult to tune yet significantly affecting model performance. Therefore, we propose to use two separate models so that one model is dedicated to the retention of historical knowledge (i.e., high stability) while the other to the adaptation to future domains (i.e., high plasticity). This allows the algorithm to forget to achieve better overall performance: dubbed as Forget to Learn (F2L), Specifically, F2L decomposes the training process into specialist model and generalist model, and uses knowledge distillation to transfer knowledge between the two models. We demonstrate the superiority of F2L compared to current CUDA trends (i.e., multi-task learning and single-task constrained learning) on different continuous unsupervised domain adaptation datasets.","Domain Adaptation, Lifelong Learning, Replay Loss, Knowledge Distillation, Stability Plasticity Dilemma" A probabilistic framework for task-aligned intra- and inter-area neural manifold estimation,https://openreview.net/forum?id=kt-dcBQcSA,https://openreview.net/pdf?id=kt-dcBQcSA,"New probabilistic estimator partitions multi-area neural variability into shared and private sources, aligned to meaningful task axes.","Latent manifolds provide a compact characterization of neural population activity and of shared co-variability across brain areas. Nonetheless, existing statistical tools for extracting neural manifolds face limitations in terms of interpretability of latents with respect to task variables, and can be hard to apply to datasets with no trial repeats. Here we propose a novel probabilistic framework that allows for interpretable partitioning of population variability within and across areas in the context of naturalistic behavior. Our approach for task aligned manifold estimation (TAME-GP) explicitly partitions variability into private and shared sources which can themselves be subdivided in task-relevant and task irrelevant components, uses a realistic Poisson noise model, and introduces temporal smoothing of latent trajectories in the form of a Gaussian Process prior. This TAME-GP graphical model allows for robust estimation of task-relevant variability in local population responses, and of shared co-variability between brain areas. We demonstrate the efficiency of our estimator on within model and biologically motivated simulated data. We also apply it to several datasets of neural population recordings during behavior. Overall, our results demonstrate the capacity of TAME-GP to capture meaningful intra- and inter-area neural variability with single trial resolution.","neuroscience, dimensionality reduction, probabilistic methods, inter-area interactions" Applying Second Order Optimization to Deep Transformers with Parameter-Efficient Tuning,https://openreview.net/forum?id=4Fi-5Jiyy5w,https://openreview.net/pdf?id=4Fi-5Jiyy5w,,"Despite the theoretical superiority in convergence issues, second-order optimizers are generally not among the top choices for training large-scale neural networks due to their high computational and memory cost. Nevertheless, introduced in recent progress of parameter-efficient tuning is a new paradigm that large-scale pre-trained models (PTMs) can be adapted to specific tasks by optimizing a tiny proportion of parameters, which might hopefully change the game. We associate this new paradigm with the computational tractability of second-order optimizers and succeed in applying them to large PTMs that are from hundreds of millions to billions in scale. Beyond verifying their tractability, we further investigate the stability-influencing factors in the optimization process and propose accordingly a Newton-step-clipping approach in which we clip the update tensors rather than the gradients. This approach stabilizes the convergence by gating the magnitude of Newton steps along the optimization trajectories through the rugged landscapes of deep transformers. We conduct extensive experiments across different downstream tasks, demonstrating that, when equipped with our Newton-step-clipping strategy, second-order optimizers, especially Kronecker-factored curvature approximation (K-FAC), can attain comparable and even superior results and faster convergence to those state-of-the-art bars implemented with AdamW. Furthermore, we scale the model up to 3 billion parameters and validate the tractability and effectiveness of our method. This work is not only the first successful application of second-order optimization on such large-scale models but also sheds light on the possibility of further optimization-wise analysis on large-scale models in the future.","Pre-trained Models, NLP, Model Adaptation" Density Sketches for Sampling and Estimation,https://openreview.net/forum?id=SdXv2C2-tnj,https://openreview.net/pdf?id=SdXv2C2-tnj,online summary of data for density estimation and sampling new data.,"There has been an exponential increase in the data generated worldwide. Insights into this data led by machine learning (ML) have given rise to exciting applications such as recommendation engines, conversational agents, and so on. Often, data for these applications is generated at a rate faster than ML pipelines can consume it. In this paper, we propose Density Sketches(DS) - a cheap and practical approach to reducing data redundancy in a streaming fashion. DS creates a succinct online summary of data distribution. While DS does not store the samples from the stream, we can sample unseen data on the fly from DS to use for downstream learning tasks. In this sense, DS can replace actual data in many machine learning pipelines analogous to generative models. Importantly, unlike generative models, which do not have statistical guarantees, the sampling distribution of DS asymptotically converges to underlying unknown density distribution.","density estimation, sampling, machine learning" Mask-tuning: Towards Improving Pre-trained Language Models' Generalization,https://openreview.net/forum?id=ZqvoLWz05jT,https://openreview.net/pdf?id=ZqvoLWz05jT,,"Pre-trained language models have the known generalization problem. This issue emerges from the pre-trained language models' learning process that heavily relies on spurious correlations, which work for the majority of training examples but do not hold in general. As a consequence, the models' performance drops substantially on out-of-distribution datasets. Previous studies proposed various solutions, including data augmentation and learning process improvement. In this paper, we present Mask-tuning, an approach that alleviates the impact of spurious correlations on the fine-tuning learning process. To achieve this goal, Mask-tuning integrates masked language training into the fine-tuning learning process. In this case, Mask-tuning perturbs the linguistic relation of downstream tasks' training examples and computes masked language training loss. Then, the perturbed examples are fed into fine-tuning process to be classified based on their ground-truth label and compute the fine-tuning training loss. Afterward, Mask-tuning loss-- a weighted aggregation of masked language model training loss and fine-tuning loss-- updates the masked language model and fine-tuning through training iterations. Extensive experiments show that Mask-tuning consistently improves the pre-trained language models' generalization on out-of-distribution datasets and enhances their performance on in-distribution datasets. The source code and pre-trained models will be available on the author's GitHub page.","NLP, Pre-trained langugae model, out-of-distribution learning, robust generalization, fine-tuning" Meta-Learning via Classifier(-free) Guidance,https://openreview.net/forum?id=8NLta1E_BPR,https://openreview.net/pdf?id=8NLta1E_BPR,We develop a meta-learning method that uses classifier(-free) guidance from the generative modeling literature to generate zero-shot adapted network weights.,"State-of-the-art meta-learning techniques do not optimize for zero-shot adaptation to unseen tasks, a setting in which humans excel. On the contrary, meta-learning algorithms learn hyperparameters and weight initializations that explicitly optimize for few-shot learning performance. In this work, we take inspiration from recent advances in generative modeling and language-conditioned image synthesis to propose meta-learning techniques that use natural language guidance to achieve higher zero-shot performance compared to the state-of-the-art. We do so by recasting the meta-learning problem as a multi-modal generative modeling problem: given a task, we consider its adapted neural network weights and its natural language description as equivalent multi-modal task representations. We first train an unconditional generative hypernetwork model to produce neural network weights; then we train a second ""guidance"" model that, given a natural language task description, traverses the hypernetwork latent space to find high-performance task-adapted weights in a zero-shot manner. We explore two alternative approaches for latent space guidance: ""HyperCLIP""-based classifier guidance and a conditional Hypernetwork Latent Diffusion Model (""HyperLDM""), which we show to benefit from the classifier-free guidance technique common in image generation. Finally, we demonstrate that our approaches outperform existing meta-learning methods with zero-shot learning experiments on our Meta-VQA dataset, which we specifically constructed to reflect the multi-modal meta-learning setting.","deep leaning, meta learning, hypernetworks, generative models, classifier guidance, contrastive learning, clip, classifier-free guidance, latent diffusion, diffusion models" Tiered Pruning for Efficient Differentialble Inference-Aware Neural Architecture Search,https://openreview.net/forum?id=T5ADm9PHGeJ,https://openreview.net/pdf?id=T5ADm9PHGeJ,,"We propose three novel pruning techniques to improve the cost and results of inference-aware Differentiable Neural Architecture Search (DNAS). First, we introduce $\textbf{Prunode}$, a stochastic bi-path building block for DNAS, which can search over inner hidden dimensions with $\mathcal{O}(1)$ memory and compute complexity. Second, we present an algorithm for pruning blocks within a stochastic layer of the SuperNet during the search. Third, we describe a novel technique for pruning unnecessary stochastic layers during the search. The optimized models resulting from the search are called PruNet and establishes a new state-of-the-art Pareto frontier for NVIDIA V100 in terms of inference latency for ImageNet Top-1 image classification accuracy. PruNet as a backbone also outperforms GPUNet and EfficientNet on the COCO object detection task on inference latency relative to mean Average Precision (mAP).","nas, dnas, neural architecture search, differentiable neural architecture search, state-of-the-art, imagenet, classification, gpunet, efficientnet, pruning, inference-aware, computer vision, object detection" MyoDex: Generalizable Representations for Dexterous Physiological Manipulation,https://openreview.net/forum?id=TBaS6AqX_F_,https://openreview.net/pdf?id=TBaS6AqX_F_,,"The complexity of human dexterity has attracted attention from multiple fields. Still, much is to be understood about how hand manipulation behaviors emerge. In this work we aim at learning dexterous manipulation behaviors with a physiologically realistic hand model: MyoHand. In contrast to prior works demonstrating isolated postural and force control, here we demonstrate musculoskeletal agents (MyoDex) exhibiting contact-rich dynamic dexterous manipulation behaviors in simulation. Furthermore, to demonstrate generalization, we show that a single MyoDex agent can be trained to solve up-to 14 different contact-rich tasks. Aligned with human development, simultaneous learning of multiple tasks imparts physiological coordinated muscle contractions i.e., muscle synergies, that are not only shared amongst those in-domain tasks but are also effective in out-of-domain tasks. By leveraging these pre-trained manipulation synergies, we show generalization to 14 additional previously unsolved tasks. While physiological behaviors with large muscle groups (such as legged-locomotion, arm-reaching, etc), have been demonstrated before, to the best of our knowledge nimble behaviors of this complexity with smaller muscle groups are being demonstrated for the first time.","Musculoskeletal, Machine Learning, human dexterity, muscle synergies" Do We Really Need Labels for Backdoor Defense?,https://openreview.net/forum?id=0Hfv9xPBSPQ,https://openreview.net/pdf?id=0Hfv9xPBSPQ,,"Since training a model from scratch always requires massive computational resources recently, it has become popular to download pre-trained backbones from third-party platforms and deploy them in various downstream tasks. While providing some convenience, it also introduces potential security risks like backdoor attacks, which lead to target misclassification for any input image with a specifically defined trigger (i.e., backdoored examples). Current backdoor defense methods always rely on clean labeled data, which indicates that safely deploying the pre-trained model in downstream tasks still demands these costly or hard-to-obtain labels. In this paper, we focus on how to purify a backdoored backbone with only unlabeled data. To evoke the backdoor patterns without labels, we propose to leverage the unsupervised contrastive loss to search for backdoors in the feature space. Surprisingly, we find that we can mimic backdoored examples with adversarial examples crafted by contrastive loss, and erase them with adversarial finetuning. Thus, we name our method as Contrastive Backdoor Defense (CBD). Against several backdoored backbones from both supervised and self-supervised learning, extensive experiments demonstrate our unsupervised method achieves comparable or even better defense compared to these supervised backdoor defense methods. Thus, our method allows practitioners to safely deploy pre-trained backbones on downstream tasks without extra labeling costs.", Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics,https://openreview.net/forum?id=PvLnIaJbt9,https://openreview.net/pdf?id=PvLnIaJbt9,Our work provides a unified and efficient framework for Metadata Archaeology -- uncovering and inferring metadata of examples in a dataset,"Modern machine learning research relies on relatively few carefully curated datasets. Even in these datasets, and typically in `untidy' or raw data, practitioners are faced with significant issues of data quality and diversity which can be prohibitively labor intensive to address. Existing methods for dealing with these challenges tend to make strong assumptions about the particular issues at play, and often require a priori knowledge or metadata such as domain labels. Our work is orthogonal to these methods: we instead focus on providing a unified and efficient framework for Metadata Archaeology -- uncovering and inferring metadata of examples in a dataset. We curate different subsets of data that might exist in a dataset (e.g. mislabeled, atypical, or out-of-distribution examples) using simple transformations, and leverage differences in learning dynamics between these probe suites to infer metadata of interest. Our method is on par with far more sophisticated mitigation methods across different tasks: identifying and correcting mislabeled examples, classifying minority-group samples, prioritizing points relevant for training and enabling scalable human auditing of relevant examples.","Metadata archaeology, Learning curves, Loss trajectory, Data auditing" Single SMPC Invocation DPHelmet: Differentially Private Distributed Learning on a Large Scale,https://openreview.net/forum?id=h1kEyC8CFI,https://openreview.net/pdf?id=h1kEyC8CFI,"Our differentially private distributed learning algorithm for image recognition tasks (e.g., CIFAR-10) scales better than prior work while improving the utility-privacy tradeoff on data-starved parties (50 data points per party).","Distributing machine learning predictors enables the collection of large-scale datasets while leaving sensitive raw data at trustworthy sites. We introduce a learning technique that is scalable to a large number of users, satisfies Differential Privacy, and is applicable to non-trivial tasks, such as CIFAR-10. For a large number of participants, communication cost is one of the main challenges. We achieve a low communication cost by requiring only a single invocation of an efficient secure multiparty summation protocol. By relying on state-of-the-art feature extractors, we are able to utilize differentially private convex learners for non-trivial tasks such as CIFAR-10. Convex learners have proven to have a strong utility-private tradeoff. Our experimental results show that for $1{,}000$ users with $50$ data points each, our scheme outperforms state-of-the-art scalable distributed learning methods (differentially private federated learning, short DP-FL) while requiring around $500$ times fewer communication costs: For CIFAR-10, we achieve a classification accuracy of $67.3\,\%$ for an $\varepsilon = 0.59$ while DP-FL achieves $57.6\,\%$. We also show the learnability properties convergence and uniform stability.","differential privacy, distributed learning, privacy-preserving machine learning, privacy, federated learning" A Scalable Training Strategy for Blind Multi-Distribution Noise Removal,https://openreview.net/forum?id=Jpctg2jSnMA,https://openreview.net/pdf?id=Jpctg2jSnMA,A Scalable Training Strategy for Blind Multi-Distribution Noise Removal,"Despite recent advances, developing general-purpose universal denoising and artifact-removal networks remains largely an open problem: Given fixed network weights, one inherently trades-off specialization at one task (e.g., removing Poisson noise) for performance at another (e.g., removing speckle noise). In addition, training such a network is challenging due to the curse of dimensionality: As one increases the dimensions of the specification-space (i.e., the number of parameters needed to describe the noise distribution) the number of unique specifications one needs to train for grows exponentially. Uniformly sampling this space will result in a network that does well at very challenging problem specifications but poorly at easy problem specifications, where even large errors will have a small effect on the overall mean-squared-error. In this work we propose training denoising networks using an adaptive-sampling strategy. Our work improves upon a recent universal denoiser training strategy by extending the results to higher dimensions and by incorporating a polynomial approximation of the true specification-loss landscape. We test our method on joint Poisson-Gaussian-speckle noise and demonstrate that, with our training strategy, a single trained generalist denoiser network can achieve mean-squared-errors within a relatively uniform bound of specialized denoiser networks across a large range of operating conditions.","denoising, image restoration, curriculum learning" $\ell$Gym: Natural Language Visual Reasoning with Reinforcement Learning,https://openreview.net/forum?id=vV1aVdCD2WW,https://openreview.net/pdf?id=vV1aVdCD2WW,A new benchmark for language-conditioned reinforcement learning in visual environments with highly compositional human-written language.,"We present $\ell$Gym, a new benchmark for language-conditioned reinforcement learning in visual environments. $\ell$Gym is based on 2,661 human-written natural language statements grounded in an interactive visual environment, and emphasizing compositionality and semantic diversity. We annotate all statements with Python programs representing their meaning. The programs are executable in an interactive visual environment to enable exact reward computation in every possible world state. Each statement is paired with multiple start states and reward functions to form thousands of distinct Markov Decision Processes of varying difficulty. We experiment with $\ell$Gym with different models and learning regimes. Our results and analysis show that while existing methods are able to achieve non-trivial performance, $\ell$Gym forms a challenging open problem. ","reinforcement learning, natural language, visual reasoning, benchmark" Re-Benchmarking Out-of-Distribution Detection in Deep Neural Networks,https://openreview.net/forum?id=VkPuKTKH2Gx,https://openreview.net/pdf?id=VkPuKTKH2Gx,,"Out-of-distribution (OOD) detection is a key challenge for making machine learning models robust in the real world, where we want models to be aware of uncertainty outside their training data distribution. Despite the rapid development of existing OOD detection algorithms, their experimental settings are usually inconsistent, e.g., datasets, evaluation metrics, model selection, implementation choices. In this paper, we aim to understand OOD detection fundamentally and provide a comprehensive benchmarking of the current state of the art OOD detection methods in a consistent and realistic evaluation setting. This benchmarking contains a serious of datasets split, model selection criteria and OOD detection algorithms. This experimental framework can be easily extended to new algorithms, datasets, and model selection criteria. We conduct extensive experiments on this benchmark and find that the threshold of OOD detection algorithms are not consistent over different datasets and model selection criteria.", Towards Antisymmetric Neural Ansatz Separation,https://openreview.net/forum?id=fadoo8Xs6pH,https://openreview.net/pdf?id=fadoo8Xs6pH,We show an exponential separation between the Slater Ansatz and Jastrow Ansatz used in quantum chemistry.,"We study separations between two fundamental models (or \emph{Ansätze}) of antisymmetric functions, that is, functions $f$ of the form $f(x_{\sigma(1)}, \ldots, x_{\sigma(N)}) = \text{sign}(\sigma)f(x_1, \ldots, x_N)$, where $\sigma$ is any permutation. These arise in the context of quantum chemistry, and are the basic modeling tool for wavefunctions of Fermionic systems. Specifically, we consider two popular antisymmetric Ansätze: the Slater representation, which leverages the alternating structure of determinants, and the Jastrow ansatz, which augments Slater determinants with a product by an arbitrary symmetric function. We construct an antisymmetric function that can be more efficiently expressed in Jastrow form, yet provably cannot be approximated by Slater determinants unless there are exponentially (in $N^2$) many terms. This represents the first explicit quantitative separation between these two Ansätze.","antisymmetric, Slater, Jastrow, quantum chemistry, separation" Multi-instance Interactive Segmentation with Self-Supervised Transformer,https://openreview.net/forum?id=7WgLZCURXxT,https://openreview.net/pdf?id=7WgLZCURXxT,Multi-instance interactive segmentation using Label Propagation and self-supervised representations from Vision Transformer.,"The rise of Vision Transformers (ViT) combined with better self-supervised learning pre-tasks has taken representation learning to the next level, beating supervised results on ImageNet. In particular, self-attention mechanism of ViT allows to easily visualize semantic information learned by the network. Following revealing of attention maps of DINO, many tried to leverage its representations for unsupervised segmentation. Despite very promising results for basic images with a single clear object in a simple background, representation of ViT are not able to segment images, with several classes and object instance, in an unsupervised fashion yet. In this paper, we propose SALT: Semi-supervised Segmentation with Self-supervised Attention Layers in Transformers, an interactive algorithm for multi-class/multi-instance segmentation. We follow previous works path and take it a step further by discriminating between different objects, using sparse human help to select said objects. We show that remarkable results are achieved with very sparse labels. Different pre-tasks are compared, and we show that self-supervised ones are more robust for panoptic segmentation, and overall achieve very similar performance. Evaluation is carried out on Pascal VOC 2007 and COCO-panoptic. Performance is evaluated for extreme conditions such as very noisy, and sparse interactions going to as little as one interaction per class.","Vision Transformer, Self-supervised learning, Interactive Image Segmentation, Semi-supervised learning" Triplet learning of task representations in latent space for continual learning,https://openreview.net/forum?id=xYmlvET9jNU,https://openreview.net/pdf?id=xYmlvET9jNU,,"Continual learning is a mechanism where a model is trained on tasks sequentially and learns the current task while retaining the knowledge of the previous tasks. Researchers have done studies on methods that utilize latent space, which include rehearsal with latent spaces and latent space partitioning. However, latent space overlapping can cause interference between knowledge of different tasks, leading to performance drop on specific metrics, e.g. classification accuracy or quality of image reconstruction. To solve this problem, we propose a method of training an autoencoder with triplet loss applied to partition its latent space. We denote the output of the encoder and some manually chosen layer of the decoder as original latent space O and common latent space C, respectively. Specifically, to mitigate the overlapping, we use triplet loss in the common latent space: (1) cluster the latent variables of the data from the same class to make its latent space not too dispersive, and (2) push the latent spaces of data away from different classes. We tested our method on several datasets, including MNIST, FashionMNIST, and CelebA. The experimental results show that our proposed model achieved an FID of 19 on MNIST and a recall of 0.272 on CelebA, which are better results than state-of-the-art models when trained under similar setups. Qualitatively, we achieve better partitioning results by comparing the visualization of latent space with other latent space methods.","Continual Learning, Triplet Learning, Image Generation" Spurious Features in Continual Learning,https://openreview.net/forum?id=Jp7NLnL3n_1,https://openreview.net/pdf?id=Jp7NLnL3n_1,This paper show that catastrophic forgetting is partially due to spurious features.,"Continual Learning (CL) is the research field addressing learning without forgetting when the data distribution is not static. This paper studies spurious features' influence on continual learning algorithms. We show that continual learning algorithms solve tasks by selecting features that are not generalizable. Our experiments highlight that continual learning algorithms face two related problems: (1) spurious features (SP) and (2) local spurious features (LSP). The first one is due to a covariate shift between training and testing data, while the second is due to the limited access to data at each training step. We study (1) through a consistent set of continual learning experiments varying spurious correlation amount and data distribution support. We show that (2) is a major cause of performance decrease in continual learning along with catastrophic forgetting. This paper presents a different way of understanding performance decrease in continual learning by highlighting the influence of (local) spurious features in algorithms capabilities.","Spurious Features, Continual Learning, Plasticity" Time Series Subsequence Anomaly Detection via Graph Neural Networks,https://openreview.net/forum?id=73U_NlKaNx,https://openreview.net/pdf?id=73U_NlKaNx,A graph neural network-based time series subsequence anomaly detection method consdering multiple effective heuristics. ,"Time series subsequence anomaly detection is an important task in a large variety of real-world applications ranging from health monitoring to AIOps, and is challenging due to complicated underlying temporal dynamics and unpredictable anomalous patterns. Firstly, how to effectively learn the temporal dependency in time series remains a challenge. Secondly, diverse and complicated anomalous subsequences as well as the lack of labels make accurate detection difficult. For example, the popular subsequence anomaly detection algorithm---time series discord---fails to handle recurring anomalies. Thirdly, many existing algorithms require a proper subsequence length for effective detection, which is difficult or impossible in practice. In this paper, we present a novel approach to subsequence anomaly detection which combines practical heuristics of time series discords and temporal relationships with deep neural networks. By performing length selection considering multi-scale information and incorporating prior knowledge using graph neural networks, our method can adaptively learn the appropriate subsequence length as well as integrated representations from both priors and raw data favorable to anomaly detection. In particular, our graph incorporates both semantic and temporal relationships between subsequences. The experimental results demonstrate the effectiveness of the proposed algorithm, which achieves superior performance on multiple time series anomaly benchmarks in comparison with state-of-the-art algorithms.", Aligning Model and Macaque Inferior Temporal Cortex Representations Improves Model-to-Human Behavioral Alignment and Adversarial Robustness,https://openreview.net/forum?id=SMYdcXjJh1q,https://openreview.net/pdf?id=SMYdcXjJh1q,Aligning late stage model representations with neural recordings from macaque IT broadly improves adversarial robustness and alignment on human behavior.,"While some state-of-the-art artificial neural network systems in computer vision are strikingly accurate models of the corresponding primate visual processing, there are still many discrepancies between these models and the behavior of primates on object recognition tasks. Many current models suffer from extreme sensitivity to adversarial attacks and often do not align well with the image-by-image behavioral error patterns observed in humans. Previous research has provided strong evidence that primate object recognition behavior can be very accurately predicted by neural population activity in the inferior temporal (IT) cortex, a brain area in the late stages of the visual processing hierarchy. Therefore, here we directly test whether making the late stage representations of models more similar to that of macaque IT produces new models that exhibit more robust, primate-like behavior. We conducted chronic, large-scale multi-electrode recordings across the IT cortex in six non-human primates (rhesus macaques). We then use these data to fine-tune (end-to-end) the model ""IT"" representations such that they are more aligned with the biological IT representations, while preserving accuracy on object recognition tasks. We generate a cohort of models with a range of IT similarity scores validated on held-out animals across two image sets with distinct statistics. Across a battery of optimization conditions, we observed a strong correlation between the models' IT-likeness and alignment with human behavior, as well as an increase in its adversarial robustness. We further assessed the limitations of this approach and find that the improvements in behavioral alignment and adversarial robustness generalize across different image statistics, but not to object categories outside of those covered in our IT training set. Taken together, our results demonstrate that building models that are more aligned with the primate brain leads to more robust and human-like behavior, and call for larger neural data-sets to further augment these gains.","Computer Vision, Primate Vision, Adversarial Robustness, Behavioral Alignment, Inferior Temporal Cortex" Improving Generalization of Motor-Imagery Brainwave Decoding via Dynamic Convolutions,https://openreview.net/forum?id=2lrx543-MbS,https://openreview.net/pdf?id=2lrx543-MbS,Tackling inter-subject variability using dynamic convolutions and causal reasoning,"Deep Convolutional Neural Networks (CNNs) have recently demonstrated impressive results in electroencephalogram (EEG) decoding for several Brain-Computer Interfaces (BCI) paradigms, including Motor-Imagery (MI). However, neurophysiological processes underpinning EEG signals vary across subjects causing covariate shifts in data distributions and hence hindering the generalization of deep models across subjects. In this paper, we aim to address the challenge of inter-subject variability in MI. To this end, we employ causal reasoning to characterize all possible distribution shifts in the MI task and propose a dynamic convolution framework to account for shifts caused by the inter-subject variability. Using publicly available MI datasets, we demonstrate improved generalization performance across subjects in various MI tasks for four well-established deep architectures.","Brain-Computer Interfaces, Dynamic Convolution, Causality" On the Expressive Power of Geometric Graph Neural Networks,https://openreview.net/forum?id=Rkxj1GXn9_,https://openreview.net/pdf?id=Rkxj1GXn9_,We propose a geometric version of the Weisfeler-Leman graph isomorphism test to characterise the expressive power of GNNs for geometric graphs.,"The expressive power of Graph Neural Networks (GNNs) has been studied extensively through the lens of the Weisfeiler-Leman (WL) graph isomorphism test. Yet, many graphs arising in real-world applications come embedded in Euclidean space with an additional notion of geometric isomorphism, which is not covered by the WL framework. In this work, we propose a geometric version of the WL test (GWL) for discriminating geometric graphs while respecting the underlying physical symmetries: permutation, rotation, reflection, and translation. We use GWL to characterise the expressive power of GNNs that are invariant or equivariant to physical symmetries by studying the classes of geometric graphs that can or cannot be distinguished by these architectures. This allows us to formalise the advantages equivariant GNN layers have over their invariant counterparts in the Geometric Deep Learning blueprint. Finally, we connect our discrimination-based perspective with the universal approximation properties of geometric GNNs and prove they are two sides of the same coin.","Graph Neural Networks, Geometric Deep Learning, Equivariance, Expressive Power, Graph Isomorphism" Fusion over the Grassmann Manifold for Incomplete-Data Clustering,https://openreview.net/forum?id=qLKammDlpF,https://openreview.net/pdf?id=qLKammDlpF,We introduce a novel paradigm that optimizes on the Grassmannian to complete and cluster incomplete data in a union of subspaces.,"This paper presents a new paradigm to cluster incomplete vectors using subspaces as proxies to exploit the geometry of the Grassmannian. We leverage this new perspective to develop an algorithm to cluster and complete data in a union of subspaces via a fusion penalty formulation. Our approach does not require prior knowledge of the number of subspaces, is naturally suited to handle noise, and only requires an upper bound on the subspaces’ dimensions. In developing our model, we present local convergence guarantees. We describe clustering, completion, model selection, and sketching techniques that can be used in practice, and complement our analysis with synthetic and real-data experiments.","high-rank matrix completion, subspace clustering, manifold learning" Off Policy Average Reward Actor Critic with Deterministic Policy Search,https://openreview.net/forum?id=_3Lk3cUWSI,https://openreview.net/pdf?id=_3Lk3cUWSI,This paper proposes actor critic algorithm with deterministic policy for the average reward criterion,"The average reward criterion is relatively less explored as most existing works in the Reinforcement Learning literature consider the discounted reward criterion. There are few recent works that present on-policy average reward actor-critic algorithms, but average reward off-policy actor-critic is relatively less explored. In this paper, we present both on-policy and off-policy deterministic policy gradient theorems for the average reward performance criterion. Using these theorems, we also present an Average Reward Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG) Algorithm. We show a finite time analysis of the resulting three-timescale stochastic approximation scheme and obtain an $\epsilon$-optimal stationary policy with a sample complexity of $\Omega(\epsilon^{-2.5})$. We compare the average reward performance of our proposed algorithm and observe better empirical performance compared to state-of-the-art on-policy average reward actor-critic algorithms over MuJoCo based environments. ","reinforcement learning, actor critic algorithm, deterministic policy, off-policy, target network, average reward, finite time analysis, convergence, three time scale stochastic approximation, DeepMind control suite" Why Did This Model Forecast This Future? Information-Theoretic Temporal Saliency for Counterfactual Explanations of Probabilistic Forecasts,https://openreview.net/forum?id=Qi4oCA89CmO,https://openreview.net/pdf?id=Qi4oCA89CmO,"We propose an information-theoretic saliency-based framework for counterfactual reasoning in probabilistic forecasting. For common distributions, we obtain a closed-form expression for the saliency of observed timesteps towards a model's forecasts. ","Probabilistic forecasting of multivariate time series is significant to several research domains where multiple futures exist for a single observed sequence. Identifying the observations on which a well-performing model bases its forecasts can enable domain experts to form data-driven hypotheses about the causal relationships between features. Consequently, we begin by revisiting the question: what constitutes a causal explanation? One hurdle in the landscape of explainable artificial intelligence is that what constitutes an explanation is not well-grounded. We build upon Miller's framework of explanations derived from research in multiple social science disciplines, and establish a conceptual link between counterfactual reasoning and saliency-based explanation techniques. However, the complication is a lack of a consistent and principled notion of saliency. Also, commonly derived saliency maps may be inconsistent with the data generation process and the underlying model. We therefore leverage a unifying definition of information-theoretic saliency grounded in preattentive human visual cognition and extend it to forecasting settings. In contrast to existing methods that require either explicit training of the saliency mechanism or access to the internal parameters of the underlying model, we obtain a closed-form solution for the resulting saliency map for commonly used density functions in probabilistic forecasting. To empirically evaluate our explainability framework in a principled manner, we construct a synthetic dataset of conversation dynamics and demonstrate that our method recovers the true salient timesteps for a forecast given a well-performing underlying model.","probabilistic forecasting, saliency, explainability" CLMIU: Commonsense Learning in Multimodal Image Understanding.,https://openreview.net/forum?id=klOPHkfx0ic,https://openreview.net/pdf?id=klOPHkfx0ic,"Incorporating external commonsense knowledge into multimodal image understanding tasks, e.g., image captioning. The proposed method achieves state of the art results without needing a pretrained object detector.","The problem of automatically describing the content of an image through accurate and meaningful captions has been attracting considerable attention among computer vision researchers. Recently, Transformers have been applied to image captioning to encode cross-modal information, in conjunction with Convolutional Neural Networks, which provide image region descriptions in terms of embeddings and object labels as input. However, the generated captions sometimes fail to capture the intentions, relationships, and abstract concepts that rely on general or commonsense knowledge. In this work we propose a novel network design, combining the strengths of Transformer models with graph-based models conveying external (common sense) knowledge. Our proposed architecture is a pure vision transformer-based image captioning model, with sequences of image patches used directly as input, without extracting any regional features. In particular, unlike the prior work, our architecture incorporates a knowledge-augmented encoder with a Transformer backbone to inject the external knowledge extracted from a knowledge graph. Furthermore, the bidirectional training on a vision-language corpus of image-text pairs, using modality specific self-supervised learning objectives, achieves promising results compared to the state-of-the-art. Our method has been trained from scratch on a small dataset, achieving a 3.8%, 2.7%, 3.2% and 6.3% improvement in BLEU@4, Meteor, Rouge and Cider scores respectively. We also reported competitive results on the NoCaps dataset, showing that the model generalizes to unseen object categories.","Vision and language pretraining, Image captioning, Commonsense knowledge, Transformers, Graph attention networks, Group masked model learning" Topological Data Analysis-Deep Learning Framework for Predicting Cancer Phenotypes,https://openreview.net/forum?id=4gwZXPNhBt,https://openreview.net/pdf?id=4gwZXPNhBt,The use of topological data analysis to predict cancer-type phenotypes. ,"Classification of patient cancer phenotypes from gene expression profiles remains a challenge in the field of transcriptomics. Gene expression data typically suffers from extreme noise and performs poorly on deep learning models alone. We build on recent work by Mandal et al., by incorporating the concept of differential gene expression analysis to pre-select genes that are necessary but not sufficient for disease association in our topological data analysis approach. The outcome is a reduction in computational cost in the calculation of persistent homology. We also test multiple topological representations to optimise prediction. Deep learning with topological features performs better compared to its use on raw data. Thus, topological features offers a new perspective on the difficult-to-unravel non-linear connection between genotype and phenotype","Topological data analysis, Deep learning, Gene expression, Cancer Phenotype prediction" In-Situ Text-Only Adaptation of Speech Models with Low-Overhead Speech Imputations,https://openreview.net/forum?id=T2Ncx_PN2K,https://openreview.net/pdf?id=T2Ncx_PN2K,A lightweight text-only adaptation technique for end-to-end speech recognition that is both fast and accurate.,"Fast and accurate adaptation of automatic speech recognition (ASR) systems using only text data in the target domain is a problem of long-standing practical relevance. Text-only adaptation was easy in traditional cascaded ASR systems with completely decoupled acoustic and language models. Recently, the RNNTransducer (RNN-T) has emerged as a default ASR model because of its high accuracy, low latency, and capability of supporting streaming input. However text-only adaptation of the RNN-T model is significantly more challenging due to its tight integration of acoustic and language models and end-to-end training. Existing recent approaches for text-only adaptation of RNN-Ts, either entail significant modification to the network or introduce high latency during decoding. We propose a new approach (TOLSTOI) that imputes speech representations internal to a baseline RNN-T, starting from text-only inputs, and performs in-situ adaptation that results in higher adaptation accuracy without any runtime overheads during decoding. Our imputation model is a function of the labeled data and trained parameters of the ASR model, and that we show, is more effective in controlling catastrophic forgetting compared to existing methods. We establish the effectiveness of TOLSTOI using three target domains and two ASR models of varying complexity. We yield up to 35% relative reduction in word error rate with text-only adaptation while forgetting the least compared to existing adaptation approaches. Our method is easy to implement and can be harnessed on existing RNN-T models without requiring ASR model training from scratch.","Text-Only Adaptation, End-to-end Speech Recognition" Rethinking Uniformity in Self-Supervised Representation Learning,https://openreview.net/forum?id=hFUlfiyf1oQ,https://openreview.net/pdf?id=hFUlfiyf1oQ,,"Self-supervised representation learning has achieved great success in many machine learning tasks. While many research efforts focus on learning better representations by preventing the model from the \emph{collapse} problem, less attention has been drawn to analyzing the collapse degrees of representations. In this paper, we present a formal study of collapse analysis via the \emph{uniformity} metric, which measures how uniformly learned representations distribute on the surface of the unit hypersphere. We fundamentally find that \textit{representation that obeys zero-mean isotropic Gaussian distribution is with the ideal uniformity} since its $l_2$-normalized form uniformly distributes on the surface of the unit hypersphere. Therefore, we propose to use the Wasserstein distance between the distribution of learned representations and the ideal distribution as a quantifiable metric of \emph{uniformity}. Moreover, we design five desirable constraints for ideal uniformity metrics, based on which we find that the proposed uniformity metric satisfies all constraints while the existing one does not. Synthetic experiments also demonstrate the proposed uniformity metric is capable to deal with the dimensional collapse while the existing one is insensitive. Furthermore, we impose the proposed \emph{uniformity} metric as an auxiliary loss term for various existing self-supervised methods, which consistently improves the downstream performance. ","Collapse Analysis, Wasserstein Distance, Self-Supervised Representation Learning" Proposal-Contrastive Pretraining for Object Detection from Fewer Data,https://openreview.net/forum?id=gm0VZ-h-hPy,https://openreview.net/pdf?id=gm0VZ-h-hPy,"We present Proposal Selection Contrast (ProSeCo), a novel unsupervised overall pretraining approach for Object Detection that leverages the large number of object proposals generated by transformer-based detectors using an improved contrastive loss.","The use of pretrained deep neural networks represents an attractive way to achieve strong results with few data available. When specialized in dense problems such as object detection, learning local rather than global information in images has proven to be more efficient. However, for unsupervised pretraining, the popular contrastive learning requires a large batch size and, therefore, a lot of resources. To address this problem, we are interested in transformer-based object detectors that have recently gained traction in the community with good performance and with the particularity of generating many diverse object proposals. In this work, we present Proposal Selection Contrast (ProSeCo), a novel unsupervised overall pretraining approach that leverages this property. ProSeCo uses the large number of object proposals generated by the detector for contrastive learning, which allows the use of a smaller batch size, combined with object-level features to learn local information in the images. To improve the effectiveness of the contrastive loss, we introduce the object location information in the selection of positive examples to take into account multiple overlapping object proposals. When reusing pretrained backbone, we advocate for consistency in learning local information between the backbone and the detection head. We show that our method outperforms state of the art in unsupervised pretraining for object detection on standard and novel benchmarks in learning with fewer data. ","Object Detection, Unsupervised, Pretraining, Contrastive Learning" SkillS: Adaptive Skill Sequencing for Efficient Temporally-Extended Exploration,https://openreview.net/forum?id=Hyan74saltV,https://openreview.net/pdf?id=Hyan74saltV,,"The ability to effectively reuse prior knowledge is a key requirement when building general and flexible Reinforcement Learning (RL) agents. Skill reuse is one of the most common approaches, but current methods have considerable limitations. For example, fine-tuning an existing policy frequently fails, as the policy can degrade rapidly early in training, particularly in sparse reward tasks. In a similar vein, distillation of expert behavior can lead to poor results when given sub-optimal experts. We compare several common approaches for skill transfer on multiple domains and in several different transfer settings, including under changes in task and system dynamics. We identify how existing methods can fail and introduce an alternative approach which sidesteps some of these problems. Our approach learns to sequence existing temporally-abstract skills for exploration but learns the final policy directly from the raw experience. This conceptual split enables rapid adaptation and thus efficient data collection but without constraining the final solution. Our approach significantly outperforms many classical methods across a suite of evaluation tasks and we use a broad set of ablations to highlight the importance of different components of our method.","Reinforcement Learning, Control, Skills, Priors, Hierarchical Reinforcement Learning" Bridging between Pool- and Stream-Based Active Learning with Temporal Data Coherence,https://openreview.net/forum?id=X6MIKw1XuxF,https://openreview.net/pdf?id=X6MIKw1XuxF,New Steam based active learning approach using properties of temporal stream data ,"Active learning (AL) reduces the amount of labeled data needed for training a machine learning model by choosing intelligently which instances to label. Classic pool-based AL needs all data to be present in a datacenter, which can be challenging with the increasing amounts of data needed in deep learning. However, AL on mobile devices and robots like autonomous cars can filter the data from perception sensor streams before it ever reaches the datacenter. In our work, we investigate AL for such image streams and propose a new concept exploiting their temporal properties. We define three methods using a pseudo uncertainty based on loss learning (Yoo & Kweon, 2019). The first considers the temporal change of uncertainty and requires 5% less labeled data than the vanilla approach. It is extended by the change in latent space in the second method. The third method, temporal distance loss stream (TDLS), combines both with submodular optimization. In our evaluation on an extension of the public Audi Autonomous Driving Dataset (Geyer et al., 2020) we outperform state-of-the-art approaches by using 1% fewer labels. Additionally, we compare our stream-based approaches with existing approaches for AL in a pool-based scenario. Our experiments show that, although pool-based AL has access to more data, our stream-based AL approaches need 0.5% fewer labels.", The Robustness Limits of SoTA Vision Models to Natural Variation,https://openreview.net/forum?id=BW2A403ema,https://openreview.net/pdf?id=BW2A403ema,"Even today's best vision models are not robust and struggle to generalize changes in factors such as pose, size, and position.","Recent state-of-the-art vision models introduced new architectures, learning paradigms, and larger pretraining data, leading to impressive performance on tasks such as classification. While previous generations of vision models were shown to lack robustness to factors such as pose, it’s unclear the extent to which this next generation of models are more robust. To study this question, we develop a dataset of more than 7 million images with controlled changes in pose, position, background, lighting, and size. We study not only how robust recent state-of-the-art models are, but also the extent to which models can generalize variation in factors when they’re present during training. We consider a catalog of recent vision models, including vision transformers (ViT), self-supervised models such as masked autoencoders (MAE), and models trained on larger datasets such as CLIP. We find out-of-the-box, even today’s best models are not robust to common changes in pose, size, and background. When some samples varied during training, we found models required a significant portion of diversity to generalize—though eventually robustness did improve. When diversity is only seen for some classes however, we found models did not generalize to other classes, unless the classes were very similar to those seen varying during training. We hope our work will shed further light on the blind spots of SoTA models and spur the development of more robust vision models","robustness, computer vision, generalization, deep learning" Scaling Laws For Deep Learning Based Image Reconstruction,https://openreview.net/forum?id=op-ceGueqc4,https://openreview.net/pdf?id=op-ceGueqc4,"The performance improvement of deep learning based image reconstruction methods as a function of the training set size slows already at moderate training set sizes, indicating that only marginal gains are expected beyond a few thousand examples.","Deep neural networks trained end-to-end to map a measurement of a (noisy) image to a clean image perform excellent for a variety of linear inverse problems. Current methods are only trained on a few hundreds or thousands of images as opposed to the millions of examples deep networks are trained on in other domains. In this work, we study whether major performance gains are expected from scaling up the training set size. We consider image denoising, accelerated magnetic resonance imaging, and super-resolution and empirically determine the reconstruction quality as a function of training set size, while optimally scaling the network size. For all three tasks we find that an initially steep power-law scaling slows significantly already at moderate training set sizes. Interpolating those scaling laws suggests that even training on millions of images would not significantly improve performance. To understand the expected behavior, we analytically characterize the performance of a linear estimator learned with early stopped gradient descent. The result formalizes the intuition that once the error induced by learning the signal model is small relative to the error floor, more training examples do not improve performance.","scaling laws, number of training examples, inverse problems, deep learning, denoising, magnetic resonance imaging, super-resolution" Robust Exploration via Clustering-based Online Density Estimation,https://openreview.net/forum?id=tVrRejrC-RZ,https://openreview.net/pdf?id=tVrRejrC-RZ,We derive a theoretically sound and effective exploration bonus for deep RL using online clustering to estimate visitation densities in a learned latent representation space ,"Intrinsic motivation is a critical ingredient in reinforcement learning to enable progress when rewards are sparse. However, many existing approaches that measure the novelty of observations are brittle, or rely on restrictive assumptions about the environment which limit generality. We propose to decompose the exploration problem into two orthogonal sub-problems: (i) finding the right representation (metric) for exploration (ii) estimating densities in this representation space. To address (ii), we introduce Robust Exploration via Clustering-based Online Density Estimation (RECODE), a non-parametric method that estimates visitation counts for clusters of states that are similar according to the metric induced by any arbitrary representation learning technique. We adapt classical clustering algorithms to design a new type of memory that allows RECODE to keep track of the history of interactions over thousands of episodes, thus effectively tracking global visitation counts. This is in contrast to existing non-parametric approaches, that can only store the recent history, typically the current episode. The generality of RECODE allows us to easily address (i) by leveraging both off-the-shelf and novel representation learning techniques. In particular, we introduce a novel generalization of the action-prediction representation that leverages multi-step predictions and that we find to be better suited to a suite of challenging 3D-exploration tasks in DM-HARD-8. We show experimentally that our approach can work with a variety of RL agents, and obtain state-of-the-art performance on Atari and DM-HARD-8.","exploration, representation learning, density estimation, reinforcement learning" Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning,https://openreview.net/forum?id=3oWo92cQyxL,https://openreview.net/pdf?id=3oWo92cQyxL,"We introduce a novel multimodal few-shot meta-learner, by learning how to bridge large-scale frozen vision and language models.","Multimodal few-shot learning is challenging due to the large domain gap between vision and language modalities. Existing methods are trying to communicate visual concepts as prompts to frozen language models, but rely on hand-engineered task induction to reduce the hypothesis space. To make the whole process learnable, we introduce a multimodal meta-learning approach. Specifically, our approach decomposes the training of the model into a set of related multimodal few-shot tasks. We define a meta-mapper network, acting as a meta-learner, to efficiently bridge frozen large-scale vision and language models and leverage their already learned capacity. By updating the learnable parameters only of the meta-mapper, it learns to accrue shared meta-knowledge among these tasks. Thus, it can rapidly adapt to newly presented samples with only a few gradient updates. Importantly, it induces the task in a completely data-driven manner, with no need for a hand-engineered task induction. We evaluate our approach on recently proposed multimodal few-shot benchmarks, measuring how rapidly the model can bind novel visual concepts to words and answer visual questions by observing only a limited set of labeled examples. The experimental results show that our meta-learning approach outperforms the baseline across multiple datasets and various training settings, while being computationally more efficient.","multimodal, few-shot learning, meta-learning, transformers, vision and language models" DLP: Data-Driven Label-Poisoning Backdoor Attack,https://openreview.net/forum?id=33daZzvuzY6,https://openreview.net/pdf?id=33daZzvuzY6,"We introduce a new type of end-to-end clean-sample backdoor attack, allowing attackers to backdoor effectively, as measured by test performances, for an arbitrary backdoor sample size. ","Backdoor attacks, which aim to disrupt or paralyze classifiers on specific tasks, are becoming an emerging concern in several learning scenarios, e.g., Machine Learning as a Service (MLaaS). Various backdoor attacks have been introduced in the literature, including perturbation-based methods, which modify a subset of training data; and clean-sample methods, which relabel only a proportion of training samples. Indeed, clean-sample attacks can be particularly stealthy since they never require modifying the samples at the training and test stages. However, the state-of-the-art clean-sample attack of relabelling training data based on their semantic meanings could be ineffective and inefficient in test performances due to heuristic selections of semantic patterns. In this work, we introduce a new type of clean-sample backdoor attack, named as DLP backdoor attack, allowing attackers to backdoor effectively, as measured by test performances, for an arbitrary backdoor sample size. The critical component of DLP is a data-driven backdoor scoring mechanism embedding in a multi-task formulation, which enables attackers to simultaneously perform well on the normal learning tasks and the backdoor tasks. Systematic empirical evaluations show the superior performance of the proposed DLP to state-of-the-art clean-sample attacks.","Backdoor learning, End-to-end learning, Clean-sample attack" AlphaFold Distillation for Improved Inverse Protein Folding,https://openreview.net/forum?id=brk7Ct4Tb1M,https://openreview.net/pdf?id=brk7Ct4Tb1M,"We distilled AlphaFold to generate structural confidence scores (pTM, pLDDT) for any protein sequence and applied it to inverse folding design and antibody infilling ","Inverse protein folding, i.e., designing sequences that fold into a given three-dimensional structure, is one of the fundamental design challenges in bio-engineering and drug discovery. Traditionally, inverse folding mainly involves learning from sequences that have an experimentally resolved structure. However, the known structures cover only a tiny space of the protein sequences, imposing limitations on the model learning. Recently proposed forward folding models, e.g., AlphaFold, offer unprecedented opportunity for accurate estimation of the structure given a protein sequence. Naturally, incorporating a forward folding model as a component of an inverse folding approach offers the potential of significantly improving the inverse folding, as the folding model can provide a feedback on any generated sequence in the form of the predicted protein structure or a structural confidence metric. However, at present, these forward folding models are still prohibitively slow to be a part of the model optimization loop during training. In this work, we propose to perform knowledge distillation on the folding model's confidence metrics, e.g., pTM or pLDDT scores, to obtain a smaller, faster and end-to-end differentiable distilled model, which then can be included as part of the structure consistency regularized inverse folding model training. Moreover, our regularization technique is general enough and can be applied in other design tasks, e.g., sequence-based protein infilling. Extensive experiments show a clear benefit of our method over the non-regularized baselines. E.g., in inverse folding design problems we observe up to 3% improvement in sequence recovery and up to 45% improvement in protein diversity, while still preserving structural consistency of the generated sequences.","Inverse Protein Folding Design, Protein Design, Model Distillation, AlphaFold, Protein Folding" Convexifying Transformers: Improving optimization and understanding of transformer networks,https://openreview.net/forum?id=PJVZCd4Dn2w,https://openreview.net/pdf?id=PJVZCd4Dn2w,We first propose a convex alternative to the self-attention mechanism and then develop a convex analytic framework to train attention/transformer networks. ,"Understanding the fundamental mechanism behind the success of transformer networks is still an open problem in the deep learning literature. Although their remarkable performance has been mostly attributed to the self-attention mechanism, the literature still lacks a solid analysis of these networks and interpretation of the functions learned by them. To this end, we study the training problem of attention/transformer networks and introduce a novel convex analytic approach to improve the understanding and optimization of these networks. Particularly, we first introduce a convex alternative to the self-attention mechanism and reformulate the regularized training problem of attention/transformer networks. Then, we cast the reformulation as a convex optimization problem that is interpretable and easier to optimize. Moreover, as a byproduct of our convex analysis, we reveal an implicit regularization mechanism, which promotes sparsity across tokens. Therefore, we not only improve the optimization of attention/transformer networks but also provide a solid theoretical understanding of the functions learned by them. We also demonstrate the effectiveness of our theory through several numerical experiments.","Convex optimization, transformers, attention, self-attention, group sparsity" Unsupervised Model-based Pre-training for Data-efficient Control from Pixels,https://openreview.net/forum?id=XGXPeOXbiT-,https://openreview.net/pdf?id=XGXPeOXbiT-,"We combine the findings of a large-scale study on the Unsupervised RL Benchmark and propose a new hybrid planner, to establish an end-to-end approach that can be efficiently fine-tuned after unsupervised pre-training.","Controlling artificial agents from visual sensory data is an arduous task. Reinforcement learning (RL) algorithms can succeed in this but require large amounts of interactions between the agent and the environment. To alleviate the issue, unsupervised RL proposes to employ self-supervised interaction and learning, for adapting faster to future tasks. Yet, whether current unsupervised strategies improve generalization capabilities is still unclear, especially in visual control settings. In this work, we design an unsupervised RL strategy for data-efficient visual control. First, we show that world models pre-trained with data collected using unsupervised RL can facilitate adaptation for future tasks. Then, we analyze several design choices to adapt faster, effectively reusing the agents' pre-trained components, and planning in imagination, with our hybrid planner, which we dub Dyna-MPC. By combining the findings of a large-scale empirical study, we establish an approach that strongly improves performance on the Unsupervised RL Benchmark, requiring 20$\times$ less data to match the performance of supervised methods. The approach also demonstrates robust performance on the Real-Word RL benchmark, hinting that the approach generalizes to noisy environments.","unsupervised reinforcement learning, world models, planning" A Cognitive-inspired Multi-Module Architecture for Continual Learning,https://openreview.net/forum?id=wPLEzBcSC7p,https://openreview.net/pdf?id=wPLEzBcSC7p,"Cognitive Continual Learner (CCL), a novel cognitive-inspired method that employs multiple modules, implicit and explicit knowledge representation dichotomy, inductive bias, and a multi-memory system. ","Artificial neural networks (ANNs) exhibit a narrow scope of expertise on stationary independent data. However, data in the real world is continuous and dynamic, and ANNs must adapt to novel scenarios while also retaining the learned knowledge to become lifelong learners. The ability of humans to excel at these tasks can be attributed to multiple factors ranging from cognitive computational structures, cognitive biases, and the multi-memory systems in the brain. We incorporate key concepts from each of these to design a cognitive-inspired continual learning method. Cognitive Continual Learner (CCL) includes multiple modules, implicit and explicit knowledge representation dichotomy, inductive bias, and a multi-memory system. CCL shows improvement across all continual learning settings and also exhibits reduced task recency bias. To test versatility of continual learning methods on a challenging distribution shift, we introduce a novel domain-incremental dataset Domain${^2}$IL. In addition to improved performance on existing benchmarks, CCL also demonstrates superior performance on this dataset.","Continual Learning, Catastrophic Forgetting, Experience Replay, Lifelong Learning, Cognitive Learning, Inductive Bias" Shuffled Transformers for Blind Training,https://openreview.net/forum?id=sWUlKZOM8kfs,https://openreview.net/pdf?id=sWUlKZOM8kfs,,"Conventional split learning faces the challenge of preserving training data and model privacy as a part of the training is beyond the data owner's control. We tackle this problem by introducing blind training, i.e., training without being aware of the data or the model, realized by shuffled Transformers. This is attributed to our intriguing findings that the inputs and the model weights of the Transformer encoder blocks, the backbone of Transformer, can be shuffled without degrading the model performance. We not only have proven the shuffling invariance property in theory, but also designs a privacy-preserving split learning framework following the property, with little modification to the original Transformer architecture. We carry out verification of the properties through experiments, and also show our proposed framework successfully defends privacy attacks to split learning with superiority.","Data privacy, split learning, Transformer, privacy-preserving" Non-Gaussian Process Regression,https://openreview.net/forum?id=lCYrsdHb5SQ,https://openreview.net/pdf?id=lCYrsdHb5SQ,We extend the Gaussian process regression model to allow for locally adaptive behaviour through time-changed GPs and learn latent probabilistic representations of inputs.,"Standard GPs offer a flexible modelling tool for well-behaved processes. However, deviations from Gaussianity are expected to appear in real world datasets, with structural outliers and shocks routinely observed. In these cases GPs can fail to model uncertainty adequately and may over-smooth inferences. Here we extend the GP framework into a new class of time-changed GPs that allow for straightforward modelling of heavy-tailed non-Gaussian behaviours, while retaining a tractable conditional GP structure through an infinite mixture of non-homogeneous GPs representation. The conditional GP structure is obtained by conditioning the observations on a latent transformed input space and the random evolution of the latent transformation is modelled using a Lévy process which allows Bayesian inference in both the posterior predictive density and the latent transformation function. We present Markov chain Monte Carlo inference procedures for this model and demonstrate the potential benefits compared to a standard GP.","Non-parametric regression, Bayesian methods, Approximate Inference, Levy processes" ImageNet-X: Understanding Model Mistakes with Factor of Variation Annotations,https://openreview.net/forum?id=HXz7Vcm3VgM,https://openreview.net/pdf?id=HXz7Vcm3VgM,we annotate ImageNet images with factor labels to explain model mistakes,"Deep learning vision systems are widely deployed across applications where reliability is critical. However, even today's best models can fail to recognize an object when its pose, lighting, or background varies. While existing benchmarks surface examples challenging for models, they do not explain why such mistakes arise. To address this need, we introduce ImageNet-X—a set of sixteen human annotations of factors such as pose, background, or lighting the entire ImageNet-1k validation set as well as a random subset of 12k training images. Equipped with ImageNet-X, we investigate 2,200 current recognition models and study the types of mistakes as a function of model’s (1) architecture, e.g. transformer vs. convolutional, (2) learning paradigm, e.g. supervised vs. self-supervised, and (3) training procedures, e.g., data augmentation. Regardless of these choices, we find models have consistent failure modes across ImageNet-X categories. We also find that while data augmentation can improve robustness to certain factors, they induce spill-over effects to other factors. For example, color-jitter augmentation improves robustness to color and brightness, but surprisingly hurts robustness to pose. Together, these insights suggest to advance the robustness of modern vision models, future research should focus on collecting additional data and understanding data augmentation schemes. Along with these insights, we release a toolkit based on ImageNet-X to spur further study into the mistakes image recognition systems make.", Hardware-aware compression with Random Operation Access Specific Tile (ROAST) hashing,https://openreview.net/forum?id=xFiI7VmVB4H,https://openreview.net/pdf?id=xFiI7VmVB4H,efficient model compression using parameter sharing tuned to underlying hardware and algorithm implementations.,"Advancements in deep learning are often associated with increasing model sizes. Training and deploying large models require sophisticated hardware and incur significantly higher costs. Thus, model compression is a widely explored approach to solving the problem. However, SOTA techniques fall short in one or more desirable aspects of compression - for instance, pruning does not reduce memory for training, quantization can only provide up to $32\times$ compression, HashedNet is cache-inefficient, etc. This paper proposes a model-agnostic, cache-friendly, and hardware-aware model compression approach: Random Operation Access Specific Tile (ROAST) hashing. ROAST collapses the parameters by clubbing them through a lightweight mapping. While clubbing these parameters, ROAST utilizes cache hierarchies by aligning the memory access pattern with the parameter access pattern. ROAST is up to $\sim 25 \times$ faster to train and $\sim 50 \times$ faster to infer than the popular parameter sharing method HashedNet. Additionally, ROAST introduces global weight sharing, which is empirically and theoretically superior to local weight sharing in HashedNet, and can be of independent interest. With ROAST, we can efficiently train and deploy the model using a much smaller memory footprint ($\sim 10 \times - 100 \times$ lesser) in text and image classification tasks","model compression, hardware aware" SoftZoo: A Soft Robot Co-design Benchmark For Locomotion In Diverse Environments,https://openreview.net/forum?id=Xyme9p1rpZw,https://openreview.net/pdf?id=Xyme9p1rpZw,We introduce a new virtual environment for soft robot co-design.,"While significant research progress has been made in robot learning for control, unique challenges arise when simultaneously co-optimizing morphology. Existing work has typically been tailored for particular environments or representations. In order to more fully understand inherent design and performance tradeoffs and accelerate the development of new breeds of soft robots, a comprehensive virtual platform — with well-established tasks, environments, and evaluation metrics — is needed. In this work, we introduce SoftZoo, a soft robot co-design platform for locomotion in diverse environments. SoftZoo supports an extensive, naturally-inspired material set, including the ability to simulate environments such as flat ground, desert, wetland, clay, ice, snow, shallow water, and ocean. Further, it provides a variety of tasks relevant for soft robotics, including fast locomotion, agile turning, and path following, as well as differentiable design representations for morphology and control. Combined, these elements form a feature-rich platform for analysis and development of soft robot co-design algorithms. We benchmark prevalent representations and co-design algorithms, and shed light on 1) the interplay between environment, morphology, and behavior (2) the importance of design space representations 3) the ambiguity in muscle formation and controller synthesis and 4) the value of differentiable physics. We envision that SoftZoo will serve as a standard platform and template an approach toward the development of novel representations and algorithms for co-designing soft robots’ behavioral and morphological intelligence. Demos are available on our project page.","soft robot, co-design, differentiable physics" Smooth Mathematical Functions from Compact Neural Networks,https://openreview.net/forum?id=c2l1XbSRnpZ,https://openreview.net/pdf?id=c2l1XbSRnpZ,Smooth mathematical functions are obtained from neural networks comprised few weight parameters using a new activation function and the new batch method.,"This is paper for the smooth function approximation by neural networks (NN). Mathematical or physical functions can be replaced by NN models through regression. In this study, we get NNs that generate highly accurate and highly smooth function, which only comprised of a few weight parameters, through discussing a few topics about regression. First, we reinterpret inside of NNs for regression; consequently, we propose a new activation function--integrated sigmoid linear unit (ISLU). Then special charateristics of metadata for regression, which is different from other data like image or sound, is discussed for improving the performance of neural networks. Finally, the one of a simple hierarchical NN that generate models substituting mathematical function is presented, and the new batch concept ``meta-batch"" which improves the performance of NN several times more is introduced. The new activation function, meta-batch method, features of numerical data, meta-augmentation with metaparameters, and a structure of NN generating a compact multi-layer perceptron(MLP) are essential in this study.","regression, function approximation, mathmatical function, compact model" Self-Supervised Learning of Maximum Manifold Capacity Representations,https://openreview.net/forum?id=sRsceSk_5l0,https://openreview.net/pdf?id=sRsceSk_5l0,We present a novel self-supervised framework that maximimizes the number of linearly separable augmentation manifolds.,"Self-supervised Learning (SSL) has recently emerged as a successful strategy for learning useful representations of images without relying on hand-assigned labels. Many such methods aim to learn a function that maps distinct views of the same scene or object to nearby points in the representation space. These methods are often justified by showing that they optimize an objective that is an approximation of (or correlated with) the mutual information between representations of different views. Here, we recast the problem from the perspective of manifold capacity, a recently developed measure for evaluating the quality of a representation. Specifically, we develop a contrastive learning framework that aims to maximize the number of linearly separable object manifolds, yielding a Maximum Manifold Capacity Representation (MMCR). We apply this method to unlabeled images, each augmented by a set of basic transformations, and find that it learns meaningful features using the standard linear evaluation protocol. Specifically, we find that MMCRs support performance on object recognition comparable to recently developed SSL frameworks, while providing more robustness to adversarial attacks. Finally, empirical analysis reveals the means by which compression of object manifolds gives rise to class separability. ","self-supervised learning, representation geometry, neural manifolds, statistical physics of learning, theoretical neuroscience" PMI-guided Masking Strategy to Enable Few-shot Learning for Genomic Applications,https://openreview.net/forum?id=wYZeJ3BsXl6,https://openreview.net/pdf?id=wYZeJ3BsXl6,PMI-masking in MLMs helps to achieve better few-shot classification performance in gene sequence modeling applications.,"Learning effective gene representations is of great research interest. Lately, large-scale language models based on the ""transformer"" architecture, such as DNABert and LOGO, have been proposed to learn gene representations from the Human Reference Genome. Although these large language models outperform previous approaches, currently, no study empirically determined the best strategy for representing gene sequences as tokens. Therefore, the uniform random masking strategy, which is the default during the pretraining of such masked language models, might lead to pretraining inefficiency, resulting in suboptimal downstream task performance in the few-shot setting. However, good few-shot performance is critical, with dataset sizes in (personalized) medicine often not exceeding a couple of hundred data points. In this paper, we develop a novel strategy to adapt ""Pointwise Mutual Information (PMI) masking"" used previously in the NLP setting to the domain of gene sequence modeling. PMI-masking masks spans of tokens that are more likely to co-occur, forming a statistically relevant span. First, we learn a vocabulary of tokens with a high PMI score from our pretraining corpus (the ""Human Reference Genome""). Next, we utilize this side information in the following step and train our model by masking tokens based on PMI scores. In extensive experiments, we evaluate the effectiveness of the PMI-masking strategy on two baseline models of DNABert and LOGO, over three benchmark datasets (two on promoters and one on enhancer), and on a variety of few-shot settings. We observe that our PMI-masking-guided baseline models substantially outperform the baseline models. We further observe that almost all the top-ranked DNA tokens in terms of PMI score are closely associated with existing ""conserved DNA sequence motifs"".","gene sequence modeling, few-shot, MLM masking" TOWARDS AN OBJECTIVE EVALUATION OF THE TRUSTWORTHINESS OF CLASSIFIERS,https://openreview.net/forum?id=IM4Iwo58T4M,https://openreview.net/pdf?id=IM4Iwo58T4M,,"With the widespread deployment of AI models in applications that impact human lives, research on model trustworthiness has become increasingly important, as a result of which model effectiveness alone (measured, e.g., with accuracy, F1, etc.) should not be the only criteria to evaluate predictive models; additionally the trustworthiness of these models should also be factored in. It has been argued that the features deemed important by a black-box model should be aligned with the human perception of the data, which in turn, should contribute to increasing the trustworthiness of a model. Existing research in XAI evaluates such alignments with user studies - the limitations being that these studies are subjective, difficult to reproduce, and consumes a large amount of time to conduct. We propose an evaluation framework, which provides a quantitative measure for trustworthiness of a black-box model, and hence, we are able to provide a fair comparison between a number of different black-box models. Our framework is applicable to both text and images, and our experiment results show that a model with a higher accuracy does not necessarily exhibit better trustworthiness.","Model trustworthiness, Explainable AI" Fine-grain Inference on Out-of-Distribution Data with Hierarchical Classification,https://openreview.net/forum?id=6s8XPvu7bI8,https://openreview.net/pdf?id=6s8XPvu7bI8,,"Machine learning methods must be trusted to make appropriate decisions in real-world environments, even when faced with out-of-distribution (OOD) samples. Many current approaches simply aim to detect OOD examples and alert the user when an unrecognized input is given. However, when the OOD sample significantly overlaps with the training data, a binary anomaly detection is not interpretable or explainable, and provides little information to the user. We propose a new model for OOD detection that makes predictions at varying levels of granularity—as the inputs become more ambiguous, the model predictions become coarser and more conservative. Consider an animal classifier that encounters an unknown bird species and a car. Both cases are OOD, but the user gains more information if the classifier recognizes that its uncertainty over the particular species is too large and predicts “bird” instead of detecting it as OOD. Furthermore, we diagnose the classifier’s performance at each level of the hierarchy improving the explainability and interpretability of the model’s predictions. We demonstrate the effectiveness of hierarchical classifiers for both fine- and coarse-grained OOD tasks.", ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech,https://openreview.net/forum?id=4daKS8wEze5,https://openreview.net/pdf?id=4daKS8wEze5,,"Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples. However, their iterative refinement process in high-dimensional data space results in slow inference speed, which restricts their application in real-time systems. Previous works have explored speeding up by minimizing the number of inference steps but at the cost of sample quality. In this work, to improve the inference speed for DDPM-based TTS model while achieving high sample quality, we propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model (e.g., FastSpeech 2) by predicting the residual between the model output and the corresponding ground-truth speech. ResGrad has several advantages: 1) Compare with other acceleration methods for DDPM which need to synthesize speech from scratch, ResGrad reduces the complexity of task by changing the generation target from ground-truth mel-spectrogram to the residual, resulting into a more lightweight model and thus a smaller real-time factor. 2) ResGrad is employed in the inference process of the existing TTS model in a plug-and-play way, without re-training this model. We verify ResGrad on the single-speaker dataset LJSpeech and two more challenging datasets with multiple speakers (LibriTTS) and high sampling rate (VCTK). Experimental results show that in comparison with other speed-up methods of DDPMs: 1) ResGrad achieves better sample quality with the same inference speed measured by real-time factor; 2) with similar speech quality, ResGrad synthesizes speech faster than baseline methods by more than 10 times. Audio samples are available at \url{https://resgrad1.github.io/}. ", The Adversarial Regulation of the Temporal Difference Loss Costs More Than Expected,https://openreview.net/forum?id=vDybC2brXKh,https://openreview.net/pdf?id=vDybC2brXKh,,"Deep reinforcement learning research has enabled reaching significant performance levels for sequential decision making in MDPs with highly complex observations and state dynamics with the aid of deep neural networks. However, this aid came with a cost that is inherent to deep neural networks which have increased sensitivities towards indistinguishable peculiarly crafted non-robust directions. To alleviate these sensitivities several studies suggested techniques to cope with this problem via explicitly regulating the temporal difference loss for the worst-case sensitivity. In our study, we show that these worst-case regularization techniques come with a cost that intriguingly causes inconsistencies and overestimations in the state-action value functions. Furthermore, our results essentially demonstrate that vanilla trained deep reinforcement learning policies have more accurate and consistent estimates for the state-action values. We believe our results reveal foundational intrinsic properties of the adversarial training techniques and demonstrate the need to rethink the approach to robustness in deep reinforcement learning.", Beyond Link Prediction: On Pre-Training Knowledge Graph Embeddings,https://openreview.net/forum?id=QvIyd7l718,https://openreview.net/pdf?id=QvIyd7l718,,"Knowledge graph embeddings (KGE) models provide low-dimensional representations of the entities and relations in a knowledge graph (KG). Most prior work focused on training and evaluating KGE models for the task of link prediction; the question of whether or not KGE models provide useful representations more generally remains largely open. In this work, we explore the suitability of KGE models (i) for more general graph-structure prediction tasks and (ii) for downstream tasks such as entity classification. For (i), we found that commonly trained KGE models often perform poorly at structural tasks other than link prediction. Based on this observation, we propose a more general multi-task training approach, which includes additional self-supervised tasks such as neighborhood prediction or domain prediction. In our experiments, these multi-task KGE models showed significantly better overall performance for structural prediction tasks. For (ii), we investigate whether KGE models provide useful features for a variety of downstream tasks. Here we view KGE models as a form of self-supervised pre-training and study the impact of both model training and model selection on downstream task performance. We found that multi-task pre-training can (but does not always) significantly improve performance and that KGE models can (but do not always) compete with or even outperform task-specific GNNs trained in a supervised fashion. Our work suggests that more research is needed on how to pre-train KGE models and on their suitability for downstream applications.", SYNC: Efficient Neural Code Search Through Structurally Guided Hard Negative Curricula,https://openreview.net/forum?id=xdVNc95sY1l,https://openreview.net/pdf?id=xdVNc95sY1l,This paper proposes an AST-guided hard negative sampling method for training efficient neural code search models using curriculum learning.,"Efficient code snippet search using natural language queries can be a great productivity tool for developers (beginners and professionals alike). Recently neural code search has been popular, where a neural method is used to embed both the query (NL) and the code snippet (PL) into a common representation space; which is further used to obtain the most relevant PL satisfying the intent in the query. Transformers-based pre-trained language models (such as CodeBERT, GraphCodeBERT, UniXCoder) have been especially effective to learn such representation. These models often make mistakes such as retrieving snippets with incorrect data types, and incorrect method names or signatures; even when exposed to the underlying structural information of the code (such as Abstract Syntax Tree and other static analysis outputs) during pre-training. The generalization ability beyond the training data is also limited (as the code retrieval datasets vary in the ways NL-PL pairs are collected). In this work, we propose a structure-aware hard negative sampling method and a mastering-rate based curriculum learning technique (SYNC) that enhances the pre-trained representation using both soft (random) and the (synthesized) hard negative samples. Our experiments on three state-of-the-art pre-trained language models for programming languages, over four Python code retrieval datasets, show the efficacy of the approach (under both in-distribution and out-of-distribution settings).","Neural Code Search, Curriculum Learning, Hard Negative Mining, Abstract Syntax Tree, Structure-aware Embeddings" Masked Siamese ConvNets: Towards an Effective Masking Strategy for General-purpose Siamese Networks ,https://openreview.net/forum?id=NnHz2rU0Hjp,https://openreview.net/pdf?id=NnHz2rU0Hjp,We propose a masking strategy for siamese networks with ConvNets.,"Siamese Networks are a popular self-supervised learning framework that learns useful representation without human supervision by encouraging representations to be invariant to distortions. Existing methods heavily rely on hand-crafted augmentations, which are not easily adapted to new domains. To explore a general-purpose or domain-agnostic siamese network, we investigate using masking as augmentations in siamese networks. Recently, masking for siamese networks has only been shown useful with transformer architectures, e.g. MSN and data2vec. In this work, we identify the underlying problems of masking for siamese networks with arbitrary backbones, including ConvNets. We propose an effective and general-purpose masking strategy and demonstrate its effectiveness on various siamese network frameworks. Our method generally improves siamese networks' performances in the few-shot image classification, and object detection tasks.","self-supervised learning, siamese networks, masking, convNets" Canary in a Coalmine: Better Membership Inference with Ensembled Adversarial Queries,https://openreview.net/forum?id=b7SBTEBFnC,https://openreview.net/pdf?id=b7SBTEBFnC,,"As industrial applications are increasingly automated by machine learning models, enforcing personal data ownership and intellectual property rights requires tracing training data back to their rightful owners. Membership inference algorithms approach this problem by using statistical techniques to discern whether a target sample was included in a model's training set. However, existing methods only utilize the unaltered target sample or simple augmentations of the target to compute statistics. Such a sparse sampling of the model's behavior carries little information, leading to poor inference capabilities. In this work, we use adversarial tools to directly optimize for queries that are discriminative and diverse. Our improvements achieve significantly more accurate membership inference than existing methods, especially in offline scenarios and in the low false-positive regime which is critical in legal settings.", Maximum Entropy Information Bottleneck for Confidence-aware Stochastic Embedding,https://openreview.net/forum?id=UbH1jxLIPhb,https://openreview.net/pdf?id=UbH1jxLIPhb,We use the maximum entropy objective to better learn stochastic embedding.,"Stochastic embedding has several advantages over deterministic embedding, such as the capability of associating uncertainty with the resulting embedding and robustness to noisy data. This is especially useful when the input data has ambiguity (e.g., blurriness or corruption) which often happens with in-the-wild settings. Many existing methods for stochastic embedding are limited by the assumption that the embedding follows a standard normal distribution under the variational information bottleneck principle. We present a different variational approach to stochastic embedding in which maximum entropy acts as the bottleneck, which we call ""Maximum Entropy Information Bottleneck"" or MEIB. We show that models trained with the MEIB objective outperform existing methods in terms of regularization, perturbation robustness, probabilistic contrastive learning, and risk-controlled recognition performance.","Deep learning, Computer vision, Stochastic embedding" Reprogramming Large Pretrained Language Models for Antibody Sequence Infilling,https://openreview.net/forum?id=axFCgjTKP45,https://openreview.net/pdf?id=axFCgjTKP45,We use large pretrained language models and reprogram them for the new task of protein infilling,"Antibodies comprise the most versatile class of binding molecules, with numerous applications in biomedicine. Therapeutic antibody development requires designing novel and diverse sequences with improved properties, while maintaining the structural consistency. Computational design of antibodies involves unusual challenges relative to designing other classes of proteins, as antibodies comprise multiple long, variable, and unstructured loops at the complementarity-determining region (CDR) that determine the antigen binding affinity and specificity of an antibody. Recently, deep language models and graph neural networks have shown impressive success in antibody sequence generation. Since only a limited number of antibody structures are known, training a model using this limited data can lead to degraded performance, particularly lacking diversity in the generated samples. To address such issues, we leverage the method of Model Reprogramming (MR) here, which focuses on repurposing pretrained machine learning models for target domain tasks with scarce data, where it may be difficult to train a high-performing model from scratch. Prior works in MR have primarily focused on classification-based tasks. We extend the capabilities of reprogramming beyond classification tasks, and towards a more complex problem of antibody sequence generation. Specifically, we introduce Reprogramming for Protein Sequence Infilling, a framework in which pretrained natural language models are repurposed for protein sequence infilling via reprogramming, to infill protein sequence templates as a method of novel protein generation. For variable CDR sequence design, we formulate the task as text infilling that uses the constant region of an antibody as the sequence template. Results on antibody design benchmarks show that our reprogrammed model on low resourced antibody sequence dataset provides highly diverse CDR sequences, up to more than a two-fold increase of diversity over the baselines, without losing structural integrity and naturalness. The performance benefit of the reprogrammed model learning only from antibody sequences is more evident for longer CDR design or for multiple loop infilling at once, compared to existing graph-based models that require additional structural information. The generated sequences also demonstrate enhanced antigen binding specificity or virus neutralization ability.","Antibody, Protein, CDR, Infilling" Optimal Scalarizations for Provable Multiobjective Optimization,https://openreview.net/forum?id=M8rwWdaGa6x,https://openreview.net/pdf?id=M8rwWdaGa6x,Don't linearly combine your objectives: Hypervolume scalarizations provide provable and more optimal multiobjective optimization.,"Linear scalarization is a simple and widely-used technique that can be deployed in any multiobjective setting to combine diverse objectives into one reward function, but such heuristics are not theoretically understood. To that end, we perform a case study of the multiobjective stochastic linear bandits framework with $k$ objectives and our goal is to provably scalarize and explore a diverse set of optimal actions on the Pareto frontier, as measured by the dominated hypervolume. Even in this elementary convex setting, the choice of scalarizations and weight distribution surprisingly affects performance, and the natural use of linear scalarization with uniform weights is suboptimal due to a non-uniform Pareto curvature. Instead, we suggest the usage of the theoretically-inspired hypervolume scalarizations with non-adaptive uniform weights, showing that it comes with novel hypervolume regret bounds of $\tilde{O}( d T^{-1/2} + T^{-1/k})$, with optimal matching lower bounds of $\Omega(T^{-1/k})$. We support our theory with strong empirical performance of the hypervolume scalarization that consistently outperforms both the linear and Chebyshev scalarizations in high dimensions.","multiobjective optimization, scalarization, linear bandits" Using semantic distance for diverse and sample efficient genetic programming,https://openreview.net/forum?id=DwOaHJJKy9,https://openreview.net/pdf?id=DwOaHJJKy9,"We show the importance of diversity in semantic (phenotypic) space when mutating genetic programs, and apply it to learning ML components.","Evolutionary methods, such as genetic programming, search a space of programs to find those with good fitness, often using mutations that manipulate the syntactic structure of programs without being aware of how they affect the semantics. For applications where the semantics are highly sensitive to small syntactic mutations, or where fitness evaluation is expensive, this can make learning programs intractable. We introduce a mutation operator that yields mutated programs that are semantically far from previously evaluated programs, while still being semantically close to their parent. For function regression, this leads to an algorithm that is one to two orders of magnitude more sample efficient than other gradient-free methods, such as genetic programming, or learning the weights of a neural network using evolutionary strategies. We show how this method can be applied to learning architecture-specific and general purpose neural network optimizers, and to reinforcement learning loss functions. The learnt components are simple, interpretable, high performance, and contain novel features not seen before such as weight growth.","genetic programming, meta learning" Semi-parametric Prompt-Generation for Model Editing,https://openreview.net/forum?id=bjCAHZLoKy,https://openreview.net/pdf?id=bjCAHZLoKy,An efficient prefix generation method to efficiently update knowledge of large language models.,"Large Language models are used in various downstream tasks with great success. However, changing specific knowledge or beliefs of a model (a.k.a. model editing) efficiently to revise inaccurate predictions while not affecting all other cases is still challenging. Most previous methods compute gradients to change the model. These strategies generally work, paying the cost of high computing and memory complexity. The semi-parametric strategy has recently shown its effectiveness in alleviating the complexity via introducing memory to store the edits of knowledge. However, the memory does not have a proper mechanism to be utilized by a large pre-trained language model, limiting its generalizability to more complicated model editing scenarios. This work proposes a prompt generation mechanism to bridge the gap. Our method encodes the edits as prefix prompts for language models, then has the large pre-trained language model perform inference with the prompts. In other words, the model is edited by prompts without changing model parameters. Our method, SEPROG, significantly outperforms state-of-art methods by up to 20% on entailed edit benchmarks and provides up to 30% better performance over gradient-based methods on non-entailed benchmarks. These advantages are achieved with much less computation and memory consumption, proving prompt generation’s great potential in model editing problems.","Model Editing, Prefix Tuning, Fine-tuning" Fast Bayesian Updates for Deep Learning with a Use Case in Active Learning,https://openreview.net/forum?id=7MthJsb-nm,https://openreview.net/pdf?id=7MthJsb-nm,,"Retraining deep neural networks when new data arrives is typically computationally expensive. Moreover, certain applications do not allow such costly retraining due to time or computational constraints. Fast Bayesian updates are a possible solution to this issue. Therefore, we propose a Bayesian update based on Monte-Carlo samples and a last-layer Laplace approximation for different Bayesian neural network types, i.e., Dropout, Ensemble, and Spectral Normalized Neural Gaussian Process (SNGP). In a large-scale evaluation study, we show that our updates combined with SNGP represent a fast and competitive alternative to costly retraining. As a use case, we combine the Bayesian updates for SNGP with different sequential query strategies to exemplarily demonstrate their improved selection performance in active learning.","Bayesian Neural Networks, Deep Learning, Active Learning" Improved Learning-augmented Algorithms for k-means and k-medians Clustering,https://openreview.net/forum?id=dCSFiAl_VO3,https://openreview.net/pdf?id=dCSFiAl_VO3,,"We consider the problem of clustering in the learning-augmented setting. We are given a data set in $d$-dimensional Euclidean space, and a label for each data point given by a predictor indicating what subsets of points should be clustered together. This setting captures situations where we have access to some auxiliary information about the data set relevant for our clustering objective, for instance the labels output by a neural network. Following prior work, we assume that there are at most an $\alpha \in (0,c)$ for some $c<1$ fraction of false positives and false negatives in each predicted cluster, in the absence of which the labels would attain the optimal clustering cost $\mathrm{OPT}$. For a dataset of size $m$, we propose a deterministic $k$-means algorithm that produces centers with an improved bound on the clustering cost compared to the previous randomized state-of-the-art algorithm while preserving the $O( d m \log m)$ runtime. Furthermore, our algorithm works even when the predictions are not very accurate, i.e., our cost bound holds for $\alpha$ up to $1/2$, an improvement from $\alpha$ being at most $1/7$ in previous work. For the $k$-medians problem we again improve upon prior work by achieving a biquadratic improvement in the dependence of the approximation factor on the accuracy parameter $\alpha$ to get a cost of $(1+O(\alpha))\OPT$, while requiring essentially just $O(md \log^3 m/\alpha)$ runtime.","clustering, learning-augmented algorithms" A Subspace Correction Method for ReLU Neural Networks for Solving PDEs,https://openreview.net/forum?id=cTdzZI83Iud,https://openreview.net/pdf?id=cTdzZI83Iud,We propose a novel algorithm called Neuron-wise Parallel Subspace Correction Method for training ReLU neural networks for solving partial differential equations.,"In this paper, we propose a novel algorithm called Neuron-wise Parallel Subspace Correction Method (NPSC) for training ReLU neural networks for numerical solution of partial differential equations (PDEs). Despite of extremely extensive research activities in applying neural networks for numerical PDEs, there is still a serious lack of training algorithms that can be used to obtain approximation with adequate accuracy. Based on recent results on the spectral properties of linear layers and landscape analysis for single neuron problems, we develop a special type of subspace correction method that deals with the linear layer and each neuron in the nonlinear layer separately. An optimal preconditioner that resolves the ill-conditioning of the linear layer is presented, so that the linear layer is trained in a uniform number of iterations with respect to the number of neurons. In each single neuron problem, a good local minimum is found by a superlinearly convergent algorithm, avoiding regions where the loss function is flat. Performance of the proposed method is demonstrated through numerical experiments for function approximation problems and PDEs.","ReLU neural network, Subspace correction method, Training algorithm, Function approximation, Partial differential equation" Neural Implicit Shape Editing using Boundary Sensitivity,https://openreview.net/forum?id=CMPIBjmhpo,https://openreview.net/pdf?id=CMPIBjmhpo,,"Neural fields are receiving increased attention as a geometric representation due to their ability to compactly store detailed and smooth shapes and easily undergo topological changes. Compared to classic geometry representations, however, neural representations do not allow the user to exert intuitive control over the shape. Motivated by this, we leverage \emph{boundary sensitivity} to express how perturbations in parameters move the shape boundary. This allows to interpret the effect of each learnable parameter and study achievable deformations. With this, we perform \emph{geometric editing}: finding a parameter update which best approximates a globally prescribed deformation. Prescribing the deformation only locally allows to deform the rest of the shape according to some prior, such as \emph{semantics} or \emph{deformation} rigidity. Different to previous efforts, our method is model-agnostic and can be applied to a pre-trained NN and update it in-place. Furthermore, we show how boundary sensitivity helps optimize and constrain objectives (such as surface area and volume), which are difficult to compute without first converting to another representation, such as a mesh.", Amortised Invariance Learning for Contrastive Self-Supervision,https://openreview.net/forum?id=nXOhmfFu5n,https://openreview.net/pdf?id=nXOhmfFu5n,,"Contrastive self-supervised learning methods famously produce high quality transferable representations by learning invariances to different data augmentations. Invariances established during pre-training can be interpreted as strong inductive biases. However these may or may not be helpful, depending on if they match the invariance requirements of downstream tasks or not. This has led to several attempts to learn task-specific invariances during pre-training, however, these methods are highly compute intensive and tedious to train. We introduce the notion of amortized invariance learning for contrastive self supervision. In the pre-training stage, we parameterize the feature extractor by differentiable invariance hyper-parameters that control the invariances encoded by the representation. Then, for any downstream task, both linear readout and task-specific invariance requirements can be efficiently and effectively learned by gradient-descent. We evaluate the notion of amortized invariances for contrastive learning over two different modalities: vision and audio, on two widely-used contrastive learning methods in vision: SimCLR and MoCo-v2 with popular architectures like ResNets and Vision Transformers, and SimCLR with ResNet-18 for audio. We show that our amortized features provide a reliable way to learn diverse downstream tasks with different invariance requirements, while using a single feature and avoiding task-specific pre-training. This provides an exciting perspective that opens up new horizons in the field of general purpose representation learning.", Direct-Effect Risk Minimization,https://openreview.net/forum?id=fkmO7EFaNT,https://openreview.net/pdf?id=fkmO7EFaNT,"We develop a novel domain generalization algorithm for correlation shift based on direct causal effects, which achieves good results in our experiments on 5 correlation-shifted datasets and the DomainBed benchmark.","We study the problem of out-of-distribution (o.o.d.) generalization where spurious correlations of attributes vary across training and test domains. This is known as the problem of correlation shift and has posed concerns on the reliability of machine learning. In this work, we introduce the concepts of direct and indirect effects from causal inference to the domain generalization problem. Under mild conditions, we show that models that learn direct effects provably minimize the worst-case risk across correlation-shifted domains. To eliminate the indirect effects, our algorithm consists of two stages: in the first stage, we learn an indirect-effect representation by minimizing the prediction error of domain labels using the representation and the class label; in the second stage, we remove the indirect effects learned in the first stage by matching each data with another data of similar indirect-effect representation but of different class label. Experiments on 5 correlation-shifted datasets and the DomainBed benchmark verify the effectiveness of our approach.","Domain Generalization, Causal Inference, Direct and Indirect Effects" DIFFUSION GENERATIVE MODELS ON SO(3),https://openreview.net/forum?id=jHA-yCyBGb,https://openreview.net/pdf?id=jHA-yCyBGb,"In this work, we establish a framework for score-based denoising diffusion generative models on the SO(3) manifold.","Diffusion-based generative models represent the current state-of-the-art for image generation. However, standard diffusion models are based on Euclidean geometry and do not translate directly to manifold-valued data. In this work, we develop extensions of both score-based generative models (SGMs) and Denoising Diffusion Probabilistic Models (DDPMs) to the Lie group of 3D rotations, SO(3). SO(3) is of particular interest in many disciplines such as robotics, biochemistry and astronomy/planetary science. Contrary to more general Riemannian manifolds, SO(3) admits a tractable solution to heat diffusion, and allows us to implement efficient training of Diffusion models. We apply both SO(3) DDPMs and SGMs to synthetic densities on SO(3) and demonstrate state-of-the-art results.","Deep Generative Models, Manifold Learning, SO(3), Denoising Diffusion, Score-based models" Certifiably Robust Transformers with 1-Lipschitz Self-Attention,https://openreview.net/forum?id=hzG72qB0XQ,https://openreview.net/pdf?id=hzG72qB0XQ,We propose a 1-Lipschitz Transformer which allows us to achieve superior certified robustness than existing transformer architectures.,"Recent works have shown that neural networks with Lipschitz constraints will lead to high adversarial robustness. In this work, we propose the first One-Lipschitz Self-Attention (OLSA) mechanism for Transformer models. In particular, we first orthogonalize all the linear operations in the self-attention mechanism. We then bound the overall Lipschitz constant by aggregating the Lipschitz of each element in the softmax with weighted sum. Based on the proposed self-attention mechanism, we construct an OLSA Transformer to achieve model deterministic certified robustness. We evaluate our model on multiple natural language processing (NLP) tasks and show that it outperforms existing certification on Transformers, especially for models with multiple layers. As an example, for 3-layer Transformers we achieve an ℓ2 deterministic certified robustness radius of 1.733 and 0.979 on the word embedding space for the Yelp and SST dataset, while the existing SOTA certification baseline of the same embedding space can only achieve 0.061 and 0.110. In addition, our certification is significantly more efficient than previous works, since we only need the output logits and Lipschitz constant for certification. We also fine-tune our OLSA Transformer as a downstream classifier of a pre-trained BERT model and show that it achieves significantly higher certified robustness on BERT embedding space compared with previous works (e.g. from 0.071 to 0.368 on the QQP datasets).","Certified robustness, Transformers" Revisiting Populations in multi-agent Communication,https://openreview.net/forum?id=n-UHRIdPju,https://openreview.net/pdf?id=n-UHRIdPju,,"Despite evidence from cognitive sciences that larger groups of speakers tend to develop more structured languages in human communication, scaling up to populations has failed to yield significant benefits in emergent multi-agent communication. In this paper we advocate for an alternate population-level training paradigm for referential games based on the idea of ""partitioning"" the agents into sender-receiver pairs and limiting co-adaptation across pairs. We show that this results in optimizing a different objective at the population level, where agents maximize (1) their respective ""internal"" communication accuracy and (2) some measure of alignment between agents. In experiments, we find that this leads to the emergence of languages that are significantly more compositional. Moreover, when agents are trained in populations that are not fully connected (ie. not all agent pairs interact at training time), this approach reduces multi-linguality and improves zero-shot communication with new agents (ie. agents are able to communicate successfully with other agents outside their training partners).",emergent communication Univariate vs Multivariate Time Series Forecasting with Transformers,https://openreview.net/forum?id=GpW327gxLTF,https://openreview.net/pdf?id=GpW327gxLTF,We achieve SOTA results via a simple method of producing multivariate forecasts in a univariate manner which points to flaws in current architectures.,"Multivariate time series forecasting is a challenging problem and a number of Transformer-based long-term time series forecasting models have been developed to tackle it. These models, however, are impeded by the additional information available in multivariate forecasting. In this paper we propose a simple univariate setting as an alternative method for producing multivariate forecasts. The univariate model is trained on each individual dimension of the time series. This single model is then used to forecast each dimension of the multivariate forecast in turn. A comparative study shows that our setting outperforms state-of-the-art Transformers in the multivariate setting in benchmark datasets. To investigate why, we set three hypotheses and verify them via an empirical study, which leads to a criterion for when our univariate setting is likely to lead to better performance and reveals flaws in the current multivariate Transformers for long-term time series forecasting.","forecasting, time series, transformers, univariate, multivariate" Semantic Transformation-based Data Augmentation for Few-Shot Learning,https://openreview.net/forum?id=xmXsZBRTzrI,https://openreview.net/pdf?id=xmXsZBRTzrI,We propose a semantic transformation based data augmentation approach by transferring samples from base dataset to the novel tasks in an encoder-decoder paradigm to alleviate the data scarce problem.,"Few-shot learning (FSL) as a data-scarce method, aims to recognize instances of unseen classes solely based on very few examples. However, the model can easily become overfitted due to the biased distribution formed with extremely limited training data. This paper presents a task specific data augmentation approach by transferring samples from base dataset to the novel tasks in an encoder-decoder paradigm, which guarantees generating semantically meaningful features. Specifically, the feature transfer process is carried out in semantic space. We further impose a compactness constraint to the generated features with the prototypes working as the reference points, which ensures the generated features distribute around the class centers. Moreover, we incorporate the cluster centers of the query set with the prototypes of the support set to reduce the bias of the class centers. With the supervision of the compactness loss, the model is encouraged to generate discriminative features with high inter-class dispersion and intra-class compactness. Extensive experiments show that our method outperforms the state-of-the-arts on four benchmarks, namely MiniImageNet, TieredImageNet, CUB and CIFAR-FS. ","few-shot learning, data augmentation, semantic feature transformation, sample bias" Sequential Gradient Coding For Straggler Mitigation,https://openreview.net/forum?id=-lGvSmht7a,https://openreview.net/pdf?id=-lGvSmht7a,We propose to improve gradient coding by exploiting the temporal dimension while training deep learning models in distributed cloud systems.,"In distributed computing, slower nodes (stragglers) usually become a bottleneck. Gradient Coding (GC), introduced by Tandon et al., is an efficient technique that uses principles of error-correcting codes to distribute gradient computation in the presence of stragglers. In this paper, we consider the distributed computation of a sequence of gradients $\{g(1),g(2),\ldots,g(J)\}$, where processing of each gradient $g(t)$ starts in round-$t$ and finishes by round-$(t+T)$. Here $T\geq 0$ denotes a delay parameter. For the GC scheme, coding is only across computing nodes and this results in a solution where $T=0$. On the other hand, having $T>0$ allows for designing schemes which exploit the temporal dimension as well. In this work, we propose two schemes that demonstrate improved performance compared to GC. Our first scheme combines GC with selective repetition of previously unfinished tasks and achieves improved straggler mitigation. In our second scheme, which constitutes our main contribution, we apply GC to a subset of the tasks and repetition for the remainder of the tasks. We then multiplex these two classes of tasks across workers and rounds in an adaptive manner, based on past straggler patterns. Using theoretical analysis, we demonstrate that our second scheme achieves significant reduction in the computational load. In our experiments, we study a practical setting of concurrently training multiple neural networks over an AWS Lambda cluster involving 256 worker nodes, where our framework naturally applies. We demonstrate that the latter scheme can yield a 16\% improvement in runtime over the baseline GC scheme, in the presence of naturally occurring, non-simulated stragglers. ","gradient coding, straggler mitigation, distributed computation, coded computing" TTN: A Domain-Shift Aware Batch Normalization in Test-Time Adaptation,https://openreview.net/forum?id=EQfeudmWLQ,https://openreview.net/pdf?id=EQfeudmWLQ,"We propose a test-time batch normalization method, which interpolates source and current batch statistics considering each layer's domain-shift sensitivity level that shows robust performance over various realistic evaluation scenarios..","This paper proposes a novel batch normalization strategy for test-time adaptation. Recent test-time adaptation methods heavily rely on the modified batch normalization, i.e., transductive batch normalization (TBN), which calculates the mean and the variance from the current test batch rather than using the running mean and variance obtained from source data, i.e., conventional batch normalization (CBN). Adopting TBN that employs test batch statistics mitigates the performance degradation caused by the domain shift. However, re-estimating normalization statistics using test data depends on impractical assumptions that a test batch should be large enough and be drawn from i.i.d. stream, and we observed that the previous methods with TBN show critical performance drop without assumptions. In this paper, we identify that CBN and TBN are in a trade-off relationship and present a new test-time normalization (TTN) method that interpolates the statistics by adjusting the importance between CBN and TBN according to the domain-shift sensitivity of each BN layer. Our proposed TTN improves model robustness to shifted domains across a wide range of batch sizes and in various realistic evaluation scenarios. TTN is widely applicable to other test-time adaptation methods that rely on updating model parameters via backpropagation. We demonstrate that adopting TTN further improves their performance and achieves state-of-the-art performance in various standard benchmarks.","Test time adaptation, Domain adaptation, Batch Normalization" Choreographer: Learning and Adapting Skills in Imagination,https://openreview.net/forum?id=PhkWyijGi5b,https://openreview.net/pdf?id=PhkWyijGi5b,"Choreographer: a model-based agent that discovers and learns unsupervised skills in latent imagination, and it's able to efficiently coordinate and adapt the skills to solve downstream tasks.","Unsupervised skill learning aims to learn a rich repertoire of behaviors without external supervision, providing artificial agents with the ability to control and influence the environment. However, without appropriate knowledge and exploration, skills may provide control only over a restricted area of the environment, limiting their applicability. Furthermore, it is unclear how to leverage the learned skill behaviors for adapting to downstream tasks in a data-efficient manner. We present Choreographer, a model-based agent that exploits its world model to learn and adapt skills in imagination. Our method decouples the exploration and skill learning processes, being able to discover skills in the latent state space of the model. During adaptation, the agent uses a meta-controller to evaluate and adapt the learned skills efficiently by deploying them in parallel in imagination. Choreographer is able to learn skills both from offline data, and by collecting data simultaneously with an exploration policy. The skills can be used to effectively adapt to downstream tasks, as we show in the URL benchmark, where we outperform previous approaches from both pixels and states inputs. The skills also explore the environment thoroughly, finding sparse rewards more frequently, as shown in goal-reaching tasks from the DMC Suite and Meta-World. Project website: https://doubleblind-repos.github.io/","unsupervised reinforcement learning, skill learning, world models" Disentanglement of Correlated Factors via Hausdorff Factorized Support,https://openreview.net/forum?id=OKcJhpQiGiX,https://openreview.net/pdf?id=OKcJhpQiGiX,We develop a method that allows for disentangled representation learning not only under the assumption of independent factors of variation but instead fundamentally allows for much more realistic correlations during training.,"A grand goal in deep learning research is to learn representations capable of generalizing across distribution shifts. Disentanglement is one promising direction aimed at aligning a model’s representations with the underlying factors generating the data (e.g. color or background). Existing disentanglement methods, however, rely on an often unrealistic assumption: that factors are statistically independent. In reality, factors (like object color and shape) are correlated. To address this limitation, we propose a relaxed disentanglement criterion – the Hausdorff Factorized Support (HFS) criterion – that encourages a factorized support, rather than a factorial distribution, by minimizing a Hausdorff distance. This allows for arbitrary distributions of the factors over their support, including correlations between them. We show that the use of HFS consistently facilitates disentanglement and recovery of ground-truth factors across a variety of correlation settings and benchmarks, even under severe training correlations and correlation shifts, with in parts over +60% in relative improvement over existing disentanglement methods. In addition, we find that leveraging HFS for representation learning can even facilitate transfer to downstream tasks such as classification under distribution shifts. We hope our original approach and positive empirical results inspire further progress on the open problem of robust generalization.","disentanglement, representation learning, generalization" TimeSeAD: Benchmarking Deep Time-Series Anomaly Detection,https://openreview.net/forum?id=UiNhIyGi1MT,https://openreview.net/pdf?id=UiNhIyGi1MT,"We analyze multivariate time-series datasets, introduce an evaluation metric for time series, and evaluate numerous deep anomaly detection methods.","Developing new methods for detecting anomalies in time series is of great practical significance, but progress is hindered by the difficulty of assessing the benefit of new methods, for the following reasons. (1) Public benchmarks are flawed (e.g., due to questionable anomaly labels), (2) there is no widely accepted standard evaluation metric, and (3) evaluation protocols are mostly inconsistent. In this work, we address all three issues: (1) We critically analyze several of the most widely-used multivariate datasets, identify a number of significant issues, and select the best candidates for evaluation. (2) We introduce a new evaluation metric for time-series anomaly detection, which—in contrast to previous metrics—is recall consistent and takes temporal correlations into account. (3) We analyze and overhaul existing evaluation protocols and provide the largest benchmark of deep multivariate time-series anomaly detection methods to date. We focus on deep-learning based methods and multivariate data, a common setting in modern anomaly detection. We provide all implementations and analysis tools in a new comprehensive library for Time Series Anomaly Detection, called TimeSeAD.","anomaly detection, multivariate time series, deep learning, benchmark, evaluation metrics" On the optimization and generalization of overparameterized implicit neural networks,https://openreview.net/forum?id=oN7tNztrYa3,https://openreview.net/pdf?id=oN7tNztrYa3,This paper analyzes the training and generalization for a implicit neural network with random initialization.,"Implicit neural networks have become increasingly attractive in the machine learning community since they can achieve competitive performance but use much less computational resources. Recently, a line of theoretical works established the global convergences for first-order methods such as gradient descent if the implicit networks are over-parameterized. However, as they train all layers together, their analyses are equivalent to only studying the evolution of the output layer. It is unclear how the implicit layer contributes to the training. Thus, in this paper, we restrict ourselves to only training the implicit layer. We show that global convergence is guaranteed, even if only the implicit layer is trained. On the other hand, the theoretical understanding of when and how the training performance of an implicit neural network can be generalized to unseen data is still under-explored. Although this problem has been studied in standard feed-forward networks, the case of implicit neural networks is still intriguing since implicit networks theoretically have infinitely many layers. Therefore, this paper investigates the generalization error for implicit neural networks. Specifically, we study the generalization of an implicit network activated by the ReLU function over random initialization. We provide a generalization bound that is initialization sensitive. As a result, we show that gradient flow with proper random initialization can train a sufficient over-parameterized implicit network to achieve arbitrarily small generalization errors.","Gradient descent, generalization, implicit neural networks" Differentially Private Conditional Text Generation For Synthetic Data Production,https://openreview.net/forum?id=LUql3ZOFwFD,https://openreview.net/pdf?id=LUql3ZOFwFD,synthesis of private text classification datasets via conditional text generation through GPT-2 fine-tuned with DP-SGD,"Companies have faced increasing pressure in recent years to anonymize user collected data when sharing internally or to third parties. Text data in particular contains copious amounts of personally identifiable information that has proven to be difficult to de-identify while remain useful for the party of interest. Previous works have suggested that synthetic text generation could provide a promising avenue to curate high performant and private datasets. In this paper, we introduce an approach to synthesize high utility text classification datasets by performing conditional generation through a large language model, distilGPT2, while providing measurable guarantees via differential privacy. We show that naive approaches suffer heavily from utility loss by entangling task-relevant factors in the transformer embedding space, making controlled generation more difficult. We analyze how incorporating a secondary learning objective can improve the performance of the generative model, improving utility of the generated data.","differential privacy, conditional text generation, NLP" Multi-Task Structural Learning using Local Task Similarity induced Neuron Creation and Removal,https://openreview.net/forum?id=_DYi95e8CAe,https://openreview.net/pdf?id=_DYi95e8CAe,We propose a multi-task learning method inspired by structural learning in the brain that simultaneously learns the architecture and its parameters.,"Multi-task learning has the potential to improve generalization by maximizing positive transfer between tasks while reducing task interference. Fully achieving this potential is hindered by manually designed architectures that remain static throughout training. In contrast, learning in the brain occurs through structural changes that are in tandem with changes in synaptic strength. Therefore, we propose Multi-Task Structural Learning (MTSL) which simultaneously learns the multi-task architecture and its parameters. MTSL begins with an identical single task network for each task and alternates between a task learning phase and a structural learning phase. In the task learning phase, each network specializes in the corresponding task. In each of the structural learning phases, starting from the earliest layer, locally similar task layers first transfer their knowledge to a newly created group layer after which they become redundant and are removed. Our experimental results show that MTSL achieves competitive generalization with various baselines and improves robustness to out-of-distribution data.","Multi-task Learning, Structural Learning, Brain-inspired Neural Network, Neuron Creation, Neuron Removal" Generating Sequences by Learning to Self-Correct,https://openreview.net/forum?id=hH36JeQZDaO,https://openreview.net/pdf?id=hH36JeQZDaO,,"Sequence generation applications require satisfying semantic constraints, such as ensuring that programs are correct, using certain keywords, or avoiding undesirable content. Language models, whether fine-tuned or prompted with few-shot demonstrations, frequently violate these constraints, and lack a mechanism to iteratively revise their outputs. Moreover, some powerful language models are of extreme scale or inaccessible, making it inefficient, if not infeasible, to update their parameters for task-specific adaptation. We present Self-Correction, an approach that decouples an imperfect base generator (an off-the-shelf language model or supervised sequence-to-sequence model) from a separate corrector that learns to iteratively correct imperfect generations. To train the corrector, we propose an online training procedure that can use either scalar or natural language feedback on intermediate imperfect generations. We show that Self-Correction improves upon the base generator in three diverse generation tasks - mathematical program synthesis, lexically-constrained generation, and toxicity control - even when the corrector is much smaller than the base generator. ","Language models, text generation" Bringing robotics taxonomies to continuous domains via GPLVM on hyperbolic manifolds,https://openreview.net/forum?id=e964ppNfoIJ,https://openreview.net/pdf?id=e964ppNfoIJ,A GPLVM with hyperbolic latent space augmented with graph-based priors for representing robotic taxonomies.,"Robotic taxonomies have appeared as high-level hierarchical abstractions that classify how humans move and interact with their environment. They have proven useful to analyse grasps, manipulation skills, and whole-body support poses. Despite the efforts devoted to design their hierarchy and underlying categories, their use in application fields remains scarce. This may be attributed to the lack of computational models that fill the gap between the discrete hierarchical structure of the taxonomy and the high-dimensional heterogeneous data associated to its categories. To overcome this problem, we propose to model taxonomy data via hyperbolic embeddings that capture the associated hierarchical structure. To do so, we formulate a Gaussian process hyperbolic latent variable model and enforce the taxonomy structure through graph-based priors on the latent space and distance-preserving back constraints. We test our model on the whole-body support pose taxonomy to learn hyperbolic embeddings that comply with the original graph structure. We show that our model properly encodes unseen poses from existing or new taxonomy categories, it can be used to generate trajectories between the embeddings, and it outperforms its Euclidean counterparts.","GPLVM, hyperbolic geometry, robotic taxonomies" COC curve: operating neural networks at high accuracy and low manual effort,https://openreview.net/forum?id=dyRVv79XBAB,https://openreview.net/pdf?id=dyRVv79XBAB,,"In human-AI collaboration systems for critical applications based on neural networks, humans should set an operating point based on a model's confidence to determine when the decision should be delegated to experts. The underlying assumption is that the network's confident predictions are also correct. However, modern neural networks are notoriously overconfident in their predictions, thus they achieve lower accuracy even when operated at high confidence. Network calibration methods mitigate this problem by encouraging models to make predictions whose confidence is consistent with the accuracy, i.e., encourage confidence to reflect the number of mistakes the network is expected to make. However, they do not consider that data need to be manually analysed by experts in critical applications if the confidence of the network is below a certain level. This can be crucial for applications where available expert time is limited and expensive, e.g., medical ones. In this paper, we propose (1) Confidence Operating Characteristics (COC) curve that assesses a predictive model in terms of accuracy and manual analysis it requires for varying operating points on confidence, and (2) a new loss function for classification that takes into account both aspects and derived from the COC curve. We perform extensive experiments on multiple computer vision and medical image datasets for classification and compare the proposed approach with the existing network calibration methods. Our results demonstrate that our method improves classification accuracy while delegating less number of decisions to human experts, achieves better out-of-distribution samples detection and on par calibration performance compared to existing methods. ", Learning to Unlearn: Instance-wise Unlearning for Pre-trained Classifiers,https://openreview.net/forum?id=z8mVbZIMOjx,https://openreview.net/pdf?id=z8mVbZIMOjx,We propose an instance-wise unlearning framework with two regularization approaches to reduce forgetting on remaining data.,"Since the recent advent of regulations for data protection (e.g., the General Data Protection Regulation), there has been increasing demand in deleting information learned from sensitive data in pre-trained models without retraining from scratch. The inherent vulnerability of neural networks towards adversarial attacks and unfairness also calls for a robust method to remove or correct information in an instance-wise fashion, while retaining the predictive performance across remaining data. To this end, we define instance-wise unlearning, of which the goal is to delete information on a set of instances from a pre-trained model, by either misclassifying each instance away from its original prediction or relabeling the instance to a different label. We also propose two methods that reduce forgetting on the remaining data: 1) utilizing adversarial examples to overcome forgetting at the representation-level and 2) leveraging weight importance metrics to pinpoint network parameters guilty of propagating unwanted information. Both methods only require the pre-trained model and data instances to forget, allowing painless application to real-life settings where the entire training set is unavailable. Through extensive experimentation on various image classification benchmarks, we show that our approach effectively preserves knowledge of remaining data while unlearning given instances in both single-task and continual unlearning scenarios.","Machine Unlearning, Adversarial Examples, Weight Importance" Repository-Level Prompt Generation for Large Language Models of Code,https://openreview.net/forum?id=MtGmCCPJD-,https://openreview.net/pdf?id=MtGmCCPJD-,,"With the success of large language models (LLMs) of code and their use as code assistants (e.g.\ Codex used in GitHub Copilot, techniques for introducing domain-specific knowledge in the prompt design process become important. In this work, we propose a framework called Repo-Level Prompt Generator that learns to generate example-specific prompts using prompt proposals. The prompt proposals take context from the entire repository, thereby incorporating both the structure of the repository and the context from other relevant files (e.g.\ imports, parent class files). Our technique doesn't require any access to the weights of the LLM, making it applicable in cases where we only have black-box access to the LLM. We conduct experiments on the task of single-line code-autocompletion using code repositories taken from Google Code archives. We demonstrate that an oracle constructed from our prompt proposals gives a remarkably high relative improvement of 36\% over Codex, showing the quality of these proposals. Further, we show that when we train a model to select the best prompt proposal, we can achieve significant performance gains over Codex and other baselines.","prompt generation, codex, large language models of code, code-autocompletion, source code, LLM, retrieval" Predicting Out-of-Domain Generalization with Local Manifold Smoothness,https://openreview.net/forum?id=3rGLfR0dqp,https://openreview.net/pdf?id=3rGLfR0dqp,Local manifold smoothness is a novel complexity measure that can be used to predict generalization even on out-of-domain test sets without labels.,"Understanding how machine learning models generalize to new environments is a critical part of their safe deployment. Recent work has proposed a variety of complexity measures that directly predict or theoretically bound the generalization capacity of a model. However, these methods rely on a strong set of assumptions that in practice are not always satisfied. Motivated by the limited settings in which existing measures can be applied, we propose a novel complexity measure based on the local manifold smoothness of a classifier. We define local manifold smoothness as a classifier's output sensitivity to perturbations in the manifold neighborhood around a given test point. Intuitively, a classifier that is less sensitive to these perturbations should generalize better. To estimate smoothness we sample points using data augmentation and measure the fraction of these points classified into the majority class. Our method only requires selecting a data augmentation method and makes no other assumptions about the model or data distributions, meaning it can be applied even in out-of-domain (OOD) settings where existing methods cannot. In experiments on robustness benchmarks in image classification, sentiment analysis, and natural language inference, we demonstrate a strong and robust correlation between our manifold smoothness measure and actual OOD generalization on over 4,000 models evaluated on over 100 train/test domain pairs.","complexity measure, out of domain generalization, smoothness" FP_AINet: Fusion Prototype with Adaptive Induction Network for Few-Shot Learning,https://openreview.net/forum?id=QmH1_mn6SI,https://openreview.net/pdf?id=QmH1_mn6SI,,"Conventional prototypical network treats all samples equally and does not consider the effects of noisy samples, which leads to a biased class representation. In this paper, we propose a novel Fusion Prototype with Adaptive Induction Network (FP_AINet) for few-shot learning that can learn representative prototypes from a few support samples. Specifically, to address the problem of noisy samples in the support set, an adaptive induction network is developed, which can learn different class representations for diverse queries and assign adaptive scores for support samples according to their relative significance. Moreover, the proposed model can generate a more accurate prototype than comparison methods by considering the query-related samples. With an increasing of samples, the prototypical network is more expressive since the Adaptive Induction Network ignores the relative local features. As a result, a Gaussian-based fusion algorithm is designed to learn more representative prototypes. Extensive experiments are conducted on three datasets: miniImageNet, tieredImageNet, and CIFAR_FS. The experimental results compared with the state-of-the-art few-shot learning methods demonstrate the superiority of FP_AINet.", CLUSTERBERT: MULTI-STAGE FINE-TUNING OF TRANSFORMERS FOR DEEP TEXT CLUSTERING,https://openreview.net/forum?id=oja19FZn5Y,https://openreview.net/pdf?id=oja19FZn5Y,,"Transformer models have originally been designed for text generation, classification and sequence labelling, and they have achieved new state-of-the-art results in those areas. Recent deep clustering methods learn cluster-friendly spaces for complex data and thereby outperform traditional clustering algorithms, especially on images and graphs. We propose ClusterBERT, an unsupervised algorithm that combines the strengths of both approaches. By tightly integrating transformer-based sentence representation learning with clustering, our method discovers a cluster-friendly representation of text data that retains useful semantic information. ClusterBERT is a multi-stage procedure that consists of domain adaptation, clustering, and hardening of the clusters. Starting from an initial representation obtained by transformer models, ClusterBERT learns a cluster-friendly space for text data by jointly optimizing the reconstruction loss and a clustering loss. Our experiments demonstrate that ClusterBERT outperforms state-of-the-art text clustering methods. ","Text Clustering, Deep Clustering, Transformer, Sentence Embedding" Neural Network Differential Equation Solvers allow unsupervised error estimation and correction,https://openreview.net/forum?id=a40XE0dgOdL,https://openreview.net/pdf?id=a40XE0dgOdL,A blueprint for deep learning solution models for differential equation,"Neural Network Differential Equation (NN DE) solvers have surged in popularity due to a combination of factors: computational advances making their optimization more tractable, their capacity to handle high dimensional problems, easy interpretability, etc. However, most NN DE solvers suffer from a fundamental limitation: their loss functions are not explicitly dependent on the errors associated with the solution estimates. As such, validation and error estimation usually requires knowledge of the true solution. Indeed, when the true solution is unknown, we are often reduced to simply hoping that a ``\textit{low enough}'' loss implies ``\textit{small enough}'' errors, since explicit relationships between the two are not available. In this work, we describe a general strategy for efficiently constructing error estimates and corrections for Neural Network Differential Equation solvers. Our methods do not require \textit{a priori} knowledge of the true solutions and obtain explicit relationships between loss functions and the errors, given certain assumptions on the DE. In turn, these explicit relationships directly allow us to estimate and correct for the errors.","Deep learning, Numerical Methods, Differential Equations" Wide Attention is the Way Forward for Transformers,https://openreview.net/forum?id=-rHOeHtdWP,https://openreview.net/pdf?id=-rHOeHtdWP,"Widening the attention layer in a Transformer and only using a single layer is surprisingly effective, with a number of advantages.","The Transformer is an extremely powerful and prominent deep learning architecture. In this work, we challenge the commonly held belief in deep learning that going deeper is better, and show an alternative design approach that is building wider attention Transformers. We demonstrate that wide single layer Transformer models can compete with or outperform deeper ones in a variety of Natural Language Processing (NLP) tasks when both are trained from scratch. The impact of changing the model aspect ratio on Transformers is then studied systematically. This ratio balances the number of layers and the number of attention heads per layer while keeping the total number of attention heads and all other hyperparameters constant. On average, across 4 NLP tasks and 10 attention types, single layer wide models perform 0.3% better than their deep counterparts. We show an in-depth evaluation and demonstrate how wide models require a far smaller memory footprint and can run faster on commodity hardware, in addition, these wider models are also more interpretable. For example, a single layer Transformer on the IMDb byte level text classification has 3.1x faster inference latency on a CPU than its equally accurate deeper counterpart, and is half the size. Our results suggest that the critical direction for building better Transformers for NLP is their width, and that their depth is less relevant.","transformer, attention, wide, deep, accuracy, latency, interpretability, xformer, size" Variational Prompt Tuning Improves Generalization of Vision-Language Models,https://openreview.net/forum?id=t2qu5Hotedi,https://openreview.net/pdf?id=t2qu5Hotedi,,"Prompt tuning provides an efficient mechanism to adapt large vision-language models to downstream tasks by treating part of the input language prompts as learnable parameters while freezing the rest of the model. Existing works for prompt tuning are however prone to damaging the generalization capabilities of the foundation models, because the learned prompts lack the capacity of covering certain concepts within the language model. To avoid such limitation, we propose a probabilistic modeling of the underlying distribution of prompts, allowing prompts within the support of an associated concept to be derived through stochastic sampling. This results in a more complete and richer transfer of the information captured by the language model, providing better generalization capabilities for downstream tasks. The resulting algorithm relies on a simple yet powerful variational framework that can be directly integrated with other developments. We show our approach is seamlessly integrated into both standard and conditional prompt learning frameworks, improving the performance on both cases considerably, especially with regards to preserving the generalization capability of the original model. Our method provides the current state-of-the-art for prompt learning, surpassing CoCoOp by 1.6% average Top-1 accuracy on the standard benchmark. Remarkably, it even surpasses the original CLIP model in terms of generalization to new classes. Implementation code will be released.", DCT-DiffStride: Differentiable Strides with Real-Valued Data,https://openreview.net/forum?id=HgJ3HYIP3pY,https://openreview.net/pdf?id=HgJ3HYIP3pY,"We propose DCT-DiffStride, a differentiable method to learn strides leveraging the energy compaction properties of the discrete cosine transform.","Reducing the size of intermediate feature maps within various neural network architectures is critical for generalization performance, and memory and computational complexity. Until recently, most methods required downsampling rates (i.e., decimation) to be predefined and static during training, with optimal downsampling rates requiring a vast hyper-parameter search. Recent work has proposed a novel and differentiable method for learning strides named DiffStride which uses the discrete Fourier transform (DFT) to learn strides for decimation. However, in many cases the DFT does not capture signal properties as efficiently as the discrete cosine transform (DCT). Therefore, we propose an alternative method for learning decimation strides, DCT-DiffStride, as well as new regularization methods to reduce model complexity. Our work employs the DCT and its inverse as a low-pass filter in the frequency domain to reduce feature map dimensionality. Leveraging the well-known energy compaction properties of the DCT for natural signals, we evaluate DCT-DiffStride with its competitors on image and audio datasets demonstrating a favorable tradeoff in model performance and model complexity compared to competing methods. Additionally, we show DCT-DiffStride and DiffStride can be applied to data outside the natural signal domain, increasing the general applications of such methods.","strides, decimation, deep learning, discrete cosine transform" Interneurons accelerate learning dynamics in recurrent neural networks for statistical adaptation,https://openreview.net/forum?id=3mlITJRYYbs,https://openreview.net/pdf?id=3mlITJRYYbs,We show that adding interneurons to a recurrent neural network for statistical whitening accelerates the learning dynamics,"Early sensory systems in the brain rapidly adapt to fluctuating input statistics, which requires recurrent communication between neurons. Mechanistically, such recurrent communication is often indirect and mediated by local interneurons. In this work, we explore the computational benefits of mediating recurrent communication via interneurons compared with direct recurrent connections. To this end, we consider two mathematically tractable recurrent neural networks that statistically whiten their inputs --- one with direct recurrent connections and the other with interneurons that mediate recurrent communication. By analyzing the corresponding continuous synaptic dynamics and numerically simulating the networks, we show that the network with interneurons is more robust to initialization than the network with direct recurrent connections in the sense that the convergence time for the synaptic dynamics in the network with interneurons (resp. direct recurrent connections) scales logarithmically (resp. linearly) with the spectrum of their initialization. Our results suggest that interneurons are computationally useful for rapid adaptation to changing input statistics. Interestingly, the network with interneurons is an overparameterized solution of the whitening objective for the network with direct recurrent connections, so our results can be viewed as a recurrent neural network analogue of the implicit acceleration phenomenon observed in overparameterized feedforward linear networks.","Interneurons, recurrent neural networks, gradient flows, implicit acceleration, statistical whitening" Burstormer: Burst Image Restoration and Enhancement Transformer,https://openreview.net/forum?id=VXyzRA_Zaj1,https://openreview.net/pdf?id=VXyzRA_Zaj1,,"On a shutter press, modern handheld cameras capture multiple images in rapid succession and merge them to generate a single image. However, individual frames in a burst are misaligned due to inevitable motions and contain multiple degradations. The challenge is to properly align the successive image shots and merge their complimentary information to achieve high-quality outputs. Towards this direction, we propose Burstormer: a novel transformer-based architecture for burst image restoration and enhancement. In comparison to existing works, our approach exploits multi-scale local and non-local features to achieve improved alignment and feature fusion. Our key idea is to enable inter-frame communication in the burst neighborhoods for information aggregation and progressive fusion while modeling the burst-wide context. However, the input burst frames need to be properly aligned before fusing their information. Therefore, we propose an enhanced deformable alignment module for aligning burst features with regards to the reference frame. Unlike existing techniques, the proposed alignment module not only aligns burst features but also exchanges feature information and maintains focused communication with the reference frame through the proposed reference-based feature enrichment mechanism. This additional exchange of information helps in aligning multi-frame features under complex motions. After multi-level alignment and enrichment, we re-emphasize on inter-frame communication within burst frames using a new cyclic burst sampling technique. Finally, the inter-frame information is aggregated using our proposed burst feature fusion module followed by progressive increase in the spatial resolution by shuffling the feature information available in burst frames. The proposed Burstormer outperforms the existing state-of-the-art approaches on three popular tasks of burst super-resolution, burst denoising and burst low-light enhancement. Our codes will be made public.","Burst super-resolution, multi-frame processing, feature alignment" Understanding DDPM Latent Codes Through Optimal Transport,https://openreview.net/forum?id=6PIrhAx1j4i,https://openreview.net/pdf?id=6PIrhAx1j4i,ddim encoder is almost equal to optimal transport,"Diffusion models have recently outperformed alternative approaches to model the distribution of natural images. Such diffusion models allow for deterministic sampling via the probability flow ODE, giving rise to a latent space and an encoder map. While having important practical applications, such as the estimation of the likelihood, the theoretical properties of this map are not yet fully understood. In the present work, we partially address this question for the popular case of the VP-SDE (DDPM) approach. We show that, perhaps surprisingly, the DDPM encoder map coincides with the optimal transport map for common distributions; we support this claim by extensive numerical experiments using advanced tensor train solver for multidimensional Fokker-Planck equation. We provide additional theoretical evidence for the case of multivariate normal distributions.","diffusion models, ddpm, optimal transport, theory" Soft Sampling for Efficient Training of Deep Neural Networks on Massive Data,https://openreview.net/forum?id=cnutOGKrz7f,https://openreview.net/pdf?id=cnutOGKrz7f,,"We investigate soft sampling which is a simple yet effective approach for efficient training of large-scale deep neural network models when dealing with massive data. Soft sampling selects a subset uniformly at random with replacement from the full data set in each epoch. First, we derive a theoretical convergence guarantee for soft sampling on non-convex objective functions and give the convergence rate. Next, we analyze the data coverage and occupancy properties of soft sampling from the perspective of the coupon collector's problem. And finally, we evaluate soft sampling on various machine learning tasks using various network architectures and demonstrate its effectiveness. Compared to existing coreset-based data selection methods, soft sampling offers a better accuracy-efficiency trade-off. Especially on real-world industrial scale data sets, soft sampling can achieve significant speedup and competitive performance with almost no additional computing cost.", Learning About Progress From Experts,https://openreview.net/forum?id=sKc6fgce1zs,https://openreview.net/pdf?id=sKc6fgce1zs,"We learn a model of long-term progress using expert demonstrations, and show that it can be used to form an exploration reward that allows reinforcement learning agents to solve very challenging sparse tasks in NetHack.","Many important tasks involve some notion of long-term progress in multiple phases: e.g. to clean a shelf it must be cleared of items, cleaning products applied, and then the items placed back on the shelf. In this work, we explore the use of expert demonstrations in long-horizon tasks to learn a monotonically increasing function that summarizes progress. This function can then be used to aid agent exploration in environments with sparse rewards. As a case study we consider the NetHack environment, which requires long-term progress at a variety of scales and is far from being solved by existing approaches. In this environment, we demonstrate that by learning a model of long-term progress from expert data containing only observations, we can achieve efficient exploration in challenging sparse tasks, well beyond what is possible with current state-of-the-art approaches. We will open-source the curated expert training data at publication time.","learning from demonstrations, reinforcement learning, exploration, nethack" Learning Fair Graph Representations via Automated Data Augmentations,https://openreview.net/forum?id=1_OGWcP1s9w,https://openreview.net/pdf?id=1_OGWcP1s9w,We propose an automated graph data augmentation method to learn fair graph representations.,"We consider fair graph representation learning via data augmentations. While this direction has been explored previously, existing methods invariably rely on certain assumptions on the properties of fair graph data in order to design fixed strategies on data augmentations. Nevertheless, the exact properties of fair graph data may vary significantly in different scenarios. Hence, heuristically designed augmentations may not always generate fair graph data in different application scenarios. In this work, we propose a method, known as Graphair, to learn fair representations based on automated graph data augmentations. Such fairness-aware augmentations are themselves learned from data. Our Graphair is designed to automatically discover fairness-aware augmentations from input graphs in order to circumvent sensitive information while preserving other useful information. Experimental results demonstrate that our Graphair consistently outperforms many baselines on multiple node classification datasets in terms of fairness-accuracy trade-off performance. In addition, results indicate that Graphair can automatically learn to generate fair graph data without prior knowledge on fairness-relevant graph properties.", FUN: Filter-based Unlearnable Datasets,https://openreview.net/forum?id=iaCzfh6vtwQ,https://openreview.net/pdf?id=iaCzfh6vtwQ,"We propose a novel, model-free convolutional filter-based unlearnable dataset (FUN) generation technique that protects data from empirical risk minimization and adversarial training with various budgets.","Large-scale training of modern deep learning models heavily relies on publicly available data on the web. This potentially unauthorized usage of online data leads to concerns regarding data privacy. Recent works aim to make unlearnable data for deep learning models by adding small, specially designed noises to tackle this issue. However, these methods are vulnerable to adversarial training (AT) and/or are computationally heavy. In this work, we propose a novel, model-free convolutional Filter-based UNlearnable (FUN) dataset generation technique. FUN performs controlled class-wise convolutions using filters that are randomly generated via a private key. FUN encourages the network to learn the relation between filters and labels rather than informative features for classifying the clean data. We develop some theoretical analysis demonstrating that FUN can successfully poison Gaussian mixture data by reducing the clean data performance of the optimal Bayes classifier. We also empirically demonstrate the effectiveness of FUN with various datasets (CIFAR-10, CIFAR-100, and ImageNet-100), and architectures (ResNet-18, VGG-16, Wide ResNet-34-10, and DenseNet-121). Our experiments show that FUN is robust to various data augmentations and training approaches such as smoothing, AT with different budgets, transfer learning, and fine-tuning. For instance, training a ResNet-18 on FUN ImageNet-100 data achieves only 8.96$\%$, 40.08$\%$, and 20.58$\%$ clean test accuracies with empirical risk minimization (ERM), $L_{\infty}$ AT, and $L_{2}$ AT, respectively. Here, ERM on the clean training data achieves a clean test accuracy of 80.66$\%$. Furthermore, we also show that FUN is robust to adaptive defenses designed specifically to break it.","unlearnable examples, data protection, privacy, adversarial machine learning" A new photoreceptor-inspired CNN layer enables deep learning models of retina to generalize across lighting conditions,https://openreview.net/forum?id=L9RXJBTQaDf,https://openreview.net/pdf?id=L9RXJBTQaDf,A new bio-inspired deep learning model that enables generalization in dynamic lighting conditions,"As we move our eyes, and as lighting changes in our environment, the light intensity reaching our retinas changes dramatically and on multiple timescales. Despite these changing conditions, our retinas effortlessly extract visual information that allows downstream brain areas to make sense of the visual world. Such processing capabilities are desirable in many settings, including computer vision systems that operate in dynamic lighting environments like in self-driving cars, and in algorithms that translate visual inputs into neural signals for use in vision-restoring prosthetics. To mimic retinal processing, we first require models that can predict retinal ganglion cell (RGC) responses reliably. While existing state-of-the-art deep learning models can accurately predict RGC responses to visual scenes under steady-state lighting conditions, these models fail under dynamic lighting conditions. This is because changes in lighting markedly alter RGC responses: adaptation mechanisms dynamically tune RGC receptive fields on multiple timescales. Because current deep learning models of the retina have no in-built notion of light level or these adaptive mechanisms, they are unable to accurately predict RGC responses under lighting conditions that they were not trained on. We present here a new deep learning model of the retina that can predict RGC responses to visual scenes at different light levels without requiring training at each light level. Our model combines a fully trainable biophysical front end capturing the fast and slow adaptation mechanisms in the photoreceptors with convolutional neural networks (CNNs) capturing downstream retinal processing. We tested our model’s generalization performance across light levels using monkey and rat retinal data. Whereas conventional CNN models without the photoreceptor layer failed to predict RGC responses when the lighting conditions changed, our model with the photoreceptor layer as a front end fared much better in this challenge. Overall, our work demonstrates a new hybrid approach that equips deep learning models with biological vision mechanisms enabling them to adapt to dynamic environments.","retina model, photoreceptor model, bio-inspired artificial vision, retina predictor, dynamic environments" 3D Neural Embedding Likelihood for Robust Sim-to-Real Transfer in Inverse Graphics,https://openreview.net/forum?id=6jqSG88Mf_D,https://openreview.net/pdf?id=6jqSG88Mf_D,"We propose 3D Neural Embedding Likelihoods (3DNEL), a 3D likelihood that models both shape information from depth and appearance information from RGB via neural embeddings and bridges the sim-to-real gap in 3D inverse graphics.","A central challenge in 3D scene perception via inverse graphics is robustly modeling the gap between 3D graphics and real-world data. We propose a novel 3D Neural Embedding Likelihood (3DNEL) over RGB-D images to address this gap. 3DNEL uses neural embeddings to predict 2D-3D correspondences from RGB and combines this with depth in a principled manner. 3DNEL is trained entirely from synthetic images and generalizes to real-world data. To showcase this capability, we develop a multi-stage inverse graphics pipeline that uses 3DNEL for 6D object pose estimation from real RGB-D images. Our method outperforms the previous state-of-the-art in sim-to-real pose estimation on the YCB-Video dataset, and improves robustness, with significantly fewer large-error predictions. Unlike existing bottom-up, discriminative approaches that are specialized for pose estimation, 3DNEL adopts a probabilistic generative formulation that jointly models multi-object scenes. This generative formulation enables easy extension of 3DNEL to additional tasks like object and camera tracking from video, using principled inference in the same probabilistic model without task specific retraining.","3D inverse graphics, probabilistic inference, likelihood, RGB-D, neural embedding, object pose estimation" Dynamic Scheduled Sampling with Imitation Loss for Neural Text Generation,https://openreview.net/forum?id=UmHG2bD7X3w,https://openreview.net/pdf?id=UmHG2bD7X3w,We proposed a new training objective that alleviates exposure bias problem in text generation.,"State-of-the-art neural text generation models are typically trained to maximize the likelihood of each token in the ground-truth sequence conditioned on the previous target tokens. However, during inference, the model needs to make a prediction conditioned on the tokens generated by itself. This train-test discrepancy is referred to as exposure bias. Scheduled sampling is a curriculum learning strategy that gradually exposes the model to its own predictions during training to mitigate this bias. Most of the proposed approaches design a scheduler based on training steps, which generally requires careful tuning depending on the training setup. In this work, we introduce Dynamic Scheduled Sampling with Imitation Loss (DySI), which maintains the schedule based solely on the training time accuracy, while enhancing the curriculum learning by introducing an imitation loss, which attempts to make the behavior of the decoder indistinguishable from the behavior of a teacher-forced decoder. DySI is universally applicable across training setups with minimal tuning. Extensive experiments and analysis show that DySI not only achieves notable improvements on standard machine translation benchmarks, but also significantly improves the robustness of other text generation models. ","exposure bias, text generation" Emergence of Maps in the Memories of Blind Navigation Agents,https://openreview.net/forum?id=lTt4KjHSsyl,https://openreview.net/pdf?id=lTt4KjHSsyl,"‘Blind’ AI navigation agents (with only egomotion sensing) can learn to navigate new environments and build map-like representations (supporting the ability to take shortcuts, follow walls, and predict free-space and collisions) of their environment.","Animal navigation research posits that organisms build and maintain internal spa- tial representations, or maps, of their environment. We ask if machines – specifically, artificial intelligence (AI) navigation agents – also build implicit (or ‘mental’) maps. A positive answer to this question would (a) explain the surprising phenomenon in recent literature of ostensibly map-free neural-networks achieving strong performance, and (b) strengthen the evidence of mapping as a fundamental mechanism for navigation by intelligent embodied agents, whether they be biological or artificial. Unlike animal navigation, we can judiciously design the agent’s perceptual system and control the learning paradigm to nullify alternative navigation mechanisms. Specifically, we train ‘blind’ agents – with sensing limited to only egomotion and no other sensing of any kind – to perform PointGoal navigation (‘go to $\Delta$x, $\Delta$y’) via reinforcement learning. Our agents are composed of navigation-agnostic components (fully-connected and recurrent neural networks), and our experimental setup provides no inductive bias towards mapping. Despite these harsh conditions, we find that blind agents are (1) surprisingly effective navigators in new environments (∼95% success); (2) they utilize memory over long horizons (remembering ∼1,000 steps of past experience in an episode); (3) this memory enables them to exhibit intelligent behavior (following walls, detecting collisions, taking shortcuts); (4) there is emergence of maps and collision detection neurons in the representations of the environment built by a blind agent as it navigates; and (5) the emergent maps are selective and task dependent (e.g. the agent ‘forgets’ exploratory detours). Overall, this paper presents no new techniques for the AI audience, but a surprising finding, an insight, and an explanation.","embodied AI, navigation, characterizing representations" Latent Neural ODEs with Sparse Bayesian Multiple Shooting,https://openreview.net/forum?id=moIlFZfj_1b,https://openreview.net/pdf?id=moIlFZfj_1b,,"Training dynamic models, such as neural ODEs, on long trajectories is a hard problem that requires using various tricks, such as trajectory splitting, to make model training work in practice. These methods are often heuristics with poor theoretical justifications, and require iterative manual tuning. We propose a principled multiple shooting technique for neural ODEs that splits the trajectories into manageable short segments, which are optimized in parallel, while ensuring probabilistic control on continuity over consecutive segments. We derive variational inference for our shooting-based latent neural ODE models and propose amortized encodings of irregularly sampled trajectories with a transformer-based recognition network with temporal attention and relative positional encoding. We demonstrate efficient and stable training, and state-of-the-art performance on multiple large-scale benchmark datasets.", $\mathcal{O}$-GNN: incorporating ring priors into molecular modeling,https://openreview.net/forum?id=5cFfz6yMVPU,https://openreview.net/pdf?id=5cFfz6yMVPU,,"Cyclic compounds that contain at least one ring play an important role in drug design. Despite the recent success of molecular modeling with graph neural networks (GNNs), few models explicitly take rings in compounds into consideration, consequently limiting the expressiveness of the models. In this work, we design a new variant of GNN, ring-enhanced GNN ($\mathcal{O}$-GNN), that explicitly models rings in addition to atoms and bonds in compounds. In $\mathcal{O}$-GNN, each ring is represented by a latent vector, which contributes to and is iteratively updated by atom and bond representations. Theoretical analysis shows that $\mathcal{O}$-GNN is able to distinguish two isomorphic subgraphs lying on different rings using only one layer while conventional graph convolutional neural networks require multiple layers to distinguish, demonstrating that $\mathcal{O}$-GNN is more expressive. Through experiments, $\mathcal{O}$-GNN shows good performance on $\bf{11}$ public datasets. In particular, it achieves state-of-the-art validation result on the PCQM4Mv1 benchmark (outperforming the previous KDDCup champion solution) and the drug-drug interaction prediction task on DrugBank. Furthermore, $\mathcal{O}$-GNN outperforms strong baselines (without modeling rings) on the molecular property prediction and retrosynthesis prediction tasks.","Graph Neural Network, Ring, Molecular Modeling" MACTA: A Multi-agent Reinforcement Learning Approach for Cache Timing Attacks and Detection,https://openreview.net/forum?id=CDlHZ78-Xzi,https://openreview.net/pdf?id=CDlHZ78-Xzi,,"Security vulnerabilities in computer systems raise serious concerns as computers process an unprecedented amount of private and sensitive data today. Cache-timing attacks pose an important practical threat as they have been shown to be able to effectively breach many protection mechanisms in today's system. However, the current detection of cache timing attacks relies heavily on heuristics and expert knowledge, which can lead to brittleness and inability to adapt to new attacks. To mitigate these problems, we develop a two-player environment for cache-timing attacks and detection, and leverage the idea of population-based multi-agent reinforcement learning (MARL) to train both attackers and detectors. Our empirical results indicate that, without any manual input from security experts, the trained attacker is able to act more stealthily while the trained detector can generalize to \emph{unseen} attacks and is less exploitable to high-bandwidth attacks. Furthermore, in this environment, we found that agents equipped with a Transformer encoder substantially outperform agents with multi-layer perceptrons encoders, which has been commonly used in RL tasks, suggesting that Transformer may learn better representations in such real-world tasks. ","multi-agent reinforcement learning, security, game theory" Training Normalizing Flows from Dependent Data,https://openreview.net/forum?id=Z4lOwCEJQ8Z,https://openreview.net/pdf?id=Z4lOwCEJQ8Z,,"Normalizing flows are powerful non-parametric statistical models that function as a hybrid between density estimators and generative models. Current learning algorithms for normalizing flows assume that data points are sampled independently, an assumption that is frequently violated in practice, which may lead to erroneous density estimation and data generation. We propose a likelihood objective of normalizing flows incorporating dependencies between the data points, for which we derive a flexible and efficient learning algorithm suitable for different dependency structures. We show that respecting dependencies between observations can improve empirical results on both synthetic and real-world data. ",Normalizing Flows Spectral Augmentation for Self-Supervised Learning on Graphs,https://openreview.net/forum?id=DjzBCrMBJ_p,https://openreview.net/pdf?id=DjzBCrMBJ_p,We propose a novel spectral augmentation method which uses graph spectrum to capture structural properties and guide topology augmentations for graph self-supervised learning.,"Graph contrastive learning (GCL), as an emerging self-supervised learning technique on graphs, aims to learn representations via instance discrimination. Its performance heavily relies on graph augmentation to reflect invariant patterns that are robust to small perturbations; yet it still remains unclear about what graph invariance GCL should capture. Recent studies mainly perform topology augmentations in a uniformly random manner in the spatial domain, ignoring its influence on the intrinsic structural properties embedded in the spectral domain. In this work, we aim to find a principled way for topology augmentations by exploring the invariance of graphs from the spectral perspective. We develop spectral augmentation which guides topology augmentations by maximizing the spectral change. Extensive experiments on both graph and node classification tasks demonstrate the effectiveness of our method in self-supervised representation learning. The proposed method also brings promising generalization capability in transfer learning, and is equipped with intriguing robustness property under adversarial attacks. Our study sheds light on a general principle for graph topology augmentation.","graph self-supervised learning, graph spectral theory, graph augmentation" An ensemble view on mixup,https://openreview.net/forum?id=k_iNqflnekU,https://openreview.net/pdf?id=k_iNqflnekU,mixup brings about all the benefits of an expensive ensemble; you can improve things by evaluating multiple mixed up examples at test time,"Deep ensembles are widely used to improve the generalization, calibration, uncertainty estimates and adversarial robustness of neural networks. In parallel, the data augmentation technique of mixup has grown popular for the very same reasons. Could these two techniques be related? This work suggests that both implement a similar inductive bias to “linearize” decision boundaries. We show how to obtain diverse predictions from a single mixup machine by interpolating a test instance with multiple reference points. These “mixup ensembles” are cheap: one needs to train and store one single model, as opposed to the K independent members forming a deep ensemble. Motivated by the limitations of ensembles to model uncertainty far away from the training data, we propose a variant of mixup that builds augmented examples using both random interpolations and extrapolations of examples. We evaluate the efficacy of our proposed methods across a variety of in-domain and out-domain metrics on the CIFAR-10 and CIFAR-10-NEG datasets.","mixup, ensembles, generalization, calibration, ood, detection" Improving Adversarial Robustness by Contrastive Guided Diffusion Process,https://openreview.net/forum?id=quCOIL8JQnp,https://openreview.net/pdf?id=quCOIL8JQnp,,"Synthetic data generation has become an emerging tool to help improve the adversarial robustness in classification tasks since robust learning requires a significantly larger amount of training samples compared with standard classification tasks. Among various deep generative models, the diffusion model has been shown to produce high-quality synthetic images and has achieved good performance in improving the adversarial robustness. However, diffusion-type methods are typically slow in data generation as compared with other generative models. Although different acceleration techniques have been proposed recently, it is also of great importance to study how to improve the sample efficiency of generated data for the downstream task. In this paper, we first analyze the optimality condition of synthetic distribution for achieving non-trivial robust accuracy. We show that enhancing the distinguishability among the generated data is critical for improving adversarial robustness. Thus, we propose the Contrastive-Guided Diffusion Process (Contrastive-DP), which adopts the contrastive loss to guide the diffusion model in data generation. We verify our theoretical results using simulations and demonstrate the good performance of Contrastive-DP on image datasets.", $\sigma$Reparam: Stable Transformer Training with Spectral Reparametrization,https://openreview.net/forum?id=QwqxO8URJzn,https://openreview.net/pdf?id=QwqxO8URJzn,"We introduce a weight reparameterization method which stabilizes transformer training across a variety of domains and setups, enabling simpler training recipes and robustness to hyperparameters without performance tradeoffs.","Training stability is of great importance to Transformers. In this work, we investigate the training dynamics of Transformers by examining the evolution of the attention layers. In particular, we track the ""attention entropy"" for each attention head during the course of training, which is a proxy of the attention's sharpness. We observe a common, non monotonic evolution of attention entropy across different settings: the attention entropy first quickly decreases in the initial phase of training, followed by quickly increasing, and finally entering a long stable phase. While the exact shape can be affected by hyperparameters such as warmup, initialization, learning rate etc., we found that there is a close correlation between the minima of attention entropy and the model's training stability. To this end, we propose a simple and efficient solution dubbed $\sigma$Reparam, where we reparametrize all linear layers with Spectral Normalization and an additional learned scalar. We provide a lower bound on the attention entropy as a function of the spectral norms of the query and key projections, which suggests that small attention entropy can be obtained with large spectral norms. $\sigma$Reparam decouples the growth rate of a weight matrix's spectral norm from its dimensionality, which we verify empirically. We conduct experiments with $\sigma$Reparam on image classification, image self supervised learning, automatic speech recognition and language modeling tasks. We show that $\sigma$Reparam provides great stability and robustness with respect to the choice of hyperparameters.","Transformers, self-attention, optimization, stability, spectral normalization, self-supervised learning, vision, speech, language, contrastive learning" Towards Multi-spatiotemporal-scale Generalized PDE Modeling,https://openreview.net/forum?id=Uk40pC45YJG,https://openreview.net/pdf?id=Uk40pC45YJG,We present a comprehensive study of Fourier and U-Net inspired architectural choices towards generalization of multi-spatiotemporal-scale problems.,"Partial differential equations (PDEs) are central to describing complex physical system simulations. Their expensive solution techniques have led to an increased interest in deep neural network based surrogates. However, the practical utility of training such surrogates is contingent on their ability to model complex multi-scale spatio-temporal phenomena. Various neural network architectures have been proposed to model complex spatial information in PDE solutions, most notably Fourier Neural Operators (FNOs) which give a natural handle over local & global information via parameterization of different Fourier modes, and U-Nets which treat local and global information via downsampling and upsampling paths. However, generalizing across different equation parameters or different time-scales still remains a challenge. In this work, we make a comprehensive comparison between various FNO and U-Net like approaches on fluid mechanics problems in both vorticity-stream and velocity function form. For U-Nets, we transfer recent architectural improvements from computer vision, most notably from object segmentation and generative modeling. We further analyze the design considerations for using FNO layers to improve performance of U-Net architectures without major degradation of computational performance. Finally, we show promising results on generalization to different PDE parameters and time-scales with a single surrogate model.","PDE modeling, multi-spatiotemporal-scale, PDE generalization, Fourier vs U-Net, fluid mechanics" PAC Reinforcement Learning for Predictive State Representations,https://openreview.net/forum?id=FVW7Mi2ph6C,https://openreview.net/pdf?id=FVW7Mi2ph6C,PAC learning for PSRs.,"In this paper we study online Reinforcement Learning (RL) in partially observable dynamical systems. We focus on the Predictive State Representations (PSRs) model, which is an expressive model that captures other well-known models such as Partially Observable Markov Decision Processes (POMDP). PSR represents the states using a set of predictions of future observations and is defined entirely using observable quantities. We develop a novel model-based algorithm for PSRs that can learn a near optimal policy in sample complexity scaling polynomially with respect to all the relevant parameters of the systems. Our algorithm naturally works with function approximation to extend to systems with potentially large state and observation spaces. We show that given a realizable model class, the sample complexity of learning the near optimal policy only scales polynomially with respect to the statistical complexity of the model class, without any explicit polynomial dependence on the size of the state and observation spaces. Notably, our work is the first work that shows polynomial sample complexities to compete with the globally optimal policy in PSRs. Finally, we demonstrate how our general theorem can be directly used to derive sample complexity bounds for special models including $m$-step weakly revealing and $m$-step decodable tabular POMDPs, POMDPs with low-rank latent transition, and POMDPs with linear emission and latent transition. ",Reinforcement learning theory (statistical learning theory) Federated Learning on Adaptively Weighted Nodes by Bilevel Optimization,https://openreview.net/forum?id=6aKcyoDJBaX,https://openreview.net/pdf?id=6aKcyoDJBaX,We propose a federated learning method with adaptively weighted nodes and analyze its generalization performance.,We propose a federated learning method with weighted nodes in which the weights can be modified to optimize the model’s performance on a separate validation set. The problem is formulated as a bilevel optimization problem where the inner problem is a federated learning problem with weighted nodes and the outer problem focuses on optimizing the weights based on the validation performance of the model returned from the inner problem. A communication-efficient federated optimization algorithm is designed to solve this bilevel optimization problem. We analyze the generalization performance of the output model and identify the scenarios when our method is in theory superior to training a model locally and superior to federated learning with static and evenly distributed weights. ,"federated learning, bilevel optimization, distributed optimization, generalization performance" Removing Structured Noise with Diffusion Models,https://openreview.net/forum?id=yNRfzsGELb,https://openreview.net/pdf?id=yNRfzsGELb,We propose a novel posterior sampling method to efficiently remove structured noise in various inverse problems using diffusion models.,"Solving ill-posed inverse problems requires careful formulation of prior beliefs over the signals of interest and an accurate description of their manifestation into noisy measurements. Handcrafted signal priors based on e.g. sparsity are increasingly replaced by data-driven deep generative models, and several groups have recently shown that state-of-the-art score-based diffusion models yield particularly strong performance and flexibility. In this paper, we show that the powerful paradigm of posterior sampling with diffusion models can be extended to include rich, structured, noise models. To that end, we propose a joint conditional reverse diffusion process with learned scores for the noise and signal-generating distribution. We demonstrate strong performance gains across various inverse problems with structured noise, outperforming competitive baselines that use normalizing flows and adversarial networks. This opens up new opportunities and relevant practical applications of diffusion modeling for inverse problems in the context of non-Gaussian measurements.","inverse problems, diffusion models, score-based, generative models, structured noise" Stein Variational Goal Generation for adaptive Exploration in Multi-Goal Reinforcement Learning,https://openreview.net/forum?id=XnF9OtkASy,https://openreview.net/pdf?id=XnF9OtkASy,,"Multi-goal Reinforcement Learning has recently attracted a large amount of research interest. By allowing experience to be shared between related training tasks, this setting favors generalization for new tasks at test time, whenever some smoothness exists in the considered representation space of goals. However, in settings with discontinuities in state or goal spaces (e.g. walls in a maze), a majority of goals are difficult to reach, due to the sparsity of rewards in the absence of expert knowledge. This implies hard exploration, for which some curriculum of goals must be discovered, to help agents learn by adapting training tasks to their current capabilities. We propose a novel approach: Stein Variational Goal Generation (SVGG), which builds on recent automatic curriculum learning techniques for goal-conditioned policies. SVGG seeks at preferably sampling new goals in the zone of proximal development of the agent, by leveraging a learned model of its abilities and a goal distribution modeled as particles in the exploration space. Our approach relies on Stein Variational Gradient Descent to dynamically attract the goal sampling distribution in areas of appropriate difficulty. We demonstrate the performances of the approach, in terms of success coverage in the goal space, compared to recent state-of-the-art RL methods for hard exploration problems.","Exploration, Goal-conditioned Policies, Automatic curriculum, Stein Variational Gradient Descent" Fourier PINNs: From Strong Boundary Conditions to Adaptive Fourier Bases,https://openreview.net/forum?id=40Mw2GJnlZ,https://openreview.net/pdf?id=40Mw2GJnlZ,,"Interest in Physics-Informed Neural Networks (PINNs) is rising as a mesh-free alternative to traditional numerical solvers for partial differential equations (PDEs). While successful, PINNs often struggle to learn high-frequency and multi-scale target solutions—which, according to prior analysis, might arise from competition during optimization between the weakly enforced boundary loss and residual loss terms. By creatively modifying the neural network architecture, some simple boundary conditions (BCs) can be satisfied exactly without jointly optimizing an additional loss term, thus avoiding the aforementioned competition altogether. Motivated by this analysis, we first study a strong BC version of PINNs for Dirichlet BCs and observe a consistent improvement compared to the standard PINNs. We conducted a Fourier analysis and found that strong BC PINNs can better learn the amplitudes of high-frequency components of the target solutions. While BC PINNs provide a promising improvement, constructing these unique architectures is an intricate process made difficult (if not impossible) by certain BCs and domain geometries. Enlightened by our analysis, we propose Fourier PINNs—a simple, general, yet powerful method that augments PINNs with pre-specified, dense Fourier bases. Our proposed architecture likewise better learns high-frequency components but places no restrictions on the particular BCs. We developed an adaptive learning and basis selection algorithm based on alternating NN basis optimization, Fourier and NN basis coefficient estimations, and coefficient truncation. This schema can flexibly identify the significant frequencies while weakening the nominal to better capture the target solution's power spectrum. We show the advantage of our approach in learning high-frequency and multi-scale solutions in a set of systematic experiments. ","Physics Informed Machine Learning, Fourier Analysis, Scientific Machine Learning, Partial Differential Equations" Distributed Graph Neural Network Training with Periodic Stale Representation Synchronization,https://openreview.net/forum?id=XYDSqLaHFVq,https://openreview.net/pdf?id=XYDSqLaHFVq,A novel distributed GNN training framework that achieves vast training speedup without compromising performance.,"Despite the recent success of Graph Neural Networks (GNNs), it remains challenging to train a GNN on large graphs with over millions of nodes & billions of edges, which are prevalent in many graph-based applications such as social networks, recommender systems, and knowledge graphs. Traditional sampling-based methods accelerate GNN training by dropping edges and nodes, which impairs the graph integrity and model performance. Differently, distributed GNN algorithms accelerate GNN training by utilizing multiple computing devices and can be classified into two types: ""partition-based"" methods enjoy low communication cost but suffer from information loss due to dropped edges, while ""propagation-based"" methods avoid information loss but suffer from prohibitive communication overhead caused by neighbor explosion. To jointly address these problems, this paper proposes DIGEST (DIstributed Graph reprEsentation SynchronizaTion), a novel distributed GNN training framework that synergizes the complementary strength of both categories of existing methods. We propose to allow each device utilize the stale representations of its neighbors in other subgraphs during subgraph parallel training. This way, out method preserves global graph information from neighbors to avoid information loss and reduce the communication cost. Therefore, DIGEST is both computation-efficient and communication-efficient as it does not need to frequently (re-)compute and transfer the massive representation data across the devices, due to neighbor explosion. DIGEST provides synchronous and asynchronous training manners for homogeneous and heterogeneous training environment, respectively. We proved that the approximation error induced by the staleness of the representations can be upper-bounded. More importantly, our convergence analysis demonstrates that DIGEST enjoys the state-of-the-art convergence rate. Extensive experimental evaluation on large, real-world graph datasets shows that DIGEST achieves up to 21.82× speedup without compromising the performance compared to state-of-the-art distributed GNN training frameworks","GNN, Distributed training" SAGE: Semantic-Aware Global Explanations for Named Entity Recognition,https://openreview.net/forum?id=OVbY-QCCjAh,https://openreview.net/pdf?id=OVbY-QCCjAh,,"In the last decades, deep learning approaches achieved impressive results in many research fields, such as Computer Vision and Natural Language Processing (NLP). NLP in particular has greatly benefit from unsupervised methods that allow to learn distributed representation of language. On the race for better performances Language Models have reached hundred of billions parameters nowadays. Despite the remarkable results, deep models are still far from being fully exploited in real world applications. Indeed, these approaches are black-boxes, i.e. they are not interpretable by design nor explainable, which is often crucial to make decisions in business. Several task-agnostic methods have been proposed in literature to explain models' decisions. Most techniques rely on the ""local"" assumption, i.e. explanations are made example-wise. In this paper instead, we present a post-hoc method to produce highly interpretable global rules to explain NLP classifiers. Rules are extracted with a data mining approach on a semantically enriched input representation, instead of using words/wordpieces solely. Semantic information yields more abstract and general rules that are both more explanatory and less complex, while being also better at reflecting the model behaviour. In the experiments we focus on Named Entity Recognition, an NLP task where explainability is under-investigated. We explain the predictions of BERT NER classifiers trained on two popular benchmarks, CoNLL03 and Ontonotes, and compare our model against LIME.","Explainable AI, Named Entity Recognition, Language Models, Natural Language Processing" Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-Regret Learning in Markov Games,https://openreview.net/forum?id=bn0GZZdDfI1,https://openreview.net/pdf?id=bn0GZZdDfI1,,"We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents. Our goal is to develop a no-regret online learning algorithm that (i) takes actions based on the local information observed by the agent and (ii) is able to find the best policy in hindsight. For such a problem, the nonstationary state transitions due to the varying opponent pose a significant challenge. In light of a recent hardness result (Liu et al., 2022), we focus on the setting where the opponent's previous policies are revealed to the agent for decision making. With such an information structure, we propose a new algorithm, Decentralized Optimistic hypeRpolicy mIrror deScent (DORIS), which achieves $\sqrt{K}$-regret in the context of general function approximation, where $K$ is the number of episodes. Moreover, when all the agents adopt DORIS, we prove that their mixture policy constitutes an approximate coarse correlated equilibrium. In particular, DORIS maintains a hyperpolicy which is a distribution over the policy space. The hyperpolicy is updated via mirror descent, where the update direction is obtained by an optimistic variant of least-squares policy evaluation. Furthermore, to illustrate the power of our method, we apply DORIS to constrained and vector-valued MDPs, which can be formulated as zero-sum Markov games with a fictitious opponent. ",Reinforcement learning theory (statistical learning theory) Graph Contrastive Learning with Model Perturbation,https://openreview.net/forum?id=rfvuuHmqHOQ,https://openreview.net/pdf?id=rfvuuHmqHOQ,,"Graph contrastive learning (GCL) has achieved great success in pre-training graph neural networks (GNN) without ground-truth labels. The performance of GCL mainly rely on designing high quality contrastive views via data augmentation. However, finding desirable augmentations is difficult and requires cumbersome efforts due to the diverse modalities in graph data. In this work, we study model perturbation to perform efficient contrastive learning on graphs without using data augmentation. Instead of searching for the optimal combination among perturbing nodes, edges or attributes, we propose to conduct perturbation on the model architectures (i.e., GNNs). However, it is non-trivial to achieve effective perturbations on GNN models without performance dropping compared with its data augmentation counterparts. This is because data augmentation 1) makes complex perturbation in the graph space, so it is hard to mimic its effect in the model parameter space with a fixed noise distribution, and 2) has different disturbances even on the same nodes between two views owning to the randomness. Motivated by this, we propose a novel model perturbation framework -- \textsc{PerturbGCL} to pre-train GNN encoders. We focus on perturbing two key operations in a GNN, including message propagation and transformation. Specifically, we propose \emph{weightPrune} to create a dynamic perturbed model to contrast with the target one by pruning its transformation weights according to their magnitudes. Contrasting the two models will lead to adaptive mining of the perturbation distribution from the data. Furthermore, we present \emph{randMP} to disturb the steps of message propagation in two contrastive models. By randomly choosing the propagation steps during training, it helps to increase local variances of nodes between the contrastive views. Despite the simplicity, coupling the two strategies together enable us to perform effective contrastive learning on graphs with model perturbation. We conduct extensive experiments on 15 benchmarks. The results demonstrate the superiority of \textsc{PerturbGCL}: it can achieve competitive results against strong baselines across both node-level and graph-level tasks, while requiring shorter computation time. The code is available at \url{https://anonymous.4open.science/r/PerturbGCL-F17D}.","Graph Contrastive Learning, Model Perturbation, Graph Augmentation" Robust Scheduling with GFlowNets,https://openreview.net/forum?id=ZBUthI6wK9h,https://openreview.net/pdf?id=ZBUthI6wK9h,We use GFlowNets for robust scheduling.,"Finding the best way to schedule operations in a computation graph is a classical NP-hard problem which is central to compiler optimization. However, evaluating the goodness of a schedule on the target hardware can be very time-consuming. Traditional approaches as well as previous machine learning ones typically optimize proxy metrics, which are fast to evaluate but can lead to bad schedules when tested on the target hardware. In this work, we propose a new approach to scheduling by sampling proportionally to the proxy metric using a novel GFlowNet method. We introduce a technique to control the trade-off between diversity and goodness of the proposed schedules at inference time and demonstrate empirically that the pure optimization baselines can lead to subpar performance with respect to our approach when tested on a target model. Furthermore, we show that conditioning the GFlowNet on the computation graph enables generalization to unseen scheduling problems for both synthetic and real-world compiler datasets.","Scheduling, GFlowNets, Combinatorial Optimization" Pareto Manifold Learning: Tackling multiple tasks via ensembles of single-task models,https://openreview.net/forum?id=C9uEwyfklBE,https://openreview.net/pdf?id=C9uEwyfklBE,,"In Multi-Task Learning, tasks may compete and limit the performance achieved on each other rather than guiding the optimization trajectory to a common solution, superior to its single-task counterparts. There is often not a single solution that is optimal for all tasks, leading practitioners to balance tradeoffs between tasks' performance, and to resort to optimality in the Pareto sense. Current Multi-Task Learning methodologies either completely neglect this aspect of functional diversity, and produce one solution in the Pareto Front predefined by their optimization schemes, or produce diverse but discrete solutions, each requiring a separate training run. In this paper, we conjecture that there exist Pareto Subspaces, i.e., weight subspaces where multiple optimal functional solutions lie. We propose Pareto Manifold Learning, an ensembling method in weight space that is able to discover such a parameterization and produces a continuous Pareto Front in a single training run, allowing practitioners to modulate the performance on each task during inference on the fly. We validate the proposed method on a diverse set of multi-task learning benchmarks, ranging from image classification to tabular datasets and scene understanding, and show that Pareto Manifold Learning outperforms state-of-the-art algorithms. ","Multi-Task Learning, multitask learning, mode connectivity, loss landscape, pareto optimal, pareto frontier" Autoregressive Conditional Neural Processes,https://openreview.net/forum?id=OAsXFPBfTBh,https://openreview.net/pdf?id=OAsXFPBfTBh,,"Conditional neural processes (CNPs; Garnelo et al., 2018a) are attractive meta-learning models which produce well-calibrated predictions and are trainable via a simple maximum likelihood procedure. Although CNPs have many advantages, they are unable to model dependencies in their predictions. Various works propose solutions to this, but these come at the cost of either requiring approximate inference or being limited to Gaussian predictions. In this work, we instead propose to change how CNPs are deployed at test time, without any modifications to the model or training procedure. Instead of making predictions independently for every target point, we autoregressively define a joint predictive distribution using the chain rule of probability, taking inspiration from the neural autoregressive density estimator (NADE) literature. We show that this simple procedure allows factorised Gaussian CNPs to model highly dependent, non-Gaussian predictive distributions. Perhaps surprisingly, in an extensive range of tasks with synthetic and real data, we show that CNPs in autoregressive (AR) mode not only significantly outperform non-AR CNPs, but are also competitive with more sophisticated models that are significantly more computationally expensive and challenging to train. This performance is remarkable given that AR CNPs are not trained to model joint dependencies. Our work provides an example of how ideas from neural distribution estimation can benefit neural processes, and motivates research into the AR deployment of other neural process models.", Exploring Methods for Parsing Movie Scripts - Feature Extraction for Further Social Injustice Analysis,https://openreview.net/forum?id=4Ox_yJWZP56,https://openreview.net/pdf?id=4Ox_yJWZP56,An exploration of methods to parse movie scripts ,"When it comes to analysing movie scripts for things like bias and given the variation of movie script formatting due to inconsistencies by the authors, it is important that we create methods that can help extract all the relevant features required for any further analysis. In this paper, we discuss multiple parsing techniques that can be used to extract features and understand the structure of movie scripts in an automated fashion. We compare and contrast the accuracy and time of a rule based and a variety of machine learning approaches including; Deep Neural Networks, Decision Tress and BERT for sequence classification model. ","Movie Parsing, Script Parsing, Parsers, IMSDB, Deep Neural Networks, Discussion Tree, BERT Parser" MultiQuan RDP: Rate-Distortion-Perception Coding via Offset Quantizers,https://openreview.net/forum?id=QutyHwpIKVw,https://openreview.net/pdf?id=QutyHwpIKVw,We propose the MultiQuan quantizers interpolating between single quantizer and dithered quantization for rate-distortion-perception coding.,"The rate-distortion-perception (RDP) framework has attracted significant recent attention due to its application in neural compression. It is important to understand the underlying mechanism connecting procedures with common randomness and those without. Different from previous efforts, we study this problem from a quantizer design perspective. By analyzing an idealized setting, we provide an interpretation on the advantage of dithered quantization in the RDP setting, which further allows us to make a conceptual connection between randomized (dithered) quantizers and quantizers without common randomness. This new understanding leads to a new procedure for RDP coding based on multiple quantizers with offsets. Though the procedure can be viewed as intermediates between the two extremes, its explicit structure can be advantageous in some cases. Experimental results are given on both simple data sources and images to illustrate its behavior. ","information theory, quantization, rate-distortion-perception, compression" $k$NN Prompting: Learning Beyond the Context with Nearest Neighbor Inference,https://openreview.net/forum?id=fe2S7736sNS,https://openreview.net/pdf?id=fe2S7736sNS,"We enable data scaling under the gradient-free paradigm of large language models using kNN inference, and bring substantial improvements over standard In-Context Learning.","In-Context Learning, which formulates target tasks as prompt completion conditioned on in-context demonstrations, has become the prevailing and standard utilization of large language models. In this paper, we disclose an actual predicament for this typical usage that it can not scale up with training data due to context length restrictions. We then advocate a simple and effective solution, $k$NN Prompting, which not only outperforms In-Context Learning under few shot scenarios, but more importantly, can scale up with as many training data as are available. $k$NN Prompting queries LLM with training data for distributed representations and caches them locally as anchors. At inference time, it predicts by simply aggregating nearest neighbors. We conduct comprehensive experiments and ablations across different scales of LLMs to demonstrate its substantial improvements, as well as other appealing aspects such as robustness and explainability. The proposed approach successfully bridges data scaling into model scaling, and brings new potentials for the gradient-free paradigm of LLM deployment.","Large Language Models, In-Context Learning, K Nearest Neighbors" Closed-loop Transcription via Convolutional Sparse Coding,https://openreview.net/forum?id=NE5P2sEK4Z5,https://openreview.net/pdf?id=NE5P2sEK4Z5,This paper combines the recent closed-loop transcription framework with convolutional sparse coding layers and demonstrates superior generative autoencoding performance.,"Autoencoding has been a popular and effective framework for learning generative models for images, with much empirical success. Autoencoders often use generic deep networks as the encoder and decoder, which are difficult to interpret, and the learned representations lack clear structure. In this work, we replace the encoder and decoder with standard convolutional sparse coding and decoding layers, obtained from unrolling an optimization algorithm for solving a (convexified) sparse coding program. Furthermore, to avoid computational difficulties in minimizing distributional distance between the real and generated images, we utilize the recent closed-loop transcription (CTRL) framework that maximizes the rate reduction of the learned sparse representations. We show that such a simple framework demonstrates surprisingly competitive performance on large datasets, such as ImageNet-1K, compared to existing autoencoding and generative methods under fair conditions. Even with simpler networks and less computational resources, our method demonstrates splendid visual quality in regenerated images with striking sample-wise consistency. More surprisingly, the learned autoencoder generalizes to unseen datasets. Our method enjoys several side benefits, including more structured and interpretable representations, more stable convergence, scalability to large datasets -- indeed, our method is the first sparse coding generative method to scale up to ImageNet -- and trainability with smaller batch sizes.","Convolutional Sparse Coding, Inverse Models, Rate Reduction" Transformers Learn Shortcuts to Automata,https://openreview.net/forum?id=De4FYqjFueZ,https://openreview.net/pdf?id=De4FYqjFueZ,"Shallow, non-recurrent Transformers can simulate the recurrent dynamics of finite-state automata, via counterintuitive shortcuts.","Algorithmic reasoning requires capabilities which are most naturally understood through recurrent models of computation, like the Turing machine. However, Transformer models, while lacking recurrence, are able to perform such reasoning using far fewer layers than the number of reasoning steps. This raises the question: what solutions are these shallow and non-recurrent models finding? We investigate this question in the setting of learning automata, discrete dynamical systems naturally suited to recurrent modeling and expressing algorithmic tasks. Our theoretical results completely characterize shortcut solutions, whereby a shallow Transformer with only $o(T)$ layers can exactly replicate the computation of an automaton on an input sequence of length $T$. By representing automata using the algebraic structure of their underlying transformation semigroups, we obtain $O(\log T)$-depth simulators for all automata and $O(1)$-depth simulators for all automata whose associated groups are solvable. Empirically, we perform synthetic experiments by training Transformers to simulate a wide variety of automata, and show that shortcut solutions can be learned via standard training. We further investigate the brittleness of these solutions and propose potential mitigations.","Transformer, self-attention, group theory, semigroup theory, algebraic automata theory, shortcut learning, theory of deep learning" Efficient neural representation in the cognitive neuroscience domain: Manifold Capacity in One-vs-rest Recognition Limit,https://openreview.net/forum?id=-itAMjwvDJC,https://openreview.net/pdf?id=-itAMjwvDJC,Our Sparse Replica Manifold Analysis enables a separability and geometric analysis of neural data by extending the scope of the theory to a realistic number of neurons and tasks more relevant to cognitive neuroscience.,"The structure in neural representations as manifolds has become a popular approach to study information encoding in neural populations. One particular interest is the connection between object recognition capability and the separability of neural representations for different objects, often called ""object manifolds."" In learning theory, separability has been studied under the notion of storage capacity, which refers to the number of patterns encoded in a feature dimension. Chung et al (2018) extended the notion of capacity from discrete points to manifolds, where manifold capacity refers to the maximum number of object manifolds that can be linearly separated with high probability given random assignment of labels. Despite the use of manifold capacity in analyzing artificial neural networks (ANNs), its application to neuroscience has been limited. Due to the limited number of ""features"", such as neurons, available in neural experiments, manifold capacity cannot be verified empirically, unlike in ANNs. Additionally, the usage of random label assignment, while common in learning theory, is of limited relevance to the definition of object recognition tasks in cognitive science. To overcome these limits, we present the Sparse Replica Manifold analysis to study object recognition. Sparse manifold capacity measures how many object manifolds can be separated under one versus the rest classification, a form of task widely used in both in cognitive neuroscience experiments and machine learning applications. We demonstrate the application of sparse manifold capacity allows analysis of a wider class of neural data - in particular, neural data that has a limited number of neurons with empirical measurements. Furthermore, sparse manifold capacity requires less computations to evaluate underlying geometries and enables a connection to a measure of dimension, the participation ratio. We analyze the relationship between capacity and dimension, and demonstrate that both manifold intrinsic dimension and the ambient space dimension play a role in capacity. ","computational neuroscience, statistical physics of learning, representation geometry, perceptual manifolds, object recognition" ULF: UNSUPERVISED LABELING FUNCTION CORRECTION USING CROSS-VALIDATION FOR WEAK SUPERVISION,https://openreview.net/forum?id=mumZwT6OrEV,https://openreview.net/pdf?id=mumZwT6OrEV,We introduce a new algorithm ULF for denoising weakly annotated data based on the principle of k-fold cross-validation. ULF uses models trained on all but some LFs to detect and correct biases specific to the held-out LFs.,"A way to overcome expensive and time-consuming manual data labeling is weak supervision - automatic annotation of data samples via a predefined set of labeling functions (LFs), rule-based mechanisms that generate artificial labels for the classes associated with the LFs. In this work, we investigate noise reduction techniques for weak supervision based on the principle of k-fold cross-validation. We introduce a new algorithm ULF for denoising weakly annotated data which uses models trained on all but some LFs to detect and correct biases specific to the held-out LFs. Specifically, ULF refines the allocation of LFs to classes by re-estimating this assignment on highly reliable cross-validated samples. We realize two variants of this algorithm: feature-based ULF (relying on count-based feature vectors), and DeepULF (fine-tuning pre-trained language models). We compare ULF to methods originally developed for detecting erroneous samples in manually annotated data, as well as to our extensions of such methods to the weakly supervised setting. Our new weak supervision-specific methods (ULF and extensions) leverage the information about matching LFs, making detecting noisy samples more accurate. Evaluation on several datasets shows that ULF can successfully improve weakly supervised learning without utilizing any manually labeled data.","nlp, weak supervision, text classification, sentiment analysis" Islands of Confidence: Robust Neural Network Classification with Uncertainty Quantification,https://openreview.net/forum?id=pNnXjO3q82,https://openreview.net/pdf?id=pNnXjO3q82,We address the overconfidence of neural networks and related issues with a new centroidal-based confidence measure.,"We propose a Gaussian confidence measure and its optimization, for use in neural network classifiers. The measure comes with theoretical results, simultaneously resolving two pressing problems in NN classification: uncertainty quantification, and robustness. Existing research in uncertainty quantification mostly revolves around the confidence reflected in the input feature space. Instead, we focus on the learned representation of the network and analyze the confidence in the penultimate layer space. We formally prove that, independent of optimization-procedural effects, a set of centroids always exists such that softmax classifiers are nearest-centroid classifiers. Softmax confidence, however, does not reflect that the classification is based on nearest centroids: artificially inflated confidence is also given to out-of-distributions samples that are not near any centroid, but slightly less distant from one centroid than from the others. Our new confidence measure is centroid-based, and hence no longer suffers from the artificial confidence inflation of out-of-distribution samples. We also show that our proposed centroidal confidence measure is providing a robustness certificate against attacks. As such, it manages to reflect what the model doesn't know (as demanded by uncertainty quantification), and to resolve the issue of robustness of neural networks.","uncertainty quantification, neural collapse, deep learning" REST: REtrieve & Self-Train for generative action recognition,https://openreview.net/forum?id=-yqNb_CxRr,https://openreview.net/pdf?id=-yqNb_CxRr,,"This work is on training a generative action/video recognition model whose output is a free-form action-specific caption describing the video (rather than an action class label). A generative approach has practical advantages like producing more fine-grained and human-readable output, and being naturally open-world. To this end, we propose to adapt a pre-trained generative Vision & Language (V&L) Foundation Model for video/action recognition. While recently there have been a few attempts to adapt V&L models trained with contrastive learning (e.g. CLIP) for video/action, to the best of our knowledge, we propose the very first method that sets outs to accomplish this goal for a generative model. We firstly show that direct fine-tuning of a generative model to produce action classes suffers from severe overfitting. To alleviate this, we introduce REST, a training framework consisting of two key components: an unsupervised method for adapting the generative model to action/video by means of pseudo-caption generation and Self-training, i.e. without using any action-specific labels; (b) a Retrieval approach based on CLIP for discovering a diverse set of pseudo-captions for each video to train the model. Importantly, we show that both components are necessary to obtain high accuracy. We evaluate REST on the problem of zero-shot action recognition where we show that our approach is very competitive when compared to contrastive learning-based methods. Code will be made available.", Quantization-aware Policy Distillation (QPD),https://openreview.net/forum?id=UpyXmNMdQEn,https://openreview.net/pdf?id=UpyXmNMdQEn,"We introduce a method based on quantization and policy distillation that can effectively compress a network down to 0.5% of its original size, without any loss in performance.","Recent advancements have made Deep Reinforcement Learning (DRL) exceedingly more powerful, but the produced models remain very computationally complex and therefore difficult to deploy on edge devices. Compression methods such as quantization and distillation can be used to increase the applicability of DRL models on these low-power edge devices by decreasing the necessary precision and number of operations respectively. Training in low-precision is notoriously less stable however, which is amplified by the decrease in representational power when limiting the number of trainable parameters. We propose Quantization-aware Policy Distillation (QPD), which overcomes this instability by providing a smoother transition from high to low-precision network parameters. A new distillation loss specifically designed for the compression of actor-critic networks is also defined, resulting in a higher accuracy after compression. Our experiments show that these combined methods can effectively compress a network down to 0.5% of its original size, without any loss in performance.","DRL, Quantization, Distillation, Model Compression, Low-Power, Actor-Critic" Conceptual Behavior and Human-Likeness in Vision-and-Language Models,https://openreview.net/forum?id=U09miyCFe6T,https://openreview.net/pdf?id=U09miyCFe6T,,"Learning conceptual structures requires acquiring knowledge of how members of a class share a set of semantic properties. The challenge is that some properties are more efficiently learned by perceptual experience (e.g., an image of a dog that shows its texture, shape and color) while others benefit from language input (e.g., “a dog is a mammal”). Unimodal machine learning systems, as opposed to human brains, are therefore fundamentally limited in this respect. In contrast, systems integrating multimodal information should be able to learn a more human-like representational space since they can leverage both types of complementary sources of information. Multimodal neural network models offer a unique opportunity to test this hypothesis. We evaluate this proposal through a series of experiments on architecturally diverse vision-and-language networks trained on massive caption image datasets. We introduce an analytic framework that characterizes the semantic information behind the discrimination of concepts (i.e., lexicalized categories) through image-text matching tasks and representational similarity. We further compare how this discrimination (i.e., the model’s “conceptual behavior”) differs from that of humans and unimodal networks, and to what extent it depends on the multimodal encoder mechanism. Our results suggest promising avenues to align human and machine representational invariants via multimodal inputs.","mutlimodal deep learning, vision-and-language, semantic knowledge, conceptual knowledge, representational analysis, human-likeness" Highly Parallel Deep Ensemble Learning,https://openreview.net/forum?id=DyFvlCAj8j_,https://openreview.net/pdf?id=DyFvlCAj8j_,,"In this paper, we propose a novel highly parallel deep ensemble learning, which leads to highly compact and parallel deep neural networks. The main idea is to \textit{split data into spectral subsets; train subnetworks separately; and ensemble the output results in the inference stage}. The proposed method has parallel branches with each branch being an independent neural network trained using one spectral subset of the training data. It ensembles the outputs of the parallel branches to produce an overall network with substantially stronger generalization capability. It can also scale up the model to the large scale dataset with limited memory. The joint data/model parallel method is amiable for GPU implementation. Due to the reduced size of inputs, the proposed spectral tensor network exhibits an inherent network compression, which leads to the acceleration of training process. We evaluate the proposed spectral tensor networks on the MNIST, CIFAR-10 and ImageNet data sets, to highlight that they simultaneously achieve network compression, reduction in computation and parallel speedup. Specifically, on both ImageNet-1K and ImageNet-21K dataset, our proposed AlexNet-spectral, VGG-16-spectral, ResNet-34-spectral, CycleMLP-spectral and MobileVit-spectral networks achieve a comparable performance with the vanila ones, and enjoy up to $4 \times$ compression ratio and $1.5 \times$ speedups.", On the Forward Invariance of Neural ODEs,https://openreview.net/forum?id=EO-NrUPaFLz,https://openreview.net/pdf?id=EO-NrUPaFLz,This paper proposes to achieve specification guarantees in the output space of neural ODEs with invariance set propagation.,"To ensure robust and trustworthy decision-making, it is highly desirable to enforce constraints over a neural network's parameters and its inputs automatically by back-propagating output specifications. This way, we can guarantee that the network makes reliable decisions under perturbations. Here, we propose a new method for achieving a class of specification guarantees for neural Ordinary Differentiable Equations (ODEs) by using invariance set propagation. An invariance of a neural ODE is defined as an output specification, such as to satisfy mathematical formulae, physical laws, and system safety. We use control barrier functions to specify the invariance of a neural ODE on the output layer and propagate it back to the input layer. Through the invariance backpropagation, we map output specifications onto constraints on the neural ODE parameters or its input. The satisfaction of the corresponding constraints implies the satisfaction of output specifications. This allows us to achieve output specification guarantees by changing the input or parameters while maximally preserving the model performance. We demonstrate the invariance propagation on a comprehensive series of representation learning tasks, including spiral curve regression, autoregressive modeling of joint physical dynamics, convexity portrait of a function, and safe neural control of collision avoidance for autonomous vehicles.","Neural ODE, Forward Invariance, Specification Guarantees" Obtaining More Generalizable Fair Classifiers on Imbalanced Datasets,https://openreview.net/forum?id=zVrw4OH1Lch,https://openreview.net/pdf?id=zVrw4OH1Lch,,"Imposing fairness constraints during learning has been widely used to ensure algorithmic fairness. However, many datasets have an inherent imbalance in certain label classes (e.g. ""healthy"") and sensitive subgroups (e.g. ""older patients""), which leads to a lack of generalizability not only of classification but also of fairness properties, especially in over-parameterized models. For example, fairness-aware training may ensure equalized odds (EO) on the training data, but EO constraint is far from being satisfied for new users. In this paper, we propose a theoretically principled, yet flexible approach that encourages both classification and fairness generalization and can be flexibly combined with many existing fair learning methods with logits-based losses. While our main focus is on EO, our approach can be directly applied to achieve equalized opportunity (EqOpt); under certain conditions, it can also be applied to other fairness notions. We demonstrate the power of our new approach by combining it with a popular fair classification algorithm, and the resulting algorithm achieves significantly better fairness generalization on several real-world datasets.","Fairness, Generalization" GMML is All you Need,https://openreview.net/forum?id=vDY5Y8HMNxO,https://openreview.net/pdf?id=vDY5Y8HMNxO,,"Vision transformers have generated significant interest in the computer vision (CV) community because of their flexibility in exploiting contextual information, whether it is sharply confined local, or long range global. However, they are known to be data hungry. This has motivated the research in self-supervised transformer pretraining, which does not need to decode the semantic information conveyed by labels to link it to the image properties, but rather focuses directly on extracting a concise representation of the image data that reflects the notion of similarity and is invariant to nuisance factors. The key vehicle for the self-learning process used by the majority of self-learning methods is the generation of multiple views of the training data and the creation of pretext tasks which use these views to define the notion of image similarity and data integrity. However, this approach lacks the natural propensity to extract contextual information. We propose group mask model learning (GMML), a self-supervised learning (SSL) mechanism for pretraining vision transformers with the ability to extract the contextual information present in all the concepts in an image. GMML achieves this by manipulating random groups of connected tokens, ensuingly covering a meaningful part of a semantic concept, and then recovering the hidden semantic information from the visible part of the concept. GMML implicitly introduces a novel data augmentation process. Unlike most of the existing SSL approaches, GMML does not require momentum encoder, nor rely on careful implementation details such as large batches and gradient stopping, which are all artefacts of most of the current self-supervised learning techniques. Since its conception at the beginning of 2021, GMML maintains itself as unbeaten SSL method with several desirable benefits and marked a significant milestone in computer vision by being one of the first self-supervised pretraining methods which outperform supervised pretraining consistently with a large margin. GMML is simple, elegant, and currently the best mechanism to extract information from a given dataset and instil this information into transformer's weights. The code will be made publicly available for the community to train on bigger corpora. ","Self-supervised Learning, Group Masked Model Learning, Masked Autoencoders, Vision Transformers." Understanding The Robustness of Self-supervised Learning Through Topic Modeling,https://openreview.net/forum?id=7Cb7Faxa1OB,https://openreview.net/pdf?id=7Cb7Faxa1OB,,"Self-supervised learning has significantly improved the performance of many NLP tasks. However, how can self-supervised learning discover useful features, and why is it better than traditional approaches such as probabilistic models are still largely unknown. In this paper, we focus on the context of topic modeling and highlight a key advantage of self-supervised learning - when applied to data generated by topic models, self-supervised learning can be oblivious to the specific model, and hence is less susceptible to model misspecification. In particular, we prove that commonly used self-supervised objectives based on reconstruction or contrastive samples can both recover useful posterior information for general topic models. Empirically, we show that the same objectives can perform on par with posterior inference using the correct model, while outperforming posterior inference using misspecified models.","deep learning theory, self-supervised learning" Temporal Disentanglement of Representations for Improved Generalisation in Reinforcement Learning,https://openreview.net/forum?id=sPgP6aISLTD,https://openreview.net/pdf?id=sPgP6aISLTD,"We introduce Temporal Disentanglement (TED) to learn disentangled representations for Reinforcement Learning, improving generalisation to unseen environment variables.","Reinforcement Learning (RL) agents are often unable to generalise well to environment variations in the state space that were not observed during training. This issue is especially problematic for image-based RL, where a change in just one variable, such as the background colour, can change many pixels in the image, which can lead to drastic changes in the agent's latent representation of the image, causing the learned policy to fail. To learn more robust representations, we introduce TEmporal Disentanglement (TED), a self-supervised auxiliary task that leads to disentangled image representations exploiting the sequential nature of RL observations. We find empirically that RL algorithms utilising TED as an auxiliary task adapt more quickly to changes in environment variables with continued training compared to state-of-the-art representation learning methods. Since TED enforces a disentangled structure of the representation, we also find that policies trained with TED generalise better to unseen values of variables irrelevant to the task (e.g.\ background colour) as well as unseen values of variables that affect the optimal policy (e.g.\ goal positions).","Reinforcement Learning, Representation Learning, Disentanglement" Distilling Pre-trained Knowledge in Chemical Reactions for Molecular Property Prediction,https://openreview.net/forum?id=t00nS5YLjSc,https://openreview.net/pdf?id=t00nS5YLjSc,We propose a novel method to incorporate chemical domain knowledge for molecular property prediction.,"How to effectively represent molecules is a long-standing challenge for molecular property prediction and drug discovery. Recently, accumulative unlabelled molecule data have spurred the rapid development of pre-training methods for molecular representation learning. However, these works mainly focus on devising self-supervised learning tasks and/or introducing 3D geometric information based on molecular structures with little chemical domain knowledge involved. To address this issue, we propose a novel method (MolKD) by Distilling pre-trained Knowledge in chemical reactions to assist Molecular property prediction. Specifically, MolKD first learns effective representations by incorporating reaction yields to measure transformation efficiency of the reactant-product pair when pre-training on reactions. Next, MolKD introduces the reaction-to-molecule distillation to transfer cross-modal knowledge between pre-training chemical reaction data and the downstream molecular property prediction tasks. Extensive experiments show that our method can learn effective molecular representations, achieving superior performance compared with state-of-the-art baselines, e.g., 2.8% absolute Hit@1 gain on USPTO in chemical reaction prediction and 1.6% absolute AUC-ROC gain on Tox21 with 1/3 pre-training data size in molecular property prediction. Further investigations on pre-trained molecular representations indicate that MolKD learns to distinguish chemically meaningful molecular similarities, which enables molecular property prediction with high robustness and interpretability.","Molecular property prediction, Chemical reactions, Pre-training for molecular representations, Knowledge distillation, AI for drug discovery" Provably Efficient Neural Offline Reinforcement Learning via Perturbed Rewards,https://openreview.net/forum?id=WOquZTLCBO1,https://openreview.net/pdf?id=WOquZTLCBO1,A provably and computationally efficient algorithm for offline RL with deep neural networks ,"We propose a novel offline reinforcement learning (RL) algorithm, namely PEturbed-Reward Value Iteration (PERVI) which amalgamates the randomized value function idea with the pessimism principle. Most current offline RL algorithms explicitly construct statistical confidence regions to obtain pessimism via lower confidence bounds (LCB), which cannot easily scale to complex problems where a neural network is used to estimate the value functions. Instead, PERVI implicitly obtains pessimism by simply perturbing the offline data for multiple times with carefully-designed i.i.d Gaussian noises to learn an ensemble of estimated state-action values and acting greedily to the minimum of the ensemble. The estimated state-action values are obtained via fitting a parametric model (e.g. neural networks) to the perturbed datasets using gradient descent. As a result, PERVI only needs $\mathcal{O}(1)$ time complexity for action selection while LCB-based algorithms require at least $\Omega(K^2)$, where $K$ is the total number of trajectories in the offline data. We also propose a novel data splitting technique that helps remove the potentially large log covering number in the learning bound. We prove that PERVI yields a provable uncertainty quantifier with overparameterized neural networks and achieves an $\tilde{\mathcal{O}}\left( \frac{ \kappa H^{5/2} \tilde{d} }{\sqrt{K}} \right)$ sub-optimality where $\tilde{d}$ is the effective dimension, $H$ is the horizon length and $\kappa$ measures the distributional shift. We corroborate the statistical and computational efficiency of PERVI with an empirical evaluation in a wide set of synthetic and real-world datasets. To the best of our knowledge, PERVI is the first offline RL algorithm that is both provably and computationally efficient in general Markov decision processes (MDPs) with neural network function approximation. ","Offline Reinforcement Learning, Neural Networks" Learning Debiased Representations via Conditional Attribute Interpolation,https://openreview.net/forum?id=16BDzjpOwe,https://openreview.net/pdf?id=16BDzjpOwe,This paper proposes a novel method to learn debiased representation via conditional attribute interpolation.,"An image is usually associated with more than one attribute, e.g., annotated based on both ""shape"" and ""color"". If most samples have attributes spuriously correlated with the target label, a Deep Neural Network (DNN) is prone to neglect those samples with attributes intrinsically consistent with the targets and leads to representations with large intra-class covariance. To improve the generalization ability of such a biased model, we propose a $\chi^2$-model to fill in the intra-class blanks and learn debiased representations. First, we use a $\chi$-shape pattern to match the training dynamics of a DNN and find Intermediate Attribute Samples (IASs) --- samples near decision boundaries when discerning various attributes, which indicate how attribute values change from one extreme to another. Then we rectify the decision boundary with a $\chi$-branch metric learning objective. Conditional interpolation among IASs eliminates the negative effect of peripheral attributes and facilitates making intra-class samples compact. Experiments show that $\chi^2$-model learns debiased representation effectively and achieves remarkable improvements on various datasets.","debiased representation, conditional attribute interpolation, image classification" Active Learning at the ImageNet Scale,https://openreview.net/forum?id=JhsVJoK13u,https://openreview.net/pdf?id=JhsVJoK13u,"We identify sampling-imbalance as a major failure mode in large-scale active learning, and we propose Balanced Selection, a simple, scalable AL algorithm to remedy it.","Active learning (AL) algorithms aim to identify an optimal subset of data for annotation, such that deep neural networks (DNN) can achieve better performance when trained on this labeled subset. AL is especially impactful in industrial scale settings where data labeling costs are high and practitioners use every tool at their disposal to improve model performance. The recent success of self-supervised pretraining (SSP) highlights the importance of harnessing abundant unlabeled data to boost model performance. By combining AL with SSP, we can make use of unlabeled data while simultaneously labeling and training on particularly informative samples. In this work, we study a combination of AL and SSP on ImageNet. We find that performance on small toy datasets – the typical benchmark setting in the literature – is not representative of performance on ImageNet due to the class imbalanced samples selected by an active learner. Among the existing baselines we test, popular AL algorithms across a variety of small and large scale settings fail to outperform random sampling. To remedy the class-imbalance problem, we propose Balanced Selection (BASE), a simple, scalable AL algorithm that outperforms random sampling consistently by selecting more balanced samples for annotation than existing methods.","active learning, large-scale active learning" Deep Probabilistic Time Series Forecasting over Long Horizons,https://openreview.net/forum?id=22h1XSEiN0,https://openreview.net/pdf?id=22h1XSEiN0,We demonstrate that with simple adaptations high performing deterministic models can be made into state of the art probabilistic forecasters.,"Recent advances in neural network architectures for time series have led to significant improvements on deterministic forecasting metrics like mean squared error. We show that for many common benchmark datasets with deterministic evaluation metrics, intrinsic stochasticity is so significant that simply predicting summary statistics of the inputs outperforms many state-of-the-art methods, despite these simple forecasters capturing essentially no information from the noisy signals in the dataset. We demonstrate that using a probabilistic framework and moving away from deterministic evaluation acts as a simple fix for this apparent misalignment between good performance and poor understanding. With simple and scalable approaches for uncertainty representation we can adapt state-of-the-art architectures for point prediction to be excellent probabilistic forecasters, outperforming complex probabilistic methods constructed from deep generative models (DGMs) on popular benchmarks. Finally, we demonstrate that our simple adaptations to point predictors yield reliable probabilistic forecasts on many problems of practical significance, namely large and highly stochastic datasets of climatological and economic data.","time series, neural networks, probabilistic forecasting" Revealing Dominant Eigendirections via Spectral Non-Robustness Analysis in the Deep Reinforcement Learning Policy Manifold,https://openreview.net/forum?id=aG_B1SZ92t,https://openreview.net/pdf?id=aG_B1SZ92t,,"Deep neural policies have recently been installed in a diverse set of settings, from biotechnology to automated financial systems. However, the utilization of deep neural networks to approximate the state-action value function commences concerns on the decision boundary stability, in particular, with regard to the sensitivity of policy decision making to indiscernible, non-robust features due to highly non-convex and complex deep neural manifolds. These concerns constitute an obstruction to understanding the reasoning made by deep neural policies, and their foundational limitations. Thus, it is crucial to develop techniques that aim to understand the sensitivities in the learnt representations of neural network policies. To achieve this we introduce a method that identifies the dominant eigen-directions via spectral analysis of non-robust directions in the deep neural policy decision boundary across both time and space. Through experiments in the Arcade Learning Environment (ALE), we demonstrate the effectiveness of our spectral analysis algorithm for identifying correlated non-robust directions, and for measuring how sample shifts remold the set of sensitive directions in the neural policy landscape. Most importantly, we show that state-of-the-art adversarial training techniques yield learning of sparser high-sensitivity directions, with dramatically larger oscillations over time, when compared to standard training. We believe our results reveal the fundamental properties of the decision process made by the deep reinforcement learning policies, and can help in constructing safe, reliable and value-aligned deep neural policies.", MC-SSL: Towards Multi-Concept Self-Supervised Learning,https://openreview.net/forum?id=uTshHIKOtan,https://openreview.net/pdf?id=uTshHIKOtan,,"Self-supervised pre-training is the method of choice for natural language processing models and is rapidly gaining popularity in many vision tasks. Recently, self-supervised pre-training has shown to outperform supervised pre-training for many downstream vision applications, marking a milestone in the area. This superiority is attributed to the negative impact of incomplete labelling of the training images, which convey multiple concepts, but are annotated using a single dominant class label. Although Self-Supervised Learning (SSL), in principle, is free of this limitation, the choice of a pretext task facilitating SSL can perpetuate this shortcoming by driving the learning process towards a single concept output. This study aims to investigate the possibility of modelling all the concepts present in an image without using labels. In this respect the proposed Multi-Concept SSL (MC-SSL) framework is a step towards unsupervised learning which embraces all the diverse content in an image with the aim of explicitly modelling the information from all the concepts present in the image. MC-SSL involves two core design steps: group masked model learning (GMML) and learning of pseudo-concepts for data tokens using a momentum encoder (teacher-student) framework. An added benefit of MC-SSL is the ability to train data hungry transformers on small datasets with high accuracy without external data. Experimental results on multi-label and multi-class image classification downstream tasks demonstrate that MC-SSL not only surpasses existing SSL methods but also outperforms supervised transfer learning. The source code will be made publicly available for the community to train on bigger corpus. ","Self-supervised Learning, Group Masked Model Learning, Masked Autoencoders, Vision Transformers, Knowledge Distillation" Latent Hierarchical Imitation Learning for Stochastic Environments,https://openreview.net/forum?id=stgewiZP0OH,https://openreview.net/pdf?id=stgewiZP0OH,We formalize and alleviate challenges in imitation learning when hierachical policies are used to prevent mode collapse. ,"Many applications of imitation learning require the agent to avoid mode collapse and mirrorthe full distribution of observed behaviors. Existing methods improving this distributional realism typically rely on hierarchical policies conditioned on sampled types that model agent-internal features like persona, goal, or strategy. However, these methods are often inappropriate for stochastic environments, where internal and external factors of influence on the observed agent trajectories have to be disentangled, and only internal factors should be encoded in the agent type to be robust to changing environment conditions. We formalize this challenge as distribution shifts in the marginal and conditional distributions of agent types under environmental stochasticity, in addition to the familiar covariate shift in state visitations. We propose Robust Type Conditioning (RTC), which eliminates these shifts with adversarial training under randomly sampled types. Experiments on two domains, including the large-scal eWaymo Open Motion Dataset, show improved distributional realism while maintaining or improving task performance compared to state of the art baselines.","hierarchical imitation learning, learning from demonstrations, autonomous driving, causal confusion" Trimsformer: Trimming Transformer via Searching for Low-Rank Structure,https://openreview.net/forum?id=T-camDtiuUg,https://openreview.net/pdf?id=T-camDtiuUg,Constructing the efficient low-rank vision transformer structure based on neural architecture search.,"Vision Transformers (ViT) have recently been used successfully in various computer vision tasks, but the high computational cost hinders their practical deployment. One of the most well-known methods to alleviate computational burden is low-rank approximation. However, how to automatically search for a low-rank configuration efficiently remains a challenge. In this paper, we propose Trimsformer, an end-to-end automatic low-rank approximation framework based on a neural architecture search scheme, which tackles the inefficiency of searching for a target low-rank configuration out of numerous ones. We propose weight inheritance which encodes enormous rank choices into a single search space. In addition, we share the gradient information among building blocks to boost the convergence of the supernet training. Furthermore, to mitigate the initial performance gap between subnetworks caused by using pre-trained weights, we adopt non-uniform sampling to promote the overall subnetwork performance. Extensive results show the efficacy of our Trimsformer framework. For instance, with our method, Trim-DeiT-B/Trim-Swin-B can save up to 57%/46% FLOPs with 1.1%/0.2% higher accuracy over DeiT-B/Swin-B. Last but not least, Trimsformer exhibits remarkable generality and orthogonality. We can yield extra 21%$\sim$26% FLOPs reductions on top of the popular compression method as well as the compact hybrid structure. Our code will be released.","Vision Transformer, Model Compression, Low-Rank Approximation, Neural Architecture Search" Exploring the Limits of Differentially Private Deep Learning with Group-wise Clipping,https://openreview.net/forum?id=oze0clVGPeX,https://openreview.net/pdf?id=oze0clVGPeX,Explore the limit of the efficiency of DP-SGD with group-wise clipping,"Differentially private deep learning has recently witnessed advances in computational efficiency and privacy-utility trade-off. We explore whether further improvements along the two axes are possible and provide affirmative answers leveraging two instantiations of \emph{group-wise clipping}. To reduce the compute time overhead of private learning, we show that \emph{per-layer clipping}, where the gradient of each neural network layer is clipped separately, allows clipping to be performed in conjunction with backpropagation in differentially private optimization. This results in private learning that is as memory-efficient and almost as fast per training update as non-private learning for many workflows of interest. While per-layer clipping with constant thresholds tends to underperform standard flat clipping, per-layer clipping with adaptive thresholds matches or outperforms flat clipping under given training epoch constraints, hence attaining similar or better task performance within less wall time. To explore the limits of scaling (pretrained) models in differentially private deep learning, we privately fine-tune the 175 billion-parameter GPT-3. We bypass scaling challenges associated with clipping gradients that are distributed across multiple devices with \emph{per-device clipping} that clips the gradient of each model piece separately on its host device. Privately fine-tuning GPT-3 with per-device clipping achieves a task performance at $\epsilon=1$ better than what is attainable by non-privately fine-tuning the largest GPT-2 on a summarization task.","differential privacy, per-layer clipping, efficiency, DP-SGD" Continual Zero-shot Learning through Semantically Guided Generative Random Walks,https://openreview.net/forum?id=g-kR7WU4Iw-,https://openreview.net/pdf?id=g-kR7WU4Iw-,,"Learning new knowledge, not forgetting previous ones, and adapting it to future tasks occur simultaneously throughout a human's lifetime. However, this learning procedure is mostly studied individually in deep learning either from the perspective of lifetime learning without forgetting (continual learning) or adaptation to recognize unseen tasks (zero-shot learning, ZSL). Continual ZSL (CZSL), the desired and more natural learning setting, has been introduced in recent years and is most developed in the transductive setting, which is unrealistic in practice. In this paper, we focus on inductive continual generalized zero-shot learning (CGZSL) by generative approach, where no unseen class information is provided during the training. The heart of the success of previous generative-based approaches is that learn quality representations from seen classes to improve the generative understanding of the unseen visual space. Motivated by this, we first introduce generalization bound tools and provide the first theoretical explanation for the benefits of generative modeling to ZSL and CZSL tasks. Second, we develop a pure Inductive Continual Generalized Zero-Shot Learner using our theoretical analysis to guide the improvement of the generation quality. The learner employs a novel semantically-guided Generative Random Walk (GRW) loss, where we encourage high transition probability, computed by random walk, from seen space to a realistic generative unseen space. We also demonstrate that our learner continually improves the unseen class representation quality, achieving state-of-the-art performance on AWA1, AWA2, CUB, and SUN datasets and surpassing existing CGZSL methods by around 3-7\% on different datasets. Code is available here https://anonymous.4open.science/r/cgzsl-76E7/main.py","Continual Learning, Zero-shot Learning, Random Walk" Mesh-free Eulerian Physics-Informed Neural Networks,https://openreview.net/forum?id=253DOGs6EF,https://openreview.net/pdf?id=253DOGs6EF,,"Physics-informed Neural Networks (PINNs) have recently emerged as a principled way to include prior physical knowledge in form of partial differential equations (PDEs) into neural networks. Although PINNs are generally viewed as mesh-free, current approaches still rely on collocation points within a bounded region, even in settings with spatially sparse signals. Furthermore, if the boundaries are not known, the selection of such a region is difficult and often results in a large proportion of collocation points being selected in areas of low relevance. To resolve this severe drawback of current methods, we present a mesh-free and adaptive approach termed particle-density PINN (pdPINN), which is inspired by the microscopic viewpoint of fluid dynamics. The method is based on the Eulerian formulation and, different from classical mesh-free method, does not require the introduction of Lagrangian updates. We propose to sample directly from the distribution over the particle positions, eliminating the need to introduce boundaries while adaptively focusing on the most relevant regions. This is achieved by interpreting a non-negative physical quantity (such as the density or temperature) as an unnormalized probability distribution from which we sample with dynamic Monte Carlo methods. The proposed method leads to higher sample efficiency and improved performance of PINNs. These advantages are demonstrated on various experiments based on the continuity equations, Fokker-Planck equations, and the heat equation.","Physics-informed Neural Network, PINN, SIREN, fluid dynamics, implicit neural representations, PDEs" Self-supervised learning with rotation-invariant kernels,https://openreview.net/forum?id=8uu6JStuYm,https://openreview.net/pdf?id=8uu6JStuYm,A regularization loss based on kernel mean embeddings with rotation-invariant kernels on the hypersphere for self-supervised learning of image representations,"We introduce a regularization loss based on kernel mean embeddings with rotation-invariant kernels on the hypersphere (also known as dot-product kernels) for self-supervised learning of image representations. Besides being fully competitive with the state of the art, our method significantly reduces time and memory complexity for self-supervised training, making it implementable for very large embedding dimensions on existing devices and more easily adjustable than previous methods to settings with limited resources. Our work follows the major paradigm where the model learns to be invariant to some predefined image transformations (cropping, blurring, color jittering, etc.), while avoiding a degenerate solution by regularizing the embedding distribution. Our particular contribution is to propose a loss family promoting the embedding distribution to be close to the uniform distribution on the hypersphere, with respect to the maximum mean discrepancy pseudometric. We demonstrate that this family encompasses several regularizers of former methods, including uniformity-based and information-maximization methods, which are variants of our flexible regularization loss with different kernels. Beyond its practical consequences for state of the art self-supervised learning with limited resources, the proposed generic regularization approach opens perspectives to leverage more widely the literature on kernel methods in order to improve self-supervised learning methods.","Self-supervised learning, maximum mean discrepancy, rotation-invariant kernel, hypersphere" Strong inductive biases provably prevent harmless interpolation,https://openreview.net/forum?id=7i6OZa7oij,https://openreview.net/pdf?id=7i6OZa7oij,We show that the strength of a model’s inductive bias determines whether interpolation of noisy data is harmless or harmful.,"Classical wisdom suggests that estimators should avoid fitting noise to achieve good generalization. In contrast, modern overparameterized models can yield small test error despite interpolating noise — a phenomenon often called ""benign overfitting"" or ""harmless interpolation"". This paper argues that the degree to which interpolation is harmless hinges upon the strength of an estimator's inductive bias, i.e., how heavily the estimator favors solutions with a certain structure: while strong inductive biases prevent harmless interpolation, weak inductive biases can even require fitting noise to generalize well. Our main theoretical result establishes tight non-asymptotic bounds for high-dimensional kernel regression that reflect this phenomenon for convolutional kernels, where the filter size regulates the strength of the inductive bias. We further provide empirical evidence of the same behavior for deep neural networks with varying filter sizes and rotational invariance.","high-dimensional statistics, non-parametric regression, deep learning theory, generalization bounds, benign overfitting" Active Learning based Structural Inference,https://openreview.net/forum?id=RN4iVt9ndGa,https://openreview.net/pdf?id=RN4iVt9ndGa,,"In this paper, we propose an active-learning based framework, Active Learning based Structural Inference (ALaSI), to infer the existence of directed connections from observed agents' states over a time period in a dynamical system. With the help of deep active learning, ALaSI is competent in learning the representation of connections with relatively small pool of prior knowledge. Moreover, based on information theory, we propose inter- and out-of-scope message learning pipelines, which are remarkably beneficial to the structural inference for large dynamical systems. We evaluate ALaSI on various large datasets including simulated systems and real-world networks, to demonstrate that ALaSI is able to precisely infer the existence of connections in these systems under either supervised learning or unsupervised learning, with better performance than baseline methods.","Structural Inference, Active Learning, Mutual Information, Deep Learning" Batch Normalization Explained,https://openreview.net/forum?id=JFtHy-Ve7e,https://openreview.net/pdf?id=JFtHy-Ve7e,Batch normalization adapts the geometry of the deep network to the data manifold and serves as a smart initialization and a margin maximization method,"A critically important, ubiquitous, and yet poorly understood ingredient in modern deep networks (DNs) is batch normalization (BN), which centers and normalizes the feature maps. To date, only limited progress has been made understanding why BN boosts DN learning and inference performance; work has focused exclusively on showing that BN smooths a DN's loss landscape. In this paper, we study BN theoretically from the perspective of function approximation; we exploit the fact that most of today's state-of-the-art DNs are continuous piecewise affine (CPA) splines that fit a predictor to the training data via affine mappings defined over a partition of the input space (the so-called ``linear regions''). We demonstrate that BN is an unsupervised learning technique that -- independent of the DN's weights or gradient-based learning -- adapts the geometry of a DN's spline partition to match the data. BN provides a ``smart initialization'' that boosts the performance of DN learning, because it adapts even a DN initialized with random weights to align its spline partition with the data. We also show that the variation of BN statistics between mini-batches introduces a dropout-like random perturbation to the partition boundaries and hence the decision boundary for classification problems. This per mini-batch perturbation reduces overfitting and improves generalization by increasing the margin between the training samples and the decision boundary. ","batch normalization, continuous piecewise linear networks, unsupervised learning" AN OPERATOR NORM BASED PASSIVE FILTER PRUNING METHOD FOR EFFICIENT CNNS,https://openreview.net/forum?id=Tjp51oUrk3,https://openreview.net/pdf?id=Tjp51oUrk3,A passive filter pruning framework is proposed by incorporating significance of filters in producing output to eliminate unimportant CNN filters for reducing computational complexity and paramters in CNNs.,"Convolutional neural networks (CNNs) have shown state-of-the-art performance in various applications. However, CNNs are resource-hungry due to their requirement of high computational complexity and memory storage. Recent efforts toward achieving computational efficiency in CNNs involve filter pruning methods that eliminate some of the filters in CNNs based on the ""importance"" of the filters. Existing passive filter pruning methods typically use the entry-wise norm of the filters to quantify filter importance, without considering how well the filter contributes in producing the node output. Under situations where the large number of filters are to be pruned from the network, the entry-wise norm methods always select high entry-wise norm filters as important, and ignore the diversity learned by the other filters that may result in degradation in the performance. To address this, we present a passive filter pruning method where the filters are pruned based on their contribution in producing output by implicitly considering the operator norm of the filters. The computational cost and memory requirement is reduced significantly by eliminating filters and their corresponding feature maps from the network. Accuracy similar to the original network is recovered by fine-tuning the pruned network. The proposed pruning method gives similar or better performance and recovers accuracy faster during the fine-tuning process than the entry-wise norm-based pruning methods. The efficacy of the proposed pruning method is evaluated on audio scene classification (e.g. TAU Urban Acoustic Scenes 2020) and image classification (MNIST handwritten digit classification). ","Convolutional neural network, filter pruning, VGGish, DCASE, MNIST" Neuromechanical Autoencoders: Learning to Couple Elastic and Neural Network Nonlinearity,https://openreview.net/forum?id=QubsmJT_A0,https://openreview.net/pdf?id=QubsmJT_A0,"We introduce Neuromechanical Autoencoders, a framework for co-design of neural network and mechanical metamaterials for performing morphological computation.","Intelligent biological systems are characterized by their embodiment in a complex environment and the intimate interplay between their nervous systems and the nonlinear mechanical properties of their bodies. This coordination, in which the dynamics of the motor system co-evolved to reduce the computational burden on the brain, is referred to as ""mechanical intelligence"" or ""morphological computation"". In this work, we seek to develop machine learning analogs of this process, in which we jointly learn the morphology of complex nonlinear elastic solids along with a deep neural network to control it. By using a specialized differentiable simulator of elastic mechanics coupled to conventional deep learning architectures---which we refer to as neuromechanical autoencoders---we are able to learn to perform morphological computation via gradient descent. Key to our approach is the use of mechanical metamaterials---cellular solids, in particular---as the morphological substrate. Just as deep neural networks provide flexible and massively-parametric function approximators for perceptual and control tasks, cellular solid metamaterials are promising as a rich and learnable space for approximating a variety of actuation tasks. In this work we take advantage of these complementary computational concepts to co-design materials and neural network controls to achieve nonintuitive mechanical behavior. We demonstrate in simulation how it is possible to achieve translation, rotation, and shape matching, as well as a ""digital MNIST"" task. We additionally manufacture and evaluate one of the designs to verify its real-world behavior. ","morphological computation, mechanical metamaterials, computational mechanics, mechanical co-design, automatic differentiation, differentiable simulation" Temporal Dynamics Aware Adversarial Attacks On Discrete-Time Graph Models,https://openreview.net/forum?id=yUY15QBERj,https://openreview.net/pdf?id=yUY15QBERj,Introduces a novel constraint to attack dynamic graph models while preserving the original graph evolution and presents an effective approach to find such attacks,"Real-world graphs such as social networks, communication networks, and rating networks are constantly evolving over time. Many architectures have been developed to learn effective node representations using both graph structure and its dynamics. While the robustness of static graph models is well-studied, the vulnerability of the dynamic graph models to adversarial attacks is underexplored. In this work, we design a novel adversarial attack on discrete-time dynamic graph models where we desire to perturb the input graph sequence in a manner that preserves the temporal dynamics of the graph. To this end, we motivate a novel Temporal Dynamics-Aware Perturbation (TDAP) constraint, which ensures that perturbations introduced at each time step are restricted to only a small fraction of the number of changes in the graph since the previous time step. We present a theoretically-grounded Projected Gradient Descent approach for dynamic graphs to find the effective perturbations under the TDAP constraint. Experiments on two tasks — dynamic link prediction and node classification, show that our approach is up to 4x more effective than the baseline methods for attacking these models. We also consider the practical online setting where graph snapshots become available in real-time and extend our attack approach to use Online Gradient Descent for performing attacks under the TDAP constraint. In this more challenging setting, we demonstrate that our method achieves upto 5x superior performance when compared to representative baselines.","Graph Neural Networks, Dynamic Graphs, Adversarial Attacks, Evolution-preserving" Automatic Curriculum Generation for Reinforcement Learning in Zero-Sum Games,https://openreview.net/forum?id=eYm_Q5KLQr,https://openreview.net/pdf?id=eYm_Q5KLQr,"In this work, we present the first theoretical framework of automatic curriculum learning in the setting of zero-sum game and derive a surprisingly simple indicator of training progress, i.e., the policy variance","Curriculum learning (CL), whose core idea is to train from easy to hard, is a popular technique to accelerate reinforcement learning (RL) training. It has also been a trend to automate the curriculum generation process. Automatic CL works primarily focus on goal-conditioned RL problems, where an explicit indicator of training progress, e.g., reward or success rate, can be used to prioritize the training tasks. However, such a requirement is no longer valid in zero-sum games: there are no goals for the agents, and the accumulative reward of the learning policy can constantly fluctuate throughout training. In this work, we present the first theoretical framework of automatic curriculum learning in the setting of zero-sum games and derive a surprisingly simple indicator of training progress, i.e., the Q-value variance, which can be directly approximated by computing the variance of value network ensembles. With such a progression metric, we further adopt a particle-based task sampler to generate initial environment configurations for training, which is particularly lightweight, computation-efficient, and naturally multi-modal. Combining these techniques with multi-agent PPO training, we obtain our final algorithm, Zero-sum Automatic Curriculum Learning (ZACL). We first evaluate ZACL in a 2D particle-world environment, where ZACL produces much stronger policies than popular RL methods for zero-sum games using the same amount of samples. Then we show in the challenging hide-and-seek environment that ZACL can lead to all four emergent phases using a single desktop computer, which is reported for the first time in the literature. The project website is at https://sites.google.com/view/zacl.","multi-agent reinforcement learning, curriculum learning, zero-sum games" Internet-augmented language models through few-shot prompting for open-domain question answering,https://openreview.net/forum?id=hFCUPkSSRE,https://openreview.net/pdf?id=hFCUPkSSRE,We use few-shot prompting to condition pre-trained LMs on Google retrieved evidence for improving open-domain question answering ,"In this work, we aim to capitalize on the unique few-shot capabilities of large-scale language models (LSLMs) to overcome some of their challenges with respect to grounding to factual and up-to-date information. Motivated by semi-parametric lan4 guage models (LMs), which ground their decisions in external retrieved evidence, we use few-shot prompting to learn to condition LMs on information returned from the web using Google Search, a broad and constantly updated knowledge source. Our approach does not involve fine-tuning or learning additional parameters, thus making it applicable to any LM, offering therefore a strong baseline. Indeed, we find that LMs conditioned on the web surpass performance of closed-book models of similar, or even larger, model sizes in open-domain question answering. Finally, we find that increasing the inference-time compute of models, achieved via using multiple retrieved evidences to generate multiple answers followed by a reranking stage that uses scores generated by the same LMs, leads to better performance and alleviates lower performance of smaller few-shot LMs. All in all, our findings suggest that it might be beneficial to slow down the race towards the biggest model and instead shift attention towards finding more effective ways to use models, including but not limited to, better prompting or increasing inference-time compute.","language models, few-shot prompting, retrieval-augmented, question answering" Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training,https://openreview.net/forum?id=YJ7o2wetJ2,https://openreview.net/pdf?id=YJ7o2wetJ2,A method for pre-training a goal-conditioned value function on human videos that can be effectively used as zero-shot visual reward and representation for unseen robotics tasks in simulation and real-world.,"Reward and representation learning are two long-standing challenges for learning an expanding set of robot manipulation skills from sensory observations. Given the inherent cost and scarcity of in-domain, task-specific robot data, learning from large, diverse, offline human videos has emerged as a promising path towards acquiring a generally useful visual representation for control; however, how these human videos can be used for general-purpose reward learning remains an open question. We introduce $\textbf{V}$alue-$\textbf{I}$mplicit $\textbf{P}$re-training (VIP), a self-supervised pre-trained visual representation capable of generating dense and smooth reward functions for unseen robotic tasks. VIP casts representation learning from human videos as an offline goal-conditioned reinforcement learning problem and derives a self-supervised dual goal-conditioned value-function objective that does not depend on actions, enabling pre-training on unlabeled human videos. Theoretically, VIP can be understood as a novel implicit time contrastive objective that generates a temporally smooth embedding, enabling the value function to be implicitly defined via the embedding distance, which can then be used to construct the reward for any goal-image specified downstream task. Trained on large-scale Ego4D human videos and without any fine-tuning on in-domain, task-specific data, VIP's frozen representation can provide dense visual reward for an extensive set of simulated and $\textbf{real-robot}$ tasks, enabling diverse reward-based visual control methods and significantly outperforming all prior pre-trained representations. Notably, VIP can enable simple, few-shot offline RL on a suite of real-world robot tasks with as few as 20 trajectories.","Pre-Training for Control, Offline RL, Goal-Conditioned RL, Deep RL, Robot Learning, Self-Supervised Learning, Visuomotor Control" Language Modeling Using Tensor Trains,https://openreview.net/forum?id=TqCHPi7xlV,https://openreview.net/pdf?id=TqCHPi7xlV,,"Tensor networks have previously been shown to have potential in language modelling in theory, but lack of practical evidence support. We propose a novel Tensor Train Language Model (TTLM) based on Tensor-Train decomposition. We prove that TTLM generalizes Second-order Recurrent Neural Networks (RNNs), Recurrent Arithmetic Circuits and Multiplicative Integration RNNs in the sense that the architecture of all of these are, essentially, special cases of that of TTLM. To show the usefulness of TTLM, we perform a principled experimental evaluation on language modeling tasks, showing that our proposed variants, TTLM-large and TTLM-Tiny, can be more effective than Vanilla RNN while TTLM-Tiny has the half of the model size.","Tensor network, RNNs, Language modeling" Bridging the Gap to Real-World Object-Centric Learning,https://openreview.net/forum?id=b9tUk-f_aG,https://openreview.net/pdf?id=b9tUk-f_aG,Our method uses slot attention with self-supervised DINO features to discover objects on real-world data.,"Humans naturally decompose their environment into entities at the appropriate level of abstraction to act in the world. Allowing machine learning algorithms to derive this decomposition in an unsupervised way has become an important line of research. However, current methods are restricted to simulated data or require additional information in the form of motion or depth in order to successfully discover objects. In this work, we overcome this limitation by showing that reconstructing features from models trained in a self-supervised manner is a sufficient training signal for object-centric representations to arise in a fully unsupervised way. Our approach, DINOSAUR, significantly out-performs existing object-centric learning models on simulated data and is the first unsupervised object-centric model that scales to real world-datasets such as COCO and PASCAL VOC. DINOSAUR is conceptually simple and shows competitive performance compared to more involved pipelines from the computer vision literature. ","object discovery, object-centric learning, vision transformer, self-supervised learning, unsupervised learning" Towards a Unified Theoretical Understanding of Non-contrastive Learning via Rank Differential Mechanism,https://openreview.net/forum?id=cIbjyd2Vcy,https://openreview.net/pdf?id=cIbjyd2Vcy,,"Recently, a lot of advances in self-supervised visual learning are brought about by contrastive learning that aligns positive pairs while pushing negative pairs apart. Surprisingly, a variety of new methods, such as BYOL, SimSiam, SwAV, DINO, shows that when equipped with some architectural asymmetric designs, aligning positive pairs alone is sufficient to attain good performance. However, it is still not fully clear how these seemingly different asymmetric designs can avoid feature collapse. Despite some understandings of some specific modules (like the predictor in BYOL), there is yet no unified theoretical understanding, particularly for those who also work without the predictor (like DINO). In this work, we propose a new understanding for non-contrastive learning, named the Rank Differential Mechanism (RDM). We show that these asymmetric designs all create a consistent difference in the dual-branch outputs as measured by their effective rank. This rank difference will provably lead to an improvement of effective dimensionality and alleviate either complete or dimensional feature collapse. Different from previous theories, our RDM theory is applicable to different asymmetric designs (with and without the predictor), and thus can serve as a unified understanding of existing non-contrastive learning methods. Besides, our RDM theory also provides practical guidelines for designing many new non-contrastive variants. We show that these variants indeed achieve comparable performance to existing methods on benchmark datasets, and some of them even outperform the baselines. ", Weighted Regularization for Efficient Neural Network Compression,https://openreview.net/forum?id=fFDM6aIZhN,https://openreview.net/pdf?id=fFDM6aIZhN,A weighted regularization method for network compression is proposed and theoretical analysis is given.,"Regularization is widely applied to model complexity reduction and neural network compression. Existing $L_1$ and nuclear norm regularizations can achieve favorable results, but these methods treat all parameters equally and ignore the importance of the parameters. Taking the trained parameters as prior information to construct weights, a weighted regularization method is proposed in this paper. Theoretically, we establish the bounds on the estimation errors for values of the global minimum for a fully connected single hidden layer neural network. Further we prove the estimates generated from the weighted $L_1$ regularization and the weighted nuclear norm regularization can recover the sparsity and the low rank structure of a global minimum of the neural network with a high probability, respectively. The effectiveness of the algorithm is validated by conducting a numerical simulation and experiments with popular neural networks on public datasets from real-world applications.","Weighted Regularization, Neural Network Compression" Stay Moral and Explore: Learn to Behave Morally in Text-based Games,https://openreview.net/forum?id=CtS2Rs_aYk,https://openreview.net/pdf?id=CtS2Rs_aYk,,"Reinforcement learning (RL) in text-based games has developed rapidly and achieved promising results. However, little effort has been expended to design agents that pursue objectives while behaving morally, which is a critical issue in the field of autonomous agents. In this paper, we propose a general framework named Moral Awareness Adaptive Learning (MorAL) that enhances the morality capacity of an agent using a plugin moral-aware learning model. The framework allows the agent to execute task learning and morality learning adaptively. The agent selects trajectories from past experiences during task learning. Meanwhile, the trajectories are used to conduct self-imitation learning with a moral-enhanced objective. In order to achieve the trade-off between morality and task progress, the agent uses the combination of task policy and moral policy for action selection. We evaluate on the Jiminy Cricket benchmark, a set of text-based games with various scenes and dense morality annotations. Our experiments demonstrate that, compared with strong contemporary value alignment approaches, the proposed framework improves task performance while reducing immoral behaviours in various games.", Efficient Discovery of Dynamical Laws in Symbolic Form,https://openreview.net/forum?id=JDuEddUsSb,https://openreview.net/pdf?id=JDuEddUsSb,"Given a time series that is governed by an ordinary differential equation (ODE), our model infers the mathematical expression of the ODE.","We propose a transformer-based sequence-to-sequence model that recovers scalar ordinary differential equations (ODEs) in symbolic form from time-series data of a single observed solution trajectory of the ODE. Our method is efficiently scalable: after one-time pretraining on a large set of ODEs, we can infer the governing laws of a new observed solution in a few forward passes of the model. First, we generate and make available a large dataset of more than 3M ODEs together with more than 63M numerical solutions for different initial conditions that may serve as a useful benchmark for future work on machine learning for dynamical systems. Then we show that our model performs better or on par with existing methods in various test cases in terms of accurate symbolic recovery of the ODE, especially for more complex expressions. Reliably recovering the symbolic form of dynamical laws is important as it allows for further dissemination of the inferred dynamics as well as meaningful modifications for predictions under interventions.","Symbolic, ODE, Transformer" Brain2GAN; Reconstructing perceived faces from the primate brain via StyleGAN3,https://openreview.net/forum?id=hT1S68yza7,https://openreview.net/pdf?id=hT1S68yza7,Reconstruction of perceived faces by neural decoding of cortical responses from the primate brain,Neural coding characterizes the relationship between stimuli and their corresponding neural responses. The usage of synthesized yet photorealistic reality by generative adversarial networks (GANs) allows for superior control over these data: the underlying feature representations that account for the semantics in synthesized data are known a priori and their relationship is perfect rather than approximated post-hoc by feature extraction models. We exploit this property in neural decoding of multi-unit activity responses that we recorded from the primate brain upon presentation with synthesized face images in a passive fixation experiment. The face reconstructions we acquired from brain activity were astonishingly similar to the originally perceived face stimuli. This provides strong evidence that the neural face manifold and the disentangled w-latent space conditioned on StyleGAN3 (rather than the z-latent space of arbitrary GANs or other feature representations we encountered so far) share how they represent the high-level semantics of the high-dimensional space of faces.,"Face reconstruction, generative adversarial networks, neural decoding" Self-Guided Diffusion Models,https://openreview.net/forum?id=Gzmyu-Baq0,https://openreview.net/pdf?id=Gzmyu-Baq0,,"Diffusion models have demonstrated remarkable progress in image generation quality, especially when guidance is used to control the generative process. However, guidance requires a large amount of image-annotation pairs for training and is thus dependent on their availability, correctness and unbiasedness. In this paper, we eliminate the need for such annotation by instead leveraging the flexibility of self-supervision signals to design a framework for $\textit{self-guided}$ diffusion models. By leveraging a feature extraction function and a self-annotation function, our method provides guidance signals at various image granularities: from the level of holistic images to object boxes and even segmentation masks. Our experiments on single-label and multi-label image datasets demonstrate that self-labeled guidance always outperforms diffusion models without guidance and may even surpass guidance based on ground-truth labels, especially on unbalanced data. When equipped with self-supervised box or mask proposals, our method further generates visually diverse yet semantically consistent images, without the need for any class, box, or segment label annotation. Self-guided diffusion is simple, flexible and expected to profit from deployment at scale. ","diffusion model, self-supervised learning, unsupervised learning" Optimistic Exploration with Learned Features Provably Solves Markov Decision Processes with Neural Dynamics,https://openreview.net/forum?id=9kBCMNb5mc,https://openreview.net/pdf?id=9kBCMNb5mc,We identify a class of Markov decision processes with neural network parameterization and propose an oracle-efficient algorithm whose sample complexity does not depend on the Eluder dimension of the NN class.,"Incorporated with the recent advances in deep learning, deep reinforcement learning (DRL) has achieved tremendous success in empirical study. However, analyzing DRL is still challenging due to the complexity of the neural network class. In this paper, we address such a challenge by analyzing the Markov decision process (MDP) with neural dynamics, which covers several existing models as special cases, including the kernelized nonlinear regulator (KNR) model and the linear MDP. We propose a novel algorithm that designs exploration incentives via learnable representations of the dynamics model by embedding the neural dynamics into a kernel space induced by the system noise. We further establish an upper bound on the sample complexity of the algorithm, which demonstrates the sample efficiency of the algorithm. We highlight that, unlike previous analyses of RL algorithms with function approximation, our bound on the sample complexity does not depend on the Eluder dimension of the neural network class, which is known to be exponentially large (Dong et al., 2021).","Reinforcement Learning, Neural Network, Representation Learning." Removing Backdoors in Pre-trained Models by Regularized Continual Pre-training,https://openreview.net/forum?id=JKuBOuzntQ,https://openreview.net/pdf?id=JKuBOuzntQ,,"Large-scale pre-trained models (PTMs) have become the cornerstones of deep learning. Trained on massive data, general-purpose PTMs allow quick adaptation to a broad range of downstream tasks with superior performance. However, recent researches reveal that PTMs are vulnerable to backdoor attacks even before being fine-tuned on downstream tasks. By associating specific triggers with pre-defined embeddings, the attackers are capable of implanting transferable task-agnostic backdoors in PTMs, and controlling model outputs on any downstream task at inference time. As a result, all downstream applications can be highly risky after the backdoored PTMs are released and deployed. Given such an emergent threat, it is essential to defend PTMs against backdoor attacks and thus build reliable AI systems. Although there are a series of works aiming to erase backdoors on downstream models, as far as we know, no defenses against PTMs have been proposed. Worse still, existing backdoor-repairing defenses require task-specific knowledge (i.e., some clean downstream data), making them unsuitable for backdoored PTMs. To this end, we propose the first task-irrelevant backdoor removal method for PTMs. Motivated by the sparse activation phenomenon, we design a simple and effective backdoor eraser by continually pre-training the backdoored PTMs with a regularization term, guiding the models to ""forget'' backdoors. Our method only needs a few auxiliary task-irrelevant data, e.g., unlabelled plain texts, and thus is practical in typical applications. We conduct extensive experiments across modalities (vision and language) and architectures (CNNs and Transformers) on pre-trained VGG, ViT, BERT and CLIP models. The results show that our method can effectively remove backdoors and preserve benign functionalities in PTMs.", Would decentralization hurt generalization?,https://openreview.net/forum?id=_-eJYVfSYH,https://openreview.net/pdf?id=_-eJYVfSYH,D-SGD introduces an implicit regularization that penalizes the learned minima’s sharpness.,"Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on massive devices without controlling of a central server. Existing theory suggests that the decentralization degrades the generalizability, which conflicts with experimental results in the large-batch settings that D-SGD generalize better than centralized SGD (C-SGD). This work presents new theory that reconciles the conflict between the two perspectives. We prove that D-SGD introduces an implicit regularization that simultaneously penalizes (1) the sharpness of the learned minima and (2) the consensus distance between the consensus model and local models. We then prove that the implicit regularization is amplified in the large-batch settings when the linear scaling rule is applied. We further analyze the escaping efficiency of D-SGD, which suggests that D-SGD favors super-quadratic flat minima. Experiments are in full agreement with our theory. The code will be released publicly. To our best knowledge, this is the first work on the implicit regularization and escaping efficiency of D-SGD.","decentralized SGD, flat minima, generalization, implicit regularization, large batch training" Variational Pseudo Labels for Meta Test-time Adaptation,https://openreview.net/forum?id=iOag71mvHI,https://openreview.net/pdf?id=iOag71mvHI,We address test-time adaptation in a probabilistic formulation by introducing variational pseudo labels with meta adaptation. ,"Test-time model adaptation has shown great effectiveness in generalizing over domain shifts. A most successful tactic for test-time adaptation conducts further optimization on the target data using the predictions by the source-trained model. However, due to domain shifts, the source-trained model predictions themselves can be largely inaccurate, which results in a model misspecified to the target data and therefore damages their adaptation ability. In this paper, we address test-time adaptation from a probabilistic perspective. We formulate model adaption as a probabilistic inference problem, which incorporates the uncertainty into source model predictions by modeling pseudo labels as distributions. Based on the probabilistic formalism, we propose variational pseudo labels that explore the information of neighboring target samples to improve pseudo labels and achieve a model better specified to target data. By a meta-learning paradigm, we train our model by simulating domain shifts and the test-time adaptation procedure. In doing so, our model learns the ability to generate more accurate pseudo-label distributions and to adapt to new domains. Experiments on three widely used datasets demonstrate the effectiveness of our proposal. ","Test-time adaptation, probabilistic framework, variational pseudo label, meta learning" No-Regret Learning in Strongly Monotone Games Converges to a Nash Equilibrium,https://openreview.net/forum?id=Ey2ePmtABj,https://openreview.net/pdf?id=Ey2ePmtABj,,"This paper studies a class of online games involving multiple agents with continuous actions that aim to minimize their local loss functions. An open question in the study of online games is whether no-regret learning for such agents leads to a Nash equilibrium. We address this question by providing a sufficient condition for strongly monotone games that guarantees Nash equilibrium convergence in a time average sense. Furthermore, we show that the class of games for which no-regret learning leads to a Nash equilibrium can be expanded if some further information on the learning algorithm is known. Specifically, we provide relaxed sufficient conditions for first-order and zeroth-order gradient descent algorithms as well as for best response algorithms in which agents choose actions that best respond to other players' actions during the last episode. We analyze the convergence rate for these algorithms and present numerical experiments on three economic market problems to illustrate their performance. ","Online game, no-regret learning, Nash equilibrium convergence, monotone game" Generalized Belief Transport,https://openreview.net/forum?id=QHevLM-OnA,https://openreview.net/pdf?id=QHevLM-OnA,,"Human learners have ability to adopt appropriate learning approaches depending on constraints such as prior on the hypothesis and urgency of decision. However, existing learning models are typically considered individually rather than in relation to one and other. To build agents that have the ability to move between different modes of learning over time, it is important to understand how learning models are related as points in a broader space of possibilities. We introduce a mathematical framework, Generalized Belief Transport (GBT), that unifies and generalizes prior models, including Bayesian inference, cooperative communication and classification, as parameterizations of three learning constraints within Unbalanced Optimal Transport (UOT). We visualize the space of learning models encoded by GBT as a cube which includes classic learning models as special points. We derive critical properties of this parameterized space including proving continuity and differentiability which is the basis for model interpolation, and study limiting behavior of the parameters, which allows attaching learning models on the boundaries. Moreover, we investigate the long-run behavior of GBT, explore convergence properties of models in GBT mathematical and computationally, and formulate conjectures about general behavior. We conclude with open questions and implications for more unified models of learning.", Adversarial Cheap Talk,https://openreview.net/forum?id=rYgeBuEHlh,https://openreview.net/pdf?id=rYgeBuEHlh,"We can cause an RL agent to fail, succeed, or be manipulatable by deterministically perturbing irrelevant features in its observation during training.","Adversarial attacks in reinforcement learning (RL) often assume highly-privileged access to the victim’s parameters, environment, or data. Instead, this paper proposes a novel adversarial setting called a Cheap Talk MDP in which an Adversary can merely append deterministic messages to the Victim’s observation, resulting in a minimal range of influence. The Adversary cannot occlude ground truth, influence underlying environment dynamics or reward signals, introduce non-stationarity, add stochasticity, see the Victim’s actions, or access their parameters. Additionally, we present a simple meta-learning algorithm called Adversarial Cheap Talk (ACT) to train Adversaries in this setting. We demonstrate that an Adversary trained with ACT can still significantly influence the Victim’s training and testing performance, despite the highly constrained setting. Affecting train-time performance reveals a new attack vector and provides insight into the success and failure modes of existing RL algorithms. More specifically, we show that an ACT Adversary is capable of harming performance by interfering with the learner’s function approximation, or instead helping the Victim’s performance by outputting useful features. Finally, we show that an ACT Adversary can manipulate messages during train-time to directly and arbitrarily control the Victim at test-time.","Meta-Learning, Reinforcement Learning, Meta-Reinforcement Learning" Multi-stationary point losses for robust model,https://openreview.net/forum?id=LKXAKOxu-T,https://openreview.net/pdf?id=LKXAKOxu-T,"We propose a familiy of Multi-stationary point losses, which improved robustness. ","We identify that cross-entropy (CE) loss does not guarantee robust boundary for neural networks. The reason is that CE loss has only one asymptotic stationary point. It stops pushing the boundary forward as long as the sample is correctly classified, which left the boundary right next to the samples. A robust boundary should be kept in the middle of samples from different classes, thus maximizing the margins from the boundary to the samples. In this paper, we propose a family of new losses, called multi-stationary point (MS) losses, which introduce additional stationary points beyond the asymptotic stationary point. We prove that robust boundary can be guaranteed by MS loss without losing much accuracy. With MS loss, bigger perturbations are required to generate adversarial examples. We demonstrate that robustness is improved under a variety of adversarial attacks by applying MS loss. Moreover, robust boundary learned by MS loss also performs well on imbalanced datasets. Finally, we modified other losses into two-stationary-point forms, and improved model robustness is observed.","Robustness, MS loss, Cross-entropy loss, Multi-stationary point losses, Adversarial attack" Learning Stackelberg Equilibria and Applications to Economic Design Games,https://openreview.net/forum?id=zNVpWmE6JM,https://openreview.net/pdf?id=zNVpWmE6JM,,"We study the use of reinforcement learning to learn the optimal leader's strategy in Stackelberg games. Learning a leader’s strategy has an innate stationarity problem---when optimizing the leader’s strategy, the followers’ strategies might shift. To circumvent this problem, we model the followers via no-regret dynamics to converge to a Bayesian Coarse-Correlated Equilibrium (B-CCE) of the game induced by the leader. We then embed the followers' no-regret dynamics in the leader's learning environment, which allows us to formulate our learning problem as a standard POMDP. We prove that the optimal policy of this POMDP achieves the same utility as the optimal leader's strategy in our Stackelberg game. We solve this POMDP using actor-critic methods, where the critic is given access to the joint information of all the agents. Finally, we show that our methods are able to learn optimal leader strategies in a variety of settings of increasing complexity, including indirect mechanisms where the leader’s strategy is setting up the mechanism’s rules.","Multi-agent Systems, Reinforcement Learning, Economic Design" Learning to Induce Causal Structure ,https://openreview.net/forum?id=hp_RwhKDJ5,https://openreview.net/pdf?id=hp_RwhKDJ5,,"The fundamental challenge in causal induction is to infer the underlying graph structure given observational and/or interventional data. Most existing causal induction algorithms operate by generating candidate graphs and evaluating them using either score-based methods (including continuous optimization) or independence tests. In our work, we instead treat the inference process as a black box and design a neural network architecture that learns the mapping from both observational and interventional data to graph structures via supervised training on synthetic graphs. The learned model generalizes to new synthetic graphs, is robust to train-test distribution shifts, and achieves state-of-the-art performance on naturalistic graphs for low sample complexity.","causality, deep learning" Attention Based Models for Cell Type Classification on Single-Cell RNA-Seq Data,https://openreview.net/forum?id=QFm186CbBp,https://openreview.net/pdf?id=QFm186CbBp,We propose two novel models through representation and attention learning for cell type classification task on single-cell RNA-seq data.,"Cell type classification serves as one of the most fundamental analyses in bioinformatics. It helps discovering new cell types, recognizing tumor cells in cancer microenvironment and facilitating the downstream tasks such as trajectory inference. Single-cell RNA-sequencing (scRNA-seq) technology can profile the whole transcriptome of different cells, thus providing invaluable data for cell type classification. Existing cell type classification methods can be mainly categorized into statistical models and neural network models. The statistical models either make hypotheses on the gene expression distribution which may not be consistent with the real data, or heavily rely on prior knowledge such as marker genes for specific cell types. By contrast, the neural networks are more robust and flexible, while it is hard to interpret the biological meanings hidden behind a mass of model parameters. Recently, the attention mechanism has been widely applied in diverse fields due to the good interpretability of the attention weights. In this paper, we examine the effectiveness and interpretability of the attention mechanism by proposing two novel models for the cell type classification task. The first model classifies cells by a capsule attention network (CAN) that performs attention on the capsule features extracted for cells. To align the features with genes, the second model first factorizes the scRNA-seq matrix to obtain the representation vectors for all genes and cells, and then performs the attention operation on the cell and gene vectors. We name it Cell-Gene Representation Attention network(CGRAN). Experiments show that our attention-based models achieve higher accuracy in cell type classification compared to existing methods on diverse datasets. Moreover, the key genes picked by their high attention scores in different cell types perfectly match with the acknowledged marker genes.","Single-cell RNA-seq data cell type classification, attention mechanism, learning representations" Personalized federated composite learning with forward-backward envelopes,https://openreview.net/forum?id=rM6CpkZLPB,https://openreview.net/pdf?id=rM6CpkZLPB,,"Federated composite optimization (FCO) is an optimization problem in federated learning whose loss function contains a non-smooth regularizer. It arises naturally in the applications of federated learning (FL) that involve requirements such as sparsity, low rankness, and monotonicity. In this study, we propose a personalization method, called pFedFBE, for FCO by using forward-backward envelope (FBE) as clients’ loss functions. With FBE, we not only decouple the personalized model from the global model, but also allow personalized models to be smooth and easily optimized. In spite of the nonsmoothness of FCO, pFedFBE shows the same convergence complexity results as FedAvg for FL with unconstrained smooth objectives. Numerical experiments are shown to demonstrate the effectiveness of our proposed method.","Federated composite optimization, personalization, forward-backward envelopes" Tackling Imbalanced Class in Federated Learning via Class Distribution Estimation,https://openreview.net/forum?id=-qjmJkacGv,https://openreview.net/pdf?id=-qjmJkacGv,,"Federated Learning (FL) has become an upsurging machine learning method due to its applicability in large-scale distributed system and its privacy-preserving property. However, in real-world applications, the presence of class imbalance issue, especially the mismatch between local and global class distribution, greatly degrades the performance of FL. Moreover, due to the privacy constrain, the class distribution information of clients can not be accessed directly. To tackle class imbalance issue under FL setting, a novel algorithm, FedRE, is proposed in this paper. We propose a new class distribution estimation method for the FedRE algorithm, which requires no extra client data information and thus has no privacy concern. Both experimental results and theoretical analysis are provided to support the validity of our distribution estimation method. The proposed algorithm is verified with several experiment, including different datasets with the presence of class imbalance and local-global distribution mismatch. The experimental results show that FedRE is effective and it outperforms other related methods in terms of both overall and minority class classification accuracy.","Federated Learning, class imbalance, class distribution estimation" Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning,https://openreview.net/forum?id=AHvFDPi-FA,https://openreview.net/pdf?id=AHvFDPi-FA,Diffusion models serve as expressive policies to boost offline RL performance. ,"Offline reinforcement learning (RL), which aims to learn an optimal policy using a previously collected static dataset, is an important paradigm of RL. Standard RL methods often perform poorly in this regime due to the function approximation errors on out-of-distribution actions. While a variety of regularization methods have been proposed to mitigate this issue, they are often constrained by policy classes with limited expressiveness that can lead to highly suboptimal solutions. In this paper, we propose representing the policy as a diffusion model, a recent class of highly-expressive deep generative models. We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy. In our approach, we learn an action-value function and we add a term maximizing action-values into the training loss of the conditional diffusion model, which results in a loss that seeks optimal actions that are near the behavior policy. We show the expressiveness of the diffusion model-based policy, and the coupling of the behavior cloning and policy improvement under the diffusion model both contribute to the outstanding performance of Diffusion-QL. We illustrate the superiority of our method compared to prior works in a simple 2D bandit example with a multimodal behavior policy. We then show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.","offline RL, diffusion models, behavior cloning, policy regularization, Q-learning" Sublinear Algorithms for Kernel Matrices via Kernel Density Estimation,https://openreview.net/forum?id=74A-FDAyiL,https://openreview.net/pdf?id=74A-FDAyiL,We give a framework for using recently developed tools for kernel density estimation to solve downstream kernel problems in sub-quadratic time.,"Kernel matrices, as well as weighted graphs represented by them, are ubiquitous objects in machine learning, statistics and other related fields. The main drawback of using kernel methods (learning and inference using kernel matrices) is efficiency -- given $n$ input points, most kernel-based algorithms need to materialize the full $n \times n$ kernel matrix before performing any subsequent computation, thus incurring $\Omega(n^2)$ runtime. Breaking this quadratic barrier for various problems has, therefore, been a subject of extensive research efforts. We break the quadratic barrier and obtain \emph{sublinear} time algorithms for several fundamental linear-algebraic and graph processing primitives, including approximating the top eigenvalue and eigenvector, spectral sparsification, solving linear systems, local clustering, low-rank approximation, arboricity estimation and counting weighted triangles. We build on the recently developed Kernel Density Estimation framework, which (after preprocessing in time subquadratic in $n$) can return estimates of row/column sums of the kernel matrix. In particular, we develop efficient reductions from \emph{weighted vertex} and \emph{weighted edge sampling} on kernel graphs, \emph{simulating random walks} on kernel graphs, and \emph{importance sampling} on matrices to Kernel Density Estimation and show that we can generate samples from these distributions in \emph{sublinear} (in the support of the distribution) time. Our reductions are the central ingredient in each of our applications and we believe they may be of independent interest. We empirically demonstrate the efficacy of our algorithms on low-rank approximation (LRA) and spectral sparsification where we observe a $\textbf{9x}$ decrease in the number of kernel evaluation over baselines for LRA and a $\textbf{41x}$ reduction in the graph size for spectral sparsification.","kernel density estimation, sublinear time algorithms" CASA: Bridging the Gap between Policy Improvement and Policy Evaluation with Conflict Averse Policy Iteration,https://openreview.net/forum?id=-H7FPruqEX,https://openreview.net/pdf?id=-H7FPruqEX,This paper proposes a method to eliminate gradient conflicts between policy improvement and policy evaluation.,"We study the problem of model-free reinforcement learning, which is often solved following the principle of Generalized Policy Iteration (GPI). While GPI is typically an interplay between policy evaluation and policy improvement, most conventional model-free methods with function approximation assume the independence of GPI steps, despite of the inherent connections between them. In this paper, we present a method that attempts to eliminate the inconsistency between policy evaluation step and policy improvement step, leading to a conflict averse GPI solution with gradient-based functional approximation. Our method is capital to balancing exploitation and exploration between policy-based and value-based methods and is applicable to existed policy-based and value-based methods. We conduct extensive experiments to study theoretical properties of our method and demonstrate the effectiveness of our method on Atari 200M benchmark.","reinforcement learning, policy iteration" Achieve Near-Optimal Individual Regret & Low Communications in Multi-Agent Bandits,https://openreview.net/forum?id=QTXKTXJKIh,https://openreview.net/pdf?id=QTXKTXJKIh,A near-optimal algorithm for both individual and group regrets and only requiring O(\log (\log T)) communication times,"Cooperative multi-agent multi-armed bandits (\CMAB) study how distributed agents cooperatively play the same multi-armed bandit game. Most existing \CMAB works focused on maximizing the group performance of all agents---the accumulation of all agents' individual performance (i.e., individual reward). However, in many applications, the performance of the system is more sensitive to the ``bad'' agent---the agent with the worst individual performance. For example, in a drone swarm, a ``bad'' agent may crash into other drones and severely degrade the system performance. In that case, the key of the learning algorithm design is to coordinate computational and communicational resources among agents so to optimize the individual learning performance of the ``bad'' agent. In \CMAB, maximizing the group performance is equivalent to minimizing the group regret of all agents, and minimizing the individual performance can be measured by minimizing the maximum (worst) individual regret among agents. Minimizing the maximum individual regret was largely ignored in prior literature, and currently, there is little work on how to minimize this objective with a low communication overhead. In this paper, we propose a near-optimal algorithm on both individual and group regrets, in addition, we also propose a novel communication module in the algorithm, which only needs \(O(\log (\log T))\) communication times where \(T\) is the number of decision rounds. We also conduct simulations to illustrate the advantage of our algorithm by comparing it to other known baselines.","Multi-agent multi-armed bandits, individual regret, communication" Online Boundary-Free Continual Learning by Scheduled Data Prior,https://openreview.net/forum?id=qco4ekz2Epm,https://openreview.net/pdf?id=qco4ekz2Epm,We propose a new continual learning setup without explicit task boundary and a method to address it.,"Typical continual learning setup assumes that the dataset is split into multiple discrete tasks. We argue that it is less realistic as the streamed data would have no notion of task boundary in real-world data. Here, we take a step forward to investigate more realistic online continual learning – learning continuously changing data distribution without explicit task boundary, which we call boundary-free setup. As there is no clear boundary of tasks, it is not obvious when and what information in the past to be preserved as a better remedy for the stability-plasticity dilemma. To this end, we propose a scheduled transfer of previously learned knowledge. We further propose a data-driven balancing between the knowledge in the past and the present in learning objective. Moreover, since it is not straight-forward to use the previously proposed forgetting measure without task boundaries, we further propose a novel forgetting measure based on information theory that can capture forgetting. We empirically evaluate our method on a Gaussian data stream, its periodic extension, which assumes periodic data distribution frequently observed in real-life data, as well as the conventional disjoint task-split. Our method outperforms prior arts by large margins in various setups, using four popular benchmark datasets – CIFAR-10, CIFAR-100, TinyImageNet and ImageNet.","Continual learning, data prior, boundary-free" HypeR: Multitask Hyper-Prompted Training Enables Large-Scale Retrieval Generalization,https://openreview.net/forum?id=kUf4BcWXGJr,https://openreview.net/pdf?id=kUf4BcWXGJr,A multitask hyper-prompted training mechanism that enables a neural retriever to dynamically process different types of queries with different hyper-prompts and transfer learned knowledge across different domains and tasks. ,"Recently, large-scale text retrieval has made impressive progress, facilitating both information retrieval and downstream knowledge-intensive tasks (e.g., open-domain QA and dialogue). With a moderate amount of data, a neural text retriever can outperform traditional methods such as BM25 by a large step. However, while being applied to out-of-domain data, the performance of a neural retriever degrades considerably. Therefore, how to enable a retriever to perform more robustly across different domains or tasks and even show strong zero-shot transfer ability is critical for building scalable IR systems. To this end, we propose HypeR, a hyper-prompted training mechanism to enable uniform retrieval across tasks of different domains. Specifically, our approach jointly trains the query encoder with a shared prompt-based parameter pool and a prompt synthesizer that dynamically composes hyper-prompt for encoding each query from different tasks or domains. Besides, to avoid the mode collapse of prompt attention distribution for different queries, we design a contrastive prompt regularization that promotes the mode of prompt attention to be aligned and uniform. Through multi-task hyper-prompted training, our retriever can master the ability to dynamically represent different types of queries and transfer knowledge across different domains and tasks. Extensive experiments show our model attains better retrieval performance across different tasks and better zero-shot transfer ability compared with various previous methods.","Uniformed Large-Scale Retrieval, Multi-Task hyper-prompted training, Retrieval Generalization" HiT-DVAE: Human Motion Generation via Hierarchical Transformer Dynamical VAE,https://openreview.net/forum?id=gy1YvA9mh6q,https://openreview.net/pdf?id=gy1YvA9mh6q,,"In this paper, we address the problem of 3D human motion generation, which aims at learning a model to generate plausible and diverse future sequences of 3D human poses from an observed one. Current state-of-the-art solutions propose injecting a single random latent vector into a deterministic motion prediction framework. The stochasticity in the generative process is thus modeled at the whole sequence level, which is inconsistent with the inherent time-dependent uncertainty of human motion (e.g. people can jump or walk after getting up from a chair). To overcome this limitation we propose Hierarchical Transformer Dynamical Variational Autoencoder (HiT-DVAE), a deep generative model with sequential latent variables that can efficiently learn the stochastic dynamics of human motion. The proposed model learns an expressive time-varying latent space that encodes diverse and realistic human motions. A thorough evaluation on HumanEva-I and Human3.6M datasets using various metrics shows that HiT-DVAE performs better than current state-of-the-art methods. Our code will be released upon publication.","generative models, human motion generation" Efficient Learning of Rationalizable Equilibria in General-Sum Games,https://openreview.net/forum?id=HjOo2k8lhFl,https://openreview.net/pdf?id=HjOo2k8lhFl,We develop provably efficient algorithms for finding approximate CE and CCE that are also rationalizable.,"A natural goal in multi-agent learning is to learn \emph{rationalizable} behavior, where players learn to avoid any Iteratively Dominated Action (IDA). However, standard no-regret based equilibria-finding algorithms could take exponential samples to find such rationalizable strategies. In this paper, we first propose a simple yet sample-efficient algorithm for finding a rationalizable action profile in multi-player general-sum games under bandit feedback, which substantially improves over the results of Wu et al. We further develop algorithms with the first efficient guarantees for learning rationalizable Coarse Correlated Equilibria (CCE) and Correlated Equilibria (CE). Our algorithms incorporate several novel techniques to guarantee the elimination of IDA and no (swap-)regret simultaneously, including a correlated exploration scheme and adaptive learning rates, which may be of independent interest. We complement our results with a sample complexity lower bound showing the sharpness of our guarantees.","Game Theory, Online Learning, Rationalizability" A Higher Precision Algorithm for Computing the $1$-Wasserstein Distance,https://openreview.net/forum?id=aMXD8gqsIiC,https://openreview.net/pdf?id=aMXD8gqsIiC,,"We consider the problem of computing the $1$-Wasserstein distance $\mathcal{W}(\mu,\nu)$ between two $d$-dimensional discrete distributions $\mu$ and $\nu$ that are within the unit hypercube. Let $A$ (resp. $B$) be the support of $\mu$ (resp. $\nu$). There are several algorithms that estimate $\mathcal{W}(\mu,\nu)$ within an additive factor of $\varepsilon$. However, when $\mathcal{W}(\mu,\nu)$ is small, the additive error $\varepsilon$ dominates leading to noisy results. Consider any additive approximation algorithm with execution time $T(n,\varepsilon)$. We propose an algorithm that runs in $O(T(n,\varepsilon/d) \log n)$ time and boosts the accuracy of estimating $\mathcal{W}(\mu,\nu)$ to an additive factor of $\min\{\varepsilon, (d\log_{\sqrt{d}/\varepsilon} n)\mathcal{W}(\mu,\nu)\}$. For the special case where every point in $A \cup B$ has a mass of $1/n$ (also called the Euclidean Bipartite Matching problem) we describe an algorithm to boost the accuracy of any additive approximation algorithm to $\min\{\varepsilon, (d\log\log n)\mathcal{W}(\mu,\nu)\}$ in $O(T(n, \varepsilon/d)\log\log n)$ time. ","Wasserstein Distance, Earth Movers Distance, Bipartite Matching" Energy-Based Test Sample Adaptation for Domain Generalization,https://openreview.net/forum?id=3dnrKbeVatv,https://openreview.net/pdf?id=3dnrKbeVatv,We propose a discriminative energy-based model to adapt target samples to the source domain distributions for domain generalization.,"In this paper, we propose energy-based sample adaptation at test time for domain generalization. Where previous works adapt their models to target domains, we adapt the unseen target samples to source-trained models. To this end, we design a discriminative energy-based model, which is trained on source domains to jointly model the conditional distribution for classification and data distribution for sample adaptation. The model is optimized to simultaneously learn a classifier and an energy function. To adapt target samples to source distributions, we iteratively update the samples by energy minimization with stochastic gradient Langevin dynamics. Moreover, to preserve the categorical information in the sample during adaptation, we introduce a categorical latent variable into the energy-based model. The latent variable is learned from the original sample before adaptation by variational inference and fixed as a condition to guide the sample update. Experiments on six benchmarks for classification of images and microblog threads demonstrate the effectiveness of our proposal.","domain generalization, energy-based model, test-time sample adaptation, variational inference" Representation Power of Graph Convolutions : Neural Tangent Kernel Analysis,https://openreview.net/forum?id=jgUqPzuMiJQ,https://openreview.net/pdf?id=jgUqPzuMiJQ,"Graph NTK shows that row normalized graph convolution preserves the underlying class structure, and skip connections retain the class structure at infinite depth.","The fundamental principle of Graph Neural Networks (GNNs) is to exploit the structural information of the data by aggregating the neighboring nodes using a `graph convolution’. Therefore, understanding its influence on the network performance is crucial. Convolutions based on graph Laplacian have emerged as the dominant choice with the symmetric normalization of the adjacency matrix $A$, defined as $D^{-1/2}AD^{−1/2}$, being the most widely adopted one, where $D$ is the degree matrix. However, some empirical studies show that row normalization $D^{−1}A$ outperforms it in node classification. Despite the widespread use of GNNs, there is no rigorous theoretical study on the representation power of these convolution operators, that could explain this behavior. In this work, we analyze the influence of the graph convolutions theoretically using Graph Neural Tangent Kernel in a semi-supervised node classification setting. Under a Degree Corrected Stochastic Block Model, we prove that: (i) row normalization preserves the underlying class structure better than other graph convolutions; (ii) performance degrades with network depth due to over-smoothing, but the loss in class information is the slowest in row normalization; (iii) skip connections retain the class information even at infinite depth, thereby eliminating over-smoothing. We finally validate our theoretical findings on real datasets.","Graph Neural Networks, Neural Tangent Kernels, Node classification, Stochastic Block Model" Bidirectional Language Models Are Also Few-shot Learners,https://openreview.net/forum?id=wCFB37bzud4,https://openreview.net/pdf?id=wCFB37bzud4,"We present Sequential Autoregressive Prompting, a technique that enables prompting of bidirectional models demonstrating prompt-based learning is an emergent property of a broader class of language models, rather than of only unidirectional models.","Large language models such as GPT-3 (Brown et al., 2020) can perform arbitrary tasks without undergoing fine-tuning after being prompted with only a few labeled examples. An arbitrary task can be reformulated as a natural language prompt, and a language model can be asked to generate the completion, indirectly performing the task in a paradigm known as prompt-based learning. To date, emergent prompt-based learning capabilities have mainly been demonstrated for unidirectional language models. However, bidirectional language models pre-trained on denoising objectives such as masked language modeling produce stronger learned representations for transfer learning. This motivates the possibility of prompting bidirectional models, but their pre-training objectives have made them largely incompatible with the existing prompting paradigm. We present SAP (Sequential Autoregressive Prompting), a technique that enables the prompting of bidirectional models. Utilizing the machine translation task as a case study, we prompt the bidirectional mT5 model (Xue et al., 2021) with SAP and demonstrate its few-shot and zero-shot translations outperform the few-shot translations of unidirectional models like GPT-3 and XGLM (Lin et al., 2021), despite mT5's approximately 50% fewer parameters. We further show SAP is effective on question answering and summarization. For the first time, our results demonstrate prompt-based learning is an emergent property of a broader class of language models, rather than only unidirectional models.","prompting, prompt-based learning, mt5, t5, machine translation, llm, large language models" Revisiting adapters with adversarial training,https://openreview.net/forum?id=HPdxC1THU8T,https://openreview.net/pdf?id=HPdxC1THU8T,,"While adversarial training is generally used as a defense mechanism, recent works show that it can also act as a regularizer. By co-training a neural network on clean and adversarial inputs, it is possible to improve classification accuracy on the clean, non-adversarial inputs. We demonstrate that, contrary to previous findings, it is not necessary to separate batch statistics when co-training on clean and adversarial inputs, and that it is sufficient to use adapters with few domain-specific parameters for each type of input. We establish that using the classification token of a Vision Transformer (ViT) as an adapter is enough to match the classification performance of dual normalization layers, while using significantly less additional parameters. First, we improve upon the top-1 accuracy of a non-adversarially trained ViT-B16 model by +1.12% on ImageNet (reaching 83.76% top-1 accuracy). Second, and more importantly, we show that training with adapters enables model soups through linear combinations of the clean and adversarial tokens. These model soups, which we call adversarial model soups, allow us to trade-off between clean and robust accuracy without sacrificing efficiency. Finally, we show that we can easily adapt the resulting models in the face of distribution shifts. Our ViT-B16 obtains top-1 accuracies on ImageNet variants that are on average +4.00% better than those obtained with Masked Autoencoders.","adapters, adversarial, robustness, soup" Human-AI Coordination via Human-Regularized Search and Learning,https://openreview.net/forum?id=qqcIHdvjyJr,https://openreview.net/pdf?id=qqcIHdvjyJr,"a new method for human-AI collaboration based on human regularized search, imitation learning and RL, tested with large scale human experiments.","We consider the problem of making AI agents that collaborate well with humans in partially observable fully cooperative environments given datasets of human behavior. Inspired by piKL, a human-data-regularized search method that improves upon a behavioral cloning policy without diverging far away from it, we develop a three-step algorithm that achieve strong performance in coordinating with real humans in the Hanabi benchmark. We first use a regularized search algorithm and behavioral cloning to produce a better human model that captures diverse skill levels. Then, we integrate the policy regularization idea into reinforcement learning to train a human-like best response to the human model. Finally, we apply regularized search on top of the best response policy at test time to handle out-of-distribution challenges when playing with humans. We evaluate our method in two large scale experiments with humans. First, we show that our method outperforms experts when playing with a group of diverse human players in ad-hoc teams. Second, we show that our method beats a vanilla best response to behavioral cloning baseline by having experts play repeatedly with the two agents.","human-ai collaboration, multi-agent, search, deep reinforcement learning" Solving Math Word Problems with Process-based and Outcome-based Feedback,https://openreview.net/forum?id=MND1kmmNy0O,https://openreview.net/pdf?id=MND1kmmNy0O,"Both process- and outcome-based feedback with all the tricks achieve similar final-answer error rates and SOTA results, but generating accurate reasoning steps requires either process-based supervision, or a reward model that emulates it.","Recent work has shown that prompting language models to generate reasoning steps improves performance on many reasoning tasks. When moving beyond prompting, this raises the question of how we should supervise the finetuning of such models: outcome-based approaches which supervise the final result, or process-based approaches which supervise the reasoning process itself? Differences between these approaches might naturally be expected not just in final-answer errors but also in reasoning errors, which can be difficult to detect and are problematic in many real-world domains such as education. We run the first comprehensive comparison between process- and outcome-based approaches trained on a natural language task, GSM8K. We find that pure outcome-based supervision produces similar final-answer error rates with less label supervision. However, for correct reasoning steps we find it necessary to use process-based supervision or supervision from learned reward models that emulate process-based feedback. In total, we improve the previous best results from 16.8% $\rightarrow$ 12.7% final-answer error and 14.0% $\rightarrow$ 3.4% reasoning error among final-answer-correct solutions.","language models, reasoning, reward models" EPISODE: Episodic Gradient Clipping with Periodic Resampled Corrections for Federated Learning with Heterogeneous Data,https://openreview.net/forum?id=ytZIYmztET,https://openreview.net/pdf?id=ytZIYmztET,"We introduce EPISODE, an algorithm for federated learning with heterogeneous data under the relaxed smoothness setting for training deep neural networks, and provide state-of-the-art computational and communication complexity guarantees.","Gradient clipping is an important technique for deep neural networks with exploding gradients, such as recurrent neural networks. Recent studies have shown that the loss functions of these networks do not satisfy the conventional smoothness condition, but instead satisfy a relaxed smoothness condition, i.e., the Lipschitz constant of the gradient scales linearly in terms of the gradient norm. Due to this observation, several gradient clipping algorithms have been developed for nonconvex and relaxed-smooth functions. However, the existing algorithms only apply to the single-machine or multiple-machine setting with homogeneous data across machines. It remains unclear how to design provably efficient gradient clipping algorithms in the general Federated Learning (FL) setting with heterogeneous data and limited communication rounds. In this paper, we design EPISODE, the very first algorithm to solve FL problems with heterogeneous data in the nonconvex and relaxed smoothness setting. The key ingredients of the algorithm are two new techniques called \textit{episodic gradient clipping} and \textit{periodic resampled corrections}. At the beginning of each round, EPISODE resamples stochastic gradients from each client and obtains the global averaged gradient, which is used to (1) determine whether to apply gradient clipping for the entire round and (2) construct local gradient corrections for each client. Notably, our algorithm and analysis provide a unified framework for both homogeneous and heterogeneous data under any noise level of the stochastic gradient, and it achieves state-of-the-art complexity results. In particular, we prove that EPISODE can achieve linear speedup in the number of machines, and it requires significantly fewer communication rounds. Experiments on several heterogeneous datasets, including text classification and image classification, show the superior performance of EPISODE over several strong baselines in FL.","Non-convex optimization, federated learning, heterogeneous data, gradient clipping, relaxed smoothness" Memory-Efficient Reinforcement Learning with Priority based on Surprise and On-policyness,https://openreview.net/forum?id=xkSlKCYyV_,https://openreview.net/pdf?id=xkSlKCYyV_,We propose a method to prune experiences in the replay buffer using a metric based on surprise and on-policyness of the experience and use it to save memory consumption in off-policy reinforcement learning.,"In off-policy reinforcement learning, an agent collects transition data (a.k.a. experience tuples) from the environment and stores them in a replay buffer for the incoming parameter updates. Storing those tuples consumes a large amount of memory when the environment observations are given as images. Large memory consumption is especially problematic when reinforcement learning methods are applied in scenarios where the computational resources are limited. In this paper, we introduce a method to prune relatively unimportant experience tuples by a simple metric that estimates the importance of experiences and saves the overall memory consumption by the buffer. To measure the importance of experiences, we use $\textit{surprise}$ and $\textit{on-policyness}$. Surprise is quantified by the information gain the model can obtain from the experiences and on-policyness ensures that they are relevant to the current policy. In our experiments, we empirically show that our method can significantly reduce the memory consumption by the replay buffer without decreasing the performance in vision-based environments.","replay buffer, reinforcement learning, memory efficiency" Uncovering Directions of Instability via Quadratic Approximation of Deep Neural Loss in Reinforcement Learning,https://openreview.net/forum?id=QL85H5Mkip,https://openreview.net/pdf?id=QL85H5Mkip,,"Learning in MDPs with highly complex state representations is currently possible due to multiple advancements in reinforcement learning algorithm design. However, this incline in complexity, and furthermore the increase in the dimensions of the observation came at the cost of non-robustness that can be taken advantage of (i.e. moving along worst-case directions in the observation space). To solve this policy instability problem we propose a novel method to ascertain the presence of these non-robust directions via quadratic approximation of the deep neural policy loss. Our method provides a theoretical basis for the fundamental cut-off between stable observations and non-robust observations. Furthermore, our technique is computationally efficient, and does not depend on the methods used to produce the worst-case directions. We conduct extensive experiments in the Arcade Learning Environment with several different non-robust alteration techniques. Most significantly, we demonstrate the effectiveness of our approach even in the setting where alterations are explicitly optimized to circumvent our proposed method.", Marginal Probability Explanation: A Saliency Map with Closed-loop Validation,https://openreview.net/forum?id=h_Ma6BSi9Q,https://openreview.net/pdf?id=h_Ma6BSi9Q,We propose a saliency map using marginal probability for each input dimension whose meaningfulness can be closed-loop validated.,"In this work, we propose a saliency map with pixel-level resolution, called Marginal Probability Explanation (MPE), for a black-box classifier. MPE visualizes the contribution of each input dimension to the classifier by calculating marginal probabilities when only one dimension is considered. Marginal probabilities are estimated using Monte Carlo by sampling from the training dataset. Based on MPE, we propose typical samples, i.e. samples that maximize their marginal probability in every input dimension. We verify that our proposed MPE is meaningful through closed-loop validation experiments, where replacing a few pixels with the lowest marginal probability with pixels in the typical sample ``corrects"" the classification. Based on experiments, we found deep neural networks are probably still using pixel-level logics for image classification. Moreover, the critical pixels are not necessary related to the subject. ","MPE, saliency map, closed-loop validation, typical sample" A Theory of Dynamic Benchmarks,https://openreview.net/forum?id=i8L9qoeZOS,https://openreview.net/pdf?id=i8L9qoeZOS,We propose a formal model of dynamic benchmarks illuminating their benefits and limitations.,"Dynamic benchmarks interweave model fitting and data collection in an attempt to mitigate the limitations of static benchmarks. In contrast to an extensive theoretical and empirical study of the static setting, the dynamic setting lags behind due to limited empirical studies and no apparent theoretical foundation to date. Responding to this deficit, we initiate a theoretical study of dynamic benchmarking. We examine two realizations, one capturing current practice and the other modeling more complex settings. In the first model, where data collection and model fitting alternate sequentially, we prove that model performance improves initially but can stall after only three rounds. Label noise arising from, for instance, annotator disagreement leads to even stronger negative results. Our second model generalizes the first to the case where data collection and model fitting have a hierarchical dependency structure. We show that this design guarantees strictly more progress than the first, albeit at a significant increase in complexity. We support our theoretical analysis by simulating dynamic benchmarks on two popular datasets. These results illuminate the benefits and practical limitations of dynamic benchmarking, providing both a theoretical foundation and a causal explanation for observed bottlenecks in empirical work. ","Dynamic Benchmarks, Adversarial Data Collection" On the Trade-Off between Actionable Explanations and the Right to be Forgotten,https://openreview.net/forum?id=HWt4BBZjVW,https://openreview.net/pdf?id=HWt4BBZjVW,"We analyze the tradeoff between actionable explanations and the right to be forgotten, and provide algorithms to find a critical subset of training data points, which, when removed would lead to a maximum invalidation of recourses.","As machine learning (ML) models are increasingly being deployed in high-stakes applications, policymakers have suggested tighter data protection regulations (e.g., GDPR, CCPA). One key principle is the “right to be forgotten” which gives users the right to have their data deleted. Another key principle is the right to an actionable explanation, also known as algorithmic recourse, allowing users to reverse unfavorable decisions. To date it is unknown whether these two principles can be operationalized simultaneously. Therefore, we introduce and study the problem of recourse invalidation in the context of data deletion requests. More specifically, we theoretically and empirically analyze the behavior of popular state-of-the-art algorithms and demonstrate that the recourses generated by these algorithms are likely to be invalidated if a small number of data deletion requests (e.g., 1 or 2) warrant updates of the predictive model. For the setting of linear models and overparameterized neural networks – studied through the lens of neural tangent kernels (NTKs) – we suggest a framework to identify a minimal subset of critical training points which, when removed, maximize the fraction of invalidated recourses. Using our framework, we empirically show that the removal of as little as 2 data instances from the training set can invalidate up to 95 percent of all recourses output by popular state-of-the-art algorithms. Thus, our work raises fundamental questions about the compatibility of “the right to an actionable explanation” in the context of the “right to be forgotten” while also providing constructive insights on the determining factors of recourse robustness.","Counterfactual Explanations, Algorihtmic Recourse, Explainability, Interpretability, Transparency" Learning to Cooperate and Communicate Over Imperfect Channels,https://openreview.net/forum?id=LemVOgJ4yP,https://openreview.net/pdf?id=LemVOgJ4yP,We investigate communication in multi-agent reinforcement learning and propose an adaptive message size selection that enables agents to use an imperfect communication channel more efficiently.,"Information exchange in multi-agent systems improves the cooperation among agents, especially in partially observable settings. This can be seen as part of the problem in which the agents learn how to communicate and to solve a shared task simultaneously. In the real world, communication is often carried out over imperfect channels and this requires the agents to deal with uncertainty due to potential information loss. In this paper, we consider a cooperative multi-agent system where the agents act and exchange information in a decentralized manner using a limited and unreliable channel. To cope with such channel constraints, we propose a novel communication approach based on independent Q-learning. Our method allows agents to dynamically adapt how much information to share by sending messages of different size, depending on their local observations and the channel properties. In addition to this message size selection, agents learn to encode and decode messages to improve their policies. We show that our approach outperforms approaches without adaptive capabilities and discuss its limitations in different environments.","multi-agent systems, deep reinforcement learning, emergent communication, imperfect communication channels" A GENERAL SCENARIO-AGNOSTIC REINFORCEMENT LEARNING FOR TRAFFIC SIGNAL CONTROL,https://openreview.net/forum?id=RKMbC8Tslx,https://openreview.net/pdf?id=RKMbC8Tslx,,"Reinforcement learning has been recently adopted to revolutionize and optimize traditional traffic signal control systems. Existing methods are either based on a single scenario or multiple independent scenarios, where each scenario has a separate simulation environment with predefined road network topology and traffic signal settings. These models implement training and testing in the same scenario, thus being strictly tied up with the specific setting and sacrificing model generalization heavily. While a few recent models could be trained by multiple scenarios, they require a huge amount of manual labor to label the intersection structure, hindering the model’s generalization. In this work, we aim at a general framework that could eliminate heavy labeling and model a variety of scenarios simultaneously. To this end, we propose a GEneral Scenario-Agnostic (GESA) reinforcement learning framework for traffic signal control with: (1) A general plug-in module to map all different intersections into a unified structure, freeing us from the heavy manual labor to specify the structure of intersections; (2) A unified state and action space to keep the model input and output consistently structured; (3) A large-scale co-training with multiple scenarios, leading to a generic traffic signal control algorithm. In experiments, we demonstrate our algorithm as the first one that can be co-trained with seven different scenarios without manual annotation, and get 17.20% higher rewards than benchmarks. When dealing with a new scenario, our model can still achieve 10.36% higher rewards. The code and scenarios will be released upon acceptance.","reinforcement learning, model generalizability, traffic signal control, smart mobility" Uncertainty-aware off policy learning,https://openreview.net/forum?id=MZFDUB40NJ,https://openreview.net/pdf?id=MZFDUB40NJ,"We consider the estimation uncertainty of logging policy, and proposed a new estimator for improved off-policy learning by controlling the effect of inaccurate estimation of logging policy.","Off-policy learning, referring to the procedure of policy optimization with access only to logged feedback data, has shown importance in various real-world applications, such as search engines, recommender systems, etc. While the ground-truth logging policy, which generates the logged data, is usually unknown, previous work directly takes its estimated value in off-policy learning, resulting in a biased estimator. This estimator has both high bias and variance on samples with small and inaccurate estimated logging probabilities. In this work, we explicitly model the uncertainty in the estimated logging policy and propose a novel \underline{U}ncertainty-aware \underline{I}nverse \underline{P}ropensity \underline{S}core estimator (UIPS) for improved off-policy learning. Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.","off-policy learning, uncertainty" Renamer: A Transformer Architecture In-variant to Variable Renaming,https://openreview.net/forum?id=7hYCGFacpz,https://openreview.net/pdf?id=7hYCGFacpz,,"Modeling tasks often take inputs from languages including programming languages and natural language. Many such tasks involve learning functions which are invariant to certain types of input transformations. In this work we consider a specific class of invariance: semantics-preserving variable renaming. We first show that transformer networks trained on such tasks do not always mirror the invariance of the underlying function. In this work we propose Renamer, a transformer architecture which is invariant to semantics-preserving variable renaming. Renamer improves over a vanilla transformer by between a 24.79% to 52.80% reduction in error on a case study on learning a surrogate of a large-scale CPU simualtor. Furthermore, the invariant network does not experience the same sensitivity to variable renaming, and its error remains constant when evaluated on a variable renamed version of the test set. Finally, the invariant network is more efficient to train, and matches the best error of the vanilla network with a between 25.15% to 60.00% reduction in training epochs.", Learning What and Where - Unsupervised Disentangling Location and Identity Tracking,https://openreview.net/forum?id=NeDc-Ak-H_,https://openreview.net/pdf?id=NeDc-Ak-H_,"Loci: an unsupervised disentangled LOCation and Identity tracking system, which excels on the CATER and related object tracking challenges featuring emergent object permanence and stable entity disentanglement via fully unsupervised learning.","Our brain can almost effortlessly decompose visual data streams into background and salient objects. Moreover, it can anticipate object motion and interactions, which are crucial abilities for conceptual planning and reasoning. Recent object reasoning datasets, such as CATER, have revealed fundamental shortcomings of current vision-based AI systems, particularly when targeting explicit object encodings, object permanence, and object reasoning. Here we introduce a self-supervised LOCation and Identity tracking system (Loci), which excels on the CATER tracking challenge. Inspired by the dorsal-ventral pathways in the brain, Loci tackles the binding problem by processing separate, slot-wise encodings of `what' and `where'. Loci's predictive coding-like processing encourages active error minimization, such that individual slots tend to encode individual objects. Interactions between objects and object dynamics are processed in the disentangled latent space. Truncated backpropagation through time combined with forward eligibility accumulation significantly speeds up learning and improves memory efficiency. Besides exhibiting superior performance in current benchmarks, Loci effectively extracts objects from video streams and separates them into location and Gestalt components. We believe that this separation offers an encoding that will facilitate effective planning and reasoning on conceptual levels.","object permanence, CATER, unsupervised learning, binding problem" BALTO: efficient tensor program optimization with diversity-based active learning,https://openreview.net/forum?id=CN223OXgyb5,https://openreview.net/pdf?id=CN223OXgyb5,,"Tensor program optimization (TPO) based on pre-trained models can effectively reduce the computing time of deep neural networks. However, training of such models is prohibitively expensive, which highly depends on a large-scale dataset and thus requires tremendous time-consuming performance measurements (more than 1 million) on target platforms. In this paper, we propose BALTO, a fast TPO approach with biased-diversity-based active learning, aiming at reducing much lower training costs under similar optimization accuracy.The key insight is that random sampling of existing approaches suffers from a heavy redundancy of low-performance programs, which incurs tremendous duplicated time-consuming measurements. Inspired by this, BALTO removes such redundancy by introducing active learning (AL) to TPO for a much lower training cost. However, applying AL with a brute-force way in BALTO can lead to an overestimation problem. To address this, we further propose a biased-diversity-based diversity scheme specially designed for BALTO. We compare BALTO against TenSet on $6$ typical hardware platforms over $2$ learning models. Experimental results show that, on average, BALTO only requires 5% of the total performance measurements of TenSet to achieve the same or higher model accuracy. Moreover, the optimized tensor programs even outperform that of TenSet by 1.06% due to higher model accuracy.", RoCourseNet: Distributionally Robust Training of a Prediction Aware Recourse Model,https://openreview.net/forum?id=zufPou5foW,https://openreview.net/pdf?id=zufPou5foW,,"Counterfactual (CF) explanations for machine learning (ML) models are preferred by end-users, as they explain the predictions of ML models by providing a recourse (or contrastive) case to individuals who are adversely impacted by predicted outcomes. Existing CF explanation methods generate recourses under the assumption that the underlying target ML model remains stationary over time. However, due to commonly occurring distributional shifts in training data, ML models constantly get updated in practice, which might render previously generated recourses invalid and diminish end-users trust in our algorithmic framework. To address this problem, we propose RoCourseNet, a training framework that jointly optimizes for predictions and recourses that are robust to future data shifts. We have three main contributions: (i) We propose a novel \emph{virtual data shift (VDS)} algorithm to find worst-case shifted ML models by explicitly considering the worst-case data shift in the training dataset. (ii) We leverage adversarial training to solve a novel tri-level optimization problem inside RoCourseNet, which simultaneously generates predictions and corresponding robust recourses. (iii) Finally, we evaluate RoCourseNet's performance on three real-world datasets and show that RoCourseNet outperforms state-of-the-art baselines by $\sim$10\% in generating robust CF explanations. ","Counterfactual Explanation, Algorithmic Recourse, Adversarial ML, Robustness" Inducing Meaningful Units from Character Sequences with Dynamic Capacity Slot Attention,https://openreview.net/forum?id=PzbYN5d76a,https://openreview.net/pdf?id=PzbYN5d76a,We propose an unsupervised method to learn the abstract meaning-bearing units in a sequence of characters with Dynamic Capacity Slot Attention. ,"Characters do not convey meaning, but sequences of characters do. We propose an unsupervised distributional method to learn the abstract meaning-bearing units in a sequence of characters. Rather than segmenting the sequence, our Dynamic Capacity Slot Attention model discovers continuous representations of the \textit{objects} in the sequence, extending an architecture for object discovery in images. We train our model on different languages and evaluate the quality of the obtained representations with forward and reverse probing classifiers. These experiments show that our model succeeds in discovering units which are similar to those proposed previously in form, content and level of abstraction, and which show promise for capturing meaningful information at a higher level of abstraction.","Unsupervised representation learning, Morphology induction, Deep learning" Enhanced Spatio-Temporal Image Encoding for Online Human Activity Recognition,https://openreview.net/forum?id=DP_u25iQWg,https://openreview.net/pdf?id=DP_u25iQWg,"In this work, we propose to improve the spatio-temporal image encoding of 3D skeletons data, by studying the concept of motion energy which focuses mainly on the joints that are the most solicited for an action.","Human Activity Recognition (HAR) based on sensors data can be seen as a time series classification problem where the challenge is to handle both spatial and temporal dependencies, while focusing on the most relevant data variations. It can be done using 3D skeleton data extracted from a RGB+D camera. In this work, we propose to improve the spatio-temporal image encoding of 3D skeletons captured from a Kinect sensor, by studying the concept of motion energy which focuses mainly on skeleton joints that are the most solicited for an action. This encoding allows us to achieve a better discrimination for the detection of online activities by focusing on the most significant parts of the actions. The article presents this new encoding and its application for HAR using a deep learning model trained on the encoded 3D skeleton data. For this purpose, we proposed to investigate the knowledge transferability of several pre-trained CNNs provided by Keras. The article shows a significant improvement of the accuracy of the learning according to the state of the art.","3D Skeleton Data, Spatio-temporal Image Encoding, Motion Energy, Online Action Recognition, Human Activity Recognition, Deep learning" In-context Reinforcement Learning with Algorithm Distillation,https://openreview.net/forum?id=hy0a5MMPUv,https://openreview.net/pdf?id=hy0a5MMPUv,"We present Algorithm Distillation, a method that outputs an in-context RL algorithm by treating learning to reinforcement learn as a sequential prediction problem.","We propose Algorithm Distillation (AD), a method for distilling reinforcement learning (RL) algorithms into neural networks by modeling their training histories with a causal sequence model. Algorithm Distillation treats learning to reinforcement learn as an across-episode sequential prediction problem. A dataset of learning histories is generated by a source RL algorithm, and then a causal transformer is trained by autoregressively predicting actions given their preceding learning histories as context. Unlike sequential policy prediction architectures that distill post-learning or expert sequences, AD is able to improve its policy entirely in-context without updating its network parameters. We demonstrate that AD can reinforcement learn in-context in a variety of environments with sparse rewards, combinatorial task structure, and pixel-based observations, and find that AD learns a more data-efficient RL algorithm than the one that generated the source data.","Reinforcement Learning, Transformers, Learning to Learn, Large Language Models" BiasPAD: A Bias-Progressive Auto-Debiasing Framework,https://openreview.net/forum?id=oXHHj0_NN9,https://openreview.net/pdf?id=oXHHj0_NN9,,"While large pre-trained language models have made great strides in natural language understanding benchmarks, recent studies have found that models rely more on the superficial or short-cut features to make predictions. In this paper, we study how to progressively and automatically detect and filter the biased data to train a robust debiased model for NLU tasks. Rather than focusing on the human-predefined biases or biases captured by a bias-only model of limited-capacity, we introduce a new debiasing framework, called Bias-Progressive Auto-Debiasing (BiasPAD), based on two observations: i) the higher the proportion of bias in the training data, the more biased the model will be, and ii) a more biased model has higher confidence in predicting the bias. The framework progressively trains a bias-only model by using the most biased samples detected in the previous epoch, which ensures a more biased model and leads to a robust debiased model. The extensive experiments demonstrate the effectiveness of the proposed framework on several challenging NLU datasets, where on HANS, we achieve 5% accuracy improvement.", On the Importance of Diversity in Data-free Model Stealing,https://openreview.net/forum?id=Lrxaf7IPVT,https://openreview.net/pdf?id=Lrxaf7IPVT,,"Machine learning as a Service (MLaaS) allows users to query the machine learning model in an API manner, which provides an opportunity for users to enjoy the benefits brought by the high-performance model trained on valuable data. This interface boosts the flourish of machine learning based applications, while on the other hand, introduces the attack surface for model stealing attacks. Existing model stealing attacks have relaxed their attack assumptions to the data-free setting, while keeping the effectiveness. However, these methods are complex and consist of several components, which obscure the core on which the attack really depends. In this paper, we revisit the model stealing problem from a diversity perspective and demonstrate that keeping the generated data samples more diverse across all the classes is the critical point for improving the attack performance. Based on this conjecture, we provide a simplified attack framework. We empirically signify our conjecture by evaluating the effectiveness of our attack, and experimental results show that our approach is able to achieve comparable or even better performance compared with the state-of-the-art method. Furthermore, benefiting from the absence of redundant components, our method demonstrates its advantages in attack efficiency and query budget.", Computing all Optimal Partial Transports,https://openreview.net/forum?id=gwcQajoXNF,https://openreview.net/pdf?id=gwcQajoXNF,,"We consider the classical version of the optimal partial transport problem. Let $\mu$ (with a mass of $U$) and $\nu$ (with a mass of $S$) be two mass distributions with $S \le U$. For a parameter $\alpha \in [0,S]$, consider the minimum-cost transport plan $\sigma_\alpha$ that transports a mass of $\alpha$ from $\nu$ to $\mu$. An \emph{OT-profile} captures the behavior of the cost of $\sigma_\alpha$ as $\alpha$ varies from $0$ to $S$. OT-profile has been used for studying mathematical properties of optimal partial transports (see~\cite{figalli2010optimal}). In this paper, we consider the question of computing the OT-profile. When $\mu$ and $\nu$ are discrete mass distributions, we show that the OT-profile is a piecewise-linear non-decreasing convex function. Let $K$ be the complexity\footnote{The combinatorial complexity of such a piecewise-linear function is simply the number of line segments it contains.} of this function, we present an exact algorithm to compute the OT-profile in $\tilde{O}(n^2K)$ time. Given $\delta > 0$, we also show that the algorithm by (Lahn et al., NeurIPS 2019) can be used to $\delta$-approximate the OT-profile of $\mu$ and $\nu$ with another piecewise-linear function in $O(n^2/\delta + n/\delta^2)$ time. The complexity of this approximation is just $O(1/\delta)$. An OT-profile is arguably more valuable than the OT-cost itself and can be used within applications. For instance, under a reasonable assumption of outliers, we prove that the first derivative of the OT-profile sees a noticable rise before any of the mass from outliers is transported. By using the OT-profile, we get an improved prediction accuracy for an outlier detection experiment as well as estimation of class priors within PU-Learning experiments, both of which are conducted on real-datasets.","Optimal Transport, Combinatorial Optimization" Towards Federated Learning of Deep Graph Neural Networks,https://openreview.net/forum?id=OMwyBv1UBh,https://openreview.net/pdf?id=OMwyBv1UBh,We study the problem of graph representation learning under a federated setting and propose a novel framework for federated learning of deep graph neural networks via reconstructing neighborhood information of nodes.,"Graph neural networks (GNNs) learn node representations by recursively aggregating neighborhood information on graph data. However, in the federated setting, data samples (nodes) located in different clients may be connected to each other, leading to huge information loss to the training method. Existing federated graph learning frameworks solve such a problem by generating missing neighbors or sending information across clients directly. None are suitable for training deep GNNs, which require a more expansive receptive field and higher communication costs. In this work, we introduce a novel framework named $Fed^2GNN$ for federated graph learning of deep GNNs via reconstructing neighborhood information of nodes. Specifically, we design a graph structure named rooted tree. The node embedding obtained by encoding on the rooted tree is the same as that obtained by encoding on the induced subgraph surrounding the node, which allows us to reconstruct the neighborhood information by building the rooted tree of the node. An encoder-decoder framework is then proposed, wherein we first encode missing neighbor information and then decode it to build the rooted tree. Extensive experiments on real-world network datasets show the effectiveness of our framework for training deep GNNs while also achieving better performance for training shadow GNN models","federated learning, graph representation learning, deep graph neural networks" CounterNet: End-to-End Training of Prediction Aware Counterfactual Explanations,https://openreview.net/forum?id=PocqkbIelt,https://openreview.net/pdf?id=PocqkbIelt,,"Counterfactual (or CF) explanations are a type of local explanations for Machine Learning (ML) model predictions, which offer a contrastive case as an explanation by finding the smallest changes (in feature space) to the input data point, which will lead to a different prediction by the ML model. Existing CF explanation techniques suffer from two major limitations: (i) all of them are post-hoc methods designed for use with proprietary ML models --- as a result, their procedure for generating CF explanations is uninformed by the training of the ML model, which leads to misalignment between model predictions and explanations; and (ii) most of them rely on solving separate time-intensive optimization problems to find CF explanations for each input data point (which negatively impacts their runtime). This work makes a novel departure from the prevalent post-hoc paradigm (of generating CF explanations) by presenting CounterNet, an end-to-end learning framework which integrates predictive model training and the generation of counterfactual (CF) explanations into a single pipeline. We adopt a block-wise coordinate descent procedure which helps in effectively training CounterNet's network. Our extensive experiments on multiple real-world datasets show that CounterNet generates high-quality predictions, and consistently achieves 100% CF validity and very low proximity scores (thereby achieving a well-balanced cost-invalidity trade-off) for any new input instance, and runs 3X faster than existing state-of-the-art baselines. ","Counterfactual Explanation, Algorithmic Recourse, Explainable AI, Interpretability" SmilesFormer: Language Model for Molecular Design,https://openreview.net/forum?id=VBQZkYu22G,https://openreview.net/pdf?id=VBQZkYu22G,"We developed a transformer-based language model for SMILES strings, able to generate and efficiently optimize molecules for drug discovery.","The objective of drug discovery is to find novel compounds with desirable chemical properties. Generative models have been utilized to sample molecules at the intersection of multiple property constraints. In this paper we pose molecular design as a language modeling problem where the model implicitly learns the vocabulary and composition of valid molecules, hence it is able to generate new molecules of interest. We present SmilesFormer, a Transformer-based model which is able to encode molecules, molecule fragments, and fragment compositions as latent variables, which are in turn decoded to stochastically generate novel molecules. This is achieved by fragmenting the molecules into smaller combinatorial groups, then learning the mapping between the input fragments and valid SMILES sequences. The model is able to optimize molecular properties through a stochastic latent space traversal technique. This technique systematically searches the encoded latent space to find latent vectors that are able to produce molecules to meet the multi-property objective. The model was validated through various de novo molecular design tasks, achieving state-of-the-art performances when compared to previous methods. Furthermore, we used the proposed method to demonstrate a drug rediscovery pipeline for Donepezil, a known Acetylcholinesterase Inhibitor.","De novo drug design, Language model, Molecule Optimization" Continuously Parameterized Mixture Models,https://openreview.net/forum?id=Ch4e4wk7Ew,https://openreview.net/pdf?id=Ch4e4wk7Ew,We parameterize mixtures of factor analyzers by a neural ordinary differential equation and train with a smooth curriculum to learn an interpretable likelihood model superior to standard mixture results.,"Mixture models are universal approximators of smooth densities but are difficult to utilize in complicated datasets due to restrictions on typically available modes and challenges with initialiations. We show that by continuously parameterizing a mixture of factor analyzers using a learned ordinary differential equation, we can improve the fit of mixture models over direct methods. Once trained, the mixture components can be extracted and the neural ODE can be discarded, leaving us with an effective, but low-resource model. We additionally explore the use of a training curriculum from an easy-to-model latent space extracted from a normalizing flow to the more complex input space and show that the smooth curriculum helps to stabilize and improve results with and without the continuous parameterization. Finally, we introduce a hierarchical version of the model to enable more flexible, robust classification and clustering, and show substantial improvements against traditional parameterizations of GMMs.","mixture models, normalizing flows, ordinary differential equations, clustering, interpretable learning" AE-FLOW: Autoencoders with Normalizing Flows for Medical Images Anomaly Detection ,https://openreview.net/forum?id=9OmCr1q54Z,https://openreview.net/pdf?id=9OmCr1q54Z,We propose a normalizing flow based autoencoder for medical anomaly detection and it outperformed the other approaches by a large margin.,"Anomaly detection from medical images is an important task for clinical screening and diagnosis. In general, a large dataset of normal images are available while only few abnormal images can be collected in clinical practice. By mimicking the diagnosis process of radiologists, we attempt to tackle this problem by learning a tractable distribution of normal images and identify anomalies by differentiating the original image and the reconstructed normal image. More specifically, we propose a normalizing flow based autoencoder for an efficient and tractable representation of normal medical images. The anomaly score consists of the likelihood originated from the normalizing flow and the reconstruction error of the autoencoder, which allows to identify the abnormality and provide an interpretability at both image and pixel levels. Experimental evaluation on two medical images datasets showed that the proposed model outperformed the other approaches by a large margin, which validated the effectiveness and robustness of the proposed method.","Anomaly Detection, Normalizing Flow, Auto-encoder." Learning a Domain-Agnostic Policy through Adversarial Representation Matching for Cross-Domain Policy Transfer,https://openreview.net/forum?id=VqrEwH4WwI-,https://openreview.net/pdf?id=VqrEwH4WwI-,"We obtain a domain-invariant feature space by behavioral cloning and adversarial training using unpaired trajectories of proxy tasks, and use it for zero-shot cross-domain transfer.","The low transferability of learned policies is one of the most critical problems limiting the applicability of learning-based solutions to decision-making tasks. In this paper, we present a way to align latent representations of states and actions between different domains by optimizing an adversarial objective. We train two models, a policy and a domain discriminator, with unpaired trajectories of proxy tasks through behavioral cloning as well as adversarial training. After the latent representations are aligned between domains, a domain-agnostic part of the policy trained with any method in the source domain can be immediately transferred to the target domain in a zero-shot manner. We empirically show that our simple approach achieves comparable performance to the latest methods in zero-shot cross-domain transfer. We also observe that our method performs better than other approaches in transfer between domains with different complexities, whereas other methods fail catastrophically.","imitation learning, domain transfer, zero-shot transfer" A Self-Attention Ansatz for Ab-initio Quantum Chemistry,https://openreview.net/forum?id=xveTeHVlF7j,https://openreview.net/pdf?id=xveTeHVlF7j,We use a novel self-attention neural network to make quantum chemistry calculations from first principles much more accurate.,"We present a novel neural network architecture using self-attention, the Wavefunction Transformer (PsiFormer), which can be used as an approximation (or ""Ansatz"") for solving the many-electron Schrödinger equation, the fundamental equation for quantum chemistry and material science. This equation can be solved *from first principles*, requiring no external training data. In recent years, deep neural networks like the FermiNet and PauliNet have been used to significantly improve the accuracy of these first-principle calculations, but they lack an attention-like mechanism for gating interactions between electrons. Here we show that the PsiFormer can be used as a drop-in replacement for these other neural networks, often dramatically improving the accuracy of the calculations. On larger molecules especially, the ground state energy can be improved by dozens of kcal/mol, a qualitative leap over previous methods. This demonstrates that self-attention networks can learn complex quantum mechanical correlations between electrons, and are a promising route to reaching unprecedented accuracy in chemical calculations on larger systems.","Machine learning for science, attention, Transformers, Monte Carlo, MCMC, self-generative learning, quantum physics, chemistry, machine learning for physics, machine learning for molecules, machine learning for chemistry" Probabilistically Robust Recourse: Navigating the Trade-offs between Costs and Robustness in Algorithmic Recourse,https://openreview.net/forum?id=sC-PmTsiTB,https://openreview.net/pdf?id=sC-PmTsiTB,We propose a novel framework to generate probabilistically robust algorithmic recourse,"As machine learning models are increasingly being employed to make consequential decisions in real-world settings, it becomes critical to ensure that individuals who are adversely impacted (e.g., loan denied) by the predictions of these models are provided with a means for recourse. While several approaches have been proposed to construct recourses for affected individuals, the recourses output by these methods either achieve low costs (i.e., ease-of-implementation) or robustness to small perturbations (i.e., noisy implementations of recourses), but not both due to the inherent trade-offs between the recourse costs and robustness. Furthermore, prior approaches do not provide end users with any agency over navigating the aforementioned trade-offs. In this work, we address the above challenges by proposing the first algorithmic framework which enables users to effectively manage the recourse cost vs. robustness trade-offs. More specifically, our framework Probabilistically ROBust rEcourse (PROBE) lets users choose the probability with which a recourse could get invalidated (recourse invalidation rate) if small changes are made to the recourse i.e., the recourse is implemented somewhat noisily. To this end, we propose a novel objective function which simultaneously minimizes the gap between the achieved (resulting) and desired recourse invalidation rates, minimizes recourse costs, and also ensures that the resulting recourse achieves a positive model prediction. We develop novel theoretical results to characterize the recourse invalidation rates corresponding to any given instance w.r.t. different classes of underlying models (e.g., linear models, tree based models etc.), and leverage these results to efficiently optimize the proposed objective. Experimental evaluation with multiple real world datasets demonstrates the efficacy of the proposed framework.","Counterfactual explanations, algorithmic recourse, explainability" How robust is unsupervised representation learning to distribution shift?,https://openreview.net/forum?id=LiXDW7CF94J,https://openreview.net/pdf?id=LiXDW7CF94J,Representations learned from self-supervised learning and auto-encoder based algorithms are surprisingly robust to distribution shift.,"The robustness of machine learning algorithms to distributions shift is primarily discussed in the context of supervised learning (SL). As such, there is a lack of insight on the robustness of the representations learned from unsupervised methods, such as self-supervised learning (SSL) and auto-encoder based algorithms (AE), to distribution shift. We posit that the input-driven objectives of unsupervised algorithms lead to representations that are more robust to distribution shift than the target-driven objective of SL. We verify this by extensively evaluating the performance of SSL and AE on both synthetic and realistic distribution shift datasets. Following observations that the linear layer used for classification itself can be susceptible to spurious correlations, we evaluate the representations using a linear head trained on a small amount of out-of-distribution (OOD) data, to isolate the robustness of the learned representations from that of the linear head. We also develop “controllable” versions of existing realistic domain generalisation datasets with adjustable degrees of distribution shifts. This allows us to study the robustness of different learning algorithms under versatile yet realistic distribution shift conditions. Our experiments show that representations learned from unsupervised learning algorithms generalise better than SL under a wide variety of extreme as well as realistic distribution shifts.","distribution shift, OOD generalisation, spurious correlation, simplicity bias, SSL, unsupervised learning, auto-encoder" Autoregressive Generative Modeling with Noise Conditional Maximum Likelihood Estimation,https://openreview.net/forum?id=r0JMLPgGXS_,https://openreview.net/pdf?id=r0JMLPgGXS_,"We propose a noise-robust modification for maximum likelihood estimation. Under this framework, we improve density estimation and significantly enhance the sample quality of images generated by autoregressive models.","We introduce a simple modification to the standard maximum likelihood estimation (MLE) framework. Rather than maximizing a single unconditional likelihood of the data under the model, we maximize a family of \textit{noise conditional} likelihoods consisting of the data perturbed by a continuum of noise levels. We find that models trained this way are more robust to noise, obtain higher test likelihoods, and generate higher quality images. They can also be sampled from via a novel score-based sampling scheme which combats the classical \textit{covariate shift} problem that occurs during sample generation in autoregressive models. Applying this augmentation to autoregressive image models, we obtain 3.32 bits per dimension on the ImageNet 64x64 dataset, and substantially improve the quality of generated samples in terms of the Frechet Inception distance (FID) --- from 37.50 to 12.09 on the CIFAR-10 dataset.","density estimation, autoregressive models, generative modeling, score-based models, diffusion models" Multi-Behavior Dynamic Contrastive Learning for Recommendation,https://openreview.net/forum?id=ykOpK9O5qYv,https://openreview.net/pdf?id=ykOpK9O5qYv,,"Dynamic behavior modeling has become an essential task in personalized recommender systems for learning the time-evolving user preference in online platforms. However, most next-item recommendation methods follow the single type behavior learning manner, which notably limits their user representation performance in reality, since the user-item relationships are often multi-typed in real-life applications (e.g., click, tag-as-favorite, review and purchase). To offer better recommendations, this work proposes Evolving Graph Contrastive Memory Network (EGCM) to model dynamic interaction heterogeneity for multi-behavior sequential recommendation. Specifically, we first develop a multi-behavior graph encoder to capture the short-term preference heterogeneity, and preserve the dedicated relation semantics for different types of user-item interactions. In addition, we design a dynamic cross-relational memory network, empowering EGCM to distill the long-term multi-behavior preference of users and the underlying evolving cross-type behavior dependencies over time. To enhance the user representation with multi-behavior commonality and diversity, we design a multi-behavior contrastive learning paradigm with heterogeneous short- and long-term interest modeling. Experiments on several real-world datasets show the superiority of our recommender system over various state-of-the-art baselines.","Multi-Behavior Recommendation, Dynamic Contrastive Learning" Analyzing diffusion as serial reproduction,https://openreview.net/forum?id=WpGqKAEwMn,https://openreview.net/pdf?id=WpGqKAEwMn,We identify a correspondence between diffusion models and a cognitive paradigm known as serial reproduction and use that to explain key features of diffusion models.,"Diffusion models are a class of generative models that learn to synthesize samples by inverting a diffusion process that gradually maps data into noise. While these models have enjoyed great success recently, a full theoretical understanding of their observed properties is still lacking, in particular, their weak sensitivity to the choice of noise family and the role of adequate scheduling of noise levels for good synthesis. By identifying a correspondence between diffusion models and a well-known paradigm in cognitive science known as serial reproduction, whereby human agents iteratively observe and reproduce stimuli from memory, we show how the aforementioned properties of diffusion models can be explained as a natural consequence of this correspondence. We then complement our theoretical analysis with simulations that exhibit these key features. Our work highlights how classic paradigms in cognitive science can shed light on state-of-the-art machine learning problems.","diffusion models, cognitive science, serial reproduction, generative models" Pseudo-label Training and Model Inertia in Neural Machine Translation,https://openreview.net/forum?id=eXkhH12DTD9,https://openreview.net/pdf?id=eXkhH12DTD9,pseudo-label training improves model stablity to updates and input perturbations,"Like many other machine learning applications, neural machine translation (NMT) benefits from over-parameterized deep neural models. However, these models have been observed to be brittle: NMT model predictions are sensitive to small input changes and can show significant variation across re-training or incremental model updates. This work studies a frequently used method in NMT, pseudo-label training (PLT), which is common to the related techniques of forward-translation (or self-training) and sequence-level knowledge distillation. While the effect of PLT on quality is well-documented, we highlight a lesser-known effect: PLT can enhance a model's stability to model updates and input perturbations, a set of properties we call \textit{model inertia}. We study inertia effects under different training settings and we identify distribution simplification as a mechanism behind the observed results. ","knowledge distillation, semi-supervised learning, self-training, forward translation, stability, robustness, machine translation" Adaptive Smoothing Gradient Learning for Spiking Neural Networks,https://openreview.net/forum?id=s5NL0rQ31zJ,https://openreview.net/pdf?id=s5NL0rQ31zJ,,"Spiking neural networks (SNNs) with biologically inspired spatio-temporal dynamics show higher energy efficiency on neuromorphic architectures. Error backpropagation in SNNs is prohibited by the all-or-none nature of spikes. The existing solution circumvents this problem by a relaxation on the gradient calculation using a continuous function with a constant relaxation degree, so-called surrogate gradient learning. Nevertheless, such solution introduces additional smoothness error on spiking firing which leads to the gradients being estimated inaccurately. Thus, how to adjust the relaxation degree adaptively and eliminate smoothness error progressively is crucial. Here, we propose a methodology such that training a prototype neural network will evolve into training an SNN gradually by fusing the learnable relaxation degree into the network with random spike noise. In this way, the network learns adaptively the accurate gradients of loss landscape in SNNs. The theoretical analysis further shows optimization on such a noisy network could be evolved into optimization on the embedded SNN with shared weights progressively. Moreover, we conduct extensive experiments on static images, dynamic event streams, speech, and instrumental sounds. The results show the proposed method achieves state-of-the-art performance across all the datasets with remarkable robustness on different relaxation degrees.", Going Beyond Approximation: Encoding Constraints for Explainable Multi-hop Inference via Differentiable Combinatorial Solvers,https://openreview.net/forum?id=im5YMG981ST,https://openreview.net/pdf?id=im5YMG981ST,The paper presents a novel neuro-symbolic framework that integrates Integer Linear Programming with pre-trained transformers to perform end-to-end explainable multi-hop inference,"Integer Linear Programming (ILP) provides a viable mechanism to encode explicit and controllable assumptions about explainable multi-hop inference with natural language. However, an ILP formulation is non-differentiable and cannot be integrated into broader deep learning architectures. Recently, Thayaparan et al. (2021a) proposed a novel methodology to integrate ILP with Transformers to achieve end-to-end differentiability for complex multi-hop inference. While this hybrid framework has been demonstrated to deliver better answer and explanation selection than transformer-based and existing ILP solvers, the neuro-symbolic integration still relies on a convex relaxation of the ILP formulation, which can produce sub-optimal solutions. To improve these limitations, we propose Diff-Comb Explainer, a novel neuro-symbolic architecture based on Differentiable BlackBox Combinatorial solvers (DBCS) (Pogančić et al., 2019). Unlike existing differentiable solvers, the presented model does not require the transformation and relaxation of the explicit semantic constraints, allowing for direct and more efficient integration of ILP formulations. Diff-Comb Explainer demonstrates improved accuracy and explainability over non-differentiable solvers, Transformers and existing differentiable constraint-based multi-hop inference frameworks.","Explainable AI, Constrained Optimization, Integer Linear Programming, Question Answering" Robust and accelerated single-spike spiking neural network training with applicability to challenging temporal tasks,https://openreview.net/forum?id=kRCRcDayfk6,https://openreview.net/pdf?id=kRCRcDayfk6,We propose a new model for robust and accelerated training of single-spike SNNs with competitive performance across various image and neuromorphic datasets and demonstrate a broader computational role for single-spike SNNs than previously believed.,"Spiking neural networks (SNNs), particularly the single-spike variant in which neurons spike at most once, are considerably more energy efficient than standard artificial neural networks (ANNs). However, single-spike SSNs are difficult to train due to their dynamic and non-differentiable nature, where current solutions are either slow or suffer from training instabilities. These networks have also been critiqued for their limited computational applicability such as being unsuitable for time-series datasets. We propose a new model for training single-spike SNNs which mitigates the aforementioned training issues and obtains competitive results across various image and neuromorphic datasets, with up to a $13.98\times$ training speedup and up to an $81\%$ reduction in spikes compared to the multi-spike SNN. Notably, our model performs on par with multi-spike SNNs in challenging tasks involving neuromorphic time-series datasets, demonstrating a broader computational role for single-spike SNNs than previously believed.","Spiking neural networks, single-spike, accelerated training, dead neuron problem" Using Planning to Improve Semantic Parsing of Instructional Texts,https://openreview.net/forum?id=Ajk3Bfo9AUW,https://openreview.net/pdf?id=Ajk3Bfo9AUW,Integrating symbolic planning information as a decoding constraint improves few-shot semantic parsing of instructional texts,"We develop a method for few-shot semantic parsing of instructional texts. The system takes long-form instructional texts as input and produces sequences of actions in a formal language that enable execution of the instructions. This task poses unique challenges since input texts may contain long context dependencies and ambiguous and domain-specific language. Valid semantic parses also require sequences of steps that constitute an executable plan. We build on recent progress in semantic parsing by leveraging large language models to learn parsers from small amounts of training data. During decoding, our method employs planning methods and domain information to rank and correct candidate parses. To validate our method, we investigate recipe interpretation in two cooking domains. We present results for few-shot semantic parsing using leave-one-out cross-validation. We show that utilizing planning domain information improves the quality of generated plans. Through ablations we also explore the effects of our decoder design choices and model size.","nlp, semantic parsing, planning" A NEW PARADIGM FOR CROSS-MODALITY PERSON RE-IDENTIFICATION,https://openreview.net/forum?id=TKcVjKZ0BxE,https://openreview.net/pdf?id=TKcVjKZ0BxE,,"Visible and infrared Person Re-identification(ReID) is still very challenging on account of few cross-modality dataset and large inter-modality variation. Most existing cross-modality ReID methods have trouble eliminating cross-modality discrepancy resulting from the heterogeneous images. In this paper, we present an effective framework and build a large benchmark, named NPU-ReID. To this end, we propose a dual-path fusion network and taking transformer as the smallest feature extraction unit. To expand cross-modality sample diversity, we propose a modality augmentation strategy to generate semi-modality pedestrian images by exchanging certain patch and the main innovation is that the cross-modality gap can be indirectly minimized by reducing the variance of semi-modality and infrared or visible modality. Moreover, in order to make the traditional triplet loss more suitable for cross-modal matching tasks, multi-masking triplet loss is a targeted design for optimizing the relative distance between anchor and positive/negative samples pairs from cross-modality, especially constraining the distance between simple and hard positive samples. Experimental results demonstrate that our proposed method achieves superior performance than other methods on SYSU-MM01, RegDB and our proposed NPU-ReID dataset, especially on the RegDB dataset with significant improvement of 6.81$\%$ in rank1 and 9.65$\%$ in mAP.",People Re-identification,Cross-modality Causal Mean Field Multi-Agent Reinforcement Learning,https://openreview.net/forum?id=Uzgfy7_v7BH,https://openreview.net/pdf?id=Uzgfy7_v7BH,This paper aims at the scalability problem in large-scale multi-agent system. We use causal inference to imporve the robustness of mean field Q-learning. Experiments verify that our method achieve superior scalability performance.,"Scalability remains a challenge in multi-agent reinforcement learning and is currently under active research. However, existing works lack the ability to identify the essential interaction under the non-stationary environment. We propose causal mean field Q-learning (CMFQ) to address this problem. It has the advantage of MFQ, which can compress the space size dramatically. Besides, it is ever more robust toward the non-stationary caused by increasing agents. We enable agents to identify which ally or opponent is more crucial by asking ""what if"" with the help of the structural causal model (SCM), then pay more attention to more crucial ones. We test CMFQ in mixed cooperative-competitive and cooperative games, which verify our method's scalability performance.","multi-agent reinforcement mearning, causal inference" Hidden Markov Mixture of Gaussian Process Functional Regression: Utilizing Multi-Scale Structure for Time-Series Forecasting,https://openreview.net/forum?id=QP4nkeQ1BpT,https://openreview.net/pdf?id=QP4nkeQ1BpT,,"The mixture of Gaussian process functional regressions (GPFRs) assumes that there are a batch of time-series or sample curves which are generated by independent random processes with different temporal structures. However, in the real situations, these structures are actually transferred in a random manner from a long time scale. Therefore, the assumption of independent curves is not true in practice. In order to get rid of this limitation, we propose the hidden Markov based GPFR mixture model (HM-GPFR) by describing these curves with both fine and coarse level temporal structures. Specifically, the temporal structure is described by the Gaussian process model at the fine level and hidden Markov process at the coarse level. The whole model can be regarded as a random process with state switching dynamics. To further enhance the robustness of the model, we also give a priori to the model parameters and develop Bayesian hidden Markov based GPFR mixture model (BHM-GPFR). Experimental results demonstrate that the proposed methods have both high prediction accuracy and good interpretability. ", HyperDeepONet: learning operator with complex target function space using the limited resources via hypernetwork,https://openreview.net/forum?id=OAw6V3ZAhSd,https://openreview.net/pdf?id=OAw6V3ZAhSd,,"Fast and accurate predictions for complex physical dynamics are a big challenge across various applications. Real-time prediction on resource-constrained hardware is even more crucial in the real-world problems. The deep operator network (DeepONet) has recently been proposed as a framework for learning nonlinear mappings between function spaces. However, the DeepONet requires many parameters and has a high computational cost when learning operators, particularly those with complex (discontinuous or non-smooth) target functions. In this study, we propose HyperDeepONet, which uses the expressive power of the hypernetwork to enable learning of a complex operator with smaller set of parameters. The DeepONet and its variant models can be thought of as a method of injecting the input function information into the target function. From this perspective, these models can be viewed as a special case of HyperDeepONet. We analyze the complexity of DeepONet and conclude that HyperDeepONet needs relatively lower complexity to obtain the desired accuracy for operator learning. HyperDeepONet was successfully applied to various operator learning problems using low computational resources compared to other benchmarks.","Hypernetwork, Operator learning, Deep operator network, DeepONet" CLAS: Central Latent Action Spaces for Coordinated Multi-Robot Manipulation,https://openreview.net/forum?id=bGRqRRVA8C,https://openreview.net/pdf?id=bGRqRRVA8C,We propose a method for coordinating multi-robot manipulation by learning a latent action space that is task specific and acts on the manipulated object.,"Multi-robot manipulation tasks involve various control entities that can be separated into dynamically independent parts. A typical example of such real-world tasks is dual-arm manipulation. Learning to naively solve such tasks with reinforcement learning is often unfeasible due to the combinatorial explosion of the action and state spaces. Instead, we would like to handle such environments as multi-agent systems and have several agents control parts of the whole. However, decentralizing the generation of actions requires coordination across agents through a channel limited to information central to the task. This paper proposes an approach to coordinating multi-robot manipulation through learned latent action spaces that are shared across different agents. We validate our method in simulated multi-robot manipulation tasks and demonstrate improvement over previous baselines in terms of sample efficiency, learning performance, and interpretability.","Latent Action Spaces, Multi-Robot Manipulation, Cooperative Control, Reinforcement Learning" Edge Guided GANs with Contrastive Learning for Semantic Image Synthesis,https://openreview.net/forum?id=qcJmsP3oE9,https://openreview.net/pdf?id=qcJmsP3oE9,A novel contrastive learning based edge guided GAN for semantic image synthesis.,"We propose a novel \underline{e}dge guided \underline{g}enerative \underline{a}dversarial \underline{n}etwork with \underline{c}ontrastive learning (ECGAN) for the challenging semantic image synthesis task. Although considerable improvement has been achieved, the quality of synthesized images is far from satisfactory due to three largely unresolved challenges. 1) The semantic labels do not provide detailed structural information, making it difficult to synthesize local details and structures. 2) The widely adopted CNN operations such as convolution, down-sampling, and normalization usually cause spatial resolution loss and thus are unable to fully preserve the original semantic information, leading to semantically inconsistent results (e.g., missing small objects). 3) Existing semantic image synthesis methods focus on modeling `local' semantic information from a single input semantic layout. However, they ignore `global' semantic information of multiple input semantic layouts, i.e., semantic cross-relations between pixels across different input layouts. To tackle 1), we propose to use edge as an intermediate representation which is further adopted to guide image generation via a proposed attention guided edge transfer module. Edge information is produced by a convolutional generator and introduces detailed structure information. To tackle 2), we design an effective module to selectively highlight class-dependent feature maps according to the original semantic layout to preserve the semantic information. To tackle 3), inspired by current methods in contrastive learning, we propose a novel contrastive learning method, which aims to enforce pixel embeddings belonging to the same semantic class to generate more similar image content than those from different classes. By doing so, it can capture more semantic relations by explicitly exploring the structures of labeled pixels from multiple input semantic layouts. Experiments on three challenging datasets show that ECGAN achieves significantly better results than state-of-the-art methods. ","Semantic image synthesis, contrastive learning, GANs, edge" Towards Reliable Link Prediction with Robust Graph Information Bottleneck,https://openreview.net/forum?id=MWGDhOQkr3,https://openreview.net/pdf?id=MWGDhOQkr3,We provide an information-theory-guided principle and its two instantiations for robust link prediction under inherent edge noise.,"Link prediction on graphs has achieved great success with the rise of deep graph learning. However, the potential robustness under the edge noise is less investigated. We reveal that the inherent edge noise that naturally perturbs both input topology and target label leads to severe performance degradation and representation collapse. Here, we propose an information-theory guided principle, Robust Graph Information Bottleneck (RGIB), to extract reliable supervision signals and avoid representation collapse. Different from the general information bottleneck, RGIB decouples and balances the mutual dependence among graph topology, edge label, and representation, building a new learning objective for robust representation. We also provide two implementations, RGIB-SSL and RGIB-REP, that benefit from different methodologies, i.e., self-supervised learning and data reparametrization, for indirect and direct data denoising, respectively. Extensive experiments on six benchmarks with various scenarios verify the effectiveness of the proposed RGIB.","Robust link prediction, Inherent edge noise, Graph representation learning" Enforcing Delayed-Impact Fairness Guarantees,https://openreview.net/forum?id=tgAI50giBbg,https://openreview.net/pdf?id=tgAI50giBbg,,"Recent research has shown that seemingly fair machine learning models, when used to inform decisions that have an impact on people's lives or well-being (e.g., applications involving education, employment, and lending), can inadvertently increase social inequality in the long term. Existing fairness-aware algorithms consider static fairness constraints, such as equal opportunity or demographic parity, but enforcing constraints of this type may result in models that have a negative long-term impact on disadvantaged individuals and communities. We introduce ELF (Enforcing Long-term Fairness), the first classification algorithm that provides high-confidence fairness guarantees in terms of long-term, or delayed, impact. Importantly, ELF solves the open problem of providing such guarantees based only on historical data that includes observations of delayed impact. Prior methods, by contrast, require prior knowledge (or an estimate) of analytical models describing the relationship between a classifier's predictions and their corresponding delayed impact. We prove that ELF satisfies delayed-impact fairness constraints with high confidence and that it is guaranteed to identify a fair solution, if one exists, given sufficient data. We show empirically, using real-life data, that ELF can successfully mitigate long-term unfairness with high confidence.", Affinity-Aware Graph Networks,https://openreview.net/forum?id=p9zz7hLzH-4,https://openreview.net/pdf?id=p9zz7hLzH-4,"We show how to use affinity measures arising from random walks (e.g., effective resistance) to design message passing networks that are shown to outperform various benchmarks with fewer message passing steps.","Graph Neural Networks (GNNs) have emerged as a powerful technique for learning on relational data. Owing to the relatively limited number of message passing steps they perform—and hence a smaller receptive field—there has been significant interest in improving their expressivity by incorporating structural aspects of the underlying graph. In this paper, we explore the use of affinity measures as features in graph neural networks, in particular measures arising from random walks, including effective resistance, hitting and commute times. We propose message passing networks based on these features and evaluate their performance on a variety of node and graph property prediction tasks.","Graph neural networks, message passing networks, effective resistance" Few-shot Lifelong Reinforcement Learning with Generalization Guarantees: An Empirical PAC-Bayes Approach,https://openreview.net/forum?id=2bJ6Cqrd-a,https://openreview.net/pdf?id=2bJ6Cqrd-a,,"We propose a new empirical PAC-Bayes approach to develop lifelong reinforcement learning algorithms with theoretical guarantees. The main idea is to extend the PAC-Bayes theory in supervised learning to the reinforcement learning regime. More specifically, we train a distribution of policies, and gradually improve the distribution parameters via optimizing the generalization error bound using trajectories from each task. As the agent sees more tasks, it elicits better prior distributions of policies, resulting in tighter generalization bounds and improved future learning. To demonstrate the superior performance of our method compared to recent state-of-the-art methods, we test the proposed algorithms on various OpenAI's Gym and Mujuco environments and show that they adapt to new tasks more efficiently by continuously distilling knowledge from past tasks.","Few-shot Learning, Lifelong Meta RL, Multi-Task RL, PAC-Bayes Bound, Generalization Error Bound" Towards the Detection of Diffusion Model Deepfakes,https://openreview.net/forum?id=RZHdb7FnqlY,https://openreview.net/pdf?id=RZHdb7FnqlY,We take a first look at the detection of images generated by diffusion models by evaluating state-of-the-art detectors and analyzing DM-generated images in the frequency domain.,"Diffusion models (DMs) have recently emerged as a promising method in image synthesis. They have surpassed generative adversarial networks (GANs) in both diversity and quality, and have achieved impressive results in text-to-image modeling. However, to date, only little attention has been paid to the detection of DM-generated images, which is critical to prevent adverse impacts on our society. While prior works have shown that GAN-generated images can be reliably detected using automated methods, it is unclear whether the same methods are effective against DMs. In this work, we address this challenge and take a first look at detecting DM-generated images. We approach the problem from two different angles: First, we evaluate the performance of state-of-the-art detectors on a variety of DMs. Second, we analyze DM-generated images in the frequency domain and study different factors that influence the spectral properties of these images. Most importantly, we demonstrate that GANs and DMs produce images with different characteristics, which requires adaptation of existing classifiers to ensure reliable detection. We believe this work provides the foundation and starting point for further research to detect DM deepfakes effectively.","Diffusion Model, Generative Adversarial Network, GAN, Deepfakes, Detection, Frequency Artifact, Frequency Analysis, Spectrum Discrepancies, Synthetic Images, Disinformation, Social Media" Global-Scale Species Mapping From Crowdsourced Data,https://openreview.net/forum?id=1mjOVFZ3C-,https://openreview.net/pdf?id=1mjOVFZ3C-,A new model for jointly estimating the spatial range of thousands of different species from sparse partially observed data. ,"Estimating the geographical range of a species from in situ observational data is a challenging and important geospatial prediction problem. Given a set of locations indicating where a species has been observed, the goal is to learn a model that can predict how likely it is for the species to be present at any other location. While this is a well-studied problem, traditional approaches are unable to take advantage of more recently available large-scale datasets that cover many locations and species. We propose a new approach that jointly estimates the geographical ranges of tens of thousands of different species simultaneously. We develop a series of benchmark evaluation tasks that measure different aspects of the species range and spatial representation learning problems. We show that our approach scales both in terms of amount of training data and species, where adding more data enables the models to learn better spatial representations that generalize to other species. Despite being only trained on weakly supervised crowdsourced data, our models can approach the predictions of current expert-developed gold standard models.","species distribution modeling, coordinate networks, deep learning" CANIFE: Crafting Canaries for Empirical Privacy Measurement in Federated Learning,https://openreview.net/forum?id=Kf7Yyf4O0u,https://openreview.net/pdf?id=Kf7Yyf4O0u,Crafting canaries to measure empirical privacy of DP-FL training under a realistic threat model,"Federated Learning (FL) is a setting for training machine learning models in distributed environments where the clients do not share their raw data but instead send model updates to a server. However, model updates can be subject to attacks and leak private information. Differential Privacy (DP) is a leading mitigation strategy which involves adding noise to clipped model updates, trading off performance for strong theoretical privacy guarantees. Previous work has shown that the threat model of DP is conservative and that the obtained guarantees may be vacuous or may not directly translate to information leakage in practice. In this paper, we aim to achieve a tighter measurement of the model exposure by considering a realistic threat model. We propose a novel method, CANIFE, that uses canaries - carefully crafted samples by a strong adversary to evaluate the empirical privacy of a training round. We apply this attack to vision models trained on CIFAR-10 and CelebA and to language models trained on Sent140 and Shakespeare. In particular, in realistic FL scenarios, we demonstrate that the empirical epsilon obtained with CANIFE is 2-7$\times$ lower than the theoretical bound. ","Federated Learning, Differential Privacy, Empirical Privacy, Model Auditing, Membership Inference Attack" Multivariate Time Series Forecasting By Graph Attention Networks With Theoretical Guarantees,https://openreview.net/forum?id=qg2XdQ773R,https://openreview.net/pdf?id=qg2XdQ773R,,"Multivariate time series forecasting (MTSF) aims to predict future values of multiple variables based on past values of multivariate time series, and has been applied in fields including traffic flow prediction, stock price forecasting, and anomaly detection. Capturing the inter-dependencies among variables poses one significant challenge to MTSF. Several methods that model the correlations between variables with an aim to improve the test prediction accuracy have been considered in recent works, however, none of them have theoretical guarantees. In this paper, we developed a new norm-bounded graph attention network (GAT) for MTSF by upper-bounding the Frobenius norm of weights in each layer of the GAT model to achieve optimal performance. Under optimal parameters, we theoretically show that our model can achieve a generalization error bound which is expressed as products of Frobenius norm of weight in each layer and the numbers of neighbors and attention heads, while the latter is represented as polynomial terms with the degree as the number of layers. Empirically, we investigate the impact of different components of GAT models on the performance of MTSF. Our experiment also verifies our theoretical findings. Empirically, we also observe that the generalization performance of our method is dependent on the number of attention heads, the number of neighbors, the scales (norms) of the weight matrices, the scale of the input features, and the number of layers. Our method provides novel perspectives for improving the generation performance for MTSF, and our theoretical guarantees give substantial implications for designing attention-based methods for MTSF. ","Multivariate Time Series Forecasting, Graph Attention Networks, Generalization Error Bound, Rademacher Complexity" Wasserstein Generalization Bound for Few-Shot Learning,https://openreview.net/forum?id=ciZrud3kf3,https://openreview.net/pdf?id=ciZrud3kf3,"We use properties of wassertein distance to give a tight bound for few shot learning, specifically prototypical networks.","In the absence of large quantities of annotated data, few shot learning is used to train neural networks that make predictions based on similarities between datapoints. To better under- stand how models would behave when presented with unfamiliar data, research on gen- eralization bounds have revealed some important properties about deep neural networks. However, when extended to the domain of few shot learning it often yields loose bounds since it does not take into the account the nature and methodology of few shot learning. We propose a novel stochastic generalization bound for prototypical neural networks by constructing a Wasserstein sphere centered around the distribution of weight matrices. We show that by applying concentration inequalities on the distribution of weight matrices in the Wasserstein sphere stricter generalization bounds can be obtained. Comparison with previous generalization bounds shows the efficacy of our approach and to our knowledge this is the first bound that makes use of Wasserstein distance to give a measure of general- izability of deep neural networks","Few shot learning, Generalization" Maximal Correlation-Based Post-Nonlinear Learning for Bivariate Causal Discovery,https://openreview.net/forum?id=Or8rcTLo7U,https://openreview.net/pdf?id=Or8rcTLo7U,,"Bivariate causal discovery aims to determine the causal relationship between two random variables from passive observational data (as intervention is not affordable in many scientific fields), which is considered fundamental and challenging. Designing algorithms based on the post-nonlinear (PNL) model has aroused much attention for its generality. However, the state-of-the-art (SOTA) PNL-based algorithms involve highly non-convex objectives for neural network training, which are time-consuming and unable to produce meaningful solutions with finite samples. In this paper, we propose a novel method that incorporates maximal correlation into the PNL model learning (short as MC-PNL) such that the underlying nonlinearities can be accurately recovered. Owing to the benign structure of our objective function when modeling the nonlinearities with linear combinations of random Fourier features, the target optimization problem can be solved rather efficiently and rapidly via the block coordinate descent. We also compare the MC-PNL with SOTA methods on the downstream synthetic and real causal discovery tasks to show its superiority in time and accuracy. Our code is available at https://anonymous.4open.science/r/MC-PNL-E446/.", A View From Somewhere: Human-Centric Face Representations,https://openreview.net/forum?id=mMaInr0r0c,https://openreview.net/pdf?id=mMaInr0r0c,Implicit discovery of face-varying dimensions and annotator bias by learning on a novel face similarity dataset,"Biases in human-centric computer vision models are often attributed to a lack of sufficient data diversity, with many demographics insufficiently represented. However, auditing datasets for diversity can be difficult, due to an absence of ground-truth labels of relevant features. Few datasets contain self-identified demographic information, inferring demographic information risks introducing additional biases, and collecting and storing data on sensitive attributes can carry legal risks. Moreover, categorical demographic labels do not necessarily capture all the relevant dimensions of human diversity that are important for developing fair and robust models. We propose to implicitly learn a set of continuous face-varying dimensions, without ever asking an annotator to explicitly categorize a person. We uncover the dimensions by learning on a novel dataset of 638,180 human judgments of face similarity (FAX). We demonstrate the utility of our learned embedding space for predicting face similarity judgments, collecting continuous face attribute values, attribute classification, and comparative dataset diversity auditing. Moreover, using a novel conditional framework, we show that an annotator's demographics influences the importance they place on different attributes when judging similarity, underscoring the need for diverse annotator groups to avoid biases.","similarity, faces, annotator bias, computer vision, cognitive, mental representations, diversity" Identifiability Results for Multimodal Contrastive Learning,https://openreview.net/forum?id=U_2kuqoTcB,https://openreview.net/pdf?id=U_2kuqoTcB,"We show that multimodal contrastive learning can block-identify latent factors shared between heterogenous modalities (e.g., images and captions), even in the presence of nontrivial statistical and causal dependencies.","Contrastive learning is a cornerstone underlying recent progress in multi-view and multimodal learning, e.g., in representation learning with image/caption pairs. While its effectiveness is not yet fully understood, a line of recent work reveals that contrastive learning can invert the data generating process and recover ground truth latent factors shared between views. In this work, we present new identifiability results for multimodal contrastive learning, showing that it is possible to recover shared factors in a more general setup than the multi-view setting studied previously. Specifically, we distinguish between the multi-view setting with one generative mechanism (e.g., multiple cameras of the same type) and the multimodal setting that is characterized by distinct mechanisms (e.g., cameras and microphones). Our work generalizes previous identifiability results by redefining the generative process in terms of distinct mechanisms with modality-specific latent variables. We prove that contrastive learning can block-identify latent factors shared between modalities, even when there are nontrivial dependencies between factors. We empirically verify our identifiability results with numerical simulations and corroborate our findings on a complex multimodal dataset of image/text pairs. Zooming out, our work provides a theoretical basis for multimodal representation learning and explains in which settings multimodal contrastive learning can be effective in practice.","multimodal learning, multi-view learning, contrastive learning, causal representation learning, nonlinear ica, identifiability" Task-Agnostic Unsupervised Robust Representation Learning,https://openreview.net/forum?id=DZ4FS-Evau7,https://openreview.net/pdf?id=DZ4FS-Evau7,"We propose a method to learn robust representations without any labels or adversarial fine-tuning in downstream tasks, based on a theoretically grounded unsupervised robustness regularizer.","It has been reported that deep learning models are extremely vulnerable to small but intentionally chosen perturbations of their input. In particular, a deep network, despite its near-optimal accuracy on the clean images, often mis-classifies an image with a worst-case but humanly imperceptible perturbation (so-called adversarial examples). To tackle this problem, a great amount of research has been done to study the training procedure of a network to improve its robustness. However, most of the research so far has focused on the case of supervised learning. With the increasing popularity of self-supervised learning methods, it is also important to study and improve the robustness of their resulting representation on downstream tasks. In this paper, we study the problem of robust representation learning with unlabeled data in a task-agnostic manner. Specifically, we first derive an upper bound on the adversarial loss of a prediction model (which is based on the learned representation) on any downstream task, using its loss on the clean data and a robustness regularizer. Importantly, the regularizer is task-independent, thus we propose to minimize it directly during the representation learning phase to make the downstream prediction model more robust. Extensive experiments show that our method results in a robust model for downstream tasks without any supervised adversarial training, and achieves preferable adversarial performance compared to relevant baselines.","unsupervised robustness, transferable adversarial robustness" Federated Learning as Variational Inference: A Scalable Expectation Propagation Approach,https://openreview.net/forum?id=dZrQR7OR11,https://openreview.net/pdf?id=dZrQR7OR11,This work introduces a probabilistic message-passing algorithm for federated learning based on expectation propagation (FedEP) and studies algorithmic considerations to scale up classic expectation propagation to modern federated learning settings.,"The canonical formulation of federated learning treats it as a distributed optimization problem where the model parameters are optimized against a global loss function that decomposes across client loss functions. A recent alternative formulation instead treats federated learning as a distributed inference problem, where the goal is to infer a global posterior from partitioned client data (Al-Shedivat et al., 2021). This paper extends the inference view and describes a variational inference formulation of federated learning where the goal is to find a global variational posterior that well-approximates the true posterior. This naturally motivates an expectation propagation approach to federated learning (FedEP), where approximations to the global posterior are iteratively refined through probabilistic message-passing between the central server and the clients. We conduct an extensive empirical study across various algorithmic considerations and describe practical strategies for scaling up expectation propagation to the modern federated setting. We apply FedEP on standard federated learning benchmarks and find that it outperforms strong baselines in terms of both convergence speed and accuracy.","federated learning, variational inference, expectation propagation" Latent Graph Inference using Product Manifolds,https://openreview.net/forum?id=JLR_B7n_Wqr,https://openreview.net/pdf?id=JLR_B7n_Wqr,,"Graph Neural Networks usually rely on the assumption that the graph topology is available to the network as well as optimal for the downstream task. Latent graph inference allows models to dynamically learn the intrinsic graph structure of problems where the connectivity patterns of data may not be directly accessible. In this work, we generalize the discrete Differentiable Graph Module (dDGM) for latent graph learning. The original dDGM architecture used the Euclidean plane to encode latent features based on which the latent graphs were generated. By incorporating Riemannian geometry into the model and generating more complex embedding spaces, we can improve the performance of the latent graph inference system. In particular, we propose a computationally tractable approach to produce product manifolds of constant curvature model spaces that can encode latent features of varying structure. The latent representations mapped onto the inferred product manifold are used to compute richer similarity measures that are leveraged by the latent graph learning model to obtain optimized latent graphs. Moreover, the curvature of the product manifold is learned during training alongside the rest of the network parameters and based on the downstream task, rather than it being a static embedding space. Our novel approach is tested on a wide range of datasets, and outperforms the original dDGM model.","Latent Graph Inference, Product Manifolds, Graph Neural Networks" UNICORN: A Unified Backdoor Trigger Inversion Framework,https://openreview.net/forum?id=Mj7K4lglGyj,https://openreview.net/pdf?id=Mj7K4lglGyj,,"The backdoor attack, where the adversary uses inputs stamped with triggers (e.g., a patch) to activate pre-planted malicious behaviors, is a severe threat to Deep Neural Network (DNN) models. Trigger inversion is an effective way of identifying backdoor models and understanding embedded adversarial behaviors. A challenge of trigger inversion is that there are many ways of constructing the trigger. Existing methods cannot generalize to various types of triggers by making certain assumptions or attack-specific constraints. The fundamental reason is that existing work does not formally define the trigger and the inversion problem. This work formally defines and analyzes the trigger and the inversion problem. Then, it proposes a unified framework to invert backdoor triggers based on the formalization of triggers and the identified inner behaviors of backdoor models from our analysis. Our prototype UNICORN is general and effective in inverting backdoor triggers in DNNs. The code can be found at https://anonymous.4open.science/r/UNICORN-FA0E.", DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention,https://openreview.net/forum?id=rByagyHWlpb,https://openreview.net/pdf?id=rByagyHWlpb,,"Many studies have been conducted to improve the efficiency of the Transformer from quadric to linear over long sequence conditions. Among them, the low-rank-based methods aim to learn the projection matrices to compress the sequence length, thus achieving efficiency gain. However, the projection matrices are fixed once they have been learned, which compress the sequence length with dedicated coefficients for the tokens in the same position regardless of different sequences. Adopting such input-invariant low-rank projections ignores the fact that the most informative part of a sequence varies from sequence to sequence, thus failing to preserve the most useful information that lies in varied positions of different sequences. In addition, previous efficient Transformers only focus on the influence of sequence length while neglecting the effect of hidden state dimension to achieve further efficiency gain. To address the aforementioned problems, we present an efficient yet effective attention mechanism, namely the Dynamic Bilinear Low-Rank Attention (DBA), which compresses the sequence length by input-sensitive dynamic projection matrices and achieves linear time and space complexity by jointly optimizing the sequence length and hidden state dimension while maintaining state-of-the-art performance. Specifically, we first theoretically demonstrate that the sequence length can be compressed non-destructively from a novel perspective of the information theory, with the compression matrices dynamically determined by the input sequence. Furthermore, we show that the hidden state dimension can be approximated by extending the Johnson–Lindenstrauss lemma and achieves high-order small amount error, optimizing the attention in bilinear form. In addition, theoretical analysis shows that the DBA is proficient in capturing high-order relations in cross-attention problems. Experiments over tasks with diverse sequence length conditions show that the DBA achieves state-of-the-art performance compared with various strong baselines while maintaining less memory consumption with higher speed, demonstrating the effectiveness and efficiency of the DBA.",Efficient Transformer On the Robustness of Dataset Inference,https://openreview.net/forum?id=tNAYMjSd296,https://openreview.net/pdf?id=tNAYMjSd296,"Dataset Inference, a model fingerprinting technique published at ICLR 2021, suffers from false positives and false negatives.","Machine learning (ML) models are costly to train as they can require a significant amount of data, computational resources and technical expertise. Thus, they constitute valuable intellectual property that needs protection from adversaries wanting to steal them. $\textit{Ownership verification}$ techniques allow the victims of model stealing attacks to demonstrate that a suspect model was in fact stolen from theirs. Although a number of ownership verification techniques based on watermarking or fingerprinting have been proposed, most of them fall short either in terms of security guarantees (well-equipped adversaries can evade verification) or computational cost. A fingerprinting technique introduced at ICLR '21, $\textit{Dataset Inference}$ (DI), has been shown to offer better robustness and efficiency than prior methods. The authors of DI provided a correctness proof for linear (suspect) models. However, in the same setting, we prove that DI suffers from high false positives (FPs) -- it can incorrectly identify an independent model trained with non-overlapping data from the same distribution as stolen. We further prove that DI also triggers FPs in realistic, non-linear suspect models. We then confirm empirically that DI leads to FPs, with high confidence. Second, we show that DI also suffers from false negatives (FNs) -- an adversary can fool DI by regularising a stolen model's decision boundaries using adversarial training, thereby leading to an FN. To this end, we demonstrate that DI fails to identify a model adversarially trained from a stolen dataset -- the setting where DI is the hardest to evade. Finally, we discuss the implications of our findings, the viability of fingerprinting-based ownership verification in general, and suggest directions for future work.","ownership verification, model extraction, model stealing, fingerprinting" Client-agnostic Learning and Zero-shot Adaptation for Federated Domain Generalization,https://openreview.net/forum?id=S4PGxCIbznF,https://openreview.net/pdf?id=S4PGxCIbznF,Propose client-agnostic learning and zero-shot adaptation for federated domain generalization,"Federated domain generalization (federated DG) aims to learn a client-agnostic global model from various distributed source domains and generalize the model to new clients in completely unseen domains. The main challenges of federated DG are the difficulty of building the global model with local client models from different domains while keeping data private and low generalizability to test clients, where data distribution deviates from those of training clients. To solve these challenges, we present two strategies: (1) client-agnostic learning with mixed instance-global statistics and (2) zero-shot adaptation with estimated statistics. In client-agnostic learning, we first augment local features by using data distribution of other clients via global statistics in the global model's batch normalization layers. This approach allows the generation of diverse domains by mixing local and global feature statistics while keeping data private. Local models then learn client-invariant representations by applying our client-agnostic objectives with the augmented data. Next, we propose a zero-shot adapter to help the learned global model to directly bridge a large domain gap between seen and unseen clients. At inference time, the adapter mixes instance statistics of a test input with global statistics that are vulnerable to distribution shift. With the aid of the adapter, the global model improves generalizability further by reflecting test distribution. We comprehensively evaluate our methods on several benchmarks in federated DG.","Federated learning, Domain generalization, Zero-shot adaptation" Towards Robust Model Watermark via Reducing Parametric Vulnerability,https://openreview.net/forum?id=wysXxmukfCA,https://openreview.net/pdf?id=wysXxmukfCA,"Based on the observation of the watermarked model in parametric space, we propose a minimax approach to improve the robustness of watermarked models against state-of-the-art removal attacks.","Deep neural networks are valuable assets considering their commercial benefits and huge demands for costly annotation and computation resources. To protect the copyright of these deep models, backdoor-based ownership verification becomes popular recently, in which the model owner can watermark the model by embedding a specific behavior before releasing it. The defender (usually the model owner) can identify whether a suspicious third-party model is ``stolen'' from it based on the presence of the behavior. Unfortunately, these watermarks are proven to be vulnerable to removal attacks even like fine-tuning. To further explore this vulnerability, we investigate the parametric space and find there exist many watermark-removed models in the vicinity of the watermarked one, which may be easily used by removal attacks. Inspired by this finding, we propose a minimax formulation to find these watermark-removed models and recover their watermark behavior. Extensive experiments demonstrate that our method improves the robustness of the model watermarking against parametric changes and numerous watermark-removal attacks.","Model Watermarking, Backdoor Watermark, Ownership Verification, Deep IP Protection" DP-InstaHide: Data Augmentations Provably Enhance Guarantees Against Dataset Manipulations,https://openreview.net/forum?id=3i_Bzt7Hmcm,https://openreview.net/pdf?id=3i_Bzt7Hmcm,,"Data poisoning and backdoor attacks manipulate training data to induce security breaches in a victim model. These attacks can be provably deflected using differentially private (DP) training methods, although this comes with a sharp decrease in model performance. The InstaHide method has recently been proposed as an alternative to DP training that leverages supposed privacy properties of the mixup augmentation, although without rigorous guarantees. In this paper, we rigorously show that $k$-way mixup provably yields at least $k$ times stronger DP guarantees than a naive DP mechanism, and we observe that this enhanced privacy guarantee is a strong foundation for building defenses against poisoning.", This Looks Like It Rather Than That: ProtoKNN For Similarity-Based Classifiers,https://openreview.net/forum?id=lh-HRYxuoRr,https://openreview.net/pdf?id=lh-HRYxuoRr,,"Among research on the interpretability of deep learning models, the 'this looks like that' framework with ProtoPNet has attracted significant attention. By combining the strong power of deep learning models with the interpretability of case-based inference, ProtoPNet can achieve high accuracy while keeping its reasoning process interpretable. Many methods based on ProtoPNet have emerged to take advantage of this benefit, but despite their practical usefulness, they run into difficulty when utilizing similarity-based classifiers, e.g., in domains where unknown class samples exist. This is because ProtoPNet and its variants adopt the training process specific to linear classifiers, which allows the prototypes to represent useful image features for class recognition. Due to this difficulty, the effectiveness of similarity-based classifiers (e.g., k-nearest neighbor (KNN)) on the 'this looks like that' framework have not been sufficiently examined. To alleviate this problem, we propose ProtoKNN, an extension of ProtoPNet that adopts KNN classifiers. Extensive experiments on multiple open datasets demonstrate that the proposed method can achieve competitive results with a state-of-the-art method.","XAI, Inherently Interpretable Model, This Looks Like That Framework, Fine-grained Image Classification, Deep Learning" SEQuence-rPPG: A Fast BVP Signal Extraction Method From Frame Sequences,https://openreview.net/forum?id=QYiN3R9nVUG,https://openreview.net/pdf?id=QYiN3R9nVUG,"A new rPPG method is proposed, which is very simple, fast and accurate.","Non-contact heart rate estimation has essential implications for the development of affective computing and telemedicine. However, existing deep learning-based methods often endeavor to achieve real-time measurements, so a simple, fast, pre-processing-free approach is needed. Our work consists of two main parts. Firstly, we proposed SEQ-rPPG, which first transforms the RGB frame sequence into the original BVP signal sequence by learning-based linear mapping and then outputs the final BVP signal using 1DCNN-based spectral transform, and time-domain filtering. Secondly, to address the shortcomings of the existing dataset in training the model, a new large-scale dataset was collected for training and testing. Our approach achieved competitive results on the collected large dataset(the best) and public dataset UBFC-rPPG(0.81 MAE with 30s time window, test only). It requires no complex pre-processing, has the fastest speed, can run in real-time on mobile ARM CPUs, and can achieve real-time beat-to-beat performance on desktop CPUs. Benefiting from the high-quality training set, other deep learning-based models reduced errors by at least 53$\%$. We compared the methods with and without the spectral transformation, and the results show that the processing in the time domain is effective.","rPPG, Remote vital sensing, Signal processing" Understanding weight-magnitude hyperparameters in training binary networks,https://openreview.net/forum?id=uBKBoix9NXa,https://openreview.net/pdf?id=uBKBoix9NXa,We analysed the effects of hyperparameters in BNN optimization and propose an optimizer that is based upon Infinite Impulse Response Filters,"Binary Neural Networks (BNNs) are compact and efficient by using binary weights instead of real-valued weights. Current BNNs use latent real-valued weights during training, where several training hyper-parameters are inherited from real-valued networks. The interpretation of several of these hyperparameters is based on the magnitude of the real-valued weights. For BNNs, however, the magnitude of binary weights is not meaningful, and thus it is unclear what these hyperparameters actually do. One example is weight-decay, which aims to keep the magnitude of real-valued weights small. Other examples are latent weight initialization, the learning rate, and learning rate decay, which influence the magnitude of the real-valued weights. The magnitude is interpretable for real-valued weights, but loses its meaning for binary weights. In this paper we offer a new interpretation of these magnitude-based hyperparameters based on higher-order gradient filtering during network optimization. Our analysis makes it possible to understand how magnitude-based hyperparameters influence the training of binary networks which allows for new optimization filters specifically designed for binary neural networks that are independent of their real-valued interpretation. Moreover, our improved understanding reduces the number of hyperparameters, which in turn eases the hyperparameter tuning effort which may lead to better hyperparameter values for improved accuracy.","Binary Neural Networks, Optimization, Digital Signal Processing, Inifnite Impulse Response Filter" Sample-efficient multi-objective molecular optimization with GFlowNets,https://openreview.net/forum?id=ztgT8Iok130,https://openreview.net/pdf?id=ztgT8Iok130,A GFlowNet-based Bayesian optimization algorithm for sample-efficient multi-objective molecular optimization,"Many crucial scientific problems involve designing novel molecules with desired properties, which can be formulated as an expensive black-box optimization problem over the discrete chemical space. Computational methods have achieved initial success but still struggle with simultaneously optimizing multiple competing properties in a sample-efficient manner. In this work, we propose a multi-objective Bayesian optimization (MOBO) algorithm leveraging the hypernetwork-based GFlowNets (HN-GFN) as an acquisition function optimizer, with the purpose of sampling a diverse batch of candidate molecular graphs from an approximate Pareto front. Using a single preference-conditioned hypernetwork, HN-GFN learns to explore various trade-offs between objectives. Inspired by reinforcement learning, we further propose a hindsight-like off-policy strategy to share high-performing molecules among different preferences in order to speed up learning for HN-GFN. Through synthetic experiments, we illustrate that HN-GFN has adequate capacity to generalize over preferences. Extensive experiments show that our framework outperforms the best baselines by a large margin in terms of hypervolume in various real-world MOBO settings.","multi-objective molecular optimization, Bayesian optimization, generative flow networks" Learning Robust Kernel Ensembles with Kernel Average Pooling,https://openreview.net/forum?id=ul7HSEpkEHX,https://openreview.net/pdf?id=ul7HSEpkEHX,Using Kernel Average Pools for learning robust kernel ensembles in neural networks,"Model ensembles have long been used in machine learning to reduce the variance in individual model predictions, making them more robust to input perturbations. Pseudo-ensemble methods like dropout have also been commonly used in deep learning models to improve generalization. However, the application of these techniques to improve neural networks' robustness against input perturbations remains underexplored. We introduce \emph{Kernel Average Pool (KAP)}, a new neural network building block that applies the mean filter along the kernel dimension of the layer activation tensor. We show that ensembles of kernels with similar functionality naturally emerge in convolutional neural networks equipped with KAP and trained with backpropagation. Moreover, we show that when combined with activation noise, KAP models are remarkably robust against various forms of adversarial attacks. Empirical evaluations on CIFAR10, CIFAR100, TinyImagenet, and Imagenet datasets show substantial improvements in robustness against strong adversarial attacks such as AutoAttack that are on par with adversarially trained networks but are importantly obtained without training on any adversarial examples.","convolutional neural networks, topographical neural networks, adversarial robustness, ensemble models" Affinity-VAE for clustering and classification of objects in multidimensional image data,https://openreview.net/forum?id=FDiO2xfKnkj,https://openreview.net/pdf?id=FDiO2xfKnkj,,"In this work we present affinity-VAE: a framework for automatic clustering and classification of objects in multidimensional image data based on their similarity. The method expands on the concept of $\beta$-VAEs with an informed similarity-based loss component driven by an affinity matrix. The affinity-VAE is able to create rotationally-invariant, morphologically homogeneous clusters in the latent representation, with improved cluster separation compared with a standard $\beta$-VAE. We explore the extent of latent disentanglement and continuity of the latent spaces on both 2D and 3D image data, including simulated biological electron cryo-tomography (cryo-ET) volumes as an example of a scientific application. ","representation learning, VAE, $\beta$-VAE, affinity, cryo-ET, cryo-electron tommography, structural biology, visual proteomics" Model Stealing Attacks Against Vision-Language Models,https://openreview.net/forum?id=v-rx235RlfI,https://openreview.net/pdf?id=v-rx235RlfI,We propose the first model stealing attack against the vision-language models.,"Vision-language models have flourished these years and are regarded as promising solutions to vision-language tasks. However, training vision-language models always requires enormous effort, making the models valuable intellectual properties (IPs). In this paper, we pioneer to propose the first model stealing attack against the vision-language models, the goal of which is to steal the functionality of the target models. Specifically, we target fine-tuned CLIP models with black-box access. We query the model to extract model information through either the text-to-image retrieval or the image-to-text retrieval API and then leverage the information to train a local copy of the target model. Experiments show the effectiveness of the model stealing attacks. We validate that our attacks are query-efficient, API-agnostic, data-agnostic, and architecture-agnostic, which broaden the attack scenarios. As a counterpart, we examine a defense based on the idea of out-of-distribution detection, which is impotent without strong assumptions. Our research pressures the unprotected release and prevalence of powerful vision-language models and appeals to the community that their IP protections, if not the least, cannot be less.","Vision-Language Model, Model Stealing Attack" Causal Attention to Exploit Transient Emergence of Causal Effect,https://openreview.net/forum?id=lXMlDL78Alx,https://openreview.net/pdf?id=lXMlDL78Alx,We propose the causal attention mechanism for a class of causal network reconstruction tasks.,"We propose a causal reasoning mechanism called $\textit{causal attention}$ that can improve performance of machine learning models on a class of causal inference tasks by revealing the generation process behind the observed data. We consider the problem of reconstructing causal networks (e.g., biological neural networks) connecting large numbers of variables (e.g., nerve cells), of which evolution is governed by nonlinear dynamics consisting of weak coupling-drive (i.e., causal effect) and strong self-drive (dominants the evolution). The core difficulty is sparseness of causal effect that emerges (the coupling force is significant) only momentarily and otherwise remains dormant in the neural activity sequence. $\textit{Causal attention}$ is designed to guide the model to make inference focusing on the critical regions of time series data where causality may manifest. Specifically, attention coefficients are assigned autonomously by a neural network trained to maximise the Attention-extended Transfer Entropy, which is a novel generalization of the iconic transfer entropy metric. Our results show that, without any prior knowledge of dynamics, $\textit{causal attention}$ explicitly identifies areas where the strength of coupling-drive is distinctly greater than zero. This innovation substantially improves reconstruction performance for both synthetic and real causal networks using data generated by neuronal models widely used in neuroscience.","causal attention mechanism, coupling-drive, sparse causal effect, neural dynamics, causal network reconstruction" A Simple Nadaraya-Watson Head for Explainable and Calibrated Classification,https://openreview.net/forum?id=Hrj3MhDO_a,https://openreview.net/pdf?id=Hrj3MhDO_a,"We present a simple, nonparametric replacement to the fully-connected head in the image classification setting based on the Nadaraya-Watson (NW) estimator, which can be shown to be interpretable and well-calibrated.","We propose a simple, non-learnable, and nonparametric prediction head to be used with any neural network architecture. The proposed head can be viewed as a classic Nadaraya-Watson (NW) model, where the prediction is a weighted average of labels from a support set. The weights are computed from distances between the query feature and support features. This is in contrast to the dominant approach of using a learnable classification head (e.g., a fully-connected layer) on the features, which can be challenging to interpret and can yield poorly calibrated predictions. Our empirical results on an array of computer vision tasks demonstrate that the NW head can yield better calibration than its parametric counterpart, while having comparable accuracy and with minimal computational overhead. To further increase inference-time efficiency, we propose a simple approach that involves a clustering step run on the training set to create a relatively small distilled support set. In addition to using the weights as a means of interpreting model predictions, we further present an easy-to-compute ``support influence function,'' which quantifies the influence of a support element on the prediction for a given query. As we demonstrate in our experiments, the influence function can allow the user to debug a trained model. We believe that the NW head is a flexible, interpretable, and highly useful building block that can be used in a range of applications.","Image Classification, Nonparametric, Interpretability, Explainability, Calibration" Imitating Human Behaviour with Diffusion Models,https://openreview.net/forum?id=Pv1GPQzRrC8,https://openreview.net/pdf?id=Pv1GPQzRrC8,,"Diffusion models have emerged as powerful generative models in the text-to-image domain. This paper studies their application as observation-to-action models for imitating human behaviour in sequential environments. Human behaviour is stochastic and multimodal, with structured correlations between action dimensions. Meanwhile, standard modelling choices in behaviour cloning are limited in their expressiveness and may introduce bias into the cloned policy. We begin by pointing out the limitations of these choices. We then propose that diffusion models are an excellent fit for imitating human behaviour, since they learn an expressive distribution over the joint action space. We introduce several innovations to make diffusion models suitable for sequential environments; designing suitable architectures, investigating the role of guidance, and developing reliable sampling strategies. Experimentally, diffusion models closely match human demonstrations in a simulated robotic control task and a modern 3D gaming environment.","imitation learning, behavioral cloning, behavioral cloning, diffusion models, generative models" Learning Privacy-Preserving Graph Embeddings Against Sensitive Attributes Inference,https://openreview.net/forum?id=cRzIAw8Rem2,https://openreview.net/pdf?id=cRzIAw8Rem2,Preserving the inference privacy of certain sensitive attributes associated with graph nodes for graph representation learning.,"We focus on preserving the privacy of some sensitive attributes associated with certain private nodes on a graph when releasing graph data. Notably, deleting the sensitive attributes from the graph data cannot resist adversarial attacks because an adversary can still leverage the graph structure information and the non-sensitive node features to predict the sensitive attributes. We propose a framework to learn graph embeddings insensitive to the changes of certain specified sensitive attributes while maximally preserving the graph structure information and non-sensitive node features for downstream tasks. The key ingredient of our framework is a novel conditional variational graph autoencoder (CVGAE), which captures the relationship between the learned embeddings and the sensitive attributes. This allows us to quantify the privacy loss that can be used for penalizing privacy leakage when learning graph embeddings without adversarial training.","Inference privacy, differential privacy, graph representation" InteriorSim: A Photorealistic Simulator for Embodied AI,https://openreview.net/forum?id=SYsmAZ8PHez,https://openreview.net/pdf?id=SYsmAZ8PHez,InteriorSim is a photorealistic simulator for embodied AI in the home.,"Interactive simulators are becoming powerful tools for training embodied agents, but existing simulators suffer from limited content diversity, physical interactivity, and visual fidelity. We address these limitations by introducing InteriorSim, a photorealistic simulator for embodied AI in the home. To create our simulator, we worked closely with a team of professional artists for over a year to construct 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each of our environments features detailed geometry, photorealistic materials, and a unique floor plan and object layout designed by a professional artist, i.e., we do not rely on remixing existing layouts to create additional content. Our environments are implemented as Unreal Engine assets, and we provide an OpenAI Gym interface for interacting with the environments via Python. We demonstrate the utility of our simulator by using it in a zero-shot sim-to-real transfer scenario, i.e., we train a point-goal navigation policy entirely in simulation that can successfully navigate through cluttered real-world environments when deployed on a real robot. We also demonstrate that our simulator is quantitatively more photorealistic than existing simulators measured by human comparisons and standard metrics for evaluating generative models. Finally, we demonstrate that our simulator achieves better sim-to-real performance than existing simulators on a real-world semantic segmentation task. All of our assets and code will be made available online.","simulation environment, embodied AI" Prompt-Based Metric Learning for Few-Shot NER,https://openreview.net/forum?id=wHt8UumYfGT,https://openreview.net/pdf?id=wHt8UumYfGT,,"Few-shot named entity recognition (NER) targets generalizing to unseen labels and/or domains with few labeled examples. Existing metric learning methods compute token-level similarities between query and support sets, but are not able to fully incorporate label semantics into modeling. To address this issue, we propose a simple method to largely improve metric learning for NER: 1) multiple prompt schemas are designed to enhance label semantics; 2) we propose a novel architecture to effectively combine multiple prompt-based representations. Empirically, our method achieves new state-of-the-art (SOTA) results under 16 of the 18 considered settings, substantially outperforming the previous SOTA by an average of 8.84% and a maximum of 34.51% in relative gains of micro F1.", MetaPhysiCa: Causality-aware Robustness to OOD Initial Conditions in Physics-informed Machine Learning,https://openreview.net/forum?id=4ojYamKgnQc,https://openreview.net/pdf?id=4ojYamKgnQc,"This work proposes combining causal structural discovery, invariant risk minimization, and meta-learning in order to make Physics-informed Machine Learning robust to OOD tasks.","A fundamental challenge in physics-informed machine learning (PIML) is the design of robust PIML methods for out-of-distribution (OOD) forecasting tasks, where the tasks require learning-to-learn from observations of the same (ODE) dynamical system with different unknown parameters, and demand accurate forecasts even under initial conditions outside the training support. In this work we propose a solution for such tasks, which we define as a meta-learning procedure for causal structural discovery (including invariant risk minimization). Using three different OOD tasks, we empirically observe that the proposed approach significantly outperforms existing state-of-the-art PIML and deep learning methods.","physics-informed machine learning, out-of-distribution, robustness, causality" Representation Balancing with Decomposed Patterns for Treatment Effect Estimation,https://openreview.net/forum?id=xgQO2_F-w8b,https://openreview.net/pdf?id=xgQO2_F-w8b,"We derive the bound for individual propensity confusion and decompose representation balancing into patterns of (i) individual propensity confusion and group distance minimization and (ii) pre-balancing and balancing, for treatment effect estimation.","Estimating treatment effects from observational data is subject to a problem of covariate shift caused by selection bias. Recent studies have attempted to mitigate this problem by group distance minimization, that is, balancing the distribution of representations between the treated and controlled groups. The rationale behind this is that learning balanced representations while preserving the predictive power of factual outcomes is expected to generalize to counterfactual inference. Inspired by this, we propose a new approach to better capture the patterns that contribute to representation balancing and outcome prediction. Specifically, we derive a theoretical bound that naturally ties the notion of propensity confusion to representation balancing, and further transform the balancing Patterns into Decompositions of Individual propensity confusion and Group distance minimization (PDIG). Moreover, we propose to decompose proxy features into Patterns of Pre-balancing and Balancing Representations (PPBR), as it is insufficient if only balanced representations are considered in outcome prediction. Extensive experiments on simulation and benchmark data confirm not only PDIG leads to mutual reinforcement between individual propensity confusion and group distance minimization, but also PPBR brings improvement to outcome prediction, especially counterfactual inference. We believe these findings are heuristics for further investigation of what affects the generalizability of representation balancing models in counterfactual estimation.","Treatment Effect Estimation, Counterfactual Estimation, Representation Balancing, Selection Bias, Covariate Shift" Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning,https://openreview.net/forum?id=3Pf3Wg6o-A4,https://openreview.net/pdf?id=3Pf3Wg6o-A4,Using language models to produce a human interpretable chain of logical reasoning to answer questions.,"Large language models (LLMs) have been shown to be capable of impressive few-shot generalisation to new tasks. However, they still tend to perform poorly on multi-step logical reasoning problems. Here we carry out a comprehensive evaluation of LLMs on 46 tasks that probe different aspects of logical reasoning. We show that language models tend to perform fairly well at single step inference or entailment tasks, but struggle to chain together multiple reasoning steps to solve more complex problems. In light of this, we propose a Selection-Inference (SI) framework that exploits pre-trained LLMs as general processing modules, and alternates between selection and inference to generate a series of interpretable, casual reasoning steps leading to the final answer. We show that a 7B parameter LLM used within the SI framework in a 5-shot generalisation setting, with no fine-tuning, yields a performance improvement of over 100% compared to an equivalent vanilla baseline on a suite of 10 logical reasoning tasks. The same model in the same setting even outperforms a significantly larger 280B parameter baseline on the same suite of tasks. Moreover, answers produced by the SI framework are accompanied by a causal natural-language-based reasoning trace, which has important implications for the safety and trustworthiness of the system.","System 2, Logical reasoning, Language Models, Large Language Models, Reasoning, Neuro-symbolic, Neural Symbolic, Interpretability" Guided Safe Shooting: model based reinforcement learning with safety constraints,https://openreview.net/forum?id=nMZhFqYsiad,https://openreview.net/pdf?id=nMZhFqYsiad,,"In the last decade, reinforcement learning successfully solved complex control tasks and decision-making problems, like the Go board game. Yet, there are few success stories when it comes to deploying those algorithms to real-world scenarios. One of the reasons is the lack of guarantees when dealing with and avoiding unsafe states, a fundamental requirement in critical control engineering systems. In this paper, we introduce Guided Safe Shooting (GuSS), a model-based RL approach that can learn to control systems with minimal violations of the safety constraints. The model is learned on the data collected during the operation of the system in an iterated batch fashion, and is then used to plan for the best action to perform at each time step. We propose three different safe planners, one based on a simple random shooting strategy and two based on MAP-Elites, a more advanced divergent-search algorithm. Experiments show that these planners help the learning agent avoid unsafe situations while maximally exploring the state space, a necessary aspect when learning an accurate model of the system. Furthermore, compared to model- free approaches, learning a model allows GuSS reducing the number of interactions with the real-system while still reaching high rewards, a fundamental requirement when handling engineering systems.","Model-based Reinforcement learning, Safe-RL, Evolutionary method, Planning" Contrastive Meta-Learning for Partially Observable Few-Shot Learning,https://openreview.net/forum?id=6iVJOtr2zL2,https://openreview.net/pdf?id=6iVJOtr2zL2,An approach for meta-learning contrastive representations under partial observability.,"Many contrastive and meta-learning approaches learn representations by identifying common features in multiple views. However, the formalism for these approaches generally assumes features to be shared across views to be captured coherently. We consider the problem of learning a unified representation from partial observations, where useful features may be present in only some of the views. We approach this through a probabilistic formalism enabling views to map to representations with different levels of uncertainty in different components; these views can then be integrated with one another through marginalisation over that uncertainty. Our approach, Partial Observation Experts Modelling (POEM), then enables us to meta-learn consistent representations from partial observations. We evaluate our approach on an adaptation of a comprehensive few-shot learning benchmark, Meta-Dataset, and demonstrate the benefits of POEM over other meta-learning methods at representation learning from partial observations. We further demonstrate the utility of POEM by meta-learning to represent an environment from partial views observed by an agent exploring the environment.","Contrastive Learning, Meta-Learning, Few-Shot Learning, Partial Observability" Analyzing Transformers in Embedding Space,https://openreview.net/forum?id=nlda8uNdwuJ,https://openreview.net/pdf?id=nlda8uNdwuJ,,"Understanding Transformer-based models has attracted significant attention, as they lie at the heart of recent technological advances across machine learning. While most interpretability methods rely on running models over inputs, recent work has shown that a zero-pass approach, where parameters are interpreted directly without a forward/backward pass is feasible for some Transformer parameters, and for two-layer attention networks. In this work, we present a theoretical analysis where all parameters of a trained Transformer are interpreted by projecting them into the embedding space, that is, the space of vocabulary items they operate on. We derive a simple theoretical framework to support our arguments and provide ample evidence for its validity. First, an empirical analysis showing that parameters of both pretrained and fine-tuned models can be interpreted in embedding space. Second, we present two applications of our framework: (a) aligning the parameters of different models that share a vocabulary, and (b) constructing a classifier without training by ""translating"" the parameters of a fine-tuned classifier to parameters of a different model that was only pretrained. Overall, our findings open the door to interpretation methods that, at least in part, abstract away from model specifics and operate in the embedding space only.","transformers, interpretability, embedding space, explainability" Enhancing the Inductive Biases of Graph Neural ODE for Modeling Dynamical Systems,https://openreview.net/forum?id=ATLEl_izD87,https://openreview.net/pdf?id=ATLEl_izD87,Inferring the dynamics of physical systems can be significantly enhanced by Graph neural ODEs with appropriate inductive biases,"Neural networks with physics-based inductive biases such as Lagrangian neural networks (Lnn), and Hamiltonian neural networks (Hnn) learn the dynamics of physical systems by encoding strong inductive biases. Alternatively, Neural ODEs with appropriate inductive biases have also been shown to give similar performances. However, these models, when applied to particle based systems, are transductive in nature and hence, do not generalize to large system sizes. In this paper, we present a graph-based neural ODE, Gnode, to learn the time evolution of dynamical systems. Further, we carefully analyse the role of different inductive biases on the performance of Gnode. We show that, similar to Lnn and Hnn, encoding the constraints explicitly can significantly improve the training efficiency and performance of Gnode. Our experiments also assess the value of additional inductive biases, such as Newton's third law, on the final performance of the model. We demonstrate that inducing these biases can enhance the performance of model by orders of magnitude in terms of both energy violation and rollout error. Interestingly, we observe that the Gnode trained with the most effective inductive biases, namely mcgnode, outperforms the graph versions of Lnn and Hnn, namely, Lagrangian graph networks (Lgn) and Hamiltonian graph networks (Hgn) in terms of energy violation error by $\sim$4 orders of magnitude for a pendulum system, and $\sim$2 orders of magnitude for spring systems. These results suggest that node-based systems can give competitive performances with energy conserving neural networks by employing appropriate inductive biases.","Neural ODE, Graph neural network, physical systems, Graph Neural ODE" Efficient Planning in a Compact Latent Action Space,https://openreview.net/forum?id=cA77NrVEuqn,https://openreview.net/pdf?id=cA77NrVEuqn,We propose Trajectory Autoencoding Planner (TAP) a model-based RL method that learns a compact discrete latent action space for efficient planning.,"Planning-based reinforcement learning has shown strong performance in tasks in discrete and low-dimensional continuous action spaces. However, planning usually brings significant computational overhead for decision making, so scaling such methods to high-dimensional action spaces remains challenging. To advance efficient planning for high-dimensional continuous control, we propose Trajectory Autoencoding Planner (TAP), which learns low-dimensional latent action codes with a state-conditional VQ-VAE. The decoder of the VQ-VAE thus serves as a novel dynamics model that takes latent actions and current state as input and reconstructs long-horizon trajectories. During inference time, given a starting state, TAP searches over discrete latent actions to find trajectories that have both high probability under the training distribution and high predicted cumulative reward. Empirical evaluation in the offline RL setting demonstrates low decision latency which is indifferent to the growing raw action dimensionality. For Adroit robotic hand manipulation tasks with high-dimensional continuous action space, TAP surpasses existing model-based methods by a large margin and also beats strong model-free actor-critic baselines.","Model-based RL, Planning, Sequence Modelling RL, Generative Model, Offline Reinforcement Learning" Improved Stein Variational Gradient Descent with Importance Weights,https://openreview.net/forum?id=eWvjcZIZrWu,https://openreview.net/pdf?id=eWvjcZIZrWu,theoretical paper to show the power of importance weights in SVGD,"Stein Variational Gradient Descent~(\algname{SVGD}) is a popular sampling algorithm used in various machine learning tasks. It is well known that \algname{SVGD} arises from a discretization of the kernelized gradient flow of the Kullback-Leibler divergence $\KL\left(\cdot\mid\pi\right)$, where $\pi$ is the target distribution. In this work, we propose to enhance \algname{SVGD} via the introduction of {\em importance weights}, which leads to a new method for which we coin the name \algname{$\beta$-SVGD}. In the continuous time and infinite particles regime, the time for this flow to converge to the equilibrium distribution $\pi$, quantified by the Stein Fisher information, depends on $\rho_0$ and $\pi$ very weakly. This is very different from the kernelized gradient flow of Kullback-Leibler divergence, whose time complexity depends on $\KL\left(\rho_0\mid\pi\right)$. Under certain assumptions, we provide a descent lemma for the population limit \algname{$\beta$-SVGD}, which covers the descent lemma for the population limit \algname{SVGD} when $\beta\to 0$. We also illustrate the advantages of \algname{$\beta$-SVGD} over \algname{SVGD} by simple experiments.","SVGD, Importance Sampling, Importance Weights, Sampling, R\'enyi Divergence, KL-divergence" Correlative Information Maximization Based Biologically Plausible Neural Networks for Correlated Source Separation,https://openreview.net/forum?id=8JsaP7j1cL0,https://openreview.net/pdf?id=8JsaP7j1cL0,This paper proposes biologically plausible neural networks for blind separation of correlated sources exploiting prior domain assumptions via an information maximization criterion.,"The brain effortlessly extracts latent causes of stimuli, but how it does this at the network level remains unknown. Most prior attempts at this problem proposed neural networks that implement independent component analysis, which works under the limitation that latent elements are mutually independent. Here, we relax this limitation and propose a biologically plausible neural network that extracts correlated latent sources by exploiting information about their domains. To derive this network, we choose maximum correlative information transfer from inputs to outputs as the separation objective under the constraint that the outputs are restricted to their presumed sets. The online formulation of this optimization problem naturally leads to neural networks with local learning rules. Our framework incorporates infinitely many source domain choices and flexibly models complex latent structures. Choices of simplex or polytopic source domains result in networks with piecewise-linear activation functions. We provide numerical examples to demonstrate the superior correlated source separation capability for both synthetic and natural sources.","Biologically Plausible Neural Networks, Blind Correlated Source Separation, Correlative Information Maximization" Simplicity bias leads to amplified performance disparities,https://openreview.net/forum?id=mAWJpM7S21-,https://openreview.net/pdf?id=mAWJpM7S21-,"We introduce difficulty disparity and difficulty amplification, where a model's bias towards simplicity results in disparate performance between groups.","The simple idea that not all things are equally difficult has surprising implications when applied in a fairness context. In this work we explore how ""difficulty"" is model-specific, such that different models find different parts of a dataset challenging. When difficulty correlates with group information, we term this difficulty disparity. Drawing a connection with recent work exploring the inductive bias towards simplicity of SGD-trained models, we show that when such a disparity exists, it is further amplified by commonly-used models. We quantify this amplification factor across a range of settings aiming towards a fuller understanding of the role of model bias. We also present a challenge to the simplifying assumption that ``fixing'' a dataset is sufficient to ensure unbiased performance.","fairness, model bias, dataset bias, bias amplification, simplicity bias" Annealed Fisher Implicit Sampler,https://openreview.net/forum?id=eLgK35G3A5d,https://openreview.net/pdf?id=eLgK35G3A5d,Train an implicit sampler by minimizing Fisher Divergence with a novel S2D loss.,"Sampling from an un-normalized target distribution is an important problem in many scientific fields. An implicit sampler uses a parametric transform $x=G_\theta(z)$ to push forward an easy-to-sample latent code $z$ to obtain a sample $x$. Such samplers are favored for fast inference speed and flexible architecture. Thus it is appealing to train an implicit sampler for sampling from the un-normalized target. In this paper, we propose a novel approach to training an implicit sampler by minimizing the Fisher Divergence between sampler and target distribution. We find that the trained sampler works well for relatively simple targets but may fail for more complicated multi-modal targets. To improve the training for multi-modal targets, we propose another adaptive training approach that trains the sampler to gradually learn a sequence of annealed distributions. We construct the annealed distribution path to bridge a simple distribution and the complicated target. With the annealed approach, the sampler is capable of handling challenging multi-modal targets. In addition, we also introduce a few MCMC correction steps after the sampler to better spread the samples. We call our proposed sampler \emph{the Annealed Fisher Implicit Sampler} (AFIS). We test AFIS on several sampling benchmarks. The experiments show that our AFIS outperforms baseline methods in many aspects. We also show in theory that the added MC correction steps get faster mixing by using the learned sampler as MCMC's initialization. ","Implicit Generative Models, Score Matching, Learning to Sample, Sampling" Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio Detection,https://openreview.net/forum?id=fUX3bszZSOw,https://openreview.net/pdf?id=fUX3bszZSOw,We propose a regularized adaptive weight modification algorithm to overcome catastrophic forgetting for fake audio detection.,"Current fake audio detection algorithms achieve promising performances on most datasets. However, their performance may be significantly degraded when dealing with audio of a different dataset. The orthogonal weight modification to overcome catastrophic forgetting does not consider the similarity of some audio, including fake audio obtained by the same algorithm and genuine audio, on different datasets. To overcome this limitation, we propose a continual learning algorithm for fake audio detection to overcome catastrophic forgetting, called Regularized Adaptive Weight Modification (RAWM). Specifically, when fine-tuning a detection network, our approach adaptively computes the direction of weight modification according to the ratio of genuine utterances and fake utterances. The adaptive modification direction ensures the network can detect fake audio on the new dataset while preserving its knowledge of previous model, thus mitigating catastrophic forgetting. In addition, orthogonal weight modification of fake audios in the new dataset will skew the distribution of inferences on audio in the previous dataset with similar acoustic characteristics, so we introduce a regularization constraint to force the network to remember this distribution. We evaluate our approach across multiple datasets and obtain a significant performance improvement on cross-dataset experiments.","fake audio detection, regularized adaptive weight modification, catastrophic forgetting, continual learning" Towards Conditionally Dependent Masked Language Models,https://openreview.net/forum?id=RnH_0iL4xao,https://openreview.net/pdf?id=RnH_0iL4xao,"We study the limitations of MRFs defined from MLMs' unary conditionals, and propose alternatives that are either better (from a probabilistic modeling standpoint) or faster to run","Masked language modeling has proven to be an effective paradigm for learning representations of language. However, when multiple tokens are masked out, the masked language model's (MLM) distribution over the masked positions assumes that the masked tokens are conditionally independent given the unmasked tokens---an assumption that does not hold in practice. Existing work addresses this limitation by interpreting the sum of unary scores (i.e., the logits or the log probabilities of single tokens when conditioned on all others) as the log potential a Markov random field (MRF). While this new model no longer makes any independence assumptions, it remains unclear whether this approach (i) results in a good probabilistic model of language and further (ii) derives a model that is faithful (i.e., has matching unary distributions) to the original model. This paper studies MRFs derived this way in a controlled setting where only two tokens are masked out at a time, which makes it possible to compute exact distributional properties. We find that such pairwise MRFs are often worse probabilistic models of language from a perplexity standpoint, and moreover have unary distributions that do not match the unary distributions of the original MLM. We then study a statistically-motivated iterative optimization algorithm for deriving joint pairwise distributions that are more compatible with the original unary distributions. While this iterative approach outperforms the MRF approach, the algorithm itself is too expensive to be practical. We thus amortize this optimization process through a parameterized feed-forward layer that learns to modify the original MLM's pairwise distributions to be both non-independent and faithful, and find that this approach outperforms the MLM for scoring pairwise tokens.","Markov random fields, masked language models, compatibility" Leveraging Importance Weights in Subset Selection,https://openreview.net/forum?id=9Nj_gNdvqYf,https://openreview.net/pdf?id=9Nj_gNdvqYf,,"We present a subset selection algorithm designed to work with arbitrary model families in a practical batch setting. In such a setting, an algorithm can sample examples one at a time but, in order to limit overhead costs, is only able to update its state (i.e. further train model weights) once a large enough batch of examples is selected. Our algorithm, IWeS, selects examples by importance sampling where the sampling probability assigned to each example is based on the entropy of models trained on previously selected batches. IWeS admits significant performance improvement compared to other subset selection algorithms for seven publicly available datasets. Additionally, it is competitive in an active learning setting, where the label information is not available at selection time. We also provide an initial theoretical analysis to support our importance weighting approach, proving generalization and sampling rate bounds.","data subset selection, importance weighted sampling" Interactive Sequential Generative Models,https://openreview.net/forum?id=PExjmUwEAAH,https://openreview.net/pdf?id=PExjmUwEAAH,We propose a novel framework for multiagent trajectories that augments sequential generative models with latent social structures.,"Understanding spatiotemporal relationships among several agents is of considerable relevance for many domains. Team sports represent a particularly interesting real-world proving ground since modeling interacting athletes requires capturing highly dynamic and complex agent-agent dependencies in addition to temporal components. However, existing generative methods in this field either entangle all latent factors into a single variable and are thus constrained in practical applicability, or they focus on uncovering interaction structures, which restricts their generative ability. To address this gap, we propose a framework for multiagent trajectories that augments sequential generative models with latent social structures. First, we derive a novel objective via approximate inference using a disentangled latent space that accurately describes the data generating process in such systems. Based on the proposed training criterion, we then present a model architecture that unifies insights from neural interaction inference and graph-structured variational recurrent neural networks for generating collective movements while allocating latent information. We validate our model on data from professional soccer and basketball. Our framework not only improves upon existing state-of-the-art approaches in forecasting trajectories, but also infers semantically meaningful representations that can be used in downstream tasks.","Generative Models and Autoencoders, Graph Neural Networks, Recurrent Networks, Sequential Models, Multi-Agent" Gradient flow in the gaussian covariate model: exact solution of learning curves and multiple descent structures,https://openreview.net/forum?id=O2cW5Q3bH_M,https://openreview.net/pdf?id=O2cW5Q3bH_M,,"A recent line of work has shown remarkable behaviors of the generalization error curves in simple learning models. Even the least-squares regression has shown atypical features such as the model-wise double descent, and further works have observed triple or multiple descents. Another important characteristic are the epoch-wise descent structures which emerge during training. The observations of model-wise and epoch-wise descents have been analytically derived in limited theoretical settings (such as the random feature model) and are otherwise experimental. In this work, we provide a full and unified analysis of the whole time-evolution of the generalization curve, in the asymptotic large-dimensional regime and under gradient-flow, within a wider theoretical setting stemming from a gaussian covariate model. In particular, we cover most cases already disparately observed in the literature, and also provide examples of the existence of multiple descent structures as a function of a model parameter or time. Furthermore, we show that our theoretical predictions adequately match the learning curves obtained by gradient descent over realistic datasets. Technically we compute averages of rational expressions involving random matrices using recent developments in random matrix theory based on ""linear pencils"". Another contribution, which is also of independent interest in random matrix theory, is a new derivation of related fixed point equations (and an extension there-off) using Dyson brownian motions.","Gaussian Covariate Model, Gradient Flow, Gradient Descent, Double Descent, Epoch-wise Double Descent, Random Matrix, Linear Pencil, Cauchy Integrals, High-dimensional Limits, Stieltjes Transform, Random Feature Model" Copy is All You Need,https://openreview.net/forum?id=CROlOA9Nd8C,https://openreview.net/pdf?id=CROlOA9Nd8C,," The dominant text generation models compose output by selecting words in a fixed vocabulary. In this paper, we formulate text generation as progressively copying text segments (e.g., words or phrases) from an existing text collection. We compute the contextualized representations of meaningful text segments and index them using efficient vector search toolkits. The task of text generation is then decomposed into a series of copy-and-paste operations: at each time step, we seek suitable text spans from existing articles in the text collection rather than selecting from a standalone vocabulary. Experiments on the standard language modeling benchmark (WikiText-103) show that our approach achieves better generation quality by coping from the original training data (0.758 vs. 0.691 MAUVE). We also show that our approach attains additional performance gains by simply scaling up to larger text collections without extra training. Furthermore, our approach allows for effective domain adaptation by simply switching to any domain-specific text collection, again without further training. Finally, we observe that our approach achieves better inference efficiency than standard token-level autoregressive models thanks to the reduction of decoding steps.",neural text genertion Graph Backup: Data Efficient Backup Exploiting Markovian Transitions,https://openreview.net/forum?id=iI8zWfCyCIQ,https://openreview.net/pdf?id=iI8zWfCyCIQ,"In this paper, we treat the transition data of the MDP as a graph, and define a novel backup operator, Graph Backup, which exploits this graph structure for better value estimation. ","The successes of deep Reinforcement Learning (RL) are limited to settings where we have a large stream of online experiences, but applying RL in the data-efficient setting with limited access to online interactions is still challenging. A key to data-efficient RL is good value estimation, but current methods in this space fail to fully utilise the structure of the trajectory data gathered from the environment. In this paper, we treat the transition data of the MDP as a graph, and define a novel backup operator, Graph Backup, which exploits this graph structure for better value estimation. Compared to multi-step backup methods such as $n$-step $Q$-Learning and TD($\lambda$), Graph Backup can perform counterfactual credit assignment and gives stable value estimates for a state regardless of which trajectory the state is sampled from. Our method, when combined with popular off-policy value-based methods, provides improved performance over one-step and multi-step methods on a suite of data-efficient RL benchmarks including MiniGrid, Minatar and Atari100K. We further analyse the reasons for this performance boost through a novel visualisation of the transition graphs of Atari games.","reinforcement learning, graph structure, neuro-symbolic methods, data efficient reinforcement learning" Finding Generalization Measures by Contrasting Signal and Noise,https://openreview.net/forum?id=vHgL7XYBiTd,https://openreview.net/pdf?id=vHgL7XYBiTd,A new generalization measure,"Generalization is one of the most fundamental challenges in deep learning, aiming to predict model performances on unseen data. Empirically, such predictions usually rely on a validation set, while recent works showed that an unlabeled validation set also works. Without validation sets, it is extremely difficult to obtain non-vacuous generalization bounds, which leads to a weaker task of finding generalization measures that monotonically relate to generalization error. In this paper, we propose a new generalization measure REF Complexity (RElative Fitting velocity between signal and noise), motivated by the intuition that a given model-algorithm pair may generalize well if it fits signal (e.g., true labels) fast while fitting noise (e.g., random labels) slow. Empirically, REF Complexity monotonically relates to test accuracy in real-world datasets without accessing additional validation sets, and achieves $-0.988$ correlation on CIFAR-10 and $-0.960$ correlation on CIFAR-100. We further theoretically verify the utility of REF Complexity under the regime of convex training with stochastic gradient descent. ","generalization measure, signal and noise" Linearised Implicit Variational Inference,https://openreview.net/forum?id=6uv5W_DXvRr,https://openreview.net/pdf?id=6uv5W_DXvRr,A novel bound for training implicit variational approximations for Bayesian Neural Networks,"Bayesian neural networks (BNNs) are touted for robustness under data drift, resilience to overfitting and catastrophic forgetting whilst also producing actionable uncertainty estimates. In variational inference, these elegant properties are contingent on the expressivity of the variational approximation. Posteriors over parameters of large models are usually multimodal and highly correlated and hence cannot be well-approximated by simple, prescribed densities. We posit implicit variational distributions specified using differentiable generators are more flexible and propose a novel bound for training BNNs using such approximations (amortized neural samplers). The proposed bound uses an approximation of the variational distribution's entropy by locally linearising the generator. Unlike existing works, our method does not require a discriminator network and moves away from an unfavourable adversarial objective. Our formulation resembles normalizing flows but does not necessitate invertibility of the generator. Moreover, we use a differentiable numerical lower bound on the Jacobians of the generator, mitigating computational concerns. We report log-likelihoods on UCI datasets competitive with deep ensembles and test our method on out-of-distribution benchmarks.","Implicit models, Variational Inference, Bayesian Deep Learning" Adversarial Driving Policy Learning by Misunderstanding the Traffic Flow,https://openreview.net/forum?id=bW-gfNJatfXX,https://openreview.net/pdf?id=bW-gfNJatfXX,,"Acquiring driving policies that can transfer to unseen environments is essential for driving in dense traffic flows. Adversarial training is a promising path to improve robustness under disturbances. Most prior works leverage few agents to induce driving policy's failures. However, we argue that directly implementing this training framework into dense traffic flow degrades transferability in unseen environments. In this paper, we propose a novel robust policy training framework that is capable of applying adversarial training based on a coordinated traffic flow. We start by building up a coordinated traffic flow where agents are allowed to communicate Social Value Orientations (SVOs). Adversary emerges when the traffic flow misunderstands the SVO of driving agent. We utilize this property to formulate a minimax optimization problem where the driving policy maximizes its own reward and a spurious adversarial policy minimizes it. Experiments demonstrate that our adversarial training framework significantly improves zero-shot transfer performance of the driving policy in dense traffic flows compared to existing algorithms.", Differentiable and transportable structure learning,https://openreview.net/forum?id=Z-CqSH6J_VK,https://openreview.net/pdf?id=Z-CqSH6J_VK,"We introduce an architecture and loss to encourage transportability in gradient-based graph learning methods. Before our method, gradient-based approaches were not transportable.","Directed acyclic graphs (DAGs) encode a lot of information about a particular distribution in its structure. However, compute required to infer these structures is typically super-exponential in the number of variables, as inference requires a sweep of a combinatorially large space of potential structures. That is, until recent advances made it possible to search this space using a differentiable metric, drastically reducing search time. While this technique– named NOTEARS –is widely considered a seminal work in DAG-discovery, it concedes an important property in favour of differentiability: transportability. To be transportable, the structures discovered on one dataset must apply to another dataset from the same domain. In our paper, we introduce D-Struct which recovers transportability in the discovered structures through a novel architecture and loss function, while remaining completely differentiable. Because D-Struct remains differentiable, our method can be easily adopted in existing differentiable architectures, as was previously done with NOTEARS. In our experiments, we empirically validate D-Struct with respect to edge accuracy and structural Hamming distance in a variety of settings.",graph learning Distributed Inference and Fine-tuning of Large Language Models Over The Internet,https://openreview.net/forum?id=HLQyRgRnoXo,https://openreview.net/pdf?id=HLQyRgRnoXo,We propose a practical algorithm for running large language models by pooling together weak geographically distributed devices. Our system can inference BLOOM-176B over the Internet more than 10x faster compared to RAM offloading.,"Large language models (LLMs) are useful in many NLP tasks and become more capable with size, scaling to over 100 billion parameters. With the release of BLOOM-176B and OPT-175B, everyone can download pretrained models of this scale. Still, using a pre-trained 100B+ model requires high-end hardware, making it inaccessible to most researchers. Recent studies in memory-efficient training (e.g. offloading) could alleviate these costs, but they do not cover important use cases of LLMs, such as autoregressive inference. In this work, we investigate methods for cost-efficient inference of large language models, comparing local and distributed strategies. We observe that a large enough model (100B+) could run efficiently on geodistributed devices in a consumer-grade network, for example by connecting existing compute resources of multiple research groups or pooling under-utilized compute from multiple cloud regions. To run LLMs in this unconventional setting, we develop a fault-tolerant algorithm for inferencing language models. We propose Petals - a decentralized system for running LLMs - and show that it can run BLOOM-176B over the Internet over $10\times$ faster than offloading for sequential generation. We evaluate the performance of our system in both simulated conditions and an actual distributed system spanning two continents. The design of Petals allows participants to inference, and fine-tune, or inference fine-tuned models simultaneously without affecting each other's results.","volunteer computing, distributed deep learning, distributed inference, efficient inference, large language models, gpt-3" Association Rules in QUBO Samples and Where to Find Them,https://openreview.net/forum?id=vKHuq9WeHMU,https://openreview.net/pdf?id=vKHuq9WeHMU,Find valuable association rules from QUBO samples to simplify QUBO problem and improve optimisation results,"There are sometimes strong associations between variables in the samples to a Quadratic Unconstrained Binary Optimization (QUBO) problem. A natural question arises to us: Are there any value in these association? We study max-cut problem and observe that association can be represented as rules to simplify QUBO problem. Classical and quantum annealers work better when the problem size is smaller. To effectively and efficiently find associations between variables, we adapt traditional association rule mining in the case of QUBO samples and propose a Fast Association Rule Mining algorithm (FARM) specifically for mining QUBO samples. We also propose strategies and a workflow to select and apply promising rules and simplify QUBO problems. We evaluate our method on D-Wave Quantum Annealer as well as Fujitsu Digital Annealer. The experiments demonstrate the utility of FARM as a visualisation tool for understanding associations in QUBO samples. The results also demonstrate the potential of our method in closing the gap between samples and ground truth. The source code will be disclosed to the public if the manuscript is accepted. ","Association rule, Annealing, QUBO" Why adversarial training can hurt robust accuracy,https://openreview.net/forum?id=-CA8yFkPc7O,https://openreview.net/pdf?id=-CA8yFkPc7O,Adversarial training can hurt robust generalization for perceptible perturbations when the sample size is small,"Machine learning classifiers with high test accuracy often perform poorly under adversarial attacks. It is commonly believed that adversarial training alleviates this issue. In this paper, we demonstrate that, surprisingly, the opposite can be true for a natural class of perceptible perturbations --- even though adversarial training helps when enough data is available, it may in fact hurt robust generalization in the small sample size regime. We first prove this phenomenon for a high-dimensional linear classification setting with noiseless observations. Using intuitive insights from the proof, we could surprisingly find perturbations on standard image datasets for which this behavior persists. Specifically, it occurs for perceptible attacks that effectively reduce class information such as object occlusions or corruptions. ","Adversarial training, Learning Theory, Robust generalisation" ExpressivE: A Spatio-Functional Embedding For Knowledge Graph Completion,https://openreview.net/forum?id=xkev3_np08z,https://openreview.net/pdf?id=xkev3_np08z,ExpressivE: A fully expressive KGC model that captures a rich set of patterns with an intuitive geometric interpretation and state-of-the-art performance.,"Knowledge graphs are inherently incomplete. Therefore substantial research has been directed towards knowledge graph completion (KGC), i.e., predicting missing triples from the information represented in the knowledge graph (KG). Embedding models have yielded promising results for KGC, yet any current KGC embedding model is incapable of: (1) fully capturing vital inference patterns (e.g., composition), (2) capturing prominent logical rules jointly (e.g., hierarchy and composition), and (3) providing an intuitive interpretation of captured patterns. In this work, we propose ExpressivE, a fully expressive spatio-functional embedding model that solves all these challenges simultaneously. ExpressivE embeds pairs of entities as points and relations as hyper-parallelograms in the virtual triple space $\mathbb{R}^{2d}$. This model design allows ExpressivE not only to capture a rich set of inference patterns jointly but additionally to display any supported inference pattern through the spatial relation of hyper-parallelograms, offering an intuitive and consistent geometric interpretation of ExpressivE embeddings and their captured patterns. Experimental results on standard KGC benchmarks reveal that ExpressivE is competitive with state-of-the-art models and even significantly outperforms them on WN18RR.","knowledge graph embedding, knowledge graph completion, composition, hierarchy, geometric interpretation" Counterfactual Explanation via Search in Gaussian Mixture Distributed Latent Space,https://openreview.net/forum?id=AXP2Sf6qqSZ,https://openreview.net/pdf?id=AXP2Sf6qqSZ,,"Counterfactual Explanations (CEs) are an important tool in Algorithmic Recourse for addressing two questions: 1. What are the crucial factors that led to an automated prediction/decision? 2. How can these factors be changed to achieve a more favorable outcome from a user's perspective? Thus, guiding the user's interaction with AI systems by proposing easy-to-understand explanations and easy-to-attain actionable changes is essential for the trustworthy adoption and long-term acceptance of AI systems. In the literature, various methods have been proposed to generate CEs, and different quality measures have been suggested to evaluate these methods. However, the generation of CEs is usually computationally expensive, and the resulting suggestions are unrealistic and thus non-actionable. In this paper, we introduce a new method to generate CEs for a pre-trained binary classifier by first shaping the latent space of an autoencoder to be a mixture of Gaussian distributions. CEs are then generated in latent space by linear interpolation between the query sample and the centroid of the target class. We show that our method maintains the characteristics of the input sample during the counterfactual search. In various experiments, we show that the proposed method is competitive based on different quality measures on image and tabular datasets -- efficiently returns results that are closer to the original data manifold compared to three state-of-the-art methods, which are essential for realistic high-dimensional machine learning applications.","XAI, Counterfactual Explanation, Autoencoder, Guassian-Mixture Distribution, Disentanglement" FedPD: Defying data heterogeneity through privacy distillation,https://openreview.net/forum?id=IERSU0La-Nt,https://openreview.net/pdf?id=IERSU0La-Nt,,"Model performance of federated learning (FL) typically suffers from data heterogeneity, i.e., data distribution varies with clients. Advanced works have already shown great potential for sharing client information to mitigate data heterogeneity. Yet, some literature shows a dilemma in preserving strong privacy and promoting model performance simultaneously. Revisiting the purpose of sharing information motivates us to raise the fundamental questions: Which part of the data is more critical for model generalization? Which part of the data is more privacy-sensitive? Can we solve this dilemma by sharing useful (for generalization) features and maintaining more sensitive data locally? Our work sheds light on data-dominated sharing and training, in a way that we decouple original training data into sensitive features and generalizable features. To be specific, we propose a \textbf{Fed}erated \textbf{P}rivacy \textbf{D}istillation framework named FedPD to alleviate the privacy-performance dilemma. Namely, FedPD keeps the distilled sensitive features locally and constructs a global dataset using shared generalizable features in a differentially private manner. Accordingly, clients can perform local training on both the local and securely shared data for acquiring high model performance and avoiding the leakage of not distilled privacy. Theoretically, we demonstrate the superiority of the sharing-only useful feature strategy over sharing raw data. Empirically, we show the efficacy of FedPD in promoting performance with comprehensive experiments.", Harnessing Client Drift with Decoupled Gradient Dissimilarity,https://openreview.net/forum?id=bp6Lr0TmmUS,https://openreview.net/pdf?id=bp6Lr0TmmUS,,"The performance of Federated learning (FL) typically suffers from client drift caused by heterogeneous data, where data distributions vary with clients. Recent studies show that the gradient dissimilarity between clients induced by the data distribution discrepancy causes the client drift. Thus, existing methods mainly focus on correcting the gradients. However, it is challenging to identify which client should (or not) be corrected. This challenge raises a series of questions: will the local training, without gradient correction, contribute to the server model's generalization of other clients' distributions? when the generalization contribution holds? how to address the challenge when it fails? To answer these questions, we analyze the generalization contribution of local training and conclude that the generalization contribution of local training is bounded by the conditional Wasserstein distance between clients' distributions. Thus, the key to promote generalization contribution is to leverage similar conditional distributions for local training. As collecting data distribution can cause privacy leakage, we propose decoupling the deep models, i.e., splitting into high-level models and low-level models, for harnessing client drift. Namely, high-level models are trained on shared feature distributions, causing promoted generalization contribution and alleviated gradient dissimilarity. Experimental results demonstrate that FL with decoupled gradient dissimilarity is robust to data heterogeneity.","Federated Learning, Deep Learning" SeKron: A Decomposition Method Supporting Many Factorization Structures,https://openreview.net/forum?id=bG0TaFFa1c9,https://openreview.net/pdf?id=bG0TaFFa1c9,,"While convolutional neural networks (CNNs) have become the de facto standard for most image processing and computer vision applications, their deployment on edge devices remains challenging. Tensor decomposition methods provide a means of compressing CNNs to meet the wide range of device constraints by imposing certain factorization structures on their convolution tensors. However, being limited to the small set of factorization structures presented by state-of-the-art decomposition approaches can lead to sub-optimal performance. We propose SeKron, a novel tensor decomposition method that offers a wide variety of factorization structures, using sequences of Kronecker products. By recursively finding approximating Kronecker factors, we arrive at optimal decompositions for each of the factorization structures. We show that SeKron is a flexible decomposition that generalizes widely used methods, such as Tensor-Train (TT), Tensor-Ring (TR), Canonical Polyadic (CP) and Tucker decompositions. Crucially, we derive an efficient convolution projection algorithm shared by all SeKron structures, leading to seamless compression of CNN models. We validate SeKron for model compression on both high-level and low-level computer vision tasks and find that it outperforms state-of-the-art decomposition methods.","model compression, tensor decomposition, factorization structure" Localized Randomized Smoothing for Collective Robustness Certification,https://openreview.net/forum?id=-k7Lvk0GpBl,https://openreview.net/pdf?id=-k7Lvk0GpBl,We propose a novel collective robustness certificate based on randomized smoothing that uses different anisotropic smoothign distribution for the different outputs of a multi-output model.,"Models for image segmentation, node classification and many other tasks map a single input to multiple labels. By perturbing this single shared input (e.g. the image) an adversary can manipulate several predictions (e.g. misclassify several pixels). Collective robustness certification is the task of provably bounding the number of robust predictions under this threat model. The only dedicated method that goes beyond certifying each output independently is limited to strictly local models, where each prediction is associated with a small receptive field. We propose a more general collective robustness certificate for all types of models and further show that this approach is beneficial for the larger class of softly local models, where each output is dependent on the entire input but assigns different levels of importance to different input regions (e.g. based on their proximity in the image). The certificate is based on our novel localized randomized smoothing approach, where the random perturbation strength for different input regions is proportional to their importance for the outputs. Localized smoothing Pareto-dominates existing certificates on both image segmentation and node classification tasks, simultaneously offering higher accuracy and stronger guarantees.","Robustness, Certification, Verification, Trustworthiness, Graph neural networks" Learning Dictionaries over Datasets through Wasserstein Barycenters,https://openreview.net/forum?id=K1KJ0NbFu1h,https://openreview.net/pdf?id=K1KJ0NbFu1h,We apply Wasserstein Dictionary Learning to datasets understood as empirical distributions.,"Dictionary learning consists of trying to represent objects in terms of basic elements (atoms) weighted by an importance factor (representation). Non-linear dictionary learning using optimal transport as a metric has been previously studied for normalized non-negative data on a fixed grid. We propose a new framework by using Wasserstein Dictionary Learning on datasets understood as empirical distributions. We leverage Wasserstein barycenters for learning a dictionary of virtual datasets and embeddings in a simplex. We apply our method for unsupervised domain adaptation, improving the state-of-the-art over 1.96% and 2.70%, respectively, and manifold learning of Gaussian distributions and color histograms.","Dictionary Learning, Optimal Transport, Domain Adaptation, Manifold Learning" Spatial Entropy as an Inductive Bias for Vision Transformers,https://openreview.net/forum?id=ZfILgclY4J-,https://openreview.net/pdf?id=ZfILgclY4J-,We propose a self-supervised pretext task to include an object-based inductive bias in Vision Transformers.,"Recent work showed that the attention maps of Vision Transformers (VTs), when trained with self-supervision, can contain a semantic segmentation structure which does not spontaneously emerge when training is supervised. In this paper, we explicitly encourage the emergence of this spatial clustering as a form of training regularization, this way including a self-supervised pretext task into the standard supervised learning. In more detail, we exploit the assumption that, in a given image, objects usually correspond to few connected regions, and we propose a spatial formulation of the information entropy to quantify this object-based inductive bias. By minimizing the proposed spatial entropy, we include an additional self-supervised signal during training. Using extensive experiments, we show that the proposed regularization is beneficial with different training scenarios, datasets, downstream tasks and VT architectures. The code will be available upon acceptance.","Vision Transformers, Self-Supervised Learning, Attention, Regularization." Learning Interpretable Neural Discrete Representation for Time Series Classification,https://openreview.net/forum?id=XkYe7K_AQ8I,https://openreview.net/pdf?id=XkYe7K_AQ8I,We propose a model for time series classification based on a convolutional model which learn in an unsupervised manner a small dictionary of patterns.,"Time series classification is a challenging research field with many real-life applications. Recent advances in deep learning have significantly improved the state of the art: recurrent or convolutional architectures allow automatic extraction of complex discriminating patterns that improve performance. Those approaches suffer from a lack of interpretability: the patterns are mapped into a high dimensional latent vector space, they are not representable in the time domain, and are often even not localizable. In this paper, we present a novel neural convolutional architecture that aims to provide a trade-off between interpretability and effectiveness based on the learning of a dictionary of discrete representations. The proposed model guarantees (1) that a small number of patterns are learned, and they are visualizable and interpretable (2) a shift equivariance property of the model associated with a time-consistency of the representation (3) a linear classifier over a limited number of patterns leading to an explainable decision. To ensure the robustness of the discrete representation, they are learned in an unsupervised process independently of the classification task. This allows further great performances in transfer learning. We present extensive experiments on the UCR benchmark wrt usual baselines. The interpretability of the model is illustrated empirically. The chosen trade-off results obviously in a decrease in performance compared to the state of the art. The performance drop is however limited and very dependent on the application domain. The experiments highlight the efficiency of the model for the transfer learning task, showing the robustness of the representations.","Time series classification, discrete neural representation, interpretability, deep learning" Representational Dissimilarity Metric Spaces for Stochastic Neural Networks,https://openreview.net/forum?id=xjb563TH-GH,https://openreview.net/pdf?id=xjb563TH-GH,Representational dissimilarity metrics that account for noise geometry in biological and artificial neural responses. ,"Quantifying similarity between neural representations---e.g. hidden layer activation vectors---is a perennial problem in deep learning and neuroscience research. Existing methods compare deterministic responses (e.g. artificial networks that lack stochastic layers) or averaged responses (e.g., trial-averaged firing rates in biological data). However, these measures of deterministic representational similarity ignore the scale and geometric structure of noise, both of which play important roles in neural computation. To rectify this, we generalize previously proposed shape metrics (Williams et al. 2021) to quantify differences in stochastic representations. These new distances satisfy the triangle inequality, and thus can be used as a rigorous basis for many supervised and unsupervised analyses. We show that this approach is practical for large-scale data (e.g. comparisons across thousands of networks) and provides insights that cannot be measured with existing metrics. For example, we find that the stochastic geometries of neurobiological representations of oriented visual gratings and naturalistic scenes respectively resemble untrained and trained deep network representations. Further, we are able to more accurately predict certain network attributes (e.g. training hyperparameters) from its position in stochastic (versus deterministic) shape space.","stochastic neural networks, noise, representational geometry, dissimilarity metrics" Hierarchical Prototypes for Unsupervised Dynamics Generalization in Model-Based Reinforcement Learning,https://openreview.net/forum?id=dO4aZ9-CsTn,https://openreview.net/pdf?id=dO4aZ9-CsTn,,"By incorporating the environment-specific factor into the dynamics prediction, model-based reinforcement learning (MBRL) is able to generalise to environments with diverse dynamics.In the majority of real-world scenarios, the environment-specific factor is not observable, so existing methods attempt to estimate it from historical transition segments. Nevertheless,earlier research was unable to identify distinct clusters for environment-specific factors learned from different environments, resulting in poor performance. To address this issue, We introduce a set of environmental prototypes to represent the environmental-specified representation for each environment. By encouraging learned environment-specific factors to resemble their assigned environmental prototypes more closely, the discrimination between factors estimated from distinct environments will be enhanced. To learn such prototypes, we first construct prototypes for each sampled trajectory and then hierarchically combine trajectory prototypes with similar semantics into one environmental prototype. Experiments demonstrate that environment-specific factors estimated by our method have superior clustering performance and can consistently improve MBRL's generalisation performance in six environments consistently.","Unsupervised Dynamics Generalization, Model-Based Reinforcement Learning" MILAN: Masked Image Pretraining on Language Assisted Representation,https://openreview.net/forum?id=bvxY6ThxGzB,https://openreview.net/pdf?id=bvxY6ThxGzB,,"Self-attention based transformer models have been dominating many computer vision tasks in the past few years. Their superb model qualities heavily depend on the excessively large labeled image datasets. In order to reduce the reliance on large labeled datasets, reconstruction based masked autoencoders are gaining popularity, which learn high quality transferable representations from unlabeled images. For the same purpose, recent weakly supervised image pretraining methods explore language supervision from text captions accompanying the images. In this work, we propose masked image pretraining on language assisted representation, dubbed as MILAN. Instead of predicting raw pixels or low level features, our pretraining objective is to reconstruct the image features with substantial semantic signals that are obtained using caption supervision. Moreover, to accommodate our reconstruction target, we propose a more efficient prompting decoder architecture and a semantic aware mask sampling mechanism, which further advance the transfer performance of the pretrained model. Experimental results demonstrate that MILAN delivers higher accuracy than the previous works. When the masked autoencoder is pretrained and finetuned on ImageNet-1K dataset with an input resolution of 224×224, MILAN achieves a top-1 accuracy of 85.4% on ViT-Base, surpassing previous state-of-the-arts by 1%. In the downstream semantic segmentation task, MILAN achieves 52.7 mIoU using ViT-Base backbone on ADE20K dataset, outperforming previous masked pretraining results by 4 points.", Irregularity Reflection Neural Network for Time Series Forecasting,https://openreview.net/forum?id=zuQQ7GrDFfH,https://openreview.net/pdf?id=zuQQ7GrDFfH,This research devises the Irregularity Representation Block that extracts and trains the irregularity from the time series data using CNN architecture.,"Time series forecasting is a long-standing challenge in a variety of industries, and deep learning stands as the mainstream paradigm for handling this forecasting problem. With recent success, representations of time series components (e.g., trend and seasonality) are also considered in the learning process of the models. However, the residual remains under explored due to difficulty in formulating its inherent complexity. In this study, we propose a novel Irregularity Reflection Neural Network (IRN) that reflect the residual for the time series forecasting. First, we redefine the residual as the irregularity and express it as a sum of individual, short regular waves considering the Fourier series in a micro perspective. Second, we design a module, based on the convolutional architectures to mimic the variables of the derived irregularity representation, named Irregularity Representation Block (IRB). IRN comprises IRB on top of a forecasting model to learn the irregularity representation of time series. Extensive experiments on multiple real-world datasets demonstrate that IRN outperforms the state-of-the-art benchmarks in time series forecasting tasks. ","DNN, CNN, time series, fourier series, irregularity representation learning" Sequential Learning of Neural Networks for Prequential MDL,https://openreview.net/forum?id=dMMPUvNSYJr,https://openreview.net/pdf?id=dMMPUvNSYJr,Techniques for obtaining short MDL descriptions lengths for image datasets with modern NN architectures and continual learning.,"Minimum Description Length (MDL) provides a framework and an objective for principled model evaluation. It formalizes Occam's Razor and can be applied to data from non-stationary sources. In the prequential formulation of MDL, the objective is to minimize the cumulative next-step log-loss when sequentially going through the data and using previous observations for parameter estimation. It thus closely resembles a continual- or online-learning problem. In this study, we evaluate approaches for computing prequential description lengths for image classification datasets with neural networks. Considering the computational cost, we find that online-learning with rehearsal has favorable performance compared to the previously widely used block-wise estimation. We propose forward-calibration to better align the models predictions with the empirical observations and introduce replay-streams, a minibatch incremental training technique to efficiently implement approximate random replay while avoiding large in-memory replay buffers. As a result, we present description lengths for a suite of image classification datasets that improve upon previously reported results by large margins.","mdl, continual-learning, deep-learning" Reducing Communication Entropy in Multi-Agent Reinforcement Learning,https://openreview.net/forum?id=o8HjiLULWq5,https://openreview.net/pdf?id=o8HjiLULWq5,,"Communication in multi-agent reinforcement learning has been drawing attention recently for its significant role in cooperation. However, multi-agent systems may suffer from limitations on communication resource and thus need efficient communication techniques in real-world scenarios. According to the Shannon-Hartley theorem, messages to be transmitted reliably in worse channels requires lower entropy. Therefore, we aim to reduce message entropy in multi-agent communication. A fundamental challenge in this is that the gradients of entropy are either 0 or infinity, disabling gradient-based methods. To handle it, we propose a pseudo gradient descent scheme, which reduces entropy by adjusting the distributions of messages wisely. We conduct experiments on six environment settings and two base communication frameworks and find that our scheme can reduce communication entropy by up to 90% with nearly no loss of performance.","multi-agent reinforcement learning, multi-agent communication, low entropy" Relaxed Attention for Transformer Models,https://openreview.net/forum?id=O5PXo5Y0csVi,https://openreview.net/pdf?id=O5PXo5Y0csVi,A simple smoothing in the attention function of transformer models contributes to improved regularization and internal language model suppression.,"The powerful modeling capabilities of all-attention-based transformer architectures often cause overfitting and - for natural language processing tasks - lead to an implicitly learned internal language model in the autoregressive transformer decoder complicating the integration of external language models. In this paper, we explore relaxed attention, a simple and easy-to-implement smoothing of the attention weights, yielding a two-fold improvement to the general transformer architecture: First, relaxed attention provides regularization when applied to the self-attention layers in the encoder. Second, we show that it naturally supports the integration of an external language model as it suppresses the implicitly learned internal language model by relaxing the cross attention in the decoder. We demonstrate the benefit of relaxed attention across several tasks with clear improvement in combination with recent benchmark approaches. Specifically, we exceed the former state-of-the-art performance of 26.90% word error rate on the largest public lip-reading LRS3 benchmark with a word error rate of 26.31%, as well as we achieve a top-performing BLEU score of 37.67 on the IWSLT14 (DE$\rightarrow$EN) machine translation task without external language models and virtually no additional model parameters. Code and models will be made publicly available.","transformer, attention, regularization, internal language model, relaxed attention" SynBench: Task-Agnostic Benchmarking of Pretrained Representations using Synthetic Data,https://openreview.net/forum?id=TLx9diIRJVj,https://openreview.net/pdf?id=TLx9diIRJVj,,"Recent success in fine-tuning large models, that are pretrained on broad data at scale, on downstream tasks has led to a significant paradigm shift in deep learning, from task-centric model design to task-agnostic representation learning and task-specific fine-tuning. As the representations of pretrained models are used as a foundation for different downstream tasks, this paper proposes a new task-agnostic framework, \textit{SynBench}, to measure the quality of pretrained representations using synthetic data. We set up a reference by a theoretically-derived robustness-accuracy tradeoff of the class conditional Gaussian mixture. Given a pretrained model, the representations of data synthesized from the Gaussian mixture are used to compare with our reference to infer the quality. By comparing the ratio of area-under-curve between the raw data and their representations, SynBench offers a quantifiable score for robustness-accuracy performance benchmarking. Our framework applies to a wide range of pretrained models taking continuous data inputs and is independent of the downstream tasks and datasets. Evaluated with several pretrained vision transformer models, the experimental results show that our SynBench score well matches the actual linear probing performance of the pre-trained model when fine-tuned on downstream tasks. Moreover, our framework can be used to inform the design of robust linear probing on pretrained representations to mitigate the robustness-accuracy tradeoff in downstream tasks.", Learning topology-preserving data representations,https://openreview.net/forum?id=lIu-ixf-Tzf,https://openreview.net/pdf?id=lIu-ixf-Tzf,We propose a method for learning topology-preserving data representations.,"We propose a method for learning topology-preserving data representations (dimensionality reduction). The method aims to provide topological similarity between the data manifold and its latent representation via enforcing the similarity in topological features (clusters, loops, 2D voids, etc.) and their localization. The core of the method is the minimization of the Representation Topology Divergence (RTD) between original high-dimensional data and low-dimensional representation in latent space. RTD minimization provides closeness in topological features with strong theoretical guarantees. We develop a scheme for RTD differentiation and apply it as a loss term for the autoencoder. The proposed method ``RTD-AE'' better preserves the global structure and topology of the data manifold than state-of-the-art competitors as measured by linear correlation, triplet distance ranking accuracy, and Wasserstein distance between persistence barcodes.","representation learning, dimensionality reduction, topological data analysis" Interpreting Class Conditional GANs with Channel Awareness,https://openreview.net/forum?id=CCF5eG4UPNS,https://openreview.net/pdf?id=CCF5eG4UPNS,"This work discovered that some channels are primarily responsible for a particular class, and some channels are shared by all classes. This finding further facilitates multiple novel applications.","Understanding the mechanism of generative adversarial networks (GANs) helps us better use GANs for downstream applications. Existing efforts mainly target interpreting unconditional models, leaving it less explored how a conditional GAN learns to render images regarding various categories. This work fills in this gap by investigating how a class conditional generator unifies the synthesis of multiple classes. For this purpose, we dive into the widely used class-conditional batch normalization (CCBN), and observe that each feature channel is activated at varying degrees given different categorical embeddings. To describe such a phenomenon, we propose channel awareness, which quantitatively characterizes how a single channel contributes to the final synthesis. Extensive evaluations and analyses on the BigGAN model pre-trained on ImageNet reveal that only a subset of channels is primarily responsible for the generation of a particular category, similar categories (e.g., cat and dog) usually get related to some same channels, and some channels turn out to share information across all classes. For good measure, our algorithm enables several novel applications with conditional GANs. Concretely, we achieve (1) versatile image editing via simply altering a single channel and manage to (2) harmoniously hybridize two different classes. We further verify that the proposed channel awareness shows promising potential in (3) segmenting the synthesized image and (4) evaluating the category-wise synthesis performance.","class-conditional GANs, representations, interpretation" Escaping saddle points in zeroth-order optimization: two function evaluations suffice,https://openreview.net/forum?id=5d_yTyTj646,https://openreview.net/pdf?id=5d_yTyTj646,We provide the first result showing that zeroth-order optimization with constant number of function evaluations per iteration can escape saddle points efficiently.,"Zeroth-order methods are useful in solving black-box optimization and reinforcement learning problems in unknown environments. It uses function values to estimate the gradient. As optimization problems are often nonconvex, it is a natural question to understand how zeroth-order methods escape saddle points. In this paper, we consider zeroth-order methods, that at each iteration, may freely choose 2$m$ function evaluations where $m$ ranges from 1 to $d$, with $d$ denoting the problem dimension. We show that by adding an appropriate isotropic perturbation at each iteration, a zeroth-order algorithm based on $2m$ function evaluations per iteration can not only find $\epsilon$-second order stationary points polynomially fast, but do so using only $\tilde{O}(\frac{d}{\epsilon^{2.5}})$ function evaluations.","zeroth-order optimization, nonconvex optimization, escape saddle points" Vector Quantization and Shifting: Exploiting Latent Properties to Optimize Neural Codecs,https://openreview.net/forum?id=QVSoh6VM4nG,https://openreview.net/pdf?id=QVSoh6VM4nG,We improve the performance of any neural codecs by uniform vector quantization and gradient of the entropy.,"End-to-end image/video codecs are getting competitive compared to traditional compression techniques that have been developed through decades of manual engineering efforts. These trainable codecs have many advantages over traditional techniques such as easy adaptation on perceptual distortion metrics and high performance on specific domains thanks to their learning ability. However, state of the art neural codecs do not take advantage of vector quantization technique and existence of gradient of entropy in decoding device. In this research, we propose some theoretical insights about these two properties (quantization and entropy gradient), and show that this can improve the performances of many off-the-shelf codecs. First, we prove that non-uniform quantization map on neural codec's latent is not necessary. Thus, we improve the performance by using a predefined optimal uniform vector quantization map. Secondly, we theoretically show that gradient of entropy (available at decoder side) is correlated with the gradient of the reconstruction error (which is not available at decoder side). Thus, we use the former as a proxy in order to improve the compression performance. According to our results, we save between 2-4\% of rate for the same quality with this proposal, for various pre-trained methods.","Neural Image Compression, Uniform Vector Quantization, Gradient of Entropy" Time-Myopic Go-Explore: Learning A State Representation for the Go-Explore Paradigm,https://openreview.net/forum?id=i1Z_VysEgu8,https://openreview.net/pdf?id=i1Z_VysEgu8,To add flexibility to and possibly improve Go-Explore we study how its representation heuristic can be replaced by a time-predicting neural network.,"Very large state spaces with a sparse reward signal are difficult to explore. The lack of a sophisticated guidance results in a poor performance for numerous reinforcement learning algorithms. In these cases, the commonly used random exploration is often not helpful. The literature shows that this kind of environments require enormous efforts to systematically explore large chunks of the state space. Learned state representations can help here to improve the search by providing semantic context and build a structure on top of the raw observations. In this work we introduce a novel time-myopic state representation that clusters temporal close states together while providing a time prediction capability between them. By adapting this model to the Go-Explore paradigm (Ecoffet et al., 2021b), we demonstrate the first learned state representation that reliably estimates novelty instead of using the hand-crafted representation heuristic. Our method shows an improved solution for the detachment problem which still remains an issue at the Go-Explore Exploration Phase. We provide evidence that our proposed method covers the entire state space with respect to all possible time trajectories — without causing disadvantageous conflict-overlaps in the cell archive. Analogous to native Go-Explore, our approach is evaluated on the hard exploration environments MontezumaRevenge, Gravitar and Frostbite (Atari) in order to validate its capabilities on difficult tasks. Our experiments show that time-myopic Go-Explore is an effective alternative for the domain-engineered heuristic while also being more general. The source code of the method is available on GitHub.","Exploration, Self-Supervised Learning, Go-Explore" Mastering Spatial Graph Prediction of Road Networks,https://openreview.net/forum?id=OIcMPYZXFPL,https://openreview.net/pdf?id=OIcMPYZXFPL,,"Accurately predicting road networks from satellite images requires a global understanding of the network topology. We propose to capture such high-level information by introducing a graph-based framework that simulates the addition of sequences of graph edges using a reinforcement learning (RL) approach. In particular, given a partially generated graph associated with a satellite image, an RL agent nominates modifications that maximize a cumulative reward. As opposed to standard supervised techniques that tend to be more restricted to commonly used surrogate losses, these rewards can be based on various complex, potentially non-continuous, metrics of interest. This yields more power and flexibility to encode problem-dependent knowledge. Empirical results on several benchmark datasets demonstrate enhanced performance and increased high-level reasoning about the graph topology when using a tree-based search. We further highlight the superiority of our approach under substantial occlusions by introducing a new synthetic benchmark dataset for this task.", A Simple Framework for Low-Resolution Detection with High-resolution Knowledge,https://openreview.net/forum?id=hgHEUpY3TB6,https://openreview.net/pdf?id=hgHEUpY3TB6,We propose a simple distillation-based framework to improve detection on low-resolution images.,"This paper dedicates to improving object detection performance on low-resolution images. The intuitive way is to distill the high-resolution knowledge from models trained over high-resolution images, shorted as cross-resolution distillation. Unfortunately, most of existing conventional distillation methods focus on the knowledge distillation with same-resolution images in both teacher and student. Directly applying these methods for the cross-resolution distillation results in limited improvement. To address this issue, we introduce a simple yet effective framework, i.e., LRDet. The key in LRDet is the bridge branch, acting as an intermediate status between teacher and student. With the bridge branch, LRDet can i) align the resolution and supervision targets between the high-resolution teacher and the low-resolution student, and ii) then transfer the high-resolution knowledge smoothly and effectively. Experiments demonstrate that LRDet consistently improves various well-known detectors on low-resolution images, e.g., from 35.4 mAP to 37.8 mAP with RetinaNet-R50 on MS COCO using 600 × 1000 input. Meanwhile, it is easy to utilize large teachers in LRDet as the conventional distillation methods do, which can further improve the low-resolution performance. For example, RetinaNet-R50 with 600 × 1000 resolution can achieve 39.7 mAP when distilling from RetinaNet-X101.","object detection, low-resolution, knowledge distillation" Learning Probabilistic Topological Representations Using Discrete Morse Theory,https://openreview.net/forum?id=cXMHQD-xQas,https://openreview.net/pdf?id=cXMHQD-xQas,We use discrete Morse theory and persistent homology to construct an one-parameter family of structures as the topological/structural representation space to perform inference tasks.,"Accurate delineation of fine-scale structures is a very important yet challenging problem. Existing methods use topological information as an additional training loss, but are ultimately making pixel-wise predictions. In this paper, we propose the first deep learning based method to learn topological/structural representations. We use discrete Morse theory and persistent homology to construct an one-parameter family of structures as the topological/structural representation space. Furthermore, we learn a probabilistic model that can perform inference tasks in such a topological/structural representation space. Our method generates true structures rather than pixel-maps, leading to better topological integrity in automatic segmentation tasks. It also facilitates semi-automatic interactive annotation/proofreading via the sampling of structures and structure-aware uncertainty.","Topological Representation, Discrete Morse Theory, Persistent Homology" Zero-Label Prompt Selection,https://openreview.net/forum?id=tMfuHn80HtH,https://openreview.net/pdf?id=tMfuHn80HtH,,"Natural language prompts have been shown to facilitate cross-task generalization for large language models. However, with no or limited labeled examples, the cross-task performance is highly sensitive to the choice of prompts, while selecting a high-performing prompt is challenging given the scarcity of labels. To address the issue, we propose a Zero-Label Prompt Selection (ZPS) method that selects prompts without any labeled data or gradient update. Specifically, given the candidate human-written prompts for a task, ZPS labels a set of unlabeled data with a prompt ensemble and uses the pseudo-labels for prompt selection. Experiments show that ZPS improves over prior methods by a sizeable margin in zero-label performance. We also extend ZPS to a few-shot setting and show its advantages over strong baselines such as prompt tuning and model tuning.", The Curious Case of Benign Memorization,https://openreview.net/forum?id=4C8ChYvMYBn,https://openreview.net/pdf?id=4C8ChYvMYBn,,"Despite the empirical advances of deep learning across a variety of learning tasks, our theoretical understanding of its success is still very restricted. One of the key challenges is the overparametrized nature of modern models, enabling complete overfitting of the data even if the labels are randomized, i.e. networks can completely \textit{memorize} all given patterns. While such a memorization capacity seems worrisome, in this work we show that under training protocols that include \textit{data augmentation}, neural networks learn to memorize entirely random labels in a benign way, i.e. they learn embeddings that lead to highly non-trivial performance under nearest neighbour probing. We demonstrate that deep models have the surprising ability to separate noise from signal by distributing the task of memorization and feature learning to different layers. As a result, only the very last layers are used for memorization, while preceding layers encode performant features which remain largely unaffected by the label noise. We explore the intricate role of the augmentations used for training and identify a memorization-generalization trade-off in terms of their diversity, marking a clear distinction to all previous works. Finally, we give a first explanation for the emergence of benign memorization by showing that \textit{malign} memorization under data augmentation is infeasible due to the insufficient capacity of the model for the increased sample size. As a consequence, the network is forced to leverage the correlated nature of the augmentations and as a result learns meaningful features. To complete the picture, a better theory of feature learning in deep neural networks is required to fully understand the origins of this phenomenon.", A Connection between One-Step Regularization and Critic Regularization in Reinforcement Learning,https://openreview.net/forum?id=ceOdspvoaEA,https://openreview.net/pdf?id=ceOdspvoaEA,We discuss some connections between one-step RL and critic regularization methods,"As with any machine learning problem with limited data, effective offline RL algorithms require careful regularization to avoid overfitting. One-step methods perform regularization by doing just a single step of policy improvement, while critic regularization methods do many steps of policy improvement with a regularized objective. These methods appear distinct. One-step methods, such as advantage-weighted regression and conditional behavioral cloning, truncate policy iteration after just one step. This ``early stopping'' makes one-step RL simple and stable, but can limit its asymptotic performance. Critic regularization typically requires more compute but has appealing lower-bound guarantees. In this paper, we draw a close connection between these methods: applying a multi-step critic regularization method with a regularization coefficient of 1 yields the same policy as one-step RL. While practical implementations violate our assumptions and critic regularization is typically applied with smaller regularization coefficients, our experiments nevertheless show that our analysis makes accurate, testable predictions about practical offline RL methods (CQL and one-step RL) with commonly-used hyperparameters. Our results that every problem can be solved with a single step of policy improvement, but rather that one-step RL might be competitive with critic regularization on RL problems that demand strong regularization.","reinforcement learning, regularization, one-step RL, theory" Deep Class Conditional Gaussians for Continual Learning,https://openreview.net/forum?id=2L-nspTvNVC,https://openreview.net/pdf?id=2L-nspTvNVC,We present an empirical Bayesian method to solve the problem in continual learning of how to use simple metric-based probabilistic models when the embedding function must be learnt online.,"The current state of the art for continual learning with frozen, pre-trained embedding networks are simple probabilistic models defined over the embedding space, for example class conditional Gaussians. However, as of yet, in the task-incremental online setting, it has been an open question how to extend these methods to when the embedding function has to be learned from scratch. In this paper, we propose an empirical Bayesian framework that works by storing a fixed number of examples in memory which are used to calculate the posterior of the probabilistic model and a conditional marginal likelihood term used to fit the embedding function. The learning of the embedding function can be interpreted as using a variant of experience replay, which is a highly performative method for continual learning. As part of our framework, we decide which examples to store by selecting the subset that minimises the KL divergence between the true posterior and the posterior induced by the subset, which is shown to be necessary to achieve good performance. We demonstrate the performance of our method on a range of task-incremental online settings, including those with overlapping tasks which thus far have been under-explored. Our method outperforms all other methods, including several other replay-based methods, evidencing the potential of our approach.","Contiual Learning, Lifelong Learning, Bayesian, Emprical Bayes, Probabilistic Machine Learning" AGREE: A Simple Aggregator of Detectors’ Decisions,https://openreview.net/forum?id=3lJ3pMuAwDT,https://openreview.net/pdf?id=3lJ3pMuAwDT,"We propose a simple yet effective method to aggregate the decisions based on the soft-probability outputs of multiple trained detectors, possibly provided by a third party.","A simple yet effective method to aggregate the decisions based on the soft-probability outputs of multiple trained detectors, possibly provided by a third party, is introduced. We formally derive a mathematically sound theoretical framework, which is straightforward as it does not require further training of the given detectors, and modular, allowing existing (and future) detectors to be merged into a single one. As an application, we evaluate our framework by tackling the recently proposed problem of simultaneous adversarial examples detection, i.e. when the attacks at the evaluation time can be simultaneously crafted according to a variety of algorithms and objective loss functions. While each single detector tends to underperform or fail in the aforementioned attack scenario, our framework successfully aggregates the knowledge of the available detectors to guarantee a more reliable decision. We validate our AGgregatoR of dEtectors' dEcisions (Agree) on popular datasets (e.g., CIFAR10 and SVHN) and we show that it consistently outperforms the state-of-the-art when simultaneous adversarial attacks are present at evaluation time.","AI Safety, Algorithms Evaluation, Deep Learning, Adversarial Examples" Unbiased Supervised Contrastive Learning,https://openreview.net/forum?id=Ph5cJSfD2XN,https://openreview.net/pdf?id=Ph5cJSfD2XN,"We introduce FairKL, a debiasing regularization technique along with a metric learning theoretical framework and a novel formulation of the supervised contrastive loss, ϵ-SupInfoNCE","Many datasets are biased, namely they contain easy-to-learn features that are highly correlated with the target class only in the dataset but not in the true underlying distribution of the data. For this reason, learning unbiased models from biased data has become a very relevant research topic in the last years. In this work, we tackle the problem of learning representations that are robust to biases. We first present a margin-based theoretical framework that allows us to clarify why recent contrastive losses (InfoNCE, SupCon, etc.) can fail when dealing with biased data. Based on that, we derive a novel formulation of the supervised contrastive loss ($\epsilon$-SupInfoNCE), providing more accurate control of the minimal distance between positive and negative samples. Furthermore, thanks to our theoretical framework, we also propose FairKL, a new debiasing regularization loss, that works well even with extremely biased data. We validate the proposed losses on standard vision datasets including CIFAR10, CIFAR100, and ImageNet, and we assess the debiasing capability of FairKL with $\epsilon$-SupInfoNCE, reaching state-of-the-art performance on a number of biased datasets, including real instances of biases ""in the wild"".","contrastive learning, debiasing, supervised learning, representation learning, deep learning, neural networks" ReaKE: Contrastive Molecular Representation Learning with Chemical Synthetic Knowledge Graph,https://openreview.net/forum?id=2aSj08z30A1,https://openreview.net/pdf?id=2aSj08z30A1,,"Molecular representation learning has demonstrated great promise in bridging machine learning and chemical science and in supporting novel chemical discoveries. State-of-the-art methods mostly employ graph neural networks (GNNs) with self-supervised learning (SSL) and extra chemical reaction knowledge to empower the learned embeddings. However, prior works ignore three major issues in modeling reaction data, that is abnormal energy flow, ambiguous embeddings, and sparse embedding space problems. To address these problems, we propose ReaKE, a chemical synthetic knowledge graph-driven pre-training framework for molecular representation learning. We first construct a large-scale chemical synthetic knowledge graph comprising reactants, products and reaction rules. We then propose triplet-level and graph-level contrastive learning strategies to jointly optimize the knowledge graph and molecular embeddings. Representations learned by ReaKE can capture intermolecular relationships reflected in the semantic knowledge graph and molecular structures. By comparing with other state-of-the-art methods, we show that ReaKE can achieve competitive performance on the reaction prediction pretext task and the learned representations transfer well to various downstream tasks, including reaction classification, yield prediction, and molecule property prediction. Further visualization shows that the learned representations can capture the fine-grained differences both between reactions and between molecules.", Graph MLP-Mixer,https://openreview.net/forum?id=N4k3klHNzQj,https://openreview.net/pdf?id=N4k3klHNzQj,,"Graph Neural Networks (GNNs) have shown great potential in the field of graph representation learning. Standard GNNs define a local message-passing mechanism which propagates information over the whole graph domain by stacking multiple layers. This paradigm suffers from two major limitations, over-squashing and poor long-range dependencies, that can be solved using global attention but significantly increases the computational cost to quadratic complexity. In this work, we consider an alternative approach to overcome these structural limitations while keeping a low complexity cost. Motivated by the recent MLP-Mixer architecture introduced in computer vision, we propose to generalize this network to graphs. This GNN model, namely Graph MLP-Mixer, can make long-range connections without over-squashing or high complexity due to the mixer layer applied to the graph patches extracted from the original graph. As a result, this architecture exhibits promising results when comparing standard GNNs vs. Graph MLP-Mixers on benchmark graph datasets.", Multivariate Gaussian Representation of Previous Tasks for Continual Learning,https://openreview.net/forum?id=uw9hk67tDEE,https://openreview.net/pdf?id=uw9hk67tDEE,we propose a method for storing and reproducing distributed representations of data for each class in memory,"Continual learning has recently become increasingly important with the development of deep learning technology. The memory-based rehearsal is one of the dominant methods: It samples data in a previous task, stores them in memory, and retrain them with the current task. However, since the whole data cannot be stored in fixed memory capacity, there is a problem to lose the knowledge of previous data. In this paper, we propose a method for storing and reproducing distributed representations of data for each class in memory. Data representation is categorized by class and converted into a multivariate Gaussian distribution, which is stored in memory in the form of means and variances. A generative algorithm regenerates the model of previous tasks to restore the data representation for the current task. In the inference process, local adaptation adjusts the model to the distributed representation of data that change as the number of tasks increases. Experiments with CIFAR10, CIFAR100, and tiny-ImageNet show the the performance improvements of 2.2%p, 5.01%p, and 3.44%p, respectively, compared to the state-of-the-art method of memory replay, confirming the effectiveness of the proposed method in data representation for memory replay.","Continual Learning, Memory Replay, Sample Generation, Multivariate Gaussian Distribution, Expectation-Maximization, Local Adaptation" Learning to Register Unbalanced Point Pairs,https://openreview.net/forum?id=ocyru3h_WIi,https://openreview.net/pdf?id=ocyru3h_WIi,We propose a new point cloud registration approach that can handle unbalanced point pairs where one point is significantly larger than others in terms of number of points and size.,"Point cloud registration methods can effectively handle large-scale, partially overlapping point cloud pairs. Despite its practicality, matching the unbalanced pairs in terms of spatial extent and density has been overlooked and rarely studied. We present a novel method, dubbed UPPNet, for Unbalanced Point cloud Pair registration. We propose to incorporate a hierarchical framework that effectively finds inlier correspondences by gradually reducing search space. The proposed method first predicts subregions within target point cloud that are likely to be overlapped with query. Then following super-point matching and fine-grained refinement modules predict accurate inlier correspondences between the target and query. Additional geometric constraints are applied to refine the correspondences that satisfy spatial compatibility. The proposed network can be trained in an end-to-end manner, predicting the accurate rigid transformation with a single forward pass. To validate the efficacy of the proposed method, we create a carefully designed benchmark, named KITTI-UPP dataset, by augmenting the KITTI odometry dataset. Extensive experiments reveal that the proposed method not only outperforms state-of-the-art point cloud registration methods by large margins on KITTI-UPP benchmark, but also achieves competitive results on the standard pairwise registration benchmark including 3DMatch, 3DLoMatch, ScanNet, and KITTI, thus showing the applicability of our method on various datasets. The source code and dataset will be publicly released.",Point Cloud Registration On Feature Diversity in Energy-based Models,https://openreview.net/forum?id=5ZSBBhiAapV,https://openreview.net/pdf?id=5ZSBBhiAapV,We derive generalization bounds for energy-based models showing that reducing the redundancy of the feature can lead to better generalization,"Energy-based learning is a powerful learning paradigm that encapsulates various discriminative and generative approaches. An energy-based model (EBM) is typically formed of inner-model(s) that learn a combination of the different features to generate an energy mapping for each input configuration. In this paper, we focus on the diversity of the produced feature set. We extend the probably approximately correct (PAC) theory of EBMs and analyze the effect of redundancy reduction on the performance of EBMs. We derive generalization bounds for various learning contexts, i.e., regression, classification, and implicit regression, with different energy functions and we show that indeed reducing redundancy of the feature set can consistently decrease the gap between the true and empirical expectation of the energy and boosts the performance of the model. ","energy-based models, redundancy reduction, feature diversity" Physics Model-based Autoencoding for Magnetic Resonance Fingerprinting,https://openreview.net/forum?id=nnrdKoTNDV6,https://openreview.net/pdf?id=nnrdKoTNDV6,"For Magnetic Resonance Fingerprinting (MRF), we propose a physics-based auto-encoder framework where a fast and differentiable MRI physics model guides the encoder to learn generalizable representations.","Magnetic Resonance Fingerprinting (MRF) is a promising paradigm to achieve fast quantitative Magnetic Resonance Imaging (QMRI). However, current MRF methods suffer from slow imaging speeds and poor generalization performance on radio frequency pulse sequences generated with varied settings. To address this challenging task, we propose a novel model-based MRF method that learns better representations by integrating a fast and differentiable MRI physics model as causal regularization. The proposed approach adopts a supervised auto-encoder framework consisting of an encoder and a decoder, where the encoder predicts the target tissue properties (anti-causal task) and the decoder reconstructs the inputs (causal task). Specifically, the encoder embeds high-dimensional MRF time sequences to a low-dimensional tissue property space, while the decoder exploits an MRI physics model to reconstruct the input signals using the estimated tissue properties and associated MRI settings. The causal regularization induced by the decoder improves the generalization performance and uniform stability of the approach, leading to the best performance on tissue property estimation, outperforming state-of-the-art competing methods.","Magnetic Resonance Fingerprinting (MRF), Physics based representation learning, Medical imaging, Anti-causal representation learning, Quantitative Magnetic Resonance Imaging (QMRI)" Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video Relation Detection,https://openreview.net/forum?id=mE91GkXYipg,https://openreview.net/pdf?id=mE91GkXYipg,,"Prompt tuning with large-scale pretrained vision-language models empowers open-vocabulary prediction trained on limited base categories, e.g., object classification and detection. In this paper, we propose compositional prompt tuning with motion cues: an extended prompt tuning paradigm for compositional predictions of video data. In particular, we present Relation Prompt (RePro) for Open-vocabulary Video Visual Relation Detection (Open-VidVRD), where conventional prompt tuning is easily biased to certain subject-object combinations and motion patterns. To this end, RePro addresses the two technical challenges of Open-VidVRD: 1) the prompt tokens should respect the two different semantic roles of subject and object, and 2) the tuning should account for the diverse spatiotemporal motion patterns of the subject-object compositions. Our RePro achieves a new state-of-the-art performance on two VidVRD benchmarks of not only the base training object and predicate categories, but also the unseen ones. Extensive ablations also demonstrate the effectiveness of the proposed compositional and multi-mode design of prompt.","Prompt Tuning, Video Relation Detection" Multi-objective optimization via equivariant deep hypervolume approximation,https://openreview.net/forum?id=fSa5IjNMmmi,https://openreview.net/pdf?id=fSa5IjNMmmi,"Hypervolume approximation using permutation invariant, scaling equivariant neural network","Optimizing multiple competing objectives is a common problem across science and industry. The inherent inextricable trade-off between those objectives leads one to the task of exploring their Pareto front. A meaningful quantity for the purpose of the latter is the hypervolume indicator, which is used in Bayesian Optimization (BO) and Evolutionary Algorithms (EAs). However, the computational complexity for the calculation of the hypervolume scales unfavorably with increasing number of objectives and data points, which restricts its use in those common multi-objective optimization frameworks. To overcome these restrictions we propose to approximate the hypervolume function with a deep neural network, which we call DeepHV. For better sample efficiency and generalization, we exploit the fact that the hypervolume is scale-equivariant in each of the objectives as well as permutation invariant w.r.t.\ both the objectives and the samples, by using a deep neural network that is equivariant w.r.t.\ the combined group of scalings and permutations. We evaluate our method against exact, and approximate hypervolume methods in terms of accuracy, computation time, and generalization. We also apply and compare our methods to state-of-the-art multi-objective BO methods and EAs on a range of synthetic benchmark test cases. The results show that our methods are promising for such multi-objective optimization tasks.","Multi-objective optimization, Hypervolume approximation, Geometric deep learning, Bayesian optimization, Evolutionary algorithms" Conditional Execution Of Cascaded Models Improves The Accuracy-Efficiency Trade-Off,https://openreview.net/forum?id=KPyHNVpear1,https://openreview.net/pdf?id=KPyHNVpear1,We show how to combine pairs of pretrained models to improve the entire ImageNet accuracy-compute Pareto front.,"The compute effort required to perform inference on state-of-the-art deep learning models is ever growing. Practical applications are commonly limited to a certain cost per inference. Cascades of pretrained models with conditional execution address these requirements based on the intuition that some inputs are easy enough that they can be processed correctly by a small model allowing for an early exit. If the small model is not sufficiently confident in its prediction, the input is passed on to a larger model. The selection of the confidence threshold allows to trade off compute effort against accuracy. In this work, we explore the effective design of model cascades, and thoroughly evaluate the impact on the accuracy-compute trade-off. We find that they not only interpolate favorably between pretrained models, but that this trade-off curve commonly outperforms single models. This allows us to redefine most of the ImageNet Pareto front already with 2-model cascades, achieving an average reduction in compute effort at equal accuracy of almost 3.1x above 86% and more than 1.9x between 80% and 86% top-1 accuracy. We confirm the wide applicability and effectiveness of the method on the GLUE benchmark. We release the code to reproduce our experiments in the supplementary material and use only publicly available models and datasets.","inference, efficiency, cascades, pretrained" Adversarial Text to Continuous Image Generation,https://openreview.net/forum?id=9X3UZJSGIg9,https://openreview.net/pdf?id=9X3UZJSGIg9,"Receiving text input, hypernetworks generate weights for INR-GAN to synthesize image.","Implicit Neural Representations (INR) provide a natural way to parametrize images as a continuous signal, using an MLP that predicts the RGB color at an (x, y) image location. Recently, it has been demonstrated that high-quality INR-decoders can be designed and integrated with Generative Adversarial Networks (GANs) to facilitate unconditional continuous image generation, that are no longer bounded to a spatial resolution. In this paper, we introduce HyperCGAN, a conceptually simple approach for Adversarial Text to Continuous Image Generation based on HyperNetworks, which are networks that produce parameters for another network. HyperCGAN utilizes HyperNetworks to condition an INR-based GAN model on text. In this setting, the generator and the discriminator weights are controlled by their corresponding HyperNetworks, which modulate weight parameters using the provided text query. We propose an effective Word-level hyper-modulation Attention operator, termed WhAtt, which encourages grounding words to independent pixels at input (x, y) coordinates. To the best of our knowledge, our work is the first that explores text-controllable continuous image generation. We conduct comprehensive experiments on the COCO 256x256, CUB 256x256, and the ArtEmis 256x256 benchmark which we introduce in this paper. HyperCGAN improves the performance of text-controllable image generators over the baselines while significantly reducing the gap between text-to-continuous and text-to-discrete image synthesis. Additionally, we show that HyperCGAN, when conditioned on text, retains the desired properties of continuous generative models (e.g., extrapolation outside of image boundaries, accelerated inference of low- resolution images, out-of-the-box superresolution).","gan, generative modelling, text-to-image, text2image, hypernetworks" Fine-grained Few-shot Recognition by Deep Object Parsing,https://openreview.net/forum?id=rpVxn-rX2Wh,https://openreview.net/pdf?id=rpVxn-rX2Wh,"A method for fine-grained few-shot recognition that relies on representing objects as a collection of parts, where each part is identified by a location and a set of active templates.","We propose a new method for fine-grained few-shot recognition via deep object parsing. In our framework, an object is made up of $K$ distinct parts and for each part, we learn a dictionary of templates, which is shared across all instances and categories. An object is parsed by estimating the locations of these $K$ parts and a set of active templates that can reconstruct the part features. We recognize test instances by comparing its active templates and the relative geometry of its part locations against those of the presented few-shot instances. Our method is end-to-end trainable to learn part templates on-top of a convolutional backbone. To combat visual distortions such as orientation, pose and size, we learn templates at multiple scales, and at test-time parse and match instances across these scales. We show that our method is competitive with the state-of-the-art, and by virtue of parsing enjoys interpretability as well.","Few-shot learning, Representation learning" Can Wikipedia Help Offline Reinforcement Learning?,https://openreview.net/forum?id=eHrqmewX1B-,https://openreview.net/pdf?id=eHrqmewX1B-,,"Fine-tuning reinforcement learning (RL) models has been challenging because of a lack of large scale off-the-shelf datasets as well as high variance in transferability among different environments. Recent work has looked at tackling offline RL from the perspective of sequence modeling with improved results as result of the introduction of the Transformer architecture. However, when the model is trained from scratch, it suffers from slow convergence speeds. In this paper, we look to take advantage of this formulation of reinforcement learning as sequence modeling and investigate the transferability of pre-trained sequence models on other domains (vision, language) when finetuned on offline RL tasks (control, games). To this end, we also propose techniques to improve transfer between these domains. Results show consistent performance gains in terms of both convergence speed and reward on a variety of environments, accelerating training by 3-6x and achieving state-of-the-art performance in a variety of tasks using Wikipedia-pretrained and GPT2 language models. We hope that this work not only brings light to the potentials of leveraging generic sequence modeling techniques and pre-trained models for RL, but also inspires future work on sharing knowledge between generative modeling tasks of completely different domains.","offline rl, language models, transfer learning" NGswin: N-Gram Swin Transformer for Efficient Single Image Super-Resolution,https://openreview.net/forum?id=QlK8nHnY8zc,https://openreview.net/pdf?id=QlK8nHnY8zc,"In our efficient NGswin, N-Gram context is proposed for deep learning in single image super-resolution for the first in history.","In single image super-resolution (SISR), many deep learning-based methods suffer from intensive computational operations. In addition, while Swin Transformer-based methods such as SwinIR established state-of-the-art results, they still hold the problem of ignoring the broad regions when computing window self-attention (WSA) to reconstruct high-frequency information. In this paper, we propose the efficient NGswin network, which is the first attempt in history to introduce N-Gram to deep learning in images. For text analysis, N-Gram is a sequence of consecutive characters or words, but in an image, we define N-Gram as neighboring local windows (in WSA of Swin Transformer) which interact with each other by sliding-WSA. We propose N-Gram interaction, SCDP bottleneck, and a pooling-cascading mechanism, which enable the network to consider broad regions beneficial to recovering the degraded neighbor pixels. Moreover, we employ a hierarchical encoder with patch-merging, uni-Gram embedding, and a compact decoder to NGswin to enhance the network efficiency. Experimental results show that the proposed model achieves competitive performance in terms of PSNR and SSIM scores with fewer operations (Mult-Adds) compared to other methods.","N-Gram, Single Image Super-Resolution, Swin Transformer, Efficiency, Deep Learning" Lightweight Equivariant Graph Representation Learning for Protein Engineering,https://openreview.net/forum?id=IWoHx6bY4Zm,https://openreview.net/pdf?id=IWoHx6bY4Zm,We design a lightweight pre-training model for multi-task protein representation learning from its 3D structure and sequence. ,"This work tackles the issue of directed evolution in computational protein design that makes accurate predictions of the function of a protein mutant. We design a lightweight pre-training graph neural network model for multi-task protein representation learning from its 3D structure. Rather than reconstructing and optimizing the protein structure, the trained model recovers the amino acid types and key properties of the central residues from a given noisy three-dimensional local environment. On the prediction task for the higher-order mutants, where many amino acid sites of the protein are mutated, the proposed training strategy achieves remarkably higher performance by 20% improvement at the cost of requiring less than 1% of computational resources that are required by popular transformer-based state-of-the-art deep learning models for protein design.",graph neural networks DiffusER: Diffusion via Edit-based Reconstruction,https://openreview.net/forum?id=nG9RF9z1yy3,https://openreview.net/pdf?id=nG9RF9z1yy3,We propose a generally applicable text generative model which takes inspiration from diffusion models and parameterises generation steps as text editing steps without compromising performance and adding flexibility.,"In text generation, models that generate text from scratch one token at a time are currently the dominant paradigm. Despite being performant, these models lack the ability to revise existing text, which limits their usability in many practical scenarios. We look to address this, with DiffusER (Diffusion via Edit-based Reconstruction), a new edit-based generative model for text based on denoising diffusion models -- a class of models that use a Markov chain of denoising steps to incrementally generate data. DiffusER is not only a strong generative model in general, rivalling autoregressive models on several tasks spanning machine translation, summarization, and style transfer; it can also perform other varieties of generation that standard autoregressive models are not well-suited for. For instance, we demonstrate that DiffusER makes it possible for a user to condition generation on a prototype, or an incomplete sequence, and continue revising based on previous edit steps.","text generation, editing, denoising autoencoders" Modeling Temporal Data as Continuous Functions with Process Diffusion,https://openreview.net/forum?id=1TxMUE7cF6_,https://openreview.net/pdf?id=1TxMUE7cF6_,We modify the diffusion framework to model continuous functions and apply the learned generative model on different time series tasks.,"Temporal data like time series are often observed at irregular intervals which is a challenging setting for the existing machine learning methods. To tackle this problem, we view such data as samples from some underlying continuous function. We then define a diffusion-based generative model that adds noise from a predefined stochastic process while preserving the continuity of the resulting underlying function. A neural network is trained to reverse this process which allows us to sample new realizations from the learned distribution. We define suitable stochastic processes as noise sources and introduce novel denoising and score-matching models on processes. Further, we show how to apply this approach to the multivariate probabilistic forecasting and imputation tasks. Through our extensive experiments, we demonstrate that our method outperforms previous models on synthetic and real-world datasets.","time series, stochastic process, diffusion, probabilistic forecasting, score-based matching" KeyCLD: Learning Constrained Lagrangian Dynamics in Keypoint Coordinates from Images,https://openreview.net/forum?id=Ll21hhIJ_oG,https://openreview.net/pdf?id=Ll21hhIJ_oG,"We learn unsupervised keypoint representations as state, jointly with constrained Lagrangian dynamics, based on videos of dynamical systems.","We present KeyCLD, a framework to learn Lagrangian dynamics from images. Learned keypoint representations derived from images are directly used as positional state vector for jointly learning constrained Lagrangian dynamics. KeyCLD is trained unsupervised end-to-end on sequences of images. Our method explicitly models the mass matrix, potential energy and the input matrix, thus allowing energy based control. We demonstrate learning of Lagrangian dynamics from images on the dm_control pendulum, cartpole and acrobot environments, wether they are unactuated, underactuated or fully actuated. Trained models are able to produce long-term video predictions, showing that the dynamics are accurately learned. Our method strongly outperforms recent works on learning Lagrangian or Hamiltonian dynamics from images. The benefits of including a Lagrangian prior and prior knowledge of a constraint function is further investigated and empirically evaluated.","Lagrangian, dynamics, keypoints, images, unsupervised" DynaMS: Dyanmic Margin Selection for Efficient Deep Learning,https://openreview.net/forum?id=7oPAgqxNb20,https://openreview.net/pdf?id=7oPAgqxNb20,A general dynamic data selection framework for efficient deep neural network training that enjoy both theoretical and practical advantages.,"The great success of deep learning is largely driven by training over-parameterized models on massive datasets. To avoid excessive computation, extracting and training only on the most informative subset is drawing increasing attention. Nevertheless, it is still an open question how to select such a subset on which the model trained generalizes on par with the full data. In this paper, we propose dynamic margin selection (DynaMS). DynaMS leverages the distance from candidate samples to the classification boundary to construct the subset, and the subset is dynamically updated during model training. We show that DynaMS converges with large probability, and for the first time show both in theory and practice that dynamically updating the subset can result in better generalization over previous works. To reduce the additional computation incurred by the selection, a light parameter sharing proxy~(PSP) is designed. PSP is able to faithfully evaluate instances with respect to the current model, which is necessary for dynamic selection. Extensive analysis and experiments demonstrate the superiority of the proposed approach in data selection against many state-of-the-art counterparts on benchmark datasets.","efficient training, data selection" TANGOS: Regularizing Tabular Neural Networks through Gradient Orthogonalization and Specialization,https://openreview.net/forum?id=n6H86gW8u0d,https://openreview.net/pdf?id=n6H86gW8u0d,"We introduce TANGOS, a regularization method that orthogonalizes the gradient attribution of neurons to improve the generalization of deep neural networks on tabular data.","Despite their success with unstructured data, deep neural networks are not yet a panacea for structured tabular data. In the tabular domain, their efficiency crucially relies on various forms of regularization to prevent overfitting and provide strong generalization performance. Existing regularization techniques include broad modelling decisions such as choice of architecture, loss functions, and optimization methods. In this work, we introduce Tabular Neural Gradient Orthogonalization and Specialization (TANGOS), a novel framework for regularization in the tabular setting built on latent unit attributions. The gradient attribution of an activation with respect to a given input feature suggests how the neuron attends to that feature, and is often employed to interpret the predictions of deep networks. In TANGOS, we take a different approach and incorporate neuron attributions directly into training to encourage orthogonalization and specialization of latent attributions in a fully-connected network. Our regularizer encourages neurons to focus on sparse, non-overlapping input features and results in a set of diverse and specialized latent units. In the tabular domain, we demonstrate that our approach can lead to improved out-of-sample generalization performance, outperforming other popular regularization methods. We provide insight into why our regularizer is effective and demonstrate that TANGOS can be applied jointly with existing methods to achieve even greater generalization performance.","Deep Learning, Tabular Data, Regularization" How does Uncertainty-aware Sample-selection Help Decision against Action Noise?,https://openreview.net/forum?id=x01dnxUEDRv,https://openreview.net/pdf?id=x01dnxUEDRv,"To learn a robust imitation learning policy against action noise, this work proposes a novel paradigm called USN, which bridges Uncertainty-aware Sample-selection with Negative learning. ","Learning from imperfect demonstrations has become a vital problem in imitation learning (IL). Since the assumption of the collected demonstrations are optimal cannot always hold in real-world tasks, many previous works considers learning from a mixture of optimal and sub-optimal demonstrations. On the other hand, video records can be hands-down demonstrations in practice. Leveraging such demonstrations requires labors to output action for each frame. However, action noise always occurs when the labors are not domain experts, or meet confusing state frames. Previous IL methods can be vulnerable to such demonstrations with state-dependent action noise. To tackle this problem, we propose a robust learning paradigm called USN, which bridges Uncertainty-aware Sample-selection with Negative learning. First, IL model feeds forward all demonstration data and estimates its predictive uncertainty. Then, we select large-loss samples in the light of the uncertainty measures. Next, we update the model parameters with additional negative learning on the selected samples. Empirical results on Box2D tasks and Atari games demonstrate that USN improves the performance of state-of-the-art IL methods by more than 10% under a large portion of action noise. ","Imitation Learning, State-Dependent Action Noise, Uncertainty-Aware, Negative Learning" SPC-Net: A New Scalable Point Cloud Compression Framework for Both Machine and Human Vision Tasks,https://openreview.net/forum?id=Y3wBqgTbxyw,https://openreview.net/pdf?id=Y3wBqgTbxyw,,"Recently, point cloud process and analysis have attracted increasing attention in various machine vision tasks. Therefore, some point cloud compression algorithms are developed. However, such compression algorithms are developed for human vision while most of the point cloud data will be used for automated point cloud analysis (e.g., detection of abnormal event and early warning in autonomous driving) and may not be seen by humans. To this end, we design a new scalable point cloud compression framework (SPC-Net) for both machine and human vision tasks, in which a scalable bit-stream will be used to describe the point cloud for both machine vision and human vision tasks. For machine vision tasks, only part of the bit-stream will be transmitted for bit-rate saving, while the full bit-stream will be transmitted when used for the human vision task. Additionally, we propose a new octree depth level predictor to automatically predict the optimal depth level in order to control the bit-rate cost for the machine vision tasks.As a result, for simple objects/scenarios, we will use fewer depth levels with less bits for the machine tasks, while for complex objects/scenarios, we prefer deeper depth levels of octree with more bits for machine tasks comprehensive. Experimental results on different datasets (e.g., ModelNet10, ModelNet40, ShapeNet and ScanNet) demonstrate that our proposed scalable point could compression framework SPC-Net achieves better performance on the machine vision tasks (e.g., classification, segmentation and detection) without degrading the performance of the human vision task.","point cloud, compression, scalable coding" Inversely Eliciting Numerical Reasoning in Language Models via Solving Linear Systems,https://openreview.net/forum?id=ukea-WPOL4Dw,https://openreview.net/pdf?id=ukea-WPOL4Dw,"leverage simple numbers as ""anchors"" to probe the implicitly inferred arithmetic relationship from language models and then explicitly apply the relationship on complex numbers","Numerical reasoning over natural language has been a long-standing goal for the research community. However, recent language models have proven difficult to reliably generalize to a broad range of numbers, although they have shown proficiency in reasoning over common and simple numbers. In this paper, we propose a novel method to elicit and exploit the numerical reasoning knowledge hidden in pre-trained language models using simple anchor numbers. Concretely, we first leverage simple numbers as anchors to probe the implicitly inferred arithmetic expressions from language models, and then explicitly apply the expressions on complex numbers to get corresponding answers. To inversely elicit arithmetic expressions, we transform and formulate the task as an analytically solvable linear system. Experimental results on several numerical reasoning benchmarks demonstrate that our approach is highly effective. More importantly, our approach works in the inference phase without extra model training, making it highly portable and achieving significant and consistent performance benefits across a variety of language models in zero-shot, few-shot, and fine-tuning scenarios.","numerical reasoning, language model" Model-based Causal Bayesian Optimization,https://openreview.net/forum?id=Vk-34OQ7rFo,https://openreview.net/pdf?id=Vk-34OQ7rFo,A principled algorithm for causal bayesian optimization.,"How should we intervene on an unknown structural equation model to maximize a downstream variable of interest? This setting, also known as causal Bayesian optimization (CBO), has important applications in medicine, ecology, and manufacturing. Standard Bayesian optimization algorithms fail to effectively leverage the underlying causal structure. Existing CBO approaches assume noiseless measurements and do not come with guarantees. We propose the {\em model-based causal Bayesian optimization algorithm (MCBO)} that learns a full system model instead of only modeling intervention-reward pairs. MCBO propagates epistemic uncertainty about the causal mechanisms through the graph and trades off exploration and exploitation via the optimism principle. We bound its cumulative regret, and obtain the first non-asymptotic bounds for CBO. Unlike in standard Bayesian optimization, our acquisition function cannot be evaluated in closed form, so we show how the reparameterization trick can be used to apply gradient-based optimizers. The resulting practical implementation of MCBO compares favorably with state-of-the-art approaches empirically.","causal inference, bayesian optimization" Targeted Attacks on Timeseries Forecasting,https://openreview.net/forum?id=4vYWYGd13cZ,https://openreview.net/pdf?id=4vYWYGd13cZ,,"Real-world deep learning models built for Time Series Forecasting are used in several critical applications from medical devices to the security domain. Many previous works have shown how deep learning models are prone to adversarial attacks and studied their vulnerabilities. However, the vulnerabilities of time series models for forecasting due to adversarial inputs are not extensively studied. While attack on a forecasting model might be intended to deteriorate the performance of the model, it is more effective, if the attack is focused on a specific impact on the model's output. In this paper, we propose a novel formulation of Directional, Amplitudinal, and Temporal targeted adversarial attacks on time series forecasting models. These targeted attacks create a specific impact or hide a potential high-impact area on the forecasting output. We use the existing adversarial attack techniques from the computer vision domain and adapt them for time series. Additionally, we propose a modified version of the Auto Projected Gradient Descent attack for targeted attacks. We explore the impact of the proposed targeted attacks against untargeted attacks. We use KS-Tests to statistically prove the impact of the attack. Our experimental results demonstrate how targeted attacks on time series models are practical and are more powerful in terms of statistical similarity. It is, hence difficult to detect through statistical methods. We believe that this work opens a new paradigm in the time series forecasting domain and is an important consideration for developing better defenses.","timeseries, forecasting, targeted attacks, adversarial ml, ai security, apgd" QuAFL: Federated Averaging Made Asynchronous and Communication-Efficient,https://openreview.net/forum?id=Rb3RN0maB7F,https://openreview.net/pdf?id=Rb3RN0maB7F,,"Federated Learning (FL) is an emerging paradigm to enable the large-scale distributed training of machine learning models, while still providing privacy guarantees. In this work, we address two of the main practical challenges when scaling federated optimization to large node counts: the need for tight synchronization between the central authority and individual computing nodes, and the large communication cost of transmissions between the central server and clients. Specifically, we present a new variant of the classic federated averaging (FedAvg) algorithm, which supports both asynchronous communication and communication compression. We provide a new analysis technique showing that, in spite of these system relaxations, our algorithm can provide similar convergence to FedAvg in some parameter regimes. On the experimental side, we show that our algorithm ensures fast convergence for standard federated tasks. ","Federated Learning, Distributed optimization, Load Balancing" Random Matrix Analysis to Balance between Supervised and Unsupervised Learning under the Low Density Separation Assumption,https://openreview.net/forum?id=4CQ9os3s4h3,https://openreview.net/pdf?id=4CQ9os3s4h3,We introduce a semi-supervised learning algorithm based on the low density assumption and propose a theoretical analysis of the latter.,"We propose a theoretical framework to analyze semi-supervised classification under the low density separation assumption in a high-dimensional regime. In particular, we introduce QLDS, a linear classification model, where the low density separation assumption is implemented via quadratic margin maximization. The algorithm has an explicit solution with rich theoretical properties, and we show that particular cases of our algorithm are the least-square support vector machine in the supervised case, the spectral clustering in the fully unsupervised regime, and a class of semi-supervised graph-based approaches. As such, QLDS establishes a smooth bridge between these supervised and unsupervised learning methods. Using recent advances in the random matrix theory, we formally derive a theoretical evaluation of the classification error in the asymptotic regime. As an application, we derive a hyperparameter selection policy that finds the best balance between the supervised and the unsupervised terms of our learning criterion. Finally, we provide extensive illustrations of our framework, as well as an experimental study on several benchmarks to demonstrate that QLDS, while being computationally more efficient, improves over cross-validation for hyperparameter selection, indicating a high promise of the usage of random matrix theory for semi-supervised model selection.","Random Matrix Theory, semi-supervised learning, high dimensional statistics" Learning to Solve Constraint Satisfaction Problems with Recurrent Transformers,https://openreview.net/forum?id=udNhDCr2KQe,https://openreview.net/pdf?id=udNhDCr2KQe,We show how Recurrent Transformer can be trained to perform logical reasoning on constraint satisfaction problems in an end-to-end manner and how to inject discrete constraints into its training,"Constraint satisfaction problems (CSPs) are about finding values of variables that satisfy the given constraints. We show that the Transformer model extended with recurrence is a viable approach to learning to solve CSPs in an end-to-end manner, having clear advantages over the state-of-the-art methods such as Graph Neural Networks, SATNet, and some neuro-symbolic models. With the ability of Transformers to handle visual input, the proposed Recurrent Transformer can straightforwardly be applied to visual constraint reasoning problems while successfully addressing the symbol grounding problem. We also show how to leverage deductive knowledge of discrete constraints in the Transformer's inductive learning to achieve sample-efficient learning and semi-supervised learning for CSPs.","transformer, constraint reasoning, semi-supervised learning" MARLlib: Extending RLlib for Multi-agent Reinforcement Learning,https://openreview.net/forum?id=q4qocCgE3uM,https://openreview.net/pdf?id=q4qocCgE3uM,"We introduce MARLlib, the MARL extension of RLlib","Despite the fast development of multi-agent reinforcement learning (MARL) methods, there is a lack of commonly-acknowledged baseline implementation and evaluation platforms. As a result, an urgent need for MARL researchers is to develop an integrated library suite, similar to the role of RLlib in single-agent RL, that delivers reliable MARL implementation and replicable evaluation in various bechmarks. To fill such a research gap, in this paper, we propose Multi-Agent RLlib (MARLlib), a comprehensive MARL algorithm library that facilitates RLlib for solving multi-agent problems. With a novel design of agent-level distributed dataflow, MARLlib manages to unify tens of algorithms, including different types of independent learning, centralized critic, and value decomposition methods; this leads to a highly composable integration of MARL algorithms that are not possible to unify before. Furthermore, MARLlib goes beyond current work by integrating diverse environment interfaces and providing flexible parameter sharing strategies; this allows to create versatile solutions to cooperative, competitive, and mixed tasks with minimal code modifications for end users. A plethora of experiments are conducted to substantiate the correctness of our implementation, based on which we further derive new insights on the relationship between the performance and the design of algorithmic components. With MARLlib, we expect researchers to be able to tackle broader real-world multi-agent problems with trustworthy solutions. Our code\footnote{\url{https://github.com/ICLR2023Paper4242/MARLlib}} and documentation\footnote{\url{https://iclr2023marllib.readthedocs.io/}} are released for reference.",MARL Improving the imputation of missing data with Markov Blanket discovery,https://openreview.net/forum?id=GrpU6dxFmMN,https://openreview.net/pdf?id=GrpU6dxFmMN,,"The process of imputation of missing data typically relies on generative and regression models. These approaches often operate on the unrealistic assumption that all of the data features are directly related with one another, and use all of the available features to impute missing values. In this paper, we propose a novel Markov Blanket discovery approach to determine the optimal feature set for a given variable by considering both observed variables and missingness of partially observed variables to account for systematic missingness. We then incorporate this method to the learning process of the state-of-the-art MissForest imputation algorithm, such that it informs MissForest which features to consider to impute missing values, depending on the variable the missing value belongs to. Experiments across different case studies and multiple imputation algorithms show that the proposed solution improves imputation accuracy, both under random and systematic missingness.","Feature selection, Imputation, Markov Blanket discovery" Boosting the Cycle Counting Power of Graph Neural Networks with I$^2$-GNNs,https://openreview.net/forum?id=kDSmxOspsXQ,https://openreview.net/pdf?id=kDSmxOspsXQ,,"Message Passing Neural Networks (MPNNs) are a widely used class of Graph Neural Networks (GNNs). The limited representational power of MPNNs inspires the study of provably powerful GNN architectures. However, knowing one model is more powerful than another gives little insight about what functions they can or cannot express. It is still unclear whether these models are able to approximate specific functions such as counting certain graph substructures, which is essential for applications in biology, chemistry and social network analysis. Motivated by this, we propose to study the counting power of Subgraph MPNNs, a recent and popular class of powerful GNN models that extract rooted subgraphs for each node, assign the root node a unique identifier and encode the root node's representation within its rooted subgraph. Specifically, we prove that Subgraph MPNNs fail to count more-than-4-cycles at node level, implying that node representations cannot correctly encode the surrounding substructures like ring systems with more than four atoms. To overcome this limitation, we propose I$^2$-GNNs to extend Subgraph MPNNs by assigning different identifiers for the root node and its neighbors in each subgraph. I$^2$-GNNs' discriminative power is shown to be strictly stronger than Subgraph MPNNs and partially stronger than the 3-WL test. More importantly, I$^2$-GNNs are proven capable of counting all 3, 4, 5 and 6-cycles, covering common substructures like benzene rings in organic chemistry, while still keeping linear complexity. To the best of our knowledge, it is the first linear-time GNN model that can count 6-cycles with theoretical guarantees. We validate its counting power in cycle counting tasks and demonstrate its competitive performance in molecular prediction benchmarks.",Graph neural networks Optimizing Connectivity through Network Gradients for the Restricted Machine,https://openreview.net/forum?id=7tmlbL5JQyt,https://openreview.net/pdf?id=7tmlbL5JQyt,This paper proposes a novel methodology that optimizes the network connectivity of an RBM using the idea of connection gradients jointly with other model parameters. ,"Leveraging sparse networks to connect successive layers in deep neural networks has recently been shown to provide benefits to large scale state-of-the-art models. However, network connectivity also plays a significant role on the learning performance of shallow networks, such as the classic Restricted Boltzmann Machines (RBM). Efficiently finding sparse connectivity patterns that improve the learning performance of shallow networks is a fundamental problem. While recent principled approaches explicitly include network connections as model parameters that must be optimized, they often rely on explicit penalization or have network sparsity as a hyperparameter. This work presents a method to find optimal connectivity patterns for RBMs based on the idea of network gradients (NCG): computing the gradient of every possible connection, given a specific connection pattern, and using the gradient to drive a continuous connection strength parameter that in turn is used to determine the connection pattern. Thus, learning RBM parameters and learning network connections is truly jointly performed, albeit with different learning rates, and without changes to the objective function. The method is applied to the MNIST and other datasets showing that better RBM models are found for the benchmark tasks of sample generation and input classification. Results also show that NCG is robust to network initialization, both adding and removing network connections while learning. ","Neural Networks, Restricted Boltzmann Machine, Neural Architecture Search, Network Pruning, Network Optimization, AutoML" Energy Consumption-Aware Tabular Benchmarks for Neural Architecture Search,https://openreview.net/forum?id=35PLkGkJOQ4,https://openreview.net/pdf?id=35PLkGkJOQ4,Energy consumption-aware tabular benchmarks for NAS can be used to access sub-space of architectures that are inherently efficient.,"The demand for large-scale computational resources for Neural Architecture Search (NAS) has been lessened by tabular benchmarks for NAS. Evaluating NAS strategies is now possible on extensive search spaces and at a moderate computational cost. But so far, NAS has mainly focused on maximising performance on some hold-out validation/test set. However, energy consumption is a partially conflicting objective that should not be neglected. We hypothesise that constraining NAS to include the energy consumption of training the models could reveal a sub-space of undiscovered architectures that are more computationally efficient with a smaller carbon footprint. To support the hypothesis, an existing tabular benchmark for NAS is augmented with the energy consumption of each architecture. We then perform multi-objective optimisation that includes energy consumption as an additional objective. We demonstrate the usefulness of multi-objective NAS for uncovering the trade-off between performance and energy consumption as well as for finding more energy-efficient architectures. The updated tabular benchmark is open-sourced to encourage the further exploration of energy consumption-aware NAS.","NAS, tabular benchmarks, energy consumption, carbon footprint, deep learning, multi-objective optimisation, automl" Fundamental Limits in Formal Verification of Message-Passing Neural Networks,https://openreview.net/forum?id=WlbG820mRH-,https://openreview.net/pdf?id=WlbG820mRH-,We prove that certain safety properties of MPNN can not be verified formally.,"Output reachability and adversarial robustness are among the most relevant safety properties of neural networks. We show that in the context of Message Passing Neural Networks (MPNN), a common Graph Neural Network (GNN) model, formal verification is impossible. In particular, we show that output reachability of graph-classifier MPNN, working over graphs of unbounded size, non-trivial degree and sufficiently expressive node labels, cannot be verified formally: there is no algorithm that answers correctly (with yes or no), given an MPNN, whether there exists some valid input to the MPNN such that the corresponding output satisfies a given specification. However, we also show that output reachability and adversarial robustness of node-classifier MPNN can be verified formally when a limit on the degree of input graphs is given a priori. We discuss the implications of these results, for the purpose of obtaining a complete picture of the principle possibility to formally verify GNN, depending on the expressiveness of the involved GNN models and input-output specifications.","formal verification, graph neural networks" Score Matching via Differentiable Physics,https://openreview.net/forum?id=t9myAV_dpCB,https://openreview.net/pdf?id=t9myAV_dpCB,We propose to learn score fields with a differentiable physics operator for natural non-deterministic physical processes like diffusion in order to solve inverse problems and obtain their posterior distribution.,"Diffusion models based on stochastic differential equations (SDEs) gradually perturb a data distribution $p(\mathbf{x})$ over time by adding noise to it. A neural network is trained to approximate the score $\nabla_\mathbf{x} \log p_t(\mathbf{x})$ at time $t$, which can be used to reverse the corruption process. In this paper, we focus on learning the score field that is associated with the time evolution according to a physics operator in the presence of natural non-deterministic physical processes like diffusion. A decisive difference to previous methods is that the SDE underlying our approach transforms the state of a physical system to another state at a later time. For that purpose, we replace the drift of the underlying SDE formulation with a differentiable simulator or a neural network approximation of the physics. At the core of our method, we optimize the so-called probability flow ODE to fit a training set of simulation trajectories inside an ODE solver and solve the reverse-time SDE for inference to sample plausible trajectories that evolve towards a given end state. We demonstrate the competitiveness of our approach for different challenging inverse problems.","diffusion models, physics simulations, stochastic partial differential equations" QUIC-FL: : Quick Unbiased Compression for Federated Learning,https://openreview.net/forum?id=04OL67rm6ok,https://openreview.net/pdf?id=04OL67rm6ok,A distributed mean estimation compression scheme with accuracy on-par with the state of the art while asymptotically improving the decoding time.,"Distributed Mean Estimation (DME) is a fundamental building block in communication efficient federated learning. In DME, clients communicate their lossily compressed gradients to the parameter server, which estimates the average and updates the model. State of the art DME techniques apply either unbiased quantization methods, resulting in large estimation errors, or biased quantization methods, where unbiasing the result requires that the server decodes each gradient individually, which markedly slows the aggregation time. In this paper, we propose QUIC-FL, a DME algorithm that achieves the best of all worlds. QUIC-FL is unbiased, offers fast aggregation time, and is competitive with the most accurate (slow aggregation) DME techniques. To achieve this, we formalize the problem in a novel way that allows us to use standard solvers to design near-optimal unbiased quantization schemes.","Distirubted, Mean Estimation, Federate Learning, Quantization, Unbiased, Communication Efficient, Bandwidth Reduction, Compression" FedMEKT: Split Multimodal Embedding Knowledge Transfer in Federated Learning,https://openreview.net/forum?id=vFK6PhTxKHR7,https://openreview.net/pdf?id=vFK6PhTxKHR7,The paper designs the split embedding knowledge transfer mechanism for multimodal federated learning under semi-supervised learning setting,"Federated Learning (FL) enables a decentralized machine-learning paradigm to collaboratively train a generalized global model without sharing users' private data. However, most existing FL approaches solely utilize single-modal data, thus limiting the systems for exploiting valuable multimodal data in future personalized applications. Furthermore, most FL methods still rely on the labeled data at the client side, which is limited in real-world applications due to the inability of data self-annotation from users. To leverage the representations from different modalities in FL, we propose a novel multimodal FL framework with a semi-supervised learning setting. Specifically, we develop the split multimodal embedding knowledge transfer mechanism in federated learning, namely, FedMEKT, which enables the personalized and generalized multimodal representations exchange between server and clients using a small multimodal proxy dataset. Hence, FedMEKT iteratively updates the generalized encoders from the collaborative embedding knowledge of each client, such as modality-averaging representations. Thereby, a generalized encoder could guide personalized encoders to enhance the generalization abilities of client models; afterward, personalized classifiers could be trained using the proxy labeled data to perform supervised tasks. Through the extensive experiments on three multimodal human activity recognition tasks, we demonstrate that FedMEKT achieves superior performance in both local and global encoder models on linear evaluation and guarantees user privacy for personal data and model parameters.","Semi-supervised learning, Multimodal Learning, Federated Learning, Knowledge Transfer" Short-Term Memory Convolutions,https://openreview.net/forum?id=4DU_HCijfJp,https://openreview.net/pdf?id=4DU_HCijfJp,,"The real-time processing of time series signals is a critical issue for many real-life applications. The idea of real-time processing is especially important for audio domain as the human audio perception is sensitive to any kind of disturbance in perceived signals, especially the lag between auditory and visual modalities. The rise of deep learning (DL) models complicated the landscape of signal processing. Although they often have superior quality compared to standard DSP methods, this advantage is diminished by higher latency. In this work we propose a method for minimization of latency and memory consumption, called Short-Term Memory Convolution (STMC) and its transposed counterpart. The main advantage of STMC is the low latency comparable to long short-term memory (LSTM) networks. Furthermore, the training of STMC-based models is faster and more stable as the method is based solely on convolutional neural networks (CNNs). In this study we demonstrate an application of this solution to a U-Net model for an audio separation task. We achieved a 5-fold reduction in inference time and a 2-fold reduction in latency without affecting the output quality.", End-to-End Speech Synthesis Based on Deep Conditional Schrödinger Bridges,https://openreview.net/forum?id=K7YxdCYmd6w,https://openreview.net/pdf?id=K7YxdCYmd6w,," Speech synthesis plays an important role in human-computer interaction. Existing methods mainly employ traditional two-stage pipeline, e.g. text-to-speech and vocoder. In this paper, we propose a system called Schr\""on, which can generate speech waves in an end-to-end mamaner by solving Schr\""odinger bridge problems (SBP). In order to make SBP suitable for speech synthesis, we generalize SBP from two aspects. The first generalization makes it possible to accept condition variables, which are used to control the generated speech, and the second generalization allows it to handle variable-size input. Besides these two generalizations, we propose two techniques to fill the large information gap between text and speech waveforms for generating high-quality voice. The first technique is to use a text-mel joint representation as the conditional input of the conditional SBP. The second one is to use a branch network for the generation of mel scores as a regularization, so that the text features will not be degenerated. Experimental results show that Schr\""on achieves state-of-the-art MOS of 4.52 on public data set LJSpeech. Audio samples are available at https://schron.github.io/.", LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval,https://openreview.net/forum?id=PfpEtB3-csK,https://openreview.net/pdf?id=PfpEtB3-csK,"A new pre-training framework, dubbed lexicon-bottlenecked masked autoencoder, is proposed to learn importance-aware lexicon representations in line with the lexicon-weighting paradigm for large-scale retrieval. ","In large-scale retrieval, the lexicon-weighting paradigm, learning weighted sparse representations in vocabulary space, has shown promising results with high quality and low latency. Despite it deeply exploiting the lexicon-representing capability of pre-trained language models, a crucial gap remains between language modeling and lexicon-weighting retrieval -- the former preferring certain or low-entropy words whereas the latter favoring pivot or high-entropy words -- becoming the main barrier to lexicon-weighting performance for large-scale retrieval. To bridge this gap, we propose a brand-new pre-training framework, lexicon-bottlenecked masked autoencoder (LexMAE), to learn importance-aware lexicon representations. Essentially, we present a lexicon-bottlenecked module between a normal language modeling encoder and a weakened decoder, where a continuous bag-of-words bottleneck is constructed to learn a lexicon-importance distribution in an unsupervised fashion. The pre-trained LexMAE is readily transferred to the lexicon-weighting retrieval via fine-tuning. On the ad-hoc retrieval benchmark, MS-Marco, it achieves 42.6% MRR@10 with 45.8 QPS for the passage dataset and 44.4% MRR@100 with 134.8 QPS for the document dataset, by a CPU machine. And LexMAE shows state-of-the-art zero-shot transfer capability on BEIR benchmark with 12 datasets. ","Self-Supervised Learning, Lexicon Representation, Large-Scale Retrieval" A GNN-Guided Predict-and-Search Framework for Mixed-Integer Linear Programming,https://openreview.net/forum?id=pHMpgT5xWaE,https://openreview.net/pdf?id=pHMpgT5xWaE,,"Mixed-integer linear programming (MILP) is widely employed for modeling combinatorial optimization problems. In practice, similar MILP instances with only coefficient variations are routinely solved, and machine learning (ML) algorithms are capable of capturing common patterns across these MILP instances. In this work, we combine ML with optimization and propose a novel predict-and-search framework for efficiently identifying high-quality feasible solutions. Specifically, we first predict the solution distributions, then we search for the best feasible solution within a properly defined ball around the predicted solution. We show that our framework is both theoretically and computationally superior to fixing strategies. We conduct extensive experiments on four public datasets and numerical results demonstrate that our proposed framework achieves 51% and 14% better primal gaps than state-of-the-art general-purpose optimization solvers SCIP and Gurobi, respectively. ", Parameter Averaging for SGD Stabilizes the Implicit Bias towards Flat Regions,https://openreview.net/forum?id=p4RvNzlJX7W,https://openreview.net/pdf?id=p4RvNzlJX7W,This paper shows that averaged SGD with a large step-size efficiently converges to flat regions.,"Stochastic gradient descent is a workhorse for training deep neural networks due to its excellent generalization performance. Several studies demonstrated this success is attributed to the implicit bias of the method that prefers a flat minimum and developed new methods based on this perspective. Recently, Izmailov et al. (2018) empirically observed that an averaged stochastic gradient descent with a large step size can bring out the implicit bias more effectively and can converge more stably to a flat minimum than the vanilla stochastic gradient descent. In our work, we theoretically justify this observation by showing that the averaging scheme improves the bias-optimization tradeoff coming from the stochastic gradient noise: a large step size amplifies the bias but makes convergence unstable, and vice versa. Specifically, we show that the averaged stochastic gradient descent can get closer to a solution of a penalized objective on the sharpness than the vanilla stochastic gradient descent using the same step size under certain conditions. In experiments, we verify our theory and demonstrate this learning scheme significantly improves performance. ","stochastic gradient descent, large step size, parameter averaging, implicit bias, flat minima" On Explaining Neural Network Robustness with Activation Path,https://openreview.net/forum?id=piIsx-G3Gux,https://openreview.net/pdf?id=piIsx-G3Gux,,"Despite their verified performance, neural networks are prone to be misled by maliciously designed adversarial examples. This work investigates the robustness of neural networks from the activation pattern perspective. We find that despite the complex structure of the deep neural network, most of the neurons provide locally stable contributions to the output, while the minority, which we refer to as float neurons, can greatly affect the prediction. We decompose the computational graph of the neural network into the fixed path and float path and investigate their role in generating adversarial examples. Based on our analysis, we categorize the vulnerable examples into Lipschitz vulnerability and float neuron vulnerability. We show that the boost of robust accuracy from randomized smoothing is the result of correcting the latter. We then propose an SC-RFP (smoothed classifier with repressed float path) to further reduce the instability of the float neurons and show that our result can provide a higher certified radius as well as accuracy. ","Randomized Smoothing, Robustness, Neural Network" Flareon: Stealthy Backdoor Injection via Poisoned Augmentation,https://openreview.net/forum?id=ujibH3ervr,https://openreview.net/pdf?id=ujibH3ervr,"A simple, stealthy, lightweight, and effective backdoor injection mechanism that targets the data augmentation pipeline with motion-based triggers.","Open software supply chain attacks, once successful, can exact heavy costs in mission-critical applications. As open-source ecosystems for deep learning flourish and become increasingly universal, they present attackers previously unexplored avenues to code-inject malicious backdoors in deep neural network models. This paper proposes Flareon, a simple, stealthy, mostly-free, and yet effective backdoor injection payload that specifically targets the data augmentation pipeline with motion-based triggers. Flareon neither alters ground-truth labels, nor modifies the training loss objective, nor does it assume prior knowledge of the victim model architecture and training hyperparameters. By learning multiple triggers for targets simultaneously, it can even produce models that learn target-conditional (or ``any2any'') backdoors. Model trained under Flareon exhibits higher attack success rates for any target choices and better clean accuracies than competing attacks that not only seize greater capabilities, but also assume more restrictive attack targets. We also demonstrate the effectiveness of Flareon against recent defenses. Flareon is fully open-source and available online to the deep learning community.", Dimensionless instance segmentation by learning graph representations of point clouds,https://openreview.net/forum?id=IN499pgOOEl,https://openreview.net/pdf?id=IN499pgOOEl,Novel method for learning point cloud graph representation for instance segmentation,"Point clouds are an increasingly common spatial data modality, being produced by sensors used in robotics and self-driving cars, and as natural intermediate representations of objects in microscopy and other bioimaging domains (e.g., cell locations over time, or filaments, membranes, or organelle boundaries in cryo-electron micrographs or tomograms). However, semantic and instance segmentation of this data remains challenging due to the complex nature of objects in point clouds. Especially in bioimaging domains where objects are often large and can be intersecting or overlapping. Furthermore, methods for operating on point clouds should not be sensitive to the specific orientation or translation of the point cloud, which is often arbitrary. Here, we frame the point cloud instance segmentation problem as a graph learning problem in which we seek to learn a function that accepts the point cloud as input and outputs a probability distribution over neighbor graphs in which connected components of the graph correspond to individual object instances. We introduce the Dimensionless Instance Segmentation Transformer (DIST), a deep neural network for spatially invariant instance segmentation of point clouds to solve this point cloud-to-graph problem. DIST uses an SO(n) invariant transformer layer architecture to operate on point clouds of arbitrary dimension and outputs, for each pair of points, the probability that an edge exists between them in the instance graph. We then decode the most likely set of instances using a graph cut. We demonstrate the power of DIST for the segmentation of biomolecules in cryo-electron micrographs and tomograms, far surpassing existing methods for membrane and filament segmentation in empirical evaluation. DIST also applies to scene and object understanding, performing competitively on the ScanNetV2 3D instance segmentation challenge. We anticipate that DIST will underpin a new generation of methods for point cloud segmentation in bioimaging and that our general model and approach will provide useful insights for point cloud segmentation methods in other domains.","instance segmentation, graph representation, point cloud segmentation" Structure by Architecture: Structured Representations without Regularization,https://openreview.net/forum?id=O_lFCPaF48t,https://openreview.net/pdf?id=O_lFCPaF48t,A novel autoencoder architecture to structure the learned representation without regularizing the objective and improve sampling for generative modeling.,"We study the problem of self-supervised structured representation learning using autoencoders for downstream tasks such as generative modeling. Unlike most methods which rely on matching an arbitrary, relatively unstructured, prior distribution for sampling, we propose a sampling technique that relies solely on the independence of latent variables, thereby avoiding the trade-off between reconstruction quality and generative performance typically observed in VAEs. We design a novel autoencoder architecture capable of learning a structured representation without the need for aggressive regularization. Our structural decoders learn a hierarchy of latent variables, thereby ordering the information without any additional regularization or supervision. We demonstrate how these models learn a representation that improves results in a variety of downstream tasks including generation, disentanglement, and extrapolation using several challenging and natural image datasets.","Autoencoder, Structure, Generative, Architecture, Disentanglement, Regularization, Hybridization" Understanding Neural Coding on Latent Manifolds by Sharing Features and Dividing Ensembles,https://openreview.net/forum?id=1UCaQYUdE_o,https://openreview.net/pdf?id=1UCaQYUdE_o,We propose neural latent variable models with feature sharing and ensemble detection.,"Systems neuroscience relies on two complementary views of neural data, characterized by single neuron tuning curves and analysis of population activity. These two perspectives combine elegantly in neural latent variable models that constrain the relationship between latent variables and neural activity, modeled by simple tuning curve functions. This has recently been demonstrated using Gaussian processes, with applications to realistic and topologically relevant latent manifolds. Those and previous models, however, missed crucial shared coding properties of neural populations. We propose $\textit{feature sharing}$ across neural tuning curves, which significantly improves performance and leads to better-behaved optimization. We also propose a solution to the problem of $\textit{ensemble detection}$, whereby different groups of neurons, i.e., ensembles, can be modulated by different latent manifolds. This is achieved through a soft clustering of neurons during training, thus allowing for the separation of mixed neural populations in an unsupervised manner. These innovations lead to more interpretable models of neural population activity that train well and perform better even on mixtures of complex latent manifolds. Finally, we apply our method on a recently published grid cell dataset, recovering distinct ensembles, inferring toroidal latents and predicting neural tuning curves all in a single integrated modeling framework.","neuroscience, neural activity, tuning curves, neural ensemble detection, grid cells, latent variable models" Learning Fast and Slow for Time Series Forecasting,https://openreview.net/forum?id=q-PbpHD3EOk,https://openreview.net/pdf?id=q-PbpHD3EOk,A novel time series forecasting framework based on continual learning and the Complementary Learning System theory,"Despite the recent success of deep learning for time series forecasting, these methods are not scalable for many real-world applications where data arrives sequentially. Training deep neural forecasters on the fly is notoriously challenging because of their limited ability to adapt to non-stationary environments and remember old knowledge. We argue that the fast adaptation capability of deep neural networks is critical and successful solutions require handling changes to both new and recurring patterns effectively. In this work, inspired by the Complementary Learning Systems (CLS) theory, we propose Fast and Slow learning Network (FSNet) as a novel framework to address the challenges of online forecasting. Particularly, FSNet improves the slowly-learned backbone by dynamically balancing fast adaptation to recent changes and retrieving similar old knowledge. FSNet achieves this mechanism via an interaction between two novel complementary components: (i) a per-layer adapter to support fast learning from individual layers, and (ii) an associative memory to support remembering, updating, and recalling repeating events. Extensive experiments on real and synthetic datasets validate FSNet's efficacy and robustness to both new and recurring patterns. Our code will be made available.","online time series forecasting, continual learning" Perturbation Defocusing for Adversarial Defense,https://openreview.net/forum?id=E2y2TrpJhYN,https://openreview.net/pdf?id=E2y2TrpJhYN,propose a new perspective to defend against text adversarial attack,"Recent research indicates adversarial attacks are likely to deceive neural systems, including large-scale, pre-trained language models. Given a natural sentence, an attacker replaces a subset of words to fool objective models. To defend against adversarial attacks, existing works aim to reconstruct the adversarial examples. However, these methods show limited defense performance on the adversarial examples whilst also damaging the clean performance on natural examples. To achieve better defense performance, our finding indicates that the reconstruction of adversarial examples is not necessary. More specifically, we inject non-toxic perturbations into adversarial examples, which can disable almost all malicious perturbations. In order to minimize performance sacrifice, we employ an adversarial example detector to distinguish and repair detected adversarial examples, which alleviates the mis-defense on natural examples. Our experimental results on three datasets, two objective models and a variety of adversarial attacks show that the proposed method successfully repairs up to ∼ 97% correctly identified adversarial examples with ≤∼ 2% performance sacrifice. We provide an anony-mus demonstration of adversarial detection and repair based on our work.","text adversarial defense, perturbation defocusing" Accuracy Boosters: Epoch-Driven Mixed-Mantissa Block Floating-Point for DNN Training,https://openreview.net/forum?id=QsgeAdRwILD,https://openreview.net/pdf?id=QsgeAdRwILD,"We propose an epoch-driven mixed-mantissa Hybrid Block Floating-Point training method converting 99.7% of arithmetic operations in training to 4-bit mantissas and using 6-bit mantissas in the last epoch, while preserving/outperforming FP32 accuracy.","The unprecedented growth in DNN model complexity, size and the amount of training data have led to a commensurate increase in demand for computing and a search for minimal encoding. Recent research advocates Hybrid Block Floating-Point (HBFP) as a technique that minimizes silicon provisioning in accelerators by converting the majority of arithmetic operations in training to 8-bit fixed-point. In this paper, we perform a full-scale exploration of the HBFP design space including minimal mantissa encoding, varying block sizes, and mixed mantissa bit-width across layers and epochs. We propose Accuracy Boosters, an epoch-driven mixed-mantissa HBFP that uses 6-bit mantissa only in the last epoch and converts 99.7% of all arithmetic operations in training to 4-bit mantissas. Accuracy Boosters enable reducing silicon provisioning for an HBFP training accelerator by 16.98× as compared to FP32, while preserving or outperforming FP32 accuracy.","DNN training, low-precision training, mixed-precision training, efficient training, number formats, numerical encodings, block floating-point, DNN accelerators" Compressing multidimensional weather and climate data into neural networks,https://openreview.net/forum?id=Y5SEe3dfniJ,https://openreview.net/pdf?id=Y5SEe3dfniJ,We compress weather and climate data into neural network weights.,"Weather and climate simulations produce petabytes of high-resolution data that are later analyzed by researchers in order to understand climate change or severe weather. We propose a new method of compressing this multidimensional weather and climate data: a coordinate-based neural network is trained to overfit the data, and the resulting parameters are taken as a compact representation of the original grid-based data. While compression ratios range from 300x to more than 3,000x, our method outperforms the state-of-the-art compressor SZ3 in terms of weighted RMSE, MAE. It can faithfully preserve important large scale atmosphere structures and does not introduce artifacts. When using the resulting neural network as a 790x compressed dataloader to train the WeatherBench forecasting model, its RMSE increases by less than 2%. The three orders of magnitude compression democratizes access to high-resolution climate data and enables numerous new research directions.","Weather and climate data, Compression" Guess the Instruction! Making Language Models Stronger Zero-Shot Learners,https://openreview.net/forum?id=FtOxgKe_Zg2,https://openreview.net/pdf?id=FtOxgKe_Zg2,"We introduce Flipped Learning, a meta-training method that computes the likelihood of the task instruction given input instance and label.","Meta-training, which fine-tunes the language model (LM) on various downstream tasks by maximizing the likelihood of the target label given the task instruction and input instance, has improved the zero-shot task generalization performance. However, meta-trained LMs still struggle to generalize to challenging tasks containing novel labels unseen during meta-training. In this paper, we propose Flipped Learning, an alternative method of meta-training which trains the LM to generate the task instruction given the input instance and label. During inference, the LM trained with Flipped Learning, referred to as Flipped, selects the label option that is most likely to generate the task instruction. On 14 tasks of the BIG-bench benchmark, the 3B-sized Flipped outperforms 4 times larger zero-shot T0-11B (Sanh et al, 2021) and even a 60 times larger 3-shot GPT-3 (175B) (Brown et al, 2020) on average by 1.8% and 3.1%, respectively. Flipped gives particularly large improvements on unseen labels, outperforming T0-11B by up to +20% average F1 score. This indicates that the strong task generalization of Flipped comes from improved generalization to novel labels. ","natural language processing, zeroshot language models, large language models" Probabilistic Imputation for Time-series Classification with Missing Data,https://openreview.net/forum?id=fBoNN1Y6PjG,https://openreview.net/pdf?id=fBoNN1Y6PjG,,"Multivariate time series data available for real-world applications typically contain a significant amount of missing values. A dominant approach for the classification with such missing values is to heuristically impute the missing values with specific values (zero, mean, values of adjacent time-steps) or learnable parameters. However, these simple strategies do not take the data generative process into account, and more importantly, do not effectively capture the uncertainty in prediction due to the multiple possibilities for the missing values. In this paper, we propose a novel probabilistic framework for classification with multivariate time series data with missing values. Our model consists of two parts; a deep generative model for missing value imputation and a classifier. Extending the existing deep generative models to better capture structures of time-series data, our deep generative model part is trained to impute the missing values in multiple plausible ways, effectively modeling the uncertainty of the imputation. The classifier part takes the time series data along with the imputed missing values and classifies signals, and is trained to capture the predictive uncertainty due to the multiple possibilities of imputations. Importantly, we show that na\""ively combining the generative model and the classifier could result in trivial solutions where the generative model does not produce meaningful imputations. To resolve this, we present a novel regularization technique that can promote the model to produce useful imputation values that actually help classification. Through extensive experiments on real-world time series data with missing values, we demonstrate the effectiveness of our method.","Missing Data Imputation, Multivariate Time Series, Classification, Generative Model, Variational Autoencoder, Dropout" Delve into the Layer Choice of BP-based Attribution Explanations,https://openreview.net/forum?id=r3B62g_nL0l,https://openreview.net/pdf?id=r3B62g_nL0l,"We quantified the influence of the layer choice on BP-based attributions. Based on the experimental results, we show how to fuse different layer attributions to obtain complete, fine-grained and reliable attribution results.","Many issues in attribution methods have been recognized to be related to the choice of target layers, such as class insensitivity in earlier layers and low resolution in deeper layers. However, as the ground truth of the decision process is unknown, the effect of layer selection has not been well-studied. In this paper, we first employ backdoor attacks to control the decision-making process of the model and quantify the influence of layer choice on class sensitivity, fine-grained localization, and completeness. We obtain three observations: (1) We find that energy distributions of the bottom layer attribution are class-sensitive, and the class-insensitive visualizations come from the presence of a large number of class-insensitive low-score pixels. (2) The choice of target layers determines the completeness and the granularity of attributions. (3) We find that single-layer attributions cannot perform well both on the LeRF and MoRF reliability evaluations. To address these issues, we propose TIF (Threshold Interception and Fusion), a technique to combine the attribution results of all layers. Qualitative and quantitative experiments show that the proposed solution is visually sharper and more tightly constrained to the object region than other methods, addresses all issues, and outperforms mainstream methods in reliability and localization evaluations.","XAI, attribution methods, layer choice, TIF" Timing is Everything: Learning to Act Selectively with Costly Actions and Budgetary Constraints,https://openreview.net/forum?id=_BoPed4tYww,https://openreview.net/pdf?id=_BoPed4tYww,,"Many real-world settings involve costs for performing actions; transaction costs in financial systems and fuel costs being common examples. In these settings, performing actions at each time step quickly accumulates costs leading to vastly suboptimal outcomes. Additionally, repeatedly acting produces wear and tear and ultimately, damage. Determining when to act is crucial for achieving successful outcomes and yet, the challenge of efficiently learning to behave optimally when actions incur minimally bounded costs remains unresolved. In this paper, we intro- duce a reinforcement learning (RL) framework named Learnable Impulse Control Reinforcement Algorithm (LICRA), for learning to optimally select both when to act and which actions to take when actions incur costs. At the core of LICRA is a nested structure that combines RL and a form of policy known as impulse control which learns to maximise objectives when actions incur costs. We prove that LICRA, which seamlessly adopts any RL method, converges to policies that optimally select when to perform actions and their optimal magnitudes. We then augment LICRA to handle problems in which the agent can perform at most k < ∞ actions and more generally, faces a budget constraint. We show LICRA learns the optimal value function and ensures budget constraints are satisfied almost surely. We demonstrate empirically LICRA’s superior performance against benchmark RL methods in OpenAI gym’s Lunar Lander and in Highway environments and a variant of the Merton portfolio problem within finance.","dynamic programming, impulse control, optimal stopping, reinforcement learning" Multi-Head State Space Model for Sequence Modeling,https://openreview.net/forum?id=hrRNkyyGGgx,https://openreview.net/pdf?id=hrRNkyyGGgx,"We develop a novel multi-head state space model as a replacement and/or complement to attention, achieving state-of-the-art performance in speech recognition and masked language modeling.","Recently, state space models (SSMs) have shown promising results on sequence modeling tasks. However, a potential challenge of existing works is that SSMs are usually introduced or initialized in a homogeneous way, encouraging the model to only capture similar temporal dynamics on different features. In this paper, we propose a multi-head state space model (MSSM), in which parallel heads are introduced to learn different temporal dynamics on sequence data. Furthermore, we propose a novel variant of the Transformer, referred to as the Stateformer, which combines MSSMs with attention. Experiments on large-scale automatic speech recognition (ASR) and language modeling tasks show the MSSM outperforming a range of attention-based baselines. The Stateformer further improves performance, achieving the state-of-the-art performance on the LibriSpeech ASR task.","multi-head, state space, transformer, stateformer, sequence model, rnn" A Weight Variation-Aware Training Method for Hardware Neuromorphic Chips,https://openreview.net/forum?id=3urtgEaXCA9,https://openreview.net/pdf?id=3urtgEaXCA9,,"Hardware neuromorphic chips that mimic the biological nervous systems have recently attracted significant attention due to their ultra-low power and parallel computation. However, the inherent variability of nano-scale synaptic devices causes a weight perturbation and performance drop of neural networks. This paper proposes a training method to find weight with robustness to intrinsic device variability. A stochastic weight characteristic incurred by device inherent variability is considered during training. We investigate the impact of weight variation on both Spiking Neural Network (SNN) and standard Artificial Neural Network (ANN) with different architectures including fully connected, convolutional neural network (CNN), VGG, and ResNet on MNIST, CIFAR-10, and CIFAR-100. Experimental results show that a weight variation-aware training method (WVAT) can dramatically minimize the performance drop on weight variability by exploring a flat loss landscape. When there are weight perturbations, WVAT yields 85.21% accuracy of VGG-5 on CIFAR-10, reducing accuracy degradation by more than 1/10 compared with SGD. Finally, WVAT is easy to implement on various architectures with little computational overhead.","edge computing systems, neuro-inspired computing, hardware implementation, synaptic device, hardware-oriented neural network" Semantic Prior for Weakly Supervised Class-Incremental Segmentation,https://openreview.net/forum?id=0vG8GbuPOH3,https://openreview.net/pdf?id=0vG8GbuPOH3,Leveraging semantic similarities between old and new classes to improve weakly supervised class-incremental semantic segmentation,"Class-incremental semantic image segmentation assumes multiple model updates, each enriching the model to segment new categories. This is typically carried out by providing pixel-level manual annotations for all new objects, limiting the adoption of such methods. Approaches which solely require image-level labels offer an attractive alternative, yet, such annotations lack crucial information about the location and boundary of new objects. In this paper we argue that, since classes represent not just indices but semantic entities, the conceptual relationships between them can provide valuable information that should be leveraged. We propose a weakly supervised approach that leverages such semantic relations in order to transfer some cues from the previously learned classes into the new ones, complementing the supervisory signal from image-level labels. We validate our approach on a number of continual learning tasks, and show how even a simple pairwise interaction between classes can significantly improve the segmentation mask quality of both old and new classes. We show these conclusions still hold for longer and, hence, more realistic sequences of tasks and for a challenging few-shot scenario.","class-incremental learning, weakly supervised semantic segmentation" A Mutual Information Duality Algorithm for Multi-Agent Specialization,https://openreview.net/forum?id=Cx1xYn6vVm2,https://openreview.net/pdf?id=Cx1xYn6vVm2,The social behavioral change in population learning is impacted by the dual properties of mutual information.,"The social behavior change in a population has long been studied as an essential component of multi-agent learning. The learning of behavioral change not only involves reinforcement learning (RL), but also be measured against the general population with mutual information (MI). The combination of RL and MI led us to derive MI optimizations from policy gradient. With MI as multi-agent's optimization objective, we discover that the dual properties of MI can result in distinctly different population behaviors. From MI maximization that maximizes the stability of a population to MI minimization that enables specialization among the agents, the dual of MI creates a significant change in a population's behavioral properties. In this paper, we propose a minimax formulation of MI (M\&M) that enables agents specialization with stable regularization. Empirically we evaluated M\&M against the prior SOTA MARL framework, and analyze the social behavior change in performance, diversity, and the stability of their social graphs. ","Multi-agent, Reinforcement Learning, Mutual Information, Duality, Policy Gradient, Social Graph" DECAP: Decoding CLIP Latents for Zero-shot Captioning,https://openreview.net/forum?id=Lt8bMlhiwx2,https://openreview.net/pdf?id=Lt8bMlhiwx2,We propose a simple framework with a lightweight visual-aware language decoder for zero-shot captioning.,"Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate strong zero-shot transfer capability in many discriminative tasks, e.g., image classification. Their adaptation to zero-shot image-conditioned text generation tasks has drawn increasing interests. Prior arts approach to zero-shot captioning by either utilizing the existing large language models (e.g., GPT-2) or pre-training the encoder-decoder network in an end-to-end manner. However, the large language models may not generate sensible descriptions due to the task discrepancy between captioning and language modeling, while the end-to-end pre-training requires paired data and extensive computational resources. In this work, we propose a simple framework, named DeCap, for zero-shot captioning. We introduce a lightweight visual-aware language decoder. This decoder is both data-efficient and computationally-efficient: 1) it only requires the text data for training, easing the burden on the collection of paired data. 2) it does not require end-to-end training. When trained with only the text data, the decoder takes the text embedding extracted from the off-the-shelf CLIP encoder as a prefix embedding. The challenge is that the decoder is trained on the text corpus but at the inference stage, it needs to generate captions based on visual inputs. Though the CLIP text em-bedding and the visual embedding are correlated, the modality gap issue is widely observed in multi-modal contrastive models that prevents us from directly taking the visual embedding as the prefix embedding. We propose a training-free mechanism to reduce the modality gap. We project the visual embedding into the CLIP text embedding space, while the projected embedding retains the information of the visual input. Taking the projected embedding as the prefix embedding, the decoder generates high-quality descriptions that match the visual input. The experiments show that DeCap outperforms other zero-shot captioning methods by a large margin on the typical image captioning benchmarks, i.e., MSCOCO and Flickr30k. On Flickr30k, the zero-shot result is even competitive with fully supervised methods. We apply DeCap to video captioning and achieve state-of-the-art zero-shot performance on MSR-VTT and ActivityNet-Captions.","Zero-shot captioning, Decoder training, Multi-modal learning" Heterogeneous Loss Function with Aggressive Rejection for Contaminated data in anomaly detection,https://openreview.net/forum?id=jD2jOzx1aef,https://openreview.net/pdf?id=jD2jOzx1aef,This paper proposes heterogeneous loss with aggressive rejection for contaminated data in anomaly detection. ,"A training clean dataset, which consists of only normal data, is crucial for detecting anomalous data. However, a clean dataset is challenging to produce in practice. Here, heterogeneous loss function with aggressive rejection is proposed, which strengthens robustness against contamination. Aggressive rejection constrains training on the intersection of normal and abnormal distributions to handle the potential anomalies. Heterogeneous loss function utilizes an adaptive mini-batch stochastic choice of an order of asymptotic polynomial of GA loss, which dynamically optimizes the gradient for the intersection further. Through the proposed method, mean square error based models can outperform various robust loss functions and generate comparable performance with robust models for contaminated data on three image datasets.","Anomaly detection, contaminated data, Unsupervised learning" Biological Factor Regulatory Neural Network,https://openreview.net/forum?id=UqVDq19iVx,https://openreview.net/pdf?id=UqVDq19iVx,"We propose BFReg-NN to simulate the complex biological processes in a cell system, understand the functions of genes or proteins, and ultimately give insights into the mechanism of larger living systems","Genes are fundamental for analyzing biological systems and many recent works proposed to utilize gene expression for various biological tasks by deep learning models. Despite their promising performance, it is hard for deep neural networks to provide biological insights for humans due to their black-box nature. Recently, some works integrated biological knowledge with neural networks to improve the transparency and performance of their models. However, these methods can only incorporate partial biological knowledge, leading to suboptimal performance. In this paper, we propose the Biological Factor Regulatory Neural Network (BFReg-NN), a generic framework to model relations among biological factors in cell systems. BFReg-NN starts from gene expression data and is capable of merging most existing biological knowledge into the model, including the regulatory relations among genes or proteins (e.g., gene regulatory networks (GRN), protein-protein interaction networks (PPI)) and the hierarchical relations among genes, proteins and pathways (e.g., several genes/proteins are contained in a pathway). Moreover, BFReg-NN also has the ability to provide new biologically meaningful insights because of its white-box characteristics. Experimental results on different gene expression-based tasks verify the superiority of BFReg-NN compared with baselines. Our case studies also show that the key insights found by BFReg-NN are consistent with the biological literature.","biological factor, gene regulatory network" Preserving Semantics in Textual Adversarial Attacks,https://openreview.net/forum?id=LofRPZeXNNk,https://openreview.net/pdf?id=LofRPZeXNNk,We propose a novel sentence encoder that improves the quality of textual adversarial attacks.,"Adversarial attacks in NLP challenge the way we look at language models. The goal of this kind of adversarial attack is to modify the input text to fool a classifier while maintaining the original meaning of the text. Although most existing adversarial attacks claim to fulfill the constraint of semantics preservation, careful scrutiny shows otherwise. We show that the problem lies in the text encoders used to determine the similarity of adversarial examples, specifically in the way they are trained. Unsupervised training methods make these encoders more susceptible to problems with antonym recognition. To overcome this, we introduce a simple, fully supervised sentence embedding technique called Semantics-Preserving-Encoder (SPE). The results show that our solution minimizes the variation in the meaning of the adversarial examples generated. It also significantly improves the overall quality of adversarial examples, as confirmed by human evaluators. Furthermore, it can be used as a component in any existing attack to speed up its execution while maintaining similar attack success.","NLP, Adversarial Attacks, Sentence Encoders, Semantics Similarity" Unbiased Decisions Reduce Regret: Adversarial Optimism for the Bank Loan Problem,https://openreview.net/forum?id=wqu7hutzn3K,https://openreview.net/pdf?id=wqu7hutzn3K,We use adversarial domain adaptation to combat accumulating bias for a class of online learning problems.,"In many real world settings binary classification decisions are made based on limited data in near real-time, e.g. when assessing a loan application. We focus on a class of these problems that share a common feature that the true label is only observed when a data point is assigned a positive label by a learner, e.g. we only learn of an outcome of \emph{accepted} loan applications. In this setting, sometimes referred to as the Bank Loan Problem (BLP) in the literature, the labelled training set suffers from accumulating bias since it is created by learners past decisions. Prior work mitigates the consequences of this bias by injecting optimism into the model to allow the learner to correct self-reinforcing false rejections. This reduces long term regret but comes at the cost of a higher false acceptance rate. We introduce \emph{adversarial optimism} (AdOpt) to directly address the bias in the training set using \emph{adversarial domain adaptation}. The goal of AdOpt is to learn an unbiased but informative representation of past data, by reducing the distributional shift between the set of \textit{accepted} data points and all data points seen thus far. AdOpt integrates classification made using this debiased representation of the data with the recently proposed \emph{pseudo-label optimism}(PLOT) method to increase the rate of correct decisions at every timestep. AdOpt significantly exceeds state-of-the-art performance on a set of challenging BLP benchmark problems.","adversarial domain adaptation, online learning, adversarial de-biasing, neural optimism" That Label's got Style: Handling Label Style Bias for Uncertain Image Segmentation,https://openreview.net/forum?id=wZ2SVhOTzBX,https://openreview.net/pdf?id=wZ2SVhOTzBX,We present a method to reduce bias caused by differing label style for uncertain image segmentation.,"Segmentation uncertainty models predict a distribution over plausible segmentations for a given input, which they learn from the annotator variation in the training set. However, in practice these annotations can differ systematically in the way they are generated, for example through the use of different labeling tools. This results in datasets that contain both data variability and differing label styles. In this paper, we demonstrate that applying state-of-the-art segmentation uncertainty models on such datasets can lead to model bias caused by the different label styles. We present an updated modelling objective conditioning on labeling style for aleatoric uncertainty estimation, and modify two state-of-the-art-architectures for segmentation uncertainty accordingly. We show with extensive experiments that this method reduces label style bias, while improving segmentation performance, increasing the applicability of segmentation uncertainty models in the wild. We curate two datasets, with annotations in different label styles, which we will make publicly available along with our code upon publication.","Uncertainty Quantification, Segmentation" Prompt Injection: Parameterization of Fixed Inputs,https://openreview.net/forum?id=_Esnr5KvO6-,https://openreview.net/pdf?id=_Esnr5KvO6-,We formulate a new problem called Prompt Injection (PI) that focuses on injecting the prompt into the parameters of an LM to be an efficient alternative to attaching fixed prompts to the input.,"Recent works have shown that attaching prompts to the input is effective at conditioning Language Models (LM) to perform specific tasks. However, prompts are always included in the input text during inference, thus incurring substantial computational and memory overhead. Also, there is currently no straightforward method of utilizing prompts that are longer than the maximum input length of the LMs without incurring additional costs during inference. We formulate a new problem called Prompt Injection (PI) that focuses on injecting the prompt into the parameters of an LM to be an efficient alternative to attaching fixed prompts to the input. We show that in scenarios with long fixed prompts, PI can be up to 280 times more efficient in terms of total FLOPs than previous approaches. We further explore methodologies for PI and show promising results in persona-dependent conversation, semantic parsing, and zero-shot learning with task instructions. Through these explorations, we show that PI can be a promising direction for conditioning language models, especially in scenarios with long and fixed prompts.","Prompt, Injection, Parameterization, Language Model, Inference Efficiency, Distillation" Holistic Adversarially Robust Pruning,https://openreview.net/forum?id=sAJDi9lD06L,https://openreview.net/pdf?id=sAJDi9lD06L,We propose HARP that realizes the adversarially robust pruning in a holistic way and yields an outstanding capability at aggressive compression.,"Neural networks can be drastically shrinked in size by removing redundant parameters. While crucial for the deployment on resource-constraint hardware, oftentimes, compression comes with a severe drop in accuracy and lack of adversarial robustness. Despite recent advances, counteracting both aspects has only succeeded for moderate compression rates so far. We propose a novel method, HARP, that copes with aggressive pruning significantly better than prior work. For this, we consider the network holistically, learning a global compression strategy that, however, can be specific to each layer. Our method optimizes both how many parameters (compression rate) and which parameters (scoring connections) to prune. It fine-tunes an existing model with dynamic regularization, that follows a step-wise incremental function balancing the different objectives. It starts by favoring robustness before shifting focus on reaching the target compression rate and only then handles the objectives equally. The learned compression strategies allow us to maintain the pre-trained model's natural accuracy as well as its adversarially robustness for a reduction by 99% of the network's original size. Moreover, we observe a crucial influence of non-uniform compression across layers.","adversarial robustness, model pruning" PASHA: Efficient HPO and NAS with Progressive Resource Allocation,https://openreview.net/forum?id=syfgJE6nFRW,https://openreview.net/pdf?id=syfgJE6nFRW,Efficient multi-fidelity method with progressive resource allocation for HPO and NAS.,"Hyperparameter optimization (HPO) and neural architecture search (NAS) are methods of choice to obtain the best-in-class machine learning models, but in practice they can be costly to run. When models are trained on large datasets, tuning them with HPO or NAS rapidly becomes prohibitively expensive for practitioners, even when efficient multi-fidelity methods are employed. We propose an approach to tackle the challenge of tuning machine learning models trained on large datasets with limited computational resources. Our approach, named PASHA, extends ASHA and is able to dynamically allocate maximum resources for the tuning procedure depending on the need. The experimental comparison shows that PASHA identifies well-performing hyperparameter configurations and architectures while consuming significantly fewer computational resources than ASHA.","Neural architecture search, Hyperparameter optimization, Large datasets, Computational efficiency, Cost" Thinking fourth dimensionally: Treating Time as a Random Variable in EBMs,https://openreview.net/forum?id=m0fEJ2bvwpw,https://openreview.net/pdf?id=m0fEJ2bvwpw,"We generalize the common practice of utilizing a series of auxiliary distributions in EBM training, and utilize this approach to improve the performance of existing methods.","Recent years have seen significant progress in techniques for learning high-dimensional distributions. Many modern methods, from diffusion models to Energy-Based-Models (EBMs), adopt a coarse-to-fine approach. This is often done by introducing a series of auxiliary distributions that gradually change from the data distribution to some simple distribution (e.g. white Gaussian noise). Methods in this category separately learn each auxiliary distribution (or transition between pairs of consecutive distributions) and then use the learned models sequentially to generate samples. In this paper, we offer a simple way to generalize this idea by treating the ``time'' index of the series as a random variable and framing the problem as that of learning a single joint distribution of ""time"" and samples. We show that this joint distribution can be learned using any existing EBM method and that it allows achieving improved results. As an example, we demonstrate this approach using contrastive divergence (CD) in its most basic form. On CIFAR-10 and CelebA ($32\times 32$), this method outperforms previous CD-based methods in terms of inception and FID scores.","energy-based models, markov chain mote carlo, contrastive divergence" Diversity of Generated Unlabeled Data Matters for Few-shot Hypothesis Adaptation,https://openreview.net/forum?id=_apb5VI2_0o,https://openreview.net/pdf?id=_apb5VI2_0o,,"Generating unlabeled data has been recently shown to help address the few-shot hypothesis adaptation (FHA) problem, where we aim to train a classifier for the target domain with a few labeled target-domain data and a well-trained source-domain classifier (i.e., a source hypothesis), for the additional information of the highly-compatible unlabeled data. However, the generated data of the existing methods are extremely similar or even the same. The strong dependency among the generated data will lead the learning to fail. In this paper, we propose a diversity-enhancing generative network (DEG-Net) for the FHA problem, which can generate diverse unlabeled data with the help of a kernel independence measure: the Hilbert-Schmidt independence criterion (HSIC). Specifically, DEG-Net will generate data via minimizing the HSIC value (i.e., maximizing the independence) among the semantic features of the generated data. By DEG-Net, the generated unlabeled data are more diverse and more effective for addressing the FHA problem. Experimental results show that the DEG-Net outperforms existing FHA baselines and further verifies that generating diverse data plays an important role in addressing the FHA problem. ",Hypothesis Adaptation StableDR: Stabilized Doubly Robust Learning for Recommendation on Data Missing Not at Random,https://openreview.net/forum?id=3VO1y5N7K1H,https://openreview.net/pdf?id=3VO1y5N7K1H,This paper proposes a theoretically guaranteed stabilized doubly robust learning approach that overcomes the shortcomings due to the presence of extremely small propensities in debiased recommendations.,"In recommender systems, users always choose the favorite items to rate, which leads to data missing not at random and poses a great challenge for unbiased evaluation and learning of prediction models. Currently, the doubly robust (DR) methods have been widely studied and demonstrate superior performance. However, in this paper, we show that DR methods are unstable and have unbounded bias, variance, and generalization bounds to extremely small propensities. Moreover, the fact that DR relies more on extrapolation will lead to suboptimal performance. To address the above limitations while retaining double robustness, we propose a stabilized doubly robust (StableDR) learning approach with a weaker reliance on extrapolation. Theoretical analysis shows that StableDR has bounded bias, variance, and generalization error bound simultaneously under inaccurate imputed errors and arbitrarily small propensities. In addition, we propose a novel learning approach for StableDR that updates the imputation, propensity, and prediction models cyclically, achieving more stable and accurate predictions. Extensive experiments show that our approaches significantly outperform the existing methods.","Recommender System, Bias, Debias, Doubly Robust" Hybrid-Regressive Neural Machine Translation,https://openreview.net/forum?id=2NQ8wlmU9a_,https://openreview.net/pdf?id=2NQ8wlmU9a_,," In this work, we empirically confirm that non-autoregressive translation with an iterative refinement mechanism (IR-NAT) suffers from poor acceleration robustness because it is more sensitive to decoding batch size and computing device setting than autoregressive translation (AT). Inspired by it, we attempt to investigate how to combine the strengths of autoregressive and non-autoregressive translation paradigms better. To this end, we demonstrate through synthetic experiments that prompting a small number of AT's predictions can promote one-shot non-autoregressive translation to achieve the equivalent performance of IR-NAT. Following this line, we propose a new two-stage translation prototype called hybrid-regressive translation (HRT). Specifically, HRT first generates discontinuous sequences via autoregression (e.g., make a prediction every $k$ tokens, $k>1$) and then fills in all previously skipped tokens at once in a non-autoregressive manner. We also propose a bag of techniques to effectively and efficiently train HRT without adding any model parameters. HRT achieves the state-of-the-art BLEU score of 28.49 on the WMT \entode task and is at least 1.5x faster than AT, regardless of batch size and device. In addition, another bonus of HRT is that it successfully inherits the good characteristics of AT in the deep-encoder-shallow-decoder architecture. Concretely, compared to the vanilla HRT with a 6-layer encoder and 6-layer decoder, the inference speed of HRT with a 12-layer encoder and 1-layer decoder is further doubled on both GPU and CPU without BLEU loss.","autoregressive translation, non-autoregressive translation, inference acceleration" Variational Causal Dynamics: Discovering Modular World Models from Interventions,https://openreview.net/forum?id=a1ttBXvNCLO,https://openreview.net/pdf?id=a1ttBXvNCLO,,"Latent world models allow agents to reason about complex environments with high-dimensional observations. However, adapting to new environments and effectively leveraging previous knowledge remain significant challenges. We present variational causal dynamics (VCD), a structured world model that exploits the invariance of causal mechanisms across environments to achieve fast and modular adaptation. By causally factorising a transition model, VCD is able to identify reusable components across different environments. This is achieved by combining causal discovery and variational inference to learn a latent representation and transition model jointly in an unsupervised manner. Specifically, we optimise the evidence lower bound jointly over a representation model and a transition model structured as a causal graphical model. In evaluations on simulated environments with state and image observations, we show that VCD is able to successfully identify causal variables, and to discover consistent causal structures across different environments. Moreover, given a small number of observations in a previously unseen, intervened environment, VCD is able to identify the sparse changes in the dynamics and to adapt efficiently. In doing so, VCD significantly extends the capabilities of the current state-of-the-art in latent world models while also comparing favourably in terms of prediction accuracy. ", Query The Agent: Improving Sample Efficiency Through Epistemic Uncertainty Estimation,https://openreview.net/forum?id=hF1WEiIYPNb,https://openreview.net/pdf?id=hF1WEiIYPNb,Designing more sample efficient reinforcement learning curricula by measuring and exploiting agents' epistemic uncertainty. ,"Curricula for goal-conditioned reinforcement learning agents typically rely on poor estimates of the agent's epistemic uncertainty or fail to consider the agents' epistemic uncertainty altogether, resulting in poor sample efficiency. We propose a novel algorithm, Query The Agent (QTA), which significantly improves sample efficiency by estimating the agent's epistemic uncertainty throughout the state space and setting goals in highly uncertain areas. Encouraging the agent to collect data in highly uncertain states allows the agent to improve its estimation of the value function rapidly. QTA utilizes a novel technique for estimating epistemic uncertainty, Predictive Uncertainty Networks (PUN), to allow QTA to assess the agent's uncertainty in all previously observed states. We demonstrate that QTA offers decisive sample efficiency improvements over preexisting methods.","goal-conditioned reinforcement learning, reinforcement learning, goal-conditioned, goal, model-free, sample efficiency, deep reinforcement learning" Differentiable Logic Programming for Probabilistic Reasoning,https://openreview.net/forum?id=FbC2VeNlth5,https://openreview.net/pdf?id=FbC2VeNlth5,Learn Logic Rules for Reasoning,"This paper studies inductive logic programming for probabilistic reasoning. The key problems, i.e. learning rule structures and learning rule weights, have been extensively studied with traditional discrete searching methods as well as recent neural-based approaches. In this paper, we present a new approach called Differentiable Logic Programming (DLP), which provides a flexible framework for learning first-order logical rules for reasoning. We propose a continuous version of optimization problem for learning high-quality rules as a proxy and generalize rule learning and forward chaining algorithms in a differentiable manner, which enables us to efficiently learn rule structures and weights via gradient-based methods. Theoretical analysis and empirical results show effectiveness of our approach.","Inductive Logic Programming, Differentiable Programming, Logic Rules" Rewiring with Positional Encodings for GNNs,https://openreview.net/forum?id=5c9imxdLlCW,https://openreview.net/pdf?id=5c9imxdLlCW,,"Several recent works use positional encodings to extend the receptive fields of graph neural network (GNN) layers equipped with attention mechanisms. These techniques, however, extend receptive fields to the complete graph, at substantial computational cost and risking a change in the inductive biases of conventional GNNs, or require complex architecture adjustments. As a conservative alternative, we use positional encodings to expand receptive fields to $r$-hop neighborhoods. More specifically, our method augments the input graph with additional nodes/edges and uses positional encodings as node and/or edge features. We thus modify graphs before inputting them to a downstream GNN model, instead of modifying the model itself. This makes our method model-agnostic, i.e. compatible with any existing GNN architectures. We also provide examples of positional encodings that are lossless with a one-to-one map between the original and the modified graphs. We demonstrate that extending receptive fields via positional encodings and a virtual fully-connected node significantly improves GNN performance and alleviates over-squashing using small $r$. We obtain improvements on a variety of models and datasets, and reach state-of-the-art performance using traditional GNNs or graph Transformers.", Automatic Dictionary Generation: Could Brothers Grimm Create a Dictionary with BERT?,https://openreview.net/forum?id=8xZogWcm73f,https://openreview.net/pdf?id=8xZogWcm73f,,"The creation of the most famous German dictionary, also referred to as ``Deutsches Wörterbuch'' or in English ``The German Dictionary'', by the two brothers Jacob and Wilhelm Grimm, took more than a lifetime to be finished (1838--1961). In our work we examine the question, if it would be possible for them to create a dictionary using present technology, i.e., language models such as BERT. Starting with the definition of the task of Automatic Dictionary Generation, we propose a method based on contextualized word embeddings and hierarchical clustering to create a dictionary given unannotated text corpora. We justify our design choices by running variants of our method on English texts, where ground truth dictionaries are available. Finally, we apply of our approach to Shakespeare's work and automatically generate a dictionary tailored to Shakespearean vocabulary and contexts without human intervention.","dictionary generation, natural language processing, transformers" Feed-Forward Latent Domain Adaptation,https://openreview.net/forum?id=BR1qoDGxjWp,https://openreview.net/pdf?id=BR1qoDGxjWp,Cross-attention based meta-learning approach for fast latent domain adaptation,"We study the highly practical but comparatively under-studied problem of latent-domain adaptation, where a source model should be adapted to a target dataset that contains a mixture of unlabelled domain-relevant and domain-irrelevant examples. Furthermore, motivated by the requirements for data privacy and the need for embedded and resource-constrained devices of all kinds to adapt to local data distributions, we focus on the setting of feed-forward source-free domain adaptation, where adaptation should not require access to the source dataset, and also be back propagation-free. Our solution is to meta-learn a network capable of embedding the mixed-relevance target dataset and dynamically adapting inference for target examples using cross-attention. The resulting framework leads to consistent improvement on strong ERM baselines. We also show that our framework sometimes even improves on the upper bound of domain-supervised adaptation, where only domain-relevant instances are provided for adaptation. This suggests that human annotated domain labels may not always be optimal, and raises the possibility of doing better through automated instance selection.","latent domain adaptation, source-free, cross-attention, meta-learning" "Sampling-based inference for large linear models, with application to linearised Laplace",https://openreview.net/forum?id=aoDyX6vSqsd,https://openreview.net/pdf?id=aoDyX6vSqsd,We scale the linearised Laplace method for uncertainty estimation to large neural networks and datasets using an efficient method for posterior sampling,"Large-scale linear models are ubiquitous throughout machine learning, with contemporary application as surrogate models for neural network uncertainty quantification; that is, the linearised Laplace method. Alas, the computational cost associated with Bayesian linear models constrains this method's application to small networks, small output spaces and small datasets. We address this limitation by introducing a scalable sample-based Bayesian inference method for conjugate Gaussian multi-output linear models, together with a matching method for hyperparameter (regularisation) selection. Furthermore, we use a classic feature normalisation method (the g-prior) to resolve a previously highlighted pathology of the linearised Laplace method. Together, these contributions allow us to perform linearised neural network inference with ResNet-18 on CIFAR100 (11M parameters, 100 output dimensions × 50k datapoints) and with a U-Net on a high-resolution tomographic reconstruction task (2M parameters, 251k output dimensions).","Laplace, linearised Laplace, Bayesian neural network, Bayesian linear regression, uncertainty estimation, Bayesian deep learning, EM, large scale regression, sample then optimise, evidence framework" Defending against Adversarial Audio via Diffusion Model,https://openreview.net/forum?id=5-Df3tljit7,https://openreview.net/pdf?id=5-Df3tljit7,We propose a defense method based based on diffusion models for acoustic systems against diverse audio adversarial examples.,"Deep learning models have been widely used in commercial acoustic systems in recent years. However, adversarial audio examples can cause abnormal behaviors for those acoustic systems, while being hard for humans to perceive. Various methods, such as transformation-based defenses and adversarial training, have been proposed to protect acoustic systems from adversarial attacks, but they are less effective against adaptive attacks. Furthermore, directly applying the methods from the image domain can lead to suboptimal results because of the unique properties of audio data. In this paper, we propose an adversarial purification-based defense pipeline, AudioPure, for acoustic systems via off-the-shelf diffusion models. Taking advantage of the strong generation ability of diffusion models, AudioPure first adds a small amount of noise to the adversarial audio and then runs the reverse sampling step to purify the noisy audio and recover clean audio. AudioPure is a plug-and-play method that can be directly applied to any pretrained classifier without any fine-tuning or re-training. We conduct extensive experiments on the speech command recognition task to evaluate the robustness of AudioPure. Our method is effective against diverse adversarial attacks (e.g. L2 or L∞-norm). It outperforms the existing methods under both strong adaptive white-box and black-box attacks bounded by L2 or L∞-norm (up to +54% in robust accuracy). Besides, we also evaluate the certified robustness for perturbations bounded by L2-norm via randomized smoothing. Our pipeline achieves a higher certified accuracy than baselines.","Adversarial attack and defense, AI security, speech recognition, diffusion models" Text-Guided Diffusion Image Style Transfer with Contrastive Loss Fine-tuning,https://openreview.net/forum?id=iJ_E0ZCy8fi,https://openreview.net/pdf?id=iJ_E0ZCy8fi,,"Recently, diffusion models have demonstrated superior performance in text-guided image style transfer. However, there exists fundamental trade-off between transforming styles and maintaining content in diffusion models. Although a simple remedy would be using deterministic sampling scheme such as denoising diffusion implicit model (DDIM) that guarantees the perfect reconstruction, it requires the computationally expensive fining-tuning of the diffusion models. To address this, here we present a text-guided sampling scheme using a patch-wise contrastive loss fine-tuning. By exploiting the contrastive loss between the samples and the original images, our diffusion model can generate an image with the same semantic content as the source image. Experimental results demonstrate that our approach outperforms the existing methods while maintaining content and requiring no additional training on the diffusion model.","Diffusion models, Style transfer" FedProp: Cross-client Label Propagation for Federated Semi-supervised Learning,https://openreview.net/forum?id=JrVIWD81Z0u,https://openreview.net/pdf?id=JrVIWD81Z0u,,"Federated learning (FL) allows multiple clients to jointly train a machine learning model in such a way that no client has to share their data with any other participating party. In the supervised setting, where all client data is fully labeled, FL has been widely adopted for learning tasks that require data privacy. However, it is an ongoing research question how to best perform federated learning in a semi-supervised setting, where the clients possess data that is only partially labeled or even completely unlabeled. In this work, we propose a new method, FedProp, that follows a manifold-based approach to semi-supervised learning (SSL). It estimates the data manifold jointly from the data of multiple clients and computes pseudo-labels using cross-client label propagation. To avoid that clients have to share their data with anyone, FedProp employs two cryptographically secure yet highly efficient protocols: multi-party Hamming distance computation and secure aggregation. Experiments on three standard benchmarks show that FedProp achieves higher classification accuracy than previous federated SSL methods. Furthermore, as a pseudo-label-based technique, FedProp is complementary to other federated SSL approaches, in particular consistency-based ones. We demonstrate experimentally that further accuracy gains are possible by combining both.","federated learning, semi-supervised learning, label propagation, cryptographically secure computation" Test-Time Adaptation for Real-World Denoising Networks via Noise-Aware Image Generation,https://openreview.net/forum?id=7uIycrR-KOa,https://openreview.net/pdf?id=7uIycrR-KOa,,"Image denoising aims for a challenging task of recovering clean images from unseen noise, which can follow different distributions depending on scenes, camera models, ISO settings, etc. Previous works have attempted to handle unseen noise by adapting denoising neural networks to each given noisy image. However, a single noisy image can only provide a limited amount of information for training networks. Therefore, we propose to generate noisy images with diverse yet realistic noise that is similar to noise in a given input image. Such noise generation is difficult to achieve given only a single noisy image. To address the challenge, we propose a normalizing flow (NF) framework that can learn the latent representation of noise, conditioned on noisy images. We also employ the Gaussian mixture model to better handle real-world unseen noise by leveraging multiple noise distributions. Using the proposed NF model, our framework can generate multiple synthetic noisy images to facilitate the adaptation of denoising networks to each given image. To further improve the adaptation to unseen noise, we integrate a meta-learning algorithm into our framework. The experimental results demonstrate that our framework substantially improves the performance of several denoising networks on unseen real-world noise across numerous real-world benchmark datasets.", Gated Inference Network: Inferencing and Learning State-Space Models,https://openreview.net/forum?id=eCcG7QlFWeJ,https://openreview.net/pdf?id=eCcG7QlFWeJ,"A structure, using Bayesian properties and non-linearity in its design, is introduced that can learn complex state-spaces.","State-space models (SSMs) perform predictions by learning the underlying dynamics of observed sequence. We propose a new SSM in both high and low dimensional observation space, which utilizes Bayesian filtering-smoothing to model system’s dynamics more accurately than RNN-based SSMs and can be learned in an end-to-end manner. The designed architecture, which we call the Gated Inference Network (GIN), is able to integrate the uncertainty estimates and learn the complicated dynamics of the system that enables us to perform estimation and imputation tasks in both data presence and absence. The proposed model uses the GRU cells into its structure to complete the data flow, while avoids expensive computations and potentially unstable matrix inversions. The GIN is able to deal with any time-series data and gives us a strong robustness to handle the observational noise. In the numerical experiments, we show that the GIN reduces the uncertainty of estimates and outperforms its counterparts , LSTMs, GRUs and variational approaches.","Time Series, Recurrent Networks, Gaussian Process" Theoretical Characterization of the Generalization Performance of Overfitted Meta-Learning,https://openreview.net/forum?id=Jifob4dSh99,https://openreview.net/pdf?id=Jifob4dSh99,,"Meta-learning has arisen as a successful method for improving training performance by training over many similar tasks, especially with deep neural networks (DNNs). However, the theoretical understanding of when and why overparameterized models such as DNNs can generalize well in meta-learning is still limited. As an initial step towards addressing this challenge, this paper studies the generalization performance of overfitted meta-learning under a linear regression model with Gaussian features. In contrast to a few recent studies along the same line, our framework allows the number of model parameters to be arbitrarily larger than the number of features in the ground truth signal, and hence naturally captures the overparameterized regime in practical deep meta-learning. We show that the overfitted min $\ell_2$-norm solution of model-agnostic meta-learning (MAML) can be beneficial, which is similar to the recent remarkable findings on ""benign overfitting"" and ""double descent"" phenomenon in the classical (single-task) linear regression. However, due to the uniqueness of meta-learning such as task-specific gradient descent inner training and the diversity/fluctuation of the ground-truth signals among training tasks, we find new and interesting properties that do not exist in single-task linear regression. We first provide a high-probability upper bound (under reasonable tightness) on the generalization error, where certain terms decrease when the number of features increases. Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large. Under this circumstance, we show that the overfitted min $\ell_2$-norm solution can achieve an even lower generalization error than the underparameterized solution.", Cold Posteriors through PAC-Bayes,https://openreview.net/forum?id=HwcEuhLtCJr,https://openreview.net/pdf?id=HwcEuhLtCJr,"PAC-Bayes objectives naturally contain a temperature parameter, we investigate it's relation to the cold posterior effect.","We investigate the cold posterior effect through the lens of PAC-Bayes generalization bounds. We argue that in the non-asymptotic setting, when the number of training samples is (relatively) small, discussions of the cold posterior effect should take into account that approximate Bayesian inference does not readily provide guarantees of performance on out-of-sample data. Instead, out-of-sample error is better described through a generalization bound. In this context, we explore the connections of the ELBO objective from variational inference and the PAC-Bayes objectives. We note that, while the ELBO and PAC-Bayes objectives are similar, the latter objectives naturally contain a temperature parameter $\lambda$ which is not restricted to be $\lambda=1$. For both simplified regression and realistic classification tasks, in the case of Laplace approximations to the posterior, we show how this PAC-Bayesian interpretation of the temperature parameter captures important aspects of the cold posterior effect.","cold posteriors, Bayesian, Bayesian Neural Networks, PAC-Bayes, Laplace approximation" Learning 3D Point Cloud Embeddings using Optimal Transport,https://openreview.net/forum?id=ar_09k_Gsos,https://openreview.net/pdf?id=ar_09k_Gsos,We introduce a novel method to learn Wasserstein embeddings for 3D point clouds endowed by contrastive learning setup.,"Learning embeddings of any data largely depends on the ability of the target space to capture semantic relations. The widely used Euclidean space, where embeddings are represented as point vectors, is known to be lacking in its potential to exploit complex structures and relations. Contrary to standard Euclidean embeddings, in this work, we embed point clouds as discrete probability distributions in Wasserstein space. We build a contrastive learning setup to learn Wasserstein embeddings that can be used as a pre-training method with or without supervision for any downstream task. We show that the features captured by Wasserstein embeddings are better in preserving the point cloud geometry, including both global and local information, thus resulting in improved quality embeddings. We perform exhaustive experiments and demonstrate the effectiveness of our method for point cloud classification, transfer learning, segmentation and interpolation tasks over multiple datasets including synthetic and real-world objects in both supervised and self-supervised settings. We also compare against other existing methods and show that our method outperforms them in all downstream tasks. Additionally, our study reveals a promising interpretation of capturing critical points of point clouds that makes our proposed method self-explainable.","Optimal Transport, 3D Point Cloud, Wasserstein Space, Contrastive Learning" Training language models for deeper understanding improves brain alignment,https://openreview.net/forum?id=KzkLAE49H9b,https://openreview.net/pdf?id=KzkLAE49H9b,"We show that training language models for deeper narrative understanding (characters, emotions, relationships) results in richer representations that have improved alignment to human brain activity.","Building systems that achieve a deeper understanding of language is one of the central goals of natural language processing (NLP). Towards this goal, recent works have begun to train language models on narrative datasets which require extracting the most critical information by integrating across long contexts. However, it is still an open question whether these models are learning a deeper understanding of the text, or if the models are simply learning a heuristic to complete the task. This work investigates this further by turning to the one language processing system that truly understands complex language: the human brain. We show that training language models for deeper narrative understanding results in richer representations that have improved alignment to human brain activity. We further find that the improvements in brain alignment are larger for character names than for other discourse features, which indicates that these models are learning important narrative elements. Taken together, these results suggest that this type of training can indeed lead to deeper language understanding. These findings have consequences both for cognitive neuroscience by revealing some of the significant factors behind brain-NLP alignment, and for NLP by highlighting that understanding of long-range context can be improved beyond language modeling.","language, nlp, neuroscience, fMRI, interpretability" DeNF: Unsupervised Scene-Decompositional Normalizing Flows,https://openreview.net/forum?id=2qM88ymKO6r,https://openreview.net/pdf?id=2qM88ymKO6r,,"Unsupervised object-centric scene decomposition models can learn compositional and hierarchical representations of multi-object scene data that allow the abstraction of the data into object entities and spaces. However, previous approaches, either based on VAE or GAN frameworks, have no guarantee of preserving particular aspects of the image in scene reconstruction. In this work, we propose the first probabilistic model called DeNF. Based on recent advances in normalizing flows, we represent the scene as a mixture of bidirectional flows that map a set of structured prior distributions into the scene data distribution. The bijective mapping of DeNF yields an efficient sampling and density evaluation in training time. Furthermore, it improves the fidelity of the scene's visual contents in the reconstruction process. In our experiments on real and synthetic image data for unsupervised scene decomposition, DeNF achieves competitive results.", VQ-TR: Vector Quantized Attention for Time Series Forecasting,https://openreview.net/forum?id=RMnJxnLwGak,https://openreview.net/pdf?id=RMnJxnLwGak,A linear transformer using a vector quantized cross attention block for time series forecasting.,"Modern time series datasets can easily contain hundreds or thousands of temporal time points, however, Transformer based models scale poorly to the size of the sequence length constraining their context size in the seq-to-seq setting. In this work, we introduce VQ-TR which maps large sequences to a discrete set of latents representations as part of the Attention module. This allows us to attend over larger context windows with linear complexity with respect to the sequence length. We compare this method with other competitive deep learning and classical univariate probabilistic models and highlight its performance using both probabilistic and point forecasting metrics on a variety of open datasets from different domains.","deep learning, time series forecasting, latent variable models, transformer" Local KL Convergence Rate for Stein Variational Gradient Descent with Reweighted Kernel,https://openreview.net/forum?id=k2CRIF8tJ7Y,https://openreview.net/pdf?id=k2CRIF8tJ7Y,," We study convergence properties of Stein Variational Gradient Descent (SVGD) algorithm for sampling from a non-normalized probabilistic distribution $p_*({x})\propto\exp(-f_*({x}))$. Compared with Kernelized Stein Discrepancy (KSD) convergence analyzed in previous literature, KL convergence as a more convincing criterion can better explain the effectiveness of SVGD in real-world applications. In the population limit, SVGD performs smoothed gradient descent with kernel integral operator. Notably, SVGD with smoothing kernels suffers from gradient vanishing in low-density areas, which makes the error term between smoothed gradient and Wasserstein gradient not controllable. In this context, we introduce a reweighted kernel to amplify the smoothed gradient in low-density areas, which leads to a bounded error term to Wasserstein gradient. When the $p_*({x})$ satisfies log-Sobolev inequality, we develop the convergence rate for SVGD in KL divergence with the reweighted kernel. Our analysis points out the defects of conventional smoothing kernels in SVGD and provides the convergence rate for SVGD in KL divergence.","SVGD, asymptotic analysis" A Decomposition Based Dual Projection Model for Multivariate Time Series Forecasting and Anomaly Detection,https://openreview.net/forum?id=u4quDwcd9S3,https://openreview.net/pdf?id=u4quDwcd9S3,A seasonal-trend decomposition based model with the channel-wise and sequence-wise dual projection is developed for efficient and accurate multivariate time series forecasting and anomaly detection.,"Efficient anomaly detection and diagnosis in multivariate time series data is of great importance for various application areas. Forecasting of long-sequence time series is an important problem to prepare for future changes. An accurate prediction can help to detect anomaly events beforehand and make better decisions. It seems that one has to use more complex structures for deep learning models to get better performance, e.g., the recent surge of Transformer variants for time series modeling. However, such complex architectures require a large amount of training data and extensive computing resources. In addition, many of the considerations behind such architectures do not hold for time series applications. The objective of this study is to re-consider the effectiveness of deep learning architectures for efficient and accurate time series forecasting and anomaly detection. A model with direct projections is proposed, and it outperforms existing Transformer based models in most cases by a significant margin. The new decomposition based dual projection (DBDP) model consists of an anchored global profile and a varied number of decomposed seasonal local profiles of the time series for better forecasting performance. In addition to forecasting, a non-contrastive self-supervised learning approach, we propose to include a contrastive learning module in the DBDPC model for better forecasting performance and robustness. Finally, we apply the DBDP and DBDPC models to forecasting based time series anomaly detection and achieve superior performance over the latest SoTA models. These results demonstrate the effectiveness of the several key considerations behind the DBDP and DBDPC models, which also encourages the development of new architectures for time series applications.","Multivariate Time Series, Decomposition, Projection, Forecasting, Anomaly Detection" LEXA: Language-agnostic Cross-consistency Training for Question Answering Tasks,https://openreview.net/forum?id=DUIjRKNFFbG,https://openreview.net/pdf?id=DUIjRKNFFbG,We developed a novel pre-training method to improve cross-lingual consistency in a language model. We demonstrate the achieved ability on several question answering datasets.,"Cross-lingual information retrieval (CLIR) is a knowledge-intensive NLP task that requires a lot of domain-specific data in different languages. In previous works, authors were mostly using machine translation and iterative training for data mining. We considered the problem from another angle and present a novel cross-lingual pre-training and fine-tuning approach for CLIR tasks based on cross-lingual alignment. We present a new model LEXA-LM significantly improving cross-lingual knowledge transfer thus achieving new state-of-the-art in cross-lingual and monolingual question answering and cross-lingual sentence retrieval. Moreover, we show that our pre-training technique LEXA is a very powerful tool for a zero-shot scenario allowing to outperform some supervised methods. ","pre-training, language model, natural language processing" FedHPO-Bench: A Benchmark Suite for Federated Hyperparameter Optimization,https://openreview.net/forum?id=kTqYrmqnUm1,https://openreview.net/pdf?id=kTqYrmqnUm1,,"Hyperparameter optimization (HPO) is crucial for machine learning algorithms to achieve satisfactory performance. Its research progress has been boosted by existing HPO benchmarks. Nonetheless, existing efforts in benchmarking all focus on HPO for traditional centralized learning while ignoring federated learning (FL), a promising paradigm for collaboratively learning models from dispersed data. In this paper, we first identify some uniqueness of HPO for FL algorithms from various aspects. Due to this uniqueness, existing HPO benchmarks no longer satisfy the need to compare HPO methods in the FL setting. To facilitate the research of HPO in the FL setting, we propose and implement a benchmark suite FedHPO-Bench that incorporates comprehensive FedHPO problems, enables flexible customization of the function evaluations, and eases continuing extensions. We also conduct extensive experiments based on FedHPO-Bench to provide the community with more insights into FedHPO. We open-sourced FedHPO-Bench at https://github.com/FedHPO-Bench/FedHPO-Bench-ICLR23.","federated learning, hyperparameter optimization" CCT: Cross-consistency training for Clone Detection and Code Search Tasks,https://openreview.net/forum?id=aLgoMwjNDsi,https://openreview.net/pdf?id=aLgoMwjNDsi,We present a novel approach for pre-training the language models for better code and text representation improving results in code search and clone detection.,"Clone detection is a well known task, which could be formulated on any programming language. Although to the best of our knowledge there is no cross-lingual clone detection task formulated. In this work we formulate such a task alongside with a specific training procedure CCT for a deep leaning language model. This procedure allows CCT-trained model to outperform the existing approaches on POJ-104 benchmark with result of 95.67\% MAP and on newly created cross-lingual clone detection benchmark XCD. Moreover, CCT model shows new state of the art results in code search task AdvTest 47.15\% MRR.","pre-training, language model" Robust Explanation Constraints for Neural Networks,https://openreview.net/forum?id=_hHYaKu0jcj,https://openreview.net/pdf?id=_hHYaKu0jcj,We present a method for guaranteeing adversarial robustness of explanations that are based on the input gradient of a neural network.,"Post-hoc explanation methods are used with the intent of providing insights about neural networks and are sometimes said to help engender trust in their outputs. However, popular explanations methods have been found to be fragile to minor perturbations of input features or model parameters. Relying on constraint relaxation techniques from non-convex optimization, we develop a method that upper-bounds the largest change an adversary can make to a gradient-based explanation via bounded manipulation of either the input features or model parameters. By propagating a compact input or parameter set as symbolic intervals through the forwards and backwards computations of the neural network we can formally certify the robustness of gradient-based explanations. Our bounds are differentiable, hence we can incorporate provable explanation robustness into neural network training. Empirically, our method surpasses the robustness provided by previous heuristic approaches. We find that our training method is the only method able to learn neural networks with certificates of explanation robustness across all six datasets tested. ","Explainability, Adversarial Robustness, Neural Networks, Robustness Certification" Cyclophobic Reinforcement Learning,https://openreview.net/forum?id=FoRC6dIfO8u,https://openreview.net/pdf?id=FoRC6dIfO8u,"Cyclophobic Reinforcement Learning systematically and efficiently explores the state space by penalizing cycles, achieving excellent results in sparse reward environements.","In environments with sparse rewards finding a good inductive bias for exploration is crucial to the agent’s success. However, there are two competing goals: novelty search and systematic exploration. While existing approaches such as curiousity- driven exploration find novelty, they sometimes do not systematically explore the whole state space, akin to depth-first-search vs breadth-first-search. In this paper, we propose a new intrinsic reward that is cyclophobic, i.e. it does not reward novelty, but punishes redundancy by avoiding cycles. Augmenting the cyclophobic intrinsic reward with a sequence of hierarchical representations based on the agent’s cropped observations we are able to achieve excellent results in the MiniGrid and MiniHack environments. Both are particularly hard, as they require complex interactions with different objects in order to be solved. Detailed comparisons with previous approaches and thorough ablation studies show that our newly proposed cyclophobic reinforcement learning is vastly more efficient than other state of the art methods.","Reinforcement learning, intrinsic rewards, exploration, transfer learning, objects" Emergent collective intelligence from massive-agent cooperation and competition,https://openreview.net/forum?id=4orJ47he7WV,https://openreview.net/pdf?id=4orJ47he7WV,,"Inspired by organisms evolving through cooperation and competition between different populations on Earth, we study the emergence of artificial collective intelligence through massive-agent reinforcement learning. To this end, We propose a new massive-agent reinforcement learning environment, Lux, where dynamic and massive agents in two teams scramble for limited resources and fight off the darkness. In Lux, we build our agents through the standard reinforcement learning algorithm in curriculum learning phases and leverage centralized control via a pixel-to-pixel policy network. As agents co-evolve through self-play, we observe several stages of intelligence, from the acquisition of atomic skills to the development of group strategies. Since these learned group strategies arise from individual decisions without an explicit coordination mechanism, we claim that artificial collective intelligence emerges from massive-agent cooperation and competition. We further analyze the emergence of various learned strategies through metrics and ablation studies, aiming to provide insights for reinforcement learning implementations in massive-agent environments.","Reinforcement Learning, Multi-agent System, Emergent Behavior" Offline Reinforcement Learning via High-Fidelity Generative Behavior Modeling,https://openreview.net/forum?id=42zs3qa2kpy,https://openreview.net/pdf?id=42zs3qa2kpy,,"In offline reinforcement learning, weighted regression is a common method to ensure the learned policy stays close to the behavior policy and to prevent selecting out-of-sample actions. In this work, we show that due to the limited distributional expressivity of policy models, previous methods might still select unseen actions during training, which deviates from their initial motivation. To address this problem, we adopt a generative approach by decoupling the learned policy into two parts: an expressive generative behavior model and an action evaluation model. The key insight is that such decoupling avoids learning an explicitly parameterized policy model with a closed-form expression. Directly learning the behavior policy allows us to leverage existing advances in generative modeling, such as diffusion-based methods, to model diverse behaviors. As for action evaluation, we combine our method with an in-sample planning technique to further avoid selecting out-of-sample actions and increase computational efficiency. Experimental results on D4RL datasets show that our proposed method achieves competitive or superior performance compared with state-of-the-art offline RL methods, especially in complex tasks such as AntMaze. We also empirically demonstrate that our method can successfully learn from a heterogeneous dataset containing multiple distinctive but similarly successful strategies, whereas previous unimodal policies fail.","offline reinforcement learning, generative models, diffusion models, behavior modeling" GraphVF: Controllable Protein-Specific 3D Molecule Generation with Variational Flow,https://openreview.net/forum?id=IB5Njg_ztYB,https://openreview.net/pdf?id=IB5Njg_ztYB,,"Designing molecules that bind to specific target proteins is a fundamental task in drug discovery. Recent generative models leveraging geometrical constraints imposed by proteins and molecules have shown great potential in generating protein-specific 3D molecules. Nevertheless, these existing methods fail to generate 3D molecules with 2D skeletal curtailments, which encode pharmacophoric patterns essential to drug potency. To cope with this challenge, we propose GraphVF, which seamlessly integrates geometrical and skeletal restraints into a variational flow framework, where the former is captured through a flow transformation and the latter is encoded by an amortized factorized Gaussian. We empirically verify that our method achieves state-of-the-art performance on protein-specific 3D molecule generation in terms of binding affinity and some other drug properties. In particular, it represents the first controllable geometry-aware, protein-specific molecule generation method, which enables creating 3D molecules with specified chemical sub-structures or drug properties. ","Controllable Molecular Generation, Pocket-based Drug Design, Variational Flow" Graph Neural Networks as Gradient Flows: understanding graph convolutions via energy,https://openreview.net/forum?id=M3GzgrA7U4,https://openreview.net/pdf?id=M3GzgrA7U4,We apply the gradient flow formalism to GNNs to both develop new frameworks and provide a better theoretical understanding of existing ones.,"Gradient flows are differential equations that minimize an energy functional and constitute the main descriptors of physical systems. We apply this formalism to Graph Neural Networks (GNNs) to develop new frameworks for learning on graphs as well as provide a better theoretical understanding of existing ones. We derive GNNs as a gradient flow equation of a parametric energy that provides a physics-inspired interpretation of GNNs as learning particle dynamics in the feature space. In particular, we show that in graph convolutional models (GCN), the positive/negative eigenvalues of the channel mixing matrix correspond to attractive/repulsive forces between adjacent features. We rigorously prove how the channel-mixing can learn to steer the dynamics towards low or high frequencies, which allows to deal with heterophilic graphs. We show that the same class of energies is decreasing along a larger family of GNNs; albeit not gradient flows, they retain their inductive bias. We experimentally evaluate an instance of the gradient flow framework that is principled, more efficient than GCN, and achieves competitive performance on graph datasets of varying homophily often outperforming recent baselines specifically designed to target heterophily.","Graph Neural Networks, Gradient flows, Energy functionals, Spectral theory" CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers,https://openreview.net/forum?id=rB6TpjAuSRy,https://openreview.net/pdf?id=rB6TpjAuSRy,,"Large-scale pretrained transformers have reached a milestone in text (GPT-3) and text-to-image (DALL-E and CogView) generation. However, its application to video generation still has several challenges: unaffordable huge computation cost and scarcity and weak relevance of the text-video datasets. In this work, we present CogVideo, a 9B-parameter transformer for text-to-video generation. The CogVideo model has been trained by inheriting a pretrained text-to-image model, CogView2, which significantly reduces the training cost and alleviates the problem of scarcity and weak relevance. We also propose a multi-frame-rate training strategy for better aligning text and video clips. CogVideo achieves state-of-the-art performance in machine evaluation and outperforms publicly available models by a large margin in human evaluation. Its codes and model are also publicly available. The anonymous web demo is available at https://cogvideo.pka.moe.","pretraining, text-to-video generation" Revisit Finetuning strategy for Few-Shot Learning to Strengthen the Equivariance of Emdeddings,https://openreview.net/forum?id=tXc-riXhmx,https://openreview.net/pdf?id=tXc-riXhmx,We design a novel finetuning strategy to finetune the feature extractor with unbiased estimation in Few-Shot Learning.,"Few-Shot Learning (FSL) aims to learn a simple and effective bias on limited novel samples. Recently, many methods have been focused on re-training a randomly initialized linear classifier to adapt it to the novel features extracted by the pre-trained feature extractor (called Linear-Probing-based methods). These methods typically assumed the pre-trained feature extractor was robust enough, i.e., finetuning was not needed, and hence the pre-trained feature extractor does not change on the novel samples. However, the unchanged pre-trained feature extractor will distort the features of novel samples because the robustness assumption may not hold, especially on the out-of-distribution samples. To extract the undistorted features, we designed Linear-Probing-Finetuning with Firth-Bias (LP-FT-FB) to yield an accurate bias on the limited samples for better finetuning the pre-trained feature extractor, imposing equivariance on the whole model. In LP-FT-FB, we further proposed inverse Firth Bias Reduction (i-FBR) to regularize the over-parameterized feature extractor on which FBR does not work well. The proposed i-FBR effectively alleviates the over-fitting problem of the feature extractor in the process of finetuning and helps extract undistorted novel features. To show the effectiveness of the designed LP-FT-FB, we conducted a lot of experiments on the commonly used FSL datasets under different backbones, including in-domain and cross-domain FSL tasks. The experimental results show that the proposed FT-LP-FB outperforms the SOTA FSL methods.","Few-Shot Learning, Finetuning, Equivariance" Memory Learning of Multivariate Asynchronous Time Series,https://openreview.net/forum?id=LPcxnvN9vLw,https://openreview.net/pdf?id=LPcxnvN9vLw,Modeling Multivariate Asynchronous Time Series,"Sequential observations from complex systems are usually collected irregularly and asynchronously across variables. Besides, they are typically both serially and cross-sectionally dependent. Recurrent networks are always used to model such sequential data, trying to simultaneously capture marginal dynamics and dependence dynamics with one shared memory. This leads to two problems. First, some heterogeneous marginal information is difficult to be preserved in the shared memory. Second, in an asynchronous setting, missing values across variables will introduce bias in the shared memory. To solve these problems, this paper designs a new architecture that seamlessly integrates continuous-time ODE solvers with a set of memory-aware GRU blocks. It learns memory profiles separately and addresses the issue of asynchronous observations. Numerical results confirm that this new architecture outperforms a variety of state-of-the-art baseline models on datasets from various fields.","Multivariate Asynchronous Time Series, Gated Recurrent Unit, Sequential Models" Scalable Multi-Modal Continual Meta-Learning,https://openreview.net/forum?id=ytuGu-E4cIl,https://openreview.net/pdf?id=ytuGu-E4cIl,,"This paper focuses on continual meta-learning, where few-shot tasks are sequentially available and sampled from a non-stationary distribution. Motivated by this challenging setting, many works have been developed with a mixture of meta-knowledge to cope with the heterogeneity and a dynamically changing number of components to capture incremental information. However, the underlying assumption of mutual exclusiveness among mixture components prevents sharing meta-knowledge across different clusters of tasks. Moreover, the existing incremental methods only rely on the prior to determine whether to increase meta-knowledge, where the unlimited increase would lead to parameter inefficiency. In our work, we propose a Scalable Multi-Modal Continual Meta-Learning (SMM-CML) algorithm. It employs a multi-modal premise that not only encourages different clusters of tasks to share meta-knowledge but also maintains their diversity. Moreover, to capture the incremental information, our algorithm uses Indian Buffet Process (IBP) as a prior number of components and proposes a sparsity method based on evidential theory to filter out the components without receiving support information directly from tasks. Thus we can learn the posterior number of components to avoid parameter inefficiency and reduce computational consumption. Experiments show SMM-CML outperforms SOTA baselines, which illustrates the effectiveness of our multi-modal meta-knowledge, and confirms that our algorithm can learn the really need meta-knowledge from tasks.","Continual meta-learning, Indian Buffet Process, Evidential Sparcification" Optimizing Spca-based Continual Learning: A Theoretical Approach,https://openreview.net/forum?id=Vf6WcUDnY7c,https://openreview.net/pdf?id=Vf6WcUDnY7c,This paper proposes a theoretical analysis of a simple but efficient continual learning algorithm,"Catastrophic forgetting and the stability-plasticity dilemma are two major obstacles to continual learning. In this paper we first propose a theoretical analysis of a SPCA-based continual learning algorithm using high dimensional statistics. Second, we design OSCL (Optimized Spca-based Continual Learning) which builds on a flexible task optimization based on the theory. By optimizing a single task, catastrophic forgetting can be prevented theoretically. While optimizing multi-tasks, the trade-off between integrating knowledge from the new task and retaining previous knowledge of the old task can be achieved by assigning appropriate weights to corresponding tasks in compliance with the objectives. Experimental results confirm that the various theoretical conclusions are robust to a wide range of data distributions. Besides, several applications on synthetic and real data show that the proposed method while being computationally efficient, achieves comparable results with some state of the art.","continual learning, high dimensional statistics, machine learning theory" Value Memory Graph: A Graph-Structured World Model for Offline Reinforcement Learning,https://openreview.net/forum?id=UYcIheNY9Pf,https://openreview.net/pdf?id=UYcIheNY9Pf,,"Reinforcement Learning (RL) methods are typically applied directly in environments to learn policies. In some complex environments with continuous state-action spaces, sparse rewards, and/or long temporal horizons, learning a good policy in the original environments can be difficult. Focusing on the offline RL setting, we aim to build a simple and discrete world model that abstracts the original environment. RL methods are applied to our world model instead of the environment data for simplified policy learning. Our world model, dubbed Value Memory Graph (VMG), is designed as a directed-graph-based Markov decision process (MDP) of which vertices and directed edges represent graph states and graph actions, separately. As state-action spaces of VMG are finite and relatively small compared to the original environment, we can directly apply the value iteration algorithm on VMG to estimate graph state values and figure out the best graph actions. VMG is trained from and built on the offline RL dataset. Together with an action translator that converts the abstract graph actions in VMG to real actions in the original environment, VMG controls agents to maximize episode returns. Our experiments on the D4RL benchmark show that VMG can outperform state-of-the-art offline RL methods in several tasks, especially when environments have sparse rewards and long temporal horizons. Code will be made publicly available.","Offline Reinforcement Learning, Model Based Reinforcement learning" CAKE: CAusal and collaborative proxy-tasKs lEarning for Semi-Supervised Domain Adaptation,https://openreview.net/forum?id=L97ftsVhiUi,https://openreview.net/pdf?id=L97ftsVhiUi,,"Semi-supervised domain adaptation (SSDA) adapts a learner to a new domain by effectively utilizing source domain data and a few labeled target samples. It is a practical yet under-investigated research topic. In this paper, we analyze the SSDA problem from two perspectives that have previously been overlooked, and correspondingly decompose it into two \emph{key subproblems}: \emph{robust domain adaptation (DA) learning} and \emph{maximal cross-domain data utilization}. \textbf{(i)} From a causal theoretical view, a robust DA model should distinguish the invariant ``concept'' (key clue to image label) from the nuisance of confounding factors across domains. To achieve this goal, we propose to generate \emph{concept-invariant samples} to enable the model to classify the samples through causal intervention, yielding improved generalization guarantees; \textbf{(ii)} Based on the robust DA theory, we aim to exploit the maximal utilization of rich source domain data and a few labeled target samples to boost SSDA further. Consequently, we propose a collaboratively debiasing learning framework that utilizes two complementary semi-supervised learning (SSL) classifiers to mutually exchange their unbiased knowledge, which helps unleash the potential of source and target domain training data, thereby producing more convincing pseudo-labels. Such obtained labels facilitate cross-domain feature alignment and duly improve the invariant concept learning. In our experimental study, we show that the proposed model significantly outperforms SOTA methods in terms of effectiveness and generalisability on SSDA datasets.", RulE: Neural-Symbolic Knowledge Graph Reasoning with Rule Embedding,https://openreview.net/forum?id=UBSPGUwjNV,https://openreview.net/pdf?id=UBSPGUwjNV,,"Knowledge graph (KG) reasoning is an important problem for knowledge graphs. It predicts missing links by reasoning on existing facts. Knowledge graph embedding (KGE) is one of the most popular methods to address this problem. It embeds entities and relations into low-dimensional vectors and uses the learned entity/relation embeddings to predict missing facts. However, KGE only uses zeroth-order (propositional) logic to encode existing triplets (e.g., ``Alice is Bob's wife.""); it is unable to leverage first-order (predicate) logic to represent generally applicable logical \textbf{rules} (e.g., ``$\forall x,y \colon x ~\text{is}~ y\text{'s wife} \rightarrow y ~\text{is}~ x\text{'s husband}$''). On the other hand, traditional rule-based KG reasoning methods usually rely on hard logical rule inference, making it brittle and hardly competitive with KGE. In this paper, we propose RulE, a novel and principled framework to represent and model logical rules and triplets. RulE jointly represents entities, relations and logical rules in a unified embedding space. By learning an embedding for each logical rule, RulE can perform logical rule inference in a soft way and give a confidence score to each grounded rule, similar to how KGE gives each triplet a confidence score. Compared to KGE alone, RulE allows injecting prior logical rule information into the embedding space, which improves the generalization of knowledge graph embedding. Besides, the learned confidence scores of rules improve the logical rule inference process by softly controlling the contribution of each rule, which alleviates the brittleness of logic. We evaluate our method with link prediction tasks. Experimental results on multiple benchmark KGs demonstrate the effectiveness of RulE.", Sampling-free Inference for Ab-Initio Potential Energy Surface Networks,https://openreview.net/forum?id=Tuk3Pqaizx,https://openreview.net/pdf?id=Tuk3Pqaizx,We improve neural wave function methods by avoid numerical integration at inference time and introducing restricted neural wave functions.,"Recently, it has been shown that neural networks not only approximate the ground-state wave functions of a single molecular system well but can also generalize to multiple geometries. While such generalization significantly speeds up training, each energy evaluation still requires Monte Carlo integration which limits the evaluation to a few geometries. In this work, we address the inference shortcomings by proposing the Potential learning from ab-initio Networks (PlaNet) framework, in which we simultaneously train a surrogate model in addition to the neural wave function. At inference time, the surrogate avoids expensive Monte-Carlo integration by directly estimating the energy, accelerating the process from hours to milliseconds. In this way, we can accurately model high-resolution multi-dimensional energy surfaces for larger systems that previously were unobtainable via neural wave functions. Finally, we explore an additional inductive bias by introducing physically-motivated restricted neural wave function models. We implement such a function with several additional improvements in the new PESNet++ model. In our experimental evaluation, PlaNet accelerates inference by 7 orders of magnitude for larger molecules like ethanol while preserving accuracy. Compared to previous energy surface networks, PESNet++ reduces energy errors by up to 74%.","Graph Neural Networks, Computational Physics, Self-Generative Learning, Machine Learning for Science, Online Learning, Self-Supervised Learning, Molecules" PET-NeuS: Positional Encoding Triplanes for Neural Surfaces,https://openreview.net/forum?id=MHXO5xRCSXh,https://openreview.net/pdf?id=MHXO5xRCSXh,"We improve NeuS by introducing Tri-planes, modulated positional encoding, and learned self-attention convolutions.","The signed distance function (SDF) represented by an MLP network is commonly used for multi-view neural surface reconstruction. We build on the successful recent method NeuS to extend it by three new components. The first component is to borrow the Tri-plane representation from EG3D and represent signed distance fields as a mixture of tri-planes and MLPs instead of representing it with MLPs only. Discretizing the scene space with Tri-planes leads to a more expressive data structure but involving tri-planes will introduce noise due to discrete discontinuities. The second component is to use a new type of positional encoding with learnable weights to combat noise in the reconstruction process. We divide the features in the tri-plane into multiple frequency bands and modulate them with sin and cos functions of different frequency. The third component is to use learnable convolution operations on the tri-plane features using self-attention convolution to produce features with different frequency. The experiments show that PET-NeuS achieves high-fidelity surface reconstruction on standard datasets. Following previous work and using the Chamfer metric as the most important way to measure surface reconstruction quality, we are able to improve upon the NeuS baseline by 25\% on Nerf-synthetic (0.84 compared to 1.12) and by 14\% on DTU (0.75 compared to 0.87). The qualitative evaluation reveals how our method can better control the interference of high-frequency noise.","Multi-view Surface Reconstruction, Neural Radiance Fields, Signed Distance Functions" Hidden Schema Networks,https://openreview.net/forum?id=KyxJ9Yfxo2,https://openreview.net/pdf?id=KyxJ9Yfxo2,"A neural language model that discovers networks of symbols (schemata) from text datasets via a VAE framework with pretrained BERT and GPT-2 as encoder and decoder, respectively.","Most modern language models infer representations that, albeit powerful, lack both compositionality and semantic interpretability. Starting from the assumption that a large proportion of semantic content is necessarily relational, we introduce a neural language model that discovers networks of symbols (schemata) from text datasets. Using a variational autoencoder (VAE) framework, our model encodes sentences into sequences of symbols (composed representation), which correspond to the nodes visited by biased random walkers on a global latent graph. We first demonstrate that the model is able to uncover ground-truth graphs from artificially generated datasets of random token sequences. Next we leverage pretrained BERT and GPT-2 language models as encoder and decoder, respectively, to train our model on language modelling and commonsense knowledge generation tasks. Qualitatively, the model is able to infer schema networks whose nodes (symbols) can be interpreted as encoding different aspects of natural language (as e.g. topics, sentiments). Quantitatively, our results show that the model successfully interprets the encoded symbol sequences, as it achieves state-of-the-art scores on VAE language modeling benchmarks. Source code to reproduce all experiments is provided with the supplementary material.","Discrete representation learning, Unsupervised knowledge graph learning, Relational inductive biases, Semantic representation, Pretrained language models, Discrete VAE, Neuro-symbolic AI, Language modelling" A New Hierarchy of Expressivity for Graph Neural Networks,https://openreview.net/forum?id=5cAI0qXxyv,https://openreview.net/pdf?id=5cAI0qXxyv,,"The expressive power of Graph Neural Networks (GNNs) is fundamental for understanding their capabilities and limitations, i.e., what graph properties can or cannot be learnt by a GNN. Since standard GNNs have been characterised to be upper-bounded by the Weisfeiler-Lehman (1-WL) algorithm, recent attempts concentrated on developing more expressive GNNs in terms of the $k$-WL hierarchy, a well-established framework for graph isormorphism testing. In this work we show that, contrary to the widely accepted view, the $k$-WL hierarchy is not well-suited for measuring expressive GNNs. This is due to limitations that are inherent to high-dimensional WL algorithms such as the lack of a natural interpretation and high computational costs, which makes it difficult to draw any firm conclusions about the expressive power of GNNs that go beyond 1-WL. Thus, we propose a novel hierarchy of graph isomorphism tests, namely \emph{Neighbourhood WL} ($\mathscr{N}$-WL) algorithms, which enables to better measure the expressive power of GNNs. We further introduce \emph{Graph Neighbourhood Neural Network} (G3N) by building upon the $\mathscr{N}$-WL algorithms, and empirically verify its expressive power on synthetic and real-world benchmarks.","Graph neural network, Weisfeiler-Lehman algorithm, k-WL hierarchy, graph classification" Learning Input-agnostic Manipulation Directions in StyleGAN with Text Guidance,https://openreview.net/forum?id=47B_ctC4pJ,https://openreview.net/pdf?id=47B_ctC4pJ,,"With the advantages of fast inference and human-friendly flexible manipulation, image-agnostic style manipulation via text guidance enables new applications that were not previously available. The state-of-the-art text-guided image-agnostic manipulation method embeds the representation of each channel of StyleGAN independently in the Contrastive Language-Image Pre-training (CLIP) space, and provides it in the form of a Dictionary to quickly find out the channel-wise manipulation direction during inference time. However, in this paper we argue that this dictionary which is constructed by controlling single channel individually is limited to accommodate the versatility of text guidance since the collective and interactive relation among multiple channels are not considered. Indeed, we show that it fails to discover a large portion of manipulation directions that can be found by existing methods, which manually manipulates latent space without texts. To alleviate this issue, we propose a novel method that learns a Dictionary, whose entry corresponds to the representation of a single channel, by taking into account the manipulation effect coming from the interaction with multiple other channels. We demonstrate that our strategy resolves the inability of previous methods in finding diverse known directions from unsupervised methods and unknown directions from random text while maintaining the real-time inference speed and disentanglement ability.","image manipulation, deep learning, generative adversarial network" Learning Task Agnostic Temporal Consistency Correction,https://openreview.net/forum?id=R3xfdth4Gyl,https://openreview.net/pdf?id=R3xfdth4Gyl,This work provides a general framework for task agnostic video temporal consistency correction capable of producing visually pleasing and temporally consistent videos without the requiring their unprocessed counterparts.,"In many video restoration/translation tasks, image processing operations are naively extended to the video domain by processing each frame independently. This disregard for the temporal connection in video processing often leads to severe temporal inconsistencies. State-of-the-art techniques that address these inconsistencies rely on the availability of unprocessed videos to siphon consistent video dynamics to restore the temporal consistency of frame-wise processed videos. We propose a novel general framework for this task that learns to infer consistent motion dynamics from inconsistent videos to mitigate the temporal flicker while preserving the perceptual quality for both the temporally neighboring and relatively distant frames. The proposed framework produces state-of-the-art results on two benchmark datasets, DAVIS and videvo.net, processed by numerous image processing applications in a frame-wise processing manner. The code and the trained models will be released upon acceptance.","Video Processing, Temporal Consistency Correction, Video Restoration" Does Structural Information have been Fully Exploited in Graph Data?,https://openreview.net/forum?id=fH4xGeqdgLb,https://openreview.net/pdf?id=fH4xGeqdgLb,"We propose a topology-aware graph-based architecture, termed as Curvphormer, by leveraging a geometric notion, i.e., discrete Ricci curvature.","In real world, graph-structured data is pervasive, operating as an abstraction of data containing nodes and interactions between nodes. There are numerous ways dedicated to excavating structure information overtly or implicitly, but whether structural information has been adequately exploited remains an unanswered question. We offer Curvphormer, a curvature-based topology-aware Graphormer that integrates Discrete Ricci Curvature (DRC) into a powerful graph-based Transformer architecture to construct a more expressive graph-based model. This work employs DRC, a geometric descriptor, to reveal additional structural information. We intuitively characterize how our model can make better use of the topological information in graph data, and extract desired structural information, such as inherent community structure in graphs with homogeneous information. We conduct extensive experiments on different scaled datasets, such asg PCQM4M-LSC, ZINC and MolHIV, and achieve remarkable performance gain on various graph-level tasks and finetune tasks. Codes will be released upon acceptance.","Transformer, Discrete Ricci Curvature, Structural Information" Prescribed Safety Performance Imitation Learning from A Single Expert Dataset,https://openreview.net/forum?id=aPc-R01WvJV,https://openreview.net/pdf?id=aPc-R01WvJV,We propose a method to conduct safe imitation learning with prescribed safety performance.,"Existing safe imitation learning (safe IL) methods mainly focus on learning safe policies that are similar to expert ones, but may fail in applications requiring different safety constraints. In this paper, we propose the Lagrangian Generative Adversarial Imitation Learning (LGAIL) algorithm, which can adaptively learn safe policies from a single expert dataset under diverse prescribed safety constraints. To achieve this, we augment GAIL with safety constraints and then relax it as an unconstrained optimization problem by utilizing a Lagrange multiplier. The Lagrange multiplier enables explicit consideration of the safety and is dynamically adjusted to balance the imitation and safety performance during training. Then, we apply a two-stage optimization framework to solve LGAIL: (1) a discriminator is optimized to measure the similarity between the agent-generated data and the expert ones; (2) forward reinforcement learning is employed to improve the similarity while considering safety concerns enabled by a Lagrange multiplier. Furthermore, theoretical analyses on the convergence and safety of LGAIL demonstrate its capability of adaptively learning a safe policy given prescribed safety constraints. At last, extensive experiments in OpenAI Safety Gym conclude the effectiveness of our approach.","safe imitation learning, inverse reinforcement learning" End-to-end Invariance Learning with Relational Inductive Biases in Multi-Object Robotic Manipulation,https://openreview.net/forum?id=Jm-MaqTF6om,https://openreview.net/pdf?id=Jm-MaqTF6om,,"Although reinforcement learning has seen remarkable progress over the last years, solving robust dexterous object-manipulation tasks in multi-object settings remains a challenge. In this paper, we focus on models that can learn manipulation tasks in fixed multi-object settings \emph{and} extrapolate this skill zero-shot without any drop in performance when the number of objects changes. We consider the generic task of moving a single cube out of a set to a goal position. We find that previous approaches, which primarily leverage attention and graph neural network-based architectures, do not exhibit this invariance when the number of input objects changes while scaling as $K^2$. We analyse effects on generalization of different relational inductive biases and then propose an efficient plug-and-play module that overcomes these limitations. Besides exceeding performances in their training environment, we show that our approach, which scales linearly in $K$, allows agents to extrapolate and generalize zero-shot to any new object number.", DAVA: Disentangling Adversarial Variational Autoencoder,https://openreview.net/forum?id=CW6KmU5wPh,https://openreview.net/pdf?id=CW6KmU5wPh,We propose an adversarial variational auto-encoder that alleviates the issue of hyperparameter selection for disentanglement learning and propose a new unsupervised disentanglement metric.,"The use of well-disentangled representations poses many advantages for downstream tasks, e.g. increasing sample efficiency, or enabling interpretability. Their quality is, however, determined to a large extent by the choice of dataset-specific hyperparameters, most notably the regularization strength. To address the issue, we introduce DAVA, a novel training procedure for variational auto-encoders that alleviates the issue of hyperparameter selection at the cost of a comparatively small overhead. We compare DAVA against models with optimal choice of hyperparameters. Without any hyperparameter tuning, DAVA is competitive across a diverse range of commonly used datasets. Further, even under an adequate set of hyperparameters, the success of the disentanglement process remains heavily influenced by randomness in network initialization. We therefore present the new unsupervised PIPE disentanglement metric, capable of evaluating representation quality. We demonstrate the PIPE metrics ability to positively predict performance of downstream models in abstract reasoning. We also exhaustively examine correlations with existing supervised and unsupervised metrics.","Disentanglement learning, varational auto-encoder, curriculum learning, generative adversarial networks" Comparing Auxiliary Tasks for Learning Representations for Reinforcement Learning,https://openreview.net/forum?id=7Kf5_7-b7q,https://openreview.net/pdf?id=7Kf5_7-b7q,This paper empirically compares common auxiliary tasks used to learn representations for reinforcement learning (RL) across diverse continuous control environments and RL algorithms.,"Learning state representations has gained steady popularity in reinforcement learning (RL) due to its potential to improve both sample efficiency and returns on many environments. A straightforward and efficient method is to generate representations with a distinct neural network trained on an auxiliary task, i.e. a task that differs from the actual RL task. While a whole range of such auxiliary tasks has been proposed in the literature, a comparison on typical continuous control benchmark environments is computationally expensive and has, to the best of our knowledge, not been performed before. This paper presents such a comparison of common auxiliary tasks, based on hundreds of agents trained with state-of-the-art off-policy RL algorithms. We compare possible improvements in both sample efficiency and returns for environments ranging from simple pendulum to a complex simulated robotics task. Our findings show that representation learning with auxiliary tasks is beneficial for environments of higher dimension and complexity, and that learning environment dynamics is preferable to predicting rewards. We believe these insights will enable other researchers to make more informed decisions on how to utilize representation learning for their specific problem.","Reinforcement learning, Representation learning, Auxiliary task, Comparison" TDR-CL: Targeted Doubly Robust Collaborative Learning for Debiased Recommendations,https://openreview.net/forum?id=EIgLnNx_lC,https://openreview.net/pdf?id=EIgLnNx_lC,This paper proposes a principled approach that can effectively reduce the bias and variance simultaneously compared to existing DR estimators for debiased recommendations.,"Bias is a common problem inherent in recommender systems, which is entangled with users' preferences and poses a great challenge to unbiased learning. For debiasing tasks, the doubly robust (DR) method and its variants show superior performance due to the double robustness property, that is, DR is unbiased when either imputed errors or learned propensities are accurate. However, our theoretical analysis reveals that DR usually has a large variance. Meanwhile, DR would suffer unexpectedly large bias and poor generalization caused by inaccurate imputed errors and learned propensities, which often occur in practice. In this paper, we propose a principled approach that can effectively reduce the bias and variance simultaneously for existing DR estimators when the error-imputation model is misspecified. In addition, we further propose a novel semi-parametric collaborative counterfactual learning approach that decomposes imputed errors into parametric and nonparametric parts and updates them collaboratively, resulting in more accurate predictions. Both theoretical analysis and experiments demonstrate the superiority of the proposed methods compared with existing debiasing methods.","Recommender System, Bias, Debias, Doubly Robust" Dynamic-Aware GANs: Time-Series Generation with Handy Self-Supervision,https://openreview.net/forum?id=Oy-e1gcBzo,https://openreview.net/pdf?id=Oy-e1gcBzo,This paper presents Dynamic-Aware GANs as a data-efficient self-supervised paradigm for time-series data generation.,"This paper presents Dynamic-Aware GAN (DAGAN) as a data-efficient self-supervised paradigm for time-series data generation. To support sequential generation with sufficient clues of temporal dynamics, we explicitly model the transition dynamics within the data sequence through differencing, thus refining the vanilla sequence into one with inter-correlated triplets to characterize each time-step. This localized triplet consistent structure contributes to a self-supervision mechanism, which can provide more aspects of supervision for the overall stepwise dependencies encoded within the training data. Such a handy self-supervision mechanism is simple but can be beneficial especially when a model is presented with limited training data. Based on the insight, we present DAGAN which generalizes the locally regularized triplet consistency to distributional-level via dynamic encoding and joint distribution matching. Experiments on various synthetic and real-world datasets verify that our model achieves superior generation results with better quality and diversity compared with the state-of-the-art benchmarks, especially when the training data is scarce. Moreover, benefited from the dynamic-conditional and dynamic-consistent design, our DAGAN is capable of generating sequences that present specified dynamics.","Time-series modelling, Self-supervision, Deep Generative Models" Learning Gradient-based Mixup towards Flatter Minima for Domain Generalization,https://openreview.net/forum?id=FGlL0dLjpn,https://openreview.net/pdf?id=FGlL0dLjpn, We propose a policy to generate the instance weights for mixup based on gradient similarity and optimize a learnable similarity function towards flatter minima for better generalization.,"To address the distribution shifts between training and test data, domain generalization (DG) leverages multiple source domains to learn a model that generalizes well to unseen domains. However, existing DG methods generally suffer from overfitting to the source domains, partly due to the limited coverage of the expected region in feature space. Motivated by this, we propose to perform mixup with data interpolation and extrapolation to cover the potential unseen regions. To prevent the detrimental effects of unconstrained extrapolation, we carefully design a policy to generate the instance weights, named Flatness-aware Gradient-based Mixup (FGMix). The policy employs a gradient-based similarity to assign greater weights to instances that carry more invariant information, and learns the similarity function towards flatter minima for better generalization. On the DomainBed benchmark, we validate the efficacy of various designs of FGMix and demonstrate its superiority over other DG algorithms.","Domain Generalization, Mixup, Gradient-based Method, Flatness-aware Optimization" DeepGRAND: Deep Graph Neural Diffusion,https://openreview.net/forum?id=wTGORH_cHPX,https://openreview.net/pdf?id=wTGORH_cHPX,"We propose the Deep Graph Neural Diffusion (DeepGRAND), a class of continuous-depth graph neural networks that leverages a data-dependent scaling term and a perturbation to the graph diffusivity to overcome the oversmoothing issue.","We propose the Deep Graph Neural Diffusion (DeepGRAND), a class of continuous-depth graph neural networks based on the diffusion process on graphs. DeepGRAND leverages a data-dependent scaling term and a perturbation to the graph diffusivity to make the real part of all eigenvalues of the diffusivity matrix become negative, which ensures two favorable theoretical properties: (i) the node representation does not exponentially converge to a constant vector as the model depth increases, thus alleviating the over-smoothing issue; (ii) the stability of the model is guaranteed by controlling the norm of the node representation. Compared to the baseline GRAND, DeepGRAND mitigates the accuracy drop-off with increasing depth and improves the overall accuracy of the model. We empirically corroborate the advantage of DeepGRAND over many existing graph neural networks on various graph deep learning benchmark tasks.","graph neural diffusion, graph neural networks, oversmoothing" Learning Discrete Representation with Optimal Transport Quantized Autoencoders,https://openreview.net/forum?id=fcg9phFVzjd,https://openreview.net/pdf?id=fcg9phFVzjd,"We propose a simple approach to train VQ-VAE, which can avoid the codebook collapse with the help of optimal transport.","Vector quantized variational autoencoder (VQ-VAE) has recently emerged as a powerful generative model for learning discrete representations. Like other vector quantization methods, one key challenge of training VQ-VAE comes from the codebook collapse, i.e. only a fraction of codes are used, limiting its reconstruction qualities. To this end, VQ-VAE often leverages some carefully designed heuristics during the training to use more codes. In this paper, we propose a simple yet effective approach to overcome this issue through optimal transport, which regularizes the quantization by explicitly assigning equal number of samples to each code. The proposed approach, named OT-VAE, enforces the full utilization of the codebook while not requiring any heuristics. We empirically validate our approach on three different data modalities: images, speech, and 3D human motions. For all the modalities, OT-VAE shows better reconstruction with higher perplexity than other VQ-VAE variants on several datasets. In particular, OT-VAE achieves state-of-the-art results on the AIST++ dataset for 3D dance generation. Our code will be released upon publication.","VQ-VAE, Optimal Transport, 3D Motion Generation" How to Keep Cool While Training,https://openreview.net/forum?id=Z7O43UCtGMO,https://openreview.net/pdf?id=Z7O43UCtGMO,"The paper proposes a calibration method applied during neural network training, which removes the need for a learning rate schedule.","Modern classification neural networks are notoriously prone to being overly confident in their predictions. With multiple calibration methods having been proposed so far, there has been noteworthy progress in reducing this overconfidence. However, to the best of our knowledge, prior methods have exclusively focused on the factors that affect calibration, leaving open the reverse question of how (mis)calibration impacts network training. Aiming for a better understanding of this interplay, we propose a temperature-based Cooling method for calibrating classification neural networks during training. Cooling has a substantial effect on the gradients and reduces the need for a learning rate schedule. We investigate different variants of Cooling, with the simplest one, last layer Cooling, being also the best-performant one, improving network performance on a range of datasets, network architectures, and hyperparameter settings.","neural network, calibration, network calibration, cooling, temperature scaling, classification" Learning System Dynamics from Sensory Input under Optimal Control Principles,https://openreview.net/forum?id=fcA--b8ycdX,https://openreview.net/pdf?id=fcA--b8ycdX,,"Identifying the underlying dynamics of actuated physical systems from sensory input is of high interest in control, robotics, and engineering in general. In the context of control problems, existing approaches decouple the construction of the feature space where the dynamics identification process occurs from the target control tasks, potentially leading to a mismatch between feature and real state spaces: the systems may not be controllable in feature space, and synthesized controls may not be applicable in the state space. Borrowing from the Koopman formalism, we propose instead to learn an embedding of both the states and con- trols in feature spaces where the dynamics are linear, and to include the target control task in the learning objective in the form of a differentiable and robust optimal control problem. We validate this approach with simulation experiments of systems with non-linear dynamics, demonstrating that the controls obtained in feature space can be used to drive the corresponding physical systems and that the learned model can serve for future state prediction.", Dual Algorithmic Reasoning,https://openreview.net/forum?id=hhvkdRdWt1F,https://openreview.net/pdf?id=hhvkdRdWt1F,A neural algorithmic reasoning approach exploiting the duality principle,"Neural Algorithmic Reasoning is an emerging area of machine learning which seeks to infuse algorithmic computation in neural networks, typically by training neural models to approximate steps of classical algorithms. In this context, much of the current work has focused on learning reachability and shortest path graph algorithms, showing that joint learning on similar algorithms is beneficial for generalisation. However, when targeting more complex problems, such ""similar"" algorithms become more difficult to find. Here, we propose to learn algorithms by exploiting duality of the underlying algorithmic problem. Many algorithms solve optimisation problems. We demonstrate that simultaneously learning the dual definition of these optimisation problems in algorithmic learning allows for better learning and qualitatively better solutions. Specifically, we exploit the max-flow min-cut theorem to simultaneously learn these two algorithms over synthetically generated graphs, demonstrating the effectiveness of the proposed approach. We then validate the real-world utility of our dual algorithmic reasoner by deploying it on a challenging brain vessel classification task, which likely depends on the vessels’ flow properties. We demonstrate a clear performance gain when using our model within such a context, and empirically show that learning the max-flow and min-cut algorithms together is critical for achieving such a result.","Algorithmic Reasoning, Deep Learning for Graphs" Lmser-pix2seq: Learning Stable Sketch Representations For Sketch Healing,https://openreview.net/forum?id=I9J8gIyqRE,https://openreview.net/pdf?id=I9J8gIyqRE,,"Sketch healing aims to recreate a complete sketch from the corrupted one. The sparse and abstract nature of the sketch makes it challenging due to the difficulty in learning. The features extracted from the corrupted sketch may be inconsistent with the ones from the corresponding full sketch. In this paper, we present Lmser-pix2seq to learn stable sketch representations against the missing information by employing a Least mean square error reconstruction (Lmser) block, which falls into encoder-decoder paradigm. Taking as input a corrupted sketch, the Lmser encoder computes the embeddings of structural patterns of the input, while the decoder reconstructs the complete sketch from the embeddings. We build bi-directional skip connections between the encoder and the decoder in our Lmser block. The feedback connections enable recurrent paths to receive more information about the reconstructed sketch produced by the decoder, which helps the encoder extract stable sketch features. The features captured by the Lmser block are eventually fed into a recurrent neural network decoder to recreate the sketches. Experimental results show that our Lmser-pix2seq outperforms the state-of-the-art methods in sketch healing, especially when the sketches are heavily masked or corrupted.","sketch healing, Lmser, stable representations, bi-directional connections" UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion,https://openreview.net/forum?id=e9T1iIAbfKj,https://openreview.net/pdf?id=e9T1iIAbfKj,,"The unlabeled speech contains rich speaker style information, which can improve the few-shots modeling capability. This paper proposes UnifySpeech to make use of large amounts of unlabeled data for model training and boost the performance of text-to-speech (TTS) and voice conversion (VC) simultaneously. UnifySpeech brings TTS and VC into a unified framework for the first time. The model is based on the assumption that speech can be decoupled into three independent components: content information, speaker information, prosody information. Both TTS and VC can be regarded as mining these three parts of information from the input and completing the reconstruction of speech. For TTS, the speech content information is derived from the text, while in VC it's derived from the source speech, so all the remaining units are shared except for the speech content extraction module in the two pipelines. We applied vector quantization and loss optimization to bridge the gap between the content domains of TTS and VC. Objective evaluation shows UnifySpeech gets higer speaker similarity and pitch prediction accuracy, indicating the improvements of the style modeling ability. Subjective evaluation shows speech generated by UnifySpeech obtains high mean opinion score (mos) that the audio is as natural as human voice.","decoupling, zero-shot learning, text-to-speech, voice conversion, vector quantization" Toward Effective Deep Reinforcement Learning for 3D Robotic Manipulation: End-to-End Learning from Multimodal Raw Sensory Data,https://openreview.net/forum?id=Zp4EIt1yFID,https://openreview.net/pdf?id=Zp4EIt1yFID,,"Sample-efficient reinforcement learning (RL) methods capable of learning directly from raw sensory data without the use of human-crafted representations would open up real-world applications in robotics and control. Recent advances in visual RL have shown that learning a latent representation together with existing RL algorithms closes the gap between state-based and image-based training. However, image-based training is still significantly sample-inefficient with respect to learning in 3D continuous control problems (for example, robotic manipulation) compared to state-based training. In this study, we propose an effective model-free off-policy RL method for 3D robotic manipulation that can be trained in an end-to-end manner from multimodal raw sensory data obtained from a vision camera and a robot's joint encoders, without the need for human-crafted representations. Notably, our method is capable of learning a latent multimodal representation and a policy in an efficient, joint, and end-to-end manner from multimodal raw sensory data. Our method, which we dub MERL: Multimodal End-to-end Reinforcement Learning, results in a simple but effective approach capable of significantly outperforming both current state-of-the-art visual RL and state-based RL methods with respect to sample efficiency, learning performance, and training stability in relation to 3D robotic manipulation tasks from DeepMind Control.","deep reinforcement learning, robotic manipulation, end-to-end learning, multimodal representation learning" Domain Generalisation via Domain Adaptation: An Adversarial Fourier Amplitude Approach,https://openreview.net/forum?id=7IG0wsTND7w,https://openreview.net/pdf?id=7IG0wsTND7w,We tackle the domain generalisation problem by posing it as a domain adaptation task where we adversarially synthesise the worst-case target domain via Fourier amplitude generation.,"We tackle the domain generalisation (DG) problem by posing it as a domain adaptation (DA) task where we adversarially synthesise the worst-case `target' domain and adapt a model to that worst-case domain, thereby improving the model’s robustness. To synthesise data that is challenging yet semantics-preserving, we generate Fourier amplitude images and combine them with source domain phase images, exploiting the widely believed conjecture from signal processing that amplitude spectra mainly determines image style, while phase data mainly captures image semantics. To synthesise a worst-case domain for adaptation, we train the classifier and the amplitude generator adversarially. Specifically, we exploit the maximum classifier discrepancy (MCD) principle from DA that relates the target domain performance to the discrepancy of classifiers in the model hypothesis space. By Bayesian hypothesis modeling, we express the model hypothesis space effectively as a posterior distribution over classifiers given the source domains, making adversarial MCD minimisation feasible. On the DomainBed benchmark including the large-scale DomainNet dataset, the proposed approach yields significantly improved domain generalisation performance over the state-of-the-art.","Domain generalisation, Domain adaptation, Fourier analysis" Improving Generative Flow Networks with Path Regularization,https://openreview.net/forum?id=7qyLeRm1e3,https://openreview.net/pdf?id=7qyLeRm1e3,We propose a novel path regularization method based on optimal transport theory that places prior constraints on the underlying structure of the GFlowNets,"Generative Flow Networks (GFlowNets) are recently proposed models for learning stochastic policies that generate compositional objects by sequences of actions with the probability proportional to a given reward function. The central problem of GFlowNets is to improve their exploration and generalization. In this work, we propose a novel path regularization method based on optimal transport theory that places prior constraints on the underlying structure of the GFlowNets. The prior is designed to help the GFlowNets better discover the latent structure of the target distribution or enhance its ability to explore the environment in the context of active learning. The path regularization controls the flow in GFlowNets to generate more diverse and novel candidates via maximizing the optimal transport distances between two forward policies or to improve the generalization via minimizing the optimal transport distances. In addition, we derive an efficient implementation of the regularization by finding its closed form solutions in specific cases and a meaningful upper bound that can be used as an approximation to minimize the regularization term. We empirically demonstrate the advantage of our path regularization on a wide range of tasks, including synthetic hypergrid environment modeling, discrete probabilistic modeling, and biological sequence design.","generative flow networks, path regularization, optimal transport" On the Shortcut Learning in Multilingual Neural Machine Translation,https://openreview.net/forum?id=fQGjNpkGVf,https://openreview.net/pdf?id=fQGjNpkGVf,Single centric MNMT suffers from off-target issues due to overfitting of shortcut patterns of language mappings. Multilingual pretraining aggregates such overfitting. We propose a simple training strategy to eliminate such shortcut patterns.,"In this study, we connect the commonly-cited off-target issues in zero-shot translation to the usage of a single centric language in the training datasets of multilingual neural machine translation (MNMT). By carefully designing experiments on different MNMT scenarios and models, we attribute off-target issues to the overfitting of the shortcut patterns of (non-centric, centric) language mappings. Specifically, the learned shortcut patterns biases MNMT to mistakenly translate non-centric languages into the centric language instead of the expected non-centric language. We analyze the learning dynamics of MNMT and find that the shortcut learning generally occurs at the later stage of model training. Pretraining accelerates and aggravates the shortcut learning via a fast transformation from the copy pattern embedded in the pretraining intitialization to the (non-centric, centric) mapping pattern embedded in the MNMT data. Based on these observations, we propose a simple and effective training strategy to eliminate the shortcut patterns in MNMT models by leveraging the forgetting nature of model training. The only difference between our approach and the conventional training is that we only present the training examples of (centric, non-centric) language mapping (excluding the reverse direction) to MNMT models in the later stage of model training. Without introducing any additional data and computational costs, our approach can consistently and significantly improve the performance of zero-shot translation by alleviating the shortcut learning, and maintain the performance of supervised translation for different MNMT models on several benchmarks.","Multilingual Neural Machine Translation, shortcut learning, generalization" Confidential-PROFITT: Confidential PROof of FaIr Training of Trees,https://openreview.net/forum?id=iIfDQVyuFD,https://openreview.net/pdf?id=iIfDQVyuFD,We introduce a method to provide a confidential proof of fair training.,"Post hoc auditing of model fairness suffers from potential drawbacks: (1) auditing may be highly sensitive to the test samples chosen; (2) the model and/or its training data may need to be shared with an auditor thereby breaking confidentiality. We address these issues by instead providing a certificate that demonstrates that the learning algorithm itself is fair, and hence, as a consequence, so too is the trained model. We introduce a method to provide a confidential proof of fairness for training, in the context of widely used decision trees, which we term Confidential-PROFITT. We propose novel fair decision tree learning algorithms along with customized zero-knowledge proof protocols to obtain a proof of fairness that can be audited by a third party. Using zero-knowledge proofs enables us to guarantee confidentiality of both the model and its training data. We show empirically that bounding the information gain of each node with respect to the sensitive attributes reduces the unfairness of the final tree. In extensive experiments on the COMPAS, Communities and Crime, Default Credit, and Adult datasets, we demonstrate that a company can use Confidential-PROFITT to certify the fairness of their decision tree to an auditor in less than 2 minutes, thus indicating the applicability of our approach. This is true for both the demographic parity and equalized odds definitions of fairness. Finally, we extend Confidential-PROFITT to apply to ensembles of trees.","Fairness, Audit, Confidentiality, Zero-Knowledge Proof" Consolidator: Mergable Adapter with Group Connections for Vision Transformer,https://openreview.net/forum?id=J_Cja7cpgW,https://openreview.net/pdf?id=J_Cja7cpgW,We propose a module named consolidator to achieve both parameter- and inference-efficient transfer learning for vision transformers,"Recently, transformers have shown strong ability as visual feature extractors, surpassing traditional convolution-based models in various scenarios. However, the success of vision transformers largely owes to their capacity of accommodating numerous parameters. As a result, new challenges for adapting a well-trained transformer to downstream tasks arise. On resource-limited devices, classic fine-tuning, which tunes and stores a full copy of parameters in the pretrained model for every downstream task, is usually impracticable for the shortage of storage space. However, few works have focused on how to efficiently and effectively transfer the knowledge in a vision transformer. Existing methods did not dive into the properties of visual features, leading to inferior performance. Moreover, some of them bring heavy inference cost though benefiting storage. To tackle these problems, we propose consolidator to achieve efficient transfer learning for vision transformers. Our consolidator modifies the pretrained model with the addition of a small set of tunable parameters to temporarily store the task-specific knowledge while freezing the backbone model during adaptation. Motivated by the success of group-wise convolution, we adopt grouped connections across the features extracted by transformer blocks to construct tunable parts in a consolidator. To further enhance the model capacity to transfer knowledge under constrained storage budget and keep inference efficient, we consolidate the parameters in two stages: 1. between adaptation and storage 2. between loading and inference. On a series of downstream visual tasks, our consolidator can reach better performance than full fine-tuning with less than 0.5\% parameters stored per task, and outperform state-of-the-art parameter-efficient tuning methods.","Efficient Transfer Learning, Groups Connections, Vision Transformer" Statistical Theory of Differentially Private Marginal-based Data Synthesis Algorithms,https://openreview.net/forum?id=hxUwnEGxW87,https://openreview.net/pdf?id=hxUwnEGxW87,We analyze the differentially private marginal-based data synthesis algorithms in a statistical framework and establish a theoretical guarantee for the accuracy and utility.," Marginal-based methods achieve promising performance in the synthetic data competition hosted by the National Institute of Standards and Technology (NIST). To deal with high-dimensional data, the distribution of synthetic data is represented by a probabilistic graphical model (e.g., a Bayesian network), while the raw data distribution is approximated by a collection of low-dimensional marginals. Differential privacy (DP) is guaranteed by introducing random noise to each low-dimensional marginal distribution. Despite its promising performance in practice, the statistical properties of marginal-based methods are rarely studied in the literature. In this paper, we study DP data synthesis algorithms based on Bayesian networks (BN) from a statistical perspective. We establish a rigorous accuracy guarantee for BN-based algorithms, where the errors are measured by the total variation (TV) distance or the $L^2$ distance. Related to downstream machine learning tasks, an upper bound for the utility error of the DP synthetic data is also derived. To complete the picture, we establish a lower bound for TV accuracy that holds for every $\epsilon$-DP synthetic data generator.","Synthetic data, differential privacy, marginal-based method, Bayesian network, learning theory" Homotopy-based training of NeuralODEs for accurate dynamics discovery,https://openreview.net/forum?id=33csNbhVnD,https://openreview.net/pdf?id=33csNbhVnD,"Building upon ideas from the chaos literature, we introduce a novel method of training neural ordinary differential equations with drastic improvements for long complex time series prediction.","Conceptually, Neural Ordinary Differential Equations (NeuralODEs) pose an attractive way to extract dynamical laws from time series data, as they are natural extensions of the traditional differential equation-based modeling paradigm of the physical sciences. In practice, NeuralODEs display long training times and suboptimal results, especially for longer duration data where they may fail to fit the data altogether. While methods have been proposed to stabilize NeuralODE training, many of these involve placing a strong constraint on the functional form the trained NeuralODE can take that the actual underlying governing equation does not guarantee satisfaction. In this work, we present a novel NeuralODE training algorithm that leverages tools from the chaos and mathematical optimization communities -- synchronization and homotopy optimization -- for a breakthrough in tackling the NeuralODE training obstacle. We demonstrate architectural changes are unnecessary for effective NeuralODE training. Compared to the conventional training methods, our algorithm achieves drastically lower loss values without any changes to the model architectures. Experiments on both simulated and real systems with complex temporal behaviors demonstrate NeuralODEs trained with our algorithm are able to accurately capture true long term behaviors and correctly extrapolate into the future. ","neural ordinary differential equation, synchronization, homotopy, dynamical systems, physics" Transformers with Multiresolution Attention Heads,https://openreview.net/forum?id=L8qKBr_bht,https://openreview.net/pdf?id=L8qKBr_bht,"We propose the Transformer with Multiresolution-head Attention (MrsFormer), a class of efficient transformers inspired by the multiresolution approximation (MRA) for approximating a signal f using wavelet bases","We propose the Transformer with Multiresolution-head Attention (MrsFormer), a class of efficient transformers inspired by the multiresolution approximation (MRA) for approximating a signal f using wavelet bases. MRA decomposes a signal into components that lie on orthogonal subspaces at different scales. Similarly, MrsFormer decomposes the attention heads in the multi-head attention into fine-scale and coarse-scale heads, modeling the attention patterns between tokens and between groups of tokens. Computing the attention heads in MrsFormer requires significantly less computation and memory footprint compared to the standard softmax transformer with multi-head attention. We analyze and validate the advantage of MrsFormer over the standard transformers on a wide range of applications including image and time series classification.","transformer, multiresolution analysis, attention heads" Anti-Symmetric DGN: a stable architecture for Deep Graph Networks,https://openreview.net/forum?id=J3Y7cgZOOS,https://openreview.net/pdf?id=J3Y7cgZOOS,,"Deep Graph Networks (DGNs) currently dominate the research landscape of learning from graphs, due to their efficiency and ability to implement an adaptive message-passing scheme between the nodes. However, DGNs are typically limited in their ability to propagate and preserve long-term dependencies between nodes, i.e., they suffer from the over-squashing phenomena. As a result, we can expect them to under-perform, since different problems require to capture interactions at different (and possibly large) radii in order to be effectively solved. In this work, we present Anti-Symmetric Deep Graph Networks (A-DGNs), a framework for stable and non-dissipative DGN design, conceived through the lens of ordinary differential equations. We give theoretical proof that our method is stable and non-dissipative, leading to two key results: long-range information between nodes is preserved, and no gradient vanishing or explosion occurs in training. We empirically validate the proposed approach on several graph benchmarks, showing that A-DGN yields to improved performance and enables to learn effectively even when dozens of layers are used.", Contrastive Learning for Unsupervised Domain Adaptation of Time Series,https://openreview.net/forum?id=xPkJYRsQGM,https://openreview.net/pdf?id=xPkJYRsQGM,"We develop a novel framework for UDA of time series data, called CLUDA, through contrastive learning framework to learn domain-invariant contextual representations in multivariate time series.","Unsupervised domain adaptation (UDA) aims at learning a machine learning model using a labeled source domain that performs well on a similar yet different, unlabeled target domain. UDA is important in many applications such as medicine, where it is used to adapt risk scores across different patient cohorts. In this paper, we develop a novel framework for UDA of time series data, called CLUDA. Specifically, we propose a contrastive learning framework to learn contextual representations in multivariate time series, so that these preserve label information for the prediction task. In our framework, we further capture the variation in the contextual representations between source and target domain via a custom nearest-neighbor contrastive learning. To the best of our knowledge, ours is the first framework to learn domain-invariant, contextual representation for UDA of time series data. We evaluate our framework using a wide range of time series datasets to demonstrate its effectiveness and show that it achieves state-of-the-art performance for time series UDA. ","unsupervised domain adaptation, time series, contrastive learning, deep learning" Model-Based Decentralized Policy Optimization ,https://openreview.net/forum?id=ZxhIjuo6p4,https://openreview.net/pdf?id=ZxhIjuo6p4,A novel mode-based decentralized policy optimization algorithm,"Decentralized policy optimization has been commonly used in cooperative multi-agent tasks. However, since all agents are updating their policies simultaneously, from the perspective of individual agents, the environment is non-stationary, resulting in it being hard to guarantee monotonic policy improvement. To help the policy improvement be stable and monotonic, we propose model-based decentralized policy optimization (MDPO), which incorporates a latent variable function to help construct the transition and reward function from an individual perspective. We theoretically analyze that the policy optimization of MDPO is more stable than model-free decentralized policy optimization. Moreover, due to non-stationarity, the latent variable function is varying and hard to be modeled. We further propose a latent variable prediction method to reduce the error of latent variable function, which theoretically contributes to the monotonic policy improvement. Empirically, MDPO can indeed obtain superior performance than model-free decentralized policy optimization in a variety of cooperative multi-agent tasks.",Multi-Agent Reinforcement Learning CLIP model is an Efficient Continual Learner,https://openreview.net/forum?id=qZ1yqKvQtE,https://openreview.net/pdf?id=qZ1yqKvQtE,A simple baseline for future comparisons in the continual learning tasks using zero-shot CLIP.,"The continual learning setting aims to learn new tasks over time without forgetting the previous ones. The literature reports several significant efforts to tackle this problem with limited or no access to previous task data. Among such efforts, typical solutions offer sophisticated techniques involving memory replay, knowledge distillation, model regularization, and dynamic network expansion. The resulting methods have a retraining cost at each learning task, dedicated memory requirements, and setting-specific design choices. In this work, we show that a frozen CLIP (Contrastive Language-Image Pretraining) model offers astounding continual learning performance without any fine-tuning (zero-shot evaluation). We evaluate CLIP under a variety of settings including class-incremental, domain-incremental and task-agnostic incremental learning on five popular benchmarks (ImageNet-100 & 1K, CORe50, CIFAR-100, and TinyImageNet). Without any bells and whistles, the CLIP model outperforms the state-of-the-art continual learning approaches in majority of the settings. We show the effect on CLIP model’s performance by varying text inputs with simple prompt templates. To the best of our knowledge, this is the first work to report the CLIP zero-shot performance in a continual setting. We advocate the use of this strong yet embarrassingly simple baseline for future comparisons in the continual learning tasks.","Continual Learning, Vision-Language Models, Zero-shot Classifiers" Online Low Rank Matrix Completion,https://openreview.net/forum?id=47KG_AvNqeZ,https://openreview.net/pdf?id=47KG_AvNqeZ,A novel algorithm for solving online low-rank matrix completion problem with optimal regret for rank-one case.," We study the problem of online low-rank matrix completion with $\mathsf{M}$ users, $\mathsf{N}$ items and $\mathsf{T}$ rounds. In each round, the algorithm recommends one item per user, for which it gets a (noisy) reward sampled from a low-rank user-item preference matrix. The goal is to design a method with sub-linear regret (in $\mathsf{T}$) and nearly optimal dependence on $\mathsf{M}$ and $\mathsf{N}$. The problem can be easily mapped to the standard multi-armed bandit problem where each item is an independent arm, but that leads to poor regret as the correlation between arms and users is not exploited. On the other hand, exploiting the low-rank structure of reward matrix is challenging due to non-convexity of the low-rank manifold. We first demonstrate that the low-rank structure can be exploited using a simple explore-then-commit (ETC) approach that ensures a regret of $O(\mathsf{polylog} (\mathsf{M}+\mathsf{N}) \mathsf{T}^{2/3})$. That is, roughly only $\mathsf{polylog} (\mathsf{M}+\mathsf{N})$ item recommendations are required per user to get a non-trivial solution. We then improve our result for the rank-$1$ setting which in itself is quite challenging and encapsulates some of the key issues. Here, we propose OCTAL (Online Collaborative filTering using iterAtive user cLustering) that guarantees nearly optimal regret of $O(\mathsf{polylog} (\mathsf{M}+\mathsf{N}) \mathsf{T}^{1/2})$. OCTAL is based on a novel technique of clustering users that allows iterative elimination of items and leads to a nearly optimal minimax rate. ","Matrix Completion, Online Learning, Recommendation System" Modality Complementariness: Towards Understanding Multi-modal Robustness,https://openreview.net/forum?id=gfHLOC35Zh,https://openreview.net/pdf?id=gfHLOC35Zh,,"Along with the success of multi-modal learning, the robustness of multi-modal learning is receiving attention due to real-world safety concerns. Multi-modal models are anticipated to be more robust due to the possible redundancy between modalities. However, some empirical results have offered contradictory conclusions. In this paper, we point out an essential factor that causes this discrepancy: The difference in the amount of modality-wise complementary information. We provide an information-theoretical analysis of how the modality complementariness affects the multi-modal robustness. Based on the analysis, we design a metric for quantifying how complementary the modalities are to others and propose an effective pipeline to calculate our metric. Experiments on carefully-designed synthetic data verify our theory. Further, we apply our metric to real-world multi-modal datasets and reveal their property. To our best knowledge, we are the first to identify modality complementariness as an important factor affecting multi-modal robustness.",multimodal robustness theory Effective Offline Reinforcement Learning via Conservative State Value Estimation,https://openreview.net/forum?id=aySB6rDo0z,https://openreview.net/pdf?id=aySB6rDo0z,,"Offline RL promises to learn effective policies from static experience datasets without further interaction, which expect to perform well in the online environment. However, it faces up to a major challenge of value over-estimation introduced by the distributional drift between the dataset and the current learned policy, which leads to learning failure in practice. The common approach is to add a penalty term to reward or value estimation in the Bellman iterations, which has given rise to a number of successful algorithms such as CQL. Meanwhile, to avoid extrapolation on unseen states, existing methods focus on conservative Q-function estimation. In this paper, we propose CSVE, a new approach that directly imposes penalty on out-of-distribution states. We prove that for the evaluated policy, our conservative state value estimation satisfies: (1) over the state distribution that samples penalizing states, it lower bounds the true values in expectation, and (2) over the marginal state distribution of data, it is no more than the true values in expectation plus a constant decided by sampling error. Further, we develop a practical actor-critic algorithm in which the critic does the conservative value estimation by additionally sampling and penalizing the states 'around' the dataset, while the actor applies advantage weighted updates to improve the policy. We evaluate in classic continual control tasks of D4RL, showing that our method performs better than the conservative Q-function learning methods (e.g., CQL) and is strongly competitive among recent SOTA methods.",Offline Reinforcement Learning ChemAlgebra : Algebraic Reasoning on Chemical Reactions,https://openreview.net/forum?id=u05JdX1Qz-b,https://openreview.net/pdf?id=u05JdX1Qz-b,"In this paper we propose ChemAlgebra, a benchmark for measuring the reasoning capabilities of deep learning models through the prediction of stoichiometrically-balanced chemical reactions.","While showing impressive performance on various kinds of learning tasks, it is yet unclear whether deep learning models have the ability to robustly tackle reasoning tasks. Measuring the robustness of reasoning in machine learning models is challenging as one needs to provide a task that cannot be easily shortcut by exploiting spurious statistical correlations in the data, while operating on complex objects and constraints. To address this issue, we propose ChemAlgebra, a benchmark for measuring the reasoning capabilities of deep learning models through the prediction of stoichiometrically-balanced chemical reactions. ChemAlgebra requires manipulating sets of complex discrete objects – molecules represented as formulas or graphs – under algebraic constraints such as the mass preservation principle. We believe that ChemAlgebra can serve as a useful test bed for the next generation of machine reasoning models and as a promoter of their development.","algebraic reasoning, datasets, benchmarks, learning-reasoning integration, chemical reactions" A Primal-Dual Framework for Transformers and Neural Networks,https://openreview.net/forum?id=U_T8-5hClV,https://openreview.net/pdf?id=U_T8-5hClV,We show that the self-attention corresponds to the support vector expansion derived from a support vector regression problem and provide a principled framework for constructing new attention mechanisms from popular neural network layers.,"Self-attention is key to the remarkable success of transformers in sequence modeling tasks including many applications in natural language processing and computer vision. Like neural network layers, these attention mechanisms are often developed by heuristics and experience. To provide a principled framework for constructing attention layers in transformers, we show that the self-attention corresponds to the support vector expansion derived from a support vector regression problem, whose primal formulation has the form of a neural network layer. Using our framework, we derive popular attention layers used in practice and propose two new attentions: 1) the Batch Normalized Attention (Attention-BN) derived from the batch normalization layer and 2) the Attention with Scaled Head (Attention-SH) derived from using less training data to fit the SVR model. We empirically demonstrate the advantages of the Attention-BN and Attention-SH in reducing head redundancy, increasing the model's accuracy, and improving the model's efficiency in a variety of practical applications including image and time-series classification. ","attention, transformer, neural network, support vector regression, primal, dual" Explaining RL Decisions with Trajectories,https://openreview.net/forum?id=5Egggz1q575,https://openreview.net/pdf?id=5Egggz1q575,This work focuses on idea of explaining actions of offline RL agent by attributing the actions to trajectories encountered during the training.,"Explanation is a key component for the adoption of reinforcement learning (RL) in many real-world decision-making problems. In the literature, the explanation is often provided by saliency attribution to the features of the RL agent's state. In this work, we propose a complementary approach to these explanations, particularly for offline RL, where we attribute the policy decisions of a trained RL agent to the trajectories encountered by it during training. To do so, we encode trajectories in offline training data individually as well as collectively (encoding a set of trajectories). We then attribute policy decisions to a set of trajectories in this encoded space by estimating the sensitivity of the decision with respect to that set. Further, we demonstrate the effectiveness of the proposed approach in terms of quality of attributions as well as practical scalability in diverse environments that involve both discrete and continuous state and action spaces such as grid-worlds, video games (Atari) and continuous control (MuJoCo). We also conduct a human study on a simple navigation task to observe how their understanding of the task compares with data attributed for a trained RL policy.","Explainable RL, Explainable AI, Offline Reinforcement Learning, Trajectory Attribution, Decision-Aware AI" Reinforcement Learning using a Molecular Fragment Based Approach for Reaction Discovery,https://openreview.net/forum?id=IqN5SgOmxp,https://openreview.net/pdf?id=IqN5SgOmxp,A multi-pronged deep learning approach using a fragment based method is applied to chemical reaction discovery,"Deep learning methods have recently been applied to both predictive and generative tasks in the molecular space. While molecular generation and prediction of an associated property are now reasonably common, studies on reaction outcome due to the generated molecules remain less explored. Chemical reactions present a complex scenario as they involve multiple molecules and the breaking/forming of bonds. In reaction discovery, one aims to maximise yield and/or selectivity, which depends on a multitude of factors, including partner reactants and reaction conditions. We propose a multi-pronged approach that combines policy gradient reinforcement learning with a recurrent neural network-based deep generative model to identify prospective new reactants, whose yield/selectivity is estimated by a pre-trained regressor. Using SMILES (simplified molecular-input line-entry system) as the raw representation, our approach involves attaching a user-defined core fragment to the generated molecules for reaction-specific learning. On three distinct reaction types (alcohol deoxyflourination, imine-thiol coupling, asymmetric hydrogenation of imines and alkenes), we obtain notable improvements in yield and enantioselectivity. The generated molecules are diverse, while remaining synthetically accessible.","reinforcement learning, transfer learning, reaction discovery, deep generative model" Keypoint Matching via Random Network Consensus,https://openreview.net/forum?id=WhbWzFg8cZ,https://openreview.net/pdf?id=WhbWzFg8cZ,"A new approach that uses convolutional neural networks (CNNs) for keypoint description, detection, and matching, without requiring the deep networks to be trained.","Visual description, detection, and matching of keypoints in images are fundamental components of many computer vision problems, such as camera tracking and (re)localization. Recently, learning-based feature extractors on top of convolutional neural networks (CNNs) have achieved state-of-the-art performance. In this paper, we further explore the usage of CNN and present a new approach that ensembles randomly initialized CNNs without any training. Our observation is that the CNN architecture inherently extracts features with certain extents of robustness to viewpoint/illumination changes and thus, it can be regarded as visual descriptors. Consequently, randomized CNNs serve as descriptor extractors and a subsequent consensus mechanism detects keypoints using them. Such description and detection pipeline can be used to match keypoints in images and achieves higher generalization ability than the state-of-the-art methods in our experiments. ","Computer Vision, Keypoint Matching, Randomized Networks, Visual Descriptor, Detector" Visually-augmented pretrained language models for NLP Tasks without Images,https://openreview.net/forum?id=YWVkyLV53X,https://openreview.net/pdf?id=YWVkyLV53X,,"Although pre-trained language models (PLMs) have shown impressive perfor- mance by text-only self-supervised training, they are found lack of visual se- mantics or commonsense, e.g., sizes, shapes and colors of commonplace objects. Existing solutions often rely on explicit images for visual knowledge augmenta- tion (requiring time-consuming retrieval or generation), and they also conduct the augmentation for the whole input text, without considering whether it is actually needed in specific inputs or tasks. To address these issues, we propose a novel visually-augmented fine-tuning approach that can be generally applied to various PLMs or NLP tasks, without using any retrieved or generated images, namely VAWI. Specifically, we first identify the visually-hungry words (VH-words) from input text via a token selector, where three different methods have been proposed, including syntax-, attention- and learning-based strategies. Then, we adopts a fixed CLIP text encoder to generate the visually-augmented representations of these VH-words. As it has been pre-trained by visual-language alignment task on large-scale corpus, it is capable of injecting visual semantics into the aligned text representations. Finally, the visually-augmented features will be fused and trans- formed into several pre-designed visual prompts based on VH-words, which can be inserted into PLMs to enrich the visual semantics in word repersentations. We conduct extensive experiments on ten NLP tasks, i.e., GLUE benchmark, Com- monsenseQA, CommonGen and SNLI-VE. Experimental results show that our approach can consistently improve the performance of BERT, RoBERTa, BART and T5 at different scales, and outperform several competitive baselines signifi- cantly. Besides, the generated visual prompts of our framework can also be used for parameter-efficient tuning, which boosts the performance of T5-3B. We will make our code, data, and models publicly available.",Visually-augmented pretrained language models for NLP tasks without images Calibration for Decision Making via Empirical Risk Minimization,https://openreview.net/forum?id=ih3mo7J-vb,https://openreview.net/pdf?id=ih3mo7J-vb,,"Neural networks for classification can achieve high accuracy but their probabilistic predictions may be not well-calibrated, in particular overconfident. Different general calibration measures and methods were proposed. But how exactly does the calibration affect downstream tasks? We derive a new task-specific definition of calibration for the problem of statistical decision making with a known cost matrix. We then show that so-defined calibration can be theoretically rigorously improved by minimizing the empirical risk in the adjustment parameters like temperature. For the empirical risk minimization, which is not differentiable, we propose improvements to and analysis of the direct loss minimization approach. Our experiments indicate that task-specific calibration can perform better than a generic one. But we also carefully investigate weaknesses of the proposed tool and issues in the statistical evaluation for problems with highly unbalanced decision costs.","Calibration, Risk, Bayesian decision making, score decomposition, temperature scaling, direct loss minimization, margin rescaling" Improving Adversarial Robustness via Frequency Regularization,https://openreview.net/forum?id=10E_ZGfTBt,https://openreview.net/pdf?id=10E_ZGfTBt,"We show that AT-CNNs extract robust features from the low-frequency region to gain robustness and explain why the white-box attack is hard to defend from a spectral perspective, then propose a frequency regularization to improve the robustness.","Deep neural networks (DNNs) are incredibly vulnerable to crafted, human-imperceptible adversarial perturbations. While adversarial training (AT) has proven to be an effective defense approach, the properties of AT for robustness improvement remain an open issue. In this paper, we investigate AT from a spectral perspective, providing new insights into the design of effective defenses. Our analyses show that AT induces the deep model to focus more on the low-frequency region, which retains the shape-biased representations, to gain robustness. Further, we find that the spectrum of a white-box attack is primarily distributed in regions the model focuses on, and the perturbation attacks the spectral bands where the model is vulnerable. To train a model tolerant to frequency-varying perturbation, we propose a frequency regularization (FR) such that the spectral output inferred by an attacked input stays as close as possible to its natural input counterpart. Experiments demonstrate that FR and its weight averaging (WA) extension could significantly improve the robust accuracy by 1.14% ~ 4.57%, across multiple datasets (SVHN, CIFAR-10, CIFAR-100, and Tiny ImageNet), and various attacks (PGD, C&W, and Autoattack), without any extra data.", "I Speak, You Verify: Toward Trustworthy Neural Program Synthesis",https://openreview.net/forum?id=D6lZTMvBo3,https://openreview.net/pdf?id=D6lZTMvBo3,Large language models over source code can be made more trustworthy when they jointly generate programs and specifications,"We develop an approach for improving the trustworthiness and overall accuracy of programs synthesizers based on large language models for source code. Given a natural language description of a programming problem, our method samples both candidate programs as well as candidate predicates specifying what the program should compute. Our method learns to analyze the agreement between programs and predicates to judge both which program is most likely to be correct, and also judge whether the language model is able to solve the programming problem in the first place. This latter capacity allows favoring high precision over broad recall: fostering trust by only proposing a program when the system is certain that it is correct.","program synthesis, large language models" FastFill: Efficient Compatible Model Update,https://openreview.net/forum?id=rnRiiHw8Vy,https://openreview.net/pdf?id=rnRiiHw8Vy,We propose a new uncertainty based updating scheme for online model upgrades of image retrieval systems for compatible representation learning.,"In many retrieval systems the original high dimensional data (e.g., images) is mapped to a lower dimensional feature through a learned embedding model. The task of retrieving the most similar data from a gallery set to a given query data is performed through similarity comparison on features. When the embedding model is updated, it might produce features that are not comparable/compatible with features already in the gallery computed with the old model. Subsequently, all features in the gallery need to be re-computed using the new embedding model -- a computationally expensive process called \emph{backfilling}. Recently, compatible representation learning methods have been proposed to avoid back-filling. Despite their relative success, there is an inherent trade-off between new model performance and its compatibility with the old model. In this work, we introduce \method: a compatible model update process using feature alignment and policy based partial backfilling to promptly elevate retrieval performance. We show that previous backfilling strategies suffer from decreased performance and demonstrate the importance of both the training objective and the ordering in \emph{online} partial backfilling. We propose a new training method for feature alignment between old and new embedding models using uncertainty estimation. Compared to previous works, we obtain significantly improved backfilling results on a variety of datasets: mAP on ImageNet (+4.4\%), Places-365 (+2.7\%), and VGG-Face2 (+1.3\%). Further, we demonstrate that when updating a biased model with FastFill, the minority subgroup accuracy gap promptly vanishes with a small fraction of partial backfilling.","Compatible Representation Learning, Image Retrieval, Model Regression" Learnable Graph Convolutional Attention Networks,https://openreview.net/forum?id=WsUMeHPo-2,https://openreview.net/pdf?id=WsUMeHPo-2,"We propose a GNN which learns to use, in each layer, an interpolation of a GCN, GAT, and a GAT with convolved features. It outperforms existing methods, is more robust, and removes the need of cross-validating.","Existing Graph Neural Networks (GNNs) compute the message exchange between nodes by either aggregating uniformly (convolving) the features of all the neighbor- ing nodes, or by applying a non-uniform score (attending) to the features. Recent works have shown the strengths and weaknesses of the resulting GNN architectures, respectively, GCNs and GATs. In this work, we aim at exploiting the strengths of both approaches to their full extent. To this end, we first introduce the graph convolutional attention layer (CAT), which relies on convolutions to compute the attention scores. Unfortunately, as in the case of GCNs and GATs, we show that there exists no clear winner between the three—neither theoretically nor in practice—as their performance directly depends on the nature of the data (i.e., of the graph and features). This result brings us to the main contribution of our work, the learnable graph convolutional attention network (L-CAT): a GNN architecture that automatically interpolates between GCN, GAT and CAT in each layer, by adding only two scalar parameters. Our results demonstrate that L-CAT is able to efficiently combine different GNN layers along the network, outperforming competing methods in a wide range of datasets, and resulting in a more robust model that reduces the need of cross-validating.","GNN, GCN, GAT" Indoor Localisation for Detecting Medication Use in Parkinson's Disease,https://openreview.net/forum?id=D3lPaQ7iqw,https://openreview.net/pdf?id=D3lPaQ7iqw,A transformer-based network is proposed for indoor localisation utilising dual modalities where the derived in-home mobility features can be used to classify the medication state of a person with Parkinson's disease,"Parkinson’s disease (PD) is a slowly progressive debilitating neurodegenerative disease which is prominently characterised by motor symptoms. Indoor localisation, including its in-home mobility features, could provide a digital biomarker that can be used to quantify how mobility changes as this disease progresses. To improve the effectiveness of current methods for indoor localisation, a transformer-based approach utilising multiple modalities, Received Signal Strength Indicator (RSSI) and accelerometer data from wearable devices, which provide complementary views of movement, is proposed. To properly evaluate our proposed method, we use a free-living dataset where the movements and mobility are greatly varied and unstructured as expected in real-world conditions. 12 pairs of people (one with PD, and the other a control participant) lived for five days in a smart home with various sensors. Our evaluation on such a dataset, which includes subjects with and without PD, demonstrates that our proposed network outperforms the current state-of-the-art in indoor localisation. We also show how the accurate room-level localisation predictions can be transformed into in-home mobility features (i.e. room-to-room transition duration) which can be used to effectively classify whether the PD participant is taking their medications or withholding them (increasing their symptoms)","Transformer, Indoor Localisation, Medication State Classification, Parkinson's Disease" Scaffolding a Student to Instill Knowledge,https://openreview.net/forum?id=N4K5ck-BTT,https://openreview.net/pdf?id=N4K5ck-BTT,We develop a novel KD scheme where the teacher scaffolds the student's prediction on hard-to-learn examples. It smoothens student's loss landscape so that the student encounters fewer local minima. As a result it has good generalization properties.,"We propose a novel knowledge distillation (KD) method to selectively instill teacher knowledge into a student model motivated by situations where the student's capacity is significantly smaller than that of the teachers. In vanilla KD, the teacher primarily sets a predictive target for the student to follow, and we posit that this target is overly optimistic due to the student's lack of capacity. We develop a novel scaffolding scheme where the teacher, in addition to setting a predictive target, also scaffolds the student's prediction by censoring hard-to-learn examples. Scaffolding utilizes the same information as the teacher's soft-max predictions as inputs, and in this sense, our proposal can be viewed as a natural variant of vanilla KD. We show on synthetic examples that censoring hard-examples leads to smoothening the student's loss landscape so that the student encounters fewer local minima. As a result, it has good generalization properties. Against vanilla KD, we achieve improved performance and are comparable to more intrusive techniques that leverage feature matching on benchmark datasets. ","knowledge distillation, tiny capacity student, large capacity teacher, budget constrained learning" User-Interactive Offline Reinforcement Learning,https://openreview.net/forum?id=a4COps0uokg,https://openreview.net/pdf?id=a4COps0uokg,Offline RL policies need to be adaptvie after training so that a user can alter its behavior to its needs.,"Offline reinforcement learning algorithms still lack trust in practice due to the risk that the learned policy performs worse than the original policy that generated the dataset or behaves in an unexpected way that is unfamiliar to the user. At the same time, offline RL algorithms are not able to tune their most important hyperparameter - the proximity of the learned policy to the original policy. We propose an algorithm that allows the user to tune this hyperparameter at runtime, thereby addressing both of the above mentioned issues simultaneously. This allows users to start with the original behavior and grant successively greater deviation, as well as stopping at any time when the policy deteriorates or the behavior is too far from the familiar one.","Offline RL, Reinforcement Learning, User, Model-based, Adaptive" No-regret Learning in Repeated First-Price Auctions with Budget Constraints,https://openreview.net/forum?id=rHqa5_nzaCA,https://openreview.net/pdf?id=rHqa5_nzaCA,We present a bidding strategy in first-price auctions with budget for buyers with sublinear regret.,"Recently the online advertising market has exhibited a gradual shift from second-price auctions to first-price auctions. Although there has been a line of works concerning online bidding strategies in first-price auctions, it still remains open how to handle budget constraints in the problem. In the present paper, we initiate the study for a buyer with budgets to learn online bidding strategies in repeated first-price auctions. We propose an RL-based bidding algorithm against the optimal non-anticipating strategy under stationary competition. Our algorithm obtains $\widetilde O(\sqrt T)$-regret if the bids are all revealed at the end of each round. With the restriction that the buyer only sees the winning bid after each round, our modified algorithm obtains $\widetilde O(T^{\frac{7}{12}})$-regret by techniques developed from survival analysis. Our analysis extends to the more general scenario where the buyer has any bounded instantaneous utility function with regrets of the same order.","first-price auction, budget, no-regret learning" Server Aggregation as Linear Regression: Reformulation for Federated Learning,https://openreview.net/forum?id=kV0cA81Vau,https://openreview.net/pdf?id=kV0cA81Vau,,"We propose a conceptually novel framework for Federated Learning (FL) called FedFit to mitigate issues of FL. FedFit is a reformulation of the server aggregation in FL, where the global model is updated by linear regression. This reformulation naturally enables us to utilize the established linear regression techniques for several FL issues. For example, we apply robust regression to alleviate the vulnerability issue against attacks on the global model from collapsed clients, and we apply LASSO regression to introduce sparsity into the model to reduce the communication cost in FL. Moreover, FedFit enables clients to upload compressed model parameters to the server, significantly reducing the data traffic. In experiments, we demonstrate that FedFit successfully improves robustness against attacks on a global model by robust regression and reduces the global model size by LASSO regression.","Federated Learning, Machine Learning" Private and Efficient Meta-Learning with Low Rank and Sparse decomposition,https://openreview.net/forum?id=GIZg_kOXqyG,https://openreview.net/pdf?id=GIZg_kOXqyG,Provable meta-learning via privacy preserving and optimal low-rank+sparse decomposition,"Meta-learning is critical for a variety of practical ML systems -- like personalized recommendations systems -- that are required to generalize to new tasks despite a small number of task-specific training points. Existing meta-learning techniques use two complementary approaches of either learning a low-dimensional representation of points for all tasks, or task-specific fine-tuning of a global model trained using all the tasks. In this work, we propose a novel meta-learning framework that combines both the techniques to enable handling of a large number of data-starved tasks. Our framework models network weights as a sum of low-rank and sparse matrices. This allows us to capture information from multiple domains together in the low-rank part while still allowing task specific personalization using the sparse part. We instantiate and study the framework in the linear setting, where the problem reduces to that of estimating the sum of a rank-$r$ and a $k$-column sparse matrix using a small number of linear measurements. We propose an alternating minimization method with hard thresholding -- AMHT-LRS -- to learn the low-rank and sparse part effectively and efficiently. For the realizable, Gaussian data setting, we show that AMHT-LRS indeed solves the problem efficiently with nearly optimal samples. We extend AMHT-LRS to ensure that it preserves privacy of each individual user in the dataset, while still ensuring strong generalization with nearly optimal number of samples. Finally, on multiple datasets, we demonstrate that the framework allows personalized models to obtain superior performance in the data-scarce regime.","Meta-learning, Low-rank, Sparse, Privacy" $\omega$GNNs: Deep Graph Neural Networks Enhanced by Multiple Propagation Operators,https://openreview.net/forum?id=fwn2Mqpy4pS,https://openreview.net/pdf?id=fwn2Mqpy4pS,"We propose a modification to GNNs to prevent over-smoothing and enhance their expressiveness, followed by a theoretical analysis and experiments on 15 real-world datasets, reading similar or better accuracy than state-of-the-art methods.","Graph Neural Networks (GNNs) are limited in their propagation operators. These operators often contain non-negative elements only and are shared across channels and layers, limiting the expressiveness of GNNs. Moreover, some GNNs suffer from over-smoothing, limiting their depth. On the other hand, Convolutional Neural Networks (CNNs) can learn diverse propagation filters, and phenomena like over-smoothing are typically not apparent in CNNs. In this paper, we bridge this gap by incorporating trainable channel-wise weighting factors $\omega$ to learn and mix multiple smoothing and sharpening propagation operators at each layer. Our generic method is called $\omega$GNN, and we study two variants: $\omega$GCN and $\omega$GAT. For $\omega$GCN, we theoretically analyse its behaviour and the impact of $\omega$ on the obtained node features. Our experiments confirm these findings, demonstrating and explaining how both variants do not over-smooth. Additionally, we experiment with 15 real-world datasets on node- and graph-classification tasks, where our $\omega$GCN and $\omega$GAT perform better or on par with state-of-the-art methods. ","Graph Neural Networks, Deep Learning, Graph Propagation Operators, Over-Smoothing" Few-bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction,https://openreview.net/forum?id=8CJrjp73sfk,https://openreview.net/pdf?id=8CJrjp73sfk,,"Memory footprint is one of the main limiting factors for large neural network training. In backpropagation, one needs to store the input to each operation in the computational graph. Every modern neural network model has quite a few pointwise nonlinearities in its architecture, and such operation induces additional memory costs which --- as we show -- can be significantly reduced by quantization of the gradients. We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear functions with only a few bits per each element. We show that such approximation can be achieved by computing optimal piecewise-constant approximation of the derivative of the activation function, which can be done by dynamic programming. The drop-in replacements are implemented for all popular nonlinearities and can be used in any existing pipeline. We confirm the memory reduction and the same convergence on several open benchmarks.", SLTUNET: A Simple Unified Model for Sign Language Translation,https://openreview.net/forum?id=EBS4C77p_5S,https://openreview.net/pdf?id=EBS4C77p_5S,A simple unified model for sign language translation that achieves (near) state-of-the-art performance,"Despite recent successes with neural models for sign language translation (SLT), translation quality still lags behind spoken languages because of the data scarcity and modality gap between sign video and text. To address both problems, we investigate strategies for cross-modality representation sharing for SLT. We propose SLTUNET, a simple unified neural model designed to support multiple SLTrelated tasks jointly, such as sign-to-gloss, gloss-to-text and sign-to-text translation. Jointly modeling different tasks endows SLTUNET with the capability to explore the cross-task relatedness that could help narrow the modality gap. In addition, this allows us to leverage the knowledge from external resources, such as abundant parallel data used for spoken-language machine translation (MT). We show in experiments that SLTUNET achieves competitive and even state-of-the-art performance on PHOENIX-2014T and CSL-Daily when augmented with MT data and equipped with a set of optimization techniques. We further use the DGS Corpus for end-to-end SLT for the first time. It covers broader domains with a significantly larger vocabulary, which is more challenging and which we consider to allow for a more realistic assessment of the current state of SLT than the former two. Still, SLTUNET obtains improved results on the DGS Corpus. Code will be released.","Unified Modeling, Multi-task Learning, Sign Language Translation, Cross-modality Learning" Pruning by Active Attention Manipulation,https://openreview.net/forum?id=JunUr1y3Wa6,https://openreview.net/pdf?id=JunUr1y3Wa6,,"Structured pruning of a CNN is typically achieved by applying discrete masks on the CNN's filter weights or activation maps, post-training. Here, we present a new filter-importance-scoring concept named pruning by active attention manipulation (PAAM), that sparsifies the CNN's set of filters through a particular attention mechanism, during-training. PAAM learns continuous filter scores from the filter weights by optimizing a cost function regularized by an additive term in the scores. As the filters are not independent, we use attention to dynamically learn their correlations. Moreover, by training the pruning scores of all layers simultaneously, PAAM can account for layer inter-dependencies, which is essential to finding a performant sparse sub-network. PAAM can also train and generate a pruned network from scratch in a straightforward, one-stage training process without requiring a pre-trained network. Finally, PAAM does not need layer-specific hyperparameters and pre-defined layer budgets, since it can implicitly determine the appropriate number of filters in each layer. Our experimental results on different network architectures suggest that PAAM outperforms state-of-the-art structured-pruning methods (SOTA). On CIFAR-10 dataset, without requiring a pre-trained baseline network, we obtain 1.02% and 1.19% accuracy gain and 52.3% and 54% parameters reduction, on ResNet56 and ResNet110, respectively. Similarly, on the ImageNet dataset, PAAM achieves 1.06% accuracy gain while pruning 51.1% of the parameters on ResNet50. For Cifar-10, this is better than the SOTA with a margin of 9.5% and 6.6%, respectively, and on ImageNet with a margin of 11%.", Robustness of Unsupervised Representation Learning without Labels,https://openreview.net/forum?id=lYZZl2hp1gp,https://openreview.net/pdf?id=lYZZl2hp1gp,We provide a framework for robustness evaluation and adversarial training of representation encoders without the need for labelled data.,"Unsupervised representation learning leverages large unlabeled datasets and is competitive with supervised learning. But non-robust encoders may affect downstream task robustness. Recently, robust representation encoders have become of interest. Still, all prior work evaluates robustness using a downstream classification task. Instead, we propose a family of unsupervised robustness measures, which are model- and task-agnostic and label-free. We benchmark state-of-the-art representation encoders and show that none dominates the rest. We offer unsupervised extensions to the FGSM and PGD attacks. When used in adversarial training, they improve most unsupervised robustness measures, including certified robustness. We validate our results against a linear probe and show that, for MOCOv2, adversarial training results in 3 times higher certified accuracy, a 2-fold decrease in impersonation attack success rate and considerable improvements in certified robustness.","robustness, representation learning" Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization,https://openreview.net/forum?id=iUYpN14qjTF,https://openreview.net/pdf?id=iUYpN14qjTF,,"Adaptive gradient methods such as Adam have gained increasing popularity in deep learning optimization. However, it has been observed in many deep learning applications such as image classification, Adam can converge to a different solution with a worse test error compared to (stochastic) gradient descent, even with a fine-tuned regularization. In this paper, we provide a theoretical explanation for this phenomenon: we show that in the nonconvex setting of learning over-parameterized two-layer convolutional neural networks starting from the same random initialization, for a class of data distributions (inspired from image data), Adam and gradient descent (GD) can converge to different global solutions of the training objective with provably different generalization errors, even with weight decay regularization. In contrast, we show that if the training objective is convex, and the weight decay regularization is employed, any optimization algorithms including Adam and GD will converge to the same solution if the training is successful. This suggests that the generalization gap between Adam and SGD in the presence of weight decay regularization is closely tied to the nonconvex landscape of deep learning optimization, which cannot be covered by the recent neural tangent kernel (NTK) based analysis. ","generalization, gradient descent, Neural tangent kernel, Adam, nonconvex optimization" ACQL: An Adaptive Conservative Q-Learning Framework for Offline Reinforcement Learning,https://openreview.net/forum?id=o_HqtIc-oF,https://openreview.net/pdf?id=o_HqtIc-oF,"We propose Adaptive Conservative Q-Learning (ACQL), a general framework that enables more flexible control over the conservative level of Q-function for offline RL.","Offline Reinforcement Learning (RL), which relies only on static datasets without additional interactions with the environment, provides an appealing alternative to learning a safe and promising control policy. Most existing offline RL methods did not consider relative data quality and only crudely constrained the distribution gap between the learned policy and the behavior policy in general. Moreover, these algorithms cannot adaptively control the conservative level in more fine-grained ways, like for each state-action pair, leading to a performance drop, especially over highly diversified datasets. In this paper, we propose an Adaptive Conservative Q-Learning (ACQL) framework that enables more flexible control over the conservative level of Q-function for offline RL. Specifically, we present two adaptive weight functions to shape the Q-values for collected and out-of-distribution data. Then we discuss different conditions under which the conservative level of the learned Q-function changes and define the monotonicity with respect to data quality and similarity. Motivated by the theoretical analysis, we propose a novel algorithm with the ACQL framework, using neural networks as the adaptive weight functions. To learn proper adaptive weight functions, we design surrogate losses incorporating the conditions for adjusting conservative levels and a contrastive loss to maintain the monotonicity of adaptive weight functions. We evaluate ACQL on the commonly-used D4RL benchmark and conduct extensive ablation studies to illustrate the effectiveness and state-of-the-art performance compared to existing offline DRL baselines.","Deep Reinforcement Learning, Offline Deep Reinforcement Learning" Fisher-Legendre (FishLeg) optimization of deep neural networks,https://openreview.net/forum?id=c9lAOPvQHS,https://openreview.net/pdf?id=c9lAOPvQHS,"We introduce a new approach to estimate the natural gradient via Legendre-Fenchel duality, provide a convergence proof, and show competitive performance on a number of benchmarks.","Incorporating second-order gradient information (curvature) into optimization can dramatically reduce the number of iterations required to train machine learning models. In natural gradient descent, such information comes from the Fisher information matrix which yields a number of desirable properties. As exact natural gradient updates are intractable for large models, successful methods such as KFAC and sequels approximate the Fisher in a structured form that can easily be inverted. However, this requires model/layer-specific tensor algebra and certain approximations that are often difficult to justify. Here, we use ideas from Legendre-Fenchel duality to learn a direct and efficiently evaluated model for the product of the inverse Fisher with any vector, in an online manner, leading to natural gradient steps that get progressively more accurate over time despite noisy gradients. We prove that the resulting ``Fisher-Legendre'' (FishLeg) optimizer converges to a (global) minimum of non-convex functions satisfying the PL condition, which applies in particular to deep linear networks. On standard auto-encoder benchmarks, we show empirically that FishLeg outperforms standard first-order optimization methods, and performs on par with or better than other second-order methods, especially when using small batches. Thanks to its generality, we expect our approach to facilitate the handling of a variety neural network layers in future work.","Second-order optimization, Natural Gradient, Deep Learning, Meta-learning, Fisher information, Legendre-Fenchel duality" "A law of adversarial risk, interpolation, and label noise",https://openreview.net/forum?id=0_TxFpAsEI,https://openreview.net/pdf?id=0_TxFpAsEI,"Laws for how interpolating label noise increases adversarial risk, with stronger guarantees in presence of inductive bias and distributional assumptions.","In supervised learning, it has been shown that label noise in the data can be interpolated without penalties on test accuracy. We show that interpolating label noise induces adversarial vulnerability, and prove the first theorem showing the relationship between label noise and adversarial risk for any data distribution. Our results are almost tight if we do not make any assumptions on the inductive bias of the learning algorithm. We then investigate how different components of this problem affect this result including properties of the distribution. We also discuss non-uniform label noise distributions; and prove a new theorem showing uniform label noise induces nearly as large an adversarial risk as the worst poisoning with the same noise rate. Then, we provide theoretical and empirical evidence that uniform label noise is more harmful than typical real-world label noise. Finally, we show how inductive biases amplify the effect of label noise and argue the need for future work in this direction.","label noise, adversarial robustness, lower bound, robust machine learning" Lossy Image Compression with Conditional Diffusion Models,https://openreview.net/forum?id=X8-VWbONvr,https://openreview.net/pdf?id=X8-VWbONvr,We show that hybridizing compressive VAEs with denoising diffusion models leads to strong performance in perceptual image compression.. ,"Denoising diffusion models have recently marked a milestone in high-quality image generation. One may thus wonder if they are suitable for neural image compression. This paper outlines an end-to-end optimized image compression framework based on a conditional diffusion model, drawing on the transform-coding paradigm. Besides the latent variables inherent to the diffusion process, this paper introduces an additional discrete ""content"" latent variable to condition the denoising process on. This variable is equipped with a hierarchical prior for entropy coding. The remaining ""texture"" latent variables characterizing the diffusion process are synthesized (either stochastically or deterministically) at decoding time. We furthermore show that the performance can be tuned toward perceptual metrics of interest. Our extensive experiments involving five datasets and 16 image perceptual quality assessment metrics show that our approach not only compares favorably in terms of rate and perceptual distortion tradeoffs but also shows robust performance under all metrics while other baselines show less consistent behavior. ","neural image compression, denoising diffusion models" Invariance Makes a Difference: Disentangling the Role of Invariance and Equivariance in Representations,https://openreview.net/forum?id=xRMOW3kvsEv,https://openreview.net/pdf?id=xRMOW3kvsEv,"We show that performance of representations critically depends on their invariances, via controlled experiments on synthetic data.","Representations learned by deep neural networks are the foundation that enables their tremendous success and consequently a lot of work has been invested into understanding their properties. Most of this work, however, focuses on the relationships between representations and features in the input without explicitly characterizing their nature, i.e. whether they are invariances or equivariances. In this work, we concretely define and disentangle these relationships and show with carefully controlled experiments that, in fact, invariance is of central importance in achieving high generalization on downstream tasks, often more so than equivariance. To this end, we investigate the properties and performance of image classification models on synthetic datasets that we introduce and which allow us to precisely control factors of variation in the models' training and test data. With this method we explore a) the role of invariance in enabling high performance when transferring to target tasks and b) the factors that influence which invariances a model learns. We highlight the importance of representational invariance by showing that the representations learned by classification models transfer well to new classes but perform poorly when the required invariances change, and that learning the wrong invariances can be harmful. Additionally, we find that the invariances learned by models are primarily determined by the relationship of features in the training data with the training objective and that there are inductive biases that make certain invariances more difficult to learn than others.","representation learning, invariance, synthetic data" Improving the generalization ability of the chaotic time-series classification models by residual component extraction,https://openreview.net/forum?id=YyTFUpG0WmN,https://openreview.net/pdf?id=YyTFUpG0WmN,Novel approach to improve the generalization ability of the machine learning methods in the task of chaotic time-series classification.,"This paper presents a new method in chaotic time-series classification that improves the generalization ability of the machine learning models. Such models are trained on a certain set of parameters of the nonlinear dynamical systems. Because of the sensitive dependence on initial conditions of such systems, when the parameters change the accuracy of machine learning models drops drastically. In our paper, we improve the generalization ability of the Multivariate LSTM-FCN model, by extracting and training the model on the residual component of the time-series. The steps applied are generic and can be reapplied to any other machine learning method. As an illustrative example, we provide a problem of the computation of the 2D bifurcation diagrams for the oscillatory time-series (periodic or chaotic) obtained as solutions for nonlinear systems of ordinary differential equations. The Multivariate LSTM-FCN network was trained on the time-series set obtained for the plane of parameters $(R_1, R_2) \times (C_1, C_2) \times L = (6, 15.5) \times (2.8, 3.4) \times 1$ and it achieved the accuracy of $96.96\%$. Once the parameters where changed to $(R_1, R_2) \times (C_1, C_2) \times L = (20, 29.5) \times (2.2, 2.8) \times 2$ the accuracy dropped to $34.38\%$. Whereas the accuracy of the model trained on residual components of the same dataset decreased from $96.50\%$ to $65.52\%$ for the new parameters time-series dataset. Furthermore, we show that by taking the mean and variance of the residual component we are able to extract many very interesting features of the 2D bifurcation diagrams using the KMeans, GMM, and OPTICS clustering algorithms.","chaotic time-series classification, feature extraction, deep learning, clustering, chaotic time-series, time-series decomposition, generalization" Learning DAGs from Fourier-Sparse Data,https://openreview.net/forum?id=L64Bs1OSNjZ,https://openreview.net/pdf?id=L64Bs1OSNjZ,We leverage recent causal Fourier analysis to pose the novel problem of learning DAGs from data with sparse spectrum and propose a solution that has better performance over existing DAG learning methods.,"We present a novel perspective on learning directed acyclic graphs (DAGs) from data, leveraging a recent novel form of causal Fourier analysis on DAGs. We build on prior work that learned DAGs from data generated by a structural equation model (SEM). First, we show that data generated by linear SEMs can be characterized in the frequency domain as having dense spectra with random coefficients. Then we propose the new problem of learning DAGs from approximately Fourier-sparse data, which we solve by minimizing the $L^1$ norm of the spectrum. We provide a motivation for this problem and compare our method to prior DAG learning methods, showing superior performance.","directed acyclic graph, DAG learning, causal Fourier analysis, structural equation models, additive noise, Fourier-sparse" ASIF: coupled data turns unimodal models to multimodal without training,https://openreview.net/forum?id=YAxV_Krcdjm,https://openreview.net/pdf?id=YAxV_Krcdjm,How to build a CLIP-like model with two pretrained encoders and a limited amount of image-text pairs without tuning a neuron.,"Aligning the visual and language spaces requires to train deep neural networks from scratch on giant multimodal datasets; CLIP trains both an image and a text encoder, while LiT manages to train just the latter by taking advantage of a pretrained vision network. In this paper, we show that sparse relative representations are sufficient to align text and images without training any network. Our method relies on readily available single-domain encoders (trained with or without supervision) and a modest (in comparison) number of image-text pairs. ASIF redefines what constitutes a multimodal model by explicitly disentangling memory from processing: here the model is defined by the embedded pairs of all the entries in the multimodal dataset, in addition to the parameters of the two encoders. Experiments on standard zero-shot visual benchmarks demonstrate the typical transfer ability of image-text models. Overall, our method represents a simple yet surprisingly strong baseline for foundation multi-modal models, raising important questions on their data efficiency and on the role of retrieval in machine learning.","Representation learning, Multimodal models, Analogy, Sparsity, Relative representations" Momentum Boosted Episodic Memory for Improving Learning in Long-Tailed RL Environments,https://openreview.net/forum?id=U3_J25hAoRQ,https://openreview.net/pdf?id=U3_J25hAoRQ,Improving learning in long-tailed RL environments using momentum boosted episodic memory.,"Conventional Reinforcement Learning (RL) algorithms assume the distribution of the data to be uniform or mostly uniform. However, this is not the case with most real-world applications like autonomous driving or in nature, where animals roam. Some objects are encountered frequently, and most of the remaining experiences occur rarely; the resulting distribution is called \emph{Zipfian}. Taking inspiration from the theory of \emph{complementary learning systems}, an architecture for learning from Zipfian distributions is proposed where long tail states are discovered in an unsupervised manner and states along with their recurrent activation are kept longer in episodic memory. The recurrent activations are then reinstated from episodic memory using a similarity search, giving weighted importance. The proposed architecture yields improved performance in a Zipfian task over conventional architectures. Our method outperforms IMPALA by a significant margin of 20.3\% when maps/objects occur with a uniform distribution and by 50.2\% on the rarest 20\% of the distribution.","long tail distribution, reinforcement learning, representation learning, contrastive learning, complementary learning system, hippocampus" ProtoGNN: Prototype-Assisted Message Passing Framework for Non-Homophilous Graphs,https://openreview.net/forum?id=LeZ39Gkwbi0,https://openreview.net/pdf?id=LeZ39Gkwbi0,Class prototype-assisted message passing framework for improving node representation learning on non-homophilous graphs,"Many well-known Graph Neural Network (GNN) models assume the underlying graphs are homophilous, where nodes share similar features and labels with their neighbours. They rely on message passing that iteratively aggregates neighbour's features and often suffer performance degradation on non-homophilous graphs where useful information is hardly available in the local neighbourhood. In addition, earlier studies show that in some cases GNNs are even outperformed by Multi-Layer Perceptron, indicating insufficient exploitation of node feature information. Motivated by the two limitations, we propose ProtoGNN, a novel message passing framework that augments existing GNNs by effectively combining node features with structural information. ProtoGNN learns multiple class prototypes for each class from raw node features with the slot-attention mechanism. These prototype representations are then transferred onto the structural node features with explicit message passing to all non-training nodes irrespective of distance. This form of message passing, from training nodes to class prototypes to non-training nodes, also serves as a shortcut that bypasses local graph neighbourhoods and captures global information. ProtoGNN is a generic framework which can be applied onto any of the existing GNN backbones to improve node representations when node features are strong and local graph information is scarce. We demonstrate through extensive experiments that ProtoGNN brings performance improvement to various GNN backbones and achieves state-of-the-art on several non-homophilous datasets.","Graph Neural Networks, Graph representation learning, Non-homophilous Graph, Heterophily, Non-homophily, Node Classification" MonoFlow: A Unified Generative Modeling Framework for GAN Variants,https://openreview.net/forum?id=HVVDVaegjaW,https://openreview.net/pdf?id=HVVDVaegjaW,,"Generative adversarial networks (GANs) play a minmax two-player game via adversarial training. The conventional understanding of adversarial training is that the discriminator is trained to estimate a divergence and the generator learns to minimize this divergence. We argue that despite the fact that many variants of GANs are developed following this paradigm, the existing theoretical understanding of GANs and the practical algorithms are inconsistent. In order to gain deeper theoretical insights and algorithmic inspiration for these GAN variants, we leverage Wasserstein gradient flows which characterize the evolution of particles in the sample space. Based on this, we introduce a unified generative modeling framework – MonoFlow: the particle evolution is rescaled via an arbitrary monotonically increasing mapping. Under our framework, adversarial training can be viewed as a procedure first obtaining MonoFlow's vector field via the discriminator and then the generator learns to parameterize the flow defined by the corresponding vector field. We also reveal the fundamental difference between variational divergence minimization and adversarial training. These analysis help us to identify what types of generator loss functions can lead to the successful training of GANs and suggest that GANs may have more loss designs beyond those developed in the literature, e.g., non-saturated loss, as long as they realize MonoFlow. Consistent empirical studies are also included to validate the effectiveness of our framework.", The Effective coalitions of Shapley value For Integrated Gradients,https://openreview.net/forum?id=Th4_9F0a6l,https://openreview.net/pdf?id=Th4_9F0a6l,The Effective coalitions of Shapley value For Integrated Gradients,"Many methods aim to explain deep neural networks (DNN) by attributing the prediction of DNN to its input features, like Integrated Gradients and Deep Shap, which both have critical baseline problems. Previous studies pursue a perfect but intractable baseline value, which is hard to find and has a very high computational cost, limiting the application range of these baseline methods. In this paper, we propose to find a set of baseline values corresponding to Shapley values which are easier to be found and have a lower computation cost. To solve computation dilemma of Shapley value, we propose Effective Shapley value (ES), a proportional sampling method to well simulate the ratios between the Shapley values of features and then propose Shapley Integrated Gradients (SIG) to combine Integrated Gradients with ES, to achieve a good balance between efficiency and effectiveness. Experiment results show that our ES method can well and stably approximate the ratios between Shapley values, and our SIG method has a much better and more accurate performance than common baseline values with similar computational costs.",Explanation Shapley value Cold Rao-Blackwellized Straight-Through Gumbel-Softmax Gradient Estimator,https://openreview.net/forum?id=EN8YE5dkOO,https://openreview.net/pdf?id=EN8YE5dkOO,Improved gradient estimator for categorical random variables by finding the zero temperature limit of the Rao-Blackwellized Straight-Through Gumbel-Softmax Gradient Estimator,"The problem of estimating the gradient of an expectation in discrete random variables arises in many applications: learning with discrete latent representations, training neural networks with quantized weights, activations, conditional blocks, etc. This work contributes to the development of the popular Gumbel-Softmax family of estimator, which is based on approximating argmax with a temperature-parametrized softmax. The state-of-the art in this family, the Gumbel-Rao estimator uses internal MC samples to reduce the variance. We show that in the limit of zero temperature the internal integration has a closed form solution. The limit estimator, called ZGR, has a favorable bias and variance, is simple to implement and computationally inexpensive and is obviously free of the temperature hyperparameter. Furthermore, ZGR is unbiased for the class of quadratic functions of categorical variables and can be decomposed into a sum of two simple but not very well performing on their own estimators: the straight through estimator and the DARN estimator. Experiments thoroughly validate the method.","Gumbel-Softmax, categorical variables, Concrete distribution, gradient, straight-through, VAE, quantization" Generative Spoken Language Model based on continuous word-sized audio tokens,https://openreview.net/forum?id=a0e7x2EuFO,https://openreview.net/pdf?id=a0e7x2EuFO,We introduced a generative spoken language model based on continuous word-sized acoustic tokens.,"In NLP, text language models based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms-long discrete units (shorter than a phoneme). Taking inspiration from word-based LM, we introduce a Generative Spoken Language Model (GSLM) based on word-size continuous-valued audio tokens that can generate diverse and expressive language output. This is obtained by replacing lookup table for lexical types with a Lexical Embedding function, the cross entropy loss by a contrastive loss, and multinomial sampling by k-NN sampling. The resulting model is the first generative language model based on word-size continuous tokens. Its performance is on par with discrete unit GSLMs regarding generation quality and zero resource challenge metrics. Moreover, it is five times more memory efficient because of its larger units. In addition, the embeddings before and after the Lexical Embedder are phonetically and semantically interpretable.","spoken language model, sentence generation, speech synthesis, k nearest neighbors, unsupervised learning, textless technology" Tree-structure segmentation for logistic regression,https://openreview.net/forum?id=IXBC0sG3cN,https://openreview.net/pdf?id=IXBC0sG3cN,"Practitioners, in particular in the banking industry, often perform clustering to obtain ""client segments"" on which they fit separate supervised models. We perform both by learning ""logistic regression trees"".","The decision for a financial institution to accept or deny a loan is based on the probability of a client paying back their debt in time. This probability is given by a model such as a logistic regression, and estimated based on, e.g., the clients’ characteristics, their credit history, the repayment performance. Historically, different models have been developed on different markets and/or credit products and/or addressed population. We show that this amounts to modelling default as a mixture model composed of a decision tree and logistic regression on its leaves (thereafter “logistic regression tree”). We seek to optimise this practice by considering the population to which a client belongs as a latent variable, which we will estimate. After exposing the context, the notations and the problem formalisation, we will conduct estimation using a Stochastic-Expectation-Maximisation (SEM) algorithm. We will finally show the performance on simulated data, and on real retail credit data from [COMPANY], as well as real open-source data.","logistic regression, decision tree, credit scoring, segmentation" Neural Image Compression with a Diffusion-based Decoder,https://openreview.net/forum?id=4Jq0XWCZQel,https://openreview.net/pdf?id=4Jq0XWCZQel,Diffusion-based neural image codec allowing smooth and competitive rate-distortion-perception traversal at test time.,"Diffusion probabilistic models have recently achieved remarkable success in generating high quality image and video data. In this work, we build on this class of generative models and introduce a method for lossy compression of high resolution images. The resulting codec, which we call \emph{DIffuson-based Residual Augmentation Codec (DIRAC)}, is the first neural codec to allow smooth traversal of the rate-distortion-perception tradeoff at test time, while obtaining competitive performance with GAN-based methods in perceptual quality. Furthermore, while sampling from diffusion probabilistic models is notoriously expensive, we show that in the compression setting the number of steps can be drastically reduced.","neural, image, lossy, compression, diffusion, gan, perceptual" Learning ReLU networks to high uniform accuracy is intractable,https://openreview.net/forum?id=nchvKfvNeX0,https://openreview.net/pdf?id=nchvKfvNeX0,Learning target classes of ReLU networks to high uniform accuracy needs exponentially many samples.,"Statistical learning theory provides bounds on the necessary number of training samples needed to reach a prescribed accuracy in a learning problem formulated over a given target class. This accuracy is typically measured in terms of a generalization error, that is, an expected value of a given loss function. However, for several applications --- for example in a security-critical context or for problems in the computational sciences --- accuracy in this sense is not sufficient. In such cases, one would like to have guarantees for high accuracy on every input value, that is, with respect to the uniform norm. In this paper we precisely quantify the number of training samples needed for any conceivable training algorithm to guarantee a given uniform accuracy on any learning problem formulated over target classes containing (or consisting of) ReLU neural networks of a prescribed architecture. We prove that, under very general assumptions, the minimal number of training samples for this task scales exponentially both in the depth and the input dimension of the network architecture.","learning theory, ReLU networks, hardness results, sample complexity, teacher-student learning" GAML: geometry-aware meta-learning via a fully adaptive preconditioner,https://openreview.net/forum?id=mencEcrobES,https://openreview.net/pdf?id=mencEcrobES,,"Model-Agnostic Meta-Learning (MAML) is one of the most successful meta-learning algorithms. It has a bi-level optimization structure, where the outer-loop process learns the shared initialization and the inner-loop process optimizes the task-specific weights. Although MAML relies on the standard gradient descent in the inner-loop, recent works have shown that it can be beneficial to control the inner loop's gradient descent with a meta-learned preconditioner. The existing preconditioners, however, cannot adapt in a task-specific and path-dependent way at the same time. Also, most of them do not consider the geometry of the loss surface. In this work, we propose Geometry-Aware Meta-Learning (GAML) that can overcome the limitations. GAML can efficiently meta-learn a preconditioner that is dependent on the task-specific parameters and its preconditioner can be shown to be a Riemannian metric that defines the geometry of the loss surface. Therefore, we can perform a fully-adaptive and geometry-aware optimization in the inner-loop. Experiment results show that GAML outperforms the state-of-the-art MAML family and PGD-MAML family for a variety of few-shot learning tasks.","Meta-learning, few-shot learning" Caption supervision enables robust learners: a controlled study of distributionally robust model training,https://openreview.net/forum?id=Rkk51I-BpMH,https://openreview.net/pdf?id=Rkk51I-BpMH,"We introduce CaptionNet, a fully captioned, fully supervised dataset with ImageNet-compliant labels, and through experiment, show how the choice of loss function, data filtration and supervision strategy enable robust computer vision.","Vision language models like CLIP are robust to natural distribution shifts, in part because CLIP learns on unstructured data using a technique called caption supervision; the model inteprets image-linked texts as ground-truth labels. In a carefully controlled comparison study, we show that CNNs trained on a standard cross-entropy loss can also benefit from caption supervision, in some cases even more than VL models, on the same data. To facilitate future experiments with high-accuracy caption-supervised models, we introduce CaptionNet, one piece of which is a class-balanced, fully supervised dataset with over 50,000 new human-labeled ImageNet-compliant samples which includes web-scraped captions. In a series of experiments on CaptionNet, we show how the choice of loss function, data filtration and supervision strategy enable robust computer vision.","self-supervised learning, computer vision, effective robustness, vision language, CLIP, ImageNet, LAION, CC12M, YFCC" Active Learning for Object Detection with Evidential Deep Learning and Hierarchical Uncertainty Aggregation,https://openreview.net/forum?id=MnEjsw-vj-X,https://openreview.net/pdf?id=MnEjsw-vj-X,We propose an active learning method for object detection using evidential deep learning and novel uncertainty aggregation method.,"Despite the huge success of object detection, the training process still requires an immense amount of labeled data. Although various active learning solutions for object detection have been proposed, most existing works do not take advantage of epistemic uncertainty, which is an important metric for capturing the usefulness of the sample. Also, previous works pay little attention to the attributes of each bounding box (e.g., nearest object, box size) when computing the informativeness of an image. In this paper, we propose a new active learning strategy for object detection that overcomes the shortcomings of prior works. To make use of epistemic uncertainty, we adopt evidential deep learning (EDL) and propose a new module termed model evidence head (MEH), that makes EDL highly compatible with object detection. Based on the computed epistemic uncertainty of each bounding box, we propose hierarchical uncertainty aggregation (HUA) for obtaining the informativeness of an image. HUA realigns all bounding boxes into multiple levels based on the attributes and aggregates uncertainties in a bottom-up order, to effectively capture the context within the image. Experimental results show that our method outperforms existing state-of-the-art methods by a considerable margin.","Active Learning, Object Detection, Uncertainty Estimation, Bayesian Learning" How Sharpness-Aware Minimization Minimizes Sharpness?,https://openreview.net/forum?id=5spDgWmpY6x,https://openreview.net/pdf?id=5spDgWmpY6x,we prove the implicit bias of Sharpness-Aware Minimization (SAM) is minimizing the top eigenvalue of Hessian in the full-batch setting or minimizing the trace of Hessian when batch size is 1.,"Sharpness-Aware Minimization (SAM) is a highly effective regularization technique for improving the generalization of deep neural networks for various settings. However, the underlying working of SAM remains elusive because of various intriguing approximations in the theoretical characterizations. SAM intends to penalize a notion of sharpness of the model but implements a computationally efficient variant; moreover, a third notion of sharpness was used for proving generalization guarantees. The subtle differences in these notions of sharpness can indeed lead to significantly different empirical results. This paper rigorously nails down the exact sharpness notion that SAM regularizes and clarifies the underlying mechanism. We also show that the two steps of approximations in the original motivation of SAM individually lead to inaccurate local conclusions, but their combination accidentally reveals the correct effect, when full-batch gradients are applied. Furthermore, we also prove that the stochastic version of SAM in fact regularizes another notion of sharpness, which is most likely to be the preferred notion for practical performance. The key mechanism behind this intriguing phenomenon is the implicit alignment between the gradient and the top eigenvector of Hessian when running SAM.","implicit bias, implicit regularization, sharpness, sharpness aware minimization" Learning to solve the Hidden Clique Problem with Graph Neural Networks,https://openreview.net/forum?id=TpqJy1BmD6K,https://openreview.net/pdf?id=TpqJy1BmD6K,Comparison of different graph neural networks on the hidden clique problem,"We study data-driven methods for the hidden clique problem in random graphs. The training data is obtained by hiding a clique in the random graph, where the signal to noise ratio is tuned by choosing the size of the hidden clique and the density of the random graph. Using synthetic datasets allows us to test empirically the performance and generalization properties of various graph neural network (GNN) architectures at different levels of difficulties for the task. We compare message passing GNNs and GNNs augmented with a single quadratic operation (matrix multiplication) first introduced in \citep{maron2019fgnn}. Adding skip connections and normalization to these augmented GNNs is shown to improve their learning process and their generalization properties without any loss in time complexity. For hard instances of our hidden clique problem, they are shown to outperform message passing GNNs.","graph, graph neural network, GNN, hidden clique, supervised learning, deep learning" On discrete symmetries of robotics systems: A group-theoretic and data-driven analysis,https://openreview.net/forum?id=TBOFHtBariC,https://openreview.net/pdf?id=TBOFHtBariC,"We present a group-theoretic analysis of bilateral/radial symmetries of dynamical systems. Characterizing the symmetries of the system's dynamics, control, and proprioceptive/exteroceptive data. And elucidating how to exploit these symmetries in DL","In this work, we study the Morphological Symmetries of dynamical systems with one or more planes of symmetry, a predominant feature in animal biology and robotic systems, characterized by the duplication and balanced distribution of body parts. These morphological symmetries imply that the system's dynamics are symmetric (or approximately symmetric), which in turn imprints symmetries in optimal control policies and in all proprioceptive and exteroceptive measurements related to the evolution of the system's dynamics. For data-driven methods, symmetry represents an inductive bias that justifies data augmentation and the construction of symmetric function approximators. To this end, we use Group Theory to present a theoretical and practical framework allowing for (1) the identification of the system's morphological symmetry Group $\G$, (2) the characterization of how the group acts upon the system state variables and any relevant measurement living in the Euclidean space, and (3) the exploitation of data symmetries through the use of $\G$-equivariant/$\G$-invariant Neural Networks, for which we present experimental results on synthetic and real-world applications, demonstrating how symmetry constraints lead to better sample efficiency and generalization while reducing the number of trainable parameters.","Morphological Symmetries, Discrete Symmetries of Dynamical Systems, Equivariant Dynamics, Equivariant Function Approximators, Geometric Deep Learning" The Implicit Bias of Minima Stability in Multivariate Shallow ReLU Networks,https://openreview.net/forum?id=xtbog7cfsr,https://openreview.net/pdf?id=xtbog7cfsr,"A study of multivariate two layer ReLU nets via dynamical stability, showing bias to smooth functions, depth separation and stable approximation guarantees","We study the type of solutions to which stochastic gradient descent converges when used to train a single hidden-layer multivariate ReLU network with the quadratic loss. Our results are based on a dynamical stability analysis. In the univariate case, it was shown that linearly stable minima correspond to network functions (predictors) $f$, whose second derivative has a bounded weighted $L^1$ norm. The bound on the norm gets smaller as the step size increases, implying that training with a large step size leads to `smoother' predictors. Here we generalize this result to the multivariate case, showing that a similar result applies to the Laplacian of the predictor, $\Delta f$. We demonstrate the tightness of our bound on the MNIST dataset, and show that it accurately captures the behavior of the solutions as a function of the step size. Additionally, we prove a depth separation result on the approximation power of ReLU networks corresponding to stable minima of the loss. Specifically, although shallow ReLU networks are universal approximators, we prove that stable shallow networks are not. Namely, there is a function that cannot be well-approximated by stable single hidden-layer ReLU networks trained with a non-vanishing step size. In contrast, we show that the same function can be realized as a stable two hidden-layer ReLU network. Finally, we prove that if a function is sufficiently smooth (Sobolev) then it can be approximated arbitrarily well using single hidden-layer ReLU networks that correspond to stable solutions of gradient descent.","Implicit bias, implicit regularization, stability, Hessian, dynamical systems, depth separation, approximation" Out-of-Domain Intent Detection Considering Multi-turn Dialogue Contexts,https://openreview.net/forum?id=si1JH05iUV,https://openreview.net/pdf?id=si1JH05iUV,A context-aware OOD intent detection framework (Caro) that aims to consider multi-turn contexts in OOD intent detection tasks.,"Out-of-Domain (OOD) intent detection is vital for practical dialogue systems, and it usually requires considering long dialogue histories. However, previous OOD intent detection approaches are limited to single-turn contexts since it is non-trivial to gather or synthesize high-quality OOD samples in multi-turn settings, and the long distance obstacle exhibited in multi-turn contexts hinders us from obtaining robust features for intent detection. In this paper, we introduce a context-aware OOD intent detection (Caro) framework that aims to consider multi-turn contexts in OOD intent detection tasks. Specifically, we follow the information bottleneck principle to extract robust representations from multi-turn dialogue contexts by eliminating superfluous information that is not related to intent detection tasks. We also propose to synthesize pseudo OOD samples with the help of unlabeled data under the constraint of dialogue contexts, i.e., candidate OOD samples are retrieved from unlabeled data based on their context similarities and representations of these candidates are mixed-up to produce pseudo OOD samples. A three stage training process is introduced in Caro to combine above approaches. Empirical results validate the superiority of our method on benchmark datasets.","OOD Detection, Intent Detection, Multi-turn Dialogue Context" Consciousness-Aware Multi-Agent Reinforcement Learning,https://openreview.net/forum?id=u0aNcjqhEJ,https://openreview.net/pdf?id=u0aNcjqhEJ,,"In cooperative multi-agent reinforcement learning, centralized training with decentralized execution (CTDE) shows great promise for a trade-off between independent Q-learning and joint action learning. However, vanilla CTDE methods assumed a fixed number of agents could hardly adapt to real-world scenarios where dynamic team compositions typically suffer from the dilemma of dramatic partial observability variance. Specifically, agents with extensive sight ranges are prone to be affected by trivial environmental substrates, dubbed the “attention distraction” issue; ones with limited observability can hardly sense their teammates, hindering the quality of cooperation. In this paper, we propose a Consciousness-Aware Multi-Agent reinforcement learning (CAMA) approach, which roots in a divide-and-conquer strategy to facilitate stable and sustainable teamwork. Concretely, CAMA targets dividing the input entities with controlled observability masks by an Entity Dividing Module (EDM) according to their execution relevance for consciousness learning. To tackle the attention distraction issue, the highly related entities are fed to a Consciousness Enhancement Module (CEM) for consciousness-aware representation extraction via action prediction with an inverse model. For better out-of-sight-range cooperation, the lowly related ones are compressed to brief messages by a Consciousness Replenishment Module (CRM) with a conditional mutual information estimator. Our CAMA outperforms the SOTA methods significantly on the challenging StarCraftII, MPE, and Traffic Junction benchmarks.",multi-agent reinforcement learning Better with Less: Data-Active Pre-training of Graph Neural Networks,https://openreview.net/forum?id=663Cl-KetJ,https://openreview.net/pdf?id=663Cl-KetJ,,"Recently, pre-training on graph neural networks (GNNs) has become an active research area and is used to learn transferable knowledge for downstream tasks with unlabeled data. The success of graph pre-training models is often attributed to the massive amount of input data. In this paper, however, we identify the curse of big data phenomenon in graph pre-training: more training samples and graph datasets do not necessarily lead to better performance. Motivated by this observation, we propose a better-with-less framework for graph pre-training: few, but carefully chosen data are fed into a GNN model to enhance pre-training. This novel pre-training pipeline is called the data-active graph pre-training (APT) framework, and is composed of a graph selector and a pre-training model. The graph selector chooses the most representative and instructive data points based on the inherent properties of graphs as well as the predictive uncertainty. The proposed uncertainty, as feedback from the pre-training model, measures the confidence level of the model to the data. When fed with the chosen data, on the other hand, the pre-training model grasps an initial understanding of the new, unseen data, and at the same time attempts to remember the knowledge learnt from the previous data. Therefore, the integration and interaction between these two components form a unified framework, in which graph pre-training is performed in a progressive way. Experiment results show that the proposed APT framework is able to obtain an efficient pre-training model with fewer training data and better downstream performance.","Pre-training, Graph Neural Networks" MAST: Masked Augmentation Subspace Training for Generalizable Self-Supervised Priors,https://openreview.net/forum?id=5KUPKjHYD-l,https://openreview.net/pdf?id=5KUPKjHYD-l,Disentangled and uncertainty-aware learning of augmentation invariances during SSL improves generalization on downstream tasks,"Recent Self-Supervised Learning (SSL) methods are able to learn feature representations that are invariant to different data augmentations, which can then be transferred to downstream tasks of interest. However, different downstream tasks require different invariances for their best performance, so the optimal choice of augmentations for SSL depends on the target task. In this paper, we aim to learn self-supervised features that generalize well across a variety of downstream tasks (e.g., object classification, detection and instance segmentation) without knowing any task information beforehand. We do so by Masked Augmentation Subspace Training (or MAST) to encode in the single feature space the priors from different data augmentations in a factorized way. Specifically, we disentangle the feature space into separate subspaces, each induced by a learnable mask that selects relevant feature dimensions to model invariance to a specific augmentation. We show the success of MAST in jointly capturing generalizable priors from different augmentations, using both unique and shared features across the subspaces. We further show that MAST benefits from uncertainty modeling to reweight ambiguous samples from strong augmentations that may cause similarity mismatch in each subspace. Experiments demonstrate that MAST consistently improves generalization on various downstream tasks, while being task-agnostic and efficient during SSL. We also provide interesting insights about how different augmentations are related and how uncertainty reflects learning difficulty.","Self-Supervised Learning (SSL), Generalization, Computer vision" Pseudo-Edge: Semi-Supervised Link Prediction with Graph Neural Networks,https://openreview.net/forum?id=NM1Lt3ZBhal,https://openreview.net/pdf?id=NM1Lt3ZBhal,,"Pseudo-labeling is one of the powerful Semi-Supervised Learning (SSL) approaches, which generates confident pseudo-labels of unlabeled data and leverages them for training. Recently, pseudo-labeling has been further extended to Graph Neural networks (GNNs) to address the data sparsity problem due to the nature of graph-structured data. Despite their success in the graph domain, they have been mainly designed for node-level tasks by utilizing node-level algorithms (e.g., Label Propagation) for pseudo-labeling, which can not be directly applied to the link prediction task. Besides, existing works for link prediction only use given edges as positively-labeled data, and there have been no attempts to leverage non-visible edges for training a model in a semi-supervised manner. To address these limitations, we revisit the link prediction task in a semi-supervised fashion and propose a novel pseudo-labeling framework, Pseudo-Edge, that generates qualified pseudo-labels in consideration of graph structures and harnesses them for link prediction. Specifically, our framework constructs distance-based potential edge candidates and carefully selects pseudo-labels through our relation-aware pseudo-labels generation, which reflects the comparative superiority of each unlabeled edge over its local neighborhoods in graphs. Also, we propose uncertainty-aware pseudo-labels generation that can effectively filter out over-confident samples when the model overfits to specific graph structures. Extensive experiments show that our method achieved remarkable performance across five link prediction benchmark datasets and GNN architectures, compared to state-of-the-art GNN-based semi/self-supervised models.","Graph Neural Networks, Link Prediction, Pseudo-labeling, Semi-supervised learning" Graph-based Deterministic Policy Gradient for Repetitive Combinatorial Optimization Problems,https://openreview.net/forum?id=yHIIM9BgOo,https://openreview.net/pdf?id=yHIIM9BgOo,A general learning framework is proposed to learn reusable node or edge representations that can reduce the optimality gap of fast heuristics for repetitive combinatorial optimization problems.,"We propose an actor-critic framework for graph-based machine learning pipelines with non-differentiable blocks, and apply it to repetitive combinatorial optimization problems (COPs) under hard constraints. Repetitive COP refers to problems to be solved repeatedly on graphs of the same or slowly changing topology but rapidly changing node or edge weights. Compared to one-shot COPs, repetitive COPs often rely on fast heuristics to solve one instance of the problem before the next one arrives, at the cost of a relatively large optimality gap. Through numerical experiments on several discrete optimization problems, we show that our approach can learn reusable node or edge representations to reduce the optimality gap of fast heuristics for independent repetitive COPs, and can optimize the long-term objectives for repetitive COPs embedded in graph-based Markov decision processes. Source code at https://github.com/XzrTGMu/twin-nphard ", Lower Bounds on the Depth of Integral ReLU Neural Networks via Lattice Polytopes,https://openreview.net/forum?id=2mvALOAWaxY,https://openreview.net/pdf?id=2mvALOAWaxY,We derive lower bounds on the depth of integral ReLU neural networks using volume arguments for lattice polytopes arising from connections to tropical geometry.,"We prove that the set of functions representable by ReLU neural networks with integer weights strictly increases with the network depth while allowing arbitrary width. More precisely, we show that $\lceil\log_2(n)\rceil$ hidden layers are indeed necessary to compute the maximum of $n$ numbers, matching known upper bounds. Our results are based on the known duality between neural networks and Newton polytopes via tropical geometry. The integrality assumption implies that these Newton polytopes are lattice polytopes. Then, our depth lower bounds follow from a parity argument on the normalized volume of faces of such polytopes.","Rectified Linear Unit, Neural Network Expressivity, Neural Network Depth, Lattice Polytope, Normalized Volume" Contextual Transformer for Offline Reinforcement Learning,https://openreview.net/forum?id=7pl0FRiS0Td,https://openreview.net/pdf?id=7pl0FRiS0Td,This paper explores how prompts help sequence-modeling based offline-RL algorithms,"Recently, the pretrain-tuning paradigm in large-scale sequence models has made significant progress in Natural Language Processing and Computer Vision. However, such a paradigm is still hindered by intractable challenges in Reinforcement Learning (RL), including the lack of self-supervised large-scale pretraining methods based on offline data and efficient fine-tuning/prompt-tuning over unseen downstream tasks. In this work, we explore how prompts can help sequence-modeling-based offline Reinforcement Learning (offline-RL) algorithms. Firstly, we propose prompt tuning for offline RL, where a context vector sequence is concatenated with the input to guide the conditional generation. As such, we can pretrain a model on the offline dataset with supervised loss and learn a prompt to guide the policy to play the desired actions. Secondly, we extend the framework to the Meta-RL setting and propose Contextual Meta Transformer (CMT), which leverages the context among different tasks as the prompt to improve the performance on unseen tasks. We conduct extensive experiments across three different offline-RL settings: offline single-agent RL on the D4RL dataset, offline Meta-RL on the MuJoCo benchmark, and offline MARL on the SMAC benchmark. The results validate the strong performance, and generality of our methods.","Offline Meta Reinforcement Learning, Prompt Tuning, Transformer" Two-Dimensional Weisfeiler-Lehman Graph Neural Networks for Link Prediction,https://openreview.net/forum?id=8XQd91fDSf9,https://openreview.net/pdf?id=8XQd91fDSf9,We propose provably powerful 2-WL variants for link prediction and successfully implement them to get competitive results and speed advantage.,"Link prediction is one important application of graph neural networks (GNNs). Most existing GNNs for link prediction are based on one-dimensional Weisfeiler-Lehman ($1$-WL) test. $1$-WL-GNNs first compute node representations by iteratively passing neighboring node features to the center, and then obtain link representations by aggregating the pairwise node representations. As pointed out by previous works, this two-step procedure results in low discriminating power, as $1$-WL-GNNs by nature learn node-level representations instead of link-level. In this paper, we study a completely different approach which can directly obtain node pair (link) representations based on \textit{two-dimensional Weisfeiler-Lehman ($2$-WL) tests}. $2$-WL tests directly use links (2-tuples) as message passing units instead of nodes, and thus can directly obtain link representations. We theoretically analyze the expressive power of $2$-WL tests to discriminate non-isomorphic links, and prove their superior link discriminating power than $1$-WL. Based on different $2$-WL variants, we propose a series of novel $2$-WL-GNN models for link prediction. Experiments on a wide range of real-world datasets demonstrate their competitive performance to state-of-the-art baselines and superiority over plain $1$-WL-GNNs.", Wasserstein Auto-encoded MDPs: Formal Verification of Efficiently Distilled RL Policies with Many-sided Guarantees,https://openreview.net/forum?id=JLLTtEdh1ZY,https://openreview.net/pdf?id=JLLTtEdh1ZY,Formal Verification of Efficiently Distilled RL Policies with Many-sided Guarantees,"Although deep reinforcement learning (DRL) has many success stories, the large-scale deployment of policies learned through these advanced techniques in safety-critical scenarios is hindered by their lack of formal guarantees. Variational Markov Decision Processes (VAE-MDPs) are discrete latent space models that provide a reliable framework for distilling formally verifiable controllers from any RL policy. While the related guarantees address relevant practical aspects such as the satisfaction of performance and safety properties, the VAE approach suffers from several learning flaws (mode collapse, slow learning speed, poor dynamics estimates), primarily due to the absence of abstraction and representation guarantees to support latent optimization. We introduce the Wasserstein auto-encoded MDP (WAE-MDP), a latent space model that fixes those issues by minimizing a penalized form of the optimal transport between the behaviors of the agent executing the original policy and the distilled policy, for which the formal guarantees apply. Our approach yields bisimulation guarantees while learning the distilled policy, allowing concrete optimization of the abstraction and representation model quality. Our experiments show that, besides distilling policies up to 10 times faster, the latent model quality is indeed better in general. Moreover, we present experiments from a simple time-to-failure verification algorithm on the latent space. The fact that our approach enables such simple verification techniques highlights its applicability.","Reinforcement learning, Formal Verification, Representation Learning" Efficient Controllable Generation with Guarantee,https://openreview.net/forum?id=ROHDcdx8g6n,https://openreview.net/pdf?id=ROHDcdx8g6n,We propose a general method with theoretical guarantee and integrate it with NVAE for controllable generation.,"Generative models have achieved great success in image synthesis, and controllability of the generative process is a key requirement for their successful adoption in real-world applications. Most existing methods for controllable generation lack theoretical guarantees and are time-consuming, which weakens their reliability and applicability. In this paper, we propose an identifiability theorem to provide a guarantee of controllability. This theorem ensures that semantic attributes can be disentangled and hence independently controlled by orthogonalization in latent space in a supervised manner. Based on the theoretical analysis, we propose a general method for controllable generation, which can be integrated with most latent-variable generative models. We further propose to plug it into a pre-trained NVAE. Such a scheme significantly reduces the cost of time and has better consistency in image editing due to the merits of NVAE. Experiments show that our method is comparable with the state-of-the-art methods in attribute-conditional generation and image editing, and has advantages in efficiency and consistency.","controllable generation, variational autoencoder, identifiability" Towards graph-level anomaly detection via deep evolutionary mapping,https://openreview.net/forum?id=UL3RnLLQ-jK,https://openreview.net/pdf?id=UL3RnLLQ-jK,We propose a novel graph-level anomaly detection framework by mapping graphs into a specially designed feature space in which anomalies and normal graphs are well-separated.,"Graph-level anomaly detection aims at depicting anomalous individual graphs in a graph set. Due to its significance in various real-world application fields, such as identifying rare molecules in chemistry and detecting potential frauds in online social networks, graph-level anomaly detection has received great attention. In distinction from node- and edge-level anomaly detection that is devoted to identifying anomalies on a single graph, graph-level anomaly detection faces more significant challenges because both the intra- and inter-graph structural and attribute patterns need to be taken into account to distinguish anomalies that exhibit deviating structures, rare attributes or the both. Although deep graph representation learning shows effectiveness in fusing high-level representations and capturing characters of individual graphs, most of the existing works are defective in graph-level anomaly detection because of their limited capability in exploring information across graphs, the imbalanced data distribution of anomalies, and low interpretability of the black-box graph neural networks (GNNs). To bridge these gaps, we propose a novel deep evolutionary graph mapping framework named GmapAD, which can adaptively map each graph into a new feature space based on its similarity to a set of representative nodes chosen from the graph set. By automatically adjusting the candidate nodes using a specially designed evolutionary algorithm, anomalies and normal graphs are mapped to separate areas in the new feature space where a clear boundary between them can be learned. The selected candidate nodes can therefore be regarded as a benchmark for explaining anomalies because anomalies are more dissimilar/similar to the benchmark than normal graphs. Through our extensive experiments on nine real-world datasets, we demonstrate that exploring both intra- and inter-graph structural and attribute information are critical to spot anomalous graphs, and our framework outperforms the state of the art on all datasets used in the experiments.","Graph anomaly detection, anomaly detection, graph representation, deep learning, graph neural network" Global Explainability of GNNs via Logic Combination of Learned Concepts,https://openreview.net/forum?id=OTbRTIY4YS,https://openreview.net/pdf?id=OTbRTIY4YS,"We propose GLGExplainer, the first Global Explainer for GNNs capable of generating explanations as arbitrary Boolean combinations of graphical concepts.","While instance-level explanation of GNN is a well-studied problem with plenty of approaches being developed, providing a global explanation for the behaviour of a GNN is much less explored, despite its potential in interpretability and debugging. Existing solutions either simply list local explanations for a given class, or generate a synthetic prototypical graph with maximal score for a given class, completely missing any combinatorial aspect that the GNN could have learned. In this work, we propose GLGExplainer (Global Logic-based GNN Explainer), the first Global Explainer capable of generating explanations as arbitrary Boolean combinations of learned graphical concepts. GLGExplainer is a fully differentiable architecture that takes local explanations as inputs and combines them into a logic formula over graphical concepts, represented as clusters of local explanations. Contrary to existing solutions, GLGExplainer provides accurate and human-interpretable global explanations that are perfectly aligned with ground-truth explanations (on synthetic data) or match existing domain knowledge (on real-world data). Extracted formulas are faithful to the model predictions, to the point of providing insights into some occasionally incorrect rules learned by the model, making GLGExplainer a promising diagnostic tool for learned GNNs.","Explainability, Graph Neural Networks, Concept Learning" Pessimistic Policy Iteration for Offline Reinforcement Learning,https://openreview.net/forum?id=TmJtBnIWkB,https://openreview.net/pdf?id=TmJtBnIWkB,,"Offline reinforcement learning suffers from extrapolation error in the Q-value function. In addition, most methods enforce a consistent constraint on the policy during training, regardless of its out-of-distribution level. We propose pessimistic policy iteration, which guarantees that the Q-value evaluation error is small under the trained policy's distribution and bounds the suboptimality gap of the trained policy's value function. At the same time, pessimistic policy iteration's core component is a horizon-flexible uncertainty quantifier, which could set a constraint according to regional uncertainty. The empirical study shows that the proposed method could boost the performance of baseline methods and is robust to the scale of the constraint. Also, a flexible horizon of uncertainty is necessary to identify out-of-distribution regions.",offline reinforcement learning BO-Muse: A Human expert and AI teaming framework for accelerated experimental design ,https://openreview.net/forum?id=zZXztocaN9,https://openreview.net/pdf?id=zZXztocaN9,A Human-AI collaborative optimisation approach using sample-efficient Bayesian optimisation ,"In this paper we introduce BO-Muse, a new approach to human-AI teaming for the optimisation of expensive blackbox functions. Inspired by the intrinsic difficulty of extracting expert knowledge and distilling it back into AI models and by observations of human behaviour in real-world experimental design, our algorithm lets the human expert take the lead in the experimental process. The human expert can use their domain expertise to its full potential, while the AI plays the role of a muse, injecting novelty and searching for areas of weakness to break the human out of over-exploitation induced by cognitive entrenchment. With mild assumptions, we show that our algorithm converges sub-linearly, at a rate faster than the AI or human alone. We validate our algorithm using synthetic data and with human experts performing real-world experiments.","Experimental Design, Machine learning, Optimisation, Bayesian optimisation, Human-AI Teaming" Coordination Scheme Probing for Generalizable Multi-Agent Reinforcement Learning,https://openreview.net/forum?id=PAKkOriJBd,https://openreview.net/pdf?id=PAKkOriJBd,A few-shot MARL approach for improving coordination ability with diverse teammates,"Coordinating with previously unknown teammates without joint learning is a crucial need for real-world multi-agent applications, such as human-AI interaction. An active research topic on this problem is ad hoc teamwork, which improves agents' coordination ability in zero-shot settings. However, previous works can only solve the problem of a single agent's coordination with different teams, which is not in line with arbitrary group-to-group coordination in complex multi-agent scenarios. Moreover, they commonly suffer from limited adaptation ability within an episode in a zero-shot setting. To address these problems, we introduce the Coordination Scheme Probing (CSP) approach that applies a disentangled scheme probing module to represent and classify the newly arrived teammates beforehand with limited pre-collected episodic data and makes multi-agent control accordingly. To achieve generalization, CSP learns a meta-policy with multiple sub-policies that follow distinguished coordination schemes in an end-to-end fashion and automatically reuses it to coordinate with unseen teammates. Empirically, we show that the proposed method achieves remarkable performance compared to existing ad hoc teamwork and policy generalization methods in various multi-agent cooperative scenarios.","reinforcement learning, multi-agent reinforcement learning, agent modeling" Generalization error bounds for Neural Networks with ReLU activation,https://openreview.net/forum?id=3v2DIO9oVl,https://openreview.net/pdf?id=3v2DIO9oVl,We show that generalization error of Neural Netowrks with ReLU activations approaches zero with proabbility 1 as we increase the training points,"We show rigorous bounds on the generalization error for Neural Networks with ReLU activation under the condition that the network size doesn't grow with the training set size. In order to prove these bounds we weaken the notion of uniform stability of a learning algorithm in a probabilistic way by positing the notion of almost sure (a.s.) support stability and proving that if an algorithm has low enough a.s. support stability its generalization error tends to 0 as the training set size increases. Further we show that for Stochastic Gradient Descent to be almost surely support stable we only need the loss function to be locally Lipschitz and locally smooth with probability 1, thereby showing low generalization error with weaker conditions than have been used in the literature. We then show that Neural Networks with ReLU activation and a doubly differentiable loss function possess these properties, thereby proving low generalization error. The caveat is that the size of NN must not grow with the size of the training set. Finally we present experimental evidence to validate our theoretical results.","relu stability, sgd stability, non smooth neural network stability" "Two Birds, One Stone: An Equivalent Transformation for Hyper-relational Knowledge Graph Modeling",https://openreview.net/forum?id=e3U6bGsfcA,https://openreview.net/pdf?id=e3U6bGsfcA,We propose a simple yet effective transformation strategy for hyper-relational knowledge graph modeling with both semantic and structural information captured.,"By representing knowledge in a primary triple associated with additional attribute value qualifiers, hyper-relational knowledge graph (HKG) that generalizes triple based knowledge graph (KG) has been attracting research attention recently. Compared with KG, HKG is enriched with the semantic difference between the primary triple and additional qualifiers as well as the structural connection between entities in hyper-relational graph structure. However, to model HKG, existing studies mainly focus on either semantic information or structural information therein, fail to capture both simultaneously. To tackle this issue, in this paper, we propose an equivalent transformation for HKG modeling, referred to as TransEQ. Specifically, the equivalent transformation transforms a HKG to a KG, which considers both semantic and structural characteristics. Then a generalized encoder-decoder framework is developed to bridge the modeling research between KG and HKG. In the encoder part, KG-based graph neural networks are leveraged for structural modeling; while in the decoder part, various HKG-based scoring functions are exploited for semantic modeling. Especially, we design the sharing embedding mechanism in the encoder-decoder framework with semantic relatedness captured. We further theoretically prove that TransEQ preserves complete information in the equivalent transformation, and also achieves full expressivity. Finally, extensive experiments on three benchmarks demonstrate the superior performance of TransEQ in terms of both effectiveness and efficiency. On the largest benchmark WikiPeople, TransEQ significantly improves the state-of-the-art models by 15% on MRR.","Hyper-relational knowledge graph, hyperedge expansion, graph neural network" Gradient Gating for Deep Multi-Rate Learning on Graphs,https://openreview.net/forum?id=JpRExTbl1-,https://openreview.net/pdf?id=JpRExTbl1-,,"We present Gradient Gating (G$^2$), a novel framework for improving the performance of Graph Neural Networks (GNNs). Our framework is based on gating the output of GNN layers with a mechanism for multi-rate flow of message passing information across nodes of the underlying graph. Local gradients are harnessed to further modulate message passing updates. Our framework flexibly allows one to use any basic GNN layer as a wrapper around which the multi-rate gradient gating mechanism is built. We rigorously prove that G$^2$ alleviates the oversmoothing problem and allows the design of deep GNNs. Empirical results are presented to demonstrate that the proposed framework achieves state-of-the-art performance on a variety of graph learning tasks, including on large-scale heterophilic graphs.","GNNs, message-passing, oversmoothing, heterophilic graphs, multi-rate learning, gating, large graphs" Self-Supervised Extreme Compression of Gigapixel Images,https://openreview.net/forum?id=oBbsPbMsY3,https://openreview.net/pdf?id=oBbsPbMsY3,We adapted the self-supervised learning framework to learn gigapixels images embeddings and show their linear superiority on several downstream classification tasks.,"Whole slide images (WSI) are microscopy images of stained tissue slides routinely prepared for diagnosis and treatment selection in clinical practice. WSI are very large (gigapixel size) and complex (made of up to millions of cells). The current state-of-the-art (SoTA) approach to classify WSI subdivides them into tiles, encodes them by pre-trained networks, and applies Multiple Instance Learning (MIL) to train for specific downstream tasks. However, annotated datasets are often small, typically a few hundred to a few thousand WSI, which may cause overfitting and underperforming models. On the other hand, the number of unannotated WSI is ever increasing, with datasets of tens of thousands (soon to be millions) of images available. Nevertheless, using unannotated WSI is limited due to the challenges of extending self-supervised learning from natural images to WSI. We propose a strategy of slide-level self-supervised learning (SSL) to leverage the large number of images without annotations to infer powerful slide representations. The resulting embeddings allow compression of the whole public WSI dataset available at the Cancer-Genome Atlas (TCGA), one of the most widely used data resources in cancer research, from 16 TB to 23 MB, thus dramatically simplifying future studies in the field of computational pathology in terms of data storage and processing. We show that a linear classifier trained on top of these embeddings maintains or improves previous SoTA performances on various benchmark WSI classification tasks. Finally, we observe that training a classifier on these representations with tiny datasets (e.g. 50 slides) improved performances over SoTA by an average of +6.3 AUC points over all downstream tasks. We further analyze the conditions necessary for such a training framework to succeed, bringing insights into WSI processing.","Self-supervision, gigapixel, pathology, cancer, data-augmentations, compression" Combating noisy labels with stochastic noise-tolerated supervised contrastive learning,https://openreview.net/forum?id=gPxd1tTvoaC,https://openreview.net/pdf?id=gPxd1tTvoaC,,"Learning with noisy labels (LNL) aims to achieve good generalization performance given a label-corrupted training set. In this work, we consider a more challenging situation of LNL on \emph{fine-grained} datasets (LNL-FG). Due to large inter-class ambiguity among those fine-grained classes, deep models are more prone to overfitting to noisy labels, leading to poor generalization performance. To handle this problem, we propose a novel framework called stochastic noise-tolerated supervised contrastive learning (SNSCL) that can enhance discriminability of deep models. Specifically, SNSCL contains a noise-tolerated contrastive loss and a stochastic module. To play against fitting noisy labels, we design a noise-tolerated supervised contrastive learning loss that incorporates a weight-aware mechanism for noisy label correction and selectively updating momentum queue lists. By this mechanism, SCL mitigates the effects of noisy anchors and avoids inserting noisy labels into the momentum-updated queue. Besides, to avoid manually-defined augmentation strategies in SCL, we propose an efficient stochastic module that samples feature embeddings from a generated distribution, which can also enhance the representation ability of SCL. Our proposed SNSCL is general and compatible with prevailing robust LNL strategies to improve their performance for LNL-FG. Extensive experiments on four noisy benchmarks and an open-world dataset with variant noise ratios demonstrate that our proposed framework significantly improves the performance of current LNL methods for LNL-FG.", MAESTRO: Open-Ended Environment Design for Multi-Agent Reinforcement Learning,https://openreview.net/forum?id=sKWlRDzPfd7,https://openreview.net/pdf?id=sKWlRDzPfd7,,"Open-ended learning methods that automatically generate a curriculum of increasingly challenging tasks serve as a promising avenue toward generally capable reinforcement learning (RL) agents. Existing methods adapt curricula independently over either environment parameters (in single-agent settings) or co-player policies (in multi-agent settings). However, the strengths and weaknesses of co-players can manifest themselves differently depending on environmental features. It is thus crucial to consider the dependency between the environment and co-player when shaping a curriculum in multi-agent domains. In this work, we use this insight and extend Unsupervised Environment Design (UED) to multi-agent environments. We then introduce Multi-Agent Environment-Space Response Oracles (MAESTRO), the first multi-agent UED approach for two-player zero-sum settings. MAESTRO efficiently produces adversarial, joint curricula over both environment parameters and co-player policies and attains minimax-regret guarantees at Nash equilibrium. Our experiments show that MAESTRO outperforms a number of strong baselines on competitive two-player environments, spanning discrete and continuous control.","reinforcement learning, multi-agent learning, unsupervised environment design" Capturing the Motion of Every Joint: 3D Human Pose and Shape Estimation with Independent Tokens,https://openreview.net/forum?id=0Vv4H4Ch0la,https://openreview.net/pdf?id=0Vv4H4Ch0la,"We present a novel, effective and robust model with designed independent tokens to estimate 3D human pose and shape from monocular videos","In this paper we present a novel method to estimate 3D human pose and shape from monocular videos. This task requires directly recovering pixel-alignment 3D human pose and body shape from monocular images or videos, which is challenging due to its inherent ambiguity. To improve precision, existing methods highly rely on the initialized mean pose and shape as prior estimates and parameter regression with an iterative error feedback manner. In addition, video-based approaches model the overall changes over the image-level features to temporally enhance the single-frame feature, but fail to capture the rotational motion at the joint level, and cannot guarantee local temporal consistency. To address these issues, we propose a novel Transformer-based model with a design of independent tokens. First, we introduce three types of tokens independent of the image feature: \textit{joint rotation tokens, shape token, and camera token}. By progressively interacting with image features through Transformer layers, these tokens learn to encode 3D rotations of human joints, body shape, and position information, while adapting to a given image and learning priors between joint rotation angles from large-scale data. Second, benefiting from the proposed token-based representation, we further use a temporal model to focus on capturing the rotational temporal information of each joint, which is empirically conducive to preventing large jitters in local parts. Despite being conceptually simple, the proposed method attains superior performances on the 3DPW and Human3.6M datasets. Using ResNet-50 and Transformer architectures, it achieves 42.0 mm error on the PA-MPJPE metric of the challenging 3DPW, outperforming state-of-the-art counterparts by a large margin.","3D human pose and shape estimation, 3d human reconstruction, transformer, independent tokens, temporal modeling, joint rotational motion" Almost Linear Constant-Factor Sketching for $\ell_1$ and Logistic Regression,https://openreview.net/forum?id=gu-SC0dpkvw,https://openreview.net/pdf?id=gu-SC0dpkvw,We give the first constant factor approximate sketches for l1 and logistic regression in a turnstile stream with almost linear sketching dimension that result in an efficient optimization problem in the sketch space.,"We improve upon previous oblivious sketching and turnstile streaming results for $\ell_1$ and logistic regression, giving a much smaller sketching dimension achieving $O(1)$-approximation and yielding an efficient optimization problem in the sketch space. Namely, we achieve for any constant $c>0$ a sketching dimension of $\tilde{O}(d^{1+c})$ for $\ell_1$ regression and $\tilde{O}(\mu d^{1+c})$ for logistic regression, where $\mu$ is a standard measure that captures the complexity of compressing the data. For $\ell_1$-regression our sketching dimension is near-linear and improves previous work which either required $\Omega(\log d)$-approximation with this sketching dimension, or required a larger $\operatorname{poly}(d)$ number of rows. Similarly, for logistic regression previous work had worse $\operatorname{poly}(\mu d)$ factors in its sketching dimension. We also give a tradeoff that yields a $1+\varepsilon$ approximation in input sparsity time by increasing the total size to $(d\log(n)/\varepsilon)^{O(1/\varepsilon)}$ for $\ell_1$ and to $(\mu d\log(n)/\varepsilon)^{O(1/\varepsilon)}$ for logistic regression. Finally, we show that our sketch can be extended to approximate a regularized version of logistic regression where the data-dependent regularizer corresponds to the variance of the individual logistic losses.","regression, sketching, data streams, logistic regression" Neural-based classification rule learning for sequential data,https://openreview.net/forum?id=7tJyBmu9iCj,https://openreview.net/pdf?id=7tJyBmu9iCj,,"Discovering interpretable patterns for classification of sequential data is of key importance for a variety of fields, ranging from genomics to fraud detection or more generally interpretable decision-making. In this paper, we propose a novel differentiable fully interpretable method to discover both local and global patterns (i.e. catching a relative or absolute temporal dependency) for rule-based binary classification. It consists of a convolutional binary neural network with an interpretable neural filter and a training strategy based on dynamically-enforced sparsity. We demonstrate the validity and usefulness of the approach on synthetic datasets and on an open-source peptides dataset. Key to this end-to-end differentiable method is that the expressive patterns used in the rules are learned alongside the rules themselves.","classification rule learning, binary neural network, interpretable AI, sequential data" Q-learning Decision Transformer: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL,https://openreview.net/forum?id=oIkZyOytR3g,https://openreview.net/pdf?id=oIkZyOytR3g,,"Recent works have shown that tackling offline reinforcement learning (RL) with a conditional policy produces promising results. The Decision Transformer (DT) combines the conditional policy approach and a transformer architecture, showing competitive performance against several benchmarks. However, DT lacks stitching ability -- one of the critical abilities for offline RL to learn the optimal policy from sub-optimal trajectories. This issue becomes particularly significant when the offline dataset only contains sub-optimal trajectories. On the other hand, the conventional RL approaches based on Dynamic Programming (such as Q-learning) do not have the same limitation; however, they suffer from unstable learning behaviours, especially when they rely on function approximation in an off-policy learning setting. In this paper, we propose the Q-learning Decision Transformer (QDT) to address the shortcomings of DT by leveraging the benefits of Dynamic Programming (Q-learning). It utilises the Dynamic Programming results to relabel the return-to-go in the training data to then train the DT with the relabelled data. Our approach efficiently exploits the benefits of these two approaches and compensates for each other's shortcomings to achieve better performance. We empirically show these in both simple toy environments and the more complex D4RL benchmark, showing competitive performance gains. ","offline reinforcement learning, reward conditional model, dynamic programming, decision transformer" $\epsilon$-Invariant Hierarchical Reinforcement Learning for Building Generalizable Policy,https://openreview.net/forum?id=wJkXkCzWFSx,https://openreview.net/pdf?id=wJkXkCzWFSx,"We propose a new HRL method, which can build generalizable policy with general subgoals, for solving complex high-dimensional controlling maze-navigation tasks.","Goal-conditioned Hierarchical Reinforcement Learning (HRL) has shown remarkable potential for solving complex control tasks. However, existing methods struggle in tasks that require generalization since the learned subgoals are highly task-specific and therefore hardly reusable. In this paper, we propose a novel HRL framework called \textit{$\epsilon$-Invariant HRL} that uses abstract, task-agnostic subgoals reusable across tasks, resulting in a more generalizable policy. Although such subgoals are reusable, a transition mismatch problem caused by the inevitable incorrect value evaluation of subgoals can lead to non-stationary learning and even collapse. We mitigate this mismatch problem by training the high-level policy to be adaptable to the stochasticity manually injected into the low-level policy. As a result, our framework can leverage reusable subgoals to constitute a hierarchical policy that can effectively generalize to unseen new tasks. Theoretical analysis and experimental results in continuous control navigation tasks and challenging zero-shot generalization tasks show that our approach significantly outperforms state-of-the-art methods.","hierarchical reinforcement learning, generalizable policy, zero-shot generalization" Learning Control by Iterative Inversion,https://openreview.net/forum?id=-SKvXtXPCaJ,https://openreview.net/pdf?id=-SKvXtXPCaJ,"Inverting a dynamical system to give the actions which yield desired behavior, represented as an embedding of a trajectory.","We formulate learning for control as an inverse problem - inverting a dynamical system to give the actions which yield desired behavior. The key challenge in this formulation is a distribution shift in the inputs to the function to be inverted - the learning agent can only observe the forward mapping (its actions' consequences) on trajectories that it can execute, yet must learn the inverse mapping for inputs-outputs that correspond to a different, desired behavior. We propose a general recipe for inverse problems with a distribution shift that we term $\textit{iterative inversion}$ - learn the inverse mapping under the current input distribution (policy), then use it on the desired output samples to obtain a new input distribution, and repeat. As we show, iterative inversion can converge to the desired inverse mapping, but under rather strict conditions on the mapping itself. We next apply iterative inversion to learn control. Our input is a set of demonstrations of desired behavior, given as video embeddings of trajectories (without actions), and our method iteratively learns to imitate trajectories generated by the current policy, perturbed by random exploration noise. We find that constantly adding the demonstrated trajectory embeddings as input to the policy when generating trajectories to imitate, a-la iterative inversion, we effectively steer the learning towards the desired trajectory distribution. To the best of our knowledge, this is the first exploration of learning control from the viewpoint of inverse problems, and the main advantage of our approach is simplicity - it does not require rewards, and only employs supervised learning, which can be easily scaled to use state-of-the-art trajectory embedding techniques and policy representations. Indeed, with a VQ-VAE embedding, and a transformer-based policy, we demonstrate non-trivial continuous control on several tasks. Further, we report an improved performance on imitating diverse behaviors compared to reward based methods. ","RL, IRL" DetectBench: An Object Detection Benchmark for OOD Generalization Algorithms,https://openreview.net/forum?id=7o6iMO1gkeJ,https://openreview.net/pdf?id=7o6iMO1gkeJ,,"The consensus about practical machine learning tasks, such as object detection, is still the test data are drawn from the same distribution as the training data, which is known as IID (Independent and Identically Distributed). However, it can not avoid being confronted with OOD (Out-of-Distribution) scenarios in real practice. It is risky to apply an object detection algorithm without figuring out its OOD generalization performance. On the other hand, a plethora of OOD generalization algorithms has been proposed to amortize the gap between the in-house and open-world performances of machine learning systems. However, their effectiveness was only demonstrated in the image classification tasks. It is still an opening question of how these algorithms perform on complex and practical tasks. In this paper, we first specify the setting of OOD-OD (OOD generalization object detection). Then, we propose DetectBench consisting of four OOD-OD benchmark datasets to evaluate various object detection and OOD generalization algorithms. From extensive experiments on DetectBench, we find that existing OOD generalization algorithms fail dramatically when applied to the more practical object detection tasks. This raises questions over the current progress on a large number of these algorithms and whether they can be effective in practice beyond simple toy examples. For future work, we sincerely hope that DetectBench can serve as a foothold for OOD-OD research.","Out-of-Distribution, object detection, benchmark" Generalization Bounds with Arbitrary Complexity Measures,https://openreview.net/forum?id=WhwtdGkbaDr,https://openreview.net/pdf?id=WhwtdGkbaDr,We provide novel probabilistic generalization bounds able to integrate arbitrary complexity measures be leveraging the framework of disintegrated PAC-Bayes bounds ,"In statistical learning theory, generalization bounds usually involve a complexity measure that is constrained by the considered theoretical framework. This limits the scope of such analysis, as in practical algorithms, other forms of regularization are used. Indeed, the empirical work of Jiang et al. (2019) shows that (I) common complexity measures (such as the VC-dimension) do not correlate with the generalization gap and that (ii) there exist arbitrary complexity measures that are better correlated with the generalization gap, but come without generalization guarantees. In this paper, we bridge the gap between this line of empirical works and generalization bounds of statistical learning theory. To do so, we leverage the framework of disintegrated PAC-Bayes bounds to derive a generalization bound that involves an arbitrary complexity measure. Our bound stands in probability jointly over the hypotheses and the learning sample, which allows us to improve the correlation between generalization gap and complexity, as the latter can be set to fit both the hypothesis class and the task.","Complexity Measure, Generalization Bounds, Disintegrated PAC-Bayes Bounds" Graphics Capsule: Learning hierarchical 3D representations from 2D images and its application on human faces,https://openreview.net/forum?id=h1SoBc6wLgp,https://openreview.net/pdf?id=h1SoBc6wLgp,,"The function of constructing the hierarchy of objects is important to the visual process of the human brain. Previous studies have successfully adopted capsule networks to decompose the digits and faces into parts in an unsupervised manner to investigate the similar perception mechanism of neural networks. However, their descriptions are restricted to the 2D space, limiting their capacities to imitate the intrinsic 3D perception ability of humans. In this paper, we propose an Inverse Graphics Capsule Network (IGC-Net) to learn the hierarchical 3D representations from large-scale unlabeled images. The core of IGC-Net is a new type of capsule, named graphics capsule, which represents 3D primitives with interpretable parameters in computer graphics (CG), including depth, albedo, and 3D pose. Specifically, IGC-Net first decomposes the objects into a set of semantic-consistent parts and then assembles them to the object-level descriptions to build the hierarchy. The learned graphics capsules reveal how the neural networks, oriented at visual perception, understand objects as a hierarchy of 3D models. Besides, the discovered parts can be deployed to the unsupervised face segmentation task to evaluate the semantic consistency of our method. Moreover, the part-level descriptions with explicit physical meanings give an insight into the face analysis that originally runs in a black box, such as the importance of shape and texture for face recognition. Experiments on CelebA, BP4D, and Multi-PIE validate the effectiveness of our method.", Weak Supervision Variational Auto-Encoder,https://openreview.net/forum?id=0oDzoRjrbj,https://openreview.net/pdf?id=0oDzoRjrbj,"A VAE model with specifically designed components to perform weak supervision. Compared to existing weak supervision methods, it is considerably more robust to labelling functions design.","Recent advances in weak supervision (WS) techniques allow to mitigate the enormous labelling cost of human data annotation for deep learning by automating it using simple rule-based labelling functions (LFs). However, LFs need to be carefully designed, often requiring expert domain knowledge to be of sufficient accuracy, cover enough data and be independent of each other for existing WS methods to be viable. In addition, weak supervision methods often rely on small amounts of validation data with true labels to fine-tune and select models. To tackle these issues, we propose the Weak Supervision Variational Auto-Encoder (WS-VAE), a novel framework that combines unsupervised representation learning and weak labelling to reduce the dependence of WS on expert and manual engineering of LFs. The proposed technique learns from inputs and weak labels jointly and captures the input signals distribution with an artificial latent space, leading to considerably improved robustness to LFs quality. Our extensive empirical evaluation shows that our WS-VAE performs competitively to existing WS on a standard WS benchmark while it is substantially more robust to LF engineering.","Variational Auto-Encoders, Weak Supervision, Weak Labelling" Object Detection with OOD Generalizable Neural Architecture Search,https://openreview.net/forum?id=GHOMWtsFhj,https://openreview.net/pdf?id=GHOMWtsFhj,,"We present a Neural Architecture Search (NAS) framework guided by feature orthogonalization to improve Out-of-Distribution (OOD) Generalization on Object Detection. Specifically, we attribute the failure of generalizing on OOD data to the spurious correlations of category-related features and context-related features. The category-related features describe the causal information for predicting the target objects, e.g., ""a car with four wheels"", while the context-related features describe the non-causal information, e.g., ""a car driving at night"", and the context-related features are always mistaken for causal information due to the existence of distinct data distribution between training and testing sets (OOD) to some degree. Therefore, we aim at automatically discovering an optimal architecture that is able to disentangle the category-related features and the context-related features with a novel weight-based detector head. Both theoretical and experimental results show that the proposed scheme is able to achieve the disentanglement and better performance on both Independent-Identically-Distribution datasets (Pascal VOC 2012 and MS COCO) and OOD datasets (BDD100K-weather and BDD100K-time-of-day).","Out-of-Distribution, neural architecture search" Learning To Invert: Simple Adaptive Attacks for Gradient Inversion in Federated Learning,https://openreview.net/forum?id=deit1AdsFU,https://openreview.net/pdf?id=deit1AdsFU,,"Gradient inversion attack enables recovery of training samples from model updates in federated learning (FL) and constitutes a serious threat to data privacy. To mitigate this vulnerability, prior work proposed both principled defenses based on differential privacy, as well as heuristic defenses based on gradient compression as countermeasures. These defenses have so far been very effective, in particular those based on gradient compression that allow the model to maintain high accuracy while greatly reducing the attack's effectiveness. In this work, we argue that such findings do not accurately reflect the privacy risk in FL, and show that existing defenses can be broken by a simple adaptive attack that trains a model using auxiliary data to learn how to invert gradients on both vision and language tasks.", Leveraging Unlabeled Data to Track Memorization,https://openreview.net/forum?id=ORp91sAbzI,https://openreview.net/pdf?id=ORp91sAbzI,"We propose a practical metric to track memorization for neural networks, which together with the overall training accuracy can distinguish models with low label noise memorization on the training set and high generalization to unseen data.","Deep neural networks may easily memorize noisy labels present in real-world data, which degrades their ability to generalize. It is therefore important to track and evaluate the robustness of models against noisy label memorization. We propose a metric, called $\textit{susceptibility}$, to gauge such memorization for neural networks. Susceptibility is simple and easy to compute during training. Moreover, it does not require access to ground-truth labels and it only uses unlabeled data. We empirically show the effectiveness of our metric in tracking memorization on various architectures and datasets and provide theoretical insights into the design of the susceptibility metric. Finally, we show through extensive experiments on datasets with synthetic and real-world label noise that one can utilize susceptibility and the overall training accuracy to distinguish models that maintain a low memorization on the training set and generalize well to unseen clean data. ","memorization, label noise, generalization, unlabeled data, deep learning" CCIL: Context-conditioned imitation learning for urban driving,https://openreview.net/forum?id=n-d5xFHrk4,https://openreview.net/pdf?id=n-d5xFHrk4,,"Imitation learning is a promising solution to the challenging autonomous urban driving task as experienced human drivers can effortlessly tackle highly complex driving scenarios. Behavior cloning is the most widely applied imitation learning approach in autonomous driving due to its exemption from potentially risky online interactions, but it suffers from the covariate shift issue. To mitigate this problem, we propose a context-conditioned imitation learning approach that learns a policy to map the context state into the ego vehicle's state instead of the typical formulation from both ego and context state to the ego action. Besides, to make full use of the spatial and temporal relations in the context to infer the ego future states, we design a novel policy network based on the Transformer, whose attention mechanism has demonstrated excellent performance in capturing relations. Finally, during evaluation, a linear quadratic controller is employed to produce smooth planning based on the predicted states from the policy network. Experiments on the real-world large-scale Lyft and nuPlan datasets demonstrate that our method can surpass the state-of-the-art method significantly. ", Improving Continual Learning by Accurate Gradient Reconstructions of the Past,https://openreview.net/forum?id=BVaytYu5Yj,https://openreview.net/pdf?id=BVaytYu5Yj,"We propose a new, principled yet practical continual learning method that combines the complementary benefits of function-regularisation, weight-regularisation and experience replay.","Knowledge reuse is essential for continual learning, and current methods attempt to realize it through regularization or experience replay. These two strategies have complementary strengths, e.g., regularization methods are compact, but replay methods can mimic batch training more accurately. At present, little has been done to find principled ways to combine the two methods and current heuristics can give suboptimal performance. Here, we provide a principled approach to combine and improve them by using a recently proposed principle of adaptation, where the goal is to reconstruct the “gradients of the past”, i.e., to mimic batch training by estimating gradients from past data. Using this principle, we design a prior that provably gives better gradient reconstructions by utilizing two types of replay and a quadratic weight-regularizer. This improves performance on standard benchmarks such as Split CIFAR, Split TinyImageNet, and ImageNet-1000. Our work shows that a good combination of replay and regularizer-based methods can be very effective in reducing forgetting, and can sometimes even completely eliminate it.", Revisiting Dense Retrieval with Unaswerable Counterfactuals,https://openreview.net/forum?id=e8MaU4BNVA,https://openreview.net/pdf?id=e8MaU4BNVA,,"The retriever-reader framework is popular for open-domain question answering (ODQA), where a retriever samples for the reader a set of relevant candidate passages from a large corpus. A key assumption behind this method is that high relevance score from the retriever likely indicates high answerability from the reader, which implies a high probability that the retrieved passages contain answers to a given question. In this work, we empirically dispel this belief and observe that recent dense retrieval models based on DPR often rank unanswerable counterfactual passages higher than their answerable original passages. To address such answer-unawareness in dense retrievers, we seek to use counterfactual samples as additional training resources to better synchronize the relevance measurement of DPR with the answerability of question-passage pairs. Specifically, we present counterfactually-Pivoting Contrastive Learning (PiCL), a novel representation learning approach for passage retrieval that leverages counterfactual samples as pivots between positive and negative samples in their learned embedding space. We incorporate PiCL into the retriever training to show the effectiveness of PiCL on ODQA benchmarks and the robustness of the learned models.","Open-domain Question Answering, Text Retrieval" Group-wise Verifiable Distributed Computing for Machine Learning under Adversarial Attacks,https://openreview.net/forum?id=N6NO4o_b5r,https://openreview.net/pdf?id=N6NO4o_b5r,This paper tackles adversarial attack and straggler effect in distributed computing by proposing Group-wise Verifiable Coded Computing. ,"Distributed computing has been a promising solution in machine learning to accelerate the training procedure on large-scale dataset by utilizing multiple workers in parallel. However, there remain two major issues that still need to be addressed: i) adversarial attacks from malicious workers, and ii) the effect of slow workers known as stragglers. In this paper, we tackle both problems simultaneously by proposing Group-wise Verifiable Coded Computing (GVCC), which leverages coding techniques and group-wise verification to provide robustness to adversarial attacks and resiliency to straggler effects in distributed computing. The key idea of GVCC is to verify a group of computation results from workers at a time, while providing resilience to stragglers through encoding tasks assigned to workers with Group-wise Verifiable Codes. Experimental results show that GVCC outperforms the existing methods in terms of overall processing time and verification time for executing matrix multiplication, which is a key computational component in machine learning and deep learning. ","Adversarial attack, Verifiable computing, Distributed Computing, Coded computing" Extending graph transformers with quantum computed aggregation,https://openreview.net/forum?id=241s3NHjxc,https://openreview.net/pdf?id=241s3NHjxc,A new Graph Neural Network architecture where the aggregation weights are computed with a quantum computer.,"Recently, efforts have been made in the community to design new Graph Neural Networks (GNN), as limitations of Message Passing Neural Networks became more apparent. This led to the appearance of Graph Transformers using global graph features such as Laplacian Eigenmaps. In our paper, we introduce a GNN architecture where the aggregation weights are computed using the long-range correlations of a quantum system. These correlations are generated by translating the graph topology into the interactions of a set of qubits in a quantum computer. The recent development of quantum processing units enables the computation of a new family of global graph features that would be otherwise out of reach for classical hardware. We give some theoretical insights about the potential benefits of this approach, and benchmark our algorithm on standard datasets. Although not being adapted to all datasets, our model performs similarly to standard GNN architectures, and paves a promising future for quantum enhanced GNNs.","graph neural networks, graph representation learning, quantum computing, graph transformers" Policy-Based Self-Competition for Planning Problems,https://openreview.net/forum?id=SmufNDN90G,https://openreview.net/pdf?id=SmufNDN90G,Solving deterministic single-agent problems through self-competition by including a historical policy in the planning process of Gumbel AlphaZero.,"AlphaZero-type algorithms may stop improving on single-player tasks in case the value network guiding the tree search is unable to approximate the exact outcome of an episode sufficiently well. One technique to address this problem is transforming the single-player task through self-competition. The main idea is to compute a scalar baseline from the agent’s historical performances and to reshape an episode’s reward into a binary output, indicating whether the baseline has been exceeded or not. However, this baseline only carries limited information for the agent about strategies how to improve. We leverage the idea of self-competition and directly incorporate a historical policy into the planning process instead of its scalar performance. Based on the recently introduced Gumbel AlphaZero (GAZ), we propose our algorithm GAZ ‘Play-to-Plan’ (GAZ PTP), in which the agent learns to find strong trajectories by planning against possible strategies of its past self. We show the effectiveness of our approach in two well-known combinatorial optimization problems, the Traveling Salesman Problem and the Job-Shop Scheduling Problem. With only half of the simulation budget for search, GAZ PTP consistently outperforms all selected single-player variants of GAZ.","reinforcement learning, alphazero, self-competition, self-critical, gumbel, mcts" Can Fair Federated Learning reduce the need for personalization?,https://openreview.net/forum?id=1yaLQb4mIl,https://openreview.net/pdf?id=1yaLQb4mIl,"This work evaluates Q-Fair Federated Learning as an alternative to personalization, we find that it does not satisfactorily improve local federated model performance and propose an approach based on Knowledge Distillation offering favourable results.","Federated Learning (FL) allows edge devices to collaboratively train machine learning models without sharing local data. Since the data distribution varies across client partitions, the performance of the federated model on local data also varies. To solve this, fair FL approaches attempt to reduce the accuracy disparity between local partitions by emphasizing clients with larger losses during training; while local adaptation personalizes the federated model by re-training on local data to provide a device participation incentive in cases where a federated model underperforms relative to one trained locally---their accuracy difference is less than zero. This paper evaluates Q-Fair Federated Learning (Q-FFL) in this relative domain and determines whether it provides a better starting point for personalization or supplants it. Contrary to expectation, Q-FFL does not significantly reduce the number of underperforming clients in a language task while doubling them in an image recognition task. Furthermore, fairness levels which maintain average accuracy provide no benefit to relative accuracy in federated or adapted models. We postulate that Q-FFL is unsuitable for our goal since clients with highly accurate local models require the federated model to have a disproportionate local partition accuracy to receive a benefit. Instead, we propose using knowledge distillation during FL training to create models with a higher local accuracy floor without forfeiting the ceiling. Our preliminary evaluation shows a 50% reduction in underperforming clients in the language task with no increase in underperforming clients for the image task. Thus, we argue that this simple change represents a more promising avenue for reducing the need for personalization than fairness.","Federated Learning, Fair Federated Learning, FL, Fair FL, Local Adaptation, Personalization, Machine Learning, ML, Deep Learning, DL, Distributed Machine Learning" Efficient Out-of-Distribution Detection based on In-Distribution Data Patterns Memorization with Modern Hopfield Energy,https://openreview.net/forum?id=KkazG4lgKL,https://openreview.net/pdf?id=KkazG4lgKL,"We propose a novel out-of-distribution detection method motivated by Modern Hopfield Energy, and futhur derive a simplified version that is effective, efficient and hyperparameter-free.","Out-of-Distribution (OOD) detection is essential for safety-critical applications of deep neural networks. OOD detection is challenging since DNN models tend to produce very high logits value even for OOD samples. Hence, it is of great difficulty to discriminate OOD data by directly adopting Softmax on output logits as the confidence score. Unlike existing OOD methods refining the confidence estimation procedure from output logits with handpicked hyperparameters, we propose a new store-then-compare paradigm. In more detail, penultimate layer outputs on the training set are considered as the representation of in-distribution (ID) data. Thus they can be transformed into stored patterns that serve as anchors to measure the discrepancy of unseen data for OOD detection. Starting from an energy function defined in Modern Hopfield Network for the discrepancy score calculation, we derive a simplified version SHE with theoretical analysis. In SHE, we only utilize one stored pattern to present each class, and these patterns can be obtained by simply averaging the penultimate layer outputs of training samples within this class. SHE has the advantages of hyperparameter-free and high computational efficiency. The evaluations of nine widely-used OOD datasets show the promising performance of such a simple yet effective approach and its superiority over State-of-the-Art models. ","Out-of-Distribution detection, Hopfield Energy, Hyperparameter-Free" Conditional Policy Similarity: An Overlooked Factor in Zero-Shot Coordination,https://openreview.net/forum?id=J9p5s5jwna,https://openreview.net/pdf?id=J9p5s5jwna,,"Multi-Agent Reinforcement Learning (MARL) in cooperative tasks usually follows the self-play setting, where agents are trained by playing with a fixed group of agents. However, in the face of Zero-Shot Coordination (ZSC), where an agent must coordinate with unseen partners, self-play agents may fail. ZSC performance is traditionally measured by cross-play, where individually trained agents are required to play with each other. However, cross-play score varies a lot for different combinations of agents, making it not reliable enough to only use a model's averaged cross-play score with several models to evaluate its ZSC performance. We think the reason for this phenomenon may be that cross-play score is highly related to the similarity between an agent's training partner and ZSC partner, and this similarity varies widely. Therefore, we define the Conditional Policy Similarity between an agent's Training partner and Testing partner (CPSTT) and conduct abundant experiments to confirm a strong linear correlation between CPSTT and cross-play score. Based on it, we propose a new criterion to evaluate ZSC performance: a model is considered better if it has higher cross-play score compared to another model given the same CPSTT. Furthermore, we put forward a Similarity-Based Robust Training (SBRT) scheme that improves agents' ZSC performance by disturbing their partners' actions during training according to a pre-defined CPSTT value. We apply SBRT to four MARL frameworks and their ZSC performance is improved whether measured by the traditional criterion or ours.","multi-agent reinforcement learning, zero-shot coordination, conditional policy similarity" Pareto-Efficient Decision Agents for Offline Multi-Objective Reinforcement Learning,https://openreview.net/forum?id=Ki4ocDm364,https://openreview.net/pdf?id=Ki4ocDm364,We introduce new dataset & benchmarks and propose new algorithms for offline Multi-Objective Reinforcement Learning (MORL),"The goal of multi-objective reinforcement learning (MORL) is to learn policies that simultaneously optimize multiple competing objectives. In practice, an agent's preferences over the objectives may not be known apriori, and hence, we require policies that can generalize to arbitrary preferences at test time. In this work, we propose a new data-driven setup for offline MORL, where we wish to learn a preference-agnostic policy agent using only a finite dataset of offline demonstrations of other agents and their preferences. The key contributions of this work are two-fold. First, we introduce D4MORL, (D)atasets for MORL that are specifically designed for offline settings. It contains 1.8 million annotated demonstrations obtained by rolling out reference policies that optimize for randomly sampled preferences on 6 MuJoCo environments with 2-3 objectives each. Second, we propose Pareto-Efficient Decision Agents (PEDA), a family of offline MORL algorithms that builds and extends Decision Transformers via a novel preference-and-return-conditioned policy. Empirically, we show that PEDA closely approximates the behavioral policy on the D4MORL benchmark and provides an excellent approximation of the Pareto-front with appropriate conditioning, as measured by the hypervolume and sparsity metrics. ","Reinforcement Learning, Offline Reinforcement Learning, Multi-Objective Reinforcement Learning, Decision Transformer, Sequential Decision Making" Learning from Asymmetrically-corrupted Data in Regression for Sensor Magnitude,https://openreview.net/forum?id=1ehuYMrigt,https://openreview.net/pdf?id=1ehuYMrigt,This paper addresses a regression problem for sensor magnitude in which a low value of labels can also mean incomplete observation. We derive an unbiased learning algorithm with a regression learned from data without incomplete observations.,"This paper addresses a regression problem in which output label values represent the results of sensing the magnitude of a phenomenon. A low value of such labels can either mean that the actual magnitude of the phenomenon has been low or that the sensor has made an incomplete observation. This leads to a bias toward lower values in labels and its resultant learning because labels for incomplete observations are recorded as lower than those for typical observations, even if both have monitored similar phenomena. Moreover, because an incomplete observation does not provide any tags indicating incompleteness, we cannot eliminate or impute them. To address this issue, we propose a learning algorithm that explicitly models the incomplete observations to be corrupted with an asymmetric noise that always has a negative value. We show that our algorithm is unbiased with a regression learned from the uncorrupted data that does not involve incomplete observations. We demonstrate the advantages of our algorithm through numerical experiments.","regression, sensor data analytics, healthcare" NAGphormer: A Tokenized Graph Transformer for Node Classification in Large Graphs,https://openreview.net/forum?id=8KYeilT3Ow,https://openreview.net/pdf?id=8KYeilT3Ow,We propose a novel Graph Transformer that utilizes the neighborhood aggregation of multiple hops to build the input sequence of token vectors and thereby can handle large graphs efficiently.,"The graph Transformer emerges as a new architecture and has shown superior performance on various graph mining tasks. In this work, we observe that existing graph Transformers treat nodes as independent tokens and construct a single long sequence composed of all node tokens so as to train the Transformer model, causing it hard to scale to large graphs due to the quadratical complexity on the number of nodes for the self-attention computation. To this end, we propose a Neighborhood Aggregation Graph Transformer (NAGphormer) that treats each node as a sequence containing a series of tokens constructed by our proposed Hop2Token module. For each node, Hop2Token aggregates the neighborhood features from different hops into different representations and thereby produces a sequence of token vectors as one input. In this way, NAGphormer could be trained in a mini-batch manner and thus could scale to large graphs. Moreover, we mathematically show that as compared to a category of advanced Graph Neural Networks (GNNs), the decoupled Graph Convolutional Network, NAGphormer could learn more informative node representations from the multi-hop neighborhoods. Extensive experiments on benchmark datasets from small to large are conducted to demonstrate that NAGphormer consistently outperforms existing graph Transformers and mainstream GNNs. ","Graph Transformer, node classification, neighborhood aggregation, multi-hop neighborhood" Bayesian Oracle for bounding information gain in neural encoding models,https://openreview.net/forum?id=iYC5hOMqUg,https://openreview.net/pdf?id=iYC5hOMqUg,We provide a method to obtain upper bounds of information gain in order to evaluate neural encoding models.,"In recent years, deep learning models have set new standards in predicting neural population responses. Most of these models currently focus on predicting the mean response of each neuron for a given input. However, neural variability around this mean is not just noise and plays a central role in several theories on neural computation. To capture this variability, we need models that predict full response distributions for a given stimulus. However, to measure the quality of such models, commonly used correlation-based metrics are not sufficient as they mainly care about the mean of the response distribution. An interpretable alternative evaluation metric for likelihood-based models is \textit{Information Gain} (IG) which evaluates the likelihood of a model relative to a lower and upper bound. However, while a lower bound is usually easy to obtain, constructing an upper bound turns out to be challenging for neural recordings with relatively low numbers of repeated trials, high (shared) variability, and sparse responses. In this work, we generalize the jack-knife oracle estimator for the mean---commonly used for correlation metrics---to a flexible Bayesian oracle estimator for IG based on posterior predictive distributions. We describe and address the challenges that arise when estimating the lower and upper bounds from small datasets. We then show that our upper bound estimate is data-efficient and robust even in the case of sparse responses and low signal-to-noise ratio. We further provide the derivation of the upper bound estimator for a variety of common distributions including the state-of-the-art zero-inflated mixture models, and relate IG to common mean-based metrics. Finally, we use our approach to evaluate such a mixture model resulting in $90\%$ IG performance.","information theory, evaluation metrics, Bayesian, Neuroscience, encoding models" Near Optimal Private and Robust Linear Regression,https://openreview.net/forum?id=wPVw218szF,https://openreview.net/pdf?id=wPVw218szF,We provide a private gradient descent with adaptive clipping that achieves near optimal error rate and robustness against label noise.,"We study the canonical statistical estimation problem of linear regression from $n$ i.i.d. examples under $(\varepsilon,\delta)$-differential privacy when a fraction of response variables are adversarially corrupted. We propose a variant of the popular differentially private stochastic gradient descent (DP-SGD) algorithm with two innovations: a full-batch gradient descent to improve sample complexity and a novel adaptive clipping to guarantee robustness. When there is no adversarial corruption, this algorithm improves upon the existing state-of-the-art approach and achieves near optimal sample complexity. Under label-corruption, this is the first efficient linear regression algorithm to provably guarantee both $(\epsilon,\delta)$-DP and robustness. Synthetic experiments confirm the superiority of our approach. ","differential privacy, private estimation, linear regression, label corruption" From Distance to Dependency: A Paradigm Shift of Full-reference Image Quality Assessment,https://openreview.net/forum?id=NxnqA1iKeT,https://openreview.net/pdf?id=NxnqA1iKeT,"Beyond distance measure, we propose the first Deep Image Dependency (DID) based full-reference image quality assessment model to capture transformation-invariant texture perception.","Deep learning-based full-reference image quality assessment (FR-IQA) models typically rely on the feature distance between the reference and distorted images. However, the underlying assumption of these models that the distance in the deep feature domain could quantify the quality degradation does not scientifically align with the invariant texture perception, especially when the images are generated artificially by neural networks. In this paper, we bring a radical shift in inferring the quality with learned features and propose the Deep Image Dependency (DID) based FR-IQA model. The feature dependency facilitates the comparisons of deep learning features in a high-order manner with Brownian distance covariance, which is characterized by the joint distribution of the features from reference and test images, as well as their marginal distributions. This enables the quantification of the feature dependency against nonlinear transformation, which is far beyond the computation of the numerical errors in the feature space. Experiments on image quality prediction, texture image similarity, and geometric invariance validate the appealing performance of our proposed measure, and the implementation will be publicly available.","Image quality assessment, brownian distance covariance, distance dependency" Siamese-NAS: Using Trained Samples Efficiently to Find Lightweight Neural Architecture by Prior Knowledge,https://openreview.net/forum?id=lVltX5KwODu,https://openreview.net/pdf?id=lVltX5KwODu,The proposed Siamese-Predictor to find lightweight neural architecture using a few trained samples by prior knowledge.,"In the past decade, many architectures of convolution neural networks were designed by handcraft, such as Vgg16, ResNet, DenseNet, etc. They all achieve state-of-the-art level on different tasks in their time. However, it still relies on human intuition and experience, and it also takes so much time consumption for trial and error. Neural Architecture Search (NAS) focused on this issue. In recent works, the Neural Predictor has significantly improved with few training architectures as training samples. However, the sampling efficiency is already considerable. In this paper, our proposed Siamese-Predictor is inspired by past works of predictor-based NAS. It is constructed with the proposed Estimation Code, which is the prior knowledge about the training procedure. The proposed Siamese-Predictor gets significant benefits from this idea. This idea causes it to surpass the current SOTA predictor on NASBench-201. In order to explore the impact of the Estimation Code, we analyze the relationship between it and accuracy. We also propose the search space Tiny-NanoBench for lightweight CNN architecture. This well-designed search space is easier to find better architecture with few FLOPs than NasBench-201. In summary, the proposed Siamese-Predictor is a predictor-based NAS. It achieves the SOTA level, especially with limited computation budgets. It applied to the proposed Tiny-NanoBench can just use a few trained samples to find extremely lightweight CNN architecture.","predictor-based NAS, prior knowledge, sampling efficiency, lightweight CNN architecture." Inverse Learning with Extremely Sparse Feedback for Recommendation,https://openreview.net/forum?id=_izzMPiE1y,https://openreview.net/pdf?id=_izzMPiE1y,We propose inverse learning with inverse dual loss and inverse gradient to annotate the unlabeled data and achieve denoising augmentation from both positive and negative perspectives.,"Negative sampling is widely used in modern recommender systems, where negative data is randomly sampled from the whole item pool. However, such a strategy often introduces false-positive noises. Existing approaches about de-noising recommendation mainly focus on positive instances while ignoring the noise in the large amount of sampled negative feedback. In this paper, we propose a meta learning method to annotate the unlabeled data from loss and gradient perspectives, which considers the noises on both positive and negative instances. Specifically, we first propose $\textit{inverse dual loss}$ (IDL) to boost the true label learning and prevent the false label learning, based on the loss of unlabeled data towards true and false labels during the training process. To achieve more robust sampling on hard instances, we further propose $\textit{inverse gradient}$ (IG) to explore the correct updating gradient and adjust the updating based on meta learning. We conduct extensive experiments on a benchmark and an industrially collected dataset where our proposed method can significantly improve AUC by $9.25\%$ against state-of-the-art methods. Further analysis verifies the proposed inverse learning is model-agnostic and can well annotate the labels combined with different recommendation backbones. The source code along with the best hyper-parameter settings is available at this link: https://anonymous.4open.science/r/InverseLearning-4F4F. ","Recommender System, Unlabeled Data, Denoising Training" Instance-Specific Augmentation: Capturing Local Invariances,https://openreview.net/forum?id=kAx_rZtFbY,https://openreview.net/pdf?id=kAx_rZtFbY,,"We introduce InstaAug, a method for automatically learning input-specific augmentations from data. Previous data augmentation methods have generally assumed independence between the original input and the transformation applied to that input. This can be highly restrictive, as the invariances that the augmentations are based on are themselves often highly input dependent; e.g., we can change a leaf from green to yellow while maintaining its label, but not a lime. InstaAug instead allows for input dependency by introducing an invariance module that maps inputs to tailored transformation distributions. It can be simultaneously trained alongside the downstream model in a fully end-to-end manner, or separately learned for a pre-trained model. We empirically demonstrate that InstaAug learns meaningful input-dependent augmentations for a wide range of transformation classes, which in turn provides better performance on both supervised and self-supervised tasks.","inductive bias, data augmentation, invariance" Spectral Subgraph Localization,https://openreview.net/forum?id=TS_VsCpuWr,https://openreview.net/pdf?id=TS_VsCpuWr,We localize a subgraph Q in a graph G by manipulating their Laplacian spectra.,"Several graph mining problems are based on some variant of the subgraph isomorphism problem: Given two graphs, G and Q, does G contain a subgraph isomorphic to Q? As this problem is NP-hard, many methods avoid addressing it explicitly. In this paper, we propose a method that solves the problem by localizing, i.e., finding the position of, Q in G, by means of an alignment among graph spectra. Finding a node correspondence from Q to G thereafter is relegated to a separate task, as an instance of the graph alignment problem. We demonstrate that our spectral approach outperforms a baseline based on the state-of-the-art method for graph alignment in terms of accuracy on real graphs and scales to hundreds of nodes as no other method does.","subgraph isomorphism, subgraph localization" Dynamical Signatures of Learning in Recurrent Networks,https://openreview.net/forum?id=8JEpyIgQS0t,https://openreview.net/pdf?id=8JEpyIgQS0t,"Self-organized learning of temporal sequences results in subcritical dynamics, which we propose is a signature of specialization. ","Recurrent neural networks (RNNs) are powerful computational tools that operate best near the edge of chaos, where small perturbations in neuronal firing are transmitted between neurons with minimal amplification or loss. In this article, we depart from the observation that both stimulus and noise can be seen as perturbations to the intrinsic dynamics of a recurrent network, however stimulus information must be reliably preserved, while noise must be discarded. First, we show that self-organizing recurrent networks (SORNs) that learn the spatio-temporal structure of their inputs, increase their recurrent memory by preferentially propagating the relevant stimulus-specific structured signals, while becoming more robust to random perturbation. We find that the computational advantages gained through self-supervised learning are accompanied by a shift from critical to ordered dynamics, and that this dynamical shift varies with the structure of the stimulus. Next, we show that SORNs with subcritical dynamics can outperform their random RNNs counterparts with critical dynamics, on a range of tasks, including a temporal MNIST and a sequential shape-rotation task. Interestingly, when a shape is rotated, both the invariant (shape) and the variant (motion direction) aspects of the stimulus sequence are improved through learning in the subcritical SORNs. We propose that the shift in criticality is a signature of specialization and we expect it to be found in all cases in which general-purpose recurrent networks acquire self-correcting properties by internalizing the statistical structure of their inputs.","RNNs, self-organization, criticality, spatio-temporal dynamics" Shifts 2.0: Extending The Dataset of Real Distributional Shifts,https://openreview.net/forum?id=5RSq86IM6mE,https://openreview.net/pdf?id=5RSq86IM6mE,We introduce two new datasets into the Shifts Benchmark for assessing robustness and uncertainty - Multiple Sclerosis Lesion Segmentation in MRI images and Cargo Vessel Power Consumption Prediction,"Distributional shift, or the mismatch between training and deployment data, is a significant obstacle to the usage of machine learning in high-stakes industrial applications, such as autonomous driving and medicine. This creates a need to be able to assess how robustly ML models generalize as well as the quality of their uncertainty estimates. Standard ML datasets do not allow these properties to be assessed, as the training, validation and test data are often identically distributed. Recently, a range of dedicated benchmarks have appeared, featuring both distributionally matched and shifted data. The Shifts dataset stands out in terms of the diversity of tasks and data modalities it features. Unlike most benchmarks, which are dominated by 2D image data, Shifts contains tabular weather forecasting, machine translation, and vehicle motion prediction tasks. This enables models to be assessed on a diverse set of industrial-scale tasks and either universal or directly applicable task-specific conclusions to be reached. In this paper, we extend the Shifts Dataset with two datasets sourced from industrial, high-risk applications of high societal importance. Specifically, we consider the tasks of segmentation of white matter Multiple Sclerosis lesions in 3D magnetic resonance brain images and the estimation of power consumption in marine cargo vessels. Both tasks feature ubiquitous distributional shifts and strict safety requirements due to the high cost of errors. These new datasets will allow researchers to explore robust generalization and uncertainty estimation in new situations. This work provides a description of the dataset and baseline results for both tasks.","Distributional Shift, Uncertainty Estimation, Benchmark, MRI 3D segmentation, medical data, industrial tabular data" $\Lambda$-DARTS: Mitigating Performance Collapse by Harmonizing Operation Selection among Cells,https://openreview.net/forum?id=oztkQizr3kk,https://openreview.net/pdf?id=oztkQizr3kk,We pinpoint the reason for performance collapse in DARTS and provide theoretical and empirical analysis on that as well as a solution to remedy the performance collapse via harmonizing the decisions of different cell.,"Differentiable neural architecture search (DARTS) is a popular method for neural architecture search (NAS), which performs cell-search and utilizes continuous relaxation to improve the search efficiency via gradient-based optimization. The main shortcoming of DARTS is performance collapse, where the discovered architecture suffers from a pattern of declining quality during search. Performance collapse has become an important topic of research, with many methods trying to solve the issue through either regularization or fundamental changes to DARTS. However, the weight-sharing framework used for cell-search in DARTS and the convergence of architecture parameters has not been analyzed yet. In this paper, we provide a thorough and novel theoretical and empirical analysis on DARTS and its point of convergence. We show that DARTS suffers from a specific structural flaw due to its weight-sharing framework that limits the convergence of DARTS to saturation points of the softmax function. This point of convergence gives an unfair advantage to layers closer to the output in choosing the optimal architecture, causing performance collapse. We then propose two new regularization terms that aim to prevent performance collapse by harmonizing operation selection via aligning gradients of layers. Experimental results on six different search spaces and three different datasets show that our method ($\Lambda$-DARTS) does indeed prevent performance collapse, providing justification for our theoretical analysis and the proposed remedy.", Prototypical Context-aware Dynamics Generalization for High-dimensional Model-based Reinforcement Learning,https://openreview.net/forum?id=OcHQVmfLn2c,https://openreview.net/pdf?id=OcHQVmfLn2c,We present a prototypical Context-aware Dynamics (ProtoCAD) model to capture the local dynamics by time consistent latent context.,"The latent world model provides a promising way to learn policies in a compact latent space for tasks with high-dimensional observations, however, its generalization across diverse environments with unseen dynamics remains challenging. Although the recurrent structure utilized in current advances helps to capture local dynamics, modeling only state transitions without an explicit understanding of environmental context limits the generalization ability of the dynamics model. To address this issue, we propose a Prototypical Context-aware Dynamics (ProtoCAD) model, which captures the local dynamics by time consistent latent context and enables dynamics generalization in high-dimensional control tasks. ProtoCAD extracts useful contextual information with the help of the prototypes clustered over batch and benefits model-based RL in two folds: 1) It utilizes a temporal consistent prototypical regularizer that encourages the prototype assignments produced for different time parts of the same latent trajectory to be temporal consistent instead of comparing the feature; 2) A context representation is designed which combines both the projection embedding of latent states and aggregated prototypes and can significantly improve the dynamics generalization ability. Extensive experiments show that ProtoCAD surpasses existing methods in terms of dynamics generalization. Compared with the recurrent-based model RSSM, ProtoCAD delivers 13.2% and 26.7% better mean and median performance across all dynamics generalization tasks.","model-based reinforcement learning, dynamics generalization, prototypical representation learning, latent world model" Efficient Hyperparameter Optimization Through Tensor Completion,https://openreview.net/forum?id=Ivkh2_UdL9O,https://openreview.net/pdf?id=Ivkh2_UdL9O,An approach for hyperparameter optimization based on tensor completion methods.,"Hyperparameter optimization is a prerequisite for state-of-the-art performance in machine learning, with current strategies including Bayesian optimisation, Hyperband, and evolutionary methods. Whereas such methods have been shown to improve performance, none of these is designed to explicitly take advantage of the underlying data structure. To this end, we introduce a completely different approach for hyperaparameter optimization, based on low-rank tensor completion. This is achieved by first forming a multi-dimensional tensor which comprises performance scores for different combinations of hyperparameters. Based on the realistic underlying assumption that the so-formed tensor has a low rank structure, this then allows for reliable estimates of the unobserved validation scores of combinations of hyperparameters to be obtained through tensor completion, and from only a fraction of known elements. Through extensive experimentation on various datasets and learning models, the proposed method is shown to exhibit competitive or superior performance to the state-of-the-art hyperparameter optimization strategies. Distinctive advantages of the proposed method include its ability to simultaneously handle any hyperparameter type (e.g., kind of optimizer, number of neurons, number of layer, etc.), its relative simplicity compared to the competing methods, as well as the ability to suggest multiple optimal combinations of hyperparameters. ","hyperparameter optimization, tensor completion" Learning Vortex Dynamics for Fluid Inference and Prediction,https://openreview.net/forum?id=nYWqxUwFc3x,https://openreview.net/pdf?id=nYWqxUwFc3x,,"We propose a novel machine learning method based on differentiable vortex particles to infer and predict fluid dynamics from a single video. The key design of our system is a particle-based latent space to encapsulate the hidden, Lagrangian vortical evolution underpinning the observable, Eulerian flow phenomena. We devise a novel differentiable vortex particle system in conjunction with their learnable, vortex-to-velocity dynamics mapping to effectively capture and represent the complex flow features in a reduced space. We further design an end-to-end training pipeline to directly learn and synthesize simulators from data, that can reliably deliver future video rollouts based on limited observation. The value of our method is twofold: first, our learned simulator enables the inference of hidden physics quantities (e.g. velocity field) purely from visual observation, to be used for motion analysis; secondly, it also supports future prediction, constructing the input video's sequel along with its future dynamics evolution. We demonstrate our method's efficacy by comparing quantitatively and qualitatively with a range of existing methods on both synthetic and real-world videos, displaying improved data correspondence, visual plausibility, and physical integrity.", Self-Supervised SVDE from Videos with Depth Variance to Shifted Positional Information,https://openreview.net/forum?id=u21cKsZAUAr,https://openreview.net/pdf?id=u21cKsZAUAr,,"Recently, much attention has been drawn to learning the underlying 3D structures of a scene from monocular videos in a fully self-supervised fashion. One of the most challenging aspects of this task is handling the independently moving objects as they break the rigid-scene assumption. For the first time, we show that pixel positional information can be exploited to learn SVDE (Single View Depth Estimation) from videos. Our proposed moving object (MO) masks, which are induced by depth variance to shifted $positional information$ (SPI) and referred to as `SPIMO' masks, are very robust and consistently remove the independently moving objects in the scenes, allowing for better learning of SVDE from videos. Additionally, we introduce a new adaptive quantization scheme that assigns the best per-pixel quantization curve for our depth discretization. Finally, we employ existing boosting techniques in a new way to further self-supervise the depth of the moving objects. With these features, our pipeline is robust against moving objects and generalizes well to high-resolution images, even when trained with small patches, yielding state-of-the-art (SOTA) results with 4 to 8x fewer parameters than the previous SOTA that learns from videos. We present extensive experiments on KITTI and CityScapes that show the effectiveness of our method. ","Self-supervised, Depth Estimation, Monocular Depth" Discovering Generalizable Multi-agent Coordination Skills from Multi-task Offline Data,https://openreview.net/forum?id=53FyUAdP7d,https://openreview.net/pdf?id=53FyUAdP7d,We propose a novel multi-agent reinforcement learning algorithm to discover coordination skills from multi-task offline data and realize multi-task generalization.,"Cooperative multi-agent reinforcement learning (MARL) faces the challenge of adapting to multiple tasks with varying agents and targets. Previous multi-task MARL approaches require costly interactions to simultaneously learn or fine-tune policies in different tasks. However, the situation that an agent should generalize to multiple tasks with only offline data from limited tasks is more in line with the needs of real-world applications. Since offline multi-task data contains a variety of behaviors, an effective data-driven approach is to extract informative latent variables that can represent universal skills for realizing coordination across tasks. In this paper, we propose a novel Offline MARL algorithm to Discover coordInation Skills (ODIS) from multi-task data. ODIS first extracts task-invariant coordination skills from offline multi-task data and learns to delineate different agent behaviors with the discovered coordination skills. Then we train a coordination policy to choose optimal coordination skills with the centralized training and decentralized execution paradigm. We further demonstrate that the discovered coordination skills can assign effective coordinative behaviors, thus significantly enhancing generalization to unseen tasks. Empirical results in cooperative MARL benchmarks, including the StarCraft multi-agent challenge, show that ODIS obtains superior performance in a wide range of tasks only with offline data from limited sources.","multi-agent reinforcement learning, multi-task reinforcement learning, skill discovery, offline reinforcement learning" MATA*: Combining Learnable Node Matching with A* Algorithm for Approximate Graph Edit Distance Computation,https://openreview.net/forum?id=fC4pNRot_ys,https://openreview.net/pdf?id=fC4pNRot_ys,"We present a data-driven hybrid approach MATA* based on Graph Neural Networks and A* algorithms, which leverages the learned candidate matching nodes to prune unpromising search directions of the A* algorithm.","Graph Edit Distance (GED) is a general and domain-agnostic metric to measure graph similarity, widely used in graph search or retrieving tasks. However, the exact GED computation is known to be NP-complete. For instance, the widely used A* algorithms explore the entire search space to find the optimal solution which inevitably suffers scalability issues. Learning-based methods apply graph representation techniques to learn the GED by formulating a regression task, which can not recover the edit path and lead to inaccurate GED approximation (i.e., the predicted GED is smaller than the exact). To this end, in this work, we present a data-driven hybrid approach MATA* for approximate GED computation based on Graph Neural Networks and A* algorithms, which models from the perspective of learning to match nodes instead of directly regressing GED. That is it leverages the learned node matchings to prune unpromising search directions of the A* algorithm. Specifically, aware of the combinatorial property of structure-dominant operations (i.e., node and edge insertion/deletion) in GED computation, a structure-enhanced Graph Neural Network is firstly designed to effectively learn powerful node embeddings w.r.t. node matchings. Based on this, the pairwise node similarity matrix is next built. Second, top-k candidate matching nodes are produced from the similarity matrix which is adhering to the combinatorial property of multiple optimal node matchings. Third, benefiting from the candidate nodes, MATA* only performs on the promising search directions, reaching the solution efficiently. Finally, extensive experiments demonstrate the superiority of MATA* as it significantly outperforms the combinatorial search-based, learning-based and hybrid approaches and scales well to large-size graphs.","Graph Edit Distance, Machine Learning for Combinatorial Optimization, Graph Neural Networks, A* algorithm, Graph Similarity" On student-teacher deviations in distillation: does it pay to disobey?,https://openreview.net/forum?id=xJz9LTHP0K,https://openreview.net/pdf?id=xJz9LTHP0K,On the lack of fidelity in knowledge distillation,"Knowledge distillation has been widely-used to improve the performance of a ``student'' network by hoping to mimic the soft probabilities of a ``teacher'' network. Yet, for self-distillation to work, the student {\em must} deviate from the teacher in some manner \citep{stanton21does}. We conduct a variety of experiments across image and language classification datasets to more precisely understand the nature of student-teacher deviations and how they relate to accuracy gains. Our first key empirical observation is that in a majority of our settings, the student underfits points that the teacher finds hard. Next, we find that student-teacher deviations during the \textit{initial} phase training are \textit{not} crucial to get the benefits of distillation --- simply switching to distillation in the middle of training can recover a significant fraction of distillation's accuracy gains. We then provide two parallel theoretical perspectives of student-teacher deviations, one casting distillation as a regularizer in eigenspace, and another as a denoiser of gradients. In both these views, we argue how our empirically reported student-teacher deviations may emerge, and how they may relate to generalization. Importantly, our analysis bridges key gaps between existing theory and practice by focusing on gradient descent and avoiding label noise assumptions.","knowledge distillation, regularization, understanding, underfitting" Quantum Vision Transformers,https://openreview.net/forum?id=p7xPXoKB0H,https://openreview.net/pdf?id=p7xPXoKB0H,"We propose several quantum algorithms to mimic or enhance the transformer architecture, prove theoretical guarantees and provide experiments on real quantum hardware","In this work, we design and analyse quantum transformers, extending the state-of-the-art classical transformer neural network architectures known to be very performant in natural language processing and image analysis. Building upon the previous work of using parametrised quantum circuits for data loading and orthogonal neural layers, we introduce three types of quantum transformers, including a quantum transformer based on compound matrices. These quantum architectures can be built using shallow quantum circuits and produce qualitatively different classification models. The three proposed quantum attention layers vary on the spectrum between closely following the classical transformers and exhibiting more quantum characteristics. We propose a method for loading a matrix as quantum states along with two new trainable quantum orthogonal layers adaptable to different levels of connectivity and quality of quantum computers. We performed extensive simulations of the quantum transformers on standard medical image datasets that showed competitive, and at times better, performance compared to the classical benchmarks, including the best-in-class classical vision transformers. The trained quantum transformers require fewer parameters as compared to the standard classical benchmarks, confirming the predicted computational advantage of our quantum attention layers with respect to the size of the classified images. Finally, we implemented our quantum transformers on superconducting quantum computers and obtained encouraging results for up to six qubit experiments.","quantum computing, quantum machine learning, quantum deep learning, transformers, vision transformers" Merging Models Pre-Trained on Different Features with Consensus Graph,https://openreview.net/forum?id=5s6NuOP9cW,https://openreview.net/pdf?id=5s6NuOP9cW,Combining Pre-Trained Models with Different Feature Sets via Learning Consensus Graph,"Learning global models effectively on private and decentralized datasets has become an increasingly important challenge of machine learning when applied in practice. Federated Learning (FL) has recently emerged as a solution paradigm to address this challenge. In particular, the FL clients agree to a common model parameterization in advance, which can then be updated collaboratively via synchronous aggregation of their local model updates. However, such strong requirement of modeling homogeneity and synchronicity across clients makes FL inapplicable to many practical learning scenarios that cannot afford such requirements. For example, in distributed sensing, a network of heterogeneous sensors sample from different data modalities of the same phenomenon. Each sensor thus requires its own specialized model. Local learning therefore needs to happen in isolation but inference still requires merging the local models for better performance. To enable this, we investigate a feature fusion approach that extracts local feature representations from local models and incorporates them into a global representation to train a more holistic predictive model. We study two key aspects of this feature incorporation. First, we develop an alignment algorithm that draws accurate correspondence between feature components which are arbitrarily arranged across clients. Next, we propose learning a consensus graph that captures the high-order interactions between these feature components, which reveals how data with heterogeneous features can be stitched together coherently to train a better model. The proposed framework is demonstrated on four real-life data sets including monitoring and predicting power grids and traffic networks.","Graph Neural Network, Probabilistic Methods" Unsupervised Performance Predictor for Architecture Search,https://openreview.net/forum?id=074e7Rojdj,https://openreview.net/pdf?id=074e7Rojdj,"We propose a performance predictor which can utilize existing fully-trained architectures, thus reducing the high cost of annotating architectures in the background of NAS.","Performance predictors can directly predict the performance value of given neural architectures without training, thus broadly being studied to alleviate the prohibitive cost of Neural Architecture Search (NAS). However, existing performance predictors still require training a large number of architectures from scratch to get their performance labels as the training dataset, which is still computationally expensive. To solve this issue, we develop an unsupervised performance predictor called USPP, which can avoid costly dataset construction by using existing fully-trained architectures. Specifically, a progressive domain-invariant feature extraction method is proposed to assist in extracting domain-invariant features due to the great transferability challenge caused by the rich domain-specific features. Furthermore, a learnable representation (denoted as operation embedding) is designed to replace the fixed encoding of the operations to transfer more knowledge about operations to the target search space. In experiments, we train the predictor by the labeled architectures in NAS-Bench-101 and predict the architectures in the DARTS search space. Compared with other state-of-the-art NAS methods, the proposed USPP only costs $0.02$ GPU days but finds the architecture with $97.86\%$ on CIFAR-10 and $96.50\%$ top-1 accuracy on ImageNet.","Neural Architecture Search, AutoML, Performance Predictor" Efficient recurrent architectures through activity sparsity and sparse back-propagation through time,https://openreview.net/forum?id=lJdOlWg8td,https://openreview.net/pdf?id=lJdOlWg8td,"We add a activity sparsity mechanism to the GRU using a thresholding function, which makes both the forward and backward passes computationally sparse. This model achieves competitive performance on various benchmarks including language modeling.","Recurrent neural networks (RNNs) are well suited for solving sequence tasks in resource-constrained systems due to their expressivity and low computational requirements. However, there is still a need to bridge the gap between what RNNs are capable of in terms of efficiency and performance and real-world application requirements. The memory and computational requirements arising from propagating the activations of all the neurons at every time step to every connected neuron, together with the sequential dependence of activations, contribute to the inefficiency of training and using RNNs. We propose a solution inspired by biological neuron dynamics that makes the communication between RNN units sparse and discrete. This makes the backward pass with backpropagation through time (BPTT) computationally sparse and efficient as well. We base our model on the gated recurrent unit (GRU), extending it with units that emit discrete events for communication triggered by a threshold so that no information is communicated to other units in the absence of events. We show theoretically that the communication between units, and hence the computation required for both the forward and backward passes, scales with the number of events in the network. Our model achieves efficiency without compromising task performance, demonstrating competitive performance compared to state-of-the-art recurrent network models in real-world tasks, including language modeling. The dynamic activity sparsity mechanism also makes our model well suited for novel energy-efficient neuromorphic hardware.","RNN, GRU, recurrent network, language modeling, dvs, gesture recognition, activity sparsity, efficiency" Quality-Similar Diversity via Population Based Reinforcement Learning,https://openreview.net/forum?id=bLmSMXbqXr,https://openreview.net/pdf?id=bLmSMXbqXr,We formulate the Quality-Similar Diversity (QSD) problem and propose an efficient population-based RL algorithm to optimize the user-defined diversity at multiple quality levels throughout training.,"Diversity is a growing research topic in Reinforcement Learning (RL). Previous research on diversity has mainly focused on promoting diversity to encourage exploration and thereby improve quality (the cumulative reward), maximizing diversity subject to quality constraints, or jointly maximizing quality and diversity, known as the quality-diversity problem. In this work, we present the quality-similar diversity problem that features diversity among policies of similar qualities. In contrast to task-agnostic diversity, we focus on task-specific diversity defined by a set of user-specified Behavior Descriptors (BDs). A BD is a scalar function of a trajectory (e.g., the fire action rate for an Atari game), which delivers the type of diversity the user prefers. To derive the gradient of the user-specified diversity with respect to a policy, which is not trivially available, we introduce a set of BD estimators and connect it with the classical policy gradient theorem. Based on the diversity gradient, we develop a population-based RL algorithm to adaptively and efficiently optimize the population diversity at multiple quality levels throughout training. Extensive results on MuJoCo and Atari demonstrate that our algorithm significantly outperforms previous methods in terms of generating user-specified diverse policies across different quality levels.","quality diversity, reinforcement learning, user-defined, population" PREDICTION OF TOURISM FLOW WITH SPARSE DATA INCORPORATING TOURIST GEOLOCATIONS,https://openreview.net/forum?id=4VFNnqSinf,https://openreview.net/pdf?id=4VFNnqSinf,"We apply state-of-the-art deep learning models such as GNNs, RNNs and Transformers to the problem of Tourism flow predictions","Modern tourism in the 21st century is facing numerous challenges. One of these challenges is the rapidly growing number of tourists in space-limited regions such as historical city centers, museums, or geographical bottlenecks like narrow val- leys. In this context, a proper and accurate prediction of tourism volume and tourism flow within a certain area is important and critical for visitor management tasks such as sustainable treatment of the environment and prevention of over- crowding. Static flow control methods like conventional low-level controllers or limiting access to overcrowded venues could not solve the problem yet. In this paper, we empirically evaluate the performance of state-of-the-art deep-learning methods such as RNNs, GNNs, and Transformers as well as the classic statistical ARIMA method. Granular limited data supplied by a tourism region is extended by exogenous data such as geolocation trajectories of individual tourists, weather and holidays. In the field of visitor flow prediction with sparse data, we are thereby capable of increasing the accuracy of our predictions, incorporating modern input feature handling as well as mapping geolocation data on top of discrete POI data.","GNN, RNN, Transformer, Tourism, Tourism flow prediction" Uncertainty-oriented Order Learning for Facial Beauty Prediction,https://openreview.net/forum?id=lcZTuVcAyW,https://openreview.net/pdf?id=lcZTuVcAyW,,"Previous Facial Beauty Prediction (FBP) methods generally model FB feature of an image as a point on the latent space, and learn a mapping from the point to a precise score. Although existing regression methods perform well on a single dataset, they are inclined to be sensitive to test data and have weak generalization ability. We think they underestimate two inconsistencies existing in the FBP problem: 1. inconsistency of FB standards among multiple datasets, and 2. inconsistency of human cognition on FB of an image. To address these issues, we propose a new Uncertainty-oriented Order Learning (UOL), where the order learning addresses the inconsistency of FB standards by learning the FB order relations among face images rather than a mapping, and the uncertainty modeling represents the inconsistency in human cognition. The key contribution of UOL is a designed distribution comparison module, which enables conventional order learning to learn the order of uncertain data. Extensive experiments show that UOL outperforms the state-of-the-art methods on both accuracy and generalization ability.", Modeling the Uncertainty with Maximum Discrepant Students for Semi-supervised 2D Pose Estimation,https://openreview.net/forum?id=wOTLra5iXh,https://openreview.net/pdf?id=wOTLra5iXh,"Under the mean-teacher framework, the two maximum discrepant students (MDSs) are created to effectively model the uncertainty of pseudo-labels, so as to select the high quality pseudo-labels to improve the semi-supervised pose estimation performance.","Semi-supervised pose estimation is a practically challenging task for computer vision. Although numerous excellent semi-supervised classification methods have emerged, these methods typically use confidence to evaluate the quality of pseudo-labels, which is difficult to achieve in pose estimation tasks. For example, in pose estimation, confidence represents only the possibility that a position of the heatmap is a keypoint, not the quality of that prediction. In this paper, we propose a simple yet efficient framework to estimate the quality of pseudo-labels in semi-supervised pose estimation tasks from the perspective of modeling the uncertainty of the pseudo-labels. Concretely, under the dual mean-teacher framework, we construct the two maximum discrepant students (MDSs) to effectively push two teachers to generate different decision boundaries for the same sample. Moreover, we create multiple uncertainties to assess the quality of the pseudo-labels. Experimental results demonstrate that our method improves the performance of semi-supervised pose estimation on three datasets.","semi-supervised learning, pose estimation, uncertainty, convolutional neural networks" UTS: When Monotonic Value Factorisation Meets Non-monotonic and Stochastic Targets,https://openreview.net/forum?id=ELmZduELxm,https://openreview.net/pdf?id=ELmZduELxm,We propose a novel value factorisation method to deal with non-monotonic and stochastic target joint action-values. ,"Extracting decentralised policies from joint action-values is an attractive way to exploit centralised learning. It is possible to apply monotonic value factorisation to guarantee consistency between the centralised and decentralised policies. However, the best strategy for training decentralised policies when the target joint action-values are non-monotonic and stochastic is still unclear. We propose a novel value factorisation method named uncertainty-based target shaping (UTS) to solve this problem. UTS employs networks that estimate the reward and the following state's embedding, where the large prediction error indicates that the target is stochastic. By replacing deterministic targets for the suboptimal with the best per-agent values, we enforce that all shaped targets become a subset of the space that can be represented by monotonic value factorisation. Empirical results show that UTS outperforms state-of-the-art baselines on multiple benchmarks, including matrix games, predator-prey, and challenging tasks in the StarCraft II micromanagement.","Value Decomposition, Multi-Agent Reinforcement Learning" TransLog: A Unified Transformer-based Framework for Log Anomaly Detection,https://openreview.net/forum?id=_q7iqj1Ns8,https://openreview.net/pdf?id=_q7iqj1Ns8,,"Log anomaly detection is a key component in the field of artificial intelligence for IT operations (AIOps). Considering the log data of variant domains, retraining the whole network for unknown domains is inefficient in real industrial scenarios, especially for low-resource domains. However, previous deep models merely focused on extracting the semantics of log sequences in the same domain, leading to poor generalization on multi-domain logs. To alleviate this issue, we propose a unified Transformer-based framework for Log anomaly detection (TransLog) to improve the generalization ability across different domains from a new perspective, where we establish a two-stage process including the pre-training and adapter-based tuning stage. Specifically, our model is first pre-trained on the source domain to obtain shared semantic knowledge of log data. Then, we transfer such knowledge to the target domain via shared parameters. Besides, The adapter, designed for log data, is utilized to improve migration efficiency while reducing cost. The proposed method is evaluated on three public datasets and one real-world dataset. Experimental results demonstrate that our simple yet efficient approach, with fewer trainable parameters and lower training costs in the target domain, achieves state-of-the-art performance on all benchmarks.","System Security, Log data, Anomaly detection" Meta-learning with Auto-generated Tasks for Predicting Human Behaviour in Normal Form Games,https://openreview.net/forum?id=ms5edDXWcX,https://openreview.net/pdf?id=ms5edDXWcX,,"In recent years, machine learning methods have been successfully applied to predict human behaviour in strategic settings. However, as available data of human behaviour is not always large enough and people's reasoning processes in different types of games are various, it is challenging to acquire a satisfied prediction model. In this paper, we employ a meta-learning method to improve learning performance in predicting human behaviour in normal form games. In particular, we first design a deep neural network that captures mixed human behaviour features to model and be learned to get a underlying behavioural predictor. Then, using a dataset of experimental human behaviour, we apply unsupervised learning to generate tasks and use meta-learning to improve the learning proficiency. Experimental results show that our proposed meta-learning method with the designed neural network and auto-generated tasks considerably increases the prediction accuracy and significantly exceeds the previous state-of-the-art.", Are Graph Attention Networks Attentive Enough? Rethinking Graph Attention by Capturing Homophily and Heterophily,https://openreview.net/forum?id=Xk10fyKR8G,https://openreview.net/pdf?id=Xk10fyKR8G,Propose a new attention mechanism for Graph ML,"Attention Mechanism has been successfully applied in Graph Neural Networks (GNNs). However, as messages propagate along the edges, the node embeddings for edge-connected nodes will become closer even though we can not ensure these nodes have similar features and labels, especially in heterophily graphs. The current attention mechanisms cannot adaptively extract information from the neighbors because they can not fully use the graph information in self-attention calculation. We introduce new a graph attention mechanism (GATv3) straightly involving the graphic information in the self-attention calculation, which can be aware of the homophily or heterophily of the graphs. We conduct an extensive evaluation in node classification tasks and show that using graphic information and features simultaneously can extract more diverse attention scores. Our code is available at https://github.com/anonymousSubscriber/G-GAT","Graph ML, attention mechanism" FairGrad: Fairness Aware Gradient Descent,https://openreview.net/forum?id=hZRxiAZFJC,https://openreview.net/pdf?id=hZRxiAZFJC,A method to enforce fairness based on a reweighting scheme that iteratively learns group specific weights based on whether they are advantaged or not.,"We address the problem of group fairness in classification, where the objective is to learn models that do not unjustly discriminate against subgroups of the population. Most existing approaches are limited to simple binary tasks or involve difficult to implement training mechanisms. This reduces their practical applicability. In this paper, we propose FairGrad, a method to enforce fairness based on a reweighting scheme that iteratively learns group specific weights based on whether they are advantaged or not. FairGrad is easy to implement and can accommodate various standard fairness definitions. Furthermore, we show that it is competitive with standard baselines over various datasets including ones used in natural language processing and computer vision.","Group Fairness, Optimization" Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge Distillation,https://openreview.net/forum?id=M0_sUuEyHs,https://openreview.net/pdf?id=M0_sUuEyHs,The proposed method dynamically introduces part of teacher's features to student as prior knowledge before applying knowledge distillation. ,"Knowledge distillation (KD) has shown very promising capabilities in transferring learning representations from large models (teachers) to small models (students). However, as the capacity gap between students and teachers becomes larger, existing KD methods fail to achieve better results. Our work shows that the 'prior knowledge' is vital to KD, especially when applying large teachers. Particularly, we propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation. This means that our method also takes the teacher's feature as `input', not just `target'. Besides, we dynamically adjust the ratio of the prior knowledge during the training phase according to the feature gap, thus guiding the student in an appropriate difficulty. To evaluate the proposed method, we conduct extensive experiments on two image classification benchmarks (i.e. CIFAR100 and ImageNet) and an object detection benchmark (\i.e. MS COCO). The results demonstrate the superiority of our method in performance under varying settings. Besides, our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers. More importantly, DPK provides a fast solution in teacher model selection for any given model. Our codes will be publicly available for reproducibility.",Knowledge Distillation Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow,https://openreview.net/forum?id=XVjTT1nw5z,https://openreview.net/pdf?id=XVjTT1nw5z,,"We present rectified flow, a simple approach to learning (neural) ordinary differential equation (ODE) models to transport between two empirically observed distributions $\pi_0$ and $\pi_1$, hence providing a unified solution to generative modeling and domain transfer, among various other tasks involving distribution transport. The idea of rectified flow is to learn the ODE to follow the straight paths connecting the points drawn from $\pi_0$ and $\pi_1$ as much as possible. This is achieved by solving a straightforward nonlinear least squares optimization problem, which can be easily scaled to large models without introducing extra parameters beyond standard supervised learning. The straight paths are the shortest paths between two points, and can be simulated exactly without time discretization and hence yield computationally efficient models. We show that, by learning a rectified flow from data, we effectively turn an arbitrary coupling of $\pi_0$ and $\pi_1$ to a new deterministic coupling with provably non-increasing convex transport costs. In addition, with a ``reflow"" procedure that iteratively learns a new rectified flow from the data bootstrapped from the previous one, we obtain a sequence of flows with increasingly straight paths, which can be simulated accurately with coarse time discretization in the inference phase. In empirical studies, we show that rectified flow performs superbly on image generation, image-to-image translation, and domain adaptation. In particular, on image generation and translation, our method yields nearly straight flows that give high quality results even with \emph{a single Euler discretization step}. Code will be made publicly available.", "Inequality phenomenon in $l_{\infty}$-adversarial training, and its unrealized threats",https://openreview.net/forum?id=4t9q35BxGr,https://openreview.net/pdf?id=4t9q35BxGr,"We find an intriguing phenomena of $l_{\infty}$ adversarial training, and this phenomena brings unrealized threats to adversarially trained model.","The appearance of adversarial examples raises attention from both academia and industry. Along with the attack-defense arms race, adversarial training is the most effective against adversarial examples. However, we find inequality phenomena occur during the $l_{\infty}$-adversarial training, that few features dominate the prediction made by the adversarially trained model. We systematically evaluate such inequality phenomena by extensive experiments and find such phenomena become more obvious when performing adversarial training with increasing adversarial strength (evaluated by $\epsilon$). We hypothesize such inequality phenomena make $l_{\infty}$-adversarially trained model less reliable than the standard trained model when few ``important features"" are influenced. To validate our hypothesis, we proposed two simple attacks that either perturb or replace important features with noise or occlusion. Experiments show that $l_{\infty}$-adversarially trained model can be easily attacked when the few important features are influenced. Our work shed light on the limitation of the practicality of $l_{\infty}$-adversarial training.","Adversarial training, Adversarial robustness, Adversarial feature represenation" Tensor-Based Sketching Method for the Low-Rank Approximation of Data Streams.,https://openreview.net/forum?id=rOFKmzNTbC,https://openreview.net/pdf?id=rOFKmzNTbC,,"Low-rank approximation in data streams is a fundamental and significant task in computing science, machine learning and statistics. Multiple streaming algorithms have emerged over years and most of them are inspired by randomized algorithms, more specifically, sketching methods. However, many algorithms are not able to leverage information of data streams and consequently suffer from low accuracy. Existing data-driven methods improve accuracy but the training cost is expensive in practice. In this paper, from a subspace perspective, we propose a tensor-based sketching method for low-rank approximation of data streams. The proposed algorithm fully exploits the structure of data streams and obtains quasi-optimal sketching matrices by performing tensor decomposition on training data. A series of experiments are carried out and show that the proposed tensor-based method can be more accurate and much faster than the previous work.", CRISP: Curriculum based Sequential neural decoders for Polar code family,https://openreview.net/forum?id=cxul04S-aG,https://openreview.net/pdf?id=cxul04S-aG,"We introduce CRISP, a novel curriculum learning based neural decoder that attains near optimal reliability on the Polar code family in the short blocklength regime.","Polar codes are widely used state-of-the-art codes for reliable communication that have recently been included in the $5^{\text{th}}$ generation wireless standards ($5$G). However, there still remains room for design of polar decoders that are both efficient and reliable in the short blocklength regime. Motivated by recent successes of data-driven channel decoders, we introduce a novel $\textbf{ C}$ur${\textbf{RI}}$culum based $\textbf{S}$equential neural decoder for $\textbf{P}$olar codes (CRISP). We design a principled curriculum, guided by information-theoretic insights, to train CRISP and show that it outperforms the successive-cancellation (SC) decoder and attains near-optimal reliability performance on the $\text{Polar}(16,32)$ and $\text{Polar}(22,64)$ codes. The choice of the proposed curriculum is critical in achieving the accuracy gains of CRISP, as we show by comparing against other curricula. More notably, CRISP can be readily extended to Polarization-Adjusted-Convolutional (PAC) codes, where existing SC decoders are significantly less reliable. To the best of our knowledge, CRISP constructs the first data-driven decoder for PAC codes and attains near-optimal performance on the $\text{PAC}(16,32)$ code. ","information theory, coding theory, wireless communication, polar codes, PAC codes, machine learning, deep learning" A Mathematical Framework for Characterizing Dependency Structures of Multimodal Learning,https://openreview.net/forum?id=t851DsVVtA,https://openreview.net/pdf?id=t851DsVVtA,,"Dependency structures between modalities have been utilized explicitly and implicitly in multimodal learning to enhance classification performance, particularly when the training samples are insufficient. Recent efforts have concentrated on developing suitable dependence structures and applying them in deep neural networks, but the interplay between the training sample size and various structures has not received enough attention. To address this issue, we propose a mathematical framework that can be utilized to characterize conditional dependency structures in analytic ways. It provides an explicit description of the sample size in learning various structures in a non-asymptotic regime. Additionally, it demonstrates how task complexity and a fitness evaluation of conditional dependence structures affect the results. Furthermore, we develop an autonomous updated coefficient algorithm auto-CODES based on the theoretical framework and conduct experiments on multimodal emotion recognition tasks using the MELD and IEMOCAP datasets. The experimental results validate our theory and show the effectiveness of the proposed algorithm.","Multimodal learning, dependency structures, correlation anaylses, emotion recognition, classification" Language Models are Realistic Tabular Data Generators,https://openreview.net/forum?id=cEygmQNOeI,https://openreview.net/pdf?id=cEygmQNOeI,The GReaT approach utilizes the capabilities of large language models to synthesize realistic tabular data. A challenging set of experiments validates the proposed method’s efficiency.,"Tabular data is among the oldest and most ubiquitous forms of data. However, the generation of synthetic samples with the original data's characteristics remains a significant challenge for tabular data. While many generative models from the computer vision domain, such as autoencoders or generative adversarial networks, have been adapted for tabular data generation, less research has been directed towards recent transformer-based large language models (LLMs), which are also generative in nature. To this end, we propose GReaT (Generation of Realistic Tabular data), which exploits an auto-regressive generative LLM to sample synthetic and yet highly realistic tabular data. Furthermore, GReaT can model tabular data distributions by conditioning on any subset of features; the remaining features are sampled without additional overhead. We demonstrate the effectiveness of the proposed approach in a series of experiments that quantify the validity and quality of the produced data samples from multiple angles. We find that GReaT maintains state-of-the-art performance across many real-world data sets with heterogeneous feature types coming in various sizes.","tabular data, tabular data generation, large language models, transformers, probabilistic modeling, deep neural networks" Data augmentation alone can improve adversarial training,https://openreview.net/forum?id=y4uc4NtTWaq,https://openreview.net/pdf?id=y4uc4NtTWaq,data augmentation alone can significantly improve adversarial training regarding both accuracy and robustness,"Adversarial training suffers from the issue of robust overfitting, which seriously impairs its generalization performance. Data augmentation, which is effective at preventing overfitting in standard training, has been observed by many previous works to be ineffective in mitigating overfitting in adversarial training. This work proves that, contrary to previous findings, data augmentation alone can significantly boost accuracy and robustness in adversarial training. We find that the hardness and the diversity of data augmentation are important factors in combating robust overfitting. In general, diversity can improve both accuracy and robustness, while hardness can boost robustness at the cost of accuracy within a certain limit and degrade them both over that limit. To mitigate robust overfitting, we first propose a new crop transformation Cropshift with improved diversity compared to the conventional one (Padcrop). We then propose a new data augmentation scheme, based on Cropshift, with much improved diversity and well-balanced hardness. Empirically, our augmentation method achieves the state-of-the-art accuracy and robustness for data augmentations in adversarial training. Furthermore, it matches, or even exceeds when combined with weight averaging, the performance of the best contemporary regularization methods for alleviating robust overfitting.","deep learning, adversarial training, data augmentation, adversarial robustness" Learning Rotation-Equivariant Features for Visual Correspondence,https://openreview.net/forum?id=GCF6ZOA6Npk,https://openreview.net/pdf?id=GCF6ZOA6Npk,We introduce a self-supervised learning framework to yield discriminative rotation-invariant descriptors using local features extracted from group-equivariant CNNs for the task of visual correspondence.,"Extracting discriminative local features that are invariant to imaging variations is an integral part of establishing correspondences between images. In this work, we introduce a self-supervised learning framework to extract discriminative rotation-invariant descriptors using group-equivariant CNNs. Thanks to employing group-equivariant CNNs, our method effectively learns to obtain rotation-equivariant features and their orientations explicitly, without having to perform sophisticated data augmentations. The resultant features and their orientations are further processed by group aligning, a novel invariant mapping technique that shifts the group-equivariant features by their orientations along the group dimension. Our group aligning technique achieves rotation-invariance without any collapse of the group dimension and thus eschews loss of discriminability. The proposed method is trained end-to-end in a self-supervised manner, where we use an orientation alignment loss for the orientation estimation and a contrastive descriptor loss for robust local descriptors to geometric/photometric variations. Our method demonstrates state-of-the-art matching accuracy among existing rotation-invariant descriptors under varying rotation and also show competitive results when transferred to the task of keypoint matching and camera pose estimation.","visual correspondence, equivariant representation learning, deep local feature extraction, self-supervised learning" Learning Diffusion Bridges on Constrained Domains,https://openreview.net/forum?id=WH1yCa0TbB,https://openreview.net/pdf?id=WH1yCa0TbB,,"Diffusion models have achieved promising results on generative learning recently. However, because diffusion processes are most naturally applied on the unconstrained Euclidean space $\mathrm{R}^d$, key challenges arise for developing diffusion based models for learning data on constrained and structured domains. We present a simple and unified framework to achieve this that can be easily adopted to various types of domains, including product spaces of any type (be it bounded/unbounded, continuous/discrete, categorical/ordinal, or their mix). In our model, the diffusion process is driven by a drift force that is a sum of two terms: one singular force designed by $Doob's~ h$-$transform$ that ensures all outcomes of the process to belong to $\Omega$, and one non-singular neural force field that is trained to make sure the outcome follows the data distribution statistically. Experiments show that our methods perform superbly on generating tabular data, images, semantic segments and 3D point clouds. ", Revisiting Uncertainty Estimation for Node Classification: New Benchmark and Insights,https://openreview.net/forum?id=DB3BH3arU2Y,https://openreview.net/pdf?id=DB3BH3arU2Y,We analyze uncertainty estimation for node classification problems: we propose a benchmark covering distribution shifts of different types and perform a thorough analysis of various uncertainty estimation techniques.,"Uncertainty estimation is an important task that can be essential for high-risk applications of machine learning. This problem is especially challenging for node-level prediction in graph-structured data, as the samples (nodes) are interdependent. Recently, several studies addressed node-level uncertainty estimation. However, there is no established benchmark for evaluating these methods in a unified setup covering diverse distributional shift. In this paper, we address this problem and propose such a benchmark together with a technique for the controllable generation of data splits with various types of distributional shift. Importantly, besides the standard feature-based distributional shift, we also consider shifts specifically designed for graph-structured data. In summary, our benchmark consists of several graph datasets equipped with various distributional shift on which we evaluate the robustness of models and uncertainty estimation performance. This allows us to compare existing solutions in a unified setup. Moreover, we decompose the current state-of-the-art Dirichlet-based framework and perform an ablation study on its components. In our experiments, we demonstrate that when faced with complex yet realistic distributional shift, most models fail to maintain high classification performance and consistency of uncertainty estimates with prediction errors. However, ensembling techniques help to partially overcome significant drops in performance and achieve better results than distinct models. Among single-pass models, Natural Posterior Network with GNN encoder achieves the best performance.","uncertainty estimation, distribution shift, graph, node classification, benchmark" CUTS: Neural Causal Discovery from Unstructured Time-Series Data,https://openreview.net/forum?id=UG8bQcD3Emv,https://openreview.net/pdf?id=UG8bQcD3Emv,Discovering causal relations from unstructured time-series data with a mutually boosting iterative method.,"Causal discovery from time-series data has been a central task in machine learning. Recently, Granger causality inference is gaining momentum due to its good explainability and high compatibility with emerging deep neural networks. However, most existing methods assume structured input data and degenerate greatly when encountering data with randomly missing entries or non-uniform sampling frequencies, which hampers their applications in real scenarios. To address this issue, here we present CUTS, a neural Granger causal discovery algorithm to jointly impute unobserved data points and build causal graphs, via plugging in two mutually boosting modules in an iterative framework: (i) Latent data prediction stage: designs a Delayed Supervision Graph Neural Network (DSGNN) to hallucinate and register unstructured data which might be of high dimension and with complex distribution; (ii) Causal graph fitting stage: builds a causal adjacency matrix with imputed data under sparse penalty. Experiments show that CUTS effectively infers causal graphs from unstructured time-series data, with significantly superior performance to existing methods. Our approach constitutes a promising step towards applying causal discovery to real applications with non-ideal observations.","Time series, Granger causality, Causal discovery, Neural networks, Graph neural networks" Multi-Source Transfer Learning for Deep Model-Based Reinforcement Learning,https://openreview.net/forum?id=rDm09u4Hws4,https://openreview.net/pdf?id=rDm09u4Hws4,"Modular multi-source transfer learning techniques for model-based reinforcement learning that autonomously learn to extract information from a set of source tasks, regardless of differences between environments.","A crucial challenge in reinforcement learning is to reduce the number of interactions with the environment that an agent requires to master a given task. Transfer learning proposes to address this issue by re-using knowledge from previously learned tasks. However, determining which source task qualifies as optimal for knowledge extraction, as well as the choice regarding which algorithm components to transfer, represent severe obstacles to its application in reinforcement learning. The goal of this paper is to alleviate these issues with modular multi-source transfer learning techniques. Our proposed methodologies automatically learn how to extract useful information from source tasks, regardless of the difference in state-action space and reward function. We support our claims with extensive and challenging cross-domain experiments for visual control.","multi-source transfer learning, world models, model-based reinforcement learning, sample efficiency, cross-domain transfer learning" Balancing MSE against Abrupt Changes for Time-Series Forecasting,https://openreview.net/forum?id=tEyV6gwCECk,https://openreview.net/pdf?id=tEyV6gwCECk,,"Time-series forecasting models often encounter drastic changes in a given period of time (i.e., abrupt changes) which generally occur due to unexpected or unknown events. Despite their scarce occurrences in the training set, abrupt changes incur loss (i.e., MSE) that significantly contributes to the total loss. Therefore, they act as noisy training samples and prevent the model from learning generalizable sequence patterns, namely the normal states. Based on such an intuition, we propose a reweighting framework that down-weights the loss incurred by abrupt changes and up-weights those by normal states. For the reweighting framework, we first define a measurement termed Local Discrepancy (LD) which measures the degree of abruptness of a change in a given period of time. Then, we calculate how frequently the temporal changes appear in the training set based on LD (i.e., estimated LD density). Since normal states generally appear frequently compared to abrupt changes, they achieve higher LD density. Using such a property, we reweight the losses proportionally to the estimated LD density. Our reweighting framework is applicable to existing time-series forecasting models regardless of the architectures. Through extensive experiments on 12 time-series forecasting models over eight datasets with various in-output sequence lengths, we demonstrate that applying our reweighting framework reduces MSE by 10.1% on average and by up to 18.6% in the state-of-the-art model.","time-series forecasting, data imbalance, loss imbalance, noisy samples" PAVI: Plate-Amortized Variational Inference,https://openreview.net/forum?id=hpr8KTZzz4W,https://openreview.net/pdf?id=hpr8KTZzz4W,We share a variational family's parameterization and learning across a model's plates to tackle efficiently very large plate cardinality regimes.,"Given observed data and a probabilistic generative model, Bayesian inference aims at obtaining the distribution of a model’s latent parameters that could have yielded the data. This task is challenging for large population studies where thousands of measurements are performed over a cohort of hundreds of subjects, resulting in a massive latent parameter space. This large cardinality renders off-the-shelf Variational Inference (VI) computationally impractical. In this work, we design structured VI families that can efficiently tackle large population studies. Our main idea is to share the parameterization and learning across the different i.i.d. variables in a generative model --symbolized by the model’s plates. We name this concept plate amortization, and illustrate the powerful synergies it entitles, resulting in expressive, parsimoniously parameterized and orders of magnitude faster to train large scale hierarchical variational distributions. We illustrate the practical utility of PAVI through a challenging Neuroimaging example featuring a million latent parameters, demonstrating a significant step towards scalable and expressive Variational Inference.","structured Variational Inference, Bayesian inference, Hierarchical Bayesian Models, Inference amortization, Neuroimaging" Near-optimal Coresets for Robust Clustering,https://openreview.net/forum?id=Nc1ZkRW8Vde,https://openreview.net/pdf?id=Nc1ZkRW8Vde,"We obtain an \epsilon-coreset of near-optimal size for (k, z)-clustering (which includes k-median and k-means) with m outliers","We consider robust clustering problems in $\mathbb{R}^d$, specifically $k$-clustering problems (e.g., $k$-Median and $k$-Means) with $m$ \emph{outliers}, where the cost for a given center set $C \subset \mathbb{R}^d$ aggregates the distances from $C$ to all but the furthest $m$ data points, instead of all points as in classical clustering. We focus on the $\epsilon$-coreset for robust clustering, a small proxy of the dataset that preserves the clustering cost within $\epsilon$-relative error for all center sets. Our main result is an $\epsilon$-coreset of size $O(m + \mathrm{poly}(k \epsilon^{-1}))$ that can be constructed in near-linear time. This significantly improves previous results, which either suffers an exponential dependence on $(m + k)$ [Feldman and Schulman, SODA'12], or has a weaker bi-criteria guarantee [Huang et al., FOCS'18]. Furthermore, we show this dependence in $m$ is nearly-optimal, and the fact that it is isolated from other factors may be crucial for dealing with large number of outliers. We construct our coresets by adapting to the outlier setting a recent framework [Braverman et al., FOCS'22] which was designed for capacity-constrained clustering, overcoming a new challenge that the participating terms in the cost, particularly the excluded $m$ outlier points, are dependent on the center set $C$. We validate our coresets on various datasets, and we observe a superior size-accuracy tradeoff compared with popular baselines including uniform sampling and sensitivity sampling. We also achieve a significant speedup of existing approximation algorithms for robust clustering using our coresets.","clustering, outlier, robustness, coreset" CLUTR: Curriculum Learning via Unsupervised Task Representation Learning,https://openreview.net/forum?id=DvMDIEFtyjV,https://openreview.net/pdf?id=DvMDIEFtyjV,,"Reinforcement Learning (RL) algorithms are often known for sample inefficiency and difficult generalization. Recently, Unsupervised Environment Design (UED) emerged as a new paradigm for zero-shot generalization by simultaneously learning a task distribution and agent policies on the sampled tasks. This is a non-stationary process where the task distribution evolves along with agent policies; creating an instability over time. While past works demonstrated the potential of such approaches, sampling effectively from the task space remains an open challenge, bottlenecking these approaches. To this end, we introduce CLUTR: a novel curriculum learning algorithm that decouples task representation and curriculum learning into a two-stage optimization. It first trains a recurrent variational autoencoder on randomly generated tasks to learn a latent task manifold. Next, a teacher agent creates a curriculum by optimizing a minimax REGRET-based objective on a set of latent tasks sampled from this manifold. By keeping the task manifold fixed, we show that CLUTR successfully overcomes the non-stationarity problem and improves stability. Our experimental results show CLUTR outperforms PAIRED, a principled and popular UED method, in terms of generalization and sample efficiency in the challenging CarRacing and navigation environments: showing an 18x improvement on the F1 CarRacing benchmark. CLUTR also performs comparably to the non-UED state-of-the-art for CarRacing, outperforming it in nine of the 20 tracks. CLUTR also achieves a 33% higher solved rate than PAIRED on a set of 18 out-of-distribution navigation tasks.","reinforcement learning, curriculum learning" Test-time Adaptation for Segmentation via Image Synthesis,https://openreview.net/forum?id=TQZkycVeMIy,https://openreview.net/pdf?id=TQZkycVeMIy,We propose a test-time adaptation framework that optimizes image synthesis loss to improve image segmentation.,"We consider the problem of segmenting scenes into constituent objects and their parts. Current supervised visual detectors, though impressive within their training distribution, often fail to segment out-of-distribution scenes into their constituent entities. Recent test-time adaptation methods use auxiliary self-supervised losses to adapt the network parameters to each test example independently and have shown promising results towards generalization outside the training distribution for the task of image classification. In our work, we find evidence that these losses can be insufficient for instance segmentation tasks, without also considering architectural inductive biases. For image segmentation, recent slot-centric generative models break such dependence on supervision by attempting to segment scenes into entities in a self-supervised manner by reconstructing pixels. Drawing upon these two lines of work, we propose Generating Fast and Slow Networks (GFS-Nets), a semi-supervised instance segmentation model equipped with a slot-centric image or point-cloud rendering component that is adapted per scene at test time through gradient descent on reconstruction or novel view synthesis objectives. We show that test-time adaptation greatly improves segmentation in out-of-distribution scenes. We evaluate GFS-Nets in several 3D and 2D scene segmentation benchmarks and show substantial out-of-distribution performance improvements against state-of-the-art supervised feed forward detectors and self-supervised domain adaptation models.","object-centric learning, test-time adaptation, unsupervised domain adaptation, test-time training, entity-centric models" On the Importance of In-distribution Class Prior for Out-of-distribution Detection,https://openreview.net/forum?id=72lzvXrKqqd,https://openreview.net/pdf?id=72lzvXrKqqd,,"Given a pre-trained in-distribution (ID) model, the task of inference-time out-of-distribution (OOD) detection methods aims to recognize upcoming OOD data in inference time. However, some representative methods share an unproven assumption that the probability that OOD data belong to every ID class should be the same, i.e., probabilities that OOD data belong to ID classes form a uniform distribution. In this paper, we theoretically and empirically show that this assumption makes these methods incapable of recognizing OOD data when the ID model is trained with class-imbalanced data. Fortunately, by analyzing the causal relations between ID/OOD classes and features, we identify several common scenarios where probabilities that OOD data belong to ID classes should be the ID-class-prior distribution. Based on the above finding, we propose two effective strategies to modify previous inference-time OOD detection methods: 1) if they explicitly use the uniform distribution, we can replace the uniform distribution with the ID-class-prior distribution; 2) otherwise, we can reweight their scores according to the similarity between the ID-class-prior distribution and the softmax outputs of the pre-trained model. Extensive experiments show that both strategies significantly improve the accuracy of recognizing OOD data when the ID model is pre-trained with imbalanced data. As a highlight, when evaluating on the iNaturalist dataset, our method can achieve ~36% increase on AUROC and ~61% decrease on FPR95, compared with the original Energy method, reflecting the importance of ID-class prior in the OOD detection, which lights up a new road to study this problem.", Quantized Compressed Sensing with Score-Based Generative Models,https://openreview.net/forum?id=OOWLRfAI_V_,https://openreview.net/pdf?id=OOWLRfAI_V_,Quantized Compressed Sensing with Score-Based Generative Models,"We consider the general problem of recovering a high-dimensional signal from noisy quantized measurements. Quantization, especially coarse quantization such as one-bit sign measurements, leads to severe information loss and thus a good prior knowledge of the unknown signal is helpful for accurate recovery. Motivated by the power of score-based generative models (SGM, also known as diffusion models) in capturing the rich structure of natural signals beyond simple sparsity, we propose an unsupervised data-driven approach called quantized compressed sensing with SGM (QCS-SGM), where the prior distribution is modeled by a pre-trained SGM. To perform posterior sampling, an annealed likelihood score called noise perturbed likelihood score is introduced and combined with the prior score of SGM. The proposed QCS-SGM applies to arbitrary number of quantization bits. Experiments on a variety of baseline datasets demonstrate that the proposed QCS-SGM significantly outperforms existing state-of-the-art algorithms by a large margin for both in-distribution and out-of-distribution samples. Moreover, as a posterior sampling method, QCS-SGM can be easily used to obtain confidence intervals or uncertainty estimates of the reconstructed results. Our code will be open-sourced after acceptance. ","generative models, compressed sensing, linear inverse problems, quantization" Unbiased Representation of Electronic Health Records for Patient Outcome Prediction,https://openreview.net/forum?id=fnDbEm6RxqH,https://openreview.net/pdf?id=fnDbEm6RxqH,,"Fairness is one of the newly emerging focuses for building trustworthy artificial intelligence (AI) models. One of the reasons resulting in an unfair model is the algorithm bias towards different groups of samples. A biased model may benefit certain groups but disfavor others. As a result, leaving the fairness problem unresolved might have a significant negative impact, especially in the context of healthcare applications. Integrating both domain-specific and domain-invariant representations, we propose a masked triple attention transformer encoder (MTATE) to learn unbiased and fair data representations of different subpopulations. Specifically, MTATE includes multiple domain classifiers and uses three attention mechanisms to effectively learn the representations of diverse subpopulations. In the experiment on real-world healthcare data, MTATE performed the best among the compared models regarding overall performance and fairness.","Deep Learning, Electronic Health Records Representation Learning, Healthcare AI, Model Fairness" Valid P-Value for Deep Learning-driven Salient Region,https://openreview.net/forum?id=qihMOPw4Sf_,https://openreview.net/pdf?id=qihMOPw4Sf_,We propose a novel method to quantify the reliability of neural network-driven saliency region in statistical hypothesis testing framework.,"Various saliency map methods have been proposed to interpret and explain predictions of deep learning models. Saliency maps allow us to interpret which parts of the input signals have a strong influence on the prediction results. However, since a saliency map is obtained by complex computations in deep learning models, it is often difficult to know how reliable the saliency map itself is. In this study, we propose a method to quantify the reliability of a saliency region in the form of p-values. Our idea is to consider a saliency map as a selected hypothesis by the trained deep learning model and employ the selective inference framework. The proposed method provably provides a valid p-value for the detected salient region, i.e., we can provably control the false positive rate of the detected salient region. We demonstrate the validity of the proposed method through numerical examples in synthetic and real datasets. Furthermore, we develop a Keras-based framework for conducting the proposed selective inference for a wide class of CNNs without additional implementation cost.","Saliency Map, Attention, Selective Inference, Uncertainty Quantification, P-value, Statistical Hypothesis Testing" Unsupervised Semantic Segmentation with Self-supervised Object-centric Representations,https://openreview.net/forum?id=1_jFneF07YC,https://openreview.net/pdf?id=1_jFneF07YC,Strong and simple baseline for unsupervised segmentation methods obtained by leveraging and combining object-centric priors.,"In this paper, we show that recent advances in self-supervised representation learning enable unsupervised object discovery and semantic segmentation with a performance that matches the state of the field on supervised semantic segmentation 10 years ago. We propose a methodology based on unsupervised saliency masks and self-supervised feature clustering to kickstart object discovery followed by training a semantic segmentation network on pseudo-labels to bootstrap the system on images with multiple objects. We show that while being conceptually simple our proposed baseline is surprisingly strong. We present results on PASCAL VOC that go far beyond the current state of the art (47.3 mIoU; +10.1 mIoU), and we report for the first time results on MS COCO for the whole set of 81 classes: our method discovers 34 categories with more than 20% IoU, while obtaining an average IoU of 19.6 for all 81 categories.","unsupervised semantic segmentation, object segmentation, object-centric learning" Skill Graph for Real-world Quadrupedal Robot Reinforcement Learning,https://openreview.net/forum?id=vdm4WnG5u-M,https://openreview.net/pdf?id=vdm4WnG5u-M,We propose a novel structured skill graph for accelerating the learning of robotic DRL policies and rapid adaptation to unseen real-world tasks.,"Deep Reinforcement Learning (DRL) is one of the promising methods for general learning policies from the environment. However, DRL has two basic problems: sample inefficiency and weak generalization. Real-world robotic DRL, for example, often requires time-consuming data collection and frequent human intervention to reset the environment. If faced with a new environment or task, the robot can master basic skills in advance instead of learning from scratch, then its learning efficiency and adaptability will be greatly improved. Therefore, in this paper, we propose a novel structured skill graph (SG) for accelerating the learning of robotic DRL policies and rapid adaptation to unseen real-world tasks. Similar to the knowledge graph (KG), SG adopts the tri-element structure to store information. But different from KG storing static knowledge, SG can store dynamic policies and adopt different tri-elements. To construct the SG, we utilize the various real-world quadrupedal locomotion skills in different realistic environments. When faced with new real-world tasks, the relevant skills in SK will be extracted and used to help the robotic DRL learning and rapid adaptation. Extensive experimental results on the real-world quadruped robot locomotion tasks demonstrate the effectiveness of SG for facilitating DRL-based robot learning. Real-world quadrupedal robots can adapt to new environments or tasks in minutes with the help of our SG.","Skill Graph, Quadrupedal Robot, Deep Reinforcement Learning." Pre-training Protein Structure Encoder via Siamese Diffusion Trajectory Prediction,https://openreview.net/forum?id=Tb3ZJBDF7aA,https://openreview.net/pdf?id=Tb3ZJBDF7aA,"In this work, we propose a novel protein structure pre-training algorithm SiamDiff to effectively maximize mutual information between protein structure-sequence co-diffusion trajectories.","Due to the determining role of protein structures on diverse protein functions, pre-training representations of proteins on massive unlabeled protein structures has attracted rising research interests. Among recent efforts on this direction, mutual information (MI) maximization based methods have gained the superiority on various downstream benchmark tasks. The core of these methods is to design correlated views that share common information about a protein. Previous view designs focus on capturing structural motif co-occurrence on the same protein structure, while they cannot capture detailed atom/residue interactions. To address this limitation, we propose the Siamese Diffusion Trajectory Prediction (SiamDiff) method. SiamDiff builds a view as the trajectory that gradually approaches protein native structure from scratch, which facilitates the modeling of atom/residue interactions underlying the protein structural dynamics. Specifically, we employ the multimodal diffusion process as a faithful simulation of the structure-sequence co-diffusion trajectory, where rich patterns of protein structural changes are embedded. On such basis, we design a principled theoretical framework to maximize the MI between correlated multimodal diffusion trajectories. We study the effectiveness of SiamDiff on both residue-level and atom-level structures. On the EC and ATOM3D benchmarks, we extensively compare our method with previous protein structure pre-training approaches. The experimental results verify the consistently superior or competitive performance of SiamDiff on all benchmark tasks compared to existing baselines. The source code will be made public upon acceptance.","Protein representation learning, diffusion models, self-supervised learning" Indiscriminate Poisoning Attacks on Unsupervised Contrastive Learning,https://openreview.net/forum?id=f0a_dWEYg-Td,https://openreview.net/pdf?id=f0a_dWEYg-Td,,"Indiscriminate data poisoning attacks are quite effective against supervised learning. However, not much is known about their impact on unsupervised contrastive learning (CL). This paper is the first to consider indiscriminate poisoning attacks of contrastive learning. We propose contrastive poisoning, the first effective such attack on CL. We empirically show that contrastive poisoning, not only drastically reduces the performance of CL algorithms, but also attacks supervised learning models, making it the most generalizable indiscriminate poisoning attack. We also show that CL algorithms with a momentum encoder are more robust to indiscriminate poisoning, and propose a new countermeasure based on matrix completion. Our code will be publicly available upon publication.","data poisoning, contrastive learning" Decompositional Generation Process for Instance-Dependent Partial Label Learning,https://openreview.net/forum?id=lKOfilXucGB,https://openreview.net/pdf?id=lKOfilXucGB, We consider instance-dependent PLL and assume that the generation process of the candidate labels could decompose into two sequential parts.,"Partial label learning (PLL) is a typical weakly supervised learning problem, where each training example is associated with a set of candidate labels among which only one is true. Most existing PLL approaches assume that the incorrect labels in each training example are randomly picked as the candidate labels and model the generation process of the candidate labels in a simple way. However, these approaches usually do not perform as well as expected due to the fact that the generation process of the candidate labels is always instance-dependent. Therefore, it deserves to be modeled in a refined way. In this paper, we consider instance-dependent PLL and assume that the generation process of the candidate labels could decompose into two sequential parts, where the correct label emerges first in the mind of the annotator but then the incorrect labels related to the feature are also selected with the correct label as candidate labels due to uncertainty of labeling. Motivated by this consideration, we propose a novel PLL method that performs Maximum A Posterior(MAP) based on an explicitly modeled generation process of candidate labels via decomposed probability distribution models. Extensive experiments on manually corrupted benchmark datasets and real-world datasets validate the effectiveness of the proposed method.","partial label learning, weakly supervised learning, decompositional generation process" Multimodal Masked Autoencoders Learn Transferable Representations,https://openreview.net/forum?id=Z-aIURmBbBk,https://openreview.net/pdf?id=Z-aIURmBbBk,Multimodal Transformers for vision-language representation learning trained with masked token prediction.,"Building scalable models to learn from diverse, multimodal data remains an open challenge. For vision-language data, the dominant approaches are based on contrastive learning objectives that train a separate encoder for each modality. While effective, contrastive learning approaches introduce sampling bias depending on the data augmentations used, which can degrade performance on downstream tasks. Moreover, these methods are limited to paired image-text data, and cannot leverage widely-available unpaired data. In this paper, we investigate whether a large multimodal model trained purely via masked token prediction, without using modality-specific encoders or contrastive learning, can learn transferable representations for downstream tasks. We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE), which learns a unified encoder for both vision and language data via masked token prediction. We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks. Surprisingly, we find that M3AE benefits from a higher text mask ratio (50-90%), in contrast to BERT whose standard masking ratio is 15%, due to the joint training of two data modalities. We also provide qualitative analysis showing that the learned representation incorporates meaningful information from both image and language. Lastly, we demonstrate the scalability of M3AE with larger model size and training time, and its flexibility to train on both paired image-text data as well as unpaired data. ","Multimodal, Self-supervised Learning, Pre-training, Masked Autoencoder, Representation Learning" Adversarial Causal Augmentation for Graph Covariate Shift,https://openreview.net/forum?id=TJPmwnQIMmw,https://openreview.net/pdf?id=TJPmwnQIMmw,"We propose a novel graph data augmentation method, Adversarial Causal Augmentation (AdvCA), to address the covariate shift issues.","Out-of-distribution (OOD) generalization on graphs is drawing widespread attention. However, existing efforts mainly focus on the OOD issue of correlation shift. While another type, covariate shift, remains largely unexplored but is the focus of this work. From a data generation view, causal features are stable substructures in data, which play key roles in OOD generalization. While their complementary parts, environments, are unstable features that often lead to various distribution shifts. Correlation shift establishes spurious statistical correlations between environments and labels. In contrast, covariate shift means that there exist unseen environmental features in test data. Existing strategies of graph invariant learning and data augmentation suffer from limited environments or unstable causal features, which greatly limits their generalization ability on covariate shift. In view of that, we propose a novel graph augmentation strategy: Adversarial Causal Augmentation (AdvCA), to alleviate the covariate shift. Specifically, it adversarially augments the data to explore diverse distributions of the environments. Meanwhile, it keeps the causal features invariant across diverse environments. It maintains the environmental diversity while ensuring the invariance of the causal features, thereby effectively alleviating the covariate shift. Extensive experimental results with in-depth analyses demonstrate that AdvCA can outperform 14 baselines on synthetic and real-world datasets with various covariate shifts.","Graph Data Augmentation, Graph Neural Networks, Covariate Shift, OOD Generalization" Learning from conflicting data with hidden contexts,https://openreview.net/forum?id=kvAQEZZ_BI1,https://openreview.net/pdf?id=kvAQEZZ_BI1,We formulate the problem of learning from conflicting data with hidden contexts and propose a subjective learning framework to tackle this problem.,"Classical supervised learning assumes a stable relation between inputs and outputs. However, this assumption is often invalid in real-world scenarios where the input-output relation in the data depends on some hidden contexts. We formulate a more general setting where the training data is sampled from multiple unobservable domains, while different domains may possess semantically distinct input-output maps. Training data exhibits inherent conflict in this setting, rendering vanilla empirical risk minimization problematic. We propose to tackle this problem by introducing an allocation function that learns to allocate conflicting data to different prediction models, resulting in an algorithm that we term LEAF. We draw an intriguing connection between our approach and a variant of the Expectation-Maximization algorithm. We provide theoretical justifications for LEAF on its identifiability, learnability, and generalization error. Empirical results demonstrate the efficacy and potential applications of LEAF in a range of regression and classification tasks on both synthetic data and real-world datasets.","Conflicting data, hidden contexts, subjective learning, multi-domain learning" Building a Subspace of Policies for Scalable Continual Learning,https://openreview.net/forum?id=UKr0MwZM6fL,https://openreview.net/pdf?id=UKr0MwZM6fL,We introduce a continual reinforcement learning method that incrementally builds a subspace of policies and adaptively prune it to preserve a good trade-off between model size and performance.,"The ability to continuously acquire new knowledge and skills is crucial for autonomous agents. Existing methods are typically based on either fixed-size models that struggle to learn a large number of diverse behaviors, or growing-size models that scale poorly with the number of tasks. In this work, we aim to strike a better balance between scalability and performance by designing a method whose size grows adaptively depending on the task sequence. We introduce Continual Subspace of Policies (CSP), a new approach that incrementally builds a subspace of policies for training a reinforcement learning agent on a sequence of tasks. The subspace's high expressivity allows CSP to perform well for many different tasks while growing more slowly than the number of tasks. Our method does not suffer from forgetting and also displays positive transfer to new tasks. CSP outperforms a number of popular baselines on a wide range of scenarios from two challenging domains, Brax (locomotion) and Continual World (robotic manipulation). Interactive visualizations of the subspace can be found at https://share.streamlit.io/continual-subspace/policies/main.","continual learning, deep reinforcement learning" Test-Time AutoEval with Supporting Self-supervision,https://openreview.net/forum?id=ztkUF_MQj7J,https://openreview.net/pdf?id=ztkUF_MQj7J,A new framework for unsupervised model evaluation without touching training sets,"The Automatic Model Evaluation (AutoEval) framework entertains the possibility of evaluating a trained machine learning model without resorting to a labeled testing set, which commonly isn’t accessible nor provided in real-world scenarios. Existing AutoEval methods always rely on computing distribution shift between the unlabelled testing set and the training set. However, this lines of work cannot fit well in some real-world ML applications like edge computing boxes where the original training set is inaccessible. Contrastive Learning (CL) is an efficient self-supervised learning task, which can learn helpful visual representations for down-stream classification tasks. In our work, we surprisingly find that CL accuracy and classification accuracy can build strong linear correlation ($r > 0.88$). This finding motivates us to regress classification accuracy with CL accuracy. In our experiments, we show that without touching training sets, our framework can achieve results comparable to SOTA AutoEval baselines. Besides, our subsequent experiments demonstrate that different CL approaches and model structures can easily fit into our framework.","Test-Time, AutoEval, Self-supervised Learning" Complexity-Based Prompting for Multi-step Reasoning,https://openreview.net/forum?id=yf1icZHC-l9,https://openreview.net/pdf?id=yf1icZHC-l9,We show using prompts with more reasoning steps can improve language models multi-step reasoning ability ,"We study the task of prompting large-scale language models to perform multi-step reasoning. Existing work shows that when prompted with a chain of thoughts (CoT), sequences of short sentences describing intermediate reasoning steps towards a final answer, large language models can generate new reasoning chains and predict answers for new inputs. A central question is which reasoning examples make the most effective prompts. In this work, we propose complexity-based prompting, a simple and effective example selection scheme for multi-step reasoning. We show that prompts with higher reasoning complexity, i.e., chains with more reasoning steps, achieve substantially better performance on math word reasoning tasks over strong baselines. We further extend our complexity-based criteria from prompting (selecting inputs) to decoding (selecting outputs), where we sample multiple reasoning chains from the model, then choose the majority of generated answers from complex reasoning chains (over simple chains). When used to prompt GPT-3, our approach substantially improves multi-step reasoning accuracy, with an 8.6% absolute improvement on GSM8K, and 6.4% on MathQA. Compared with existing example selection schemes like manual tuning or retrieval-based selection, selection based on reasoning complexity is intuitive, easy to implement, and annotation-efficient. Further results demonstrate the robustness of performance gains from complex prompts under format perturbation and distribution shift.","Chain-of-Thoughts, Multi-Step Reasoning, Large Language Models, Prompting" ECLAD: Extracting Concepts with Local Aggregated Descriptors,https://openreview.net/forum?id=FvqcQ_9u7Mo,https://openreview.net/pdf?id=FvqcQ_9u7Mo,,"Convolutional neural networks (CNNs) are increasingly being used in critical systems, where robustness and alignment are crucial. In this context, the field of explainable artificial intelligence has proposed the generation of high-level explanations of the prediction process of CNNs through concept extraction. While these methods can detect whether or not a concept is present in an image, they are unable to determine its location. What is more, a fair comparison of such approaches is difficult due to a lack of proper validation procedures. To address these issues, we propose a novel method for automatic concept extraction and localization based on representations obtained through pixel-wise aggregations of CNN activation maps. Further, we introduce a process for the validation of concept-extraction techniques based on synthetic datasets with pixel-wise annotations of their main components, reducing the need for human intervention. Extensive experimentation on both synthetic and real-world datasets demonstrates that our method outperforms state-of-the-art alternatives.","explainable artificial intelligence, deep learning, interpretability, concept extraction" Not All Tasks Are Born Equal: Understanding Zero-Shot Generalization,https://openreview.net/forum?id=KGV-GBh8fb,https://openreview.net/pdf?id=KGV-GBh8fb,,"Recent work has achieved remarkable zero-shot performance with multi-task prompted pretraining, but little has been understood. For the first time, we show that training on a small number of key tasks beats using all the training tasks, while removing these key tasks substantially hurts performance. We also find that these key tasks are mostly question answering (QA) tasks. We design a shuffle experiment to further show that training on these QA tasks leads to better cross-task generalization in multi-task learning under various training/test task splits. These novel findings combined deepen our understanding about zero-generalization---training on certain tasks such as QA encodes general knowledge transferable to a wide range of tasks, which explains the improved zero-shot performance in recent progress. In addition, to automate this procedure, we devise a method to identify and upsample key training tasks without observing the test tasks based on cross validation. Empirically, our approach achieves improved results across various model scales and tasks.","Zero-Shot Learning, Multi-Task Learning, Transfer Learning" MA2QL: A Minimalist Approach to Fully Decentralized Multi-Agent Reinforcement Learning,https://openreview.net/forum?id=TMYzh1hsHd,https://openreview.net/pdf?id=TMYzh1hsHd,A new algorithm for multi-agent reinforcement learning,"Decentralized learning has shown great promise for cooperative multi-agent reinforcement learning (MARL). However, non-stationarity remains a significant challenge in fully decentralized learning. In the paper, we tackle the non-stationarity problem in the simplest and fundamental way and propose multi-agent alternate Q-learning (MA2QL), where agents take turns to update their Q-functions by Q-learning. MA2QL is a minimalist approach to fully decentralized cooperative MARL but is theoretically grounded. We prove that when each agent guarantees $\varepsilon$-convergence at each turn, their joint policy converges to a Nash equilibrium. In practice, MA2QL only requires minimal changes to independent Q-learning (IQL). We empirically evaluate MA2QL on a variety of cooperative multi-agent tasks. Results show MA2QL consistently outperforms IQL, which verifies the effectiveness of MA2QL, despite such minimal changes.",multi-agent reinforcement learning Learning Asymmetric Visual Semantic Embedding for Image-Text Retrieval,https://openreview.net/forum?id=THp4UABcMv,https://openreview.net/pdf?id=THp4UABcMv,"In this paper, we propose a novel method to calculate visual semantic similarity for image-text matching and achieve outperform recent state-of-the-art methods on two widely used datasets.","Learning visual semantic similarity is the key challenge to bridge the correspondences between images and texts. However, there are many inherent variations between vision and language data, such as information density, i.e., images can contain textual information from multiple different views, which makes it difficult to accurately compute the similarity between these two modality data. In the mainstream methods, global-level methods cannot effectively handle the above problem, while local-level methods need complicated mechanism, which significantly affects the retrieval efficiency. In this paper, we propose Asymmetric Visual Semantic Embedding (AVSE), which aims to design a novel model to learn visual semantic similarity by explicitly considering the difference in information density between the two modalities and eschew the prohibitive computations. Specifically, to keep the information density of images, AVSE exploits the large spatial redundancy of image regions to capture and concatenate multi-view features as image embedding. It also has a novel module to efficiently calculate the visual semantic similarity of asymmetric image embedding and text embedding via dividing embeddings into many semantic blocks with the same dimension and compute the similarity by finding the optimal match between these semantic blocks. Extensive experiments on large-scale MS-COCO and Flickr30K datasets verify the superiority of our proposed AVSE compared with recent state-of-the-art methods. Compared to the recent NAAF method, our AVSE inference is 1000 times faster on the 1K test set and more accurately on the widely used benchmarks.","Cross-modal retrieval, image-text matching" Representation Interference Suppression via Non-linear Value Factorization for Indecomposable Markov Games,https://openreview.net/forum?id=rgp4_59eC0,https://openreview.net/pdf?id=rgp4_59eC0,,"Value factorization is an efficient approach for centralized training with decentralized execution in cooperative multi-agent reinforcement learning tasks. As the simplest implementation of value factorization, Linear Value Factorization (LVF) attracts wide attention. In this paper, firstly, we investigate the applicable conditions of LVF, which is important but usually neglected by previous works. We prove that due to the representation limitation, LVF is only perfectly applicable to an extremely narrow class of tasks, which we define as the decomposable Markov game. Secondly, to handle the indecomposable Markov game where the LVF is inapplicable, we turn to value factorization with complete representation capability (CRC) and explore the general form of the value factorization function that satisfies both Independent Global Max (IGM) and CRC conditions. A common problem of these value factorization functions is the representation interference among true Q values with shared local Q value functions. As a result, the policy could be trapped in local optimums due to the representation interference on the optimal true Q values. Thirdly, to address the problem, we propose a novel value factorization method, namely Q Factorization with Representation Interference Suppression (QFRIS). QFRIS adaptively reduces the gradients of the local Q value functions contributed by the non-optimal true Q values. Our method is evaluated on various benchmarks. Experimental results demonstrate the good convergence of QFIRS.",Multi-agent Reinforcement Learning On Threshold Functions in Learning to Generate Feasible Solutions of Mixed Integer Programs,https://openreview.net/forum?id=VezcnWFSd2d,https://openreview.net/pdf?id=VezcnWFSd2d,We introduce a posthoc method and a learning approach to optimize the selection rate for partial discrete variables assignments in MIP to find feasible solutions efficiently. ,"Finding a high-quality feasible solution to a combinatorial optimization problem in a given time budget is a challenging task due to its discrete nature. Neural diving is a learning-based approach to generating partial assignments for the discrete variables in MIP. We find that there usually is a small range of selection rates which lead to feasible and optimal solutions; when too many parameters are selected, the solution space is too restricted to find a feasible solution; when too few parameters are selected, the solution space is too wide to efficiently find a feasible solution. Therefore, the choice of selection rate is the critical determinant of the Neural diving performance. In this context, we present theoretical insights that there exist threshold functions in feasibility and feasible optimality over the selection rate. Based on the theoretical foundations, we introduce a post-hoc method, and a learning-based approach to optimize the selection rate for partial discrete variable assignments in MIP more efficiently. A key idea is to jointly learn to restrict the selection rate search space, and to predict the selection rate in the learned search space that results in a high-quality feasible solution. MIP solver is integrated into the end-to-end learning framework. We suggest that learning a deep neural network to generate a threshold-aware selection rate is effective in finding high-quality feasible solutions more quickly. Experimental results demonstrate that our method achieves state-of-the-art performance in NeurIPS ML4CO datasets. In the workload apportionment dataset, our method achieves the optimality gap of 0.45%, which is around 10× better than SCIP, at the one-minute time limit.","Mixed Integer Programming, Neural Combinatorial Optimization" SDAC: Efficient Safe Reinforcement Learning with Low-Biased Distributional Actor-Critic,https://openreview.net/forum?id=X4DOJ-wL2I,https://openreview.net/pdf?id=X4DOJ-wL2I,We propose a safe reinforcement learning method based on the trust region method and distributional critics.,"To apply reinforcement learning (RL) to real-world practical applications, agents are required to adhere to the safety guidelines of their respective domains. Safe RL can effectively handle the guidelines by maximizing returns while maintaining safety satisfaction. In this paper, we develop a safe distributional RL method based on the trust region method which has the capability of satisfying safety constraints consistently. However, importance sampling required for the trust region method can hinder performance due to its significant variance, and policies may not meet the safety guidelines due to the estimation bias of distributional critics. Hence, we enhance safety performance through the following approaches. First, we propose novel surrogates for the trust region method expressed with Q-functions using the reparameterization trick. Second, we utilize distributional critics trained with a target distribution where bias-variance can be traded off. In addition, if an initial policy violates safety constraints, there can be no policy satisfying safety constraints within the trust region. Thus, we propose a gradient integration method which is guaranteed to find a policy satisfying multiple constraints from an unsafe initial policy. From extensive experiments, the proposed method shows minimal constraint violations while achieving high returns compared to existing safe RL methods. Furthermore, we demonstrate the benefit of safe RL for problems in which the reward function cannot be easily specified.","Reinforcement learning, Safety, Distributional Critic" So-TVAE: Sentiment-oriented Transformer-based Variational Autoencoder Network for Live Video Commenting,https://openreview.net/forum?id=pxnp5lBtXPr,https://openreview.net/pdf?id=pxnp5lBtXPr,This paper proposes a Sentiment-oriented Transformer-based Variational Autoencoder model which can achieve diverse video commenting with multiple sentiments and semantics for the automatic live video commenting task.,"Automatic live video commenting is with increasing attention due to its significance in narration generation, topic explanation, etc. However, the sentiment consideration of the generated comments is missing from the current methods. Thus, in this paper, we introduce and investigate a task, namely sentiment-guided automatic live video commenting, which aims to generate live video comments based on sentiment guidance. To address this problem, we propose a Sentiment-oriented Transformer-based Variational Autoencoder (So-TVAE) network, which consists of a sentiment-oriented diversity encoder module and a batch-attention module. Specifically, our sentiment-oriented diversity encoder elegantly combines VAE and random mask mechanism to achieve semantic diversity under sentiment guidance, which is then fused with cross-modal features to generate live video comments. Furthermore, a batch attention module is also proposed in this paper to alleviate the problem of missing sentimental samples, caused by the data imbalance, which is common in live videos as the popularity of video varies. Extensive experiments on Livebot and VideoIC datasets demonstrate that the proposed So-TVAE outperforms the state-of-the-art methods in terms of the quality and diversity of generated comments. Related codes will be released.","Automatic live video commenting, batch attention, cross-modal fusion" SoTeacher: Toward Student-oriented Teacher Network Training for Knowledge Distillation,https://openreview.net/forum?id=aR3hRo_O6cn,https://openreview.net/pdf?id=aR3hRo_O6cn,We study the feasibility of training a teacher network towards the performance of the student with empirical risk minimization.,"How to train an ideal teacher for knowledge distillation is still an open problem. It has been widely observed that a best-performing teacher does not necessarily yield the best-performing student, suggesting a fundamental discrepancy between the current practice in teacher training and the distillation objective. To fill this gap, we explore the feasibility of training a teacher that is oriented toward student performance with empirical risk minimization. Our analyses are inspired by the recent findings that the effectiveness of knowledge distillation hinges on the teacher’s capability to approximate the true label distribution of training inputs. We theoretically established that (1) the empirical risk minimizer can provably approximate the true label distribution of training data if the loss function is a proper scoring rule and the hypothesis function is locally-Lipschitz continuous around training inputs; and (2) when data augmentation is employed for training, an additional constraint is required that the minimizer has to produce consistent predictions across augmented views of the same training input. In light of our theory, we propose a teacher training method SoTeacher which renovates the empirical risk minimization by incorporating Lipschitz regularization and consistency regularization. Experiments on two benchmark datasets confirm that SoTeacher can improve student performance significantly and consistently across various knowledge distillation algorithms and teacher-student pairs.","Knowledge distillation, Teacher-student training, Empirical risk minimization" GuardHFL: Privacy Guardian for Heterogeneous Federated Learning,https://openreview.net/forum?id=__GGLJ79pV,https://openreview.net/pdf?id=__GGLJ79pV,,"Heterogeneous federated learning (HFL) enables clients with different computation and communication capabilities to collaboratively train their own customized models via a query-response paradigm on auxiliary datasets. However, such paradigm raises serious privacy issues due to the leakage of highly sensitive query samples and response predictions. Although existing secure querying solutions may be extended to enhance the privacy of HFL with non-trivial adaptation, they suffer from two key limitations: (1) lacking customized protocol designs and (2) relying on heavy cryptographic primitives, which could lead to poor performance. In this work, we put forth GuardHFL, the first-of-its-kind efficient and privacy-preserving HFL framework. GuardHFL is equipped with a novel HFL-friendly secure querying scheme that is built on lightweight secret sharing and symmetric-key techniques. Its core is a set of customized multiplication and comparison protocols, which substantially boost the execution efficiency. Extensive evaluations demonstrate that GuardHFL outperforms the state-of-the-art works by up to two orders of magnitude in efficiency.", Class-wise Visual Explanations for Deep Neural Networks,https://openreview.net/forum?id=ix3UDwIN5E,https://openreview.net/pdf?id=ix3UDwIN5E,We propose a method to visualize global explanation in the input space for every class learned in the training procedure.,"Many explainable AI (XAI) methods have been proposed to interpret neural net- work’s decisions on why they predict what they predict locally through gradient information. Yet, existing works mainly for local explanation lack global knowledge to show class-wise explanations in the whole training procedure. To fill this gap, we proposed to visualize global explanation in the input space for every class learned in the training procedure. Specifically, our solution finds a representation set that could demonstrate the learned knowledge for each class. To achieve this goal, we optimize the representation set by imitating the model training procedure over the full dataset. Experimental results show that our method could generate class-wise explanations with high quality in a series of image classification datasets. Using our global explanation, we further analyze the model knowledge in different training procedures, including adversarial training and noisy label learning. Moreover, we illustrate that the generated explanations could lend insights into diagnosing model failures, such as revealing triggers in a backdoored model.","Class-wise explanation, Backdoor attack detection, Global explanation" Decentralized Policy Optimization,https://openreview.net/forum?id=tyyNcEVrklJ,https://openreview.net/pdf?id=tyyNcEVrklJ,A principled algorithm for fully decentralized policy optimization,"The study of decentralized learning or independent learning in cooperative multi-agent reinforcement learning has a history of decades. Recently empirical studies show that independent PPO (IPPO) can obtain good performance, close to or even better than the methods of centralized training with decentralized execution, in several benchmarks. However, decentralized actor-critic with convergence guarantee is still open. In this paper, we propose decentralized policy optimization (DPO), a decentralized actor-critic algorithm with monotonic improvement and convergence guarantee. We derive a novel decentralized surrogate for policy optimization such that the monotonic improvement of joint policy can be guaranteed by each agent independently optimizing the surrogate. In practice, this decentralized surrogate can be realized by two adaptive coefficients for policy optimization at each agent. Empirically, we compare DPO with IPPO in a variety of cooperative multi-agent tasks, covering discrete and continuous action spaces, and fully and partially observable environments. The results show DPO outperforms IPPO in most tasks, which can be the evidence for our theoretical results.",multi-agent reinforcement learning Identification of the Adversary from a Single Adversarial Example,https://openreview.net/forum?id=V9dXRjqvqcD,https://openreview.net/pdf?id=V9dXRjqvqcD,This paper proposes a forensic mechanism for the aftermath of adversarial attacks.,"Deep neural networks have been shown vulnerable to adversarial examples. Even though many defence methods have been proposed to enhance the robustness, it is still a long way toward providing an attack-free method to build a trustworthy machine learning system. In this paper, instead of enhancing the robustness, we take the investigator's perspective and propose a new framework to trace the first compromised model in a forensic investigation manner. Specifically, we focus on the following setting: the machine learning service provider provides models for a set of customers. However, one of the customers conducted adversarial attacks to fool the system. Therefore, the investigator's objective is to identify the first compromised model by collecting and analyzing evidence from only available adversarial examples. To make the tracing viable, we design a random mask watermarking mechanism to differentiate adversarial examples from different models. First, we propose a tracing approach in the data-limited case where the original example is also available. Then, we design a data-free approach to identify the adversary without accessing the original example. Finally, the effectiveness of our proposed framework is evaluated by extensive experiments with different model architectures, adversarial attacks, and datasets.","Adversarial examples, Forensic investigation" Similarity of Neural Architectures Based on Input Gradient Transferability,https://openreview.net/forum?id=JZRBSoJv7lb,https://openreview.net/pdf?id=JZRBSoJv7lb,We propose a similarity score between neural networks. We provide analyses on 69 neural architectures using the proposed score.,"In this paper, we aim to design a quantitative similarity function between two neural architectures. Specifically, we define a model similarity using input gradient transferability. We generate adversarial samples of two networks and measure the average accuracy of the networks on adversarial samples of each other. If two networks are highly correlated, then the attack transferability will be high, resulting in high similarity. Using the similarity score, we investigate two topics: (1) Which network component contributes to the model diversity? (2) How does model diversity affect practical scenarios? We answer the first question by providing feature importance analysis and clustering analysis. The second question is validated by two different scenarios: model ensemble and knowledge distillation. Our findings show that model diversity takes a key role when interacting with different neural architectures. For example, we found that more diversity leads to better ensemble performance. We also observe that the relationship between teacher and student networks and distillation performance depends on the choice of the base architecture of the teacher and student networks. We expect our analysis tool helps a high-level understanding of differences between various neural architectures as well as practical guidance when using multiple architectures.","neural architecture similarity, model similarity, model diversity, model ensemble, knowledge distillation" Image Segmentation using Transfer Learning with DeepLabv3 to Facilitate Photogrammetric Limb Scanning,https://openreview.net/forum?id=Pt1KTsjSfRG,https://openreview.net/pdf?id=Pt1KTsjSfRG,,"In this paper, we explore the use of deep learning (DL) in conjunction with photogrammetry for scanning amputated limbs. Combining these two technologies can expand the scope of prosthetic telemedicine by facilitating low-cost limb scanning using cell phones. Previous research identified image segmentation as one of the main limitations of using photogrammetry for limb scanning. Based on those limitations, this work sought to answer two main research questions: (1) Can a neural network be trained to identify and segment an amputated limb automatically? (2) Will segmenting 2D limb images using neural networks impact the accuracy of 3D models generated via photogrammetry? To answer the first question, transfer learning was applied to a neural network with the DeepLabv3 architecture. After training, the model was able to successfully identify and segment limb images with an IoU of 79.9%. To answer the second question, the fine-tuned DL model was applied to a dataset of 22 scans comprising 6312 limb images, then 3D models were rendered utilizing Agisoft Metashape. The Mean Absolute Error (MAE) of models rendered from images segmented with DL was 0.57 mm ± 0.63 mm when compared to models rendered from ground truth images. These results are important because segmentation with DL makes photogrammetry for limb scanning feasible on a large clinical scale. Future work should focus on generalizing the segmentation model for different types of amputations and imaging conditions.","3D Scanning, Deep Learning, Image Segmentation, Photogrammetry, Telemedicine" G-Censor: Graph Contrastive Learning with Task-Oriented Counterfactual Views,https://openreview.net/forum?id=LiWGbK8_iOB,https://openreview.net/pdf?id=LiWGbK8_iOB,"Graph Contrastive learning with task-oriented counterfactual positive/negative views, a model-agnostic framework designed for node property prediction tasks.","Graph Contrastive learning (GCL) has achieved great success in learning representations from unlabeled graph-structure data. However, how to automatically obtain the optimal contrastive views w.r.t specific downstream tasks is little studied. Theoretically, a downstream task can be causally correlated to particular sub-structures in graphs. The existing GCL methods may fail to enhance model performance on a given task when the task-related semantics are incomplete/preserved in the positive/negative views. To address this problem, we propose G-CENSOR, i.e., Graph Contrastive lEarniNg with taSk-oriented cOunteRfactual views, a model-agnostic framework designed for node property prediction tasks. G-CENSOR can simultaneously generate the optimal task-oriented counterfactual positive/negative views for raw ego-graphs and train graph neural networks (GNNs) with a contrastive objective between the raw ego-graphs and their corresponding counterfac-tual views. Extensive experiments on eight real-world datasets demonstrate that G-CENSOR can consistently outperform existing state-of-the-art GCL methods to improve the task performance and generalizability of a series of typical GNNs. To the best of our knowledge, this is a pioneer investigation to explore task-oriented graph contrastive learning from a counterfactual perspective in node property pre- diction tasks. We will release the source code after the review process.","graph contrastive learning, node property prediction, task-oriented counterfactual views" Unsupervised 3d object learning through neuron activity aware plasticity,https://openreview.net/forum?id=mXPoBtnpMnuy,https://openreview.net/pdf?id=mXPoBtnpMnuy,,"We present an unsupervised deep learning model for 3D object classification. Conventional Hebbian learning, a well-known unsupervised model, suffers from loss of local features leading to reduced performance for tasks with complex geometric objects. We present a deep network with a novel Neuron Activity Aware (NeAW) Hebbian learning rule that dynamically switches the neurons to be governed by Hebbian learning or anti-Hebbian learning, depending on its activity. We analytically show that NeAW Hebbian learning relieves the bias in neuron activity, allowing more neurons to attend to the representation of the 3D objects. Empirical results show that the NeAW Hebbian learning outperforms other variants of Hebbian learning and shows higher accuracy over fully supervised models when training data is limited.",Hebbian learning Visually-Augmented Language Modeling,https://openreview.net/forum?id=8IN-qLkl215,https://openreview.net/pdf?id=8IN-qLkl215,"We propose a novel pre-trained framework, to Visually-augment text tokens with retrieved relevant images for multimodal grounded Language Modeling.","Human language is grounded on multimodal knowledge including visual knowledge like colors, sizes, and shapes. However, current large-scale pre-trained language models rely on the text-only self-supervised training with massive text data, which precludes them from utilizing relevant visual information when necessary. To address this, we propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling. Specifically, VaLM builds on a novel latent text-image alignment method via an image retrieval module to fetch corresponding images given a textual context. With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling by attending on both text context and visual knowledge in images. We evaluate VaLM on various visual knowledge intensive commonsense reasoning tasks, which require visual information to excel. The experimental results illustrate that VaLM outperforms all strong language-only and vision-language baselines with substantial gains on reasoning object commonsense including color, size, and shape.","visually-grounded language modeling, visual commonsense reasoning, pre-trained visually-augmented language model" A HIERARCHICAL FRAGMENT-BASED MODEL FOR 3D DRUG-LIKE MOLECULE GENERATION,https://openreview.net/forum?id=walno7E1F8w,https://openreview.net/pdf?id=walno7E1F8w,This paper introduced a hierarchical generative model for fragment-based drug-like molecule generation.,"De novo design of hit molecules is an important task in drug discovery. With the help of deep generative models, 3D molecular point set generation for smaller molecules (QM9) has been proposed by a few researchers. However, it is a non-trivial task to generate drug-like molecules which have relatively large atom numbers in the 3D space. Inspired by the human prior from domain experts, we propose a hierarchical fragment-based model. In order to avoid fragment collisions and maintain chemical validity, we solve the problem by generating high-level features and then sampling specific fragments and edges conditioned on the former. This hierarchical framework can capture basic chemical rules while generating 3D molecules of high quality. To evaluate our model's ability to sample molecules from the drug-like chemical space, we tested our method on multiple metrics. Among all evaluated metrics, our model outperforms the baseline model by a large margin. ","Drug Design, Molecule Generation, Deep Learning, Computational Biology" Multi-Layered 3D Garments Animation,https://openreview.net/forum?id=vmFwJeiSx4X,https://openreview.net/pdf?id=vmFwJeiSx4X,,"Most existing 3D garment animation datasets are restricted to human bodies with single-layered garments. Even though cases with upper shirts and lower pants are included, only a few overlap areas among such garment combinations exist. Moreover, they often regard human body movement as the only driving factor that causes garment animation. Approaches developed on top of these datasets thus tend to model garments as functions of human body parameters such as body shape and pose. While such treatment leads to promising performance on existing datasets, it leaves a gap between experimental environments and real scenarios, where a body can wear multiple layered garments and the corresponding garment dynamics can be affected by environmental factors and garment attributes. Consequently, existing approaches often struggle to generalize to multi-layered garments and realistic scenarios. To facilitate the advance of 3D garment animation toward handling more challenging cases, this paper presents a new large-scale synthetic dataset called LAYERS, covering 4,900 different combinations of multi-layered garments with 700k frames in total. The animation of these multi-layered garments follows the laws of physics and is affected by not only human body movements but also random environmental wind and garment attributes. To demonstrate the quality of LAYERS, we further propose a novel method, LayersNet, for 3D garment animation, which represents garments as unions of particles and subsequently adopts a neural network to animate garments via particle-based simulation. In this way, the interactions between different parts of one garment, different garments on the same body, and garments against various driving factors, can be naturally and uniformly handled via the interactions of particles. Through comprehensive experiments, LayersNet demonstrates superior performance in terms of animation accuracy and generality over baselines. The proposed dataset, LAYERS, as well as the proposed method, LayersNet, will be publicly available.", Preventing Mode Collapse When Imitating Latent Policies from Observations,https://openreview.net/forum?id=Mf9fQ0OgMzo,https://openreview.net/pdf?id=Mf9fQ0OgMzo,,"Imitation from observations only (ILfO) is an extension of the classic imitation learning setting to cases where expert observations are easy to obtain but no expert actions are available. Most existing ILfO methods either require access to task-specific cost functions or large amounts of interactions with the target environment. Learning a forward dynamics model in combination with a latent policy has been shown to solve these issues. However, the limited supervision in the ILfO scenario can lead to a mode collapse in learning the generative forward model and the corresponding latent policy. In this paper, we analyse the mode collapse problem and show that it can occur whenever the expert is deterministic, and may also occur due to bad initialization of the models. Under the assumption of piecewise continuous system dynamics, we propose a method to prevent the mode collapse using clustering of expert transitions to pre-train the generative model and the latent policy. We show that the resulting method prevents mode collapse and improves performance in five different OpenAI Gym environments.","Imitation Learning, Imitation from Observations Only, Latent Policy Learning" Unsupervised Learning of Structured Representations via Closed-Loop Transcription,https://openreview.net/forum?id=jZdJd1dGF2A,https://openreview.net/pdf?id=jZdJd1dGF2A,,"This paper proposes an unsupervised method for learning a unified representation that serves both discriminative and generative purposes. While most existing unsupervised learning approaches focus on a representation for only one of these two goals, we show that a unified representation can enjoy the mutual benefits of having both. Such a representation is attainable by generalizing the recently proposed closed-loop transcription framework, known as CTRL, to the unsupervised setting. This entails solving a constrained maximin game over a rate reduction objective that expands features of all samples while compressing features of augmentations of each sample. Through this process, we see discriminative low-dimensional structures emerge in the resulting representations. Under comparable experimental conditions and network complexities, we demonstrate that these structured representations enable classification performance close to state-of-the-art unsupervised discriminative representations, and conditionally generated image quality significantly higher than that of state-of-the-art unsupervised generative models.","Unsupervised Conditional Image Generation, Self-supervised Generative Model, Closed-Loop Transcription" DETRDistill: A Simple Knowledge Distillation Framework for DETR-Families,https://openreview.net/forum?id=cLcN6JY69aG,https://openreview.net/pdf?id=cLcN6JY69aG,,"Transformer-based detectors (DETRs) have attracted great attention due to their sparse training paradigm and the removal of post-processing operations, but the huge model can be computationally time-consuming and difficult to be deployed in real-world applications. To tackle this problem, knowledge distillation (KD) can be employed to compress the huge model by constructing a simple teacher-student learning framework. Different from the traditional CNN detectors, where the distillation targets can be naturally aligned through the feature map, DETR regards object detection as a set prediction problem, leading to an unclear relationship between teacher and student during distillation. In this paper, we propose DETRDistill, a novel knowledge distillation dedicated to DETR-families. We first explore a sparse matching paradigm with progressive stage-by-stage instance distillation. Considering the diverse attention mechanisms adopted in different DETRs, we propose attention-agnostic feature distillation module to overcome the ineffectiveness of conventional feature imitation. Finally, to fully leverage the intermediate products from the teacher, we introduce teacher-assisted assignment distillation, which greatly alleviates the instability of label assignment caused by bipartite graph matching. Extensive experiments demonstrate that our distillation method achieves significant improvement on various competitive DETR approaches, without introducing extra consumption in the inference phase. To the best of our knowledge, this is the first systematic study to explore a general distillation method for DETR-style detectors.","Knowledge Distillation, DETR, Transformer, Model Compression" Closed Boundary Learning for NLP Classification Tasks with the Universum Class,https://openreview.net/forum?id=ygN9NbyVkyy,https://openreview.net/pdf?id=ygN9NbyVkyy,,"The Universum class, often known as the other class or the miscellaneous class, is defined as a collection of samples that do not belong to any class of interest. It is a typical class that exists in many classification-based tasks in natural language processing (NLP), such as relation extraction, named entity recognition, sentiment analysis, etc. During data labeling, a significant number of samples are annotated as Universum because there are always some samples that exist in the dataset but do not belong to preset target classes and are not of interest in the task. The Universum class exhibits very different properties, namely heterogeneity and lack of representativeness in training data; however, existing methods often treat the Universum class equally with the classes of interest. Although the Universum class only contains uninterested samples, improper treatment will result in the misclassification of samples of interest. In this work, we propose a closed boundary learning method that treats the Universum class and classes of interest differently. We apply closed decision boundaries to classes of interest and designate the area outside all closed boundaries in the feature space as the space of the Universum class. Specifically, we formulate the closed boundaries as arbitrary shapes, propose a strategy to estimate the probability of the Universum class according to its unique property rather than the within-class sample distribution, and propose a boundary learning loss to learn decision boundaries based on the balance of misclassified samples inside and outside the boundary. We evaluate our method on 6 state-of-the-art works in 3 different tasks, and the performance of all 6 works is improved. Our code will be released on GitHub.", Solving Constrained Variational Inequalities via a First-order Interior Point-based Method,https://openreview.net/forum?id=RQY2AXFMRiu,https://openreview.net/pdf?id=RQY2AXFMRiu,"We derive a first-order method for solving constrained variational inequality problem when given general constraints, by combining interior-point methods and ADMM.","We develop an interior-point approach to solve constrained variational inequality (cVI) problems. Inspired by the efficacy of the alternating direction method of multipliers (ADMM) method in the single-objective context, we generalize ADMM to derive a first-order method for cVIs, that we refer to as ADMM-based interior-point method for constrained VIs (ACVI). We provide convergence guarantees for ACVI in two general classes of problems: (i) when the operator is $\xi$-monotone, and (ii) when it is monotone, some constraints are active and the game is not purely rotational. When the operator is in addition L-Lipschitz for the latter case, we match known lower bounds on rates for the gap function of $\mathcal{O}(1/\sqrt{K})$ and $\mathcal{O}(1/K)$ for the last and average iterate, respectively. To the best of our knowledge, this is the first presentation of a first-order interior-point method for the general cVI problem that has a global convergence guarantee. Moreover, unlike previous work in this setting, ACVI provides a means to solve cVIs when the constraints are nontrivial. Empirical analyses demonstrate clear advantages of ACVI over common first-order methods. In particular, (i) cyclical behavior is notably reduced as our methods approach the solution from the analytic center, and (ii) unlike projection-based methods that zigzag when near a constraint, ACVI efficiently handles the constraints.","constrained variational inequality, interior point, admm" MeGraph: Graph Representation Learning on Connected Multi-scale Graphs,https://openreview.net/forum?id=Oz0npxjLAsI,https://openreview.net/pdf?id=Oz0npxjLAsI,We present a novel graph network architechture learning on a mega graph derived by connecting multi-scale graphs. The architechture allows repeated information exchange across multiple scaled graphs.,"We present MeGraph, a novel network architecture for graph-structured data. Given any input graph, we create multi-scale graphs using graph pooling. Then, we connect them into a mega graph by bridging inter-graph edges according to the graph pooling results. Instead of universally stacking graph convolutions over the mega graph, we apply general graph convolutions over intra-graph edges, while the convolutions over inter-graph edges follow a bidirectional pathway to deliver the information along the hierarchy for one turn. Graph convolution and graph pooling are two core elementary operations of MeGraph. In our implementation, we adopt the graph full network (GFuN) and propose the stridden edge contraction pooling (S-EdgePool) with adjustable pooling ratio, which are extended from conventional graph convolution and edge contraction pooling. The MeGraph model enables information exchange across multi-scale graphs, repeatedly, for deeper understanding of wide-range correlations in graphs. This distinguishes MeGraph from many recent hierarchical graph neural networks like Graph U-Nets. We conduct comprehensive empirical studies on tens of public datasets, in which we observe consistent performance gains comparing to baselines. Specifically, we establish 5 new graph theory benchmark tasks that require long-term inference and deduction to solve, where MeGraph demonstrates dominated performance compared with popular graph neural networks.","Hierachical Graph Learning, Multi-scale, Graph Pooling, Graph Neural Networks(GNNs)" Learning Reduced Fluid Dynamics,https://openreview.net/forum?id=fsa9jrF73fo,https://openreview.net/pdf?id=fsa9jrF73fo,Learning optimal model-reduced fluid dynamics,"Predicting the state evolution of ultra high-dimensional, time-reversible fluid dynamic system is a crucial but computationally expensive task. Model-reduction has been proven to be an effective method to reduce the computational cost by learning a low-dimensional state embedding. However, existing reduced models are irrespective of either the time reversible property or the nonlinear dynamics, leading to sub-optimal performance. We propose a model-based approach to identify locally optimal, model-reduced, time reversible, nonlinear fluid dynamic systems. Our main idea is to use stochastic Riemann optimization to obtain a high-quality a reduced fluid model by minimizing the expected trajectory-wise model-reduction error over a given distribution of initial conditions. To this end, our method formulates the reduced fluid dynamics as an invertible state transfer function parameterized by the reduced subspace. We further show that the reduced trajectories are differentiable with respect to the subspace bases over the entire Grassmannian manifold, under proper choices of timestep sizes and numerical integrators. Finally, we propose a loss function measuring the trajectory-wise discrepancy between the original and reduced models. By tensor precomputation, we show that gradient information of such loss functions can be evaluated efficiently over a long trajectory without time-integrating the high-dimensional dynamic system. Through evaluations on a row of simulation benchmarks, we show that our method lower the discrepancy by 45%-97% over conventional reduced models.","Fluid Dynamics, Model Reduction" Symmetric Pruning in Quantum Neural Networks,https://openreview.net/forum?id=K96AogLDT2K,https://openreview.net/pdf?id=K96AogLDT2K,We prove how the symmetry enhances the training performance of QNNs and then devise an efficient symmetric pruning scheme to distill a symmetric ansatz from an over-parameterized and asymmetric ansatz.,"Many fundamental properties of a quantum system are captured by its Hamiltonian and ground state. Despite the significance, ground states preparation (GSP) is classically intractable for large-scale Hamiltonians. Quantum neural networks (QNNs), which exert the power of modern quantum machines, have emerged as a leading protocol to conquer this issue. As such, the performance enhancement of QNNs becomes the core in GSP. Empirical evidence showed that QNNs with handcraft symmetric ans\""atze generally experience better trainability than those with asymmetric ans\""atze, while theoretical explanations remain vague. To fill this knowledge gap, here we propose the effective quantum neural tangent kernel (EQNTK) and connect this concept with over-parameterization theory to quantify the convergence of QNNs towards the global optima. We uncover that the advance of symmetric ans\""atze attributes to their large EQNTK value with low effective dimension, which requests few parameters and quantum circuit depth to reach the over-parameterization regime permitting a benign loss landscape and fast convergence. Guided by EQNTK, we further devise a symmetric pruning (SP) scheme to automatically tailor a symmetric ansatz from an over-parameterized and asymmetric one to greatly improve the performance of QNNs when the explicit symmetry information of Hamiltonian is unavailable. Extensive numerical simulations are conducted to validate the analytical results of EQNTK and the effectiveness of SP. ","quantum neural networks, symmetry, pruning, quantum neural tangent kernel, effective dimension" Managing Temporal Resolution in Continuous Value Estimation: A Fundamental Trade-off,https://openreview.net/forum?id=ZmYHoQm0SWH,https://openreview.net/pdf?id=ZmYHoQm0SWH,By analyzing Monte-Carlo value estimation for LQR systems we uncover a fundamental trade-off between approximation and statistical error in value estimation.,"A default assumption in reinforcement learning and optimal control is that experience arrives at discrete time points on a fixed clock cycle. Many applications, however, involve continuous systems where the time discretization is not fixed but instead can be managed by a learning algorithm. By analyzing Monte-Carlo value estimation for LQR systems in both finite-horizon and infinite-horizon settings, we uncover a fundamental trade-off between approximation and statistical error in value estimation. Importantly, these two errors behave differently with respect to time discretization, which implies that there is an optimal choice for the temporal resolution that depends on the data budget. These findings show how adapting the temporal resolution can provably improve value estimation quality in LQR systems from finite data. Empirically, we demonstrate the trade-off in numerical simulations of LQR instances and several non-linear environments.","Temporal Discretization, Continuous Time, Langevin System, LQR, Policy Evaluation" On the Robustness of Randomized Ensembles to Adversarial Perturbations,https://openreview.net/forum?id=k3VANp85b4S,https://openreview.net/pdf?id=k3VANp85b4S,"We derive fundamental results on the robustness of randomized ensemble classifiers to adversarial perturbations. Empirically, we propose a boosting algorithm for training robust randomized ensemble classifiers.","Randomized ensemble classifiers (RECs), where one classifier is randomly selected during inference, have emerged as an attractive alternative to traditional ensembling methods for realizing adversarially robust classifiers with limited compute requirements. However, recent works have shown that existing methods for constructing RECs are more vulnerable than initially claimed, casting major doubts on their efficacy and prompting fundamental questions such as: ""When are RECs useful?"", ""What are their limits?"", and ""How do we train them?"". In this work, we first demystify RECs as we derive fundamental results regarding their theoretical limits, necessary and sufficient conditions for them to be useful, and more. Leveraging this new understanding, we propose a new boosting algorithm (BARRE) for training robust RECs, and empirically demonstrate its effectiveness at defending against strong $\ell_\infty$ norm-bounded adversaries across various network architectures and datasets. Our code is submitted as part of the supplementary material, and will be publicly released on GitHub","adversarial robustness, efficient inference, randomized ensembles, boosting" Minimum Variance Unbiased N:M Sparsity for the Neural Gradients,https://openreview.net/forum?id=vuD2xEtxZcj,https://openreview.net/pdf?id=vuD2xEtxZcj,A method to use structured N:M sparsity on all training GEMM operations,"In deep learning, fine-grained N:M sparsity reduces the data footprint and bandwidth of a General Matrix multiply (GEMM) up to x2, and doubles throughput by skipping computation of zero values. So far, it was mainly only used to prune weights to accelerate the forward and backward phases. We examine how this method can be used also for the neural gradients (i.e. loss gradients with respect to the intermediate neural layer outputs). To this end, we first establish a tensor-level optimality criteria. Previous works aimed to minimize the mean-square-error (MSE) of each pruned block. We show that while minimization of the MSE works fine for pruning the weights and activations, it catastrophically fails for the neural gradients. Instead, we show that accurate pruning of the neural gradients requires an unbiased minimum-variance pruning mask. We design such specialized masks, and find that in most cases, 1:2 sparsity is sufficient for training, and 2:4 sparsity is usually enough when this is not the case. Further, we suggest combining several such methods together in order to potentially speed up training even more. A reference implementation is supplied in the supplementary material.","pruning, compression, structured sparsity, acceleration" Incremental Learning of Structured Memory via Closed-Loop Transcription,https://openreview.net/forum?id=XrgjF5-M3xi,https://openreview.net/pdf?id=XrgjF5-M3xi,,"This work proposes a minimal computational model for learning structured memories of multiple object classes in an incremental setting. Our approach is based on establishing a {\em closed-loop transcription} between the classes and a corresponding set of subspaces, known as a linear discriminative representation, in a low-dimensional feature space. Our method is simpler than existing approaches for incremental learning, and more efficient in terms of model size, storage, and computation: it requires only a single, fixed-capacity autoencoding network with a feature space that is used for both discriminative and generative purposes. Network parameters are optimized simultaneously without architectural manipulations, by solving a constrained minimax game between the encoding and decoding maps over a single rate reduction-based objective. Experimental results show that our method can effectively alleviate catastrophic forgetting, achieving significantly better performance than prior work of generative replay on MNIST, CIFAR-10, and ImageNet-50, despite requiring fewer resources.","Generative Replay Incremental Learning, Closed Loop Transcription" Curved Data Representations in Deep Learning,https://openreview.net/forum?id=_bFeNCnBAl7,https://openreview.net/pdf?id=_bFeNCnBAl7,A comprehensive analysis of curvature for data representations in deep neural networks,"The phenomenal success of deep neural networks inspire many to understand the inner mechanisms of these models. To this end, several works have been studying geometric properties such as the intrinsic dimension of latent data representations produced by the layers of the network. In this paper, we investigate the curvature of data manifolds, i.e., the deviation of the manifold from being flat in its principal directions. We find that state-of-the-art trained convolutional neural networks have a characteristic curvature profile along layers: an initial increase, followed by a long phase of a plateau, and tailed by another increase. In contrast, untrained networks exhibit qualitatively and quantitatively different curvature profiles. We also show that the curvature gap between the last two layers is strongly correlated with the performance of the network. Further, we find that the intrinsic dimension of latent data along the network layers is not necessarily indicative of curvature. Finally, we evaluate the effect of common regularizers such as weight decay and mixup on curvature, and we find that mixup-based methods flatten intermediate layers, whereas the final layers still feature high curvatures. Our results indicate that relatively flat manifolds which transform to highly-curved manifolds toward the last layers generalize well to unseen data.","representation learning, curvature analysis, deep neural networks" When Data Geometry Meets Deep Function: Generalizing Offline Reinforcement Learning,https://openreview.net/forum?id=lMO7TC7cuuh,https://openreview.net/pdf?id=lMO7TC7cuuh,,"In offline reinforcement learning (RL), one detrimental issue to policy learning is the error accumulation of deep \textit{Q} function in out-of-distribution (OOD) areas. Unfortunately, existing offline RL methods are often over-conservative, inevitably hurting generalization performance outside data distribution. In our study, one interesting observation is that deep \textit{Q} functions approximate well inside the convex hull of training data. Inspired by this, we propose a new method, \textit{DOGE (Distance-sensitive Offline RL with better GEneralization)}. DOGE marries dataset geometry with deep function approximators in offline RL, and enables exploitation in generalizable OOD areas rather than strictly constraining policy within data distribution. Specifically, DOGE trains a state-conditioned distance function that can be readily plugged into standard actor-critic methods as a policy constraint. Simple yet elegant, our algorithm enjoys better generalization compared to state-of-the-art methods on D4RL benchmarks. Theoretical analysis demonstrates the superiority of our approach to existing methods that are solely based on data distribution or support constraints.","offline reinforcement learning, deep Q functions generalization" Self-supervised debiasing using low rank regularization,https://openreview.net/forum?id=PHpK5B2iGpq,https://openreview.net/pdf?id=PHpK5B2iGpq,,"Spurious correlations can cause strong biases in deep neural networks, impairing generalization ability. While most of existing debiasing methods require full supervisions on either spurious attributes or target labels, training a debiased model from a limited amount of both annotations is still an open issue. To overcome such limitations, we first examined an interesting phenomenon by the spectral analysis of latent representations: spuriously correlated, easy-to-learn attributes make neural networks inductively biased towards encoding lower effective rank representations. We also show that a rank regularization can amplify this bias in a way that encourages highly correlated features. Motivated by these observations, we propose a self-supervised debiasing framework that is potentially compatible with unlabeled samples. We first pretrain a biased encoder in a self-supervised manner with the rank regularization, serving as a semantic bottleneck to enforce the encoder to learn the spuriously correlated attributes. This biased encoder is then used to discover and upweight bias-conflicting samples in a downstream task, serving as a boosting to effectively debias the main model. Remarkably, the proposed debiasing framework significantly improves the generalization performance of self-supervised learning baselines and, in some cases, even outperforms state-of-the-art supervised debiasing approaches.","Debiasing, spurious correlation, self-supervised learning" Wasserstein Gradient Flows for Optimizing GMM-based Policies,https://openreview.net/forum?id=1UBSvnGHFxK,https://openreview.net/pdf?id=1UBSvnGHFxK,Policy structure-aware Optimization via Wasserstein Gradient Flows for Robot Motion Adaptation,"Robots often rely on a repertoire of previously-learned motion policies for performing tasks of diverse complexities. When facing unseen task conditions or when new task requirements arise, robots must adapt their motion policies accordingly. In this context, policy optimization is the de facto paradigm to adapt robot policies as a function of task-specific objectives. Most commonly-used motion policies carry particular structures that are often overlooked in policy optimization algorithms. We instead propose to leverage the structure of probabilistic policies by casting the policy optimization as an optimal transport problem. Specifically, we focus on robot motion policies that build on Gaussian mixture models (GMMs) and formulate the policy optimization as a Wassertein gradient flow over the GMMs space. This naturally allows us to constrain the policy updates via the $L^2$-Wasserstein distance between GMMs to enhance the stability of the policy optimization process. Furthermore, we leverage the geometry of the Bures-Wasserstein manifold to optimize the Gaussian distributions of the GMM policy via Riemannian optimization. We evaluate our approach over a set of common robotic settings: Reaching motions, collision-avoidance behaviors and multi-goal tasks. Our results show that our method outperforms common policy optimization baselines in terms of task success rate and low-variance solutions. ","Optimal transport, policy optimization, gaussian mixture models, robot motion adaptation." Compositional Image Generation and Manipulation with Latent Diffusion Models,https://openreview.net/forum?id=SvcawuEiUVM,https://openreview.net/pdf?id=SvcawuEiUVM,,"We propose a principled method for compositional image generation and manipulation using diffusion probabilistic models. In particular, for any pre-trained generative model with a semantic latent space, we train a latent diffusion model and auxiliary latent classifiers to help navigate latent representations in a non-linear fashion. We show that such conditional generation achieved by latent classifier guidance provably maximizes a lower bound of the conditional log-likelihood during training, and can reduce to a simple latent arithmetic method with additional assumption, which is surprisingly under-studied in the context of compositionality. We then derive a new guidance term which is shown to be crucial for maintaining the original semantics when doing manipulation. Unlike previous methods, our method is agnostic to pre-trained generative models and latent spaces, while still achieving competitive performance on compositional image generation as well as sequential manipulation of real and synthetic images.","Compositionality, Diffusion Models" Neural Unbalanced Optimal Transport via Cycle-Consistent Semi-Couplings,https://openreview.net/forum?id=QIpfInYnAu2,https://openreview.net/pdf?id=QIpfInYnAu2,,"Comparing unpaired samples of a distribution or population taken at different points in time is a fundamental task in many application domains where measuring populations is destructive and cannot be done repeatedly on the same sample, such as in single-cell biology. Optimal transport (OT) can solve this challenge by learning an optimal coupling of samples across distributions from unpaired data. However, the usual formulation of OT assumes conservation of mass, which is violated in unbalanced scenarios in which the population size changes (e.g., cell proliferation or death) between measurements. In this work, we introduce NubOT, a neural unbalanced OT formulation that relies on the formalism of semi-couplings to account for creation and destruction of mass. To estimate such semi-couplings and generalize out-of-sample, we derive an efficient parameterization based on neural optimal transport maps and propose a novel algorithmic scheme through a cycle-consistent training procedure. We apply our method to the challenging task of forecasting heterogeneous responses of multiple cancer cell lines to various drugs, where we observe that by accurately modeling cell proliferation and death, our method yields notable improvements over previous neural optimal transport methods.", Prompt Tuning for Graph Neural Networks,https://openreview.net/forum?id=SZojABvWnkx,https://openreview.net/pdf?id=SZojABvWnkx,We explore the prompt tuning method for pre-trained GNN models.,"In recent years, prompt tuning has set off a research boom in the adaptation of pre-trained models. In this paper, we propose Graph Prompt as an efficient and effective alternative to full fine-tuning for adapting the pre-trianed GNN models to downstream tasks. To the best of our knowledge, we are the first to explore the effectiveness of prompt tuning on existing pre-trained GNN models. Specifically, without tuning the parameters of the pre-trained GNN model, we train a task-specific graph prompt that provides graph-level transformations on the downstream graphs during the adaptation stage. Then, we introduce a concrete implementation of the graph prompt, called GP-Feature (GPF), which adds learnable perturbations to the feature space of the downstream graph. GPF has a strong expressive ability that it can modify both the node features and the graph structure implicitly. Accordingly, we demonstrate that GPF can achieve the approximately equivalent effect of any graph-level transformations under most existing pre-trained GNN models. We validate the effectiveness of GPF on numerous pre-trained GNN models, and the experimental results show that with a small amount (about 0.1% of that for fine-tuning ) of tunable parameters, GPF can achieve comparable performances as fine-tuning, and even obtain significant performance gains in some cases. ", Budgeted Training for Vision Transformer,https://openreview.net/forum?id=sVzBN-DlJRi,https://openreview.net/pdf?id=sVzBN-DlJRi,,"The superior performances of Vision Transformers often come with higher training costs. Compared to their CNN counterpart, Transformer models are hungry for large-scale data and their training schedules are usually prolonged. This sets great restrictions on training Transformers with limited resources, where a proper trade-off between training cost and model performance is longed. In this paper, we address the problem by proposing a framework that enables the training process under any training budget, while achieving competitive model performances. Specifically, based on the observation that Transformer exhibits different levels of model redundancies at different stages of training, we propose to dynamically control the activation rate of model parameters along the training process and meet the demand on the training budget by adjusting the duration on each level of model complexity. Extensive experiments demonstrate that our framework is applicable to various Vision Transformers, and achieves competitive performances on a wide range of training budgets.", Knowledge-in-Context: Towards Knowledgeable Semi-Parametric Language Models,https://openreview.net/forum?id=a2jNdqE2102,https://openreview.net/pdf?id=a2jNdqE2102,"We develop a novel knowledge-rich semi-parametric model, KiC, that is able to achieve superior zeros-hot performance on unseen task with a much smaller model size.","Fully-parametric language models generally require a huge number of model parameters to store the necessary knowledge for solving multiple natural language tasks in zero/few-shot settings. In addition, it is hard to adapt to the changing world knowledge without the costly model re-training. In this paper, we develop a novel semi-parametric language model architecture, Knowledge-in-Context (KiC), which empowers a parametric text-to-text language model with a knowledge-rich external memory. Specifically, the external memory contains six different kinds of knowledge types: commonsense, entity, event, dictionary, script and causal knowledges. For each input instance, the KiC model adaptively selects a knowledge type and retrieves the knowledge pieces that are most helpful. The input instance along with its knowledge augmentation is fed into a text-to-text model (e.g., T5) to generate the output answer, where both the input and the output are in natural language forms after prompting. Interestingly, we find that KiC can be identified as a special mixture-of-experts (MoE) model, where the knowledge selector plays the role of a router. This key observation inspires us to develop a novel algorithm for learning KiC with an instance-adaptive knowledge selector. As a knowledge-rich semi-parametric language model, KiC only needs a relatively smaller parametric part to achieve superior zero-shot performance on unseen tasks. For instance, KiC-large with 770M parameters easily outperforms a 3B model on several benchmarks; that is, KiC exhibits emergent abilities at a much smaller model scale compared to the fully-parametric models.","Semi-parametric language model, text-to-text model, mixture-of-experts, natural language understanding" Understanding and Mitigating Robust Overfitting through the Lens of Feature Dynamics,https://openreview.net/forum?id=0JD3EN75NJE,https://openreview.net/pdf?id=0JD3EN75NJE,,"Adversarial Training (AT) has become arguably the state-of-the-art algorithm for extracting robust features. However, researchers recently notice that AT suffers from severe robust overfitting problems, particularly after the learning rate (LR) decay, while the existing static view of feature robustness fails to explain this phenomenon. In this paper, we propose a new dynamic feature robustness framework which takes the dynamic interplay between the model trainer and the attacker into consideration. By tracing temporal and dataset-specific feature robustness, we develop a new understanding of robust overfitting from the dynamics of non-robust features, and empirically verify it on real-world datasets. Built upon this understanding, we explore three techniques to restore the balance between the model trainer and the attacker, and show that they could effectively alleviate robust overfitting and attain state-of-the-art robustness on benchmark datasets. Notably, different from previous studies, our interpretation highlights the necessity of considering the min-max nature of AT for robust overfitting. ","Adversarial Training, Robust Overfitting, Generalization, Robustness, Adversarial Attack" DualMatch: Promoting Semi-Supervised Learning with Hierarchical Label and Contrastive Learning,https://openreview.net/forum?id=zPkbpQdAfFi,https://openreview.net/pdf?id=zPkbpQdAfFi,,"The recently proposed FixMatch and FlexMatch have achieved remarkable results in the field of semi-supervised learning. But these two methods go to two extremes as FixMatch and FlexMatch use a pre-defined constant threshold for all classes and an adaptive threshold for each category, respectively. By only investigating consistency regularization, they suffer from unstable results and indiscriminative feature representation, especially under the situation of few labeled samples. In this paper, we propose a novel DualMatch method, which can learn an adaptive threshold for all classes to perform instance-level prediction matching as well as discriminative features by graph matching based contrastive learning. We first present a memory-bank based near-global threshold learning strategy to select highly-confident samples. In the meantime, we make full use of the structured information in the hierarchical labels to learn an accurate affinity graph for contrastive learning. DualMatch achieves very stable and superior results on several commonly-used benchmarks. For example, DualMatch achieves 8.44% and 9.02% error rate reduction over FlexMatch on CIFAR-100 under WRN-28-2 with only 4 and 25 labeled samples per class, respectively. ","Semi-supervised Learning, Contrastive Learning, Hierarchical Label Matching" Augmentative Topology Agents For Open-Ended Learning,https://openreview.net/forum?id=Rywi6F_HVCO,https://openreview.net/pdf?id=Rywi6F_HVCO,This work brings generalization capabilities and ability to solve complex environments to Open Ended Learning framework by adding agents that augment their topologies over time.,"In this work, we tackle the problem of Open-Ended Learning by a method that simultaneously evolves agents and increasingly challenging environments. Unlike previous open-ended approaches that optimize agents using a fixed neural network topology, we hypothesize that generalization can be improved by allowing agents' controllers to become more complex as they encounter more difficult environments. Our method, Augmentative Topology EPOET (ATEP), extends the Enhanced Paired Open-Ended Trailblazer (EPOET) algorithm by allowing agents to evolve their own neural network structures over time, adding complexity and capacity as necessary. Empirical results demonstrate that ATEP results in general agents capable of solving more environments than a fixed-topology baseline. We also investigate mechanisms for transferring agents between environments and find that a species-based approach further improves the performance and generalization of agents.","Open-Ended Learning, NeuroEvolution" Partial Differential Equation-Regularized Neural Networks: An Application to Image Classification,https://openreview.net/forum?id=YurfS_kh5ib,https://openreview.net/pdf?id=YurfS_kh5ib,Learn a PDE (approximated by a neural network) from data for image classification,"Differential equations can be used to design neural networks. For instance, neural ordinary differential equations (neural ODEs) can be considered as a continuous generalization of residual networks. In this work, we present a novel partial differential equation (PDE)-based approach for image classification, where we construct a continuous-depth and continuous-width neural network as a form of solutions of PDEs, and the PDEs defining the evolution of the solutions also are learned from data. Owing to the recent advancement of identifying PDEs, the presented novel concept, called PR-Net, can be implemented. Our method shows comparable (or better) accuracy and robustness for various datasets and tasks in comparison with neural ODEs and Isometric MobileNet V3. Thanks to the efficient nature of PR-Net, it is suitable to be deployed in resource-scarce environments, e.g., deployed instead of MobileNet.","partial differential equations, image classification, physics-informed neural networks" Learning to Boost Resilience of Complex Networks via Neural Edge Rewiring,https://openreview.net/forum?id=RHsOd1Aineq,https://openreview.net/pdf?id=RHsOd1Aineq,We develop an inductive network resilience optimization method with the proposed topology-inspired FireGNN for learning inductive neural edge rewiring to boost resilience of complex networks without rich features.,"The resilience of complex networks, a critical structural characteristic in network science, measures the network's ability to withstand noise corruption and structural changes. Improving resilience typically resorts to minimal modifications of the network structure via degree-preserving edge rewiring-based methods. Despite their effectiveness, existing methods are learning-free, sharing the limitation of transduction: a learned edge rewiring strategy from one graph cannot be generalized to another. Such a limitation cannot be trivially addressed by existing graph neural networks (GNNs)-based approaches since there is no rich initial node features for GNNs to learn meaningful representations. However, neural edge rewiring relies on GNNs for obtaining meaningful representations from pure graph topologies to select edges. We found existing GNNs degenerate remarkably with only pure topologies on the resilience task, leading to the undesired infinite action backtracking. In this work, inspired by persistent homology, we specifically design a variant of GNN called FireGNN for learning inductive edge rewiring strategies. Based on meaningful representations from FireGNN, we develop the first end-to-end inductive method, ResiNet, to discover $\textbf{resi}$lient $\textbf{net}$work topologies while balancing network utility. ResiNet reformulates network resilience optimization as a Markov decision process equipped with edge rewiring action space and learns to select correct edges successively. Extensive experiments demonstrate that ResiNet achieves a near-optimal resilience gain on various graphs while balancing the utility and outperforms existing approaches by a large margin.","complex networks, network resilience, network robustness, graph neural networks" Deep Transformer Q-Networks for Partially Observable Reinforcement Learning,https://openreview.net/forum?id=cddqs4kvC20,https://openreview.net/pdf?id=cddqs4kvC20,,"Real-world reinforcement learning tasks often involve some form of partial observability where the observations only give a partial or noisy view of the true state of the world. Such tasks typically require some form of memory, where the agent has access to multiple past observations, in order to perform well. One popular way to incorporate memory is by using a recurrent neural network to access the agent's history. However, recurrent neural networks in reinforcement learning are often fragile and difficult to train, susceptible to catastrophic forgetting and sometimes fail completely as a result. In this work, we propose Deep Transformer Q-Networks (DTQN), a novel architecture utilizing transformers and self-attention to encode an agent's history. DTQN is designed modularly, and we compare results against several modifications to our base model. Our experiments demonstrate the transformer can solve partially observable tasks faster and more stably than previous recurrent approaches.", Mind's Eye: Grounded Language Model Reasoning through Simulation,https://openreview.net/forum?id=4rXMRuoJlai,https://openreview.net/pdf?id=4rXMRuoJlai,We present a new reasoning paradigm that grounds language model reasoning on simulation results from the advanced physics engine MuJoCo.,"Successful and effective communication between humans and AI relies on a shared experience of the world. By training solely on written text, current language models (LMs) miss the grounded experience of humans in the real-world---their failure to relate language to the physical world causes knowledge to be misrepresented and obvious mistakes in their reasoning. We present Mind's Eye, a paradigm to ground language model reasoning in the physical world. Given a physical reasoning question, we use a computational physics engine (DeepMind's MuJoCo) to simulate the possible outcomes, and then use the simulation results as part of the input, which enables language models to perform reasoning. Experiments on 39 tasks in a physics alignment benchmark demonstrate that Mind's Eye can improve reasoning ability by a large margin (27.9% zero-shot, and 46.0% few-shot absolute accuracy improvement on average). Smaller language models armed with Mind's Eye can obtain similar performance to models that are 100x larger. Finally, we confirm the robustness of Mind's Eye through ablation studies.","reasoning, alignment, simulation, physics, grounding" Visual Expertise and the Log-Polar Transform Explain Image Inversion Effects,https://openreview.net/forum?id=6bRKHpeZi7,https://openreview.net/pdf?id=6bRKHpeZi7,,"Visual expertise can be defined as the ability to discriminate among subordinate-level objects in homogeneous classes, such as identities of faces within the class ""face"". Despite being able to discriminate many faces, subjects perform poorly at recognizing even familiar faces once inverted. This face-inversion effect is in contrast to subjects’ performance identifying inverted objects for which their experience is at a basic level, which results in less impairment. Experimental results have suggested that when identifying mono-oriented objects, such as cars, car novices' performance is between that of faces and other objects. We build an anatomically-inspired neurocomputational model to explore this effect. Our model includes a foveated retina and the log-polar mapping from the visual field to V1. This transformation causes changes in scale to appear as horizontal translations, leading to scale equivariance. Rotation is similarly equivariant, leading to vertical translations. When fed into a standard convolutional network, this provides rotation and scale invariance. It may be surprising that a rotation-invariant network shows any inversion effect at all. This is because there is a crucial topological difference between scale and rotation: Rotational invariance is discontinuous, with V1 ranging from 90 degrees (vertically up) to 270 degrees (vertically down). Hence when a face is inverted, the configural information in the face is disrupted while feature information is relatively unaffected. We show that the inversion effect arises as a result of visual expertise, where configural information becomes relevant as more identities are learned at the subordinate level. Our model matches the classic result: faces suffer more from inversion than mono-oriented objects, which are more disrupted than non-mono-oriented objects when objects are only familiar at a basic level.", Cross-Protein Wasserstein Transformer for Protein-Protein Interactions,https://openreview.net/forum?id=7HgnhMmbIB,https://openreview.net/pdf?id=7HgnhMmbIB,,"Previous studies reveal intimate relationships between the structure and function of proteins. Motivated by this, for protein-protein interactions (PPIs), we hypothesize that cross-protein structural correspondence, including both global correlation and local co-occurrence, poses a great influence. Accordingly, a novel deep learning framework named Cross-Protein Wasserstein Transformer (CPWT) is proposed to predict PPI sites through fine-grained cross-graph structural modeling. Considering the irregular architecture of acid sequences, for a pair of proteins, graphs are constructed to describe them. Then, a core Cross-Graph Transformer (CGT) module of two branches (e.g. ligand and receptor branches) is proposed for cross-protein structural modeling. Specifically, in this module, Wasserstein affinity across graphs is calculated through cross-graph query (i.e. ligand (query) - receptor (key) or the converse), based on which the multi-head attention is derived to adaptively mine fine-grained cues of PPI sites. By stacking CGT modules, the two branches in CGT are co-evolved in a deep architecture during forward inference, hence being powerful and advantageous in cross-protein structural representation and fine-grained learning. We verify the effectiveness of our CPWT framework by conducting comprehensive experiments on multiple PPI datasets, and further visualize the learned fine-grained saliencies for intuitive understanding.", What Do Self-Supervised Vision Transformers Learn?,https://openreview.net/forum?id=azCKuYyS74,https://openreview.net/pdf?id=azCKuYyS74,"We show that (1) CL primarily captures global patterns compared with MIM, (2) CL is more shape-oriented whereas MIM is more texture-oriented, and (3) CL plays a key role in the later layers while MIM focuses on the early layers.","We present comparative studies on how and why contrastive learning (CL) and masked image modeling (MIM) differ in their representations and performance on downstream tasks. In particular, self-supervised Vision Transformers (ViTs) have the following properties: (1) CL trains self-attentions to capture longer-range global patterns, such as the shape of an object, compared with MIM. This property of CL helps ViTs linearly separate images in their representation spaces. However, it also makes the self-attentions collapse into homogeneity for all heads, depths, and query tokens. Such homogeneity of self-attention reduces representations' diversity, resulting in worse scalability and dense prediction performance; (2) CL reduces the high-frequency signals of the representations, but MIM amplifies them. Since the low-frequency information stands for the shapes and the high frequencies represent the textures, CL is more shape-oriented, whereas MIM is more texture-oriented; (3) CL plays a crucial role in the later layers of ViT architecture, while MIM mainly focuses on the early layers. Upon these analyses, we find that CL and MIM can complement each other and observe that the simplest harmonization can enjoy the advantages of both methods. ","contrastive learning, masked image modeling, vision transformer, representation learning, self-supervised learning, empirical analysis" Continuous Monte Carlo Graph Search,https://openreview.net/forum?id=9NHWYzbKHLd,https://openreview.net/pdf?id=9NHWYzbKHLd,"This paper proposes Continuous Monte Carlo Graph Search (CMCGS), a novel extension of MCTS to online planning in environments with continuous state and action spaces.","In many complex sequential decision making tasks, online planning is crucial for high-performance. For efficient online planning, Monte Carlo Tree Search (MCTS) employs a principled mechanism for trading off between exploration and exploitation. MCTS outperforms comparison methods in various discrete decision making domains such as Go, Chess, and Shogi. Following, extensions of MCTS to continuous domains have been proposed. However, the inherent high branching factor and the resulting explosion of search tree size is limiting existing methods. To solve this problem, this paper proposes Continuous Monte Carlo Graph Search (CMCGS), a novel extension of MCTS to online planning in environments with continuous state and action spaces. CMCGS takes advantage of the insight that, during planning, sharing the same action policy between several states can yield high performance. To implement this idea, at each time step CMCGS clusters similar states into a limited number of stochastic action bandit nodes, which produce a layered graph instead of an MCTS search tree. Experimental evaluation with limited sample budgets shows that CMCGS outperforms comparison methods in several complex continuous DeepMind Control Suite benchmarks and a 2D navigation task.","online planning, sequential decision making, monte carlo tree search, MCTS, continuous control" Confident Sinkhorn Allocation for Pseudo-Labeling,https://openreview.net/forum?id=jNt9ql72mBg,https://openreview.net/pdf?id=jNt9ql72mBg,a new pseudo-labeling method for semi-supervised learning without domain knowledge,"Semi-supervised learning is a critical tool in reducing machine learning’s dependence on labeled data. It has been successfully applied to structure data, such as image and language data, by exploiting the inherent spatial and semantic structure therein with pretrained models or data augmentation. Some of these methods are no longer applicable for the data where domain structures are not available because the pretrained models or data augmentation can not be used. Due to simplicity, existing pseudo-labeling (PL) methods can be widely used without any domain assumption, but are vulnerable to noise samples and to greedy assignments given a predefined threshold which is typically unknown. This paper addresses this problem by proposing a Confident Sinkhorn Allocation (CSA), which assigns labels to only samples with high confidence scores and learns the best label allocation via optimal transport. CSA outperforms the current state-of-the-art in this practically important area of semi-supervised learning.","pseudo-labeling, semi-supervised learning, tabular data" Rank-1 Matrix Completion with Gradient Descent and Small Random Initialization,https://openreview.net/forum?id=BGqYCl1k1fN,https://openreview.net/pdf?id=BGqYCl1k1fN,Proves the convergence of gradient descent with small random initialization for rank-1 matrix completion.,"The nonconvex formulation of matrix completion problem has received significant attention in recent years due to its affordable complexity compared to the convex formulation. Gradient descent (GD) is the simplest yet efficient baseline algorithm for solving nonconvex optimization problems. The success of GD has been witnessed in many different problems in both theory and practice when it is combined with random initialization. However, previous works on matrix completion require either careful initialization or regularizer to prove the convergence of GD. In this work, we study the rank-1 symmetric matrix completion and prove that GD converges to the ground truth when small random initialization is used. We show that in logarithmic amount of iterations, the trajectory enters the region where local convergence occurs. We provide an upper bound on the initialization size that is sufficient to guarantee the convergence and show that a larger initialization can be used as more samples are available. We observe that implicit regularization effect of GD plays a critical role in the analysis, and for the entire trajectory, it prevents each entry from becoming much larger than the others.","Matrix Completion, Small Initialization, Gradient Descent" Adversarial Robustness based on Randomized Smoothing in Quantum Machine Learning ,https://openreview.net/forum?id=o-Yxq5iicIp,https://openreview.net/pdf?id=o-Yxq5iicIp,"Algorithm, theoretical proof, circuits, and results for a certifiably robust Quantum Computing classifier based on Randomized Smoothing.","We present an end-to-end Quantum Machine Learning algorithm that encodes a classical input into a Quantum Computing state and provides certified radius for a base classifier, with robustness guarantees based on randomized smoothing - current state-of-the-art defense against adversarial attacks. Classically, the number of samples, also the number of queries to the classifier, scale with $O(1/\epsilon^2)$ where $\epsilon$ is the desired error bound in expected value of the probability measure $\rho$ defined over the randomized smoothing neighborhood. Our algorithm is designed to solve the same problem for a Quantum Computing classifier. We prove that number of queries to the classifier scale as $O(1/\epsilon)$ for the same confidence and error bound. We also present the unitary circuit corresponding to the quantum randomized smoothing algorithm, as well as the state preparation methods and circuits for smoothing distributions used to defend against common adversaries - modeled using $l_0$, $l_1$, $l_2$ norms, and other metrics. The results of comparison between the classical and simulation of the quantum algorithm are also discussed.","Quantum Computing, Adversarial Machine learning, Randomized Smoothing, Quantum Amplitude Estimation" Multi-Vector Retrieval as Sparse Alignment,https://openreview.net/forum?id=2EFQ_QlcPs8,https://openreview.net/pdf?id=2EFQ_QlcPs8,We propose a novel multi-vector retrieval model with pairwise alignment and unary salience.,"Multi-vector retrieval models improve over single-vector dual encoders on many information retrieval tasks. In this paper, we cast the multi-vector retrieval problem as sparse alignment between query and document tokens. We propose ALIGNER, a novel multi-vector retrieval model that learns sparsified pairwise alignments between query and document tokens (e.g. `dog' vs. `puppy') and per-token unary saliences reflecting their relative importance for retrieval. We show that controlling the sparsity of pairwise token alignments often brings significant performance gains. While most factoid questions focusing on a specific part of a document require a smaller number of alignments, others requiring a broader understanding of a document favor a larger number of alignments. Unary saliences, on the other hand, decide whether a token ever needs to be aligned with others for retrieval (e.g. `kind' from `what kind of currency is used in new zealand'). With sparsified unary saliences, we are able to prune a large number of query and document token vectors and improve the efficiency of multi-vector retrieval. We learn the sparse unary saliences with entropy-regularized linear programming, which outperforms other methods to achieve sparsity. In a zero-shot setting, ALIGNER scores 51.1 nDCG@10, achieving a new retriever-only state-of-the-art on 13 tasks in the BEIR benchmark. In addition, adapting pairwise alignments with a few examples (<= 8) further improves the performance up to 15.7 points nDCG@10 for argument retrieval tasks. The unary saliences of ALIGNER helps us to keep only 20% of the document token representations with minimal performance loss. We further show that our model often produces interpretable alignments and significantly improves its performance when initialized from larger language models.","natural language processing, document retrieval, information retrieval" Sampled Transformer for Point Sets,https://openreview.net/forum?id=F7f4BYnDAIc,https://openreview.net/pdf?id=F7f4BYnDAIc,,"The sparse transformer can reduce the computational complexity of the self-attention layers to $O(n)$, whilst still being a universal approximator of continuous sequence-to-sequence functions. However, this permutation variant operation is not appropriate for direct application to sets. In this paper, we proposed an $O(n)$ complexity sampled transformer that can process point set elements directly without any additional inductive bias. Our sampled transformer introduces random element sampling, which randomly splits point sets into subsets, followed by applying a shared Hamiltonian self-attention mechanism to each subset. The overall attention mechanism can be viewed as a Hamiltonian cycle in the complete attention graph, and the permutation of point set elements is equivalent to randomly sampling Hamiltonian cycles. This mechanism implements a Monte Carlo simulation of the $O(n^2)$ dense attention connections. We show that it is a universal approximator for continuous set-to-set functions. Experimental results for classification and few-shot learning on point-clouds show comparable or better accuracy with significantly reduced computational complexity compared to the dense transformer or alternative sparse attention schemes.", Scaling Laws in Mean-Field Games,https://openreview.net/forum?id=fB4V-2QvCEm,https://openreview.net/pdf?id=fB4V-2QvCEm,,"In this work, we attempt to bridge the two largely independently evolving fields of finite-agent and infinite-agent games, by studying the scaling laws in mean-field games. The key is to obtain the optimal policies of a set of finite-agent games with different numbers of agents (population size) and then investigate how the policies evolve with the population size. However, either deriving the closed-form solution for each game is theoretically intractable, training a distinct policy for each game is computationally intensive, or directly applying the policy trained in a game to other games is sub-optimal. We address these challenges through the \textbf{P}opulation-size-\textbf{A}ware \textbf{P}olicy \textbf{O}ptimization (PAPO). Our contributions are three-fold. First, to efficiently generate efficient policies for games with different population sizes, we propose PAPO, which unifies two natural options (augmentation and hypernetwork) and achieves significantly better performance. PAPO consists of three components: i) the population-size encoding which transforms the original value of population size to an equivalent encoding to avoid training collapse, ii) a hypernetwork to generate a distinct policy for each game conditioned on the population size, and iii) the population size as an additional input to the generated policy. Next, we construct a multi-task-based training procedure to efficiently train the neural networks of PAPO by sampling data from multiple games with different population sizes. Finally, extensive experiments on multiple environments show the significant superiority of PAPO over baselines, and extensive analysis of the scaling laws of the generated policies further deepens our understanding of the two fields of finite-agent and infinite-agent games. To our best knowledge, this work presents the first attempt to bridge the two research fields.", PartAfford: Part-level Affordance Discovery,https://openreview.net/forum?id=bQZ2wEYxRBL,https://openreview.net/pdf?id=bQZ2wEYxRBL,Discover 3D object part affordances by learning contrast in affordance compositions.,"Understanding what objects could furnish for humans—learning object affordance—is the crux of bridging perception and action. In the vision community, prior work has primarily focused on learning object affordance with dense (e.g., at a per-pixel level) supervision. In stark contrast, we humans learn the object affordance without dense labels. As such, the fundamental question to devise a computational model is: What is the natural way to learn the object affordance from geometry with humanlike sparse supervision? In this work, we present the new task of part-level affordance discovery (PartAfford): Given only the affordance labels for each object, the machine is tasked to (i) decompose 3D shapes into parts and (ii) discover how each part of the object corresponds to a certain affordance category. We propose a novel learning framework that discovers part-level representations by leveraging only the affordance set supervision and geometric primitive regularization without dense supervision. To learn and evaluate PartAfford, we construct a part-level, cross-category 3D object affordance dataset, annotated with 24 affordance categories shared among >25, 000 objects. We demonstrate through extensive experiments that our method enables both the abstraction of 3D objects and part-level affordance discovery, with generalizability to difficult and cross-category examples. Further ablations reveal the contribution of each component.", On The Relative Error of Random Fourier Features for Preserving Kernel Distance,https://openreview.net/forum?id=qs2YCziX2o-,https://openreview.net/pdf?id=qs2YCziX2o-,"We characterize for what kernels the random Fourier features method, proposed in a seminal paper by Rahimi and Recht, preserves the relative-error for the kernel distance.","The method of random Fourier features (RFF), proposed in a seminal paper by Rahimi and Recht (NIPS'07), is a powerful technique to find approximate low-dimensional representations of points in (high-dimensional) kernel space, for shift-invariant kernels. While RFF has been analyzed under various notions of error guarantee, the ability to preserve the kernel distance with \emph{relative} error is less understood. We show that for a significant range of kernels, including the well-known Laplacian kernels, RFF cannot approximate the kernel distance with small relative error using low dimensions. We complement this by showing as long as the shift-invariant kernel is analytic, RFF with $\mathrm{poly}(\epsilon^{-1} \log n)$ dimensions achieves $\epsilon$-relative error for pairwise kernel distance of $n$ points, and the dimension bound is improved to $\mathrm{poly}(\epsilon^{-1}\log k)$ for the specific application of kernel $k$-means. Finally, going beyond RFF, we make the first step towards data-oblivious dimension-reduction for general shift-invariant kernels, and we obtain a similar $\mathrm{poly}(\epsilon^{-1} \log n)$ dimension bound for Laplacian kernels. We also validate the dimension-error tradeoff of our methods on simulated datasets, and they demonstrate superior performance compared with other popular methods including random-projection and Nystr\""{o}m methods.","random Fourier features, kernel methods, dimension reduction, clustering, Laplacian kernel" UTC-IE: A Unified Token-pair Classification Architecture for Information Extraction,https://openreview.net/forum?id=cRQwl-59CU8,https://openreview.net/pdf?id=cRQwl-59CU8,Reformulate and unify three IE tasks as token-pair classifications and propose Plusformer to effectively model the interaction between token pairs.,"Information Extraction (IE) spans several tasks with different output structures, such as named entity recognition, relation extraction and event extraction. Previously, those tasks were solved with different models because of diverse task output structures. Through re-examining IE tasks, we find that all of them can be interpreted as extracting spans and span relations. We propose using the start and end token of a span to pinpoint the span in texts, and using the start-to-start and end-to-end token pairs of two spans to determine the relation. Hence, we can unify all IE tasks under the same token-pair classification formulation. Based on the reformulation, we propose a \textbf{U}nified \textbf{T}oken-pair \textbf{C}lassification architecture for \textbf{I}nformation \textbf{E}xtraction (\textbf{UTC-IE}), where we introduce Plusformer on top of the token-pair feature matrix. Specifically, it models axis-aware interaction with plus-shaped self-attention and local interaction with Convolutional Neural Network over token pairs. Experiments show that our approach outperforms task-specific and unified models on all tasks in 10 datasets, and achieves better or comparable results on 2 joint IE datasets. Moreover, UTC-IE speeds up over state-of-the-art models on IE tasks significantly in most datasets, which verifies the effectiveness of our architecture.","Information extraction, unified classification, Transformer, CNN" Robust Quantity-Aware Aggregation for Federated Learning,https://openreview.net/forum?id=OUV0Fh5Lgm2,https://openreview.net/pdf?id=OUV0Fh5Lgm2,,"Federated learning (FL) enables multiple clients to collaboratively train models without sharing their local data, and becomes an important privacy-preserving machine learning framework. However, classical FL faces serious security and robustness problem, e.g., malicious clients can poison model updates and at the same time claim large quantities to amplify the impact of their model updates in the model aggregation. Existing defense methods for FL, while all handling malicious model updates, either treat all quantities benign or simply ignore/truncate the quantities of all clients. The former is vulnerable to quantity-enhanced attack, while the latter leads to sub-optimal performance since the local data on different clients is usually in significantly different sizes. In this paper, we propose a robust quantity-aware aggregation algorithm for federated learning, called FedRA, to perform the aggregation with awareness of local data quantities while being able to defend against quantity-enhanced attacks. More specifically, we propose a method to filter malicious clients by jointly considering the uploaded model updates and data quantities from different clients, and performing quantity-aware weighted averaging on model updates from remaining clients. Moreover, as the number of malicious clients participating in the federated learning may dynamically change in different rounds, we also propose a malicious client number estimator to predict how many suspicious clients should be filtered in each round. Experiments on four public datasets demonstrate the effectiveness of our FedRA method in defending FL against quantity-enhanced attacks. Our code is available at \url{https://anonymous.4open.science/r/FedRA-4C1E}. ","Federated Learning, Robustness, Defense" Efficient debiasing with contrastive weight pruning,https://openreview.net/forum?id=0DIkhwclYX3,https://openreview.net/pdf?id=0DIkhwclYX3,,"Neural networks are often biased to spuriously correlated features that provide misleading statistical evidence that does not generalize. This raises a fundamental question: ""Does an optimal unbiased functional subnetwork exist in a severely biased network? If so, how to extract such subnetwork?"" While few studies have revealed the existence of such optimal subnetworks with the guidance of ground-truth unbiased samples, the way to discover the optimal subnetworks with biased training dataset is still unexplored in practice. To address this, here we first present our theoretical insight that alerts potential limitations of existing algorithms in exploring unbiased subnetworks in the presence of strong spurious correlations. We then further elucidate the importance of bias-conflicting samples on structure learning. Motivated by these observations, we propose a Debiased Contrastive Weight Pruning (DCWP) algorithm, which probes unbiased subnetworks without expensive group annotations. Experimental results demonstrate that our approach significantly outperforms state-of-the-art debiasing methods despite its considerable reduction in the number of parameters.","Debiasing, spurious correlation, pruning" Linear Convergence of Decentralized FedAvg for Non-Convex Objectives: The Interpolation Regime,https://openreview.net/forum?id=sn8w7P9TrYf,https://openreview.net/pdf?id=sn8w7P9TrYf,Our work shows linear convergence for Federated Averaging algorithm in {\em Server} and {\em Decentralized} settings.,"In the age of Bigdata, Federated Learning (FL) provides machine learning (ML) practitioners with an indispensable tool for solving large-scale learning problems. FL is a distributed optimization paradigm where multiple nodes each having access to a local dataset collaborate (with or without a server) to solve a joint problem. Federated Averaging (FedAvg) although the algorithm of choice for many FL applications is not very well understood especially in the interpolation regime, a common phenomenon observed in modern overparameterized neural networks. In this work, we address this challenge and perform a thorough theoretical performance analysis of FedAvg in the interpolation regime for training of overparameterized neural networks. Specifically, we analyze the performance of FedAvg in two settings: (i) {\em[Server]}: When the network has access to a server that coordinates the information sharing among nodes, and (ii) {\em[Decentralized]:} The serverless setting, where the local nodes communicate over an undirected graph. We consider a class of non-convex functions satisfying the Polyak-Lojasiewicz (PL) condition, a condition that is satisfied by overparameterized neural networks. For the first time, we establish that FedAvg under both {\em Server} and {\em Decentralized} settings achieve linear convergence rates of $\mathcal{O}(T^{3/2} \log (1/{\epsilon} ) )$ and $\mathcal{O}({T^2} \log ({1}/{\epsilon}))$, respectively, where $\epsilon$ is the desired solution accuracy, and $T$ is the number of local updates at each node. In contrast to the standard FedAvg analysis, our work does not require bounded heterogeneity, variance, and gradient assumptions. Instead, we show that sample-wise (and local) smoothness of the local loss functions suffice to capture the effect of heterogeneity in FL training. We use a novel application of induction to prove the linear convergence in the {\em Decentralized} setting, which can be of independent interest. Finally, we conduct experiments on multiple real datasets to corroborate our theoretical findings.","Polyak-Lojasiewicz (PL) inequality, Federated Averaging, Linear convergence" Rethinking Missing Modality Learning: From a Decoding View,https://openreview.net/forum?id=PRpO-cOCQoX,https://openreview.net/pdf?id=PRpO-cOCQoX,,"Conventional pipeline of multimodal learning consists of three stages, including encoding, fusion, and decoding. Most existing methods under missing modality condition focus on the first stage and aim to learn the modality invariant representation or reconstruct missing features. However, these methods rely on strong assumptions (i.e., all the pre-defined modalities are available for each input sample during training and the number of modalities is fixed). To solve this problem, we propose a simple yet effective method called Interaction Augmented Prototype Decomposition (IPD) for a more general setting, where the number of modalities is arbitrary and there are various incomplete modality conditions happening in both training and inference phases, even there are unseen testing conditions. Different from the previous methods, we improve the decoding stage. Concretely, IPD jointly learns the common and modality-specific task prototypes. Considering that the number of missing modality conditions scales exponentially with the number of modalities ${\bf O}({\text 2^n})$ and different conditions may have implicit interaction, the low-rank partial prototype decomposition with enough theoretical analysis is employed for modality-specific components to reduce the complexity. The decomposition also can promote unseen generalization with the modality factors of existing conditions. To simulate the low-rank setup, we further constrain the explicit interaction of specific modality conditions by employing disentangled contrastive constraints. Extensive results on the newly-created benchmarks of multiple tasks illustrate the effectiveness of our proposed model. ","multimodal, decoding, tensor decomposition" Global Nash Equilibrium in a Class of Nonconvex N-player Games,https://openreview.net/forum?id=HcGb9QnNAew,https://openreview.net/pdf?id=HcGb9QnNAew,,"We consider seeking the global Nash equilibrium (NE) in a class of nonconvex N-player games. The structured nonconvex payoffs are composited with canonical functions and quadratic operators, which are broadly investigated in various tasks such as robust network training and sensor network communication. However, the full-fledged development of nonconvex min-max games may not provide available help due to the interference of multiple players’ coupled stationary conditions, and the existing results on convex games may also perform unsatisfactorily since they may be stuck in local NE or Nash stationary points, rather than the global NE. Here, we first make efforts to take a canonical conjugate transformation of the nonconvex N-player game, and cast the complementary problem into a variational inequality (VI) problem for the derivation of the global NE. Then we design a conjugate-based ordinary differential equation (ODE) for the solvable VI problem, and present the equilibrium equivalence and guaranteed convergence within the ODE. Furthermore, we provide a discretized algorithm based on the ODE, and discuss step-size settings and convergence rates in two typical nonconvex N-player games. At last, we conduct experiments in practical tasks to illustrate the effectiveness of our approach.", UNDERSTANDING PURE CLIP GUIDANCE FOR VOXEL GRID NERF MODELS,https://openreview.net/forum?id=3gZop22KWP,https://openreview.net/pdf?id=3gZop22KWP,We explore various mechanics that prevent adversarial generations from using CLIP as guidance for training a voxel grid NeRF model without any datasets.,"We explore the task of text to 3D object generation using CLIP. Specifically, we use CLIP for guidance without access to any datasets, a setting we refer to as pure CLIP guidance. While prior work has adopted this setting, there is no systematic study of mechanics for preventing adversarial generations within CLIP. We illustrate how different image-based augmentations prevent the adversarial generation problem, and how the generated results are impacted. We test different CLIP model architectures and show that ensembling different models for guidance can prevent adversarial generations within bigger models and generate sharper results. Furthermore, we implement an implicit voxel grid model to show how neural networks provide an additional layer of regularization, resulting in better geometrical structure and coherency of generated objects. Compared to prior work, we achieve more coherent results with higher memory efficiency and faster training speeds.","Text to 3D Generation, CLIP, NeRF, Adversarial Examples, Augmentation" Neural Semi-Counterfactual Risk Minimization,https://openreview.net/forum?id=m0X42gFenw,https://openreview.net/pdf?id=m0X42gFenw,we study the semi-counterfactual risk minimization by considering access to both known reward and unknown reward logged datasets.,"Counterfactual risk minimization is a framework for offline policy optimization with logged data which consists of context, action, propensity score, and reward for each sample point. In this work, we build on this framework and propose a learning method for settings where the rewards for some samples are not observed, and so the logged data consists of a subset of samples with unknown rewards and a subset of samples with known rewards. This setting arises in many application domains, including advertising and healthcare. While reward feedback is missing for some samples, it is possible to leverage the unknown-reward samples in order to minimize the risk, and we refer to this setting as semi-counterfactual risk minimization. To approach this kind of learning problem, we derive new upper bounds on the true risk under the inverse propensity score estimator. We then build upon these bounds to propose a regularized counterfactual risk minimization method, where the regularization term is based on the logged unknown-rewards dataset only; hence it is reward-independent. We also propose another algorithm based on generating pseudo-rewards for the logged unknown-rewards dataset. Experimental results with neural networks and benchmark datasets indicate that these algorithms can leverage the logged unknown-rewards dataset besides the logged known-reward dataset.","semi-Counterfactual risk minimization, importance weighting upper bound, KL regularisation, Batch learning" Task-Agnostic Online Meta-Learning in Non-stationary Environments,https://openreview.net/forum?id=jsZ8PDQOVU,https://openreview.net/pdf?id=jsZ8PDQOVU,We propose a novel algorithm for task-agnostic online meta-learning in non-stationary environments without knowledge of task boundaries. ,"Online meta-learning has recently emerged as a marriage between batch meta-learning and online learning, for achieving the capability of quick adaptation on new tasks in a lifelong manner. However, most existing approaches focus on the restrictive setting where the distribution of the online tasks remains fixed with known task boundaries. In this work we relax these assumptions and propose a novel algorithm for task-agnostic online meta-learning in non-stationary environments. More specifically, we first propose two simple but effective detection mechanisms of task switches and distribution shift based on empirical observations, which serve as a key building block for more elegant online model updates in our algorithm: the task switch detection mechanism allows reusing of the best model available for the current task at hand, and the distribution shift detection mechanism differentiates the meta model update so as to preserve the knowledge for in-distribution tasks and quickly learn the new knowledge for out-of-distribution tasks. Motivated by the recent advance in online learning, our online meta model updates are based only on the current data, which eliminates the need of storing previous data as required in most existing methods. This crucial choice is also well supported by our theoretical analysis of dynamic regret in online meta-learning, where a sublinear regret can be achieved by updating the meta model at each round using the current data only. Empirical studies on three different benchmarks clearly demonstrate the significant advantage of our algorithm over related baseline approaches. ","online meta-learning, domain shift, dynamic regret, out of distribution detection" Meta-Weighted Language Model Tuning for Augmentation-Enhanced Few-Shot Learning,https://openreview.net/forum?id=mduJQSy7KE,https://openreview.net/pdf?id=mduJQSy7KE,We study how to effectively tune language models to generate training samples for few-shot learning.,"Recent studies have revealed the intriguing few-shot learning ability of pretrained language models (PLMs): They can quickly adapt to a new task when fine-tuned on a small amount of labeled data formulated as prompts, without requiring abundant task-specific annotations. Despite their promising performance, most existing few-shot approaches that only learn from the small training set still underperform fully supervised training by nontrivial margins. In this work, we study few-shot learning with PLMs from a different perspective: We first tune an autoregressive PLM on the few-shot samples and then use it as a generator to synthesize a large amount of novel training samples which augment the original training set. To encourage the generator to produce label-discriminative samples, we train it via weighted maximum likelihood where the weight of each token is automatically adjusted based on a discriminative meta-learning objective. A classification PLM can then be fine-tuned on both the few-shot and the synthetic samples with regularization for better generalization and stability. Our approach FewGen achieves an overall better result across seven classification tasks of the GLUE benchmark than existing few-shot learning methods.","Few-Shot Learning, Natural Language Understanding" Online Reinforcement Learning via Posterior Sampling of Policy,https://openreview.net/forum?id=1uPo_IrEp8,https://openreview.net/pdf?id=1uPo_IrEp8,,"We propose a Reward-Weighted Posterior Sampling of Policy (RWPSP) algorithm to tackle the classic trade-off problem between exploration and exploitation under finite Markov decision processes (MDPs). The Thompson sampling method so far has only considered posterior sampling over transition probabilities, which is hard to gain the globally sub-optimal rewards. RWPSP runs posterior sampling over stationary policy distributions instead of transition probabilities, and meanwhile keeps transition probabilities updated. Particularly, we leverage both relevant count functions and reward-weighting to online update the policy posterior, aiming to balance between local and long-term policy distributions for a globally near-optimal game value. Theoretically, we establish a bound of $\tilde{\mathcal{O}}(\Gamma\sqrt{T}/S^{2})$\footnote{The symbol $\tilde{\mathcal{O}}$ hides logarithmic factors.} on the total regret in time horizon $T$ with $\Gamma/S^{2} < D\sqrt{SA}$ satisfied in general, where $S$ and $A$ represents the sizes of state and action spaces, respectively, $D$ the diameter. This matches the best regret bound thus far for MDPs. Experimental results corroborate our theoretical results and show the advantage of our algorithm over the state of the art in terms of efficiency.", NewModel: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing,https://openreview.net/forum?id=sE7-XhLxHA,https://openreview.net/pdf?id=sE7-XhLxHA,,"This paper presents a new pre-trained language model, NewModel, which improves the original DeBERTa model by replacing mask language modeling (MLM) with replaced token detection (RTD), a more sample-efficient pre-training task. Our analysis shows that vanilla embedding sharing in ELECTRA hurts training efficiency and model performance. This is because the training losses of the discriminator and the generator pull token embeddings in different directions, creating the “tug-of-war” dynamics. We thus propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics, improving both training efficiency and the quality of the pre-trained model. We have pre-trained NewModel using the same settings as DeBERTa to demonstrate its exceptional performance on a wide range of downstream natural language understanding (NLU) tasks. Taking the GLUE benchmark with eight tasks as an example, the NewModel Large model achieves a 91.37% average score, which is 1.37% over DeBERTa and 1.91% over ELECTRA, setting a new state-of-the-art (SOTA) among the models with a similar structure. Furthermore, we have pre-trained a multi-lingual model mNew-Model and observed a larger improvement over strong baselines compared to English models. For example, the mNewModel Base achieves a 79.8% zero-shot cross-lingual accuracy on XNLI and a 3.6% improvement over XLM-R Base, creating a new SOTA on this benchmark. We will make our model and code publicly available.", Weakly Supervised Neuro-Symbolic Image Manipulation via Multi-Hop Complex Instructions,https://openreview.net/forum?id=RqJZTlQMph,https://openreview.net/pdf?id=RqJZTlQMph,We propose a weakly supervised neuro-symbolic approach for the problem of image manipulation using text instructions.,"We are interested in image manipulation via natural language text – a task that is extremely useful for multiple AI applications but requires complex reasoning over multi-modal spaces. Recent work on neuro-symbolic approaches (Mao et al., 2019) (NSCL) has been quite effective for solving VQA as they offer better modularity, interpretability, and generalizability. We extend NSCL for the image manipulation task and propose a solution referred to as NeuroSIM. Previous work either requires supervised training data in the form of manipulated images or can only deal with very simple reasoning instructions over single object scenes. In contrast, NeuroSIM can perform complex multi-hop reasoning over multi-object scenes and only requires weak supervision in the form of annotated data for VQA. NeuroSIM parses an instruction into a symbolic program, based on a Domain Specific Language (DSL) comprising of object attributes and manipulation operations, that guides the manipulation. We design neural modules for manipulation, as well as novel loss functions that are capable of testing the correctness of manipulated object and scene graph representations via query networks trained merely on VQA data. An image decoder is trained to render the final image from the manipulated scene graph. Extensive experiments demonstrate that NeuroSIM, without using target images as supervision, is highly competitive with SOTA baselines that make use of supervised data for manipulation.","Neuro-Symbolic Reasoning, Natural Language Guided Image Manipulation, Visual Question Answering, Weakly Supervised Learning" Graph Neural Networks for Aerodynamic Flow Reconstruction from Sparse Sensing,https://openreview.net/forum?id=_kf9GU5c_sE,https://openreview.net/pdf?id=_kf9GU5c_sE,," Sensing the fluid flow around an arbitrary geometry entails extrapolating from the physical quantities perceived at its surface in order to reconstruct the features of the surrounding fluid. This is a challenging inverse problem, yet one that if solved could have a significant impact on many engineering applications. The exploitation of such an inverse logic has gained interest in recent years with the advent of widely available cheap but capable MEMS-based sensors. When combined with novel data-driven methods, these sensors may allow for flow reconstruction around immersed structures, benefiting applications such as unmanned airborne/underwater vehicle path planning or control and structural health monitoring of wind turbine blades. In this work, we train deep reversible Graph Neural Networks (GNNs) to perform flow sensing (flow reconstruction) around two-dimensional aerodynamic shapes: airfoils. Motivated by recent work, which has shown that GNNs can be powerful alternatives to mesh-based forward physics simulators, we implement a Message-Passing Neural Network to simultaneously reconstruct both the pressure and velocity fields surrounding simulated airfoils based on their surface pressure distributions, whilst additionally gathering useful farfield properties in the form of context vectors. We generate a unique dataset of Computational Fluid Dynamics simulations by simulating random, yet meaningful combinations of input boundary conditions and airfoil shapes. We show that despite the challenges associated with reconstructing the flow around arbitrary airfoil geometries in high Reynolds turbulent inflow conditions, our framework is able to generalize well to unseen cases.","CFD, flow reconstruction, GNNs" Learning Binary Networks on Long-Tailed Distributions,https://openreview.net/forum?id=ZEXh0XyO2hh,https://openreview.net/pdf?id=ZEXh0XyO2hh,We propose the first method to learn binary networks on long-tailed distributions in the literature.,"In deploying deep models to real world scenarios, there are a number of issues including computational resource constraints and long-tailed data distributions. For the first time in the literature, we address the combined challenge of learning long-tailed distributions under the extreme resource constraints of using binary networks as backbones. Specifically, we propose a framework of calibrating off-the-shelf pretrained full precision weights that are learned on $\textit{non-long-tailed}$ distributions when training binary networks on long-tailed datasets. In the framework, we additionally propose a novel adversarial balancing and a multi-resolution learning method for better generalization to diverse semantic domains and input resolutions. We conduct extensive empirical evaluations on 15 datasets including newly derived long-tailed datasets from existing balanced datasets, which is the largest benchmark in the literature. Our empirical studies show that our proposed method outperforms prior arts by large margins, $\textit{e.g.}$, at least $+14.33\%$ on average.","binary neural network, long-tailed recognition, distillation" Deep Attention Pooling Graph Neural Network for Text Classification,https://openreview.net/forum?id=1YE_zTFICdr,https://openreview.net/pdf?id=1YE_zTFICdr,"A fresh model based on GNN with dual adjacency matrix, and attention pooling for text classification.","Graph Neural Networks (GNN) is a classical method that has been applied to document classification as a compelling message-passing framework inside and between documents. Consider the graph-based models are transductive when representing the documents as nodes in one graph(inter-documents), and require high memory and time efficiency to employ the GNN to each document after aligning the documents to the longest one(intra-documents). This paper proposes a novel method named Deep Attention Pooling Graph Neural Networks (DAPG) to use the structure of each document for inductive document classification. The attention pooling layer (APL) in DAPG adaptively selects nodes to form smaller graphs based on their scalar attention values to alleviate resource consumption. Additionally, regarding the structural variation, a fresh dual adjacency matrix for individual graphs based on the word co-occurrence and the word distance has been built to conquer the sparsity and keep stability after pooling. Experiments conducted on five standard text classification datasets show that our method is competitive with the state-of-the-art. Ablation studies reveal further insights into the impact of the different components on performance.","GNN, Attention, Pooling, Adjacency matrix, Text Classification" Backdoor Mitigation by Correcting Activation Distribution Alteration,https://openreview.net/forum?id=Yc9tld-ENbf,https://openreview.net/pdf?id=Yc9tld-ENbf,,"Backdoor (Trojan) attacks are an important type of adversarial exploit against deep neural networks (DNNs), wherein a test instance is (mis)classified to the attacker's target class whenever a backdoor trigger is present. In this paper, we reveal and analyze an important property of backdoor attacks: a successful attack causes an alteration in the distribution of internal layer activations for backdoor-trigger instances, compared to that for clean instances. Even more importantly, we find that instances with the backdoor trigger will be correctly classified to their original source classes if this distribution alteration is reversed. Based on our observations, we propose an efficient and effective method that achieves post-training backdoor mitigation by correcting the distribution alteration using reverse-engineered triggers. Notably, our method does not change any trainable parameters of the DNN, but achieves generally better mitigation performance than existing methods that do require intensive DNN parameter tuning. It also efficiently detects test instances with the trigger, which may help to catch adversarial entities.","backdoor, Trojan, adversarial learning, deep neural network" Pose Transfer using a Single Spatial Transformation,https://openreview.net/forum?id=wF-QfZebInw,https://openreview.net/pdf?id=wF-QfZebInw,,"In this paper, we address the problem of pose transfer. The goal is to generate a source image in a new target pose. The pose is already provided by a set of spatial landmarks. The transfer function is directly estimated from the difference between the landmarks given in the new target pose and the landmarks of the source image. Existing methods perform the task using two specialized networks, one to move the patches of the source sample and the other one to generate the new patches that are not visible in the source image. Contrary to these strategies, we develop an end-to-end trainable neural network that learns to estimate both these visible and invisible parts using a simple warping module. In other words, we propose a flow estimation method that not only displaces the patches to their new locations but also generates new pixels that are not visible in the source image, all in an unsupervised manner without the need for a ground-truth flow map. In this way, moving the patches and introducing new parts are unified into a single network, ensuring that an overall solution is achieved for these two mutual tasks. Additionally, this method avoids the need for a human observer to determine a trade-off between the performance of the two separated networks, thus avoiding a cartoonish addition of the new parts to the visible areas. Extensive experiments demonstrate the superiority of our method over state-of-the-art algorithms. We conduct our experiments on two well-known datasets: Deepfashion and Market1501. ", Local Distance Preserving Auto-encoders using Continuous k-Nearest Neighbours Graphs,https://openreview.net/forum?id=MpwWSMOlkc,https://openreview.net/pdf?id=MpwWSMOlkc,,"Auto-encoder models that preserve similarities in the data are a popular tool in representation learning. In this paper we introduce several auto-encoder models that preserve local distances when mapping from the data space to the latent space. We use a local distance-preserving loss that is based on the continuous k-nearest neighbours graph which is known to capture topological features at all scales simultaneously. To improve training performance, we formulate learning as a constraint optimisation problem with local distance preservation as the main objective and reconstruction accuracy as a constraint. We generalise this approach to hierarchical variational auto-encoders thus learning generative models with geometrically consistent latent and data spaces. Our method provides state-of-the-art or comparable performance across several standard datasets and evaluation metrics.","manifold learning, representational learning, generative models" Clustering for directed graphs using parametrized random walk diffusion kernels,https://openreview.net/forum?id=nNpv6IDLu5,https://openreview.net/pdf?id=nNpv6IDLu5,A novel clustering algorithm for directed graphs based on the diffusion geometry framework and parametrized random walk operators,"Clustering based on the random walk operator has been proven effective for undirected graphs, but its generalization to directed graphs (digraphs) is much more challenging. Although the random walk operator is well-defined for digraphs, in most cases such graphs are not strongly connected, and hence the associated random walks are not irreducible, which is a crucial property for clustering that exists naturally in the undirected setting. To remedy this, the usual workaround is to either naively symmetrize the adjacency matrix or to replace the natural random walk operator by the teleporting random walk operator, but this can lead to the loss of valuable information carried by edge directionality. In this paper, we introduce a new clustering framework, the Parametrized Random Walk Diffusion Kernel Clustering (P-RWDKC), which is suitable for handling both directed and undirected graphs. Our framework is based on the diffusion geometry and the generalized spectral clustering framework. Accordingly, we propose an algorithm that automatically reveals the cluster structure at a given scale, by considering the random walk dynamics associated with a parametrized kernel operator, and by estimating its critical diffusion time. Experiments on $K$-NN graphs constructed from real-world datasets and real-world graphs show that our clustering approach performs well in all tested cases, and outperforms existing approaches in most of them. ","clustering, diffusion geometry, parametrized random walks, directed graphs" Poisoning Generative Models to Promote Catastrophic Forgetting,https://openreview.net/forum?id=a18z-D9l763,https://openreview.net/pdf?id=a18z-D9l763,We develop a novel poisoning attack on generative models to promotes catastrophic forgetting.,"Generative models have grown into the workhorse of many state-of-the-art machine learning methods. However, their vulnerability under poisoning attacks has been largely understudied. In this work, we investigate this issue in the context of continual learning, where generative replayers are utilized to tackle catastrophic forgetting. By developing a novel customization of dirty-label input-aware backdoor to the online setting, our attacker manages to stealthily promote forgetting while retaining high accuracy at the current task and sustaining strong defenders. Our approach taps into an intriguing property of generative models, namely that they cannot well capture input-dependent triggers. Experiments on four standard datasets corroborate the poisoner’s effectiveness. ","poisoning attacks, backdoor attacks" Squeeze Training for Adversarial Robustness,https://openreview.net/forum?id=Z_tmYu060Kr,https://openreview.net/pdf?id=Z_tmYu060Kr,"We highlight that some collaborative examples, which show extremely lower prediction loss, can be utilized to enhance adversarial training. A novel method called squeeze training (ST) is thus proposed.","The vulnerability of deep neural networks (DNNs) to adversarial examples has attracted great attention in the machine learning community. The problem is related to local non-smoothness and steepness of normally obtained loss landscapes. Training augmented with adversarial examples (a.k.a., adversarial training) is considered as an effective remedy. In this paper, we highlight that some collaborative examples, nearly perceptually indistinguishable from both adversarial and benign examples yet show extremely lower prediction loss, can be utilized to enhance adversarial training. A novel method is therefore proposed to achieve new state-of-the-arts in adversarial robustness. Code will be made publicly available.","Adversarial Training, Adversarial Examples, Model Robustness" Concealing Sensitive Samples for Enhanced Privacy in Federated Learning,https://openreview.net/forum?id=Ms4S3XC3vtW,https://openreview.net/pdf?id=Ms4S3XC3vtW,A method for improving privacy in federated learning by obfuscating sensitive data with adaptively synthesized concealed samples.,"Federated Learning (FL) is a distributed learning paradigm that promises to protect users’ privacy by not requiring the clients to share their raw and private data with the server. Despite the success, recent studies reveal the vulnerability of FL to model inversion attacks by showing that they can reconstruct users’ private data via eavesdropping on the shared gradient information. Most existing defence methods to preserve privacy in FL are formulated to protect all data samples equally, which in turn proven brittle against attacks and compromising the FL performance. In this paper, we argue that data containing sensitive information should take precedence. We present a simple, yet effective defence strategy that obfuscates the gradients of the sensitive data with concealed samples. In doing so, we propose to synthesize concealed samples to simulate the sensitive data at the gradient level. Furthermore, we employ a gradient projection technique to obscure sensitive data without compromising the quality of the shared gradients, hence enabling FL to retain its performance. Compared to the previous art, our empirical evaluations suggest that the proposed technique provides the strongest protection while simultaneously maintaining the FL performance. We also provide examples of how the proposed method can be combined with other defences to boost the privacy-performance trade-off even further.","Privacy preserving, Model inversion attack, Federated Learning" Knowledge Unlearning for Mitigating Privacy Risks in Language Models,https://openreview.net/forum?id=zAxuIJLb38,https://openreview.net/pdf?id=zAxuIJLb38,We propose knowledge unlearning for efficiently providing empirical privacy guarantees for large language models as an alternative solution to existing methods.,"Pretrained Language Models (LMs) memorize a vast amount of knowledge during initial pretraining, including information that may violate the privacy of personal lives and identities. Previous work addressing privacy issues for language models has mostly focused on data preprocessing and differential privacy methods, both requiring re-training the underlying LM. We propose knowledge unlearning as an alternative method to reduce privacy risks for LMs post hoc. We show that simply applying the unlikelihood training objective to target token sequences is effective at forgetting them with little to no degradation of general language modeling performances; it sometimes even substantially improves the underlying LM with just a few iterations. We also find that sequential unlearning is better than trying to unlearn all the data at once and that unlearning is highly dependent on which kind of data (domain) is forgotten. By showing comparisons with a previous data preprocessing method known to mitigate privacy risks for LMs, we show that unlearning can give a stronger empirical privacy guarantee in scenarios where the data vulnerable to extraction attacks are known a priori while being orders of magnitude more computationally efficient. We release the code and dataset needed to replicate our results at http://www.omitted.link/.","privacy, large language models, knowledge unlearning, natural language processing" Understanding Graph Contrastive Learning From A Statistical Perspective,https://openreview.net/forum?id=1VQnc0wnIQ,https://openreview.net/pdf?id=1VQnc0wnIQ,"From a statistical perspective, we propose two principles to guide graph contrastive learning.","Although recent advances have prompted the prosperity in graph contrastive learning, the researches on universal principles for model design and desirable properties of latent representations are still inadequate. From a statistical perspective, this paper proposes two principles for guidance and constructs a general graph self-supervised framework. Reformulating data augmentation as a mixture process, the first one, termed consistency principle, lays stress on exploring and mapping cross-view common information to consistent and essence-revealing representations. For the purpose of instantiation, four statistical indicators are employed to estimate and maximize the correlation between representations from various views, whose accordant variation trend during training implies the extraction of common content. With awareness of the insufficiency of a solo consistency principle, suffering from degenerated and coupled solutions, a decorrelation principle is put forward to encourage diverse and informative representations. Accordingly, two specific strategies, performing in representation space and eigen spectral space, respectively, are propounded to decouple various representation channels. Under two principles, various combinations of concrete implementations derive a family of methods. Provably, after decomposition and analysis for the commonly used \textit{InfoNCE} loss, we clarify that the approaches based on mutual information maximization implicitly fulfill the two principles and are covered within our framework. The comparison experiments with current state-of-the-arts demonstrate the effectiveness and sufficiency of two principles for high-quality graph representations. Furthermore, visual studies reveal how certain principles affect learned representations.","graph contrastive learning, unsupervised, general principles" Revisiting the Activation Function for Federated Image Classification,https://openreview.net/forum?id=hf6JLVbAog,https://openreview.net/pdf?id=hf6JLVbAog,We empirically observe that off-the-shelf activation functions used in centralized setting get a totally different order of accuracy in federated learning.,"Federated learning (FL) has become one of the most popular distributed machine learning paradigms; these paradigms enable training on a large corpus of decentralized data that resides on devices. The recent evolution in FL research is mainly credited to the refinements in training procedures by developing the optimization methods. However, there has been little verification of other technical improvements, especially improvements to the activation functions (e.g., ReLU), that are widely used in the conventional centralized approach (i.e., standard data-centric optimization). In this work, we verify the effectiveness of activation functions in various federated settings. We empirically observe that off-the-shelf activation functions that are used in centralized settings exhibit a totally different performance trend than do federated settings. The experimental results demonstrate that HardTanh achieves the best accuracy when severe data heterogeneity or low participation rate is present. We provide a thorough analysis to investigate why the representation powers of activation functions are changed in a federated setting by measuring the similarities in terms of weight parameters and representations. Lastly, we deliver guidelines for selecting activation functions in both a cross-silo setting (i.e., a number of clients <= 20) and a cross-device setting (i.e., a number of clients >= 100). We believe that our work provides benchmark data and intriguing insights for designing models FL models. ","Federated Learning, Activation Function" Rethinking Knowledge Distillation with Raw Features for Semantic Segmentation,https://openreview.net/forum?id=HwbEioBGLo3,https://openreview.net/pdf?id=HwbEioBGLo3,In-depth analysis of the raw feature distillation and the design of an effective feature distillation method for semantic segmentation.,"Most existing knowledge distillation methods for semantic segmentation focus on extracting various complex forms of knowledge from raw features. However, such knowledge is usually manually designed and relies on prior knowledge as in traditional feature engineering. In this paper, in order to seek a more simple and effective way to perform feature distillation, we analyze the naive feature distillation method with raw features and reveal that it actually attempts to make the student learn both the magnitude and angular information from the teacher features simultaneously. We further find experimentally that the angular information is more effective than the magnitude information for feature distillation. Based on this finding, we propose a simple and effective feature distillation method for semantic segmentation, which eliminates the need to manually design distillation knowledge. Experimental results on three popular benchmark datasets show that our method achieves state-of-the-art distillation performance for semantic segmentation. The code will be available.","Knowledge Distillation, Semantic Segmentation, Raw Feature Learning" Open-domain Visual Entity Linking,https://openreview.net/forum?id=xisxfxXJI21,https://openreview.net/pdf?id=xisxfxXJI21,We present a new task (with an associating dataset) that targets at linking visual contents to entities in a knowledge base,"We introduce the task of Open-domain Visual Entity Linking (OVEN), targeting a wide range of entities including animals, plants, buildings, locations and much more. Given an image (e.g., an image of an aircraft), a text query (`What is the model?' or `What is the airline?'), and a multi-modal knowledge base (e.g., Wikipedia), the goal is to link to an entity (Boeing-777 or EVA Air) out of all entities in the knowledge base. We build a benchmark dataset (OVEN-wiki), by repurposing 14 existing image classification, image retrieval, and visual QA datasets. We link all existing labels to Wikipedia entities when possible, using a state-of-the-art entity linking system and human annotators, creating a diverse and unified label space. OVEN is a rich and challenging task, which requires models to recognize and link visual content to both a small set of seen entities as well as a much larger set of unseen entities (e.g., unseen aircraft models). OVEN also requires models to generalize to previously unseen intents that may require more fine-grained reasoning (`Who manufactured the aircraft in the back?'). We build strong baselines based on state-of-the-art pre-trained models and find that current pre-trained models struggle to address the challenges posed by OVEN. We hope OVEN will inspire next-generation pre-training techniques and pave the way to future knowledge-intensive vision tasks.","Open-domain Visual Entity Linking, Vision and Language" Robustify Transformers with Robust Kernel Density Estimation,https://openreview.net/forum?id=2Fb-h04mt5I,https://openreview.net/pdf?id=2Fb-h04mt5I,We propose a robust transformer that can that can alleviate the effect from contaminated data while improve the clean data performance.,"Recent advances in Transformer architecture have empowered its empirical success in various tasks across different domains. However, existing works mainly focus on improving the standard accuracy and computational cost, without considering the robustness of contaminated samples. Existing work (Nguyen et al, 2022, FourierFormer) has shown that the self-attention mechanism, which is the center of the Transformer architecture, can be viewed as a non-parametric estimator based on the well-known kernel density estimation (KDE). This motivates us to leverage the robust kernel density estimation (RKDE) in the self-attention mechanism, to alleviate the issue of the contamination of data by down-weighting the weight of bad samples in the estimation process. The modified self-attention mechanism can be incorporated into different Transformer variants. Empirical results on language modeling and image classification tasks demonstrate the effectiveness of this approach.","Transformers, Kernel Density Estimation, Robustness" Pushing the Accuracy-Fairness Tradeoff Frontier with Introspective Self-play,https://openreview.net/forum?id=MofT9KEF0kw,https://openreview.net/pdf?id=MofT9KEF0kw,Principled training method to improve deep model's uncertainty and active learning performance under dataset bias.,"Improving the accuracy-fairness frontier of deep neural network (DNN) models is an important problem. Uncertainty-based active learning (AL) can potentially improve the frontier by preferentially sampling underrepresented subgroups to create a more balanced training dataset. However, the quality of uncertainty estimates from modern DNNs tend to degrade in the presence of spurious correlations and dataset bias, compromising the effectiveness of AL for sampling tail groups. In this work, we propose $Introspective Self-play$ (ISP), a simple approach to improve the uncertainty estimation of a deep neural network under dataset bias, by adding an auxiliary $introspection$ task requiring a model to predict the bias for each data point in addition to the label. We show that ISP provably improves the bias-awareness of the model representation and the resulting uncertainty estimates. On two real-world tabular and language tasks,ISP serves as a simple “plug-in” for AL model training, consistently improving both the tail-group sampling rate and the final accuracy-fairness trade-off frontier of popular AL methods.","Uncertainty Quantification, Spurious Correlation, Active Learning" MDPose: Real-Time Multi-Person Pose Estimation via Mixture Density Model,https://openreview.net/forum?id=csccDKhK7tw,https://openreview.net/pdf?id=csccDKhK7tw,"We reformulate multi-person pose estimation task as a density estimation, enabling real-time instance-aware keypoint estimation without any additional instance identification process.","One of the major challenges in multi-person pose estimation is instance-aware keypoint estimation. Previous methods address this problem by leveraging an off-the-shelf detector, heuristic post-grouping process or explicit instance identification process, hindering further improvements in the inference speed which is an important factor for practical applications. From the statistical point of view, those additional processes for identifying instances are necessary to bypass learning the high-dimensional joint distribution of human keypoints, which is a critical factor for another major challenge, the occlusion scenario. In this work, we propose a novel framework of single-stage instance-aware pose estimation by modeling the joint distribution of human keypoints with a mixture density model, termed as MDPose. Our MDPose estimates the distribution of human keypoints' coordinates using a mixture density model with an instance-aware keypoint head consisting simply of 8 convolutional layers. It is trained by minimizing the negative log-likelihood of the ground truth keypoints. Also, we propose a simple yet effective training strategy, Random Keypoint Grouping (RKG), which significantly alleviates the underflow problem leading to successful learning of relations between keypoints. On OCHuman dataset, which consists of images with highly occluded people, our MDPose achieves state-of-the-art performance by successfully learning the high-dimensional joint distribution of human keypoints. Furthermore, our MDPose shows significant improvement in inference speed with a competitive accuracy on MS COCO, a widely-used human keypoint datasets, thanks to the proposed much simpler single-stage pipeline.","Instance-aware Pose Estimation, Density Estimation, Mixture Model" Learning to Predict Parameter for Unseen Data,https://openreview.net/forum?id=6FEULL9vSUt,https://openreview.net/pdf?id=6FEULL9vSUt,,"Typical deep learning models depend heavily on large amounts of training data and resort to an iterative optimization algorithm (e.g., SGD or Adam) for learning network parameters, which makes the training process very time- and resource-intensive. In this paper, we propose a new training paradigm and formulate network parameter training into a prediction task: given a network architecture, we observe there exists correlations between datasets and their corresponding optimal network parameters, and explore if we can learn a hyper-mapping between them to capture the relations, such that we can directly predict the parameters of the network for a new dataset never seen during the training phase. To do this, we put forward a new hypernetwork with the purpose of building a mapping between datasets and their corresponding network parameters, and then predict parameters for unseen data with only a single forward propagation of the hypernetwork. At its heart, our model benefits from a series of GRU sharing weights to capture the dependencies of parameters among different layers in the network. Extensive experimental studies are performed and experimental results validate our proposed method achieves surprisingly good efficacy. For instance, it takes 119 GPU seconds to train ResNet-18 using Adam from scratch and obtain a top-1 accuracy of 74.56%, while our method costs only 0.5 GPU seconds to predict the network parameters of ResNet-18 achieving comparable performance (73.33%), more than 200 times faster than the traditional training paradigm.","Parameter Prediction, Training paradigm" PADDLES: Phase-Amplitude Spectrum Disentangled Early Stopping for Learning with Noisy Labels,https://openreview.net/forum?id=Sme6eesZqW,https://openreview.net/pdf?id=Sme6eesZqW,We propose a new early stopping training method for learning with noisy labels by choosing different stopping points for the Phase and Amplitude spectrum in the frequency domain. ,"Deep Neural Networks (DNNs) have demonstrated superiority in learning various patterns. However, DNNs are sensitive to label noises and would easily overfit noisy labels during training. The early stopping strategy averts updating DNNs during the early training phase and is widely employed as an effective method when learning with noisy labels. Motivated by biological findings that the amplitude spectrum (AS) and phase spectrum (PS) in the frequency domain play different roles in the animal's vision system, we observe that PS, which captures more semantic information, is more resistant to label noise than AS. Performing the early stopping on AS and PS at the same time is therefore undesirable. In contrast, we propose early stops at different times for AS and PS. In order to achieve this, we disentangle the features of some layer(s) into AS and PS using Discrete Fourier Transform (DFT) during training. The AS and PS will be detached at different training stages from the gradient computational graph. The features are then restored via inverse DFT (iDFT) for the next layer. We term the proposed method Phase-AmplituDe DisentangLed Early Stopping (PADDLES). Simple yet effective, PADDLES outperforms other early stopping methods and obtains state-of-the-art performance on both synthetic and real-world label-noise datasets.","Learning with noisy labels, Frequency domain decomposition, Early Stopping Training" UNREAL: Unlabeled Nodes Retrieval and Labeling for Heavily-imbalanced Node Classification,https://openreview.net/forum?id=Hh0BdBf6Ls,https://openreview.net/pdf?id=Hh0BdBf6Ls,A method for retrieving unlabeled node information to handle heavily-imbalanced node classification,"Extremely skewed label distributions are common in real-world node classification tasks. If not dealt with appropriately, it significantly hurts the performance of GNNs on minority classes. Due to the practical importance, there have been a series of recent researches devoted to this challenge. Existing over-sampling techniques smooth the label distribution by generating ''fake'' minority nodes and synthesize their features and local topology, which largely ignore the rich information of unlabeled nodes on graphs. Recent methods based on loss function modification re-weight different samples or change classification margins, which achieve good performance. However, representative methods need label information to estimate the distance of each node to its class center, which is unavailable on unlabeled nodes. In this paper, we propose UNREAL, which is an iterative over-sampling method. The first key difference is that we only add unlabeled nodes instead of synthetic nodes, which eliminates the challenge of feature and neighborhood generation. To select which unlabeled nodes to add, we propose geometric ranking, which ranks unlabeled nodes based on unsupervised learning results in the node embedding space. Finally, we identify the issue of geometric imbalance in the embedding space and provide a simple metric to filter out geometrically imbalanced nodes. Extensive experiments on real-world benchmark datasets are conducted, and the empirical results show that our method significantly outperforms current state-of-the-art methods consistent on different datasets with different imbalance ratios.","Node Classification, Heavily-imbalanced Representation Learning, Graph Neural Networks" Textless Phrase Structure Induction from Visually-Grounded Speech,https://openreview.net/forum?id=0c2SbGJ3Lt,https://openreview.net/pdf?id=0c2SbGJ3Lt,"The first study on grammar induction from audio-visual inputs, without relying on intermediate text or ASR. ","We study phrase structure induction from visually-grounded speech without intermediate text or text pre-trained models. The core idea is to first segment the speech waveform into sequences of word segments, then induce phrase structure based on the inferred segment-level continuous representations. To this end, we present the Audio-Visual Neural Syntax Learner (AV-NSL) that learns non-trivial phrase structure by listening to audio and looking at images, without ever reading text. Experiments on SpokenCOCO, the spoken version of MSCOCO with paired images and spoken captions, show that AV-NSL infers meaningful phrase structures similar to those learned from naturally-supervised text parsing, quantitatively and qualitatively. The findings in this paper extend prior work in unsupervised language acquisition from speech and grounded grammar induction, and manifest one possibility of bridging the gap between the two fields.","unsupervised speech processing, grammar induction, speech representation learning, visually-grounded representation learning" On Nullspace of Vision Transformers and What Does it Tell Us?,https://openreview.net/forum?id=d8VrVfNARSy,https://openreview.net/pdf?id=d8VrVfNARSy,Our work highlights and discusses the concept of nullspace wrt vision transformers.,"Nullspace of a linear mapping is the subspace which is mapped to the zero vector. For a linear map, adding an element of the nullspace to its input has no effect on the output of the mapping. We position this work as an exposition towards answering one simple question, ``Does a vision transformer have a non-trivial nullspace?"" If TRUE, this would imply that adding elements from this non-trivial nullspace to an input will have no effect on the output of the network. This finding can eventually lead us closer to understanding the generalization properties of vision transformers. In this paper, we first demonstrate that provably a non-trivial nullspace exists for a particular class of vision transformers. This proof is drawn by simply computing the nullspace of the patch embedding matrices. We extend this idea to the non-linear layers of the vision transformer and show that it is possible to learn a non-linear counterpart to the nullspace via simple optimisations for any vision transformer. Subsequently, we perform studies to understand robustness properties of ViTs under nullspace noise. Under robustness, we investigate prediction stability, and (network and interpretation) fooling properties of the noise. Lastly, we provide image watermarking as an application of nullspace noise.","nullspace, vision transformers, robustness, watermarking, fooling interpretations, fooling models" Max-Margin Works while Large Margin Fails: Generalization without Uniform Convergence,https://openreview.net/forum?id=n-hKHMzBgy,https://openreview.net/pdf?id=n-hKHMzBgy,We prove generalization for max-margin solutions on a 2 Layer NN problem in regimes where uniform convergence bounds provably fail.,"A major challenge in modern machine learning is theoretically understanding the generalization properties of overparameterized models. Many existing tools rely on uniform convergence (UC), a property that, when it holds, guarantees that the test loss will be close to the training loss, uniformly over a class of candidate models. Nagarajan and Kolter (2019) show that in certain simple linear and neural-network settings, any uniform convergence bound will be vacuous, leaving open the question of how to prove generalization in settings where UC fails. Our main contribution is proving novel generalization bounds in two such settings, one linear, and one non-linear. We study the linear classification setting of Nagarajan and Kolter (2019), and a quadratic ground truth function learned via a two-layer neural network in the non-linear regime. We prove a new type of margin bound showing that above a certain signal-to-noise threshold, any near-max-margin classifier will achieve almost no test loss in these two settings. Our results show that near-max-margin is important: while any model that achieves at least a $(1 - \epsilon)$-fraction of the max-margin generalizes well, a classifier achieving half of the max-margin may fail terribly. Our analysis provides insight on why memorization can coexist with generalization: we show that in this challenging regime where generalization occurs but UC fails, near-max-margin classifiers simultaneously contain some generalizable components and some overfitting components that memorize the data. The presence of the overfitting components is enough to preclude UC, but the near-extremal margin guarantees that sufficient generalizable components are present.","generalization, uniform convergence, overparameterization, learning theory, max-margin, two-layer neural network" The batch size can affect inference results,https://openreview.net/forum?id=9MDjKb9lGi,https://openreview.net/pdf?id=9MDjKb9lGi,,"When performing matrix multiplication using GPUs, the cuBLAS library is commonly used for computational efficiency. Because of the cuBLAS’ heuristics, a vast, deep neural network model with GPUs may produce different test results owing to the batch sizes in both the training and inference stages. In this paper, we show that the batch size affects the inference results of deep neural network models. Our test models were the well-known bidirectional encoder representations from transformers (BERT) and generative pre-trained transformer (GPT) natural language processing (NLP) models, and the super-resolution generative adversarial network (SRGAN) image generation model in FP32 and TF32. In the TF32 setting, the evaluation loss in BERT using the general language understanding evaluation (GLUE) data sometimes varied for different batch sizes. The GPT generated sentences depending on batch size, and we show the logit's mean square error by increasing the token length. The SRGAN model produced different images from batch to batch. However, these phenomena were not observed under the FP32 setting. Therefore, the batch size must be carefully managed in large-sized deep neural networks under the TF32 setting.","Matrix operation, Floating-point, Batch size, GEMM" Asymptotic Instance-Optimal Algorithms for Interactive Decision Making,https://openreview.net/forum?id=oGVu9spZaJJ,https://openreview.net/pdf?id=oGVu9spZaJJ,We design the first instance-optimal algorithm for general interactive decision making problems.,"Past research on interactive decision making problems (bandits, reinforcement learning, etc.) mostly focuses on the minimax regret that measures the algorithm's performance on the hardest instance. However, an ideal algorithm should adapt to the complexity of a particular problem instance and incur smaller regrets on easy instances than worst-case instances. In this paper, we design the first asymptotic instance-optimal algorithm for general interactive decision making problems with finite number of decisions under mild conditions. On every instance $f$, our algorithm outperforms all consistent algorithms (those achieving non-trivial regrets on all instances), and has asymptotic regret $\mathcal{C}(f) \ln n$, where $\mathcal{C}(f)$ is an exact characterization of the complexity of $f$. The key step of the algorithm involves hypothesis testing with active data collection. It computes the most economical decisions with which the algorithm collects observations to test whether an estimated instance is indeed correct; thus, the complexity $\mathcal{C}(f)$ is the minimum cost to test the instance $f$ against other instances. Our results, instantiated on concrete problems, recover the classical gap-dependent bounds for multi-armed bandits and prior works on linear bandits, and improve upon the previous best instance-dependent upper bound for reinforcement learning. ","reinforcement learning theory, instance optimality" GRAPHSENSOR: A Graph Attention Network for Time-Series Sensor Data,https://openreview.net/forum?id=-pAV454n6mS,https://openreview.net/pdf?id=-pAV454n6mS,,"Our work focuses on the exploration of the internal relationships of signals in an individual sensor. In particular, we address the problem of not being able to evaluate such inter-sensor relationships due to missing rich and explicit feature representation. To solve this problem, we propose GRAPHSENSOR, a graph attention network, with a shared-weight convolution feature encoder to generate the signal segments and learn the internal relationships between them. Furthermore, we enrich the representation of the features by utilizing a multi-head approach when creating the internal relationship graph. Compared with traditional multi-head approaches, we propose a more efficient convolution-based multi-head mechanism, which only requires 56% of model parameters compared with the best multi-head baseline as demonstrated in the experiments. Moreover, GRAPHSENSOR is capable of achieving the state-of-the-art performance in the electroencephalography dataset and improving the accuracy by 13.8% compared to the best baseline in an inertial measurement unit (IMU) dataset.", ProsodyBERT: Self-Supervised Prosody Representation for Style-Controllable TTS,https://openreview.net/forum?id=7wk9PqiiW2D,https://openreview.net/pdf?id=7wk9PqiiW2D,a self-supervised approach to learning prosody representations from raw audio,"We propose ProsodyBERT, a self-supervised approach to learning prosody representations from raw audio. Different from most previous works, which use information bottlenecks to disentangle prosody features from speech content and speaker information, we perform an offline clustering of speaker-normalized prosody-related features (energy, pitch, their dynamics, etc.) and use the cluster labels as targets for HuBERT-like masked unit prediction. A span boundary loss is also introduced to capture long-range prosodic information. We demonstrate the effectiveness of ProsodyBERT on a multi-speaker style-controllable text-to-speech (TTS) system. Experiments show that the TTS system trained with ProsodyBERT features can generate natural and expressive speech samples, surpassing the model supervised by energy and pitch on subjective human evaluation. Also, the style and expressiveness of synthesized audio can be controlled by manipulating the prosody features. In addition, We achieve new state-of-the-art results on the IEMOCAP emotion recognition task by combining our prosody features with HuBERT features, showing that ProsodyBERT is complementary to popular pretrained speech self-supervised models.","prosody, self-supervised learning, text-to-speech, speech processing, emotion recognition, speech synthesis" FedDebias: Reducing the Local Learning Bias Improves Federated Learning on Heterogeneous Data,https://openreview.net/forum?id=m_thN8e6qrF,https://openreview.net/pdf?id=m_thN8e6qrF,We propose a unified method to reduce the local feature and classifier bias in Federated Learning.,"Federated Learning (FL) is a machine learning paradigm that learns from data kept locally to safeguard the privacy of clients, whereas local SGD is typically employed on the clients' devices to improve communication efficiency. However, such a scheme is currently constrained by the slow and unstable convergence induced by clients' heterogeneous data. In this work, we identify three under-explored phenomena of the biased local learning that may explain these challenges caused by local updates in supervised FL. As a remedy, we propose FedDebias, a novel unified algorithm that reduces the local learning bias on features and classifiers to tackle these challenges. FedDebias consists of two components: The first component alleviates the bias in the local classifiers by balancing the output distribution of models. The second component learns client invariant features that are close to global features but considerably distinct from those learned from other input distributions. In a series of experiments, we show that FedDebias consistently outperforms other SOTA FL and domain generalization (DG) baselines, in which both two components have individual performance gains.",Federated Learning CRISP: Curriculum inducing Primitive Informed Subgoal Prediction for Hierarchical Reinforcement Learning,https://openreview.net/forum?id=ydv0gtW4WLU,https://openreview.net/pdf?id=ydv0gtW4WLU,We effectively leverage expert demonstrations using our curriculum learning based approach to deal with non-stationarity in the context of hierarchical reinforcement learning.,"Hierarchical reinforcement learning is a promising approach that uses temporal abstraction to solve complex long horizon problems. However, simultaneously learning a hierarchy of policies is unstable as it is challenging to train higher-level policy when the lower-level primitive is non-stationary. In this paper, we propose to generate a curriculum of achievable subgoals for evolving lower-level primitives using reinforcement learning and imitation learning. The lower level primitive periodically performs data relabeling on a handful of expert demonstrations using our primitive informed parsing method. We derive expressions to bound the sub-optimality of our method and develop a practical algorithm for hierarchical reinforcement learning. Since our approach uses a handful of expert demonstrations, it is suitable for most real world robotic control tasks. Experimental results on complex maze navigation and robotic manipulation environments show that inducing hierarchical curriculum learning significantly improves sample efficiency, and results in better learning of goal conditioned policies in complex temporally extended tasks. ","Hierarchical Reinforcement Learning, Inverse Reinforcement Learning, Imitation Learning, Curriculum Learning" Near-Optimal Deployment Efficiency in Reward-Free Reinforcement Learning with Linear Function Approximation,https://openreview.net/forum?id=SNwH0dDGl7_,https://openreview.net/pdf?id=SNwH0dDGl7_,We design algorithms for reward free RL under linear MDP with near-optimal deployment complexity and sample complexity.,"We study the problem of deployment efficient reinforcement learning (RL) with linear function approximation under the \emph{reward-free} exploration setting. This is a well-motivated problem because deploying new policies is costly in real-life RL applications. Under the linear MDP setting with feature dimension $d$ and planning horizon $H$, we propose a new algorithm that collects at most $\widetilde{O}(\frac{d^2H^5}{\epsilon^2})$ trajectories within $H$ deployments to identify $\epsilon$-optimal policy for any (possibly data-dependent) choice of reward functions. To the best of our knowledge, our approach is the first to achieve optimal deployment complexity and optimal $d$ dependence in sample complexity at the same time, even if the reward is known ahead of time. Our novel techniques include an exploration-preserving policy discretization and a generalized G-optimal experiment design, which could be of independent interest. Lastly, we analyze the related problem of regret minimization in low-adaptive RL and provide information-theoretic lower bounds for switching cost and batch complexity.","Reinforcement Learning, Deployment efficiency, Reward free RL, Low adaptive RL" Mitigating Out-of-Distribution Data Density Overestimation in Energy-Based Models,https://openreview.net/forum?id=G1STYDZDBeH,https://openreview.net/pdf?id=G1STYDZDBeH,We investigate why EBMs assign high density to OOD data and propose a method to mitigate this problem.,"Deep energy-based models (EBMs), which use deep neural networks (DNNs) as energy functions, are receiving increasing attention due to their ability to learn complex distributions. To train deep EBMs, the maximum likelihood estimation (MLE) with short-run Langevin Monte Carlo (LMC) is often used. While the MLE with short-run LMC is computationally efficient compared to an MLE with full Markov Chain Monte Carlo (MCMC), it often assigns high density to out-of-distribution (OOD) data. To address this issue, here we systematically investigate why the MLE with short-run LMC can converge to EBMs with wrong density estimates, and reveal that the heuristic modifications to LMC introduced by previous works were the main problem. We then propose a Uniform Support Partitioning (USP) scheme that optimizes a set of points to evenly partition the support of the EBM and then uses the resulting points to approximate the EBM-MLE loss gradient. We empirically demonstrate that USP avoids the pitfalls of short-run LMC, leading to significantly improved OOD data detection performance on Fashion-MNIST.",Energy-Based Model Provably efficient multi-task Reinforcement Learning in large state spaces,https://openreview.net/forum?id=p6wiThIOS5m,https://openreview.net/pdf?id=p6wiThIOS5m,We develop sample efficient reinforcement learning algorithm with general function approximation.,"We study multi-task Reinforcement Learning where shared knowledge among different environments is distilled to enable scalable generalization to a variety of problem instances. In the context of general function approximation, Markov Decision Process (MDP) with low Bilinear rank encapsulates a wide range of structural conditions that permit polynomial sample complexity in large state spaces, where the Bellman errors are related to bilinear forms of features with low intrinsic dimensions. To achieve multi-task learning in MDPs, we propose online representation learning algorithms to capture the shared features in the different task-specific bilinear forms. We show that in the presence of low-rank structures in the features of the bilinear forms, the algorithms benefit from sample complexity improvements compared to single-task learning. Therefore, we achieve the first sample efficient multi-task reinforcement learning algorithm with general function approximation.","Reinforcement Learning, Multi-task Learning, Function Approximation, Sample Effficiency" An Equal-Size Hard EM Algorithm for Diverse Dialogue Generation,https://openreview.net/forum?id=k5PEHHY4spM,https://openreview.net/pdf?id=k5PEHHY4spM,"We propose an efficient, effective, and theoretically understood EqHard-EM algorithm for diverse dialogue generation.","Open-domain dialogue systems aim to interact with humans through natural language texts in an open-ended fashion. However, the widely successful neural networks may not work well for dialogue systems, as they tend to generate generic responses. In this work, we propose an Equal-size Hard Expectation--Maximization (EqHard-EM) algorithm to train a multi-decoder model for diverse dialogue generation. Our algorithm assigns a sample to a decoder in a hard manner and additionally imposes an equal-assignment constraint to ensure that all decoders are well-trained. We provide detailed theoretical analysis to justify our approach. Further, experiments on two large-scale, open-domain dialogue datasets verify that our EqHard-EM algorithm generates high-quality diverse responses.","dialogue systems, diverse text generation, EM algorithm" NeuralEQ: Neural-Network-Based Equalizer for High-Speed Wireline Communication,https://openreview.net/forum?id=1Lr5QxntGcM,https://openreview.net/pdf?id=1Lr5QxntGcM,,"Rapid growth of ML applications demand high-performance computing systems to perform massive data processing. In such systems, I/O bandwidth must be scaled up to prevent any performance degradation due to the limited data transfer rates. To meet this demand, recently wireline communication started adopting PAM4 signaling and DSP-based equalizers. However, multi-level signaling and conventional equalizing techniques degrade the bit-error-rate (BER) performance significantly. To mitigate this problem, this paper proposes a novel neural network architecture that mimics the forward-backward algorithm estimating the posterior probabilities in Hidden Markov Models. The proposed neural network overcomes the existing equalizer performance such as feed-forward equalizers or decision-feedback equalizers, while reducing the complexity of the forward-backward algorithm.","Forward-backward algorithm, Equalizer, Neural network, BER" Which is Better for Learning with Noisy Labels: The Semi-supervised Method or Modeling Label Noise?,https://openreview.net/forum?id=sDCMrYnXNGY,https://openreview.net/pdf?id=sDCMrYnXNGY,,"In real life, accurately annotating large-scale datasets is sometimes difficult. Datasets used for training deep learning models are likely to contain label noise. To make use of the dataset containing label noise, two typical methods have been proposed. One is to employ the semi-supervised method by exploiting labeled \textit{confident examples} and unlabeled \textit{non-confident examples}. The other one is to \textit{model label noise} and design \textit{statistically consistent} classifiers. A natural question remains unsolved: which one should be used for a specific real-world application? In this paper, we answer the question from the perspective of \textit{causal data generative process}. Specifically, the semi-supervised method depends heavily on the data generation process while the modeling label noise method is independent of the generation process. For example, for a given dataset, if it has a causal generative structure that the features cause the label, the semi-supervised method would not be helpful. When the causal structure is unknown, we provide an intuitive method to discover the causal structure for a given dataset containing label noise.", The hidden uniform cluster prior in self-supervised learning,https://openreview.net/forum?id=04K3PMtMckp,https://openreview.net/pdf?id=04K3PMtMckp,"Many common self-supervised learning frameworks provably impose a hidden uniform prior, which is detrimental when pretraining with real-world class-imbalanced data.","A successful paradigm in representation learning is to perform self-supervised pretraining using tasks based on mini-batch statistics; (e.g., SimCLR, VICReg, SwAV, MSN). We show that in the formulation of all these methods is an overlooked prior to learn features that enable uniform clustering of the data. While this prior has led to remarkably semantic representations when pretraining on class-balanced data, such as ImageNet, we demonstrate that it can hamper performance when pretraining on class-imbalanced data. By moving away from conventional uniformity priors and instead preferring power-law distributed feature clusters, we show that one can improve the quality of the learned representations on real-world class-imbalanced datasets. To demonstrate this, we develop an extension of the Masked Siamese Networks (MSN) method to support the use of arbitrary features priors.","self-supervised learning, unsupervised learning, representation learning, transfer learning" Revisiting Over-smoothing in Graph Neural Networks,https://openreview.net/forum?id=NB69ih1tiA1,https://openreview.net/pdf?id=NB69ih1tiA1,,"Shallow graph neural networks (GNNs) are state-of-the-art models for relational data. However, it is known that deep GNNs suffer from over-smoothing where, as the number of layers increases, node representations become nearly indistinguishable and model performance on the downstream task degrades significantly. Despite multiple approaches being proposed to address this problem, it is unclear when any of these methods (or their combination) works best and how they perform when evaluated under exactly the same experimental setting. In this paper, we systematically and carefully evaluate different methods for alleviating over-smoothing in GNNs. Furthermore, inspired by standard deeply supervised nets, we propose a general architecture that helps alleviate over-smoothing based on the idea of layer-wise supervision. We term this architecture deeply supervised GNNs (or DSGNNs for short). Our experiments show that deeper GNNs can indeed provide better performance when trained on a combination of different approaches and that DSGNNs are robust under various conditions and can provide the best performance in missing-feature scenarios. ", Optical Flow Regularization of Implicit Neural Representations for Video Frame Interpolation,https://openreview.net/forum?id=G29-Xa55dCXD,https://openreview.net/pdf?id=G29-Xa55dCXD,We show that constraining the derivatives of video INR to satisfy the optical flow constraint equation allows to reach state of the art VFI on limited motion ranges without relying on additional training data.,"Recent works have shown the ability of Implicit Neural Representations (INR) to carry meaningful representations of signal derivatives. In this work, we leverage this property to perform Video Frame Interpolation (VFI) by explicitly constraining the derivatives of the INR to satisfy the optical flow constraint equation. We achieve state of the art VFI on limited motion ranges using only a target video and its optical flow, without learning the interpolation operator from additional training data. We further show that constraining the INR derivatives not only allows to better interpolate intermediate frames but also improves the ability of narrow networks to fit the observed frames, which suggests potential applications to video compression and INR optimization.","Implicit Neural Representation, Video Representation, Video Frame Interpretation" Mosaic Representation Learning for Self-supervised Visual Pre-training,https://openreview.net/forum?id=JAezPMehaUu,https://openreview.net/pdf?id=JAezPMehaUu,"We propose a simple and effective mosaic representation learning framework consisting of a new data augmentation strategy, which aims to adequately learn discriminative feature representations.","Self-supervised learning has achieved significant success in learning visual representations without the need for manual annotation. To obtain generalizable representations, a meticulously designed data augmentation strategy is one of the most crucial parts. Recently, multi-crop strategies utilizing a set of small crops as positive samples have been shown to learn spatially structured features. However, it overlooks the diverse contextual backgrounds, which reduces the variance of the input views and degenerates the performance. To address this problem, we propose a mosaic representation learning framework (MosRep), consisting of a new data augmentation strategy that enriches the backgrounds of each small crop and improves the quality of visual representations. Specifically, we randomly sample numbers of small crops from different input images and compose them into a mosaic view, which is equivalent to introducing different background information for each small crop. Additionally, we further jitter the mosaic view to prevent memorizing the spatial locations of each crop. Along with optimization, our MosRep gradually extracts more discriminative features. Extensive experimental results demonstrate that our method improves the performance far greater than the multi-crop strategy on a series of downstream tasks, e.g., +7.4% and +4.9% than the multi-crop strategy on ImageNet-1K with 1% label and 10% label, respectively. ","self-supervised learning, computer vision" Inverse Optimal Transport with Application to Contrastive Learning,https://openreview.net/forum?id=OA-gbD-ANFt,https://openreview.net/pdf?id=OA-gbD-ANFt,,"Previous works in contrastive learning (CL) mainly focus on pairwise views to learn the representations by attracting the positive samples and repelling negative ones. In this work, we understand the CL with a collective point set matching view and solve this problem with the formulation of inverse optimal transport(IOT), which is a min-min optimization to learn the features. By varying the relaxation degree of constraints in inner minimization of IOT, one can naturally get three different contrastive losses and reveal that InfoNCE is a special case of them, which shows a new and more generalized understanding view of CL. Besides, with our soft matching view, a uniformity penalty is also proposed to improve the representation learning. Experimental results show the effectiveness of our methods.","Contrastive Learning, Inverse Optimal Transoprt" Learning Multi-Object Positional Relationships via Emergent Communication,https://openreview.net/forum?id=Bc_R_YyycnK,https://openreview.net/pdf?id=Bc_R_YyycnK,,"The study of emergent communication has been dedicated to interactive artificial intelligence. While existing work focuses on communication about single objects or complex image scenes, we argue that communicating relationships between multiple objects is important in more realistic tasks, but understudied. In this paper, we try to fill this gap and focus on emergent communication about positional relationships between two objects. We train agents in the referential game where observations contain two objects, and find that generalization is the major problem when the positional relationship is involved. The key factor affecting the generalization ability of the emergent language is the input variation between Speaker and Listener, which is realized by a random image generator in our work. Further, we find that the learned language can generalize well in a new multi-step MDP task where the positional relationship describes the goal, and performs better than raw-pixel images as well as pre-trained image features, verifying the strong generalization ability of discrete sequences. We also show that language transfer from the referential game performs better in the new task than learning language directly in this task, implying the potential benefits of pre-training in referential games. All in all, our experiments demonstrate the viability and merit of having agents learn to communicate positional relationships between multiple objects through emergent communication.",emergence communication FluidLab: A Differentiable Environment for Benchmarking Complex Fluid Manipulation,https://openreview.net/forum?id=Cp-io_BoFaE,https://openreview.net/pdf?id=Cp-io_BoFaE,,"Humans manipulate various kinds of fluids in their everyday life: creating latte art, scooping floating objects from water, rolling an ice cream cone, etc. Using robots to augment or replace human labors in these daily settings remain as a challenging task due to the multifaceted complexities of fluids. Previous research in robotic fluid manipulation mostly consider fluids governed by an ideal, Newtonian model in simple task settings (e.g., pouring water into a container). However, the vast majority of real-world fluid systems manifest their complexities in terms of the fluid’s complex material behaviors (e.g., elastoplastic deformation) and multi-component interactions (e.g. coffee and frothed milk when making latte art), both of which were well beyond the scope of the current literature. To evaluate robot learning algorithms on understanding and interacting with such complex fluid systems, a comprehensive virtual platform with versatile simulation capabilities and well-established tasks is needed. In this work, we introduce FluidLab, a simulation environment with a diverse set of manipulation tasks involving complex fluid dynamics. These tasks address interactions between solid and fluid as well as among multiple fluids. At the heart of our platform is a fully differentiable physics simulator, FluidEngine, providing GPU-accelerated simulations and gradient calculations for various material types and their couplings, extending the scope of the existing differentiable simulation engines. We identify several challenges for fluid manipulation learning by evaluating a set of reinforcement learning and trajectory optimization methods on our platform. To address these challenges, we propose several domain-specific optimization schemes coupled with differentiable physics, which are empirically shown to be effective in tackling optimization problems featured by fluid system’s non-convex and non-smooth properties. FluidLab and FluidEngine will be publicly available.","Complex Fluid Manipulation, Differentiable Physics" "Route, Interpret, Repeat: Blurring the Line Between Posthoc Explainability and Interpretable Models ",https://openreview.net/forum?id=jVDhQl8mPpx,https://openreview.net/pdf?id=jVDhQl8mPpx,We seek to carve out the interpretable model from trained blackbox iteratively and explain the prediction in terms of human interpretable concepts using First order logic (FOL)),"The current approach to ML model design is either to choose a flexible Blackbox model and explain it post hoc or to start with an interpretable model. Blackbox models are flexible but difficult to explain, whereas interpretable models are designed to be explainable. However, developing interpretable models necessitates extensive ML knowledge, and the resulting models tend to be less flexible, offering potentially subpar performance compared to their Blackbox equivalents. This paper aims to blur the distinction between a post hoc explanation of a BlackBox and constructing interpretable models. We propose beginning with a flexible BlackBox model and gradually carving out a mixture of interpretable models and a residual network. Our design identifies a subset of samples and routes them through the interpretable models. The remaining samples are routed through a flexible residual network. We adopt First Order Logic (FOL) as the interpretable model's backbone, which provides basic reasoning on the concept retrieved from the BlackBox model. On the residual network, we repeat the method until the proportion of data explained by the residual network falls below a desired threshold. Our approach offers several advantages. First, the mixture of interpretable and flexible residual networks results in almost no compromise in performance. Second, the rout, interpret, and repeat approach yields a highly flexible interpretable model. Our extensive experiment demonstrates the performance of the model on various datasets. We show that by editing the FOL model, we can fix the shortcut learned by the original BlackBox model. Finally, our method provides a framework for a hybrid symbolic-connectionist network that is simple to train and adaptable to many applications.","Explainable AI, Posthoc explanation, Computer Vision" On Regularization for Explaining Graph Neural Networks: An Information Theory Perspective,https://openreview.net/forum?id=5rX7M4wa2R_,https://openreview.net/pdf?id=5rX7M4wa2R_,"We rethink the role of regularization in GNNs explainability from the perspective of information theory, and propose four intriguing propositions of regularization. ","This work studies the explainability of graph neural networks (GNNs), which is important for the credibility of GNNs in practical usage. Existing work mostly follows the two-phase paradigm to interpret a prediction: feature attribution and selection. However, another important component --- regularization, which is crucial to facilitate the above paradigm --- has been seldom studied. In this work, we explore the role of regularization in GNNs explainability from the perspective of information theory. Our main findings are: 1) regularization is essentially pursuing the balance between two phases, 2) its optimal coefficient is proportional to the sparsity of explanations, 3) existing methods imply an implicit regularization effect of stochastic mechanism, and 4) its contradictory effects on two phases are responsible for the out-of-distribution (OOD) issue in post-hoc explainability. Based on these findings, we propose two common optimization methods, which can bolster the performance of the current explanation methods via sparsity-adaptive and OOD-resistant regularization schemes. Extensive empirical studies validate our findings and proposed methods. Code is available at https://anonymous.4open.science/r/Rethink_Reg-07F0. ","Explainability, Graph Neural Networks, Regularization" The Dark Side of Invariance: Revisiting the Role of Augmentations in Contrastive Learning,https://openreview.net/forum?id=emwOOlciu9v,https://openreview.net/pdf?id=emwOOlciu9v,Contrastive learning can succeed even if the augmentations sometimes change the ground truth label—and there are cases where this can actually help rather than hurt learning,"What role do augmentations play in contrastive learning? Recent work suggests that good augmentations are label-preserving with respect to a specific downstream task. We complicate this picture by showing that label-destroying augmentations are often crucial in the foundation model setting, where the goal is to learn diverse, general-purpose representations for multiple downstream tasks. We perform contrastive learning experiments on a range of image and audio datasets with multiple downstream tasks (e.g. for digits superimposed on photographs, predicting the class of one vs. the other). We find that Viewmaker Networks, a recently proposed model for learning augmentations for contrastive learning, produce label-destroying augmentations that stochastically destroy features needed for different downstream tasks. These augmentations are interpretable (e.g. altering shapes, digits, or letters added to images) and surprisingly often result in better performance compared to expert-designed augmentations, despite not preserving label information. To support our empirical results, we theoretically analyze a simple contrastive learning setting with a linear model. In this setting, label-destroying augmentations are crucial for preventing one set of features from suppressing the learning of features useful for another downstream task. Our results highlight the need for analyzing the interaction between multiple downstream tasks when trying to explain the success of foundation models","contrastive learning, self-supervised learning, feature suppression" Language model with Plug-in Knowldge Memory,https://openreview.net/forum?id=Plr5l7r0jY6,https://openreview.net/pdf?id=Plr5l7r0jY6,we propose a pre-training framework to decouple the knowledge storage from PLM,"Large-scale pre-trained language models(PLM) have made impressive results in a wide range of NLP tasks and it has been revealed that one of the key factors to their success is the parameters of these models implicitly learn various types of knowledge in the pre-training corpus. However, encoding knowledge implicitly in the model parameters has two fundamental drawbacks. First, the knowledge is neither editable nor scalable once the model is trained, which is especially problematic in that knowledge is consistently evolving. Second, it lacks interpretability and prevents us from understanding what kind of knowledge PLM needs to solve certain task. In this paper, we introduce PlugLM, a pre-training model with differentiable plug-in memory(DPM). The key intuition behind is to decouple the knowledge storage from model parameters with an editable and scalable key-value memory and leverage knowledge in an explainable manner by knowledge retrieval in the DPM. We conduct extensive experiments under various settings to justify this design choice. In domain adaptation setting, PlugLM could be easily adapted to different domains with plugable in-domain memory---obtaining 3.95 F1 improvements across four domains, without any in-domain training. PlugLM could also keep absorbing new knowledge after pre-training is done by knowledge updating operation in the DPM without re-training. Finally, we show that by incorporating training samples into DPM with knowledge prompting, PlugLM could further be improved by the instruction of in-task knowledge.","pre-training, language model, memory" Hierarchical Gaussian Mixture based Task Generative Model for Robust Meta-Learning,https://openreview.net/forum?id=A4fSkNAs6E1,https://openreview.net/pdf?id=A4fSkNAs6E1,,"Meta-learning enables quick adaptation of machine learning models to new tasks with limited data. While tasks could come from varying distributions in reality, most of the existing meta-learning methods consider both training and testing tasks as from the same uni-component distribution, overlooking two critical needs of a practical solution: (1) the various sources of tasks may compose a multi-component mixture distribution, and (2) novel tasks may come from a distribution that is unseen during meta-training. In this paper, we demonstrate these two challenges can be solved jointly by modeling the density of task instances. We develop a meta-training framework underlain by a novel Hierarchical Gaussian Mixture based Task Generative Model (HTGM). HTGM extends the widely used empirical process of sampling tasks to a theoretical model, which learns task embeddings, fits mixture distribution of tasks, and enables density-based scoring of novel tasks. The framework is agnostic to the encoder and scales well with large backbone networks. The model parameters are learned end-to-end by maximum likelihood estimation via an Expectation-Maximization algorithm. Extensive experiments on benchmark datasets indicate the effectiveness of our method for both sample classification and novel task detection.", Game-Theoretic Understanding of Misclassification,https://openreview.net/forum?id=A2EeU2Jn3iX,https://openreview.net/pdf?id=A2EeU2Jn3iX,,"This paper analyzes various types of image misclassification from a game-theoretic view. Particularly, we consider the misclassification of clean, adversarial, and corrupted images and characterize it through the distribution of multi-order interactions. We discover that the distribution of multi-order interactions varies across the types of misclassification. For example, misclassified adversarial images have a higher strength of high-order interactions than correctly classified clean images, which indicates that adversarial perturbations create spurious features that arise from complex cooperation between pixels. By contrast, misclassified corrupted images have a lower strength of low-order interactions than correctly classified clean images, which indicates that corruptions break the local cooperation between pixels. We also provide the first analysis of Vision Transformers using interactions. We found that Vision Transformers show a different tendency in the distribution of interactions from that in CNNs, and this implies that they exploit the features that CNNs do not use for the prediction. Our study demonstrates that the recent game-theoretic analysis of deep learning models can be broadened to analyze various malfunctions of deep learning models including Vision Transformers by using the distribution, order, and sign of interactions. ", The Final Ascent: When Bigger Models Generalize Worse on Noisy-Labeled Data,https://openreview.net/forum?id=L5pRidCQlRc,https://openreview.net/pdf?id=L5pRidCQlRc,"When noise-to-sample-size ratio is sufficiently large, increasing the width or density of the model beyond a certain point only hurts the generalization performance.","Increasing the size of overparameterized neural networks has been shown to improve their generalization performance. However, real-world datasets often contain a significant fraction of noisy labels, which can drastically harm the performance of the models trained on them. In this work, we study how neural networks' test loss changes with model size when the training set contains noisy labels. We show that under a sufficiently large noise-to-sample size ratio, generalization error eventually increases with model size. First, we provide a theoretical analysis on random feature regression and show that this phenomenon occurs as the variance of the generalization loss experiences a second ascent under large noise-to-sample size ratio. Then, we present extensive empirical evidence confirming that our theoretical results hold for neural networks. Furthermore, we empirically observe that the adverse effect of network size is more pronounced when robust training methods are employed to learn from noisy-labeled data. Our results have important practical implications: First, larger models should be employed with extra care, particularly when trained on smaller dataset or using robust learning methods. Second, a large sample size can alleviate the effect of noisy labels and allow larger models to achieve a superior performance even under noise. ","supervised learning, generalization, overfitting, memorization" Long-Tailed Partial Label Learning via Dynamic Rebalancing,https://openreview.net/forum?id=sXfWoK4KvSW,https://openreview.net/pdf?id=sXfWoK4KvSW,"We propose a novel method RECORDS for long-tailed partial label learning, which overcomes the drawback of the straightforward combination between long-tailed learning and partial label learning, and significantly improves the performance.","Real-world data usually couples the label ambiguity and heavy imbalance, challenging the algorithmic robustness of partial label learning (PLL) and long-tailed learning (LT). The straightforward combination of LT and PLL, i.e., LT-PLL, suffers from a fundamental dilemma: LT methods build upon a given class distribution that is unavailable in PLL, and the performance of PLL is severely influenced in long-tailed context. We show that even with the auxiliary of an oracle class prior, the state-of-the-art methods underperform due to an adverse fact that the constant rebalancing in LT is harsh to the label disambiguation in PLL. To overcome this challenge, we thus propose a dynamic rebalancing method, termed as RECORDS, without assuming any prior knowledge about the class distribution. Based on a parametric decomposition of the biased output, our method constructs a dynamic adjustment that is benign to the label disambiguation process and theoretically converges to the oracle class prior. Extensive experiments on three benchmark datasets demonstrate the significant gain of RECORDS compared with a range of baselines. Our code will be publicly available.","long-tailed learning, partial label learning" Task Ambiguity in Humans and Language Models,https://openreview.net/forum?id=QrnDe_9ZFd8,https://openreview.net/pdf?id=QrnDe_9ZFd8,"We motivate the direction of studying task ambiguity in humans and language models, evaluating them on a new benchmark of ambiguously-specified tasks and develop methods for improving performance","Language models have recently achieved strong performance across a wide range of NLP benchmarks. However, real world tasks are often poorly specified, and agents must deduce the intended behavior from a combination of context, instructions, and examples. We investigate how both humans and models behave in the face of such task ambiguity by proposing AmbiBench, a new benchmark of six ambiguously-specified classification tasks. We evaluate humans and models on AmbiBench by seeing how well they identify the intended task using 1) instructions with varying degrees of ambiguity, and 2) different numbers of labeled examples. We find that the combination of model scaling (to 175B parameters) and reinforcement learning from human feedback (RLHF) enables models to approach or exceed the accuracy of human participants across tasks, but that either one of these alone is not sufficient. In addition, we show how to dramatically improve the accuracy of language models trained without RLHF by finetuning on a small number of ambiguous in-context examples, providing a promising direction for teaching models to generalize well in the face of ambiguity.","task ambiguity, safety, language models, few-shot learning, in-context learning" Learning from student's mistakes: Improving mean teacher for end-to-end semi-supervised video action detection,https://openreview.net/forum?id=KDTaSChivXd,https://openreview.net/pdf?id=KDTaSChivXd,,"In this work, we focus on semi-supervised learning for video action detection. We present Enhanced Mean Teacher, a simple end-to-end student-teacher based framework which rely on pseudo-labels to learn from unlabeled samples. Limited amount of data make the teacher prone to unreliable boundaries while detecting the spatio-temporal actions. We propose a novel auxiliary module, which learns from student’s mistakes on labeled samples and improve the spatio-temporal pseudo-labels generated by the teacher on unlabeled set. The proposed framework utilize spatial and temporal augmentations to generate pseudo-labels where both classification as well as spatio-temporal consistencies are used to train the model. We evaluate our approach on two action detection benchmark datasets, UCF101-24, and JHMDB-21. On UCF101-24, our approach outperforms the supervised baseline by an approximate margin of 19% on f-mAP@0.5 and 25% on v-mAP@0.5. Using merely 10-15% of the annotations in UCF-101-24, the proposed approach provides a competitive performance compared to the supervised baseline trained on 100% annotations. We also evaluate the effectiveness of Enhanced Mean Teacher for video object segmentation demonstrating its generalization capability to other tasks in video domain.","semi-supervised, activity detection, student-teacher, video understanding" Equivariant Disentangled Transformation for Domain Generalization under Combination Shift,https://openreview.net/forum?id=bn2J_zqfsEf,https://openreview.net/pdf?id=bn2J_zqfsEf,Learning data augmentations based on the algebraic structures of labels is a promising approach for combination shift.,"Machine learning systems may encounter unexpected problems when the data distribution changes in the deployment environment. A major reason is that certain combinations of domains and labels are not observed during training but appear in the test environment. Although various invariance-based algorithms can be applied, we find that the performance gain is often marginal. To formally analyze this issue, we provide a unique algebraic formulation of the combination shift problem based on the concepts of homomorphism, equivariance, and a refined definition of disentanglement. The algebraic requirements naturally derive a simple yet effective method, referred to as equivariant disentangled transformation (EDT), which augments the data based on the algebraic structures of labels and makes the transformation satisfy the equivariance and disentanglement requirements. Experimental results demonstrate that invariance may be insufficient, and it is important to exploit the equivariance structure in the combination shift problem.","distribution shift, domain generalization, equivariance, invariance, disentanglement, algebraic theory, group theory, category theory" Best Possible Q-Learning,https://openreview.net/forum?id=t02FF6Fj5mH,https://openreview.net/pdf?id=t02FF6Fj5mH,A new Q-learning algorithm for multi-agent reinforcement learning,"Fully decentralized learning, where the global information, i.e., the actions of other agents, is inaccessible, is a fundamental challenge in multi-agent reinforcement learning. However, the convergence and optimality of most decentralized algorithms are not theoretically guaranteed, since the transition probabilities are non-stationary as all agents are updating policies simultaneously. To tackle this challenge, we propose \textit{best possible operator}, a novel decentralized operator, and prove that the policies of agents will converge to the optimal joint policy if each agent independently updates its individual state-action value by the operator. Further, to make the update more efficient and practical, we simplify the operator and prove that the convergence and optimality still hold with the simplified one. By instantiating the simplified operator, the derived fully decentralized algorithm, best possible Q-learning (BQL), does not suffer from non-stationarity. Empirically, we show that BQL achieves remarkable improvement over baselines in a variety of cooperative multi-agent tasks.",multi-agent reinforcement learning Analysis of Radio Localiser Networks under Distribution Shift,https://openreview.net/forum?id=oCl1ggnIZ1_,https://openreview.net/pdf?id=oCl1ggnIZ1_,Comparing and benchmarking SOTA RF localisation methods,"Deploying radio frequency (RF) localisation systems invariably entails non-trivial effort, particularly for the latest learning-based breeds. There has been little prior work on characterising and comparing how learnt localiser networks can be deployed in the field under real-world RF distribution shifts. In this paper, we present RadioBench: a suite of 8 learnt localiser nets from the state-of-the-art to study and benchmark their real-world deployability, utilising five novel industry-grade datasets. We train 10k models to analyse the inner workings of these learnt localiser nets and uncover their differing behaviours across three performance axes: (i) learning, (ii) proneness to distribution shift, and (iii) localisation. We use insights gained from this analysis to recommend best practices for the deployability of learning-based RF localisation under practical constraints.","RF localisation, RF positioning, robustness, benchmarking, domain shift, localisation, positioning" Winning Both the Accuracy of Floating Point Activation and the Simplicity of Integer Arithmetic,https://openreview.net/forum?id=z92lBy1ehjI,https://openreview.net/pdf?id=z92lBy1ehjI,,"Even though floating point (FP) numbers have been adopted as a de facto standard data format for deep learning computing, the complexity of FP arithmetic impedes a broader deployment of Deep Neural Networks (DNNs). Recent works such as quantization have attempted to replace the FP matrix multiplication (MatMul) of DNNs with simple integer MatMul by transforming the datatypes of both weights and activations into integers. Unfortunately, unlike weight values that are static, it is challenging to represent dynamic activations with integers. In this paper, to simultaneously achieve the accuracy of FP activation and the simplicity of integer arithmetic, we present a method for replacing FP arithmetic with integer one without changing FP activations in the storage format while weights are quantized. The proposed method pre-aligns the significands of FP activations just ahead of the MatMul on-the-fly so that the aligned significands (integers) can be used for the computation. Inspired by an observation that conventional FP arithmetic does not produce precise results due to rounding, we demonstrate that our proposed integer arithmetic-based scheme can produce the same level of errors as that of the FP arithmetic in case DNNs use FP activations and quantized weights. Experimental results show that the hardware based on the proposed scheme shows significant improvement over FP arithmetic-based designs in terms of energy efficiency and throughput-per-area while maintaining a similar level of accuracy.", Tensor Decompositions For Temporal Knowledge Graph Completion with Time Perspective,https://openreview.net/forum?id=elIEtsQdOYP,https://openreview.net/pdf?id=elIEtsQdOYP,"Instead of focusing on facts and their evolution, we observe temporal knowledge graphs through time perspective, and improve the current tensor decomposition model based on the observed properties.","Facts in the real world are often tied to time, such as the spread of diseases, and the state of military affairs. Therefore, knowledge graphs combined with temporal factors have gained growing attention. In the temporal knowledge graph, most researchers focus on the original facts and pay attention to their changes over time. The temporal factors are only used as auxiliary information for representation learning. In this paper, we try to observe from the perspective of time and find some interesting properties of temporal knowledge graph: (1) Simultaneousness. Various facts occur at the same time; (2) Aggregation. The facts may aggregately occur for a certain individual, organization, or location; (3) Associativity. Some specific relations tend to occur at specific times, such as celebrations at festivals. Based on the above three properties, we add a simple time-aware module to the existing tensor decomposition-based temporal knowledge graph model TComplEx Lacroix et al. (2020), which obtains impressive improvements and achieves state-of-the-art results on four standard temporal knowledge graph completion benchmarks. Specifically, in terms of mean reciprocal rank (MRR), we advance the state-of-the-art by +21.8% on ICEWS14, +16.9% on ICEWS05-15, +20.7% on YAGO15k, and 13.1% on GDELT.","Knowledge Graph, Temporal Knowledge Graph Completion" Preference Transformer: Modeling Human Preferences using Transformers for RL,https://openreview.net/forum?id=Peot1SFDX0,https://openreview.net/pdf?id=Peot1SFDX0,We introduce a transformer-based architecture for preference-based RL considering non-Markovian rewards.,"Preference-based reinforcement learning (RL) provides a framework to train agents using human preferences between two behaviors. However, preference-based RL has been challenging to scale since it requires a large amount of human feedback to learn a reward function aligned with human intent. In this paper, we present Preference Transformer, a neural architecture that models human preferences using transformers. Unlike prior approaches assuming human judgment is based on the Markovian rewards which contribute to the decision equally, we introduce a new preference model based on the weighted sum of non-Markovian rewards. We then design the proposed preference model using a transformer architecture that stacks causal and bidirectional self-attention layers. We demonstrate that Preference Transformer can solve a variety of control tasks using real human preferences, while prior approaches fail to work. We also show that Preference Transformer can induce a well-specified reward and attend to critical events in the trajectory by automatically capturing the temporal dependencies in human decision-making.","preference-based reinforcement learning, human-in-the-loop reinforcement learning, deep reinforcement learning" Flow Matching for Generative Modeling,https://openreview.net/forum?id=PqvMRDCJT9t,https://openreview.net/pdf?id=PqvMRDCJT9t,"We introduce a new simulation-free approach for training Continuous Normalizing Flows, generalizing the probability paths induced by simple diffusion processes. We obtain state-of-the-art on ImageNet in both NLL and FID among competing methods.","We introduce a new paradigm for generative modeling built on Continuous Normalizing Flows (CNFs), allowing us to train CNFs at unprecedented scale. Specifically, we present the notion of Flow Matching (FM), a simulation-free approach for training CNFs based on regressing vector fields of fixed conditional probability paths. Flow Matching is compatible with a general family of Gaussian probability paths for transforming between noise and data samples---which subsumes existing diffusion paths as specific instances. Interestingly, we find that employing FM with diffusion paths results in a more robust and stable alternative for training diffusion models. Furthermore, Flow Matching opens the door to training CNFs with other, non-diffusion probability paths. An instance of particular interest is using Optimal Transport (OT) displacement interpolation to define the conditional probability paths. These paths are more efficient than diffusion paths, provide faster training and sampling, and result in better generalization. Training CNFs using Flow Matching on ImageNet leads to state-of-the-art performance in terms of both likelihood and sample quality, and allows fast and reliable sample generation using off-the-shelf numerical ODE solvers.","continuous normalizing flows, generative models" Graph-informed Neural Point Process With Monotonic Nets,https://openreview.net/forum?id=UR_HvaCdgt6,https://openreview.net/pdf?id=UR_HvaCdgt6,Graph-informed neural point process for sequential data modeling with conditional monotonic neural network.,"Multi-class event data is ubiquitous in real-world applications. The recent neural temporal point processes have used monotonic nets to model the cumulative conditional intensity to avoid an intractable integration in the likelihood. While successful, they are restricted to single-type events and easily sink into poor learning results. To address these limitations and exploit valuable structural information within event participants, we develop a Graph-Informed Neural Point Process (GINPP) that can freely handle multiple event types, greatly improve learning efficiency with the monotonic net, and effectively integrate the graph information to facilitate training. First, we find the bottleneck of the previous model arises from the standard soft-plus transformation over the output of the monotonic net, which greatly enlarges the prediction variations of the monotonic net and increases the training challenge. We propose a shift-scale version that can significantly reduce the variation and promote learning efficiency. Second, we use a conditional mark distribution to model multiple event types, without the need for explicitly estimating the intensity for each type. The latter can be much more challenging. Third, we use random walks to collect the neighborhood of each event participant and use an attention mechanism to update the hidden state of each participant according to the observed events of both the participant itself and its neighborhood. In this way, we can effectively leverage the graph knowledge, and scale up to large graphs. We have shown the advantage of our approach in both ablation studies and real-world applications.","Point Process, Sequential Model, Graph Neural Network" Targeted Hyperparameter Optimization with Lexicographic Preferences Over Multiple Objectives,https://openreview.net/forum?id=0Ij9_q567Ma,https://openreview.net/pdf?id=0Ij9_q567Ma,Hyperparameter tuning under lexicographic preference,"We propose to do targeted hyperparameter optimization with lexicographic preference over multiple objectives, motivated by various practical applications. We first provide a rigorous problem formulation. The formulation is novel and general to allow a clear specification of an automatable optimization goal. We then propose a randomized directed search method named LexiFlow to solve this problem. We demonstrate the strong empirical performance of the proposed algorithm in multiple hyperparameter optimization tasks.","Automatic Machine learning, Hyperparameter tuning, Lexicographic preference" Learning to Decouple Complex System for Sequential Data,https://openreview.net/forum?id=tsPXEkMzPjB,https://openreview.net/pdf?id=tsPXEkMzPjB,"We propose to learn to decouple a complex system into simple but interacting latent sub-systems, which proved effective and powerful in sequential modeling.","A complex system with cluttered observations may be a coupled mixture of multiple simple sub-systems corresponding to \emph{latent entities}. Such sub-systems may hold distinct dynamics in the continuous-time domain, therein complicated interactions between sub-systems also evolve over time. This setting is fairly common in the real world, but has been less considered. In this paper, we propose a sequential learning approach under this setting by decoupling a complex system for handling irregularly sampled and cluttered sequential observations. Such decoupling brings about not only subsystems describing the dynamics of each latent entity, but also a meta-system capturing the interaction between entities over time. Specifically, we argue that the meta-system of interactions is governed by a smoothed version of \emph{projected differential equations}. Experimental results on synthetic and real-world datasets show the advantages of our approach when facing complex and cluttered sequential data compared to the state-of-the-art.","neural differential equation, sequential learning, decoupling complex system" Restoration based Generative Models,https://openreview.net/forum?id=iNUtsk4h2q1,https://openreview.net/pdf?id=iNUtsk4h2q1,A new framework on generative modeling in the perspective of restoration.,"Denoising generative models (DGMs) have recently attracted increasing attention by showing impressive synthesis quality. DGMs are built on a diffusion process that pushes data to the noise distribution and the models learn to denoise. In this paper, we establish the interpretation of DGMs in terms of image restoration (IR). Integrating IR literature allows us to use an alternative objective and diverse forward processes, not confining to the diffusion process. By imposing prior knowledge on the loss function grounded on MAP estimation, we eliminate the need for the expensive sampling of DGMs. Also, we propose a multi-scale training, which alleviates the latent inefficiency of DGMs, by taking advantage of the flexibility of the forward process. Our model improves the quality and efficiency of both training and inference, achieving state-of-the-art performance when the number of forward steps is limited. Furthermore, we show the applicability of our model to inverse problems. We believe that our framework paves the way for designing a new type of flexible general generative model.","Diffusion Generative Models, Image Restoration, Maximum a Posteriori" How hard are computer vision datasets? Calibrating dataset difficulty to viewing time,https://openreview.net/forum?id=zA7hVj3rR19,https://openreview.net/pdf?id=zA7hVj3rR19,"We develop a new dataset difficulty metric based on how long humans must view an image in order to classify a target object, finding that the distribution of current datasets is skewed towards easy images.","Humans outperform object recognizers despite the fact that models perform well on current datasets. Numerous attempts have been made to create more challenging datasets by scaling them up from the web, exploring distribution shift, or adding controls for biases. The difficulty of each image in each dataset is not independently evaluated, nor is the concept of dataset difficulty as a whole well-posed. We develop a new dataset difficulty metric based on how long humans must view an image in order to classify a target object. Images whose objects can be recognized in 17ms are considered to be easier than those which require seconds of viewing time. Using 133,588 judgments on two major datasets, ImageNet and ObjectNet, we determine the distribution of image difficulties in those datasets, which we find varies wildly, but significantly undersamples hard images. Rather than hoping that distribution shift or other approaches will lead to hard datasets, we should measure the difficulty of datasets and seek to explicitly fill out the class of difficult examples. Analyzing model performance guided by image difficulty reveals that models tend to have lower performance and a larger generalization gap on harder images. Encouragingly for the biological validity of current architectures, much of the variance in human difficulty can be accounted for given an object recognizer by computing a combination of prediction depth, c-score, and adversarial robustness. We release a dataset of such judgments as a complementary metric to raw performance and a network’s ability to explain neural recordings. Such experiments with humans allow us to create a metric for progress in object recognition datasets, which we find are skewed toward easy examples, to test the biological validity of models in a novel way, and to develop tools for shaping datasets as they are being gathered to focus them on filling out the missing class of hard examples from today’s datasets. Dataset and analysis code can be found at https://github.com/image-flash/image-flash-2022.","deep learning, computer vision, cognitive science, datasets" Self-Supervised Logit Adjustment,https://openreview.net/forum?id=mqLowjofGBm,https://openreview.net/pdf?id=mqLowjofGBm,"We propose a novel algorithm for self-supervised long-tailed learning, which overcomes the intrinsic limitation of the conventional contrastive learning loss, i.e., sample-level uniformity, and progressively approaches the category-level uniformity.","Self-supervised learning (SSL) has achieved tremendous success on various well curated datasets in computer vision and natural language processing. Nevertheless, it is hard for existing works to capture transferable and robust features, when facing the long-tailed distribution in the real-world scenarios. The attribution is that plain SSL methods to pursue sample-level uniformity easily leads to the distorted embedding space, where head classes with the huge sample number dominate the feature regime and tail classes passively collapse. To tackle this problem, we propose a novel Self-Supervised Logit Adjustment ($S^2LA$) method to achieve the category-level uniformity from a geometric perspective. Specially, we measure the geometric statistics of the embedding space to construct the calibration, and jointly learn a surrogate label allocation to constrain the space expansion of head classes and avoid the passive collapse of tail classes. Our proposal does not alter the setting of SSL and can be easily integrated into existing works in an end-to-end and low-cost manner. Extensive results on a range of benchmark datasets show the effectiveness of $S^2LA$ with high tolerance to the distribution skewness. ","Long-tailed learning, self-supervised learning, logit adjustment, optimal transport" Proportional Amplitude Spectrum Training Augmentation for Synthetic-to-Real Domain Generalization,https://openreview.net/forum?id=5FqeE2SojJi,https://openreview.net/pdf?id=5FqeE2SojJi,"We propose Proportional Amplitude Spectrum Training Augmentation (PASTA), an augmentation strategy for Synthetic-to-Real Generalization","Synthetic data offers the promise of cheap and bountiful training data for settings where lots of labeled real-world data for some task is unavailable. However, models trained on synthetic data significantly underperform on real-world data. In this paper, we propose Proportional Amplitude Spectrum Training Augmentation (PASTA), a simple and effective augmentation strategy to improve out-of-the-box synthetic-to-real (syn-to-real) generalization performance. PASTA involves perturbing the amplitude spectrums of the synthetic images in the Fourier domain to generate augmented views. We design PASTA to perturb the amplitude spectrums in a structured manner such that high-frequency components are perturbed relatively more than the low-frequency ones. For the tasks of semantic segmentation (GTAV→Real), object detection (Sim10K→Real), and object recognition (VisDAC Syn→Real), across a total of 5 syn-to-real shifts, we find that PASTA either outperforms or is consistently competitive with more complex state-of-the-art methods while being complementary to other generalization approaches.","Synthetic-to-Real Generalization, Fourier Space Augmentation, Single Domain Generalization" "More Centralized Training, Still Decentralized Execution: Multi-Agent Conditional Policy Factorization",https://openreview.net/forum?id=znLlSgN-4S0,https://openreview.net/pdf?id=znLlSgN-4S0,We propose a novel multi-agent reinforcement learning algorithm where dependency among agents is explicitly considered,"In cooperative multi-agent reinforcement learning (MARL), combining value decomposition with actor-critic enables agents to learn stochastic policies, which are more suitable for the partially observable environment. Given the goal of learning local policies that enable decentralized execution, agents are commonly assumed to be independent of each other, even in centralized training. However, such an assumption may prohibit agents from learning the optimal joint policy. To address this problem, we explicitly take the dependency among agents into centralized training. Although this leads to the optimal joint policy, it may not be factorized for decentralized execution. Nevertheless, we theoretically show that from such a joint policy, we can always derive another joint policy that achieves the same optimality but can be factorized for decentralized execution. To this end, we propose multi-agent conditional policy factorization (MACPF), which takes more centralized training but still enables decentralized execution. We empirically verify MACPF in various cooperative MARL tasks and demonstrate that MACPF achieves better performance or faster convergence than baselines.",Multi-Agent Reinforcement Learning StepGCN: Step-oriented Graph Convolutional Networks in Representation Learning,https://openreview.net/forum?id=O_m1c-A5w6w,https://openreview.net/pdf?id=O_m1c-A5w6w,,"Graph Convolutional Networks (GCNs) are employed to address a number of tasks in our society with their representation learning approach. Nonetheless, despite their effectiveness and usefulness, the majority of GCN-oriented approaches have an over-smoothing concern. Over-smoothing is the problem of node representations converging into a certain value, making the nodes indistinguishable. To effectively address the over-smoothing problem, we introduce StepGCN, a GCN model that integrates step learning techniques with graph residual connection networks. With our StepGCN, we achieved significant performance improvements in multiple representation learning benchmark datasets, and demonstrate that step learning can be expanded to other graph networks. ", GAPS: Few-Shot Incremental Semantic Segmentation via Guided Copy-Paste Synthesis,https://openreview.net/forum?id=cDVL245jZa,https://openreview.net/pdf?id=cDVL245jZa,"This paper proposes a guided copy-paste synthesis process for few-shot incremental semantic segmentation, which can be combined with existing methods to achieve dramatic performance increase.","Few-shot incremental segmentation is the task of updating a segmentation model, as novel classes are introduced online overtime with a small number of training images. Although incremental segmentation methods exist in the literature, they tend to fall short in the few-shot regime and when given partially-annotated training images, where only the novel class is segmented. This paper proposes a data synthesizer, Guided copy-And-Paste Synthesis (GAPS), that improves the performance of few-shot incremental segmentation in a model-agnostic fashion. Despite the great success of copy-paste synthesis in the conventional offline visual recognition, we demonstrate substantially degraded performance of its naive extension in our online scenario, due to newly encountered challenges. To this end, GAPS (i) addresses the partial-annotation problem by leveraging copy-paste to generate fully-labeled data for training, (ii) helps augment the few images of novel objects by introducing a guided sampling process, and (iii) mitigates catastrophic forgetting by employing a diverse memory-replay buffer. Compared to existing state-of-the-art methods, GAPS dramatically boosts the novel IoU of baseline methods on established few-shot incremental segmentation benchmarks by up to 80%. More notably, GAPS maintains good performance in even more impoverished annotation settings, where only single instances of novel objects are annotated.","continual learning, incremental learning, incremental segmentation, few-shot learning" Edgeformers: Graph-Empowered Transformers for Representation Learning on Textual-Edge Networks,https://openreview.net/forum?id=2YQrqe4RNv,https://openreview.net/pdf?id=2YQrqe4RNv,,"Edges in many real-world social/information networks are associated with rich text information (e.g., user-user communications or user-product reviews). However, mainstream network representation learning models focus on propagating and aggregating node attributes, lacking specific designs to utilize text semantics on edges. While there exist edge-aware graph neural networks, they directly initialize edge attributes as a feature vector, which cannot fully capture the contextualized text semantics of edges. In this paper, we propose Edgeformers, a framework built upon graph-enhanced Transformers, to perform edge and node representation learning by modeling texts on edges in a contextualized way. Specifically, in edge representation learning, we inject network information into each Transformer layer when encoding edge texts; in node representation learning, we aggregate edge representations through an attention mechanism within each node’s ego-graph. On five public datasets from three different domains, Edgeformers consistently outperform state-of-the-art baselines in edge classification and link prediction, demonstrating the efficacy in learning edge and node representations, respectively. Code can be found at https://anonymous.4open.science/r/Edgeformer-release-F422.", How Distinguishable Are Vocoder Models? Analyzing Vocoder Fingerprints for Fake Audio,https://openreview.net/forum?id=cCjxF2QB-AT,https://openreview.net/pdf?id=cCjxF2QB-AT,,"In recent years, vocoders powered by deep neural networks (DNNs) have found much success in the task of generating raw waveforms from acoustic features, as the audio generated becomes increasingly realistic. This development however raises a few challenges, especially in the field of forensics, where the attribution of audio to real or generated sources is vital. To our knowledge, our investigation constitutes the first efforts to answer the question on the existence of vocoder fingerprints and to analyze them. In this paper, we present our discoveries in identifying the sources of generated audio waveforms. Our experiments conducted on the multi-speaker LibriTTS dataset show that (1) vocoder models do leave model-specific fingerprints on the audio they generate, and (2) minor differences in vocoder training can result in sufficiently different fingerprints in generated audio as to allow for distinguishing between the two. We believe that these differences are strong evidence that there exist vocoder-specific fingerprints that can be exploited for source identification purposes.", Hierarchical Multi-Resolution Graph Generation Networks,https://openreview.net/forum?id=iYZkvCji36L,https://openreview.net/pdf?id=iYZkvCji36L,,"In real world domains, graphs often have natural hierarchies. However, data-driven graph generation is yet to effectively respect and exploit such structures. We propose a novel approach that recursively generates community structures at multiple resolutions, with the generated structures conforming to training data distribution at each level of the hierarchy. While the generation of a community at one level takes place sequentially, lower-level sub-structures of the community can be handled in parallel. This significantly improves the speed of both generation and learning, resulting in $\mathcal{O}(\log n)$ generative rounds. Our method is further supported by an expressive probability distribution for intermediate and leaf levels of this hierarchical model. Our method achieves the state of the art performance in graph generation in both accuracy and efficiency on many datasets.","Graph Generative models, GNN, Multinomial distribution" Any-scale Balanced Samplers for Discrete Space,https://openreview.net/forum?id=lEkl0jdSb7B,https://openreview.net/pdf?id=lEkl0jdSb7B,"We identify two key issues of existing gradient based locally balanced samplers, and provide improved proposals with adjusted weight function and 2nd order approximation.","The locally balanced informed proposal has proved to be highly effective for sampling from discrete spaces. However, its success relies on the ""local'' factor, which ensures that whenever the proposal distribution is restricted to be near the current state, the locally balanced weight functions are asymptotically optimal and the gradient approximations are accurate. In seeking a more efficient sampling algorithm, many recent works have considered increasing the scale of the proposal distributions, but this causes the ""local'' factor to no longer hold. Instead, we propose any-scale balanced samplers to repair the gap in non-local proposals. In particular, we substitute the locally balanced function with an any-scale balanced function that can self-adjust to achieve better efficiency for proposal distributions at any scale. We also use quadratic approximations to capture curvature of the target distribution and reduce the error in the gradient approximation, while employing a Gaussian integral trick with a special estimated diagonal to efficiently sample from the quadratic proposal distribution. On various synthetic and real distributions, the proposed sampler substantially outperforms existing approaches.","MCMC, Discrete Space Sampling, Locally Balanced Proposal" Stochastic Optimization under Strongly Convexity and Lipschitz Hessian: Minimax Sample Complexity,https://openreview.net/forum?id=DD8ZJNdTPtO,https://openreview.net/pdf?id=DD8ZJNdTPtO,,"Optimization of convex functions under stochastic zeroth-order feedback has been a major and challenging question in online learning. In this work we consider the problem of optimizing second-order smooth and strongly convex functions where the algorithm is only accessible to noisy evaluations of the objective function it queries. We provide the first tight characterization for the rate of the minimax simple regret by developing matching upper and lower bounds. We propose an algorithm that features a combination of a bootstrapping stage and a mirror-descent stage. The main innovation of our approach is the usage of a gradient estimation scheme that exploits the local geometry of the objective function, and we provide sharp analysis for the corresponding estimation bounds. ", BinSGDM: Extreme One-Bit Quantization for Communication Efficient Large-Scale Distributed Training ,https://openreview.net/forum?id=U45w87vFQ3,https://openreview.net/pdf?id=U45w87vFQ3, Extreme One-Bit Quantization for Communication Efficient Large-Scale Distributed Training ,"To alleviate the communication bottleneck of large-scale distributed training, a rich body of prior communication-compression optimizers have been proposed. These methods focus mainly on high compression ratio to expect acceleration. However, some recent works pointed out, when running with distributed training frameworks ( \emph{e.g.}, \emph{DistributedDataParallel} in pytorch), these methods may provide no acceleration over the off-the-shelve uncompressed SGD/Adam in the typical settings, due to heavy compression/decompression computation or incompatibility with efficient communication primitives or the requirement of uncompressed warmup at the early stage. For these reasons, we propose a novel extreme one-bit quantization optimizer, dubbed \emph{BinSGDM}. The quantization of \emph{BinSGDM} is computed easily and lightly, and it does not need to resort to uncompressed optimizers for warmup. We also theoretically prove that it can promise the same convergence speed as the original Adam. Moreover, we specially present a hierarchical communication scheme to further lower the communication volume. Extensive experiments are conducted on 8 to 64 GPUs (1 to 8 nodes) for distributed training with \emph{DistributedDataParallel}, and the experimental results demonstrates that \emph{BinSGDM} with the communication scheme can achieve up to {$\bm{2.47 \times}$} speedup for training ResNet-50 and $\bm{6.26\times}$ speedup for training BERT-Base, compared to the full-precision optimizers.","Distributed Learning, Optimizer, Communication Efficiency" Gradient-based Algorithms for Pessimistic Bilevel Optimization,https://openreview.net/forum?id=vUC2qwPVvw,https://openreview.net/pdf?id=vUC2qwPVvw,"We propose the first gradient-based algorithm for pessimistic bilevel optimization, provide the first convergence result with nonlinear objective functions, and validate our design and analysis through experiments on several robust learning problems.","As a powerful framework for a variety of machine learning problems, bilevel optimization has attracted much attention. While many modern gradient-based algorithms have been devised for optimistic bilevel optimization, pessimistic bilevel optimization (PBO) is still less-explored and only studied under the linear settings. To fill this void, we investigate PBO with nonlinear inner- and outer-level objective functions in this work, by reformulating it into a single-level constrained optimization problem. In particular, two gradient-based algorithms are first proposed to solve the reformulated problem, i.e., the switching gradient method (SG-PBO) and the primal-dual method (PD-PBO). Through carefully handling the bias errors in gradient estimations resulted by the nature of bilevel optimization, we show that both SG-PBO and PD-PBO converge to the global minimum of the reformulated problem when it is strongly convex, which immediately implies the convergence to the original PBO. Moreover, we propose the proximal scheme (Prox-PBO) with the convergence guarantee for the nonconvex reformulated problem. To the best of our knowledge, this is the first work that investigates gradient-based algorithms and provides convergence analysis for PBO under non-linear settings. We further conduct experiments on an illustrative example and a robust hyperparameter learning problem, which clearly validate our algorithmic design and theoretical analysis.","pessimistic bilevel optimization, convergence analysis, nonconvex, gradient-based method" Equivariant Shape-Conditioned Generation of 3D Molecules for Ligand-Based Drug Design,https://openreview.net/forum?id=4MbGnp4iPQ,https://openreview.net/pdf?id=4MbGnp4iPQ,We develop a shape-conditioned 3D generative model for ligand-based drug design,"Shape-based virtual screening is widely employed in ligand-based drug design to search chemical libraries for molecules with similar 3D shapes yet novel 2D chemical structures compared to known ligands. 3D deep generative models have the potential to automate this exploration of shape-conditioned 3D chemical space; however, no existing models can reliably generate valid drug-like molecules in conformations that adopt a specific shape such as a known binding pose. We introduce a new multimodal 3D generative model that enables shape-conditioned 3D molecular design by equivariantly encoding molecular shape and variationally encoding chemical identity. We ensure local geometric and chemical validity of generated molecules by using autoregressive fragment-based generation with heuristic bonding geometries, allowing the model to prioritize the scoring of rotatable bonds to best align the growing conformational structure to the target shape. We evaluate our 3D generative model in tasks relevant to drug design including shape-conditioned generation of chemically diverse molecular structures and shape-constrained molecular property optimization, demonstrating its utility over virtual screening of enumerated libraries.","molecules, equivariance, generation" Leaves: Learning Views for Time-Series Data in Contrastive Learning,https://openreview.net/forum?id=f8PIYPs-nB,https://openreview.net/pdf?id=f8PIYPs-nB,We propose a simple but effective method to automatrically learn views for time-series data in contrastive learning,"Contrastive learning, a self-supervised learning method that can learn representations from unlabeled data, has been developed promisingly. Many methods of contrastive learning depend on data augmentation techniques, which generate different views from the original signal. However, tuning policies and hyper-parameters for more effective data augmentation methods in contrastive learning is often time and resource-consuming. Researchers have designed approaches to automatically generate new views for some input signals, especially on the image data. But the view-learning method is not well developed for time-series data. In this work, we propose a simple but effective module for automating view generation for time-series data in contrastive learning, named learning views for time-series data (LEAVES). The proposed module learns the hyper-parameters for augmentations using adversarial training in contrastive learning. We validate the effectiveness of the proposed method using multiple time-series datasets. The experiments demonstrate that the proposed method is more effective in finding reasonable views and performs downstream tasks better than the baselines, including manually tuned augmentation-based contrastive learning methods and SOTA methods.","Contrastive Learning, Data Augmentation, Learning Views, Time-Series Data, Adversarial Learning" The Eigenlearning Framework: A Conservation Law Perspective on Kernel Ridge Regression and Wide Neural Networks,https://openreview.net/forum?id=rMkd7_6fB7,https://openreview.net/pdf?id=rMkd7_6fB7,"We identify a conserved quantity in kernel ridge regression and leverage it to develop a theory of generalization, concluding with various applications of our framework.","We derive simple closed-form estimates for the test risk and other generalization metrics of kernel ridge regression (KRR). Relative to prior work, our derivations are greatly simplified and our final expressions are far more interpretable. These improvements are enabled by our identification of a sharp conservation law which limits the ability of KRR to learn any orthonormal basis of functions. Test risk and other objects of interest are expressed in a transparent, interpretable way in terms of our conserved quantity evaluated in the kernel eigenbasis. We use our improved framework to: i) provide a theoretical explanation for the ``deep bootstrap"" of Nakkiran et al (2020), ii) prove a new result regarding the hardness of the classic parity problem, iii) fashion a theoretical tool for the study of adversarial robustness, and iv) draw a tight analogy between KRR and a well-studied system in statistical physics.","kernel regression, kernel ridge regression, wide neural networks, neural tangent kernel, ntk, generalization, conservation laws, learnability" Imbalanced Semi-supervised Learning with Bias Adaptive Classifier,https://openreview.net/forum?id=rVM8wD2G7Dy,https://openreview.net/pdf?id=rVM8wD2G7Dy,This work proposes a bi-level learning framework to learn a tailored classifier for imbalanced semi-supervised learning.,"Pseudo-labeling has proven to be a promising semi-supervised learning (SSL) paradigm. Existing pseudo-labeling methods commonly assume that the class distributions of training data are balanced. However, such an assumption is far from realistic scenarios and thus severely limits the performance of current pseudo-labeling methods under the context of class-imbalance. To alleviate this problem, we design a bias adaptive classifier that targets the imbalanced SSL setups. The core idea is to automatically assimilate the training bias caused by class imbalance via the bias adaptive classifier, which is composed of a novel bias attractor and the original linear classifier. The bias attractor is designed as a light-weight residual network and learned through a bi-level learning framework, which enables the bias adaptive classifier to fit imbalanced training data, while the linear classifier can provide unbiased label prediction for each class. We conduct extensive experiments under various imbalanced semi-supervised setups, and the results demonstrate that our method can be applied to different pseudo-labeling models and is superior to current state-of-the-art methods.","semi-supervised learning, weakly-supervised learning" COMNET : CORTICAL MODULES ARE POWERFUL,https://openreview.net/forum?id=Crw1sKsLDvl,https://openreview.net/pdf?id=Crw1sKsLDvl,"A novel CNN architecture leveraging biological structures in visual cortex to cater real-time applications with low latency, smaller depths","Existing CNN architectures may achieve efficiency in either one or two dimensions: FLOPs, depth, accuracy, representation power, latency but not in all. In this work, we present a pragmatically designed novel CNN architecture “CoMNet” which offers multi-dimensional efficiency at once such as: simple yet accurate, lower latency and FLOPs, high representation power in limited parameters, low memory consumption, negligible branching, smaller depths, and only a few design hyperparameters. The key to achieve the multi-dimensional efficiency is our use of biological underpinnings into CoMNet which is primarily the organization of cortical modules in the visual cortex. To realize CoMNet, a few concepts from well understood CNN designs are directly inherited such as residual learning. Our solid experimental evaluations demonstrate superiority of CoMNet over many state-of-the-art industry and academia dominant architectures such as ResNet, RepVGG etc. For instance, CoMNet supersedes ResNet-50 on ImageNet while being 50% shallower, 22% lesser parameters, 25% lower FLOPs and latency, and in 16% lesser training epochs. Code will be opensourced post the reviews.","CNN Architecture, Multi-dimensional efficiencies, Cortical Modules, Columnar Structure, Real-Time Applications, Latency" A Multi-objective Perspective towards Improving Meta-Generalization,https://openreview.net/forum?id=0o_PPAJstY,https://openreview.net/pdf?id=0o_PPAJstY,We propose to improve meta-generalization from a multi-objective point of view. ,"To improve meta-generalization, i.e., accommodating out-of-domain meta-testing tasks beyond meta-training ones, is of significance to extending the success of meta-learning beyond standard benchmarks. Previous heterogeneous meta-learning algorithms have shown that tailoring the global meta-knowledge by the learned clusters during meta-training promotes better meta-generalization to novel meta-testing tasks. Inspired by this, we propose a novel multi-objective perspective to sharpen the compositionality of the meta-trained clusters, through which we have empirically validated that the meta-generalization further improves. Grounded on the hierarchically structured meta-learning framework, we formulate a hypervolume loss to evaluate the degree of conflict between multiple cluster-conditioned parameters in the two-dimensional loss space over two randomly chosen tasks belonging to two clusters and two mixed tasks imitating out-of-domain tasks. Experimental results on more than 16 few-shot image classification datasets show not only improved performance on out-of-domain meta-testing datasets but also better clusters in visualization. ","meta learning, multi-objective optimization" Do We Always Need to Penalize Variance of Losses for Learning with Label Noise?,https://openreview.net/forum?id=FJdSi_seSg,https://openreview.net/pdf?id=FJdSi_seSg,,"Algorithms which minimize the averaged loss have been widely designed for dealing with noisy labels. Intuitively, when there is a finite training sample, penalizing the variance of losses will improve the stability and generalization of the algorithms. Interestingly, we found that the variance of losses sometimes needs to be increased for the problem of learning with noisy labels. Specifically, increasing the variance of losses would boost the memorization effect and reduce the harmfulness of incorrect labels. Regularizers can be easily designed to increase the variance of losses and be plugged in many existing algorithms. Empirically, the proposed method by increasing the variance of losses could improve the generalization ability of baselines on both synthetic and real-world datasets.", DeepGuiser: Learning to Disguise Neural Architectures for Impeding Adversarial Transfer Attacks,https://openreview.net/forum?id=PArJcOptzg,https://openreview.net/pdf?id=PArJcOptzg,"DeepGuiser is an automatic, hardware-agnostic, and retrain-free neural architecture disguising method to disguise the neural architectures, to resist possible adversarial attacks rendered by the model extraction attacks. ","Security is becoming increasingly critical in deep learning applications. Recent researches demonstrate that NN models are vulnerable to adversarial attacks, which can mislead them with only small input perturbations. Moreover, adversaries who know the architecture of victim models can conduct more effective attacks. Unfortunately, the architectural knowledge can usually be stolen by the adversaries by exploiting the system-level hints through many side channels, which is referred to as the neural architecture extraction attack. Conventional countermeasures for neural architecture extraction can introduce large overhead, and different hardware platforms have diverse types of side-channel leakages such that many expert efforts are needed in developing hardware-specific countermeasures. In this paper, we propose DeepGuiser, an automatic, hardware-agnostic, and retrain-free neural architecture disguising method, to disguise the neural architectures to reduce the harm of neural architecture extraction attacks. In a nutshell, given a trained model, DeepGuiser outputs a deploy model that is functionally equivalent with the trained model but with a different (i.e., disguising) architecture. DeepGuiser can minimize the harm of the follow-up adversarial transfer attacks to the deploy model, even if the disguising architecture is completely stolen by the architecture extraction attack. Experiments demonstrate that DeepGuiser can effectively disguise diverse architectures and impede the adversarial transferability by 13.87% ∼ 32.59%, while only introducing 10% ∼ 40% extra inference latency.","Neural architecture extraction attack, neural architecture disguising, adversarial robustness, transferability predictor, policy learning" Network Controllability Perspectives on Graph Representation,https://openreview.net/forum?id=N6iz-EQkuar,https://openreview.net/pdf?id=N6iz-EQkuar,We develop a novel graph representation method using network control properties and demonstrate its theoretical merits. ,"Graph representations in fixed dimensional feature space are vital in applying learning tools and data mining algorithms to perform graph analytics. Such representations must encode the graph's topological and structural information at the local and global scales without posing significant computation overhead. This paper employs a unique approach grounded in networked control system theory to obtain expressive graph representations with desired properties. We consider graphs as networked dynamical systems and study their controllability properties to explore the underlying graph structure. The controllability of a networked dynamical system profoundly depends on the underlying network topology, and we exploit this relationship to design novel graph representations using controllability Gramian and related metrics. We discuss the merits of this new approach in terms of the desired properties (for instance, permutation and scale invariance) of the proposed representations. Our evaluation of various benchmark datasets in the graph classification framework demonstrates that the proposed representations either outperform (sometimes by more than 6%), or give similar results to the state-of-the-art embeddings. ","Graph Representation, Network Controllability, Graph Classification" FACS: FAST ADAPTIVE CHANNEL SQUEEZING,https://openreview.net/forum?id=QrdSiDAv5ek,https://openreview.net/pdf?id=QrdSiDAv5ek,Computationally Efficient Channel Squeezing in CNNs with high representation power,"Channel squeezing is one of the central operations performed in CNN bottlenecks to reduce the number of channels in a feature map. This operation is carried out by using a 1 × 1 pointwise convolution which constitutes a significant amount of computations and parameters in a given network. ResNet-50 for instance, consists of 16 such layers which form 33% of total layers and 25% (1.05B/4.12B) of total FLOPs or computations. In the light of their predominance, we propose a novel “Fast Adaptive Channel Squeezing” module which carries out the squeezing operation in a computationally efficient manner. The key benefit of FACS is that it neither alters the number of parameters nor affects the accuracy of a given network. When plugged into diverse CNNs architectures, namely ResNet, VGG, and MobileNet-v2, FACS achieves state-of-the-art performance on ImageNet and CIFAR datasets at dramatically reduced FLOPs. FACS also cuts the training time significantly, and lowers the latency which is particularly advantageous for fast inference on edge devices. The source-code will be made publicly available.","Fast Channel squeezing, Edge Devices, CNN" Pre-trained Language Models can be Fully Zero-Shot Learners,https://openreview.net/forum?id=jCpTofV7iY_,https://openreview.net/pdf?id=jCpTofV7iY_,,"How can we extend a pre-trained model to many language understanding tasks, without labeled or additional unlabeled data? Pre-trained language models (PLMs) have been effective for a wide range of NLP tasks. However, existing approaches either require fine-tuning on downstream labeled datasets or manually constructing proper prompts. In this paper, we propose nonparametric prompting PLM (NPPrompt) for fully zero-shot language understanding. Unlike previous methods, NPPrompt uses only pre-trained language models and does not require any labeled data or additional raw corpus for further fine-tuning, nor does it rely on humans to construct a comprehensive set of prompt label words. We evaluate NPPrompt against previous major few-shot and zero-shot learning methods on diverse NLP tasks: including text classification, text entailment, similar text retrieval, and paraphrasing. Experimental results demonstrate that our NPPrompt outperforms the previous best fully zero-shot method by big margins, with absolute gains of 12.8% in accuracy on text classification and 18.9% on the GLUE benchmark.","pre-trained language models, zero-shot learning, prompt" On Compositional Uncertainty Quantification for Seq2seq Graph Parsing,https://openreview.net/forum?id=rJcLocAJpA6,https://openreview.net/pdf?id=rJcLocAJpA6,"In this paper, we aim to quantify and evaluate compositional uncertainty for seq2seq graph parsing by proposing a simple probabilistic framework and rigorous evaluation metrics.","Recent years have witnessed the success of applying seq2seq models to graph parsing tasks, where the outputs are compositionally structured (e.g., a graph or a tree). However, these seq2seq approaches pose a challenge in quantifying the model’s compositional uncertainty on graph structures due to the gap between seq2seq output probability and structural probability on the graph. This work is the first to quantify and evaluate compositional uncertainty for seq2seq graph parsing tasks. First, we proposed a generic, probabilistically interpretable framework that allows correspondences between seq2seq output probability to structural probability on the graph. This framework serves as a powerful medium for quantifying a seq2seq model's compositional uncertainty on graph elements (i.e., nodes or edges). Second, to evaluate uncertainty quality in terms of calibration, we propose a novel metric called Compositional Expected Calibration Error (CECE) which can measure a model’s calibration behavior in predicting graph structures. By a thorough evaluation for compositional uncertainty on three different tasks across ten domains, we demonstrate that CECE is a better reflection for distributional shift compared to vanilla sequence ECE. Finally, we validate the effectiveness of compositional uncertainty considering the task of collaborative semantic parsing, where the model is allowed to send limited subgraphs for human review. The results show that the collaborative performance based on uncertain subgraph selection consistently outperforms random subgraph selection (30\% average error reduction rate) and performs comparably to oracle subgraph selection (only 0.33 difference in average prediction error), indicating that compositional uncertainty is an ideal signal for model errors and can benefit various downstream tasks.","Uncertainty Quantification, Seq2seq Graph Parsing" Generative Gradual Domain Adaptation with Optimal Transport,https://openreview.net/forum?id=E1_fqDe3YIC,https://openreview.net/pdf?id=E1_fqDe3YIC,,"Unsupervised domain adaptation (UDA) adapts a model from a labeled source domain to an unlabeled target domain in a one-off way. Though widely applied, UDA faces a great challenge whenever the distribution shift between the source and the target is large. Gradual domain adaptation (GDA) mitigates this limitation by using intermediate domains to gradually adapt from the source to the target domain. However, it remains an open problem on how to leverage this paradigm when the oracle intermediate domains are missing or scarce. To approach this practical challenge, we propose Generative Gradual Domain Adaptation with Optimal Transport (GOAT), an algorithmic framework that can generate intermediate domains in a data-dependent way. More concretely, we generate intermediate domains along the Wasserstein geodesic between two given consecutive domains in a feature space, and apply gradual self-training, a standard GDA algorithm, to adapt the source-trained classifier to the target along the sequence of intermediate domains. Empirically, we demonstrate that our GOAT framework can improve the performance of standard GDA when the oracle intermediate domains are scarce, significantly broadening the real-world application scenarios of GDA.","Domain Adaptation, Gradual Domain Adaptation, Distribution Shift" ENHANCING THE PRIVACY OF FEDERATED LEARNING THROUGH DATA SYNTHESIS,https://openreview.net/forum?id=J0IhgZ8ziv-,https://openreview.net/pdf?id=J0IhgZ8ziv-,,"Federated Learning (FL) is a distributed machine learning architecture where edge devices collaboratively learn the shared model, while the training data is securely held at the edge devices. FL promises a way forward to preserving data privacy by sending model updates in the form of gradients or the weights themselves. However, these updates still contain the essence of the original training data and can be reconstructed using gradient-based attacks. To overcome this, we propose a novel Privacy-Preserving Federated Learning algorithm (PPFed) wherein we generate a condensed dataset from the original training data at each edge device. The client’s then train their local models on the condensed dataset which is then broadcasted to the server, followed by regular federated averaging. Our method provides privacy by being robust against gradient-based attacks, which holds across different benchmark datasets and CNN based architectures", Free Lunch for Domain Adversarial Training: Environment Label Smoothing,https://openreview.net/forum?id=GPTjnA57h_3,https://openreview.net/pdf?id=GPTjnA57h_3,"We propose to smooth environment label for domain adversarial training methods, which is experimentally and theoretically shown able to improve training stability, local convergence, and robustness to noisy labels.","A fundamental challenge for machine learning models is how to generalize learned models for out-of-distribution (OOD) data. Among various approaches, exploiting invariant features by Domain Adversarial Training (DAT) received widespread attention. Despite its success, we observe training instability from DAT, mostly due to over-confident domain discriminator and environment label noise. To address this issue, we proposed Environment Label Smoothing (ELS), which encourages the discriminator to output soft probability, which thus reduces the confidence of the discriminator and alleviates the impact of noisy environment labels. We demonstrate, both experimentally and theoretically, that ELS can improve training stability, local convergence, and robustness to noisy environment labels. By incorporating ELS with DAT methods, we are able to yield state-of-art results on a wide range of domain generalization/adaptation tasks, particularly when the environment labels are highly noisy. ","Out-of-Distribution Generalization, Domain adaptation/generalization, Domain adversarial training, environmnt label noise, non-asymptotic convergence" DYNAMIC ENSEMBLE FOR PROBABILISTIC TIME- SERIES FORECASTING VIA DEEP REINFORCEMENT LEARNING,https://openreview.net/forum?id=a6NvoZ5DLoe,https://openreview.net/pdf?id=a6NvoZ5DLoe,We develop a general dynamic ensemble framework for probabilistic multi-horizon time series forecasting using deep reinforcement learning.,"Ensembles from given base learners are known to be indispensable in improving accuracy for most of the prediction tasks, leading to numerous methods. However, the only ensembling strategies that have been considered for time series forecasting in the past have been static methods, ones that have access to the predictions of the base learners but not to the base learners themselves. In this paper, we propose a novel \textit{dynamic ensemble policy}, which, unlike static methods, uses the power of the ensemble to improve each of the base learners being ensembled by reducing the error accumulation of each base learner via consecutively feeding an ensembled sample to each base learner. To do so, we adopt a deep Reinforcement Learning (RL) framework with a Markov Decision Process (MDP) designed where the ensemble agent interacts with our environment (\textit{TS-GYM}) from offline data. The output of our ensemble strategy is a single autoregressive forecaster that supports several desirable properties of uncertainty quantification and sample path, along with notable performance gain. The effectiveness of the proposed framework is demonstrated in multiple synthetic and real-world experiments.","Time series, ensemble, reinforcement learning" Scaling Forward Gradient With Local Losses,https://openreview.net/forum?id=JxpBP1JM15-,https://openreview.net/pdf?id=JxpBP1JM15-,,"Forward gradient learning computes a noisy directional gradient and is a biologically plausible alternative to backprop for learning deep neural networks. The standard forward gradient algorithm suffers from the curse of dimensionality in the number of parameters. In this paper, we propose to scale forward gradient by adding a large number of local greedy loss functions. We consider block-wise, patch-wise, and channel group-wise local losses, and show that activity perturbation reduces variance compared to weight perturbation. Inspired by MLPMixer, we also propose a new architecture, LocalMixer, that is more suitable for local learning. We find local learning can work well with both supervised classification and self-supervised contrastive learning. Empirically, it can match backprop on MNIST and CIFAR-10 and significantly outperform backprop-free algorithms on ImageNet.", Recommendation with User Active Disclosing Willingness,https://openreview.net/forum?id=wUXcwhZ9yT,https://openreview.net/pdf?id=wUXcwhZ9yT,,"Recommender system has been deployed in a large amount of real-world applications, profoundly influencing people's daily life and production.Traditional recommender models mostly collect as comprehensive as possible user behaviors for accurate preference estimation.However, considering the privacy, storage/computation burden and other issues, the users may not want to disclose all their behaviors for training the model.In this paper, we study a novel recommendation paradigm, where the users are allowed to indicate their ""willingness"" on disclosing different behaviors, and the models are optimized by trading-off the recommendation quality as well as the violation of the user ""willingness"".More specifically, we formulate the recommendation problem as a multi-player game, where the action is a selection vector representing whether or not involve the items into the model training.For efficiently solving this game, we design a tailored algorithm based on influence function to lower the time cost for recommendation quality exploration, and also extended it with multiple anchor selection vectors.We conduct extensive experiments to demonstrate the effectiveness of our model on balancing the recommendation quality and user disclosing willingness.","recommender system, recommendation, collaborative filtering, user behavior modeling" PAC-NeRF: Physics Augmented Continuum Neural Radiance Fields for Geometry-Agnostic System Identification,https://openreview.net/forum?id=tVkrbkz42vc,https://openreview.net/pdf?id=tVkrbkz42vc,,"Existing approaches to system identification (estimating the physical parameters of an object) from videos assume known object geometries. This precludes their applicability in a vast majority of scenes where object geometries are complex or unknown. In this work, we aim to identify parameters characterizing a physical system from a set of multi-view videos without any assumption on object geometry or topology. To this end, we propose ""Physics Augmented Continuum Neural Radiance Fields"" (PAC-NeRF), to estimate both the unknown geometry and physical parameters of highly dynamic objects from multi-view videos. We design PAC-NeRF to only ever produce physically plausible states by enforcing the neural radiance field to follow the conservation laws of continuum mechanics. For this, we design a hybrid Eulerian-Lagrangian representation of the neural radiance field, i.e., we use the Eulerian grid representation for NeRF density and color fields, while advecting the neural radiance fields via Lagrangian particles. This hybrid Eulerian-Lagrangian representation seamlessly blends efficient neural rendering with the material point method (MPM) for robust differentiable physics simulation. We validate the effectiveness of our proposed framework on geometry and physical parameter estimation over a vast range of materials, including elastic bodies, plasticine, sand, Newtonian and non-Newtonian fluids, and demonstrate significant performance gain on most tasks.","System Identification, Neural Radiance Fields, Differentiable Physics, Material Point Method" Linearly Constrained Bilevel Optimization: A Smoothed Implicit Gradient Approach,https://openreview.net/forum?id=LzPN-BHiJuc,https://openreview.net/pdf?id=LzPN-BHiJuc,,"This work develops analysis and algorithms for solving a class of bilevel optimization problems where the lower-level (LL) problems have linear constraints. Most of the existing approaches for constrained bilevel problems rely on value function based approximate reformulations, which suffer from issues such as non-convex and non-differentiable constraints. In contrast, in this work, we develop an implicit gradient-based approach, which is easy to implement, and is suitable for machine learning applications. We first provide in-depth understanding of the problem, by showing that the implicit objective for such problems is in general non-differentiable. However, if we add some small (linear) perturbation to the LL objective, the resulting problem becomes differentiable almost surely. This key observation opens the door for developing (deterministic and stochastic) gradient-based algorithms similar to the state-of-the-art ones for unconstrained bi-level problems. We show that when the implicit function is assumed to be strongly-convex, convex and non-convex, the resulting algorithms converge with guaranteed rate. Finally, we experimentally corroborate the theoretical findings and evaluate the performance of the proposed framework on numerical and adversarial learning problems. To our knowledge, this is the first time that (implicit) gradient-based methods have been developed and analyzed for the considered class of bilevel problems.", Mastering the Game of No-Press Diplomacy via Human-Regularized Reinforcement Learning and Planning,https://openreview.net/forum?id=F61FwJTZhb,https://openreview.net/pdf?id=F61FwJTZhb,We train a bot that places first in a no-press Diplomacy tournament with humans by using human-data-regularized reinforcement learning and planning ,"No-press Diplomacy is a complex strategy game involving both cooperation and competition that has served as a benchmark for multi-agent AI research. While self-play reinforcement learning has resulted in numerous successes in purely adversarial games like chess, Go, and poker, self-play alone is insufficient for achieving optimal performance in domains involving cooperation with humans. We address this shortcoming by first introducing a planning algorithm we call DiL-piKL that regularizes a reward-maximizing policy toward a human imitation-learned policy. We prove that this is a no-regret learning algorithm under a modified utility function. We then show that DiL-piKL can be extended into a self-play reinforcement learning algorithm we call RL-DiL-piKL that provides a model of human play while simultaneously training an agent that responds well to this human model. We used RL-DiL-piKL to train an agent we name Diplodocus. In a 200-game no-press Diplomacy tournament involving 62 human participants spanning skill levels from beginner to expert, two Diplodocus agents both achieved a higher average score than all other participants who played more than two games, and ranked first and third according to an Elo ratings model.", Understanding Embodied Reference with Touch-Line Transformer,https://openreview.net/forum?id=ugA1HX69sf,https://openreview.net/pdf?id=ugA1HX69sf,People often refer to objects using both language expression and pointing gesture at the same time. Our model locate these objects (referents) more accurately using the virtual touch line. ,"We study embodied reference understanding: locating referents using embodied gestural cues and language references. A popular misconception is that referents lie on the elbow-wrist line. However, as shown by human studies, the virtual touch line much more accurately indicates the direction of referents. This more accurate indicator of referent directions is missing from human pose representations in existing computational models. Consequently, existing computational models cannot effectively incorporate gestural information when locating referents. We help computational models utilize this critical gestural information by devising the touch-line transformer. Our touch-line transformer takes tokenized visual and textual features as inputs and simultaneously predicts the referent’s bounding box and a touch-line vector. At the same time, to facilitate the use of the touchline prior, we apply a geometric consistency loss that encourages the co-linearity between referents and touch lines. Incorporating gestural information improves model performances significantly. Experiments on the YouRefIt dataset show our method achieves a +25.0% accuracy improvement under the 0.75 IoU criterion, closing 63.6% of the gap between model and human performances. Furthermore, we computationally verify prior human studies by showing that computational models more accurately locate referents when using the virtual touch line than when using the elbow-wrist line. Our codes and models will be publicly available.", Evaluating Robustness of Cooperative MARL: A Model-based Approach,https://openreview.net/forum?id=kugE_tCwsC,https://openreview.net/pdf?id=kugE_tCwsC,A novel model-based adversarial attack framework for cooperative multi-agent reinforcement learning with novel victim agent selection strategy.,"In recent years, a proliferation of methods were developed for cooperative multi-agent reinforcement learning (c-MARL). However, the robustness of c-MARL agents against adversarial attacks has been rarely explored. In this paper, we propose to evaluate the robustness of c-MARL agents via a model-based approach, named c-MBA. Our proposed formulation can craft much stronger adversarial state perturbations of c-MARL agents to lower total team rewards than existing model-free approaches. In addition, we propose the first victim-agent selection strategy and the first data-driven approach to define targeted failure states where each of them allows us to develop even stronger adversarial attack without the expert knowledge to the underlying environment. Our numerical experiments on two representative MARL benchmarks illustrate the advantage of our approach over other baselines: our model-based attack consistently outperforms other baselines in all tested environments.","robust c-MARL, model-based adversarial attack" The Emergence of Prototypicality: Unsupervised Feature Learning in Hyperbolic Space,https://openreview.net/forum?id=uVyD2VRZg_T,https://openreview.net/pdf?id=uVyD2VRZg_T,We propose a novel unsupervised learning method by leveraging the property of hyperbolic space for organizing images based on prototypicality and semantics.,"Prototypicality is extensively studied in machine learning and computer vision. However, there is still no widely accepted definition of prototypicality. In this paper, we first propose to define prototypicality based on the concept of congealing. Then, we develop a novel method called HACK to automatically discover prototypical examples from the dataset. HACK conducts unsupervised \pt\ learning in \underline{H}yperbolic space with sphere p\underline{ACK}ing. HACK first generates uniformly packed particles in the Poincar\'e ball of hyperbolic space and then assigns the image uniquely to each particle. Due to the geometrical property of hyperbolic space, prototypical examples naturally emerge and tend to locate in the center of the Poincar\'e ball. HACK naturally leverages hyperbolic space to discover prototypical examples in a data-driven fashion. We verify the effectiveness of the method with synthetic dataset and natural image datasets. Extensive experiments show that HACK can naturally discover the prototypical examples without supervision. The discovered prototypical examples and atypical examples can be used to reduce sample complexity and increase model robustness.","Unsupervised Learning, Prototypicality, Hyperbolic Space" Rethinking Symbolic Regression Datasets and Benchmarks for Scientific Discovery,https://openreview.net/forum?id=i2e2wqt0nAI,https://openreview.net/pdf?id=i2e2wqt0nAI,We propose new datasets and evaluation metric to discuss the performance of symbolic regression for scientific discovery (SRSD).,"This paper revisits datasets and evaluation criteria for Symbolic Regression, a task of expressing given data using mathematical equations, specifically focused on its potential for scientific discovery. Focused on a set of formulas used in the existing datasets based on Feynman Lectures on Physics, we recreate 120 datasets to discuss the performance of symbolic regression for scientific discovery (SRSD). For each of the 120 SRSD datasets, we carefully review the properties of the formula and its variables to design reasonably realistic sampling range of values so that our new SRSD datasets can be used for evaluating the potential of SRSD such as whether or not an SR method can (re)discover physical laws from such datasets. As an evaluation metric, we also propose to use normalized edit distances between a predicted equation and the ground-truth equation trees. While existing metrics are either binary or errors between the target values and an SR model's predicted values for a given input, normalized edit distances evaluate a sort of similarity between the ground-truth and predicted equation trees. We have conducted experiments on our new SRSD datasets using five state-of-the-art SR methods in SRBench and a simple baseline based on a recent Transformer architecture. The results show that we provide a more realistic performance evaluation and open up a new machine learning-based approach for scientific discovery. We provide our datasets and code as part of the supplementary material.","symbolic regression for scientific discovery, physics, datasets, benchmarks" The Cost of Privacy in Fair Machine Learning,https://openreview.net/forum?id=32w-1DCZuVS,https://openreview.net/pdf?id=32w-1DCZuVS,,"A common task in fair machine learning is training ML models that preserve certain summary statistics across subpopulations defined by sensitive attributes. However, access to such sensitive attributes in training data is restricted and the learner must rely on noisy proxies for the sensitive attributes. In this paper, we study the effect of a privacy mechanism that obfuscates the sensitive attributes from the learner on the fairness of the resulting classifier. We show that the cost of privacy in fair ML is a decline in the generalizability of fairness constraints.", Coordinated Strategy Identification Multi-Agent Reinforcement Learning,https://openreview.net/forum?id=44DHnx0Ya_j,https://openreview.net/pdf?id=44DHnx0Ya_j,"We present a framework which expedites and stabilizes learning of a hierarchical multi-agent reinforcement learning via episodic memory, while achieving coordinated behaviors among agents with a novel theoretic regularization.","An agent's strategy can be considered as a subset of action spaces, specialized in certain goals. This paper introduces a coordinated Strategy Identification Multi-Agent reinforcement learning (MARL) with episodic memory, called SIMA. SIMA derives a new temporal difference (TD) target to increase the sample efficiency. The efficiency is achived by keeping the best returns and corresponding to the best joint strategies for given states. This TD target with an additive strategy mixer automatically switches between an episodic control and a conventional Q-learning according to the existence of similar memories. In addition, each agent needs to behave similarly according to its strategy trajectory for coordinated behaviors among agents and coherent evaluation of a group's joint strategies. To this end, SIMA introduces a theoretical regularization for action policies to maximize the mutual information between an agent’s trajectory and its specified strategy. We demonstrate its significant performance improvement on the StarCraft Multi-Agent Challenge benchmark. ","Multi-Agent Reinforcement Learning, Coordinated Strategy, Hierarchical Multi-Agent learning, Episodic Memory" VARIATIONAL ADAPTIVE GRAPH TRANSFORMER FOR MULTIVARIATE TIME SERIES MODELING,https://openreview.net/forum?id=PWWW73yQVp,https://openreview.net/pdf?id=PWWW73yQVp,,"Multivariate time series (MTS) are widely collected by large-scale complex systems, such as internet services, IT infrastructures, and wearable devices. The modeling of MTS has long been an important but challenging task. To capture complex long-range dynamics, Transformers have been utilized in MTS modeling and achieved attractive performance. However, Transformers in general do not well capture the diverse relationships between different channels within MTS and have difficulty in modeling MTS with complex distributions due to the lack of stochasticity. In this paper, we first incorporate relational modeling into Transformer to develop an adaptive Graph Transformer (G-Trans) module for MTS. Then, we further consider stochastity by introducing a powerful embedding guided probabilistic generative module for G-Trans to construct Variational adaptive Graph Transformer (VG-Trans), which is a well-defined variational generative dynamic model. VG-Trans is utilized to learn expressive representations of MTS, being an plug-and-play framework that can be applied to forecasting and anomaly detection tasks of MTS. For efficient inference, we develop an autoencoding variational inference scheme with a combined prediction and reconstruction loss. Extensive experiments on diverse datasets show the efficient of VG-Trans on MTS modeling and improving the existing methods on VG-Trans outperforms state-of-the-art methods on a variety of MTS modeling tasks.", One-Vs-All AUC Maximization: an effective solution to the low-resource named entity recognition problem,https://openreview.net/forum?id=Azw-0kVtsX,https://openreview.net/pdf?id=Azw-0kVtsX,,"Named entity recognition (NER), a sequence labelling/token classification task, has been traditionally considered a multi-class classification problem, the learning objective of which is to either optimise the multi-class cross entropy loss (CE) or train a conditional random field (CRF). However, these standard learning objectives, though scalable to large NER datasets and used in state-of-the-art work, largely ignore the problem of imbalanced label distributions that is inherent in all NER corpora. We show this leads to degraded performance in low-resource settings. While reformulating this standard multi-class labelling problem as a one-vs-all (OVA) learning problem, we propose to optimise the NER model with an AUC-based alternative loss function that is more capable of handling imbalanced datasets. As OVA often leads to a higher training time compared to the standard multi-class setting, we also develop two training strategies, one trains together the labels that share similar linguistic characteristics, and another employs a meta-learning approach to speed convergence. In order to motivate some of our experiments and better interpret the results, we also develop a Bayesian theory for what is the AUC function during learning. Experimental results under low-resource NER settings from benchmark corpora show that our methods can achieve consistently better performance compared with the learning objectives commonly used in NER. We also give evidence that our methods are robust and agnostic to the underlying NER embeddings, models, domains, and label distributions. The code to replicate this work will be released upon the publication of this paper.","NLP, NER, Low-Resource, Imbalanced Distribution, AUC Maximization, One-Vs-All" Efficient Large-scale Transformer Training via Random and Layerwise Token Dropping,https://openreview.net/forum?id=CPg5IRu9PL,https://openreview.net/pdf?id=CPg5IRu9PL,We present a novel random and layerwise token dropping method that can save up to 33.3% theoretical compute cost and 25.6% wall-clock time while achieving comparable accuracy as compared to the standard training procedure.,"Large-scale transformer models have become the de-facto architectures for various machine learning applications, e.g., CV and NLP. However, those large models also introduce prohibitive training costs. To mitigate this issue, we propose a novel random and layerwise token dropping method (\OURS), which skips the computation of a subset of the input tokens at all middle layers. Particularly, \OURS achieves considerable speedups and comparable accuracy as the standard training baseline. Compared to other token dropping methods, \OURS does not require (1) any importance score-based metrics, (2) any special token treatment (e.g., \texttt{[CLS]}), and (3) many layers in full sequence length training except the first and the last layers. Besides, a new \layertoken learning rate schedule is proposed for pretraining problems that resolve the heavy tuning requirement for our proposed training mechanism. Finally, we demonstrate that \OURS can be applied to broader applications, including \gpt and \bert pretraining as well as ViT and \gpt finetuning tasks. Our results show that \OURS can save about 33.3\% theoretical compute cost and 25.6\% wall-clock training time while achieving similar zero-shot evaluations on \gptb as compared to baseline.","Efficient Training, Large-scale Transformers, Token Dropping, GPT, BERT, ViT" Towards Robust Dataset Learning,https://openreview.net/forum?id=OA4o8yKW3q,https://openreview.net/pdf?id=OA4o8yKW3q,We study the problem of learning a robust dataset such that any classifier naturally trained on the dataset is adversarially robust. ,"We study the problem of learning a robust dataset such that any classifier naturally trained on the dataset is adversarially robust. Such a dataset benefits the downstream tasks as natural training is much faster than adversarial training, and demonstrates that the desired property of robustness is transferable between models and data. In this work, we propose a principled, tri-level optimization to formulate the robust dataset learning problem. We show that, under an abstraction model that characterizes robust vs. non-robust features, the proposed method provably learns a robust dataset. Extensive experiments on MNIST, CIFAR10, and TinyImageNet demostrate the effectiveness of our algorithm with different network initializations and architectures.",robust dataset learning Demystifying black-box DNN training processes through Concept-Monitor,https://openreview.net/forum?id=Us8pHYSEgO,https://openreview.net/pdf?id=Us8pHYSEgO,,"Despite the successes of deep neural networks (DNNs) on a broad range of tasks little has been understood of why and how they achieve such victories due to their complex architecture and their opaque black-box training processes. With the goal to unveil the mystery of DNNs, in this work, we propose a general framework called Concept-Monitor to uncover the black-box DNN training processes automatically for the first time. Our proposed Concept-Monitor enables human-interpretable visualization on the DNN training processes and thus facilitates transparency as well as deeper understanding of how DNNs function and operate along the training iterations. Using Concept-Monitor, we are able to observe and compare different training paradigms at ease, including standard training, fine-tuning, adversarial training and network pruning for Lottery Ticket Hypothesis, which brings new insights on why and how adversarial training and network pruning work and how they modify the network during training. For example, we find that the lottery ticket hypothesis discovers a mask that makes neurons interpretable at initialization, \textit{without} any finetuning, and we also found that adversarially robust models have more neurons relying on color as compared to standard models trained on the same dataset.","interpretability, deep learning" Generalization Mechanics in Deep Learning,https://openreview.net/forum?id=xIr81Cft7s,https://openreview.net/pdf?id=xIr81Cft7s,"Uncovers an implicit regularization in deep learning, and gives a new approach to obtain non-vacuous generalization bound","Deep neural nets are well known for their ability to generalize, but understanding the mechanics of generalization remains an important open question. We analyze the probability-flow through a deep net, especially what happens to this flow during the training process. Quantitatively, this study leads to a new approach for obtaining generalization bounds, using which we derive a non-vacuous bound for an over-parameterized deep network that learns relatively small dataset such as CIFAR-10. On the qualitative side, this reveals that the process of learning imposes an implicit regularization, \viz orbit-regularization, as an essential ingredient for generalization. ","Deep Learning, Generalization, regularization" Large Language Models Can Self-improve,https://openreview.net/forum?id=NiEtU7blzN,https://openreview.net/pdf?id=NiEtU7blzN,Improving Reasoning Ability of Large Language Models in An Unsupervised Fashion,"Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate “high-confidence” rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs. We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%→82.1% on GSM8K, 78.2%→83.0% on DROP, 90.0%→94.4% on OpenBookQA, and 63.4%→67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label. We conduct ablation studies and show that finetuning on reasoning is critical for self-improvement.","natural language processing, unsupervised learning, chain of thought" Calibration Matters: Tackling Maximization Bias in Large-scale Advertising Recommendation Systems,https://openreview.net/forum?id=wzlWiO_WY4,https://openreview.net/pdf?id=wzlWiO_WY4,We solve the maximization bias problem in Large-scale Advertising Recommendation Systems.,"Calibration is defined as the ratio of the average predicted click rate to the true click rate. The optimization of calibration is essential to many online advertising recommendation systems because it directly affects the downstream bids in ads auctions and the amount of money charged to advertisers. Despite its importance, calibration often suffers from a problem called “maximization bias”. Maximization bias refers to the phenomenon that the maximum of predicted values overestimates the true maximum. The problem is introduced because the calibration is computed on the set selected by the prediction model itself. It persists even if unbiased predictions are achieved on every datapoint and worsens when covariate shifts exist between the training and test sets. To mitigate this problem, we quantify maximization bias and propose a variance-adjusting debiasing (VAD) meta-algorithm in this paper. The algorithm is efficient, robust, and practical as it is able to mitigate maximization bias problem under covariate shifts, without incurring additional online serving costs or compromising the ranking performance. We demonstrate the effectiveness of the proposed algorithm using a state-of-the-art recommendation neural network model on a large-scale real-world dataset.","Maximization bias, calibration, distribution shifts, neural networks, recommendation system, computational advertisement" Excess risk analysis for epistemic uncertainty with application to variational inference,https://openreview.net/forum?id=kXiS0PpoSP6,https://openreview.net/pdf?id=kXiS0PpoSP6,,"Bayesian deep learning plays an important role especially for its ability evaluating epistemic uncertainty (EU). Due to computational complexity issues, approximation methods such as variational inference (VI) have been used in practice to obtain posterior distributions and their generalization abilities have been analyzed extensively, for example, by PAC-Bayesian theory; however, little analysis exists on EU, although many numerical experiments have been conducted on it. In this study, we analyze the EU of supervised learning in approximate Bayesian inference by focusing on its excess risk. First, we theoretically show the novel relations between generalization error and the widely used EU measurements, such as the variance and mutual information of predictive distribution, and derive their convergence behaviors. Next, we clarify how the objective function of VI regularizes the EU. With this analysis, we propose a new objective function for VI that directly controls the prediction performance and the EU based on the PAC-Bayesian theory. Numerical experiments show that our algorithm significantly improves the EU evaluation over the existing VI methods.","Uncertainty, variational inference, Bayesian inference" Memorization-Dilation: Modeling Neural Collapse Under Noise,https://openreview.net/forum?id=cJWxqmmDL2b,https://openreview.net/pdf?id=cJWxqmmDL2b,,"The notion of neural collapse refers to several emergent phenomena that have been empirically observed across various canonical classification problems. During the terminal phase of training a deep neural network, the feature embedding of all examples of the same class tend to collapse to a single representation, and the features of different classes tend to separate as much as possible. Neural collapse is often studied through a simplified model, called the layer-peeled model, in which the network is assumed to have ``infinite expressivity'' and can map each data point to any arbitrary representation. In this work we study a more realistic variant of the layer-peeled model, which takes the positivity of the features into account. Furthermore, we extend this model to also incorporate the limited expressivity of the network. Empirical evidence suggests that the memorization of noisy data points leads to a degradation (dilation) of the neural collapse. Using a model of the memorization-dilation (M-D) phenomenon, we show one mechanism by which different losses lead to different performances of the trained network on noisy data. Our proofs reveal why label smoothing, a modification of cross-entropy empirically observed to produce a regularization effect, leads to improved generalization in classification tasks.","Neural collapse, feature representation, label smoothing, cross entropy" Spacetime Representation Learning,https://openreview.net/forum?id=qV_M_rhYajc,https://openreview.net/pdf?id=qV_M_rhYajc,Representation of directed graphs by exploiting the causal structure of spacetimes via Lorentzian pre-length spaces,"Much of the data we encounter in the real world can be represented as directed graphs. In this work, we introduce a general family of representations for directed graphs through connected time-oriented Lorentz manifolds, called ""spacetimes"" in general relativity. Spacetimes intrinsically contain a causal structure that indicates whether or not there exists a causal or even chronological order between points of the manifold, called events. This chronological order allows us to naturally represent directed edges via imposing the correct ordering when the nodes are embedded as events in the spacetime. Previous work in machine learning only considers embeddings lying on the simplest Lorentz manifold or does not exploit the connection between Lorentzian pre-length spaces and directed graphs. We introduce a well-defined approach to map data onto a general family of spacetimes. We empirically evaluate our framework in the tasks of hierarchy extraction of undirected graphs, directed link prediction and representation of directed graphs.","pseudo-Riemannian geometry, spacetimes, Lorentz geometry, Lorentzian causality theory, Lorentzian pre-length spaces, directed graphs" Meta-Learning General-Purpose Learning Algorithms with Transformers,https://openreview.net/forum?id=Y2ShteTrnX2,https://openreview.net/pdf?id=Y2ShteTrnX2,Transformers and other black-box models can exhibit learning-to-learn that generalizes to significantly different datasets while undergoing multiple phase transitions in terms of their learning behavior.,"Modern machine learning requires system designers to specify aspects of the learning pipeline, such as losses, architectures, and optimizers. Meta-learning, or learning-to-learn, instead aims to learn those aspects, and promises to unlock greater capabilities with less manual effort. One particularly ambitious goal of meta-learning is to train general purpose learning algorithms from scratch, using only black box models with minimal inductive bias. A general purpose learning algorithm is one which takes in training data, and produces test-set predictions across a wide range of problems, without any explicit definition of an inference model, training loss, or optimization algorithm. In this paper we show that Transformers and other black-box models can be meta-trained to act as general purpose learning algorithms, and can generalize to learn on different datasets than used during meta-training. We characterize phase transitions between algorithms that generalize, algorithms that memorize, and algorithms that fail to meta-train at all, induced by changes in model size, number of tasks used during meta-training, and meta-optimization hyper-parameters. We further show that the capabilities of meta-trained algorithms are bottlenecked by the accessible state size determining the next prediction, unlike standard models which are thought to be bottlenecked by parameter count. Finally, we propose practical interventions such as biasing the training distribution that improve the meta-training and meta-generalization of general purpose learning algorithms.","meta-learning, general-purpose, transformers, learning-to-learn, meta-optimization, large-models, black-box" Learning to Extrapolate: A Transductive Approach,https://openreview.net/forum?id=lid14UkLPd4,https://openreview.net/pdf?id=lid14UkLPd4,,"Machine learning systems, especially overparameterized deep neural networks, can generalize to novel testing instances drawn from the same distribution as the training data. However, they fare poorly when evaluated on out-of-support testing points. In this work, we tackle the problem of developing machine learning systems that retain the power of overparametrized function approximators, while enabling extrapolation to out-of-support testing points when possible. This is accomplished by noting that under certain conditions, a ""transductive"" reparameterization can convert an out-of-support extrapolation problem into a problem of within-support combinatorial generalization. We propose a simple strategy based on bilinear embeddings to enable this type of combinatorial generalization, thereby addressing the out-of-support extrapolation problem. We instantiate a simple, practical algorithm applicable to various supervised learning problems and imitation learning tasks. ", Label-free Concept Bottleneck Models,https://openreview.net/forum?id=FlCg47MNvBA,https://openreview.net/pdf?id=FlCg47MNvBA,"Scalable, automated and efficient way to create Concept Bottleneck Models without labeled concept data.","Concept bottleneck model (CBM) are a popular way of creating more interpretable neural network by having hidden layer neurons correspond to human-understandable concepts. However, existing CBMs and their variants have two crucial limitations: first, the need to collect labeled data for each of the predefined concepts, which is time consuming and labor intensive; second, the accuracy of a CBM is often significantly lower than that of a standard neural network, especially on more complex datasets. This poor performance creates a barrier for adoption in practical real world applications. Motivated by these challenges, we propose \textit{Label-free} CBM which is a framework to transform any neural network into an interpretable CBM without labeled concept data, while retaining a high accuracy. Our Label-free CBM has many advantages, it is: \textit{scalable} - we present the first CBM scaled to ImageNet, \textit{efficient} - creating a CBM takes only a few hours even for very large datasets, and \textit{automated} - training it for a new dataset requires minimal human effort.","Interpretability, Explainability, Concept Bottleneck Models" COMBAT: Alternated Training for Near-Perfect Clean-Label Backdoor Attacks,https://openreview.net/forum?id=Udho-Hry4RZ,https://openreview.net/pdf?id=Udho-Hry4RZ,"We propose a novel mechanism to develop clean-label attacks with near-perfect attack performance, based on alternated training between a trigger generator and a surrogate classifier model.","Backdoor attacks pose a critical concern to the practice of using third-party data for AI development. The data can be poisoned to make a trained model misbehaves when a predefined trigger pattern appears, granting the attackers illegal benefits. While most proposed backdoor attacks are dirty-label, clean-label attacks are more desirable by keeping data labels unchanged to dodge human inspection. However, designing a working clean-label attack is a challenging task, and existing clean-label attacks show underwhelming performance. In this paper, we propose a novel mechanism to develop clean-label attacks with near-perfect attack performance. The key component is a trigger pattern generator, which is trained together with a surrogate model in an alternate manner. Our proposed mechanism is flexible and customizable, allowing different backdoor trigger types and behaviors for either single or multiple target labels. Our backdoor attacks can reach near-perfect attack success rates and bypass all state-of-the-art backdoor defenses, as illustrated via comprehensive experiments on three standard benchmark datasets, including CIFAR-10, GTSRB, and CelebA.","backdoor, clean-label, alternated training" Multi-level Protein Structure Pre-training via Prompt Learning,https://openreview.net/forum?id=XGagtiJ8XC,https://openreview.net/pdf?id=XGagtiJ8XC,,"A protein can focus on different structure levels to implement its functions. Each structure has its own merit and driving forces in describing some specific characteristics, and they cannot replace each other. Most existing function prediction methods take the tertiary structure as input, unintentionally ignoring the other levels of protein structures. Considering protein sequences can determine multi-level structures, in this paper, we aim to realize the comprehensive potential of protein sequences for function prediction. Specifically, we propose a new prompt-guided multi-task pre-training and fine-tuning framework, and the resulting protein model is called PromptProtein. Through the prompt-guided multi-task pre-training, we learn multiple prompt signals to steer the model to focus on different structure levels. We also design a prompt fine-tuning module to provide downstream tasks the on-demand flexibility of utilizing respective levels of structure information. Extensive experiments on function prediction and protein engineering show that PromptProtein outperforms state-of-the-art methods by large margins.","protein representation learning, prompt learning, multi-task learning, multi-level structure" CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks,https://openreview.net/forum?id=iPWiwWHc1V,https://openreview.net/pdf?id=iPWiwWHc1V,"We propose an automated method for generating descriptions of the representation learned by hidden layer neurons, leveraging the multimodal CLIP-model.","In this paper, we propose CLIP-Dissect, a new technique to automatically describe the function of individual hidden neurons inside vision networks. CLIP-Dissect leverages recent advances in multimodal vision/language models to label internal neurons with open-ended concepts without the need for any labeled data or human examples, which are required for existing tools to succeed. We show that CLIP-Dissect provides more accurate descriptions than existing methods for last layer neurons where the ground-truth is available as well as qualitatively good descriptions for hidden layer neurons. In addition, our method is very flexible: it is model agnostic, can easily handle new concepts and can be extended to take advantage of better multimodal models in the future. Finally CLIP-Dissect is computationally efficient and can label all neurons from five layers of ResNet-50 in just four minutes.","Interpretability, Explainability, Network Dissection" GLM-130B: An Open Bilingual Pre-trained Model,https://openreview.net/forum?id=-Aw0rrrPUF,https://openreview.net/pdf?id=-Aw0rrrPUF,,"We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model as good as GPT-3 and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and disconvergence. In this paper, we introduce the pre-training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B—the largest Chinese language model—across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4×RTX 3090 (24G) or 8×RTX 2080 Ti (11G) GPUs, the most ever affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at https://anonymous.4open.science/r/GLM-130B/.", Causal Estimation for Text Data with (Apparent) Overlap Violations,https://openreview.net/forum?id=Ha2MnQM9Ph,https://openreview.net/pdf?id=Ha2MnQM9Ph,,"Consider the problem of estimating the causal effect of some attribute of a text document; for example: what effect does writing a polite vs. rude email have on response time? To estimate a causal effect from observational data, we need to adjust for confounding aspects of the text that affect both the treatment and outcome---e.g., the topic or writing level of the text. These confounding aspects are unknown a priori, so it seems natural to adjust for the entirety of the text (e.g., using a transformer). However, causal identification and estimation procedures rely on the assumption of overlap: for all levels of the adjustment variables, there is randomness leftover so that every unit could have (not) received treatment. Since the treatment here is itself an attribute of the text, it is perfectly determined, and overlap is apparently violated. The purpose of this paper is to show how to handle causal identification and obtain robust causal estimation in the presence of apparent overlap violations. In brief, the idea is to use supervised representation learning to produce a data representation that preserves confounding information while eliminating information that is only predictive of the treatment. This representation then suffices for adjustment and satisfies overlap. Adapting results on non-parametric estimation, we show that this procedure shows robustness with respect to conditional outcome misestimation and yields a low-bias estimator that admits valid uncertainty quantification under weak conditions. Empirical results show reductions in bias and strong improvements in uncertainty quantification relative to the natural (transformer-based) baseline.", Understanding Pruning at Initialization: An Effective Node-Path Balancing Perspective,https://openreview.net/forum?id=nqoxB03tzi,https://openreview.net/pdf?id=nqoxB03tzi,Our paper gives new perspectives on pruning at initialization from the configuration of subnetwork (particularly effective paths and nodes) to better design pruning algorithms.,"Pruning at initialization (PaI) methods aim to remove weights of neural networks before training in pursuit of reducing training costs. While current PaI methods are promising and outperform random pruning, much work remains to be done to understand and improve PaI methods to achieve the performance of pruning after training. In particular, recent studies (Frankle et al., 2021; Su et al., 2020) present empirical evidence for the potential of PaI, and show intriguing properties like layerwise random shuffling connections of pruned networks preserves or even improves the performance. Our paper gives new perspectives on PaI from the geometry of subnetwork configurations. We propose to use two quantities to probe the shape of subnetworks: the numbers of effective paths and effective nodes (or channels). Using these numbers, we provide a principled framework to better understand PaI methods. Our main findings are: (i) the width of subnetworks matters in regular sparsity levels (< 99%) - this matches the competitive performance of shuffled layerwise subnetworks; (ii) node-path balancing plays a critical role in the quality of PaI subnetworks, especially in extreme sparsity regimes. These innovate an important direction to network pruning that takes into account the subnetwork topology itself. To illustrate the promise of this direction, we present a fairly naive method based on SynFlow (Tanaka et al., 2020) and conduct extensive experiments on different architectures and datasets to demonstrate its effectiveness.","Pruning Neural Network, Sparsity" Intrinsic Computational Complexity of Equivariant Neural Networks,https://openreview.net/forum?id=-MQWXqNyoa,https://openreview.net/pdf?id=-MQWXqNyoa,This paper theoretically studies the requried computational complexity for equivariant neural networks to achieve a desired expressivity.,"Equivariant neural networks have shown significant advantages in learning on data with intrinsic symmetries represented by groups. A major concern is on the high computational costs in the cases of large-scale groups, especially in the inference stage. This paper studies the required computational complexity of equivariant neural networks in inference for achieving a desired expressivity. We theoretically compare three classes of ReLU networks: (1) two-layer group-averaging networks (TGNs); (2) two-layer layer-wise equivariant networks (TENs); and (3) two-layer networks without any equivariant constraints (TNs), with a new notion {\it intrinsic computational complexity} for better characterizing computational costs. We prove that (1) TGNs/TENs have equal and full expressivities to represent any invariant function that can be learned by a TN, where the TGNs and TENs have equal intrinsic computational complexities; (2) a TGN/TEN requires at most double the intrinsic computational complexity of a TN; and (3) a TEN can achieve the inference speed coincident with its intrinsic computational complexity, while TGNs are strictly slower, which justifies the computational advantages of layer-wise equivariant architectures over group averaging. Our theory rules out the existence of equivariant networks with group-scale-independent computational costs, summarized in a new no-free-lunch theorem: when more equivariance is desired, more computation is required.","Equivariant Neural Networks, Learning Theory" Data Continuity Matters: Improving Sequence Modeling with Lipschitz Regularizer,https://openreview.net/forum?id=27uBgHuoSQ,https://openreview.net/pdf?id=27uBgHuoSQ,,"Sequence modeling is a core problem in machine learning, and various neural networks have been designed to process different types of sequence data. However, few attempts have been made to understand the inherent data property of sequence data, neglecting the critical factor that may significantly affect the performance of sequence modeling. In this paper, we theoretically and empirically analyze a generic property of sequence data, i.e., continuity, and connect this property with the performance of deep models. First, we empirically observe that different kinds of models for sequence modeling prefer data with different continuity. Then, we theoretically analyze the continuity preference of different models in both time and frequency domains. To further utilize continuity to improve sequence modeling, we propose a simple yet effective Lipschitz Regularizer, that can flexibly adjust data continuity according to model preferences, and bring very little extra computational cost. Extensive experiments on various tasks demonstrate that altering data continuity via Lipschitz Regularizer can largely improve the performance of many deep models for sequence modeling.","deep learning, data continuity, sequence modeling" MoDem: Accelerating Visual Model-Based Reinforcement Learning with Demonstrations,https://openreview.net/forum?id=JdTnc9gjVfJ,https://openreview.net/pdf?id=JdTnc9gjVfJ,"We find that leveraging just a handful of demonstrations can dramatically improve the sample-efficiency of model-based RL, but requires three key ingredients: policy pretraining, targeted exploration, and oversampling of demonstration data.","Poor sample efficiency continues to be the primary challenge for deployment of deep Reinforcement Learning (RL) algorithms for real-world applications, and in particular for visuo-motor control. Model-based RL has the potential to be highly sample efficient by concurrently learning a world model and using synthetic rollouts for planning and policy improvement. However, in practice, sample-efficient learning with model-based RL is bottlenecked by the exploration challenge. In this work, we find that leveraging just a handful of demonstrations can dramatically improve the sample-efficiency of model-based RL. Simply appending demonstrations to the interaction dataset, however, does not suffice. We identify key ingredients for leveraging demonstrations in model learning -- policy pretraining, targeted exploration, and oversampling of demonstration data -- which forms the three phases of our model-based RL framework. We empirically study three complex visuo-motor control domains and find that our method is 260%-350% more successful in completing sparse reward tasks compared to prior approaches in the low data regime (100K interaction steps, 5 demonstrations). Webpage: https://modemrl.github.io/","model-based reinforcement learning, visual reinforcement learning, learning from demonstrations" Improving the Estimation of Instance-dependent Transition Matrix by using Self-supervised Learning,https://openreview.net/forum?id=d_w12b7fb20,https://openreview.net/pdf?id=d_w12b7fb20,,"The transition matrix reveals the transition relationship between clean labels and noisy labels. It plays an important role in building statistically consistent classifiers. In real-world applications, the transition matrix is usually unknown and has to be estimated. It is a challenging task to accurately estimate the transition matrix, especially when it depends on the instance. Given that both instances and noisy labels are available, the major difficulty of learning the transition matrix comes from the missing of clean information. A lot of methods have been proposed to infer clean information. The self-supervised learning has demonstrated great success. These methods could even achieve comparable performance with supervised learning on some datasets but without requiring any labels during the training. It implies that these methods can efficiently infer clean labels. Motivated by this, in this paper, we have proposed a practical method that leverages self-supervised learning to help learn the instance-dependent transition matrix. Empirically, the proposed method has achieved state-of-the-art performance on different datasets.", Holographic-(V)AE: an end-to-end SO(3)-Equivariant (Variational) Autoencoder in Fourier Space,https://openreview.net/forum?id=LcQ3aRCEuKK,https://openreview.net/pdf?id=LcQ3aRCEuKK,,"Group-equivariant neural networks have emerged as a data-efficient approach to solve classification and regression tasks, while respecting the relevant symmetries of the data. However, little work has been done to extend this paradigm to the unsupervised and generative domains. Here, we present Holographic-(V)AE (H-(V)AE), a fully end-to-end SO(3)-equivariant (variational) autoencoder in Fourier space, suitable for unsupervised learning and generation of data distributed around a specified origin. H-(V)AE is trained to reconstruct the spherical Fourier encoding of data, learning in the process a latent space with a maximally informative invariant embedding alongside an equivariant frame describing the orientation of the data. We extensively test the performance of H-(V)AE on diverse datasets and show that its latent space efficiently encodes the categorical features of spherical images and structural features of protein atomic environments. Our work can further be seen as a case study for equivariant modeling of a data distribution by reconstructing its Fourier encoding.", A general differentially private learning framework for decentralized data,https://openreview.net/forum?id=kAfl36VUr95,https://openreview.net/pdf?id=kAfl36VUr95,,"Decentralized consensus learning has been hugely successful, which minimizes a finite sum of expected objective functions over a network of nodes. However, the local communication across neighboring nodes in the network may lead to the leakage of private information. To address this challenge, we propose a general differentially private (DP) learning framework for decentralized data that applies to many non-smooth learning problems. We show that the proposed algorithm retains the performance guarantee in terms of stability, generalization, and finite sample performance. We investigate the impact of local privacy-preserving computation on the global DP guarantee. Further, we extend the discussion by adopting a new class of noise-adding DP mechanisms based on generalized Gaussian distributions to improve the utility-privacy trade-offs. Our numerical results demonstrate the effectiveness of our algorithm and its better performance over the state-of-the-art baseline methods in various decentralized settings.", Wasserstein Barycenter-based Model Fusion and Linear Mode Connectivity of Neural Networks,https://openreview.net/forum?id=qHbyR1MKG8K,https://openreview.net/pdf?id=qHbyR1MKG8K,We proposed a framework for model fusion through Wasserstein barycenter and Gromov-Wasserstein barycenter and connect it to the understanding of linear mode connectivity of neural networks.,"Based on the concepts of Wasserstein barycenter (WB) and Gromov-Wasserstein barycenter (GWB), we propose a unified mathematical framework for neural network (NN) model fusion and utilize it to reveal new insights about the linear mode connectivity of SGD solutions. In our framework, the fusion occurs in a layer-wise manner and builds on an interpretation of a node in a network as a function of the layer preceding it. The versatility of our mathematical framework allows us to talk about model fusion and linear mode connectivity for a broad class of NNs, including fully connected NN, CNN, ResNet, RNN, and LSTM, in each case exploiting the specific structure of the network architecture. We present extensive numerical experiments to: 1) illustrate the strengths of our approach in relation to other model fusion methodologies and 2) from a certain perspective, provide new empirical evidence for recent conjectures which say that two local minima found by gradient-based methods end up lying on the same basin of the loss landscape after a proper permutation of weights is applied to one of the models.","Linear mode connectivity, Neural network model fusion, Wasserstein barycenter, Federated learning" Evaluating Robustness of Generative Models with Adversarial Networks,https://openreview.net/forum?id=ZTPzwWtKW7o,https://openreview.net/pdf?id=ZTPzwWtKW7o,,"With the advent of adversarial robustness as a research area, much novel work attempts to design creative defense mechanisms against adversarial vulnerabilities that arise. While classification models are the most common target of adversarial robustness research, generative models are often underestimated though they play essential roles in many applications. This work evaluates generative models for reconstruction tasks in terms of their adversarial robustness. We constructed two frameworks: a standard and a universal-attack framework. The standard framework requires an input to find its perturbation, and the universal-attack framework generates adversarial perturbation from the distribution of a dataset. Extensive experimental evidence discussed in this paper suggests that both frameworks can effectively alter how images are reconstructed and classified using classic generative models trained on MNIST and Cropped Yale Face datasets. Further, these frameworks outperform state-of-the-art adversarial attacks. Moreover, we showcase using the proposed framework to retrain a generative model to improve its resilience against adversarial perturbations. Furthermore, for the sake of generative models, an attack may desire not to alter the latent space. Thus, we also include the analysis of the latent space.","Adversarial examples, Generative models, Generative adversarial networks, Adversarial attacks" Weakly-Supervised Domain Adaptation in Federated Learning,https://openreview.net/forum?id=_1gu0EX0mM3,https://openreview.net/pdf?id=_1gu0EX0mM3,We leverage auxiliary information and propose gradient projection (GP) to tackle federated domain adaptation problem under weak supervision. ,"Federated domain adaptation (FDA) describes the setting where a set of source clients seek to optimize the performance of a target client. To be effective, FDA must address some of the distributional challenges of Federated learning (FL). For instance, FL systems exhibit distribution shifts across clients. Further, labeled data are not always available among the clients. To this end, we propose and compare novel approaches for FDA, combining the few labeled target samples with the source data when auxiliary labels are available to the clients. The in-distribution auxiliary information is included during local training to boost out-of-domain accuracy. Also, during fine-tuning, we devise a simple yet efficient gradient projection method to detect the valuable components from each source client model towards the target direction. The extensive experiments on medical imaging datasets show that our proposed framework significantly improves federated domain adaptation performance.","federated learning, domain adaptation, healthcare" PD-MORL: Preference-Driven Multi-Objective Reinforcement Learning Algorithm,https://openreview.net/forum?id=zS9sRyaPFlJ,https://openreview.net/pdf?id=zS9sRyaPFlJ,A novel approach that obtains a single policy network optimizing multiple objectives using multi-objective reinforcement learning on challenging continuous control tasks.,"Multi-objective reinforcement learning (MORL) approaches have emerged to tackle many real-world problems with multiple conflicting objectives by maximizing a joint objective function weighted by a preference vector. These approaches find fixed customized policies corresponding to preference vectors specified during training. However, the design constraints and objectives typically change dynamically in real-life scenarios. Furthermore, storing a policy for each potential preference is not scalable. Hence, obtaining a set of Pareto front solutions for the entire preference space in a given domain with a single training is critical. To this end, we propose a novel MORL algorithm that trains a single universal network to cover the entire preference space scalable to continuous robotic tasks. The proposed approach, Preference-Driven MORL (PD-MORL), utilizes the preferences as guidance to update the network parameters. It also employs a novel parallelization approach to increase sample efficiency. We show that PD-MORL achieves up to 25% larger hypervolume for challenging continuous control tasks and uses an order of magnitude fewer trainable parameters compared to prior approaches.","multi-objective reinforcement learning, MORL, DDQN, TD3, HER, continuous control, robotics application" When Majorities Prevent Learning: Eliminating Bias to Improve Worst-group and Out-of-distribution Generalization,https://openreview.net/forum?id=W7udwvFMnAd,https://openreview.net/pdf?id=W7udwvFMnAd,,"Modern neural networks trained on large datasets have achieved state-of-the-art (in-distribution) generalization performance on various tasks. However, their good generalization performance has been shown to be contributed largely to overfitting spurious biases in large datasets. This is evident by the poor generalization performance of such models on minorities and out-of-distribution data. To alleviate this issue, subsampling the majority groups has been shown to be very effective. However, it is not clear how to find the subgroups (e.g. within a class) in large real-world datasets. Besides, naively subsampling the majority groups can entirely deplete some of their smaller sub-populations and drastically harm the in-distribution performance. Here, we show that tracking gradient trajectories of examples in initial epochs allows for finding large subpopulations of data points. We leverage this observation and propose an importance sampling method that is biased towards selecting smaller subpopulations, and eliminates bias in the large subpopulations. Our experiments confirm the effectiveness of our approach in eliminating spurious biases and learning higher-quality models with superior in- and out-of-distribution performance on various datasets.", Precautionary Unfairness in Self-Supervised Contrastive Pre-training,https://openreview.net/forum?id=l2FXO1RJ5Hs,https://openreview.net/pdf?id=l2FXO1RJ5Hs,,"Recently, self-supervised contrastive pre-training has become the de facto regime, that allows for efficient downstream fine-tuning. Meanwhile, its fairness issues are barely studied, though they have drawn great attention from the machine learning community, where structured biases in data can lead to biased predictions against under-presented groups. Most existing fairness metrics and algorithms focus on supervised settings, e.g., based on disparities in prediction performance, and they become inapplicable in the absence of supervision. We are thus interested in the challenging question: how does the pre-training representation (un)fairness transfer to the downstream task (un)fairness, and can we define and pursue fairness in unsupervised pre-training? Firstly, we empirically show that imbalanced groups in the pre-training data indeed lead to unfairness in the pre-trained representations, and that cannot be easily fixed by fairness-aware fine-tuning without sacrificing efficiency. Secondly, motivated by the observation that the majority group of the pre-training data dominates the learned representations, we design the first unfairness metric that can be applicable to self-supervised learning, and leverage that to guide the contrastive pre-training for fairness-aware representations. Our experiments demonstrate that the underestimated representation disparities strike over 10% surges on the proposed metric and our algorithm improves 10 out of 13 tasks on the 1%-labeled CelebA dataset. Codes will be released upon acceptance. ","fairness, contrastive learning" Understanding the Role of Nonlinearity in Training Dynamics of Contrastive Learning,https://openreview.net/forum?id=s130rTE3U_X,https://openreview.net/pdf?id=s130rTE3U_X,We analyze the role played by the nonlinearity in the training dynamics nonlinear 2-layer network for contrastive learning. ,"While the empirical success of self-supervised learning (SSL) heavily relies on the usage of deep nonlinear models, existing theoretical works on SSL understanding still focus on linear ones. In this paper, we study the role of nonlinearity in the training dynamics of contrastive learning (CL) on one and two-layer nonlinear networks with homogeneous activation $h(x) = h'(x)x$. We have two major theoretical discoveries. First, the presence of nonlinearity can lead to many local optima even in 1-layer setting, each corresponding to certain patterns from the data distribution, while with linear activation, only one major pattern can be learned. This suggests that models with lots of parameters can be regarded as a \emph{brute-force} way to find these local optima induced by nonlinearity. Second, in the 2-layer case, linear activation is proven not capable of learning specialized weights into diverse patterns, demonstrating the importance of nonlinearity. In addition, for 2-layer setting, we also discover \emph{global modulation}: those local patterns discriminative from the perspective of global-level patterns are prioritized to learn, further characterizing the learning process. Simulation verifies our theoretical findings. ","contrastive learning, self-supervised learning, representation learning, nonlinearity, training dynamics, landscape" Oracle-oriented Robustness: Robust Image Model Evaluation with Pretrained Models as Surrogate Oracle,https://openreview.net/forum?id=xZxK8OG2igG,https://openreview.net/pdf?id=xZxK8OG2igG,"We offer a dynamic evaluation protocol that evaluates vision models' robustness across generic i.i.d benchmarks, without requirement on the prior knowledge of the underlying causal structure depicted by the images and additional human efforts.","Machine learning has demonstrated remarkable performances over finite datasets, yet whether the scores over the fixed benchmarks can sufficiently indicate the model’s performances in the real world is still in discussion. In reality, an ideal robust model will probably behave similarly to the oracle (*e.g.*, the human users), thus a good evaluation protocol is probably to evaluate the models’ behaviors in comparison to the oracle. In this paper, we introduce a new robustness measurement that directly measures the image classification model’s performance compared with a surrogate oracle. Besides, we design a simple method that can accomplish the evaluation beyond the scope of the benchmarks. Our method extends the image datasets with new samples that are sufficiently perturbed to be distinct from the ones in the original sets, but are still bounded within the same causal structure the original test image represents, constrained by a surrogate oracle model pretrained with a large amount of samples. As a result, our new method will offer us a new way to evaluate the models’ robustness performances, free of limitations of fixed benchmarks or constrained perturbations, although scoped by the power of the oracle. In addition to the evaluation results, we also leverage our generated data to understand the behaviors of the model and our new evaluation strategies.","robustness, distribution shift, reliable machine learning" Certified Robustness on Structural Graph Matching,https://openreview.net/forum?id=bGQw0awHUjQ,https://openreview.net/pdf?id=bGQw0awHUjQ,We are the first to define certified robustness on GM and design a new certification strategy using a joint smoothing distribution to maximize certified region.,"The vulnerability of graph matching (GM) to adversarial attacks has received increasing attention from emerging empirical studies, while the certified robustness of GM has not been explored. Motivated by randomized smoothing, we are the first to define certified robustness on GM and design a new certification strategy called Structure-based Certified Robustness of Graph Matching (SCR-GM). Structural prior information of nodes is used to construct a joint smoothing distribution matrix with physical significance, which certifies a wider range than those obtained by previous iterative optimization methods. Furthermore, we propose a certified space that can be used to derive a strictly certified radius and two radii for evaluation. Experimental results on graph matching datasets reveal that our strategy achieves state-of-the-art $\ell_{2}$ certified accuracy and regions.","Structural graph matching (GM), certified robustness, randomized smoothing, joint Gaussian distribution" CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis,https://openreview.net/forum?id=iaYcJKpY2B_,https://openreview.net/pdf?id=iaYcJKpY2B_,"We open-source a large language models, CodeGen, for program synthesis and propose a multi-turn program synthesis benchmark for evaluation.","Program synthesis strives to generate a computer program as a solution to a given problem specification, expressed with input-output examples or natural language descriptions. The prevalence of large language models advances the state-of-the-art for program synthesis, though limited training resources and data impede open access to such models. To democratize this, we train and release a family of large language models up to 16.1B parameters, called CODEGEN, on natural language and programming language data, and open source the training library JAXFORMER. We show the utility of the trained model by demonstrating that it is competitive with the previous state-of-the-art on zero-shot Python code generation on HumanEval. We further investigate the multi-step paradigm for program synthesis, where a single program is factorized into multiple prompts specifying subproblems. To this end, we construct an open benchmark, Multi-Turn Programming Benchmark (MTPB), consisting of 115 diverse problem sets that are factorized into multi-turn prompts. Our analysis on MTPB shows that the same intent provided to CODEGEN in multi-turn fashion significantly improves program synthesis over that provided as a single turn. We make the training library JAXFORMER and model checkpoints available as open source contribution: http://repo.codegen-iclr.org.","Program synthesis, multi-turn generation, code generation, large language models, generative models" Bayesian Optimal Experimental Design for the Survey Bandit Setting,https://openreview.net/forum?id=Db_WALIfbdC,https://openreview.net/pdf?id=Db_WALIfbdC,We develop a Bayesian optimal experimental design procedure for the survey bandit setting.,"The contextual bandit is a classic problem in sequential decision making under uncertainty that finds broad application to tasks in precision medicine, personalized education, and drug discovery. Here, a decision maker repeatedly receives a context, takes an action, and then observes an associated outcome, with the goal of choosing actions that achieve a minimal regret. However, in many settings, the context is not given, and the decision maker must instead collect some information to infer a context before proceeding. For example, when a doctor does not have prior information about a patient, they might ask a sequence of questions before recommending a medical treatment. In this paper, we aim to develop methods for this setting—which we refer to as the \emph{survey bandit}—where the decision maker is not given access to the context but can ask a finite sequence of questions to gain information about the context before taking an action and observing an outcome. Using insights from Bayesian optimal experimental design (BOED) and decision-theoretic information theory, we view the interaction with each user as a BOED task, where the goal is to ask a sequence of questions that elicit the most information about the optimal action for this user. Our procedure is agnostic to the choice of probabilistic model, and we demonstrate its usefulness in a few common classes of distributions. Our algorithm achieves significantly better performance on both synthetic and real data relative to existing baseline methods while remaining statistically efficient, interpretable, and computationally friendly.","Bayesian optimal experimental design, contextual bandit, survey" Deep Contrastive Learning Approximates Ensembles of One-Class SVMs with Neural Tangent Kernels,https://openreview.net/forum?id=WePih5bXsNB,https://openreview.net/pdf?id=WePih5bXsNB,,"To demystify the (self-supervised) contrastive learning in representation learning, in the paper we show that a model learned by deep contrastive learning with a family of loss functions such as InfoNCE essentially approximates an ensemble of one-class support vector machines (SVMs) with neural tangent kernels (NTKs). This result comes from the fact that each gradient for network weight update in contrastive learning can be interpreted approximately as the primal solution for a one-class SVM with contrastive gradients as input. From the dual perspective, the Lagrange multipliers provide unique insights into the importance of the anchor-positive-negative triplet samples. In this way, we further propose a novel sequential convex programming (SCP) algorithm for contrastive learning, where each sub-problem is a one-class SVM. Empirically we demonstrate that our approach can learn better gradients than conventional contrastive learning approaches that significantly improve performance.","contrastive learning, one-class SVM, neural tangent kernel, sequential convex programming" Synchronized Contrastive Pruning for Efficient Self-Supervised Learning,https://openreview.net/forum?id=aCdREQkEMGk,https://openreview.net/pdf?id=aCdREQkEMGk,Novel sparse training algorithm for self-supervised learning,"Various self-supervised learning (SSL) methods have demonstrated strong capability in learning visual representations from unlabeled data. However, the current state-of-the-art (SoTA) SSL methods largely rely on heavy encoders to achieve comparable performance as the supervised learning counterpart. Despite the well-learned visual representations, the large-sized encoders impede the energy efficient computation, especially for resource-constrained edge computing. Prior works have utilized the sparsity-induced asymmetry to enhance the contrastive learning of dense models, but the generality between asymmetric sparsity and self-supervised learning has not been fully discovered. Furthermore, transferring the supervised sparse learning to SSL is also largely under-explored. To address the research gap in prior works, this paper investigates the correlation between in-training sparsity and SSL. In particular, we propose a novel sparse SSL algorithm, embracing the benefits of contrastiveness while exploiting high sparsity during SSL training. The proposed framework is evaluated comprehensively with various granularities of sparsity, including element-wise sparsity, GPU-friendly N:M structured fine-grained sparsity, and hardware-specific structured sparsity. Extensive experiments across multiple datasets are performed, where the proposed method shows superior performance against the SoTA sparse learning algorithms with various SSL frameworks. Furthermore, the training speedup aided by the proposed method is evaluated with an actual DNN training accelerator model. ","Sparse Training, Self-supervised Learning, Neural Network Pruning" VEHICLE-INFRASTRUCTURE COOPERATIVE 3D DETECTION VIA FEATURE FLOW PREDICTION,https://openreview.net/forum?id=ZLfD0cowleE,https://openreview.net/pdf?id=ZLfD0cowleE,,"Effectively utilizing data from infrastructure could greatly improve autonomous driving safety. Vehicle-Infrastructure Cooperative 3D Object Detection (VIC3D) is an important task to localize and recognize objects surrounding the ego-vehicle by combining the sensor data from both ego-vehicle and roadside infrastructure. However, there are serious temporal asynchrony problems between vehicle and infrastructure data. To the best of our knowledge, no existing work in the literature could effectively solve the asynchrony problem with limited communication bandwidth and computational resources on vehicle-infrastructure devices. This work proposes a novel approach for VIC3D, called Feature Flow Network(FFNet), to effectively address the problem of temporal asynchrony caused by different sensor initialization and latency. Compared with previous feature fusion approaches that only use the current static feature, FFNet transmits the feature flow and generates the future features on-the-fly, aligned with the ego-vehicle timestamp. We propose a self-supervised method to train the feature flow generation model, and use the pre-trained infrastructure model to extract features from randomly assigned future frames as ground truth. Extensive experiments on the DAIR-V2X dataset (a large-scale real-world V2X dataset) show that FFNet establishes a new state of the art, surpassing SOTA methods by up to 5% mAP while with comparable transmission cost. In particularly, FFNet can even make up for almost all the performance drop caused by the temporal asynchrony in 200ms delay.", M-L2O: Towards Generalizable Learning-to-Optimize by Test-Time Fast Self-Adaptation,https://openreview.net/forum?id=s7oOe6cNRT8,https://openreview.net/pdf?id=s7oOe6cNRT8,," Learning to Optimize (L2O) has drawn increasing attention as it often remarkably accelerates the optimization procedure of complex tasks by ``overfitting"" specific task type, leading to enhanced performance compared to analytical optimizers. Generally, L2O develops a parameterized optimization method (i.e., ``optimizer"") by learning from solving sample problems. This data-driven procedure yields L2O that can efficiently solve problems similar to those seen in training, that is, drawn from the same ``task distribution"". However, such learned optimizers often struggle when new test problems come with a substantially deviation from the training task distribution. This paper investigates a potential solution to this open challenge, by meta-training an L2O optimizer that can perform fast test-time self-adaptation to a out-of-distribution task, in only a few steps. We theoretically characterize the generalization of L2O, and further show that our proposed framework (termed as M-L2O) provably facilitates rapid task adaptation by locating well-adapted initial points for the optimizer weight. Empirical observations on several classic tasks like LASSO and Quadratic, demonstrate that M-L2O converges significantly faster than vanilla L2O with only $5$ steps of adaptation, echoing our theoretical results. All codes will be shared upon acceptance.","L2O, Meta Learning, Generalization" ReG-NAS: Graph Neural Network Architecture Search using Regression Proxy Task,https://openreview.net/forum?id=t7HIN3fUAUu,https://openreview.net/pdf?id=t7HIN3fUAUu,,"Neural Architecture Search (NAS) has become a focus that has been extensively researched in recent years. Innovative achievements are yielded from the area like convolutional neural networks (CNN), recurrent neural networks (RNN) and so on. However, research on NAS for graph neural networks (GNN) is still in a preliminary stage. Because of the special structure of graph data, some conclusions drew from CNN cannot be directly applied to GNN. At the same time, for NAS, the models' ranking stability is of great importance for it reflects the reliability of the NAS performance. Unfortunately, little research attention has been paid to it, making it a pitfall in the development of NAS research. In this paper, we proposed a novel NAS pipeline, ReG-NAS, which balances stability, reliability and time cost to search the best GNN architecture. Besides, for the first time, we systematically analyzed factors that will affect models' ranking stability in a given search space, which can be used as a guideline for subsequent studies. Our codes are available at https://anonymous.4open.science/r/ReG-NAS-4D21","Neural Architecture Search, GNN, Machine Learning" Mesh-Independent Operator Learning for PDEs using Set Representations,https://openreview.net/forum?id=7d-d0BFz6Hf,https://openreview.net/pdf?id=7d-d0BFz6Hf,"We propose an attention-based operator learning model for obtaining the continuous solution of PDEs, independent of the discretization formats.","Operator learning, learning the mapping between infinite-dimensional function spaces, has been attracted as an alternative approach to traditional numerical methods to solve partial differential equations (PDEs). In practice, the functions of the physical systems are often observed by sparse or even irregularly distributed measurements, thus the functions are discretized and usually represented by finite structured arrays, which are given as data of input-output pairs. Through training with the arrays, the solution of the trained models should be independent of the discretization of the input function and can be queried at any point continuously. Therefore, the architectures for operator learning should be flexibly compatible with arbitrary sizes and locations of the measurements, otherwise, it can restrict the scalability when the observations have discrepancies between measurement formats. In this paper, we propose to treat the discretized functions as set-valued data and construct an attention-based model, called mesh-independent operator learner (MIOL), to provide proper treatments of input functions and query coordinates for the solution functions by detaching the dependencies on input and output meshes. Our models pre-trained with benchmark datasets of operator learning are evaluated by downstream tasks to demonstrate the generalization abilities to varying discretization formats of the system, which are natural characteristics of the continuous solution of the PDEs. ","partial differential equations, operator learning, set representations, attention-based model, implicit neural representation" ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning,https://openreview.net/forum?id=xYlJRpzZtsY,https://openreview.net/pdf?id=xYlJRpzZtsY,We propose a new taxonomy for reasoning errors and suite of metrics to score step-by-step reasoning in language models.,"Large language models show improved downstream task performance when prompted to generate step-by-step reasoning to justify their final answers. These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness (independent of the final answer) is difficult without reliable methods for automatic evaluation. We simply do not know how often the stated reasoning steps actually support the final end task predictions. In this work, we present ROSCOE, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics. To evaluate ROSCOE against baseline metrics, we design a typology of reasoning errors and collect synthetic and human evaluation scores on commonly used reasoning datasets. In contrast with existing metrics, ROSCOE can measure semantic consistency, logicality, informativeness, fluency, and factuality — among other traits — by leveraging properties of step-by-step rationales. We empirically verify the strength of our metrics on five human annotated and six programmatically perturbed diagnostics datasets - covering a diverse set of tasks that require reasoning skills and show that ROSCOE can consistently outperform baseline metrics.","step-by-step reasoning, evaluation" Robust Multi-Agent Reinforcement Learning against Adversaries on Observation,https://openreview.net/forum?id=eExA3Mk0Dxp,https://openreview.net/pdf?id=eExA3Mk0Dxp,We propose a training framework that progressively generates adversarial attacks on agents' observations to help agents learn a robust cooperative policy. ,"With the broad applications of deep learning, such as image classification, it is becoming increasingly essential to tackle the vulnerability of neural networks when facing adversarial attacks, which have been widely studied recently. In the cooperative multi-agent reinforcement learning field, which has also shown potential in real-life domains, little work focuses on the problem of adversarial attacks. However, adversarial attacks on observations that can undermine the coordination among agents are likely to occur in actual deployment. This paper proposes a training framework that progressively generates adversarial attacks on agents' observations to help agents learn a robust cooperative policy. One attacker makes decisions on a hybrid action space that it first chooses an agent to attack and then outputs the perturbation vector. The victim policy is then trained against the attackers. Experimental results show that our generated adversarial attacks are diverse enough to improve the agents' robustness against possible disturbances. ","multi-agent reinforcement learning, robust reinforcement learning, cooperative multi-agent systems, adversarial training" Limitations of Piecewise Linearity for Efficient Robustness Certification,https://openreview.net/forum?id=BaWtp9o25zN,https://openreview.net/pdf?id=BaWtp9o25zN,"We show that piecewise linearity imposes fundamental limitations for efficient robustness certification, e.g., Lipschitz-based certification; this imposes additional capacity requirements on networks that must be certified by such techniques.","Certified defenses against small-norm adversarial examples have received growing attention in recent years; though certified accuracies of state-of-the-art methods remain far below their non-robust counterparts, despite the fact that benchmark datasets have been shown to be well-separated at far larger radii than the literature generally attempts to certify. In this work, we offer insights that identify potential factors in this performance gap. Specifically, our analysis reveals that piecewise linearity imposes fundamental limitations on the tightness of leading certification techniques. These limitations are felt in practical terms as a greater need for capacity in models hoped to be certified efficiently. Moreover, this is _in addition_ to the capacity necessary to learn a robust boundary, studied in prior work. However, we argue that addressing the limitations of piecewise linearity through scaling up model capacity may give rise to potential difficulties---particularly regarding robust generalization---therefore, we conclude by suggesting that developing _smooth_ activation functions may be the way forward for advancing the performance of certified neural networks.","robustness, certification, Lipschitz, limitations, adversarial examples" Forces are not Enough: Benchmark and Critical Evaluation for Machine Learning Force Fields with Molecular Simulations,https://openreview.net/forum?id=_V-nKeWvs7p,https://openreview.net/pdf?id=_V-nKeWvs7p,We benchmark machine learning force fields with novel datasets and metrics based on molecular simulations and reveal insights into how they should be evaluated and further improved.,"Molecular dynamics (MD) simulation techniques are widely used for various natural science applications. Increasingly, machine learning (ML) force field (FF) models begin to replace ab-initio simulations by predicting forces directly from atomic structures. Despite significant progress in this area, such techniques are primarily benchmarked by their force/energy prediction errors, even though the practical use case would be to produce realistic MD trajectories. We aim to fill this gap by introducing a novel benchmark suite for ML MD simulation. We curate representative MD systems, including water, organic molecules, peptide, and materials, and design evaluation metrics corresponding to the scientific objectives of respective systems. We benchmark a collection of state-of-the-art (SOTA) ML FF models and illustrate, in particular, how the commonly benchmarked force accuracy is not well aligned with relevant simulation metrics. We demonstrate when and how selected SOTA methods fail, along with offering directions for further improvement. Specifically, we identify stability as a key metric for ML models to improve. Our benchmark suite comes with a comprehensive open source codebase for training and simulation with ML FFs to facilitate further work.","molecular dynamics, machine learning force field, benchmark, simulation-based evaluation, ML for Sciences" Hypothetical Training for Robust Machine Reading Comprehension of Tabular Context,https://openreview.net/forum?id=7GQfA9xAqxN,https://openreview.net/pdf?id=7GQfA9xAqxN,,"Machine Reading Comprehension (MRC) models easily learn the spurious correlations from complex context such as tabular data. Counterfactual training—using the original and augmented data—has become a promising solution. However, it is costly to construct faithful counterfactual examples because it is tricky to maintain the consistency and dependency of the table entries. In this paper, we take a more economic fashion to ask hypothetical questions, e.g., “in which year would the net profit be larger if the revenue in 2019 were $38,298?”, whose effects on the answers are equivalent to those expensive counterfactual tables. We propose a hypothetical training framework that uses pairs of examples with different hypothetical questions to supervise the direction of model gradient w.r.t. the input towards the answer change. We conduct experiments on MRC datasets with factual and hypothetical examples. Performance gain on a newly constructed stress test validates the effectiveness and rationality of our approach.", FlexRound: Learnable Rounding by Element-wise Division for Post-Training Quantization,https://openreview.net/forum?id=-tYCaP0phY_,https://openreview.net/pdf?id=-tYCaP0phY_,,"Post-training Quantization (PTQ) has been gaining popularity for the deployment of deep neural networks on resource-limited devices since unlike quantization-aware training, neither a full training dataset nor end-to-end training is required at all. As PTQ schemes based on reconstructing each layer or block output turn out to be effective to enhance quantized model performance, recent works have developed algorithms to devise and learn a new weight-rounding scheme so as to better reconstruct each layer or block output. We notice that, however, such new rounding schemes are established on element-wise addition. In this work, we propose a simple yet effective new rounding mechanism for PTQ, coined FlexRound, via element-wise division to learn not only a common quantization grid size but also a different scale for each pre-trained weight. Thanks to the reciprocal rule of derivatives induced by element-wise division, FlexRound is inherently able to exploit the importance of a pre-trained weight when updating its corresponding scale, and thus, flexibly quantize a pre-trained weight depending on its own importance. We empirically validate the efficacy of FlexRound on a wide range of models and tasks. To the best of our knowledge, our work is the first to carry out comprehensive experiments on image classification, natural language understanding, and natural language generation in the per-tensor uniform PTQ setting. Our code will be open-sourced soon.","Efficient Inference, Quantization, Post-Training Quantization" Re-calibrating Feature Attributions for Model Interpretation,https://openreview.net/forum?id=WUWJIV2Yxtp,https://openreview.net/pdf?id=WUWJIV2Yxtp,We propose a re-calibration technique to calibrate existing integral-based attribution methods with valid references for a consistent explanation.,"The ability to interpret machine learning models is critical for high-stakes applications. Due to its desirable theoretical properties, path integration is a widely used scheme for feature attribution to interpret model predictions. However, the methods implementing this scheme currently rely on absolute attribution scores to eventually provide sensible interpretations. This not only contradicts the premise that the features with larger attribution scores are more relevant to the model prediction, but also conflicts with the theoretical settings for which the desirable properties of the attributions are proven. We address this by devising a method to first compute an appropriate reference for the path integration scheme. This reference further helps in identifying valid interpolation points on a desired integration path. The reference is computed in a gradient ascending direction on the model's loss surface, while the interpolations are performed by analyzing the model gradients and variations between the reference and the input. The eventual integration is effectively performed along a non-linear path. Our scheme can be incorporated into the existing integral-based attribution methods. We also devise an effective sampling and integration procedure that enables employing our scheme with multi-reference path integration efficiently. We achieve a marked performance boost for a range of integral-based attribution methods on both local and global evaluation metrics by enhancing them with our scheme. Our extensive results also show improved sensitivity, sanity preservation and model robustness with the proposed re-calibration of the attribution techniques with our method.","Feature Attribution, Explainable Artifical Intelligence" Adversarial Diversity in Hanabi,https://openreview.net/forum?id=uLE3WF3-H_5,https://openreview.net/pdf?id=uLE3WF3-H_5,We produce meaningfully diverse and reasonable joint policies using off-belief learning and adversarial reward shaping.,"Many Dec-POMDPs admit a qualitatively diverse set of ``reasonable'' joint policies. Diversity literature is concerned with generating these joint policies. Unfortunately, existing methods fail to produce teams of agents that are simultaneously diverse, high performing and, ``reasonable''. In this work, we propose a novel approach to diverse policy generation for turn-based Dec-POMDPs with public actions, which relies on off-belief learning to encourage reasonableness and skill, and on ``repulsive'' fictitious transitions to encourage diversity. We use this approach to generate new agents with distinct but ``reasonable'' play styles for the card game Hanabi, as indicated by their non-sabotaging behaviour and the graceful degradation of their performance with ad-hoc partners. We open-source our agents so that they may be used as starting points for a test bed for future research on (ad-hoc) coordination.","coordination, diversity, multi-agent reinforcement learning" 3D UX-Net: A Large Kernel Volumetric ConvNet Modernizing Hierarchical Transformer for Medical Image Segmentation,https://openreview.net/forum?id=wsZsjOSytRA,https://openreview.net/pdf?id=wsZsjOSytRA,We propose a lightweight network 3D UX-Net that simulates hierarchical transformer behavior with large kernel depthwise convolution and introduce pointwise depthwise scaling to scale features with less model parameters for volumetric segmentation.,"Vision transformers (ViTs) have quickly superseded convolutional networks (ConvNets) as the current state-of-the-art (SOTA) models for medical image segmentation. Hierarchical transformers (e.g., Swin Transformers) reintroduced several ConvNet priors and further enhanced the practical viability of adapting volumetric segmentation in 3D medical datasets. The effectiveness of hybrid approaches is largely credited to the large receptive field for non-local self-attention and the large number of model parameters. We hypothesize that volumetric ConvNets can simulate the large receptive field behavior of these learning approaches with fewer model parameters using depth-wise convolution. In this work, we propose a lightweight volumetric ConvNet, termed 3D UX-Net, which adapts the hierarchical transformer using ConvNet modules for robust volumetric segmentation. Specifically, we revisit volumetric depth-wise convolutions with large kernel size (e.g. starting from $7\times7\times7$) to enable the larger global receptive fields, inspired by Swin Transformer. We further substitute the multi-layer perceptron (MLP) in Swin Transformer blocks with pointwise depth convolutions and enhance model performances with fewer normalization and activation layers, thus reducing the number of model parameters. 3D UX-Net competes favorably with current SOTA transformers (e.g. SwinUNETR) using three challenging public datasets on volumetric brain and abdominal imaging: 1) MICCAI Challenge 2021 FLARE, 2) MICCAI Challenge 2021 FeTA, and 3) MICCAI Challenge 2022 AMOS. 3D UX-Net consistently outperforms SwinUNETR with improvement from 0.929 to 0.938 Dice (FLARE2021) and 0.867 to 0.874 Dice (Feta2021). We further evaluate the transfer learning capability of 3D UX-Net with AMOS2022 and demonstrates another improvement of $2.27\%$ Dice (from 0.880 to 0.900).","Depth-wise Convolution, Large Kernel Convolution, Convolutional Neural Network, Hierarchical Transformer, Volumetric Segmentation, Medical Image Segmentation" Multi-Reward Fusion: Learning from Other Policies by Distilling ,https://openreview.net/forum?id=1_ZHr9Ha_ZJ,https://openreview.net/pdf?id=1_ZHr9Ha_ZJ,Multi-Reward Fusion: Learn from other policies by distilling ,"Designing rewards is crucial for applying reinforcement learning in practice. However, it is difficult to design a shaping reward which can accelerate agents' learning process without biasing the original task's optimization objective. Moreover, the low-dimensional representation of the reward and value function (i.e. scalar value) may also be an obstruction during the learning process. This paper contributes towards tackling these challenges, by proposing a new method, called Multi-Reward Fusion (MRF). MRF take as input a list of human designed rewards, which contains the information from multiple perspectives about the task, and learns separate policies for each component of the reward list. We formulate the problem of learning the target policy as a distillation task, propose a novel method which can selectively distills knowledge from the auxiliary policies, and theoretically show the feasibility of this method. We conduct extensive experiments and show that the MRF method performs better than state-of-the-art reward shaping methods.","Energy-Based, Policy Distilling, Reinforcement Learning, Auto Reward Shaping" Push and Pull: Competing Feature-Prototype Interactions Improve Semi-supervised Semantic Segmentation,https://openreview.net/forum?id=39cMBLyo_ia,https://openreview.net/pdf?id=39cMBLyo_ia,,"This paper challenges semi-supervised segmentation with a rethink on the feature-prototype interaction in the classification head. Specifically, we view each weight vector in the classification head as the prototype of a semantic category. The basic practice in the softmax classifier is to pull a feature towards its positive prototype (i.e., the prototype of its class), as well as to push it away from its negative prototypes. In this paper, we focus on the interaction between the feature and its negative prototypes, which is always “pushing” to make them dissimilar. While the pushing-away interaction is necessary, this paper reveals a new mechanism that the contrary interaction of pulling close negative prototypes is also beneficial. We have two insights for this counter-intuitive interaction: 1) some pseudo negative prototypes might actually be positive so that the pulling interaction can help resisting the pseudo-label noises, and 2) some true negative prototypes might contain contextual information that is beneficial. Therefore, we integrate these two competing interactions into a Push-and-Pull Learning (PPL) method. On the one hand, PPL introduces the novel pulling-close interaction between features and negative prototypes with a feature-to-prototype attention. On the other hand, PPL reinforces the original pushing-away interaction with a multi-prototype contrastive learning. While PPL is very simple, experiments show that it substantially improves semi-supervised segmentation and sets a new state of the art.","Semi-supervised, Segmentation, Competing Interactions, Classifier Prototype" MaskNeRF: Masked Neural Radiance Fields for Sparse View Synthesis,https://openreview.net/forum?id=jpWa2RnZpIK,https://openreview.net/pdf?id=jpWa2RnZpIK,,"Although Neural Radiance Fields (NeRF) has achieved impressive 3D reconstruction with dense view images, its performance degrades significantly when the training views are sparse. We observe that under the sparse view setting, it is important to learn the correspondence of pixels among different views, i.e., the 3D consistency, to improve the reconstruction quality. To achieve this, we first propose the Hard-Mask that utilizes the depth information to locate pixels with correspondence relationship and then assigns higher loss weights on these pixels. The key idea is to achieve pixel-wise differentiated optimization of NeRF based on the 3D consistency among target views and source views instead of treating each pixel equally. This optimization strategy helps NeRF-based algorithms to learn fine-grained object details with limited data. To deal with the absence of accurate depth information, the Soft-Mask is proposed to estimate the correspondence relationship based on the trend of training losses. Our proposed method can serve as a plug-in component for existing NeRF-based view-synthesis models. Extensive experiments on recent representative works, including NeRF, IBRNet and MVSNeRF, show that our method can significantly improve the model performance under sparse view conditions (e.g., up to 70\% improvement in PSNR on DTU dataset). ", Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small,https://openreview.net/forum?id=NpsVSN6o4ul,https://openreview.net/pdf?id=NpsVSN6o4ul,We find a large circuit for a natural language task in GPT-2 small and provide quantitative evaluation of our human-understandable explanation.,"Research in mechanistic interpretability seeks to explain behaviors of ML models in terms of their internal components. However, most previous work either focuses on simple behaviors in small models, or describes complicated behaviors in larger models with broad strokes. In this work, we bridge this gap by presenting an explanation for how GPT-2 small performs a natural language task that requires logical reasoning: indirect object identification (IOI). Our explanation encompasses 28 attention heads grouped into 7 main classes, which we discovered using a combination of interpretability approaches including causal interventions and projections. To our knowledge, this investigation is the largest end-to-end attempt at reverse-engineering a natural behavior ""in the wild"" in a language model. We evaluate the reliability of our explanation using three quantitative criteria - faithfulness, completeness and minimality. Though these criteria support our explanation, they also point to remaining gaps in our understanding. Our work provides evidence that a mechanistic understanding of large ML models is feasible, opening opportunities to scale our understanding to both larger models and more complex tasks.","Mechanistic Interpretability, Transformers, Language Models, Interpretability, Transparency, Science of ML" Equivariant Descriptor Fields: SE(3)-Equivariant Energy-Based Models for End-to-End Visual Robotic Manipulation Learning,https://openreview.net/forum?id=dnjZSPGmY5O,https://openreview.net/pdf?id=dnjZSPGmY5O,We present SE(3)-equivariant energy based models that can learn robotic manipulation tasks end-to-end from only few demonstrations without any prior knowledge.,"End-to-end learning for visual robotic manipulation is known to suffer from sample inefficiency, requiring large numbers of demonstrations. The spatial roto-translation equivariance, or the SE(3)-equivariance can be exploited to improve the sample efficiency for learning robotic manipulation. In this paper, we present SE(3)-equivariant models for visual robotic manipulation from point clouds that can be trained fully end-to-end. By utilizing the representation theory of the Lie group, we construct novel SE(3)-equivariant energy-based models that allow highly sample efficient end-to-end learning. We show that our models can learn from scratch without prior knowledge and yet are highly sample efficient (5~10 demonstrations are enough). Furthermore, we show that our models can generalize to tasks with (i) previously unseen target object poses, (ii) previously unseen target object instances of the category, and (iii) previously unseen visual distractors. We experiment with 6-DoF robotic manipulation tasks to validate our models' sample efficiency and generalizability.","Robotics, Manipulation, Robotic manipulation, Equivariance, SE(3), SO(3), Energy-based model, Lie group, Representation theory, Equivariant robotics, Roto-translation equivariance, End-to-end, Point clouds, Graph neural networks, Imitation learning, Learning from demonstration, Sample efficient, Few shot, Unseen object, Category-level manipulation, MCMC, Langevin dynamics" Anatomical Structure-Aware Image Difference Graph Learning for Difference-Aware Medical Visual Question Answering,https://openreview.net/forum?id=uH_RlkvQMUs,https://openreview.net/pdf?id=uH_RlkvQMUs,Large scale image difference medical VQA dataset and expert knowledge-aware graph representation learning ,"To contribute to automating the medical vision-language model, we propose a novel Chest-Xray Different Visual Question Answering (VQA) task. Given a pair of main and reference images, this task attempts to answer several questions on both diseases and more importantly, the differences between them. This is consistent with the radiologist's diagnosis practice that compares the current image with the reference before concluding the report. For this task, we propose a new dataset, namely MIMIC-Diff-VQA including 700,821 QA pairs on 109,872 pairs of images. Meanwhile, we also propose a novel expert knowledge-aware graph representation learning model to address this problem. We leveraged the expert knowledge such as anatomical structure prior, semantic and spatial knowledge to construct a multi-relationship graph to represent the image differences between two images for the image difference VQA task. Our dataset and code will be released upon publication. We believe this work would further push forward the medical vision language model.","Chest Xray, Difference Image VQA, medical dataset, Graph Neuron Networks" Explaining Temporal Graph Models through an Explorer-Navigator Framework,https://openreview.net/forum?id=BR_ZhvcYbGJ,https://openreview.net/pdf?id=BR_ZhvcYbGJ,A MCTS-based explainer for temporal graph models.,"While GNN explanation has recently received significant attention, existing works are consistently designed for static graphs. Due to the prevalence of temporal graphs, many temporal graph models have been proposed, but explaining their predictions remains to be explored. To bridge the gap, in this paper, we propose T-GNNExplainer for temporal graph model explanation. Specifically, we regard a temporal graph constituted by a sequence of temporal events. Given a target event, our task is to find a subset of previously occurred events that lead to the model's prediction for it. To handle this combinatorial optimization problem, T-GNNExplainer includes an explorer to find the event subsets with Monte Carlo Tree Search (MCTS) and a navigator that learns the correlations between events and helps reduce the search space. In particular, the navigator is trained in advance and then integrated with the explorer to speed up searching and achieve better results. To the best of our knowledge, T-GNNExplainer is the first explainer tailored for temporal graph models. We conduct extensive experiments to evaluate the performance of T-GNNExplainer. Experimental results on both real-world and synthetic datasets demonstrate that T-GNNExplainer can achieve superior performance with up to about 50% improvement in Area under Fidelity-Sparsity Curve. ","graph neural networks, gnn explainers, temporal graphs" Tackling Diverse Tasks via Cross-Modal Transfer Learning,https://openreview.net/forum?id=17mPeO4rqGj,https://openreview.net/pdf?id=17mPeO4rqGj,We study how to effectively transfer pretrained models to problems outside the pretraining modalities.,"Fine-tuning large-scale pretrained models has led to remarkable progress in well-studied modalities such as vision and NLP. However, similar gains have not been observed in many other tasks due to an assumed lack of relevant pretrained models for these diverse modalities. In this work, we revisit this assumption by studying the cross-modal transfer ability of large-scale pretrained models. We introduce ORCA, a general cross-modal fine-tuning workflow that enables fast and automatic exploitation of existing pretrained models for diverse tasks. ORCA achieves task-specific adaptation by learning feature embeddings that minimize an optimal transport distance metric to map the data distribution in the end-task modality to the pretraining modality. We test ORCA on 13 tasks with varying modalities and input-output types. ORCA performs the best on 10 of them and is in the top three on the others. We further quantify the importance of embedding distance for downstream performance, highlight ORCA’s utility for data-limited tasks, and demonstrate its compatibility with same-modality transfer.","Cross-modal transfer learning, pretrained models, fine-tuning" Leveraged Asymmetric Loss with Disambiguation for Multi-label Recognition with One-Positive Annotations,https://openreview.net/forum?id=JFf-bPQu5RB,https://openreview.net/pdf?id=JFf-bPQu5RB,,"In the problem of multi-label learning from single positive labels (SPL), we learn the potential multiple labels from one observable single positive annotation. Despite many efforts to solve this problem, an effective algorithm with sound theoretical understanding is still in need. In this paper, we propose a novel loss function for the SPL problem, called leveraged asymmetric loss with disambiguation (LASD), where we introduce a pair of leverage parameters to address the severe negative-positive imbalance. From the theoretical perspective, we analyze the SPL problem, for the first time, from the perspective of risk consistency, which links the SPL loss with losses for ordinary multi-label classification. We prove the consistency of our proposed LASD loss to the cost-sensitive Hamming loss, which provides guidance to the empirical choice of our proposed leverage parameters. In experiments, we demonstrate the effectiveness of our proposed LASD loss function over other state-of-the-art methods and empirically verify our theoretical results.", Self-supervised Learning for Cell Segmentation and Quantification in Digital Pathology Images,https://openreview.net/forum?id=ToNvGM_jXlA,https://openreview.net/pdf?id=ToNvGM_jXlA,Identifying cell bodies in brain tissue digital pathology images,"Parkinson’s Disease (PD) is the second most common neurodegenerative disease in humans, impacting 2-3% of people over the age of 65. PD is characterized by the gradual loss of dopaminergic neurons in the Substantia Nigra (a part of the midbrain). At present, the number of dopaminergic neurons in the Substantia Nigra is one of the most important indexes in evaluating drug efficacy in PD animal models. Currently, analyzing and quantifying dopaminergic neurons is con- ducted manually by expert biologists through careful analysis of digital pathology images. However, this approach is laborious, time-consuming, and highly subjective, which significantly delays the progress in PD research and drug development. As such, a reliable and unbiased automated system is highly demanded for the quantification of neurons in digital pathology images. To this end, in this paper, we propose an end-to-end deep learning framework for the segmentation and quantification of dopaminergic neurons in PD animal models. Our framework relies on self-supervised learning advances to handle the limited amount of data for training deep models. Extensive experiments demonstrate the effectiveness of the developed method in quantifying neurons with a small amount of labeled data. As a result, the proposed methodology can lead to reliable data support for PD research and drug discovery by accelerating the digital pathology analysis.","Medical Segmentation, Self-supervised Learning, contrastive learning" Mitigating Demographic Bias of Federated Learning Models via Global Domain Smoothing,https://openreview.net/forum?id=uUU05MP7Pv7,https://openreview.net/pdf?id=uUU05MP7Pv7,,"Federated learning (FL) has shown impressive performance in training modern machine learning models from distributed data sources. However, the distributed training process of FL could suffer from a non-trivial bias issue, where the trained models are affected by the imbalanced distribution of the training data on local clients, and eventually lead to a severe bias of the aggregated global model. In this paper, we propose a novel fairness-aware FL training framework Worst-Fair Domain Smoothing (WFDS) to address the bias issue of FL models from a domain-shifting perspective. Our framework consists of two concurrent components: 1) local worst-fair training, and 2) reference domain smoothing. The first module is designed to train fair local models and enforces the robustness of local fairness against domain shifts from local distribution to global distribution by performing worst-fair training. The second module simulates a reference data domain of the studied FL application for all clients, and implicitly reduces the domain discrepancy of training data among different clients. With reduced domain discrepancy, the fairness of each local model will be learned from similar training distributions despite on different clients. As such, improved global fairness can be achieved after aggregating the local models into the global model. Evaluation results on multiple real-world datasets show that WFDS can achieve significant performance gains in demographic fairness compared to state-of-the-art baselines.","Federated Learning, Demographic Fairness" Safe Reinforcement Learning with Contrastive Risk Prediction,https://openreview.net/forum?id=4OS-U1a5kB-,https://openreview.net/pdf?id=4OS-U1a5kB-,We propose a contrastive risk prediction method to train safe RL agents with risk preventive trajectory exploration and reward shaping.,"As safety violations can lead to severe consequences in real-world applications, the increasing deployment of Reinforcement Learning (RL) in safety-critical domains such as robotics has propelled the study of safe exploration for reinforcement learning (safe RL). In this work, we propose a risk preventive training method for safe RL, which learns a statistical contrastive classifier to predict the probability of a state-action pair leading to unsafe states. Based on the predicted risk probabilities, we can collect risk preventive trajectories and reshape the reward function with risk penalties to induce safe RL policies. We conduct experiments in robotic simulation environments. The results show the proposed approach has comparable performance with the state-of-the-art model-based methods and outperforms conventional model-free safe RL approaches. ","safe reinforcement learning, contrastive risk prediction" Analysis of differentially private synthetic data: a general measurement error approach,https://openreview.net/forum?id=Cn6JkFnKgPX,https://openreview.net/pdf?id=Cn6JkFnKgPX,,"Differential private (DP) synthetic datasets have been receiving significant attention from academia, industry, and government. However, little is known about how to perform statistical inference using DP synthetic datasets. Naive approaches that do not take into account the induced uncertainty due to DP mechanism will result in biased estimators and invalid inferences. In this paper, we present a general class of bias-corrected DP estimators with valid asymptotic confidence intervals for parameters in regression settings, by establishing the connection between additive DP mechanisms and measurement error models. Our simulation shows that when the sample covariance between DP noises and data is close to zero, our estimator is far superior to the widely used sufficient statistic perturbation algorithm, and the CIs can achieve better coverage when comparing to the naive CIs obtained from ignoring the DP mechanism.","Measurement Error Model, Differential Privacy, Regression, Statistical Inference" Imbalanced Lifelong Learning with AUC Maximization,https://openreview.net/forum?id=uh93gVo6Veu,https://openreview.net/pdf?id=uh93gVo6Veu,We propose a new approach to empower continual learning with imbalanced data: designing an algorithm to directly maximize one widely used metric in an imbalanced data setting: Area Under the ROC Curve (AUC). ,"Imbalanced data is ubiquitous in machine learning, such as medical or fine-grained image datasets. The existing continual learning methods employ various techniques such as balanced sampling to improve classification accuracy in this setting. However, classification accuracy is not a suitable metric for imbalanced data, and hence these methods may not obtain a good classifier as measured by other metrics (e.g., recall, F1-score, Area under the ROC Curve). In this paper, we propose a solution to enable efficient imbalanced continual learning by designing an algorithm to effectively maximize one widely used metric in an imbalanced data setting: Area Under the ROC Curve (AUC). We find that simply replacing accuracy with AUC will cause \textit{gradient interference problem} due to the imbalanced data distribution. To address this issue, we propose a new algorithm, namely DIANA, which performs a novel synthesis of model \underline{D}ecoupl\underline{I}ng \underline{AN}d \underline{A}lignment. In particular, the algorithm updates two models simultaneously: one focuses on learning the current knowledge while the other concentrates on reviewing previously-learned knowledge, and the two models gradually align during training. We conduct extensive experiments on datasets across various imbalanced domains, ranging from natural images to medical and satellite images. The results show that DIANA achieves state-of-the-art performance on all the imbalanced datasets compared with several competitive baselines. We further consider standard balanced benchmarks used in lifelong learning to show the effectiveness of DIANA as a general lifelong learning method. ","Imbalanced Lifelong Learning, AUC Maximization" On the Efficacy of Server-Aided Federated Learning against Partial Client Participation,https://openreview.net/forum?id=Dyzhru5NO3u,https://openreview.net/pdf?id=Dyzhru5NO3u,,"Although federated learning (FL) has become a prevailing distributed learning framework in recent years due to its benefits in scalability/privacy, there remain many significant challenges in FL system design. Notably, most existing works in the current FL literature assume either full client or uniformly distributed client participation. Unfortunately, this idealistic assumption rarely hold in practice. It has been frequently observed that some clients may never participate in FL training (aka partial/incomplete participation) due to a meld of system heterogeneity factors. To mitigate impacts of partial client participation, an increasingly popular approach in practical FL systems is the sever-aided federated learning (SA-FL) framework, where one equips the server with an auxiliary dataset. However, despite the fact that SA-FL has been empirically shown to be effective in addressing the partial client participation problem, there remains a lack of theoretical understanding for SA-FL. Worse yet, even the ramifications of partial worker participation is not clearly understood in conventional FL so far. These theoretical gaps motivate us to rigorously investigate SA-FL. To this end, we first reveal that conventional FL is {\em not} PAC-learnable under partial participation in the worst case, which advances our understanding of conventional FL. Then, we show that the PAC-learnability of FL with partial client participation can indeed be revived by SA-FL, which theoretically justifies the use of SA-FL for the first time. Lastly, to further make SA-FL communication-efficient, we propose the \alg (\ul{s}erver-\ul{a}ided \ul{f}ederated \ul{a}ve\ul{r}ag\ul{i}ng) algorithm that enjoys convergence guarantee and the same level of communication efficiency and privacy as state-of-the-art FL.", Soft Neighbors are Positive Supporters in Contrastive Visual Representation Learning,https://openreview.net/forum?id=l9vM_PaUKz,https://openreview.net/pdf?id=l9vM_PaUKz,We leverage the soft neighbors to sufficiently explore the correlation information among samples in cotrastive learning.,"Contrastive learning methods train visual encoders by comparing views (e.g., often created via a group of data augmentations on the same instance) from one instance to others. Typically, the views created from one instance are set as positive, while views from other instances are negative. This binary instance discrimination is studied extensively to improve feature representations in self-supervised learning. In this paper, we rethink the instance discrimination framework and find the binary instance labeling insufficient to measure correlations between different samples. For an intuitive example, given a random image instance, there may exist other images in a mini-batch whose content meanings are the same (i.e., belonging to the same category) or partially related (i.e., belonging to a similar category). How to treat the images that correlate similarly to the current image instance leaves an unexplored problem. We thus propose to support the current image by exploring other correlated instances (i.e., soft neighbors). We first carefully cultivate a candidate neighbor set, which will be further utilized to explore the highly-correlated instances. A cross-attention module is then introduced to predict the correlation score (denoted as positiveness) of other correlated instances with respect to the current one. The positiveness score quantitatively measures the positive support from each correlated instance, and is encoded into the objective for pretext training. To this end, our proposed method benefits in discriminating uncorrelated instances while absorbing correlated instances for SSL. We evaluate our soft neighbor contrastive learning method (SNCLR) on standard visual recognition benchmarks, including image classification, object detection, and instance segmentation. The state-of-the-art recognition performance shows that SNCLR is effective in improving feature representations from both ViT and CNN encoders. More materials can be found in our project page: anonymous-iclr23snclr.github.io.","contrastive learning, soft neighbors, visual correlation" LA-BALD: An Information-Theoretic Image Labeling Task Sampler,https://openreview.net/forum?id=n9pes83qD1,https://openreview.net/pdf?id=n9pes83qD1,"We propose LA-BALD, an information-theoretic image labeling task sampler, that actively selects image and worker pairs to improve labeling accuracy","Large-scale visual recognition datasets with high-quality labels enable many computer vision applications, but also come with enormous annotation costs, especially since multiple annotators are typically queried per image to obtain a more reliable label. Recent work in label aggregation consolidates human annotations by combining them with the predictions of an online-learned predictive model. In this work, we devise an image labeling task sampler that actively selects image-worker pairs to efficiently reduce the noise in the human annotations and improve the predictive model at the same time. We propose an information-theoretic task sampler, Label Aggregation BALD (LA-BALD), to maximize the information contributing to the labeled dataset via human annotations and the model. The simulated experiments on ImageNet100-sandbox show that LA-BALD reduces the number of annotations by 19% and 12% on average compared to the two types of baselines. Our analysis shows that LA-BALD provides both more accurate annotations and a better online-learned predictive model, leading to better labeling efficiency over the baselines. ","active learning, data annotation" Text and Patterns: For Effective Chain of Thought It Takes Two to Tango,https://openreview.net/forum?id=z9fXRC5XdT,https://openreview.net/pdf?id=z9fXRC5XdT,Text and patterns play a complementary role in the success of few-shot prompting.,"In the past decade, we witnessed dramatic gains in natural language processing and an unprecedented scaling of large language models. These developments have been accelerated by the advent of few-shot techniques such as chain of thought (CoT) prompting. Specifically, CoT pushes the performance of large language models in a few-shot setup by augmenting the prompts with intermediate steps. Despite impressive results across various tasks, the reasons behind their success have not been explored. This work uses counterfactual prompting to develop a deeper understanding of CoT-based few-shot prompting mechanisms in large language models. We first systematically identify and define the key components of a prompt: symbols, patterns, and text. Then, we devise and conduct an exhaustive set of deliberated experiments across four different tasks, by querying the model with counterfactual prompts where only one of these components is altered. Our experiments across three large language models (PaLM, GPT-3, and CODEX) reveal several surprising findings and brings into question the conventional wisdom around few-shot prompting. First, the presence of factual patterns in a prompt is practically immaterial to the success of CoT. Second, our results conclude that the primary role of intermediate steps may not be to facilitate learning ""how"" to solve a task. The intermediate steps are rather a beacon for the model to realize ""what"" symbols to replicate in the output to form a factual answer. As such, the patterns are merely a channel to ""trick"" the model into forming sentences that resemble correct answers. This pathway is facilitated by text, which imbues patterns with commonsense knowledge and meaning. Our empirical and qualitative analysis reveals that a symbiotic relationship between text and patterns explains the success of few-shot prompting: text helps extract commonsense from the question to help patterns, and patterns enforce task understanding and direct text generation. Such systematic understanding of CoT enables us to devise a concise chain of thought, dubbed as CCoT, where text and patterns are pruned by over 20%, only retaining their key roles. We achieve this reduction in the number of tokens while delivering on par or slightly higher solve task rate. We release datasets and anonymized code for reproducing our results at https://anonymous.4open.science/r/CoTTwoToTango-3106/.","in-context learning, few-shot prompting, chain of thought prompting, large-language models" Offline RL for Natural Language Generation with Implicit Language Q Learning,https://openreview.net/forum?id=aBH_DydEvoH,https://openreview.net/pdf?id=aBH_DydEvoH,"We propose a novel offline RL method, implicit language Q-learning (ILQL), for use on language models.","Large language models distill broad knowledge from text corpora. However, they can be inconsistent when it comes to completing user specified tasks. This issue can be addressed by finetuning such models via supervised learning on curated datasets, or via reinforcement learning. In this work, we propose a novel offline RL method, implicit language Q-learning (ILQL), designed for use on language models, that combines both the flexible utility maximization framework of RL algorithms with the ability of supervised learning to leverage previously collected data, as well as its simplicity and stability. Our method employs a combination of value conservatism alongside an implicit dataset support constraint in learning value functions, which are then used to guide language model generations towards maximizing user-specified utility functions. In addition to empirically validating ILQL, we present a detailed empirical analysis of situations where offline RL can be useful in natural language generation settings, demonstrating how it can be a more effective utility optimizer than prior approaches for end-to-end dialogue, and how it can effectively optimize high variance reward functions based on subjective judgement, such as whether to label a comment as toxic or not.","offline reinforcement learning, natural language processing, dialogue, controlled generation" MoCa: Cognitive Scaffolding for Language Models in Causal and Moral Judgment Tasks,https://openreview.net/forum?id=RdudTla7eIM,https://openreview.net/pdf?id=RdudTla7eIM,"We summarized the main findings of 24 cognitive science papers around human intuitions on causal and moral judgments, and collect a dataset to evaluate large language models.","Human commonsense understanding of the physical and social world is organized around intuitive theories. These theories support making causal and moral judgments. When something bad happened, we naturally ask: who did what, and why? A rich literature in cognitive science has studied people's causal and moral intuitions. These works have revealed a number of factors that systematically influence people's judgments, such as the presence of norms, and whether or not the protagonist in a scenario was aware of their action's potential consequences. Here, we investigate whether large language models (LLMs) make causal and moral judgments about text-based scenarios that align with those of human participants. We find that without any annotations, LLMs and human participants are not well aligned (17\%-39\% agreement). However, LLMs can accurately annotate what relevant factors are present in a scenario with simple expert-written instructions. We demonstrate how these annotations can be used to bring LLMs in closer alignment with people (36.3\%-47.2\% agreement). These results show how insights from cognitive science can help scaffold language models to more closely match human intuitions in challenging commonsense evaluation tasks.","cognitive science, causal reasoning, moral reasoning, dataset, chain-of-thought, step-by-step, language models" Anchor Sampling for Federated Learning with Partial Client Participation,https://openreview.net/forum?id=VLnODGVVAsL,https://openreview.net/pdf?id=VLnODGVVAsL,"To accelerate the training process and improve the model performance, FedAMD utilizes anchor sampling that disjoints the partial participants into anchor group (training with large batches) and miner group (training with small batches). ","In federated learning, the support of partial client participation offers a flexible training strategy, but it deteriorates the model training efficiency. In this paper, we propose a framework FedAMD to improve the convergence property and maintain flexibility. The core idea is anchor sampling, which disjoints the partial participants into anchor and miner groups. Each client in the anchor group aims at the local bullseye with the gradient computation using a large batch. Guided by the bullseyes, clients in the miner group steer multiple near-optimal local updates using small batches and update the global model. With the joint efforts from both groups, FedAMD is able to accelerate the training process as well as improve the model performance. Measured by $\epsilon$-approximation and compared to the state-of-the-art first-order methods, FedAMD achieves the convergence by up to $O(1/\epsilon)$ fewer communication rounds under non-convex objectives. In specific, we achieve a linear convergence rate under PL conditions. Empirical studies on real-world datasets validate the effectiveness of FedAMD and demonstrate the superiority of our proposed algorithm: Not only does it considerably save computation and communication costs, but also the test accuracy significantly improves. ","Federated Learning, Optimization" Lattice Convolutional Networks for Learning Ground States of Quantum Many-Body Systems,https://openreview.net/forum?id=Oh0cnNTn5Di,https://openreview.net/pdf?id=Oh0cnNTn5Di,,"Deep learning methods have been shown to be effective in representing ground-state wave functions of quantum many-body systems. Existing methods use convolutional neural networks (CNNs) for square lattices due to their image-like structures. For non-square lattices, existing method uses graph neural network (GNN) in which structure information is not precisely captured, thereby requiring additional hand-crafted sublattice encoding. In this work, we propose lattice convolutions in which a set of proposed operations are used to convert non-square lattices into grid-like augmented lattices on which regular convolution can be applied. Based on the proposed lattice convolutions, we design lattice convolutional networks (LCN) that use self-gating and attention mechanisms. Experimental results show that our method achieves performance on par or better than the GNN method on spin 1/2 $J_1$-$J_2$ Heisenberg model over the square, honeycomb, triangular, and kagome lattices while without using hand-crafted encoding.", CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos,https://openreview.net/forum?id=H-T3F0dMbyj,https://openreview.net/pdf?id=H-T3F0dMbyj,A new method the leverages the pretrained CLIP model and noise invariant training for learning text-queried sound separation with only noisy unlabeled videos,"Recent years have seen progress beyond domain-specific sound separation for speech or music towards universal sound separation for arbitrary sounds. Prior work on universal sound separation has investigated separating a target sound out of an audio mixture given a text query. Such text-queried sound separation systems provide a natural and scalable interface for specifying arbitrary target sounds. However, supervised text-queried sound separation systems require costly labeled audio-text pairs for training. Moreover, the audio provided in existing datasets is often recorded in a controlled environment, causing a considerable generalization gap to noisy audio in the wild. In this work, we aim to approach text-queried universal sound separation by using only unlabeled data. We propose to leverage the visual modality as a bridge to learn the desired audio-textual correspondence. The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model, and the query vector is then used to condition an audio separation model to separate out the target sound. While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting, thanks to the joint language-image embedding learned by the CLIP model. Further, videos in the wild often contain off-screen sounds and background noise that may hinder the model from learning the desired audio-textual correspondence. To address this problem, we further propose an approach called noise invariant training for training a query-based sound separation model on noisy data. Experimental results show that the proposed models successfully learn text-queried universal sound separation using only noisy unlabeled videos, even achieving competitive performance against a supervised model in some settings.","universal sound separation, source separation, contrastive language-image pre-training, multi-modal learning, self-supervised learning" On the Soft-Subnetwork for Few-Shot Class Incremental Learning,https://openreview.net/forum?id=z57WK5lGeHd,https://openreview.net/pdf?id=z57WK5lGeHd,SoftNet jointly learns the model weights and adaptive non-binary soft masks at a base training session in which each mask consists of the major and minor subnetwork.,"Inspired by Regularized Lottery Ticket Hypothesis, which states that competitive smooth (non-binary) subnetworks exist within a dense network, we propose a few-shot class-incremental learning method referred to as Soft-SubNetworks (SoftNet). Our objective is to learn a sequence of sessions incrementally, where each session only includes a few training instances per class while preserving the knowledge of the previously learned ones. SoftNet jointly learns the model weights and adaptive non-binary soft masks at a base training session in which each mask consists of the major and minor subnetwork; the former aims to minimize catastrophic forgetting during training, and the latter aims to avoid overfitting to a few samples in each new training session. We provide comprehensive empirical validations demonstrating that our SoftNet effectively tackles the few-shot incremental learning problem by surpassing the performance of state-of-the-art baselines over benchmark datasets.","Soft-subnetwork, Few-shot class incremental learning (FSCIL)" Fairness-Aware Model-Based Multi-Agent Reinforcement Learning for Traffic Signal Control,https://openreview.net/forum?id=sy0PqUr2fq9,https://openreview.net/pdf?id=sy0PqUr2fq9,A novel Fairness-aware Model-based Multi-agent Reinforcement Learning (FM2Light) method to improve the sample efficiency and fairness for multi-intersection control.,"Poorly timed traffic lights exacerbate traffic congestion and greenhouse gas emissions. Traffic signal control with reinforcement learning (RL) algorithms has shown great potential in dealing with such issues and improving the efficiency of traffic systems. RL-based solutions can perform better than classic rule-based methods, especially in dynamic environments. However, most of the existing RL-based solutions are model-free methods and require a large number of interactions with the environment, which can be very costly or even unacceptable in real-world scenarios. Furthermore, the fairness of multi-intersection control has been ignored in most of the previous works, which may lead to unfair congestion at different intersections. In this work, we propose a novel Fairness-aware Model-based Multi-agent Reinforcement Learning (FM2Light) method to improve the sample efficiency, thus addressing the data-expensive training, and handle unfair control in multi-intersection scenarios with a better reward design. With rigorous experiments under different real-world scenarios, we demonstrate that our method can achieve comparable asymptotic performance to model-free RL methods while achieving much higher sample efficiency and greater fairness. ","Traffic signal control, reinforcement learning, fairness" Approximating How Single Head Attention Learns,https://openreview.net/forum?id=8yVy6LdhER4,https://openreview.net/pdf?id=8yVy6LdhER4,"Why do models often attend to salient words, and how does this evolve throughout training? We define a model property, Knowledge to Translate Individual Words, and claim that it drives the learning of the attention.","Why do models often attend to salient words, and how does this evolve throughout training? We approximate model training as a two stage process: early on in training when the attention weights are uniform, the model learns to translate individual input word `i` to `o` if they co-occur frequently. Later, the model learns to attend to `i` while the correct output is o because it knows `i` translates to `o`. To formalize, we define a model property, Knowledge to Translate Individual Words (KTIW) (e.g. knowing that `i` translates to `o`), and claim that it drives the learning of the attention. This claim is supported by the fact that before the attention mechanism is learned, KTIW can be learned from word co-occurrence statistics, but not the other way around. Particularly, we can construct a training distribution that makes KTIW hard to learn, the learning of the attention fails, and the model cannot even learn the simple task of copying the input words to the output. Our approximation explains why models sometimes attend to salient words, and inspires a toy example where a multi-head attention model can overcome the above hard training distribution by improving learning dynamics rather than expressiveness. We end by discussing the limitation of our approximation framework and suggest future directions.","NLP, training dynamics, attention" Efficient Attention via Control Variates,https://openreview.net/forum?id=G-uNfHKrj46,https://openreview.net/pdf?id=G-uNfHKrj46,"We present a novel analysis of random feature attention based on control variates, which characterizes its gap to full softmax attention and induces a novel efficient variant that significantly improves the approximation while remaining efficient.","Random-feature-based attention (RFA) is an efficient approximation of softmax attention with linear runtime and space complexity. However, the approximation gap between RFA and conventional softmax attention is not well studied. Built upon previous progress of RFA, we characterize this gap through the lens of control variates and show that RFA can be decomposed into a sum of multiple control variate estimators for each element in the sequence. This new framework reveals that exact softmax attention can be recovered from RFA by manipulating each control variate. Besides, it allows us to develop a more flexible form of control variates, resulting in a novel attention mechanism that significantly reduces the approximation gap while maintaining linear complexity. Extensive experiments demonstrate that our model outperforms state-of-the-art efficient attention mechanisms on both vision and language tasks.","attention mechanism, transformers, random features, control variates, importance sampling" Pathfinding Neural Cellular Automata,https://openreview.net/forum?id=CU8BwVAzLme,https://openreview.net/pdf?id=CU8BwVAzLme,We show the algorithmic alignment of Neural Cellular Automata with pathfinding problems using hand-coded networks and learned models,"Pathfinding makes up an important sub-component of a broad range of complex tasks in AI, such as robot path planning, transport routing, and game playing. While classical algorithms can efficiently compute shortest paths, neural networks could be better suited to adapting these sub-routines to more complex and intractable tasks. As a step toward developing such networks, we hand-code and learn models for Breadth-First Search (BFS), i.e. shortest path finding, using the unified architectural framework of Neural Cellular Automata, which are iterative neural networks with equal-size inputs and outputs. Similarly, we present a neural implementation of Depth-First Search (DFS), and outline how it can be combined with neural BFS to produce an NCA for computing diameter of a graph. We experiment with architectural modifications inspired by these hand-coded NCAs, training networks from scratch to solve the diameter problem on grid mazes while exhibiting strong generalization ability. Finally, we introduce a scheme in which data points are mutated adversarially during training. We find that adversarially evolving mazes leads to increased generalization on out-of-distribution examples, while at the same time generating data-sets with significantly more complex solutions for reasoning tasks.","pathfinding, neural cellular automata, graph neural networks, algorithmic alignment" Learning to Optimize Quasi-Newton Methods,https://openreview.net/forum?id=EqDnVOyiVX,https://openreview.net/pdf?id=EqDnVOyiVX,We introduce a novel machine learning optimizer which combines learning to optimize with quasi-Newton methodology.,"We introduce a novel machine learning optimizer called LODO, which online meta-learns an implicit inverse Hessian of the loss as a subroutine of quasi-Newton optimization. Our optimizer merges Learning to Optimize (L2O) techniques with quasi-Newton methods to learn neural representations of symmetric matrix vector products, which are more flexible than those in other quasi-Newton methods. Unlike other L2O methods, ours does not require any meta-training on a training task distribution, and instead learns to optimize on the fly while optimizing on the test task, adapting to the local characteristics of the loss landscape while traversing it. Theoretically, we show that our optimizer approximates the inverse Hessian in noisy loss landscapes and is capable of representing a wide range of inverse Hessians. We experimentally verify our algorithm's performance in the presence of noise, and show that simpler alternatives for representing the inverse Hessians worsen performance. Lastly, we use our optimizer to train a semi-realistic deep neural network with 95k parameters, and obtain competitive results against standard neural network optimizers.","meta-learning, learning to optimize, quasi-Newton, optimization, hypergradients" Toxicity in Multilingual Machine Translation at Scale,https://openreview.net/forum?id=5G_SmGZlXQ,https://openreview.net/pdf?id=5G_SmGZlXQ,"Quantification, analysis and mitigation recommendations of toxicity for 164 languages ","Machine Translation systems can produce different types of errors, some of which get characterized as critical or catastrophic due to the specific negative impact they can have on users. Automatic or human evaluation metrics do not necessarily differentiate between such critical errors and more innocuous ones. In this paper we focus on one type of critical error: added toxicity. We evaluate and analyze added toxicity when translating a large evaluation dataset (HOLISTICBIAS, over 472k sentences, covering 13 demographic axes) from English into 164 languages. The toxicity automatic evaluation shows that added toxicity across languages varies from 0% to 5%. The output languages with the most added toxicity tend to be low-resource ones, and the demographic axes with the most added toxicity include sexual orientation, gender and sex, and ability. We also perform human evaluation on a subset of 8 directions, confirming the prevalence of true added toxicity. We use a measurement of the amount of source contribution to the translation, where a low source contribution implies hallucination, to interpret what causes toxicity. We observe that the source contribution is somewhat correlated with toxicity but that 45.6% of added toxic words have a high source contribution, suggesting that much of the added toxicity may be due to mistranslations. Combining the signal of source contribution level with a measurement of translation robustness allows us to flag 22.3% of added toxicity, suggesting that added toxicity may be related to both hallucination and the stability of translations in different contexts. Given these findings, our recommendations to reduce added toxicity are to curate training data to avoid mistranslations, mitigate hallucination and check unstable translations.","Toxicity, Holistic Bias, Input Attributions, Multilingual Machine Translation, Scale" An Adaptive Policy to Employ Sharpness-Aware Minimization,https://openreview.net/forum?id=6Wl7-M2BC-,https://openreview.net/pdf?id=6Wl7-M2BC-,We design an adaptive policy to employ SAM and propose two efficient algorithms to reduce the fraction of SAM updates.,"Sharpness-aware minimization (SAM), which searches for flat minima by min-max optimization, has been shown to be useful in improving model generalization. However, since each SAM update requires computing two gradients, its computational cost and training time are both doubled compared to standard empirical risk minimization (ERM). Recent state-of-the-arts reduce the fraction of SAM updates and thus accelerate SAM by switching between SAM and ERM updates randomly or periodically. In this paper, we design an adaptive policy to employ SAM based on the loss landscape geometry. Two efficient algorithms, AE-SAM and AE-LookSAM, are proposed. We theoretically show that AE-SAM has the same convergence rate as SAM. Experimental results on various datasets and architectures demonstrate the efficiency and effectiveness of the proposed method.","Sharpness-aware minimization, model generalization, loss landscape" Semi-Offline Reinforcement Learning for Portfolio Optimization,https://openreview.net/forum?id=jl-zL6aETgQ,https://openreview.net/pdf?id=jl-zL6aETgQ,,"We introduce semi-offline reinforcement learning (RL), a new formalization of the sequential decision-making problem for portfolio optimization. Unlike the standard and the fully-offline RL settings, the unique challenge of semi-offline RL is the limited access to an actively evolving environment. Therefore, existing online/offline RL approaches are incapable of handling the distributional shift between the fixed observations in the training set and those in an out-of-distribution test domain. In this paper, we propose a novel off-policy RL algorithm named \textit{stationarity-constrained MDP} (SC-MDP), which decouples the previously-collected training observations into two streams of \textit{stationary} and \textit{non-stationary} latent variables through a probabilistic inference framework. We demonstrate that in this way, the learned policies can be persistently profitable despite rapidly-changing environment dynamics. Our approach remarkably outperforms the existing online RL algorithms, advanced offline RL methods, and state-of-the-art stock prediction models on three real-world financial datasets.", FedMT: Federated Learning with Mixed-type Labels,https://openreview.net/forum?id=lCzuxqRbThP,https://openreview.net/pdf?id=lCzuxqRbThP,,"In federated learning (FL), classifiers (e.g., deep networks) are trained on datasets from multiple centers without exchanging data across them, and thus improves sample efficiency. In the classical setting of FL, the same labeling criterion is usually employed across all centers being involved in training. This constraint greatly limits the applicability of FL. For example, standards used for disease diagnosis are more likely to be different across clinical centers, which mismatches the classical FL setting. In this paper, we consider an important yet under-explored setting of FL, namely FL with mixed-type labels where different labeling criteria can be employed by various centers, leading to inter-center label space differences and challenging existing FL methods designed for the classical setting. To effectively and efficiently train models with mixed-type labels, we propose a theory-guided and model-agnostic approach that can make use of the underlying correspondence between those label spaces and can be easily combined with various FL methods such as FedAvg. We present convergence analysis based on over-parameterized ReLU networks. We show that the proposed method can achieve linear convergence in label projection, and demonstrate the impact of the parameters of our new setting on the convergence rate. The proposed method is evaluated and the theoretical findings are validated on benchmark and medical datasets.","Federated Learning, Mixed-type Labels, Healthcare Application, Neural Tangent Kernel" A Note on Quantifying the Influence of Energy Regularization for Imbalanced Classification,https://openreview.net/forum?id=tFmjPd8J5o1,https://openreview.net/pdf?id=tFmjPd8J5o1,,"For classification problems where classifiers predict $\bar{p}(y|\mathbf{x})$, namely the probability of label $y$ given data $\mathbf{x}$, an energy value can be defined (e.g. LogSumExp of the logits) and used to evaluate the estimated $\bar{p}(\mathbf{x})$ by learned model, which is widely used for generative learning. However, previous works overlook the relationship between the estimated $\bar{p}(\mathbf{x})$ and the testing accuracy of a classifier {when shifts occur regarding $p(\mathbf{x})$ from the the training set to the testing set} \emph{e.g.} imbalanced dataset learning. In this paper, we propose to evaluate the influence of the energy value regarding $\bar{p}(\mathbf{x})$ on the testing accuracy via influence function which is a standard tool in robust statistics. In particular, we empirically show that the energy value could influence the testing accuracy of the model trained on the imbalanced dataset. Based on our findings, we further propose a technique that regularizes the energy value on the training set to improve imbalanced data learning. We theoretically prove that regularizing energy value could adjust the margin and re-weight the sample. Experimental results show the effectiveness of our method. In particular, when finetuning with our method for only a few epochs, the testing accuracy could be effectively boosted on popular imbalance classification benchmarks.","Influence Function, Energy based model, Imbalanced Dataset" Penalizing the High-likelihood: A Novel Sampling Method for Open-ended Neural Text Generation via Inverse Probability Weighting,https://openreview.net/forum?id=e9CKiV6pgBD,https://openreview.net/pdf?id=e9CKiV6pgBD,A novel sampling algorithm for neural text generation with improved diversity and novelty compared with top-p/k and temperature sampling.,"Traditional stochastic sampling methods for open-ended neural text generation focus on truncating the low-likelihood part of the predicted distribution. They do not directly manipulate the high-likelihood part, which leads to the likelihood trap that induces repetition and boredom. They also do not directly leverage that human does not always favor high-likelihood texts. Inspired by these, we propose a novel sampling method that rescales the high-likelihood part of the distribution with inverse probability weighting. It increases the diversity by rescaling and penalizing the high-likelihood words, and preserves the fluency by using multi-filtering truncation on the low-likelihood words. We use pre-trained language models to compare our algorithm with traditional sampling methods. Results show that our algorithm can significantly increase the diversity and novelty of generated texts without corrupting the fluency.","neural text generation, sampling algorithm, likelihood trap, diversity and novelty" Unlearning with Fisher Masking,https://openreview.net/forum?id=UifByHdLmmf,https://openreview.net/pdf?id=UifByHdLmmf,We develop a new unlearning strategy based on Fisher Masking which shows strong unlearning performances across different datasets and deep neural network structures.,"Machine unlearning aims to revoke some training data after learning in response to requests from users, model developers, and administrators. Most previous methods are based on direct fine tuning, which may neither remove data completely nor retain full performances on the remain data. In this work, we find that, by first masking some important parameters before fine tuning, the performances of unlearning could be significantly improved. We propose a new masking strategy tailored to unlearning based on Fisher information. Experiments on various datasets and network structures show the effectiveness of the method: without any fine tuning, the proposed Fisher masking could unlearn almost completely while maintaining most of the performance on the remain data. It also exhibits stronger stability comparing with other unlearning baselines.","machine unlearning, Fisher Information" Augmented Lagrangian is Enough for Optimal Offline RL with General Function Approximation and Partial Coverage,https://openreview.net/forum?id=ZsvWb6mJnMv,https://openreview.net/pdf?id=ZsvWb6mJnMv,We present practical and statistically optimal offline RL algorithms under general function approximation and single-policy concentrability.,"Offline reinforcement learning (RL), which refers to decision-making from a previously-collected dataset of interactions, has received significant attention over the past years. Much effort has focused on improving offline RL practicality by addressing the prevalent issue of partial data coverage through various forms of conservative policy learning. While the majority of algorithms do not have finite-sample guarantees, several provable conservative offline RL algorithms are designed and analyzed within the single-policy concentrability framework that handles partial coverage. Yet, in the nonlinear function approximation setting where confidence intervals are difficult to obtain, existing provable algorithms suffer from computational intractability, prohibitively strong assumptions, and suboptimal statistical rates. In this paper, we leverage the marginalized importance sampling (MIS) formulation of RL and present the first set of offline RL algorithms that are statistically optimal and practical under general function approximation and single-policy concentrability, bypassing the need for uncertainty quantification. We identify that the key for successfully solving the sample-based approximation of the MIS problem is ensuring that certain state occupancy validity constraints are nearly satisfied. We enforce these constraints by a novel application of the augmented Lagrangian method and prove the following result: with MIS formulation, augmented Lagrangian is enough for statistically optimal offline RL. In stark contrast to prior algorithms that induce additional conservatism through methods such as behavior regularization, our approach provably eliminates this need and reinterprets regularizers as ""enforcers of state occupancy validity"" than ""promoters of conservatism.""","Offline RL, Pessimism, RL Theory" Bandit Learning with General Function Classes: Heteroscedastic Noise and Variance-dependent Regret Bounds,https://openreview.net/forum?id=fySLokohvj4,https://openreview.net/pdf?id=fySLokohvj4,,"We consider learning a stochastic bandit model, where the reward function belongs to a general class of uniformly bounded functions, and the additive noise can be heteroscedastic. Our model captures contextual linear bandits and generalized linear bandits as special cases. While previous works (Kirschner and Krause, 2018; Zhou et al., 2021) based on weighted ridge regression can deal with linear bandits with heteroscedastic noise, they are not directly applicable to our general model due to the curse of nonlinearity. In order to tackle this problem, we propose a \emph{multi-level learning} framework for the general bandit model. The core idea of our framework is to partition the observed data into different levels according to the variance of their respective reward and perform online learning at each level collaboratively. Under our framework, we first design an algorithm that constructs the variance-aware confidence set based on empirical risk minimization and prove a variance-dependent regret bound. For generalized linear bandits, we further propose an algorithm based on follow-the-regularized-leader (FTRL) subroutine and online-to-confidence-set conversion, which can achieve a tighter variance-dependent regret under certain conditions.", A Semantic Hierarchical Graph Neural Network for Text Classification,https://openreview.net/forum?id=3HX__RcSFZj,https://openreview.net/pdf?id=3HX__RcSFZj,,"The key to the text classification task is language representation and important information extraction, and there are many related studies. In recent years, the research on graph neural network (GNN) in text classification has gradually emerged and shown its advantages, but the existing models mainly focus on directly inputting words as graph nodes into the GNN models ignoring the different levels of semantic structure information in the samples. To address the issue, we propose a new hierarchical graph neural network (HieGNN) which extracts corresponding information from word-level, sentence-level (sen-level) and document-level (doc-level) respectively. The doc-level focuses on processing samples from a global perspective, while sen-level and word-level focus on processing samples from the sentences and words themselves. The model is tested on five datasets, and compared with the pure GNN-based model and the hybrid GNN and BERT model, it achieves better classification results on two datasets and similar results on three datasets, which demonstrate that our model is able to obtain more useful information for classification from samples.", Injecting Image Details into CLIP's Feature Space,https://openreview.net/forum?id=8TKFt2x3Sx,https://openreview.net/pdf?id=8TKFt2x3Sx,"We propose a framework, including a model-agnostic complete cover scheme to obtain image patches, a fusing model, the corresponding query proxy loss and a new text-image retrieval benchmark.","Although CLIP-like Visual Language Models provide a functional joint feature space for image and text, due to the limitation of the CILP-like model's image input size (e.g., 224), subtle details are lost in the feature representation if we input high-resolution images (e.g., 2240). In this work, we introduce an efficient framework that can produce a single feature representation for a high-resolution image that injects image details and shares the same semantic space as the original CLIP. In the framework, we train a feature fusing model based on CLIP features extracted from a carefully designed image patch method (Complete Cover) that can cover objects of any scale, weakly supervised by image-agnostic class prompted queries. We validate our framework by retrieving images from class prompted queries on the existing real-world and synthetic datasets, showing significant performance improvement on these tasks. Furthermore, to fully demonstrate our framework's detail retrieval ability, we construct a CLEVR-like synthetic dataset called CLVER-DS, which is fully annotated and has a controllable object scale. ","Text-Based Information Retrieval, Fine-Detail, CLIP, Single Feature, Detail Compression, Complete Cover, Feature Space Alignment, Self-Supervised" Adaptive Sparse Softmax: An Effective and Efficient Softmax Variant for Text Classification,https://openreview.net/forum?id=5cio7DSIXLQ,https://openreview.net/pdf?id=5cio7DSIXLQ,,"Softmax with the cross entropy loss is the standard configuration for current neural text classification models. The gold score for a target class is supposed to be 1, but it is never reachable under the softmax schema. Such a problem makes the training process continue forever and leads to overfitting. Moreover, the “target-approach-1” training goal forces the model to continuously learn all samples, leading to a waste of time in handling some samples which have already been classified correctly with high confidence, while the test goal simply requires the target class of each sample to hold the maximum score. To solve the above weaknesses, we propose the \textbf{A}daptive \textbf{S}parse softmax (AS-Softmax) which designs a reasonable and test-matching transformation on top of softmax. For more purposeful learning, we discard the classes with far smaller scores compared with the actual class during training. Then the model could focus on learning to distinguish the target class from its strong opponents, which is also the great challenge in test. In addition, since the training losses of easy samples will gradually drop to 0 in AS-Softmax, we develop an adaptive gradient accumulation strategy based on the masked sample ratio to speed up training. We verify proposed AS-Softmax on a variety of multi-class, multi-label and token classification tasks with class sizes ranging from 5 to 5000+. The results show that AS-Softmax consistently outperforms softmax and its variants, and the loss of AS-Softmax is remarkably correlated with classification performance in validation. Furthermore, adaptive gradient accumulation strategy can bring about 1.2× training speedup comparing with the standard softmax while maintaining classification effectiveness.", Stochastic Bridges as Effective Regularizers for Parameter-Efficient Tuning,https://openreview.net/forum?id=x5YkB3b_48o,https://openreview.net/pdf?id=x5YkB3b_48o,"We propose to use stochastic bridges in a latent space to regularize the intermediate states of pre-trained models, and show the effectiveness and generality of this regularization on different tasks, models, and parameter-efficient tuning methods.","Parameter-efficient tuning methods (PETs) have achieved promising results in tuning large pre-trained language models (PLMs). By formalizing frozen PLMs and additional tunable parameters as systems and controls respectively, PETs can be theoretically grounded to optimal control and further viewed as optimizing terminal cost and running cost in the optimal control literature. Despite the elegance of this theoretical grounding, in practice, existing PETs often ignore the running cost and only optimize the terminal cost, i.e., focus on optimizing the loss function of the output state, regardless of the running cost that depends on the intermediate states. Since it is non-trivial to directly model the intermediate states and design a running cost function, we propose to use latent stochastic bridges to regularize the intermediate states and serve as the running cost of PETs. As the first work to propose regularized PETs that use stochastic bridges as the regularizers (running costs) for intermediate states, we show the effectiveness and generality of this regularization across different tasks, PLMs and PETs. In view of the great potential and capacity, we believe more sophisticated regularizers can be designed for PETs and better performance can be achieved in the future.","parameter-efficient tuning, pre-trained model, stochastic process" Continuous Goal Sampling: A Simple Technique to Accelerate Automatic Curriculum Learning,https://openreview.net/forum?id=Vk9RH9aL1Yv,https://openreview.net/pdf?id=Vk9RH9aL1Yv,"We present continuous goal sampling, an extension of goal-conditioned RL that accelerates a wide range of curriculum learning algorithms.","Goal-conditioned reinforcement learning (RL) tackles the problem of training an RL agent to reach multiple goals in an environment, often with sparse rewards only administered upon reaching the goal. In this regard, automatic curriculum learning can improve an agent's learning by sampling goals in a structured order catered to the agent's current ability. This work presents two contributions to improve learning in goal-conditioned RL environments. First, we present a simple, algorithm-agnostic technique to accelerate learning by continuous goal sampling, in which an agent's goals are sampled and changed multiple times within a single episode. Such continuous goal sampling enables faster exploration of the goal space and allows curriculum methods to have a more significant impact on an agent's learning. Second, we propose VDIFF, an automatic curriculum learning method that uses an agent's value function to create a self-paced curriculum by sampling goals on which the agent is demonstrating high learning progress. Through results on 17 multi-goal robotic environments and navigation tasks, we show that continuous goal sampling and VDIFF work synergistically and result in performance gains over current state-of-the-art methods.","reinforcement learning, curriculum learning, goal-conditioned reinforcement learning" What do Vision Transformers Learn? A Visual Exploration,https://openreview.net/forum?id=xc5ajsvLzFO,https://openreview.net/pdf?id=xc5ajsvLzFO,,"Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision, yet we understand very little about why they work and what they learn. While existing studies visually analyze the mechanisms of convolutional neural networks, an analogous exploration of ViTs remains challenging. In this paper, we first address the obstacles to performing visualizations on ViTs. Assisted by these solutions, we observe that neurons in ViTs trained with language model supervision (e.g., CLIP) are activated by semantic concepts rather than visual features. We also explore the underlying differences between ViTs and CNNs, and we find that transformers detect image background features, just like their convolutional counterparts, but their predictions depend far less on high-frequency information. On the other hand, both architecture types behave similarly in the way features progress from abstract patterns in early layers to concrete objects in late layers. In addition, we show that ViTs maintain spatial information in all layers except the final layer. In contrast to previous works, we show that the last layer most likely discards the spatial information and behaves as a learned global pooling operation. Finally, we conduct large-scale visualizations on a wide range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin, to validate the effectiveness of our method.","Vision Transformers, Visualization, Interpretability" When do Convolutional Neural Networks Stop Learning?,https://openreview.net/forum?id=ZamTdE_Q7hv,https://openreview.net/pdf?id=ZamTdE_Q7hv,,"Convolutional Neural Networks (CNNs) have demonstrated outstanding performance in computer vision tasks such as image classification, detection, segmentation, and medical image analysis. In general, an arbitrary number of epochs is used to train such neural networks. In a single epoch, the entire training data---divided by batch size---are fed to the network. In practice, validation error with training loss is used to estimate the neural network's generalization, which indicates the optimal learning capacity of the network. Current practice is to stop training when the training loss decreases and the gap between training and validation error increases (i.e., the generalization gap) to avoid overfitting. However, this is a trial-and-error-based approach which raises a critical question: Is it possible to estimate when neural networks stop learning based on training data? This research work introduces a hypothesis that analyzes the data variation across all the layers of a CNN variant to anticipate its near-optimal learning capacity. In the training phase, we use our hypothesis to anticipate the near-optimal learning capacity of a CNN variant without using any validation data. Our hypothesis can be deployed as a plug-and-play to any existing CNN variant without introducing additional trainable parameters to the network. We test our hypothesis\footnote{https://github.com/PaperUnderReviewDeepLearning/ \\Optimization} on six different CNN variants and three different datasets (CIFAR10, CIFAR100, and SVHN). The result based on these CNN variants and datasets shows that our hypothesis saves 58.49\% of computational time (on average) in training.", Detecting and Mitigating Indirect Stereotypes in Word Embeddings,https://openreview.net/forum?id=e-M4E3Jmnkq,https://openreview.net/pdf?id=e-M4E3Jmnkq,,"Societal biases in the usage of words, including harmful stereotypes, are frequently learned by common word embedding methods. These biases manifest not only between a word and an explicit marker of its stereotype, but also between words that share related stereotypes. This latter phenomenon, sometimes called ``indirect bias,'' has resisted prior attempts at debiasing. In this paper, we propose a novel method to mitigate indirect bias in distributional word embeddings by modifying biased relationships between words before embeddings are learned. This is done by considering how the co-occurrence probability of a given pair of words changes in the presence of words marking an attribute of bias, and using this to average out the effect of a bias attribute. To evaluate this method, we perform a series of common tests and demonstrate that the semantic quality of the word embeddings is retained while measures of bias in the embeddings are reduced. In addition, we conduct novel tests for measuring indirect stereotypes by extending the Word Embedding Association Test (WEAT) with new test sets for indirect binary gender stereotypes. With these tests, we demonstrate that this method can reduce the presence of more subtle stereotypes not properly addressed by previous work.","word embeddings, stereotypes, bias, bias mitigation, indirect bias" OCIM : Object-centric Compositional Imagination for Visual Abstract Reasoning,https://openreview.net/forum?id=X5BlM6YG7j,https://openreview.net/pdf?id=X5BlM6YG7j,Our model leveragse object-centric inductive biases to derive an imagination-based learning framework. We show that it leads to better compositional generalization in a visual abstact reasoning task.,"A long-sought property of machine learning systems is the ability to compose learned concepts in novel ways that would enable them to make sense of new situations. Such capacity for imagination -- a core aspect of human intelligence -- is not yet attained for machines. In this work, we show that object-centric inductive biases can be leveraged to derive an imagination-based learning framework that achieves compositional generalization on a series of tasks. Our method, denoted Object-centric Compositional IMagination (OCIM), decomposes visual reasoning tasks into a series of primitives applied to objects without using a domain-specific language. We show that these primitives can be recomposed to generate new imaginary tasks. By training on such imagined tasks, the model learns to reuse the previously-learned concepts to systematically generalize at test time. We test our model on a series of arithmetic tasks where the model has to infer the sequence of operations (programs) applied to a series of inputs. We find that imagination is key for the model to find the correct solution for unseen combinations of operations.","objects, imagination, visual reasoning, representation learning, inductive biases, compositional generalization" How Weakly Supervised Information helps Contrastive Learning,https://openreview.net/forum?id=YSVbWFBDup,https://openreview.net/pdf?id=YSVbWFBDup,,"Contrastive learning has shown outstanding performances in both supervised and unsupervised learning. However, little is known about when and how weakly supervised information helps improve contrastive learning, especially from the theoretical perspective. The major challenge is that the existing theory of contrastive learning based on supervised learning frameworks failed to distinguish between supervised and unsupervised contrastive learning. Therefore, we turn to the unsupervised learning frameworks, and based on the posterior probability of labels, we translate the weakly supervised information into a similarity graph under the framework of spectral clustering. In this paper, we investigate two typical weakly supervised learning problems, noisy label learning, and semi-supervised learning, and analyze their influence on contrastive learning within a unified framework. Specifically, we analyze the effect of weakly supervised information on the augmentation graph of unsupervised contrastive learning, and consequently on its corresponding error bound. Numerical experiments are carried out to verify the theoretical findings.", A computational framework to unify representation similarity and function in biological and artificial neural networks,https://openreview.net/forum?id=Zd4hTGjpMNm,https://openreview.net/pdf?id=Zd4hTGjpMNm,,"Artificial neural network (ANN) is a versatile tool to study the neural representation in the ventral visual stream, and the knowledge in neuroscience in return inspires ANN models to improve performance in the task. However, it is still unclear how to merge these two directions into a unified framework. In this study, we propose an integrated framework called Deep Autoencoder with Neural Response (DAE-NR), which incorporates information from ANN and the visual cortex to achieve better image reconstruction performance and higher neural representation similarity between biological and artificial neurons. The same visual stimuli (i.e., natural images) are input to both the mice brain and DAE-NR. The encoder of DAE-NR jointly learns the dependencies from neural spike encoding and image reconstruction. For the neural spike encoding task, the features derived from a specific hidden layer of the encoder are transformed by a mapping function to predict the ground-truth neural response under the constraint of image reconstruction. Simultaneously, for the image reconstruction task, the latent representation obtained by the encoder is assigned to a decoder to restore the original image under the guidance of neural information. In DAE-NR, the learning process of encoder, mapping function and decoder are all implicitly constrained by these two tasks. Our experiments demonstrate that if and only if with the joint learning, DAE-NRs can improve the performance of visual image reconstruction and increase the representation similarity between biological neurons and artificial neurons. The DAE-NR offers a new perspective on the integration of computer vision and neuroscience.", Turning a Curse Into a Blessing: Enabling Data-Free Backdoor Unlearning via Stabilized Model Inversion,https://openreview.net/forum?id=P880C39xAvM,https://openreview.net/pdf?id=P880C39xAvM,,"Effectiveness of many existing backdoor removal techniques crucially rely on access to clean in-distribution data. However, as model is often trained on sensitive or proprietary datasets, it might not be practical to assume the availability of in-distribution samples. To address this problem, we propose a novel approach to reconstruct samples from a backdoored model and then use the reconstructed samples as a proxy for clean in-distribution data needed by the defenses. We observe an interesting phenomenon that ensuring perceptual similarity between the synthesized samples and the clean training data is \emph{not} adequate to enable effective defenses. We show that the model predictions at such synthesized samples can be unstable to small input perturbations, which misleads downstream backdoor removal techniques to remove these perturbations instead of underlying backdoor triggers. Moreover, unlike clean samples, the predictions at the synthesized samples can also be unstable to small model parameter changes. To tackle these issues, we design an optimization-based data reconstruction technique that ensures visual quality while promoting the stability to perturbations in both data and parameter space. We also observe that while reconstructed from a backdoored model, the synthesized samples do not contain backdoors, and further provide a theoretical analysis that sheds light on this observation. Our evaluation shows that our data synthesis technique can lead to state-of-the-art backdoor removal performance without clean in-distribution data access and the performance is on par with or sometimes even better than using the same amount of clean samples.",Backdoor Defenses Fairness and Accuracy under Domain Generalization,https://openreview.net/forum?id=jBEXnEMdNOL,https://openreview.net/pdf?id=jBEXnEMdNOL,"This paper presents (1) theoretical bounds for fairness and accuracy under domain generalization, and (2) a proposed model that can achieve good fairness and accuracy in an unseen target domain through invariant representation learning.","As machine learning (ML) algorithms are increasingly used in high-stakes applications, concerns have arisen that they may be biased against certain social groups. Although many approaches have been proposed to make ML models fair, they typically rely on the assumption that data distributions in training and deployment are identical. Unfortunately, this is commonly violated in practice and a model that is fair during training may lead to an unexpected outcome during its deployment. Although the problem of designing robust ML models under dataset shifts has been widely studied, most existing works focus only on the transfer of accuracy. In this paper, we study the transfer of both fairness and accuracy under domain generalization where the data at test time may be sampled from never-before-seen domains. We first develop theoretical bounds on the unfairness and expected loss at deployment, and then derive sufficient conditions under which fairness and accuracy can be perfectly transferred via invariant representation learning. Guided by this, we design a learning algorithm such that fair ML models learned with training data still have high fairness and accuracy when deployment environments change. Experiments on real-world data validate the proposed algorithm.","fairness, accuracy, domain generalization, js-divergence, invariant representation, equalized odds, equal opportunity, regularization" DROP: Conservative Model-based Optimization for Offline Reinforcement Learning,https://openreview.net/forum?id=ttfOGx6-_FT,https://openreview.net/pdf?id=ttfOGx6-_FT,,"In this work, we decouple the iterative (bi-level) offline RL optimization from the offline training phase, forming a non-iterative bi-level learning paradigm that avoids the iterative error propagation over two levels. Specifically, this non-iterative paradigm allows us to conduct inner-level optimization in training (ie, employing policy/value regularization), while performing outer-level optimization in testing (ie, conducting policy inference). Naturally, such paradigm raises three core questions (that are not fully answered by prior non-iterative methods): (Q1) What information should we transfer from inner-level to outer-level? (Q2) What should we pay attention to when using the transferred information in outer-level optimization? (Q3) What are the benefits of concurrently conducting outer-level optimization during testing? Motivated by model-based optimization, we proposed DROP, which fully answered the above three questions. Particularly, in inner-level, DROP decomposes offline data into multiple subsets, and learns a score model (Q1). To keep safe exploitation to score model in outer-level, we explicitly learn a behavior embedding and introduce a conservative regularization (Q2). During testing, we show that DROP permits deployment adaptation, enabling an adaptive inference across states (Q3). Empirically, we evaluate DROP on various benchmarks, showing that DROP gains comparable or better performance compared to prior offline RL methods.", Language Models Can Teach Themselves to Program Better,https://openreview.net/forum?id=SaRj2ka1XZ3,https://openreview.net/pdf?id=SaRj2ka1XZ3,"Language Models can be used to generate Programming Puzzles and Solutions, which can be filtered for correctness and used to finetune the LLM to improve its performance.","Recent Language Models (LMs) achieve breakthrough performance in code generation when trained on human-authored problems, even solving some competitive-programming problems. Self-play has proven useful in games such as Go, and thus it is natural to ask whether LMs can generate their own instructive programming problems to improve their performance. We show that it is possible for an LM to synthesize programming problems and solutions, which are filtered for correctness by a Python interpreter. The LM’s performance is then seen to improve when it is fine-tuned on its own synthetic problems and verified solutions; thus the model “improves itself” using the Python interpreter. Problems are specified formally as programming puzzles [Schuster et al. , 2021], a code-based problem format where solutions can easily be verified for correctness by execution. In experiments on publicly-available LMs, test accuracy more than doubles. This work demonstrates the potential for code LMs, with an interpreter, to generate instructive problems and improve their own performance.","deep learning, natural language processing, program synthesis, large language models" Adaptive Kernel Selection for Convolutional Neural Network,https://openreview.net/forum?id=jPfDKW3nj5q,https://openreview.net/pdf?id=jPfDKW3nj5q,," Convolutional Neural Networks (CNN) are used for various applications ranging from computer vision to natural language processing. A kernel, known as the matrix of weights, performs a convolution operation on input data. In general, the optimizer updates the weights of the kernel. Recent research suggests that applying a deterministic kernel after convolution operation can help a CNN to gain better generalization. However, how to compute the weights of the deterministic kernel is still a field of active research. In this work, we derived a lemma that shows of the representativeness of an adaptive deterministic kernel. We construct an adaptive deterministic kernel based on the Gaussian distribution of convoluted data. We generate many set of kernels by shifting weights to different positions of the initially created Gaussian kernel. We notice a pattern of weight distribution in deterministic kernels constructed from the Gaussian distribution of convoluted data. Using the derived lemma, it is possible to sort out a set of kernels (from many set of kernels) that tends to gain better performance in CNN for image classification task. The main object of this research work\footnote{https://github.com/PaperUnderReviewDeepLearning/KernelSet} is to identify these patterns and recommend the better set of kernels to gain performance.","Adaptive, Kernel, CNN, Computer Vision" NAPG: Non-Autoregressive Program Generation for Hybrid Tabular-Textual Question Answering,https://openreview.net/forum?id=Q31C6XQOEvl,https://openreview.net/pdf?id=Q31C6XQOEvl,We present a non-autoregressive program generation model for the numerical reasoning of hybrid question answering to address the exposure bias issue of autoregressive generation and to boost the decoding speed.,"Hybrid tabular-textual question answering (QA) requires reasoning from heterogeneous information, and the types of reasoning are mainly divided into numerical reasoning and span extraction. The current numerical reasoning method uses LSTM to autoregressively decode program sequences, and each decoding step produces either an operator or an operand. However, the step-by-step decoding suffers from exposure bias, and the accuracy of program generation drops sharply with progressive decoding. In this paper, we propose a non-autoregressive program generation framework, which facilitates program generation in parallel. Our framework, which independently generates complete program tuples containing both operators and operands, can significantly boost the speed of program generation while addressing the error accumulation issue. Our experiments on the MultiHiertt dataset shows that our model can bring about large improvements (+7.97 EM and +6.38 F1 points) over the strong baseline, establishing the new state-of-the-art performance, while being much faster (~21x) in program generation. The performance drop of our method is also significantly smaller than the baseline with increasing numbers of numerical reasoning steps.","Tabular-Textual Question Answering, Non-Autoregressive Program Generation, Natural Language Processing" MVP: Multi-task Supervised Pre-training for Natural Language Generation,https://openreview.net/forum?id=Us5in-h2Dp,https://openreview.net/pdf?id=Us5in-h2Dp,We pre-train a model MVP and task-specific prompts for natural language generation tasks with our collected labeled corpora MVPCorpus.,"Pre-trained language models (PLMs) have achieved remarkable success in natural language generation (NLG) tasks. Up to now, most NLG-oriented PLMs are pre-trained in an unsupervised manner using the large-scale general corpus. In the meanwhile, an increasing number of models pre-trained with labeled data (i.e., “supervised pre-training”) showcase superior performance compared to unsupervised pre-trained models. Motivated by the success of supervised pre-training, we propose Multi-task superVised Pre-training (MVP) for natural language generation. We collect a large-scale natural language generation corpus, MVPCorpus, from $77$ datasets over $11$ diverse NLG tasks. Then we unify these examples into a general text-to-text format to pre-train the text generation model MVP in a supervised manner. For each task, we further pre-train specific soft prompts to stimulate the model’s capacity to perform a specific task. Extensive experiments have demonstrated the effectiveness and generalizability of our MVP model in a number of NLG tasks, which achieves state-of-the-art performance on $13$ out of $17$ datasets.","Natural language generation, pretrained language models, multi-task learning, prompt learning" Learning Unified Representations for Multi-Resolution Face Recognition,https://openreview.net/forum?id=1EVPT82ttr,https://openreview.net/pdf?id=1EVPT82ttr,We propose Branch-to-Trunk Network to learn discriminative embeddings for multi-resolution face recognition while preserving representation compatibility.,"In this work, we propose Branch-to-Trunk network (BTNet), a novel representation learning method for multi-resolution face recognition. It consists of a trunk network (TNet), namely a unified encoder, and multiple branch networks (BNets), namely resolution adapters. As per the input, a resolution-specific BNet is used and the output are implanted as feature maps in the feature pyramid of TNet, at a layer with the same resolution. The discriminability of tiny faces is significantly improved, as the interpolation error introduced by rescaling, especially up-sampling, is mitigated on the inputs. With branch distillation and backward-compatible training, BTNet transfers discriminative high-resolution information to multiple branches while guaranteeing representation compatibility. Our experiments demonstrate strong performance on face recognition benchmarks, both for multi-resolution face verification and face identification, with much less computation amount and parameter storage. We establish new state-of-the-art on the challenging QMUL-SurvFace 1: N face identification task.","multi-resolution face recognition, deep representation learning" Latent Bottlenecked Attentive Neural Processes,https://openreview.net/forum?id=yIxtevizEA,https://openreview.net/pdf?id=yIxtevizEA,"In this work, we propose Latent Bottlenecked Attentive Neural Processes (LBANPs), a new computationally efficient sub-quadratic NP variant, that has a querying computational complexity independent of the number of context datapoints.","Neural Processes (NPs) are popular methods in meta-learning that can estimate predictive uncertainty on target datapoints by conditioning on a context dataset. Previous state-of-the-art method Transformer Neural Processes (TNPs) achieve strong performance but require quadratic computation with respect to the number of context datapoints, significantly limiting its scalability. Conversely, existing sub-quadratic NP variants perform significantly worse than that of TNPs. Tackling this issue, we propose Latent Bottlenecked Attentive Neural Processes (LBANPs), a new computationally efficient sub-quadratic NP variant, that has a querying computational complexity independent of the number of context datapoints. The model encodes the context dataset into a constant number of latent vectors on which self-attention is performed. When making predictions, the model retrieves higher-order information from the context dataset via multiple cross-attention mechanisms on the latent vectors. We empirically show that LBANPs achieve results competitive with the state-of-the-art on meta-regression, image completion, and contextual multi-armed bandits. We demonstrate that LBANPs can trade-off the computational cost and performance according to the number of latent vectors. Finally, we show LBANPs can scale beyond existing attention-based NP variants to larger dataset settings.","Neural Processes, Meta-Learning, Uncertainty Estimation" Improving Inductive Link Prediction through Learning Generalizable Node Representations,https://openreview.net/forum?id=ekTnbhhLHg,https://openreview.net/pdf?id=ekTnbhhLHg,"We propose new methods for designing inductive tests on any graph dataset, accompanied by unsupervised pre-training of the node attributes for improved generalizability of the link prediction models.","Link prediction is a core task in graph machine learning, as it is useful in many application domains from social networks to biological networks. Link prediction can be performed under different experimental settings: (1) transductive, (2) semi-inductive, and (3) inductive. The most common setting is the transductive one, where the task is to predict whether two observed nodes have a link. In the semi-inductive setting, the task is to predict whether an observed node has a link to a newly observed node, which was unseen during training. For example, cold start in recommendation systems requires suggesting a known product to a new user. We study the inductive setting, where the task is to predict whether two newly observed nodes have a link. The inductive setting occurs in many real-world applications such as predicting interactions between two poorly investigated chemical structures or identifying collaboration possibilities between two new authors. In this paper, we demonstrate that current state-of-the-are techniques perform poorly under the inductive setting, i.e., when generalizing to new nodes, due to the overlapping information between the graph topology and the node attributes. To address this issue and improve the robustness of link prediction models in an inductive setting, we propose new methods for designing inductive tests on any graph dataset, accompanied by unsupervised pre-training of the node attributes. Our experiments show that the inductive test performances of the state-of-the-art link prediction models are substantially lower compared to the transductive scenario. These performances are comparable, and often lower than that of a simple multilayer perceptron on the node attributes. Unsupervised pre-training of the node attributes improves the inductive performance, hence the generalizability of the link prediction models.","Link Prediction, Graph Machine Learning, Inductive Learning, Node Embedding, Representation Learning, Generalizability, Open Graph Benchmark" VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment,https://openreview.net/forum?id=26aAV_wjoc,https://openreview.net/pdf?id=26aAV_wjoc,"We introduce VoLTA, Vision-Language Transformer with weakly-supervised local-feature Alignment, a VLP paradigm trained with graph optimal transport (GOT) based image-text matching.","Vision-language pre-training (VLP) has recently proven highly effective for various uni- and multi-modal downstream applications. However, most existing end-to-end VLP methods use high-resolution image-text-box data to perform well on fine-grained region-level tasks, such as object detection, segmentation, and referring expression comprehension. Unfortunately, such high-resolution images with accurate bounding box annotations are expensive to collect and use for supervision at scale. In this work, we propose VoLTA (Vision-Language Transformer with weakly-supervised local-feature Alignment), a new VLP paradigm that only utilizes image-caption data but achieves fine-grained region-level image understanding, eliminating the use of expensive box annotations. VoLTA adopts graph optimal transport-based weakly-supervised alignment on local image patches and text tokens to germinate an explicit, self-normalized, and interpretable low-level matching criterion. In addition, VoLTA pushes multi-modal fusion deep into the uni-modal backbones during pre-training and removes fusion-specific transformer layers, further reducing memory requirements. Extensive experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of VoLTA on fine-grained applications without compromising the coarse-grained downstream performance, often outperforming methods using significantly more caption and box annotations.","self-supervision, vision-language pre-training, transformer, patch-word alignment" "Online Min-max Optimization: Nonconvexity, Nonstationarity, and Dynamic Regret",https://openreview.net/forum?id=vONX8wvmAP,https://openreview.net/pdf?id=vONX8wvmAP,We study the online nonconvex-strongly-concave min-max optimization in the nonstationary environment and propose efficient algorithms with optimal theoretical guarantees under novel notion of regret,"Online min-max optimization has recently gained considerable interest due to its rich applications to game theory, multi-agent reinforcement learning, online robust learning, etc. Theoretical understanding in this field has been mainly focused on convex-concave settings. Online min-max optimization with nonconvex geometries, which captures various online deep learning problems, has yet been studied so far. In this paper, we make the first effort and investigate online nonconvex-strongly-concave min-max optimization in the nonstationary environment. We first introduce a natural notion of local Nash equilibrium (NE)-regret, and then propose a novel algorithm coined SODA to achieve the optimal regret. We further generalize our study to the setting with stochastic first-order feedback, and show that a variation of SODA can also achieve the same optimal regret in expectation. Our theoretical results and the superior performance of the proposed method are further validated by empirical experiments. To our best knowledge, this is the first exploration of efficient online nonconvex min-max optimization.","Online Optimization, Nonconvex, Minimax" Embed to Control Partially Observed Systems: Representation Learning with Provable Sample Efficiency,https://openreview.net/forum?id=8oJHwb3Sgp,https://openreview.net/pdf?id=8oJHwb3Sgp,,"Reinforcement learning in partially observed Markov decision processes (POMDPs) faces two challenges. (i) It often takes the full history to predict the future, which induces a sample complexity that scales exponentially with the horizon. (ii) The observation and state spaces are often continuous, which induces a sample complexity that scales exponentially with the extrinsic dimension. Addressing such challenges requires learning a minimal but sufficient representation of the observation and state histories by exploiting the structure of the POMDP. To this end, we propose a reinforcement learning algorithm named Embed to Control (ETC), which learns the representation at two levels while optimizing the policy.~(i) For each step, ETC learns to represent the state with a low-dimensional feature, which factorizes the transition kernel. (ii) Across multiple steps, ETC learns to represent the full history with a low-dimensional embedding, which assembles the per-step feature. We integrate (i) and (ii) in a unified framework that allows a variety of estimators (including maximum likelihood estimators and generative adversarial networks). For a class of POMDPs with a low-rank structure in the transition kernel, ETC attains an $O(1/\epsilon^2)$ sample complexity that scales polynomially with the horizon and the intrinsic dimension (that is, the rank). Here $\epsilon$ is the optimality gap. To our best knowledge, ETC is the first sample-efficient algorithm that bridges representation learning and policy optimization in POMDPs with infinite observation and state spaces.", Towards Better Selective Classification,https://openreview.net/forum?id=5gDz_yTcst,https://openreview.net/pdf?id=5gDz_yTcst,,"We tackle the problem of Selective Classification where the objective is to achieve the best performance on a predetermined ratio (coverage) of the dataset. Recent state-of-the-art selective methods come with architectural changes either via introducing a separate selection head or an extra abstention logit. In this paper, we challenge the aforementioned methods and confirm that the superior performance of state-of-the-art methods is owed to training a more generalizable classifier rather than their proposed selection mechanisms. We argue that the best-performing selection mechanism should instead be rooted in the classifier itself. Our proposed selection strategy uses the classification probabilities and achieves better results by a significant margin, consistently, across all coverages and all datasets, without any added compute cost. Furthermore, inspired by semi-supervised learning, we propose an entropy-based regularizer that improves the performance of selective classification methods. Our proposed selection mechanism with the proposed entropy-based regularizer achieves new state-of-the-art results.","Selective Classification, Semi-Supervised Learning" Offline Equilibrium Finding,https://openreview.net/forum?id=tx-KRrFC2b,https://openreview.net/pdf?id=tx-KRrFC2b,,"Offline reinforcement learning (Offline RL) is an emerging field that has recently begun gaining attention across various application domains due to its ability to learn behavior from earlier collected datasets. Offline RL proved very successful, paving a path to solving previously intractable real-world problems, and we aim to generalize this paradigm to a multi-agent or multiplayer-game setting. To this end, we formally introduce a problem of offline equilibrium finding (OEF) and construct multiple datasets across a wide range of games using several established methods. To solve the OEF problem, we design a model-based method that can directly apply any online equilibrium finding algorithm to the OEF setting while making minimal changes. We focus on three most prominent contemporary online equilibrium finding algorithms and adapt them to the OEF setting, creating three model-based variants: OEF-PSRO and OEF-CFR, which generalize the widely-used algorithms PSRO and Deep CFR to compute Nash equilibria (NEs), and OEF-JPSRO, which generalizes the JPSRO to calculate (Coarse) Correlated equilibria ((C)CEs). We further improve their performance by combining the behavior cloning policy with the model-based policy. Extensive experimental results demonstrate the superiority of our approach over multiple model-based and model-free offline RL algorithms and the necessity of the model-based method for solving OEF problems. We hope that our efforts may help to accelerate research in large-scale equilibrium finding. ", ASGNN: Graph Neural Networks with Adaptive Structure,https://openreview.net/forum?id=wQ-Tqt4eYQ,https://openreview.net/pdf?id=wQ-Tqt4eYQ,A novel graph neural network model with adaptive structure that has strong resilience to graph structural attacks,"The graph neural network (GNN) has presented impressive achievements in numerous machine learning tasks. However, many existing GNN models are shown to be extremely vulnerable to adversarial attacks, which makes it essential to build robust GNN architectures. In this work, we propose a novel interpretable message passing scheme with adaptive structure (ASMP) to defend against adversarial attacks on graph structure. Layers in ASMP are derived based on optimization steps that minimize an objective function that simultaneously learns the node feature and the graph structure. ASMP is adaptive in the sense that the message passing process in different layers is able to be carried out over different graphs. Such a property allows more fine-grained handling of the noisy graph structure and hence improves the robustness. Integrating ASMP with neural networks can lead to a new family of GNNs with adaptive structure (ASGNN). Extensive experiments on semi-supervised node classification tasks demonstrate that the proposed ASGNN outperforms the state-of-the-art GNN architectures with respect to classification performance under various graph adversarial attacks.","Graph neural network, graph adversarial attacks and defenses, adaptive structure" Iteratively Learning Novel Strategies with Diversity Measured in State Distances,https://openreview.net/forum?id=OfaJyiYonBk,https://openreview.net/pdf?id=OfaJyiYonBk,We develop an iterative RL algorithm for discovering diverse high-reward strategies with provable convergence properties. ,"In complex reinforcement learning (RL) problems, policies with similar rewards may have substantially different behaviors. Yet, to not only optimize rewards but also discover as many diverse strategies as possible remains a challenging problem. A natural approach to this task is constrained population-based training (PBT), which simultaneously learns a collection of policies subject to diversity constraints. However, due to the unaffordable computation cost of PBT, we adopt an alternative approach, iterative learning (IL), which repeatedly learns a single novel policy that is sufficiently different from previous ones. We first analyze these two frameworks and prove that, for any policy pool derived by PBT, we can always use IL to obtain another policy pool of the same rewards and competitive diversity scores. In addition, we also present a novel state-based diversity measure with two tractable realizations. Such a metric can impose a stronger and much smoother diversity constraint than existing action-based metrics. Combining IL and the state-based diversity measure, we develop a powerful diversity-driven RL algorithm, State-based Intrinsic-reward Policy Optimization (SIPO), with provable convergence properties. We empirically examine our algorithm in complex multi-agent environments including StarCraft Multi-Agent Challenge and Google Research Football. SIPO is able to consistently derive strategically diverse and human-interpretable policies that cannot be discovered by existing baselines.","diverse behavior, multi-agent reinforcement learning, deep reinforcement learning" Learning Kernelized Contextual Bandits in a Distributed and Asynchronous Environment,https://openreview.net/forum?id=-G1kjTFsSs,https://openreview.net/pdf?id=-G1kjTFsSs,We propose and analyze a communication efficient asynchronous Kernel UCB algorithm with Nystrom approximation.,"Despite the recent advances in communication-efficient distributed bandit learning, most existing solutions are restricted to parametric models, e.g., linear bandits and generalized linear bandits (GLB). In comparison, kernel bandits, which search for non-parametric functions in a reproducing kernel Hilbert space (RKHS), offer higher modeling capacity. But the only existing work in distributed kernel bandits adopts a synchronous communication protocol, which greatly limits its practical use (e.g., every synchronization step requires all clients to participate and wait for data exchange). In this paper, in order to improve the robustness against delays and unavailability of clients that are common in practice, we propose the first asynchronous solution based on approximated kernel regression for distributed kernel bandit learning. A set of effective treatments are developed to ensure approximation quality and communication efficiency. Rigorous theoretical analysis about the regret and communication cost is provided; and extensive empirical evaluations demonstrate the effectiveness of our solution.","contextual bandit, kernelized method, asynchronous distributed learning, communication efficiency" ATTRIBUTES RECONSTRUCTION IN HETEROGENEOUS NETWORKS VIA GRAPH AUGMENTATION,https://openreview.net/forum?id=SCk8vEhwKo,https://openreview.net/pdf?id=SCk8vEhwKo,,"Heterogeneous Graph Neural Networks(HGNNs), as an effective tool for mining heterogeneous graphs, have achieved remarkable performance on node classification tasks. Yet, HGNNs are limited in their mining power as they require all nodes to have complete and reliable attributes. It is usually unrealistic since the attributes of many nodes in reality are inevitably missing or defective. Existing methods usually take imputation schemes to complete missing attributes, in which topology information is ignored, leading to suboptimal performance. And some graph augmentation techniques have improved the quality of attributes, while few of them are designed for heterogeneous graphs. In this work, we study the data augmentation on heterogeneous graphs, tackling the missing and defective attributes simultaneously, and propose a novel generic architecture—Attributes Reconstruction in Heterogeneous networks via Graph Augmentation(ARHGA), including random sampling, attribute augmentation and consistency training. In graph augmentation, to ensure attributes plausible and accurate, the attention mechanism is adopted to reconstruct attributes under the guidance of the topological relationship between nodes. Our proposed architecture can be easily combined with any GNN-based heterogeneous model, and improves the performance. Extensive experiments on three benchmark datasets demonstrate the superior performance of ARHGA over strate-of-the-art baselines on semi-supervised node classification.", Graph Signal Sampling for Inductive One-Bit Matrix Completion: a Closed-form Solution,https://openreview.net/forum?id=G_HSyfLk0m,https://openreview.net/pdf?id=G_HSyfLk0m,,"Inductive 1-bit matrix completion is motivated by modern applications such as recommender systems, where new users would appear at test stage with the ratings consisting of only ones and no zeros. We propose a unified graph signal sampling framework which enjoys the benefits of graph signal analysis and processing. The key idea is to transform each user's ratings on the items to a function (signal) on the vertices of an item-item graph, then learn structural graph properties to recover the function from its values on certain vertices --- the problem of graph signal sampling. We propose a class of regularization functionals that takes into account discrete random label noise in the graph vertex domain, then develop the GS-IMC approach which biases the reconstruction towards functions that vary little between adjacent vertices for noise reduction. Theoretical result shows that accurate reconstructions can be achieved under mild conditions. For the online setting, we develop a Bayesian extension, i.e., BGS-IMC which considers continuous random Gaussian noise in the graph Fourier domain and builds upon a prediction-correction update algorithm to obtain the unbiased and minimum-variance reconstruction. Both GS-IMC and BGS-IMC have closed-form solutions and thus are highly scalable in large data. Experiments show that our methods achieve state-of-the-art performance on public benchmarks.","inductive one-bit matrix completion, graph signal sampling" DocPrompting: Generating Code by Retrieving the Docs,https://openreview.net/forum?id=ZTCxT2t2Ru,https://openreview.net/pdf?id=ZTCxT2t2Ru,We propose to generalize the code generation models to unseen functions and usages through retrieving and reading code documentation,"Publicly available source-code libraries are continuously growing and changing. This makes it impossible for models of code to keep current with all available APIs by simply training these models on existing code repositories. Thus, existing models inherently cannot generalize to using unseen functions and libraries, because these would never appear in the training data. In contrast, when human programmers use functions and libraries for the first time, they frequently refer to textual resources such as code manuals and documentation, to explore and understand the available functionality. Inspired by this observation, we introduce DocPrompting: a natural-language-to-code generation approach that explicitly leverages documentation by (1) retrieving the relevant documentation pieces given an NL intent, and (2) generating code based on the NL intent and the retrieved documentation. DocPrompting is general: it can be applied to any programming language and is agnostic to the underlying neural model. We demonstrate that DocPrompting consistently improves NL-to-code models: DocPrompting improves strong base models such as CodeT5 by 2.85% in pass@1 (52% relative gain) and 4.39% in pass@10 (30% relative gain) in execution-based evaluation on the popular Python CoNaLa benchmark; on a new Bash dataset tldr, DocPrompting improves CodeT5 and GPT-Neo1.3B by up to absolute 6.9% exact match.","code generation, retrieval-conditioned generation" Comparing semantic and morphological analogy completion in word embeddings,https://openreview.net/forum?id=TuhR4112Ii,https://openreview.net/pdf?id=TuhR4112Ii,,"Word embeddings have prompted great excitement in the NLP community due to their capacity for generalization to unforeseen tasks, including semantic analogy completion. Features such as color and category relationships have been examined by previous work, but this is the first research considering the morphological relationships encoded in word embeddings. We construct several natural experiments examining analogy completion across word stems modified by affixes, and find no evidence that Word2Vec, glove, and fasttext models encode these morphological relationships. We note that a special case of this problem is part-of-speech transformation, and note that the lack of support for part-of-speech analogies is surprising in the context of other successful cases of semantic inference using word embeddings.","machine learning, computational linguistics, word embeddings, morphemes, semantic relationship, analogy completion" LipsFormer: Introducing Lipschitz Continuity to Vision Transformers,https://openreview.net/forum?id=cHf1DcCwcH3,https://openreview.net/pdf?id=cHf1DcCwcH3,We propose a Lipschitz continuous Transformer.,"We present a Lipschitz continuous Transformer, called LipsFormer, to pursue training stability both theoretically and empirically for Transformer-based models. In contrast to previous practical tricks that address training instability by learning rate warmup, layer normalization, attention formulation, and weight initialization, we show that Lipschitz continuity is a more essential property to ensure training stability. In LipsFormer, we replace unstable Transformer component modules with Lipschitz continuous counterparts: CenterNorm instead of LayerNorm, spectral initialization instead of Xavier initialization, scaled cosine similarity attention instead of dot-product attention, and weighted residual shortcut. We prove that these introduced modules are Lipschitz continuous and derive an upper bound on the Lipschitz constant of LipsFormer. Our experiments show that LipsFormer allows stable training of deep Transformer architectures without the need of careful learning rate tuning such as warmup, yielding a faster convergence and better generalization. As a result, on the ImageNet 1K dataset, LipsFormer-Tiny training for 100 epochs without learning rate warmup attains a top-1 accuracy of 81.6\% which is higher than Swin Transformer-Tiny training for 300 epochs with warmup. Moreover, LipsFormer-Tiny training for 300 epochs achieves a top-1 accuracy of 83.5\% with 4.7G FLOPs and 24M parameters. ","Lipschitz Continuity, Vision Transformer, Transformer" Automatic Chain of Thought Prompting in Large Language Models,https://openreview.net/forum?id=5NTt8GFjUHkr,https://openreview.net/pdf?id=5NTt8GFjUHkr,We propose an automatic prompting method (Auto-CoT) to elicit chain-of-thought reasoning in large language models without needing manually-designed demonstrations.,"Large language models (LLMs) can perform complex reasoning by generating intermediate reasoning steps. Providing such a series of steps for prompting demonstrations is called chain-of-thought (CoT) prompting. CoT prompting has two major paradigms. One leverages a simple prompt like ``Let’s think step by step'' to facilitate step-by-step thinking before answering each question. The other uses a few step-by-step demonstrations, each composed of a question and a reasoning chain that leads to an answer. In practice, the second paradigm has achieved stronger performance than the first paradigm. However, this superior performance hinges on the hand-crafting of multiple effective task-specific demonstrations. We show that this limitation may be addressed by leveraging pre-existing abilities of LLMs to generate reasoning chains for demonstrations. A key challenge is that these generated chains often come with mistakes. To mitigate the effect of such mistakes, we investigate various principles for automatically constructing demonstrations and find that diversity matters. Inspired by these findings, we propose an automatic CoT prompting method called Auto-CoT. Auto-CoT samples questions with diversity and generates reasoning chains to construct demonstrations. On ten public benchmark reasoning tasks, Auto-CoT performs competitively compared to Manual-CoT that requires manual designs. ","Chain of Thought Prompting, Large Language Models, In-context Learning, Few-shot Learning, Arithmetic Reasoning, Commonsense Reasoning, Symbolic Reasoning." Enforcing zero-Hessian in meta-learning,https://openreview.net/forum?id=5ZarS9RX5I-,https://openreview.net/pdf?id=5ZarS9RX5I-,"This paper argues linearity in the inner loop is the key gradient-based meta learning, thereby suggests algorithms which exploits this prior.","Gradient-Based Meta Learning (GBML) enables us to get task-specific parameters with few-labeled datapoints in an inner loop. However, it has not yet been discussed how GBML can adapt to a new task within a few optimization steps with a huge learning rate in the inner loop. We find that the gradient does not change from the beginning to the end of the inner loop, meaning that it behaves like a linear model. In this paper, we argue that this characteristic is an essential key to understanding convergence in inner loops with huge learning rates. Also, we show that gradient-based meta learning can be interpreted as metric-based meta learning when we adopt our hypothesis that linearity in the inner loop is the key to operating GBML. To empirically prove and exploit our hypothesis, we propose a regularization-based algorithm called enforcing Linearity in the Inner Loop (LIL) which exploits our observation which can be applied to any baselines that has the form of GBML. LIL proves its potential by showing its boosted performance not only on top of general baselines in various architectures, but also on adverse or Hessian-free baselines. Qualitative experiments are also conducted to explain the performance of LIL.","meta learning, Gradient based meta learning, GBML, kernel gradient descent, metric-based learning, optimization-based meta-learning" An efficient encoder-decoder architecture with top-down attention for speech separation,https://openreview.net/forum?id=fzberKYWKsI,https://openreview.net/pdf?id=fzberKYWKsI,"We propose an encoder-decoder speech separation structure with top-down attention, which can improve separation efficiency while ensuring the separation performance.","Deep neural networks have shown excellent prospects in speech separation tasks. However, obtaining good results while keeping a low model complexity remains challenging in real-world applications. In this paper, we provide a bio-inspired efficient encoder-decoder architecture by mimicking the brain’s top-down attention, called TDANet, with decreased model complexity without sacrificing performance. The top-down attention in TDANet is extracted by the global attention (GA) module and the cascaded local attention (LA) layers. The GA module takes multi-scale acoustic features as input to extract global attention signal, which then modulates features of different scales by direct top-down connections. The LA layers use features of adjacent layers as input to extract the local attention signal, which is used to modulate the lateral input in a top-down manner. On three benchmark datasets, TDANet consistently achieved competitive separation performance to previous state-of-the-art (SOTA) methods with higher efficiency. Specifically, TDANet’s multiply-accumulate operations (MACs) are only 5% of Sepformer, one of the previous SOTA models, and CPU inference time is only 10% of Sep- former. In addition, a large-size version of TDANet obtained SOTA results on three datasets, with MACs still only 10% of Sepformer and the CPU inference time only 24% of Sepformer. Our study suggests that top-down attention can be a more efficient strategy for speech separation.","encoder-decoder, top-down attention, speech separation" Treatment Effect Estimation with Collider Bias and Confounding Bias,https://openreview.net/forum?id=RvV2xvoML7G,https://openreview.net/pdf?id=RvV2xvoML7G,,"To answer causal questions from observational data, it is important to consider the mechanisms that determine which data values are observed and which are missing. Prior work has considered the treatment assignment mechanism and proposed methods to remove the confounding bias from the common causes of treatment and outcome. However, there are other issues in sample selection, commonly overlooked in prior work, that can bias the treatment effect estimation, such as the issue of censored outcome as a form of collider bias. In this paper, we propose the novel Selection Controlled CounterFactual Regression (SC-CFR) to simultaneously address confounding and collider bias. Specifically, we first calculate the magnitude of the collider bias of different instances by estimating the selection model and then add a control term to remove the collider bias while learning a balanced representation to remove the confounding bias when estimating the outcome model. Our method is shown to provide unbiased treatment effect estimates from observational data with confounding and collider bias. Extensive empirical results on both synthetic and real-world datasets show that our method consistently outperforms benchmarks when both types of biases exist.", Adaptive Weight Decay: On The Fly Weight Decay Tuning for Improving Robustness,https://openreview.net/forum?id=ScEfNWshH3B,https://openreview.net/pdf?id=ScEfNWshH3B,"We tune the hyper-parameter for weight decay during each iteration to stabilize training networks with smaller weight-norms which results in more robustness to adversarial examples and label noise, and less sensitivity to choices of learning rate.","We introduce adaptive weight decay, which automatically tunes the hyper-parameter for weight decay during each training iteration. For classification problems, we propose changing the value of the weight-decay hyper-parameter on the fly based on the strength of updates from the classification loss (i.e., gradient of cross-entropy), and the regularization loss (i.e., $\ell_2$-norm of the weights). We show that this simple modification can result in large improvements in adversarial robustness — an area which suffers from robust overfitting — without requiring extra data. Specifically, our reformulation results in 20% relative robustness improvement for CIFAR-100, and 10% relative robustness improvement on CIFAR-10 comparing to traditional weight-decay. In addition, this method has other desirable properties, such as less sensitivity to learning rate, and smaller weight norms, which the latter contributes to robustness to overfitting to label noise, and pruning.","weight decay, regularization, robust overfitting, adversarial robustness, noisy label, adversarial, pruning" Annealed Training for Combinatorial Optimization on Graphs,https://openreview.net/forum?id=YHCR6CFAK6v,https://openreview.net/pdf?id=YHCR6CFAK6v,A simple but effective annealed training framework for unsupervised learning of combinatorial optimization problems over graphs,"The hardness of combinatorial optimization (CO) problems hinders collecting solutions for supervised learning. However, learning neural networks for CO problems is notoriously difficult in lack of the labeled data as the training is easily trapped at local optima. In this work, we propose a simple but effective annealed training framework for CO problems. In particular, we transform CO problems into the smoothest unbiased energy-based models (EBMs) by adding carefully selected penalties, then train graph neural networks to approximate the EBMs. We prevent the training from being stuck at local optima near the initialization by introducing an annealed loss function. An experimental evaluation demonstrates that our annealed training framework obtains substantial improvements. In four types of CO problems, our method achieves performance substantially better than other unsupervised neural methods on both synthetic and real-world graphs.","combinatorial optimization, graph neural networks, unsupervised learning, simulated annealing" Machine Unlearning of Federated Clusters,https://openreview.net/forum?id=VzwfoFyYDga,https://openreview.net/pdf?id=VzwfoFyYDga,"We propose the first known unlearning mechanism for federated clustering with privacy criteria that support simple, provable, and efficient data removal at the client and server level.","Federated clustering is an unsupervised learning problem that arises in a number of practical applications, including personalized recommender and healthcare systems. With the adoption of recent laws ensuring the ""right to be forgotten"", the problem of machine unlearning for federated clustering methods has become of significant importance. This work proposes the first known unlearning mechanism for federated clustering with privacy criteria that support simple, provable, and efficient data removal at the client and server level. The gist of our approach is to combine special initialization procedures with quantization methods that allow for secure aggregation of estimated local cluster counts at the server unit. As part of our platform, we introduce secure compressed multiset aggregation (SCMA), which is of independent interest for secure sparse model aggregation. In order to simultaneously facilitate low communication complexity and secret sharing protocols, we integrate Reed-Solomon encoding with special evaluation points into the new SCMA pipeline and derive bounds on the time and communication complexity of different components of the scheme. Compared to completely retraining K-means++ locally and globally for each removal request, we obtain an average speed-up of roughly 84x across seven datasets, two of which contain biological and medical information that is subject to frequent unlearning requests.","federated learning, federated clustering, machine unlearning, secure aggregation" Semi-Supervised Segmentation-Guided Tumor-Aware Generative Adversarial Network for Multi-Modality Brain Tumor Translation,https://openreview.net/forum?id=balnyoGkYfW,https://openreview.net/pdf?id=balnyoGkYfW,Tumor-aware multi-modality brain tumor translation.,"Multi-modality brain tumor images are widely used for clinical diagnosis since they can provide complementary information. Yet, due to considerations such as time, cost, and artifacts, it is difficult to get fully paired multi-modality images. Therefore, most of the brain tumor images are modality-missing in practice and only a few are labeled, due to a large amount of expert knowledge required. To tackle this problem, multi-modality brain tumor image translation has been extensively studied. However, existing works often lead to tumor deformation or distortion because they only focus on the whole image. In this paper, we propose a semi-supervised segmentation-guided tumor-aware generative adversarial network called $S^3TAGAN$, which utilizes unpaired brain tumor images with few paired and labeled ones to learn an end-to-end mapping from source modality to target modality. Specifically, we train a semi-supervised segmentation network to get pseudo labels, which aims to help the model focus on the local brain tumor areas. The model can synthesize more realistic images using pseudo tumor labels as additional information to help the global translation. Experiments show that our model achieves competitive results on both quantitative and qualitative evaluations. We also verify the effectiveness of the generated images via the downstream segmentation tasks.","brain tumor translation, multi-modality" Brainformers: Trading Simplicity for Efficiency,https://openreview.net/forum?id=w5q6tHO1dl1,https://openreview.net/pdf?id=w5q6tHO1dl1,,"Transformers are central to recent successes in natural language processing and computer vision. Transformers have a mostly uniform backbone where layers alternate between feed-forward and self-attention in order to build a deep network. Here we investigate this design choice and find that more complex blocks that have different permutations of feed-forward layers, self-attention layers, and gated layers (for routing through sparse structures) can be more efficient. Using this insight, we develop a complex block, named Brainformer, that consists of a diverse sets of layers such as sparsely gated feed-forward layers with different gating mechanisms, dense feed-forward layers, attention layers, and various forms of layer normalization and activation functions. Brainformer consistently outperforms the state-of-the-art dense and sparse Transformers, in terms of both quality and efficiency. A Brainformer model with 8 billion activated parameters per token demonstrates 2$\times$ faster training convergence and 5x faster step time compared to its GLaM counterpart. In downstream task evaluation, Brainformer also demonstrates a 3% higher SuperGLUE score with fine-tuning compared to GLaM with a similar number of activated parameters. Finally, Brainformer largely outperforms a Primer dense model with similar computation per token on oneshot evaluation for three important generative tasks. ", Control Graph as Unified IO for Morphology-Task Generalization,https://openreview.net/forum?id=HcUf-QwZeFh,https://openreview.net/pdf?id=HcUf-QwZeFh,We explore a method for learning a single policy that manipulates various forms of agents to various goal positions by distilling a large amount of proficient behavioral data.,"The rise of generalist large-scale models in natural language and vision has made us expect that a massive data-driven approach could achieve broader generalization in other domains such as continuous control. In this work, we explore a method for learning a single policy that manipulates various forms of agents to solve various tasks by distilling a large amount of proficient behavioral data. In order to align input-output (IO) interface among multiple tasks and diverse agent morphologies while preserving essential 3D geometric relations, we introduce control graph, which treats observations, actions and goals/task in a unified graph representation. We also develop MxT-Bench for fast large-scale behavior generation, which supports procedural generation of diverse morphology-task combinations with a minimal blueprint and hardware-accelerated simulator. Through efficient representation and architecture selection on MxT-Bench, we find out that a control graph representation coupled with Transformer architecture improves the multi-task performances compared to other baselines including recent discrete tokenization, and provides better prior knowledge for zero-shot transfer or sample efficiency in downstream multi-task imitation learning. Our work suggests large diverse offline datasets, unified IO representation, and policy representation and architecture selection through supervised learning form a promising approach for studying and advancing morphology task generalization.","Morphology-Task Generalization, Behavior Distillation, Supervised RL, Reinforcement Learning" Learning to Generate Pseudo Anomalies,https://openreview.net/forum?id=3M1JnCdz-5F,https://openreview.net/pdf?id=3M1JnCdz-5F,We propose learning mechanism to generate pseudo anomalies for one-class classification in anomaly detection.,"Due to rare occurrence of anomalous events, anomaly detection is often seen as one-class classification (OCC) problem. In this setting, an autoencoder (AE) is typically trained to reconstruct using only normal data in order to learn normalcy representations. It is expected that, at test time, the AE can well reconstruct normal data while poorly reconstructing anomalous data. However, anomalous data is often well reconstructed as well. This phenomenon can be attributed to the fact that when training AE with only normal data, the boundary between normal and abnormal data is unknown, consequently resulting in a boundary that includes the abnormal data as well. To alleviate this problem, we utilize pseudo anomalies to limit the reconstruction capability of an AE. Without imposing strong inductive bias, pseudo anomalies are generated by adding noise to the normal data. Moreover, to improve the quality of pseudo anomalies, we propose a learning mechanism to generate noise by exploiting the aforementioned weakness of AE, i.e., reconstructing anomalies too well. Evaluations on Ped2, Avenue, ShanghaiTech, and CIFAR-10 datasets demonstrate the effectiveness of our approach in improving the discriminative capability of AEs for anomaly detection.","anomaly detection, generative model, pseudo anomaly, autoencoder" Effective Self-Supervised Transformers For Sparse Time Series Data,https://openreview.net/forum?id=HUCgU5EQluN,https://openreview.net/pdf?id=HUCgU5EQluN,We propose a Transformer based model for sparse time series that utilizes an input binning scheme to aggregate the time series inputs.,"Electronic health records (EHRs) typically contain a wide range of time series data that is characterized by high sparsity and irregular observations. Self-supervised Transformer architectures have shown outstanding performance in a variety of structured tasks in natural language processing and computer vision. However, their use in modelling sparse irregular time series with tabular data has not been widely explored. One of the major challenges is the quadratic scaling of self-attention layers that can significantly limit the input sequence length. In this work, we introduce TESS, Transformers for EHR data with Self Supervised learning, a self-supervised Transformer-based architecture designed to extract robust representations from EHR data. We propose an input binning scheme that aggregates the time series inputs and sparsity information into a regular sequence with fixed length, enabling the training of larger and deeper Transformers. We demonstrate that significant compression of EHR input data is possible without sacrificing useful information, likely due to the highly correlated nature of observations in small time bins. We then introduce self-supervised prediction tasks that provide rich and informative signals for model pre-training. TESS outperforms state-of-the-art deep learning models on multiple downstream tasks from the MIMIC-IV and PhysioNet-2012 EHR datasets.","Representation learning, Transformers, Sparse Time Series" SpaceEvo: Searching Hardware-Friendly Search Space for Efficient Int8 Inference,https://openreview.net/forum?id=RsSJ2_M2Nk4,https://openreview.net/pdf?id=RsSJ2_M2Nk4,We introduce techniques to search a quantization-friendly search space for a given device,"INT8 quantization is an essential compression tool to deploy a deep neural network (DNN) on resource-limited edge devices. While it greatly reduces model size and memory cost, current edge-regime DNN models cannot well utilize INT8 quantization to reduce inference latency. In this work, we find that the poor INT8 latency performance is due to the quantization-unfriendly issue: the operator and configuration (e.g., channel width) choices in a normal model design space lead to diverse quantization efficiency and can slow down the INT8 latency. To alleviate this issue, we propose SpaceEvo to efficiently search a novel hardware-aware, quantization-friendly search space, where its top-tier sub-networks achieve both superior quantization efficiency and accuracy. The key idea is to automatically evolve hardware-preferred operators and configurations guided by a search space quality metric, called Q-T score. However, naively training a candidate space from scratch for Q-T score evaluation brings prohibitive training cost, making it difficult to evolve search space on large-scale tasks (e.g., ImageNet). We further propose to conduct block-wise training and build INT8 accuracy lookup table to greatly reduce the cost. On diverse devices, SpaceEvo consistently outperforms existing manually-designed search spaces by producing both tiny and large quantized models with superior ImageNet accuracy and hardware efficiency. The discovered models, named SeqNet, achieve up to 10.1% accuracy improvement under the same latency. Our study addressed the hardware-friendly search space design challenge in NAS and paved the way for searching the search space towards efficient deployment.","Neural Architecture Search, Search Space Design, INT8 Quantization, Edge Hardware" Scalable feature selection via sparse learnable masks,https://openreview.net/forum?id=row6cEJ2aBT,https://openreview.net/pdf?id=row6cEJ2aBT,SLM is an end-to-end feature selection method using a sparse learnable mask and a novel mutual information maximizer.,"We propose a canonical approach for feature selection, sparse learnable masks (SLM). SLM integrates learnable sparse masks into end-to-end training. For the fundamental non-differentiability challenge of selecting a desired number of features, we propose duo mechanisms for automatic mask scaling to achieve the desired feature sparsity, and gradually tempering this sparsity for effective learning. In addition, SLM employs a novel objective that maximizes the mutual information between the selected features and the labels. Empirically, SLM achieves state-of-the-art results on several benchmark datasets, often by a significant margin, especially on real-world challenging datasets.","Feature selection, mutual information, end-to-end learning, sparse mask" On the Interplay Between Misspecification and Sub-optimality Gap: From Linear Contextual Bandits to Linear MDPs,https://openreview.net/forum?id=bHpOeIXvSX2,https://openreview.net/pdf?id=bHpOeIXvSX2,,"We study linear contextual bandits in the misspecified setting, where the expected reward function can be approximated by a linear function class up to a bounded misspecification level $\zeta>0$. We propose an algorithm based on a novel data selection scheme, which only selects the contextual vectors with large uncertainty for online regression. We show that, when the misspecification level $\zeta$ is dominated by $\tilde O(\Delta / \sqrt{d})$ with $\Delta$ being the minimal sub-optimality gap and $d$ being the dimension of the contextual vectors, our algorithm enjoys the same gap-dependent regret bound $\tilde O ({d^2} /{\Delta})$ as in the well-specified setting up to logarithmic factors. Together with a lower bound adapted from Du et al. (2019); Lattimore et al.(2020), our result suggests an interplay between misspecification level and the sub-optimality gap: (1) the linear contextual bandit model is efficiently learnable when $\zeta \leq \tilde O({\Delta} / \sqrt{d})$; and (2) it is not efficiently learnable when $\zeta \geq \tilde \Omega({\Delta} / {\sqrt{d}})$. We also extend our algorithm to reinforcement learning with linear Markov decision processes (linear MDPs), and obtain a parallel result of gap-dependent regret. Experiments on both synthetic and real-world datasets corroborate our theoretical results.", HSVC: Transformer-based Hierarchical Distillation for Software Vulnerability Classification,https://openreview.net/forum?id=PY1wvNgwhPC,https://openreview.net/pdf?id=PY1wvNgwhPC,,"Software vulnerabilities have diverse characteristics, attacks, and impacts on software systems, stakeholders, and organizations. Such diverse characteristics of vulnerabilities (i.e., CWE-IDs) often lead to more difficulty in handling the label distributions for a Deep Learning model (e.g., addressing a highly imbalanced multi-class classification problem). However, existing vulnerability detection approaches often treat vulnerabilities equally---which does not reflect reality. In this paper, we present a new approach to solving the highly imbalanced software vulnerability classification (SVC) problem by leveraging the hierarchical structure of CWE-IDs and knowledge distillation. Specifically, we split a complex label distribution into sub-distributions based on CWE abstract types (i.e., categorizations that group similar CWE-IDs), so similar CWE-IDs can be grouped and each group will have a more balanced label distribution. We learn TextCNN teachers on each of the simplified distributions respectively, however, they only perform well in their group. Thus, we build a transformer student model to generalize the performance of TextCNN teachers through our hierarchical knowledge distillation framework. We compare our approach with source code transformer models as well as long-tailed learning approaches proposed in the vision domain. Through an extensive evaluation using the real-world 8,636 vulnerabilities, our approach outperforms all of the baselines by 1.97%-13.89%. Our framework can be applied to any transformer-based SVC such as CodeBERT, GraphCodeBERT, and CodeGPT, with slight modifications. Training code and pre-trained models are available at https://github.com/HSVC-TEAM/HSVC. ","Transformers-based models, Knowledge distillation, Long-tailed learning, Software vulnerability classification" HAS IT REALLY IMPROVED? KNOWLEDGE GRAPH BASED SEPARATION AND FUSION FOR RECOMMENDATION,https://openreview.net/forum?id=Su04-8n0ia4,https://openreview.net/pdf?id=Su04-8n0ia4,,"In this paper we study the knowledge graph (KG) based recommendation systems. We first design the metric to study the relationship between different SOTA models and find that the current recommendation systems based on knowledge graph have poor ability to retain collaborative filtering signals, and higher-order connectivity would introduce noises. In addition, we explore the collaborative filtering recommendation method using GNN and design the experiment to show that the information learned between GNN models stacked with different layers is different, which provides the explanation for the unstable performance of GNN stacking different layers from a new perspective. According to the above findings, we first design the model-agnostic Cross-Layer Fusion Mechanism without any parameters to improve the performance of GNN. Experimental results on three datasets for collaborative filtering show that Cross-Layer Fusion Mechanism is effective for improving GNN performance. Then we design three independent signal extractors to mine the data at three different perspectives and train them separately. Finally, we use the signal fusion mechanism to fuse different signals. Experimental results on three datasets that introduce KG show that our KGSF achieves significant improvements over current SOTA KG based recommendation methods and the results are interpretable.","recommendation, knowledge-graph, graph neural network" Counterfactual Contrastive Learning for Robust Text Classification,https://openreview.net/forum?id=-4Z25gkP7Oi,https://openreview.net/pdf?id=-4Z25gkP7Oi,,"Text classification has recently been promoted by large pre-trained language models (PLMs) which aim to identify target classes with knowledge transferred from sets of reading comprehension tasks. However, derivative models of PLMs still suffer from sensitive performance on different datasets, the reasons are multiple such as cross-domain and label imbalance problems, from which most models may learn the spurious correlation between texts and labels. Existing research requires people to manually add counterfactual samples to the dataset or automatically match so-called counterfactual pairs that are already in the dataset for augmentation. In this paper, we propose a novel LDA-based counterfactual contrastive learning framework and three data augmentation methods, to capture the causal information in texts, which can promote the robustness of text classification. To confirm the effectiveness of our proposed model and methods, we design and conduct several couples of experiments. Experimental results demonstrate that our model works well on five popular text classification datasets on distinct tasks, we find that training with proposed data augmentation outperforms other augmentation methods on many superior models by 1\% or above. Plus, robustness tests on different datasets also show a competitive performance, which proves the effectiveness of our model and data.","Contrastive Learning, Representation Learning, Structural Causal Model" SAM as an Optimal Relaxation of Bayes,https://openreview.net/forum?id=k4fevFqSQcX,https://openreview.net/pdf?id=k4fevFqSQcX,"We show that SAM can be seen as a relaxation of Bayes, by using Fenchel conjugates.","Sharpness-aware minimization (SAM) and related adversarial deep-learning methods can drastically improve generalization, but their underlying mechanisms are not yet fully understood. Here, we establish SAM as a relaxation of the Bayes objective where the expected negative-loss is replaced by the optimal convex lower bound, obtained by using the so-called Fenchel biconjugate. The connection enables a new Adam-like extension of SAM to automatically obtain reasonable uncertainty estimates, while sometimes also improving its accuracy. By connecting adversarial and Bayesian methods, our work opens a new path to robustness. ","bayesian deep learning, sharpness-aware minimization, variational bayes, convex duality" Denoising MCMC for Accelerating Diffusion-Based Generative Models,https://openreview.net/forum?id=Ogh8umAChpo,https://openreview.net/pdf?id=Ogh8umAChpo,We combine MCMC and diffusion models to accelerate score-based sampling.,"Diffusion models are powerful generative models that simulate the reverse of diffusion processes using score functions to synthesize data from noise. The sampling process of diffusion models can be interpreted as solving the reverse stochastic differential equation (SDE) or the ordinary differential equation (ODE) of the diffusion process, which often requires up to thousands of discretization steps to generate a single image. This has sparked a great interest in developing efficient integration techniques for reverse-S/ODEs. Here, we propose an orthogonal approach to accelerating score-based sampling: Denoising MCMC (DMCMC). DMCMC first uses MCMC to produce samples in the product space of data and variance (or diffusion time). Then, a reverse-S/ODE integrator is used to denoise the MCMC samples. Since MCMC traverses close to the data manifold, the computation cost of producing a clean sample for DMCMC is much less than that of producing a clean sample from noise. To verify the proposed concept, we show that Denoising Langevin Gibbs (DLG), an instance of DMCMC, successfully accelerates all six reverse-S/ODE integrators considered in this work on the tasks of CIFAR10 and CelebA-HQ-256 image generation. Notably, combined with integrators of Karras et al. (2022) and pre-trained score models of Song et al. (2021b), DLG achieves state-of-the-art results. In the limited number of score function evaluation (NFE) settings on CIFAR10, we have $3.86$ FID with $\approx 10$ NFE and $2.63$ FID with $\approx 20$ NFE. On CelebA-HQ-256, we have $6.99$ FID with $\approx 160$ NFE, which beats the current best record of Kim et al. (2022) among score-based models, $7.16$ FID with $4000$ NFE.","Markov Chain Monte Carlo, Diffusion Models, Score-Based Models" Learning on Large-scale Text-attributed Graphs via Variational Inference,https://openreview.net/forum?id=q0nmYciuuZN,https://openreview.net/pdf?id=q0nmYciuuZN,"We propose a GLEM framework to effectively fuse GNN and LM with scalability, SOTA results are achieved on OGB datasets.","This paper studies learning on text-attributed graphs (TAGs), where each node is associated with a text description. An ideal solution for such a problem would be integrating both the text and graph structure information with large language models and graph neural networks (GNNs). However, the problem becomes very challenging when graphs are large due to the high computational complexity brought by large language models and training GNNs on big graphs. In this paper, we propose an efficient and effective solution to learning on large text-attributed graphs by fusing graph structure and language learning with a variational Expectation-Maximization (EM) framework, called GLEM. Instead of simultaneously training large language models and GNNs on big graphs, GLEM proposes to alternatively update the two modules in the E-step and M-step. Such a procedure allows to separately train the two modules but at the same time allows the two modules to interact and mutually enhance each other. Extensive experiments on multiple data sets demonstrate the efficiency and effectiveness of the proposed approach. ","Language Model, Graph Neural Network, Node Classification" Efficient Shapley Values Estimation by Amortization for Text Classification,https://openreview.net/forum?id=QcTbkoBycwk,https://openreview.net/pdf?id=QcTbkoBycwk,We recognize the stability issue in model interpretation for text classifier and propose an amortized approach to generate stable interpretation efficiently.,"Despite the popularity of Shapley Values in explaining neural text classification models, computing them is prohibitive for large pretrained models due to a large number of model evaluations as it needs to perform multiple model evaluations over various perturbed text inputs. In practice, Shapley Values are often estimated stochastically with a smaller number of model evaluations. However, we find that the estimated Shapley Values are quite sensitive to random seeds—the top-ranked features often have little overlap under two different seeds, especially on examples with the longer input text. As a result, a much larger number of model evaluations is needed to reduce the sensitivity to an acceptable level. To mitigate the trade-off between stability and efficiency, we develop an amortized model that directly predicts Shapley Values of each input feature without additional model evaluation. It is trained on a set of examples with Shapley Values estimated from a large number of model evaluations to ensure stability. Experimental results on two text classification datasets demonstrate that, the proposed amortized model can estimate black-box explanation scores in milliseconds per sample in inference time and is up to 60 times more efficient than traditional methods.","text classification, model interpretation, amortization" SplitMixer: Fat Trimmed From MLP-like Models,https://openreview.net/forum?id=rmU3K_ekONM,https://openreview.net/pdf?id=rmU3K_ekONM,"We present a simple and lightweight isotropic MLP-like architecture, for visual recognition that performs on par with existing models but much less computation.","We present SplitMixer, a simple and lightweight isotropic MLP-like architecture, for visual recognition. It contains two types of interleaving convolutional operations to mix information across spatial locations (spatial mixing) and channels (channel mixing). The first one includes sequentially applying two depthwise 1D kernels, instead of a 2D kernel, to mix spatial information. The second one is splitting the channels into overlapping or non-overlapping segments, with or without shared parameters, and applying our proposed channel mixing approaches or 3D convolution to mix channel information. Depending on design choices, a number of SplitMixer variants can be constructed to balance accuracy, the number of parameters, and speed. We show, both theoretically and experimentally, that SplitMixer performs on par with the state-of-the-art MLP-like models while having a significantly lower number of parameters and FLOPS. For example, without strong data augmentation and optimization, SplitMixer achieves around 94\% accuracy on CIFAR-10 with only 0.28M parameters, while ConvMixer achieves the same accuracy with about 0.6M parameters. The well-known MLP-Mixer achieves 85.45\% with 17.1M parameters. On CIFAR-100 dataset, SplitMixer achieves around 73\% accuracy, on par with ConvMixer, but with about 52\% fewer parameters and FLOPS. We hope that our results spark further research towards finding more efficient vision architectures and facilitate the development of MLP-like models. Code is available at [Masked].","deep learning architectures, MLP-Mixer, model compression, visual classification" Multimedia Generative Script Learning for Task Planning,https://openreview.net/forum?id=IAIrNRktVWR,https://openreview.net/pdf?id=IAIrNRktVWR,"We introduce a new multimedia generative script learning task with a new benchmark; novel visually trackable, inductive, diverse script learning methods; and a new multimodal-retrieval-based metric.","Goal-oriented generative script learning aims to generate subsequent steps to reach a particular goal, which is an essential task to assist robots or humans in performing stereotypical activities in daily life. However, an important aspect of this process is the ability to capture historical states visually, which provides detailed information that is not covered by text and will guide subsequent steps. Therefore, we propose a new task, Multimedia Generative Script Learning, to generate subsequent steps by tracking historical states in both text and vision modalities, as well as presenting the first benchmark containing 5,652 tasks and 79,089 steps with descriptive images. This task is challenging in three aspects: the multimedia challenge of capturing the visual states in images, the induction challenge of performing unseen tasks, and the diversity challenge of covering different information in individual steps. We propose to encode visual state changes through a selective multimedia encoder to address the multimedia challenge, transfer knowledge from previously observed tasks using a retrieval-augmented decoder to overcome the induction challenge, and further present distinct information at each step by optimizing a diversity-oriented contrastive learning objective. We define metrics to evaluate both generation quality and inductive quality. Experiment results demonstrate that our approach significantly outperforms strong baselines.","multimedia generative script learning, contrastive learning, retrieval-augmented generation, selective multimedia encoding, procedure planning" On Assimilating Learned Views in Contrastive Learning,https://openreview.net/forum?id=ml8_xBoTnVA,https://openreview.net/pdf?id=ml8_xBoTnVA,,"Transformations based on domain expertise (expert transformations), such as random-resized-crop and color-jitter, have proven critical to the success of contrastive learning techniques such as SimCLR. Recently, several attempts have been made to replace such domain-specific, human-designed transformations with generated views that are learned. However, for imagery data, so far none of these view generation methods has been able to outperform expert transformations. In this work, we tackle a different question: instead of replacing expert transformations with generated views, can we constructively assimilate generated views with expert transformations? We answer this question in the affirmative. To do so, we first propose an information-theoretic framework for designing view generation based on the analysis of Tian et al 2020b on what makes a ""good"" view in contrastive learning. Then, we present two simple yet effective assimilation methods that together with our view generation mechanisms improve the state-of-the-art by up to approximately 3.5% on four different datasets. Importantly, we conduct a detailed empirical study that systematically analyzes a range of view generation and assimilation methods and provides a holistic picture of the efficacy of learned views in contrastive representation learning.","Contrastive Learning, Self-Supervised Learning, Generative Models, Mutual Information" Upcycled-FL: Improving Accuracy and Privacy with Less Computation in Federated Learning,https://openreview.net/forum?id=10tgIzcC2vY,https://openreview.net/pdf?id=10tgIzcC2vY,We propose a federated learning framework that improves accuracy-privacy tradeoff with less computation.,"Federated learning (FL) is a distributed learning paradigm that allows multiple decentralized edge devices to collaboratively learn toward a common objective without sharing local data. Although local data is not exposed directly, privacy concerns nonetheless exist as sensitive information can be inferred from intermediate computations. As the same data is repeatedly used over an iterative process, information leakage accumulates substantially over time, making it difficult to balance the trade-off between privacy and accuracy. In this paper we introduce Upcycled-FL, a novel federated learning framework, where first-order approximation is applied at every even iteration. Under such a scheme, half of the steps incur no privacy loss and require much less computation. Theoretically, we establish the convergence rate performance of Upcycled-FL and provide privacy analysis based on objective and output perturbations. Experiments on real-world data show that Upcycled-FL consistently outperforms existing methods over heterogeneous data, and significantly improves privacy-accuracy trade-off, while reducing 48% of the training time on average.","Federated Learning, Differential Privacy" Dataset Projection: Finding Target-aligned Subsets of Auxiliary Data,https://openreview.net/forum?id=J3_WcZW3oV1,https://openreview.net/pdf?id=J3_WcZW3oV1,We project datasets to find subsets of auxiliary datasets that are most aligned with a target dataset.,"To obtain more training data for a target task, one can draw upon related but distinct datasets, or auxiliary datasets. We put forth the problem of dataset projection---finding subsets of auxiliary datasets that are most aligned with a target dataset. These so-called projected datasets can be used as training data to improve performance on target tasks while being substantially smaller than the auxiliary dataset. We then develop a framework for solving such dataset projection problems and demonstrate in a variety of vision and language settings that the resulting projected datasets, when compared to the original auxiliary datasets, (1) are closer approximations of target datasets and (2) can be used to improve test performance or provide analysis for the target datasets. ","datasets, auxiliary data, dataset projection" Rethinking Identity in Knowledge Graph Embedding,https://openreview.net/forum?id=bvwZ43dY2xj,https://openreview.net/pdf?id=bvwZ43dY2xj,"We scrutinize the identity relation in knowledge graphs, find that bilinear based models fail to uniquely model it, and propose a solution with other good properties.","Knowledge Graph Embedding (KGE) is a common method to complete real-world Knowledge Graphs (KGs) by learning the embeddings of entities and relations. Beyond specific KGE models, previous work proposes a general framework based on group. A group has a special element identity that uniquely corresponds to the relation identity in KGs, which implies that identity should be represented uniquely. However, we find that this uniqueness cannot be modeled by bilinear based models, revealing the crack between the framework and models. To this end, we study the required conditions and propose a solution named Unit Ball Bilinear Model (UniBi). In addition to the theoretical superiority, UniBi is more robust and interpretable. Experiments demonstrate that UniBi models the uniqueness without the cost of performance and verify its a robustness and interpretability. ","Knowledge Graph Embedding, Knowledge Graph Completion, Bilinear Based Models" Eigen Memory Trees,https://openreview.net/forum?id=1DOS0kifqeP,https://openreview.net/pdf?id=1DOS0kifqeP,We create an episodic memory model for online learning and evaluate it for solving contextual bandit problems. ,"This work introduces the Eigen Memory Tree (EMT), a novel online memory model for sequential learning scenarios. EMTs store data at the leaves of a binary tree, and route new samples through the structure using the principal components of previous experiences, facilitating efficient (logarithmic) access to relevant memories. We demonstrate that EMT outperforms existing online memory approaches, and provide a hybridized EMT-parametric algorithm that enjoys drastically improved performance over purely parametric methods with nearly no downsides. Our findings are validated using 206 datasets from OpenML repository in both bounded and infinite memory budget situations. ","Episodic Memory, Contextual Bandits, Sequential Learning" Energy-based Predictive Representation for Reinforcement Learning,https://openreview.net/forum?id=aCCRmE3Pglv,https://openreview.net/pdf?id=aCCRmE3Pglv,"We propose a novel predictive state representation with energy-based models, that shows superior performance on POMDPs.","In real world applications, it is usually necessary for a reinforcement learning algorithm to handle the partial observability beyond Markov decision processes (MDPs). Although the partially observable Markov decision process (POMDP) has been precisely motivated for this requirement, such a formulation raises significant computational and statistical hardness challenges in learning and planning. In this work, we introduce the Energy-based Predictive Representation (EPR), which leads to a unified framework for practical reinforcement learning algorithm design in both MDPs and POMDPs settings, to handle the learning, exploration, and planning in a coherent way. The proposed approach relies on the powerful neural energy-based model to extract sufficient representation, from which Q-functions can be efficiently approximated. With such a representation, we develop an efficient approach for computing confidence, which allows optimism/pessimism in the face of uncertainty to be efficiently implemented in planning, hence managing the exploration versus exploitation tradeoff. An experimental investigation shows that the proposed algorithm can surpass state-of-the-art performance in both MDP and POMDP settings in comparison to existing baselines.","Energy-based Models, Predictive State Representation, Partially Observable Markov Decision Process, Reinforcement Learning" Which Invariance Should We Transfer? A Causal Minimax Learning Approach,https://openreview.net/forum?id=YP4QEmqh6Ia,https://openreview.net/pdf?id=YP4QEmqh6Ia,"This paper proposes to identify the optimal subset of invariance to transfer, in order to achieve robustness in supervised regression scenario.","A major barrier to deploy current machine learning models lies in their sensitivity to dataset shifts. To resolve this problem, most existing studies attempted to transfer stable information to unseen environments. Among these, graph-based methods causally decomposed the data generating process into stable and mutable mechanisms. By removing the effect of mutable generation, they identified a set of stable predictors. However, a key question regarding robustness remains: which subset of the whole stable information should the model transfer, in order to achieve optimal generalization ability? To answer this question, we provide a comprehensive minimax analysis that fully characterizes conditions for a subset to be optimal. Particularly in general cases, we propose to maximize over mutable mechanisms (i.e., the source of dataset shifts), which is provable to identify the worst-case risk over all environments. This ensures us to select the optimal subset with the minimal worst-case risk. To reduce computational costs, we propose to search over only equivalent classes in terms of worst-case risk, instead of over all subsets. In cases when the searching space is still large, we turn this subset selection problem into a sparse min-max optimization scheme, which enjoys the simplicity and efficiency of implementation. The utility of our methods is demonstrated on the diagnosis of Alzheimer's Disease and gene function prediction. ","robustness, minimax, subset selection, causal model, g-equivalence" Exclusive Supermask Subnetwork Training for Continual Learning,https://openreview.net/forum?id=iMy1hOrqiVE,https://openreview.net/pdf?id=iMy1hOrqiVE,,"Continual Learning (CL) methods mainly focus on avoiding catastrophic forgetting and learning representations that are transferable to new tasks. Recently, Wortsman et al. (2020) proposed a CL method, SupSup, which uses a randomly initialized, fixed base network (model) and finds a supermask for each new task that selectively keeps or removes each weight to produce a subnetwork. They prevent forgetting as the network weights are not being updated. Although there is no forgetting, the performance of supermask is sub-optimal because fixed weights restrict its representational power. Furthermore, there is no accumulation or transfer of knowledge inside the model when new tasks are learned. Hence, we propose ExSSNeT (Exclusive Supermask SubNEtwork Training), that performs exclusive and non-overlapping subnetwork weight training. This avoids conflicting updates to the shared weights by subsequent tasks to improve performance while still preventing forgetting. Furthermore, we propose a novel KNN-based Knowledge Transfer (KKT) module that dynamically initializes a new task's mask based on previous tasks for improving knowledge transfer. We demonstrate that ExSSNeT outperforms SupSup and other strong previous methods on both text classification and vision tasks while preventing forgetting. Moreover, ExSSNeT is particularly advantageous for sparse masks that activate 2-10% of the model parameters, resulting in an average improvement of 8.3% over SupSup. Additionally, ExSSNeT scales to a large number of tasks (100), and our KKT module helps to learn new tasks faster while improving the overall performance.", Dual personalization for federated recommendation on devices,https://openreview.net/forum?id=8VvQ4SpvZVi,https://openreview.net/pdf?id=8VvQ4SpvZVi,,"Federated recommendation is a new Internet service architecture that aims to provide privacy-preserving recommendation services in federated settings. Existing solutions are used to combine distributed recommendation algorithms and privacy-preserving mechanisms. Thus it inherently takes the form of heavyweight models at the server and hinders the deployment of on-device intelligent models to end-users. This paper proposes a novel Personalized Federated Recommendation (PFedRec) framework to learn many user-specific lightweight models to be deployed on smart devices rather than a heavyweight model on a server. Moreover, we propose a new dual personalization mechanism to effectively learn fine-grained personalization on both users and items. The overall learning process is formulated into a unified federated optimization framework. Specifically, unlike previous methods that share exactly the same item embeddings across users in a federated system, dual personalization allows mild finetuning of item embeddings for each user to generate user-specific views for item representations which can be integrated into existing federated recommendation methods to gain improvements immediately. Experiments on multiple benchmark datasets have demonstrated the effectiveness of PFedRec and the dual personalization mechanism. Moreover, we provide visualizations and in-depth analysis of the personalization techniques in item embedding, which shed novel insights on the design of RecSys in federated settings.","federated learning, personalization, recommmendation system" Unsupervised Manifold Linearizing and Clustering,https://openreview.net/forum?id=--qiQPsCV94,https://openreview.net/pdf?id=--qiQPsCV94,,"Clustering data lying close to a union of low-dimensional manifolds, with each manifold as a cluster, is a fundamental problem in machine learning. When the manifolds are assumed to be linear subspaces, many methods succeed using low-rank and sparse priors, which have been studied extensively over the past two decades. Unfortunately, most real-world datasets can not be well approximated by linear subspaces. On the other hand, several works have proposed to identify the manifolds by learning a feature map such that the data transformed by the map lie in a union of linear subspaces, even though the original data are from non-linear manifolds. However, most works either assume knowledge of the membership of samples to clusters, or are shown to learn trivial representations. In this paper, we propose to simultaneously perform clustering and learn a union-of-subspace representation via Maximal Coding Rate Reduction. Experiments on synthetic and realistic datasets show that the proposed method achieves clustering accuracy comparable with state-of-the-art alternatives, while being more scalable and learning geometrically meaningful representations.","Clustering, Manifold Embedding, Manifold Clustering" Moderate Coreset: A Universal Method of Data Selection for Real-world Data-efficient Deep Learning,https://openreview.net/forum?id=7D5EECbOaf9,https://openreview.net/pdf?id=7D5EECbOaf9,,"Deep learning methods nowadays rely on massive data, resulting in substantial costs of data storage and model training. Data selection is a useful tool to alleviate such costs, where a coreset of massive data is extracted to practically perform on par with full data. Based on carefully-designed score criteria, existing methods first count the score of each data point and then select the data points whose scores lie in a certain range to construct a coreset. These methods work well in their respective preconceived scenarios but are not robust to the change of scenarios, since the optimal range of scores varies as the scenario changes. The issue limits the application of these methods, because realistic scenarios often mismatch preconceived ones, and it is inconvenient or unfeasible to tune the criteria and methods accordingly. In this paper, to address the issue, a concept of the moderate coreset is discussed. Specifically, given any score criterion of data selection, different scenarios prefer data points with scores in different intervals. As the score median is a proxy of the score distribution in statistics, the data points with scores close to the score median can be seen as a proxy of full data and generalize different scenarios, which are used to construct the moderate coreset. As a proof-of-concept, a universal method that inherits the moderate coreset and uses the distance of a data point to its class center as the score criterion, is proposed to meet complex realistic scenarios. Extensive experiments confirm the advance of our method over prior state-of-the-art methods, leading to a strong baseline for future research.", Effectively Clarify Confusion via Visualized Aggregation and Separation of Deep Representation,https://openreview.net/forum?id=hPdMskOKGX6,https://openreview.net/pdf?id=hPdMskOKGX6,,"Clarifying confusion is the most critical issue for improving classification performance. Confusion occurs with almost all classification models but tends to be ignored in excellent-performing models. The current mainstream research mainly focuses on solving the confusion in a specific case, such as data insufficiency and class imbalance. We believe that mining the commonalities of the same class and the gaps among different classes will effectively clarify the confusion. In this paper, we propose a novel, simple and intuitive Aggregation Separation Loss (ASLoss), as an adjunct for classification loss to clarify the confusion in some common cases. The ASLoss aggregates the representations of the same class samples as near as possible and separates the representations of different classes as far as possible. We use two image classification tasks with three simultaneous confounding characteristics i.e. data insufficiency, class imbalance, and unclear class evidence to demonstrate ASLoss. Representation visualization, confusion comparison and detailed comparison experiments are conducted. The results show that representations in deep spaces extracted by ASLoss are sufficiently clear and distinguishable, the confusion among different classes is significantly clarified and the optimal network using ASLoss reaches the state-of-the-art level.","Data Insufficiency, Class Imbalance, Evidience Unclearity, Confusion, Representation Learning" Time-Transformer AAE: Connecting Temporal Convolutional Networks and Transformer for Time Series Generation,https://openreview.net/forum?id=fI3y_Dajlca,https://openreview.net/pdf?id=fI3y_Dajlca,"A novel time series generative model, bridging Temporal Convolutional Networks and Transformer via a layer-wise parallel structure","Generating time series data is a challenging task due to the complex temporal properties of this type of data. Such temporal properties typically include local correlations as well as global dependencies. Most existing generative models have failed to effectively learn both the local and global properties of time series data. To address this open problem, we propose a novel time series generative model consisting of an adversarial autoencoder (AAE) and a newly designed architecture named `Time-Transformer' within the decoder. We call this generative model `Time-Transformer AAE'. The Time-Transformer first simultaneously learns local and global features in a layer-wise parallel design, combining the abilities of Temporal Convolutional Networks (TCN) and Transformer in extracting local features and global dependencies respectively. Second, a bidirectional cross attention is proposed to provide complementary guidance across the two branches and achieve proper fusion between local and global features. Experimental results demonstrate that our model can outperform existing state-of-the-art models in most cases, especially when the data contains both global and local properties. We also show our model's ability to perform a downstream task: data augmentation to support the solution of imbalanced classification problems.","Time Series Generation, Adversarial Autoencoder, Temporal Convolutional Networks, Transformer" A comparison of dataset distillation and active learning in text classification,https://openreview.net/forum?id=UqmL1Oc4bCw,https://openreview.net/pdf?id=UqmL1Oc4bCw,,"Deep learning has achieved great success over the past few years in different aspects ranging from computer vision to natural language process. However, the huge size of data in deep learning has always been a thorny problem in learning the underlying distribution and tackling various human tasks. To alleviate this problem, knowledge distillation has been proposed to simplify the model, and later dataset distillation as a new method of reducing dataset sizes has been proposed, which aims to synthesize a small number of samples that contain all the information of a very large dataset. Meanwhile, active learning is also an effective method to reduce dataset sizes by only selecting the most significant labeling samples from the original dataset. In this paper, we explore the discrepancies in the principles of dataset distillation and active learning, and evaluate two algorithms on NLP classification dataset: Stanford Sentiment Treebank. The result of the experiment is that the distilled data with the size of 0.1% of the original text data achieves approximately 88% accuracy, while the selected data achieves 52% performance of the original data. ","knowledge distillation, dataset distillation, active learning, text classification" Temporally Consistent Video Transformer for Long-Term Video Prediction,https://openreview.net/forum?id=NQuCQoHqqSY,https://openreview.net/pdf?id=NQuCQoHqqSY,An efficient temporally consistent video prediction model able to generate long videos referencing hundreds of frames of past context in complex 3D environments and Kinetics-600.,"Generating long, temporally consistent video remains an open challenge in video generation. Primarily due to computational limitations, most prior methods limit themselves to training on a small subset of frames that are then extended to generate longer videos through a sliding window fashion. Although these techniques may produce sharp videos, they have difficulty retaining long-term temporal consistency due to their limited context length. In this work, we present Temporally Consistent Video Transformer (TECO), a vector-quantized latent dynamics video prediction model that learns compressed representations to efficiently condition on long videos of hundreds of frames during both training and generation. We use a MaskGit prior for dynamics prediction which enables both sharper and faster generations compared to prior work. Our experiments show that TECO outperforms SOTA baselines in a variety of video prediction benchmarks ranging from simple mazes in DMLab, large 3D worlds in Minecraft, and complex real-world videos from Kinetics-600. In addition, to better understand the capabilities of video prediction models in modeling temporal consistency, we introduce several challenging video prediction tasks consisting of agents randomly traversing 3D scenes of varying difficulty. This presents a challenging benchmark for video prediction in partially observable environments where a model must understand what parts of the scenes to re-create versus invent depending on its past observations or generations. An anonymized website with samples can be found at https://sites.google.com/view/iclr23-teco","video generation, video prediction, generative modeling, latent dynamics models" Extreme Q-Learning: MaxEnt RL without Entropy,https://openreview.net/forum?id=SJ0Lde3tRL,https://openreview.net/pdf?id=SJ0Lde3tRL,Introduce a novel framework for Q-learning that models the maximal soft-values without needing to sample from a policy and reaches SOTA performance on online and offline RL settings.,"Modern Deep Reinforcement Learning (RL) algorithms require estimates of the maximal Q-value, which are difficult to compute in continuous domains with an infinite number of possible actions. In this work, we introduce a new update rule for online and offline RL which directly models the maximal value using Extreme Value Theory (EVT) inspired by Economics. By doing so, we avoid computing Q-values using out-of-distribution actions which is often a substantial source of error. Our key insight is to introduce an objective that directly estimates the optimal soft-value functions (LogSumExp) in the maximum entropy (MaxEnt) RL setting without needing to sample from a policy. Using EVT, we derive our \emph{Extreme Q-Learning} framework and consequently online and, for the first time, offline MaxEnt Q-learning algorithms, that do not explicitly require access to a policy or its entropy. Finally, our method obtains strong results in the Offline D4RL benchmark outperforming prior works by 10-20 points on some tasks while offering moderate improvements over SAC and TD3 on online DM Control tasks. ","reinforcement learning, offline reinforcement learning, statistical learning, extreme value analysis, maximum entropy rl, gumbel" Autoencoding Hyperbolic Representation for Adversarial Generation,https://openreview.net/forum?id=pmUH7A8wZz,https://openreview.net/pdf?id=pmUH7A8wZz,A hyperbolic generative network numerically stable for generating complex data.,"With the recent advance of geometric deep learning, neural networks have been extensively used for data in non-Euclidean domains. In particular, hyperbolic neural networks have proved successful in processing hierarchical information of data. However, many hyperbolic neural networks are numerically unstable during training, which precludes using complex architectures. This crucial problem makes it difficult to build hyperbolic generative models for real and complex data. In this work, we propose a hyperbolic generative network in which we design novel architecture and layers to guarantee stable training. Our proposed network contains three parts: first, a hyperbolic autoencoder (AE) that produces hyperbolic embedding for input data; second, a hyperbolic generative adversarial network (GAN) for generating the hyperbolic latent embedding of the AE from simple noise; third, a generator that inherits the decoder from the AE and the generator from the GAN. We call this network the hyperbolic AE-GAN, or HAEGAN for short. The architecture of HAEGAN fosters expressive representation in the hyperbolic space, and the specific design of layers ensures numerical stability. Experiments show that HAEGAN is able to generate complex data with state-of-the-art structure-related performance.","deep learning, hyperbolic neural network, numerical stability, generative models" CAREER: Transfer Learning for Economic Prediction of Labor Data,https://openreview.net/forum?id=lyjMArzIxH6,https://openreview.net/pdf?id=lyjMArzIxH6,"We develop CAREER, a transformer-based model of job sequence data, which is pretrained on large resume datasets and fine-tuned to survey datasets widely used by labor economists.","Labor economists regularly analyze employment data by fitting predictive models to small, carefully constructed longitudinal survey datasets. Although modern machine learning methods offer promise for such problems, these survey datasets are too small to take advantage of them. In recent years large datasets of online resumes have also become available, providing data about the career trajectories of millions of individuals. However, standard econometric models cannot take advantage of their scale or incorporate them into the analysis of survey data. To this end we develop CAREER, a transformer-based model that uses transfer learning to learn representations of job sequences. CAREER is first fit to large, passively-collected resume data and then fine-tuned to smaller, better-curated datasets for economic inferences. We fit CAREER to a dataset of 24 million job sequences from resumes, and fine-tune its representations on longitudinal survey datasets. We find that CAREER forms accurate predictions of job sequences, achieving state-of-the-art predictive performance on three widely-used economics datasets. We further find that CAREER can be used to form good predictions of other downstream variables; incorporating CAREER into a wage model provides better predictions than the econometric models currently in use.","economics, transfer learning" Federated Nearest Neighbor Machine Translation,https://openreview.net/forum?id=R1U5G2spbLd,https://openreview.net/pdf?id=R1U5G2spbLd,We propose a novel federated nearest neighbor machine translation framework to build low-overhead privacy-preserving MT systems in FL settings.,"To protect user privacy and meet legal regulations, federated learning (FL) is attracting significant attention. Training neural machine translation (NMT) models with traditional FL algorithm (e.g., FedAvg) typically relies on multi-round model-based interactions. However, it is impractical and inefficient for machine translation tasks due to the vast communication overheads and heavy synchronization. In this paper, we propose a novel federated nearest neighbor (FedNN) machine translation framework that, instead of multi-round model-based interactions, leverages one-round memorization-based interaction to share knowledge across different clients to build low-overhead privacy-preserving systems. The whole approach equips the public NMT model trained on large-scale accessible data with a $k$-nearest-neighbor ($k$NN) classifier and integrates the external datastore constructed by private text data in all clients to form the final FL model. A two-phase datastore encryption strategy is introduced to achieve privacy-preserving during this process. Extensive experiments show that FedNN significantly reduces computational and communication costs compared with FedAvg, while maintaining promising performance in different FL settings.","Machine Translation, Federated Learning, Memorization Augmentation" Latent Variable Representation for Reinforcement Learning,https://openreview.net/forum?id=mQpmZVzXK1h,https://openreview.net/pdf?id=mQpmZVzXK1h,"We show how the latent variable model can be used for representation learning in reinforcement learning, that can have superior empirical performance as well as complete sample complexity analysis.","Deep latent variable models have achieved significant empirical successes in model-based reinforcement learning (RL) due to their expressiveness in modeling complex transition dynamics. On the other hand, it remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of RL. In this paper, we provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle in the face of uncertainty for exploration. In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models. Theoretically, we establish the sample complexity of the proposed approach in the online and offline settings. Empirically, we demonstrate superior performance over current state-of-the-art algorithms across various benchmarks.","Latent Variable Model, Markov Decision Processes, Reinforcement Learning" Look in The Mirror: Molecular Graph Contrastive Learning with Line Graph,https://openreview.net/forum?id=pzH2Sltp2--,https://openreview.net/pdf?id=pzH2Sltp2--,,"Trapped by the label scarcity in molecular property prediction and drug design, graph contrastive learning came forward. A general contrastive model consists of a view generator, view encoder, and contrastive loss, in which the view mainly controls the encoded information underlying input graphs. Leading contrastive learning works show two kinds of view generators, that is, random or learnable data corruption and domain knowledge incorporation. While effective, the two ways also lead to molecular semantics altering and limited generalization capability, respectively. Thus, a decent view that can fully retain molecular semantics and is free from profound domain knowledge is supposed to come forward. To this end, we relate molecular graph contrastive learning with the line graph and propose a novel method termed LGCL. Specifically, by contrasting the given graph with the corresponding line graph, the graph encoder can freely encode the molecular semantics without omission. While considering the information inconsistency and over-smoothing derived from the learning process because of the mismatched pace of message passing in two kinds of graphs, we present a new patch with edge attribute fusion and two local contrastive losses for performance fixing. Compared with state-of-the-art (SOTA) methods for view generation, superior performance on molecular property prediction suggests the effectiveness of line graphs severing as the contrasting views.", Precision Collaboration for Federated Learning,https://openreview.net/forum?id=pQL-sBfD4I,https://openreview.net/pdf?id=pQL-sBfD4I,This paper investigates a precision collaboration mechanism for federated learning.,"Inherent heterogeneity of local data distributions, which causes inefficient model learning and significant degradation of model performance, has been a key challenge in Federated Learning (FL). So far, plenty of efforts have focused on addressing data heterogeneity by relying on a hypothetical clustering structure or a consistent information sharing mechanism. However, because of the diversity of the real-world local data, these assumptions may be largely violated. In this work, we argue that information sharing is mostly fragmented in the federated network in reality. More specifically, the distribution overlaps are not consistent but scattered in local clients. In this work, we propose the concept ``Precision Collaboration'' which refers to learning from the informative overlaps precisely while avoiding the potential negative transfer induced by others. In particular, we propose to infer the local data manifolds and estimate the exact local data density simultaneously. The learned manifold aims to precisely identify the overlaps from other clients, and the estimated likelihood allows to generate samples from the manifold in an optimal sampling density. Experiments show that our proposed PCFL significantly overcomes baselines on benchmarks and a real-world clinical scenario.","federated learning, personalized federated learning" RLSBench: A Large-Scale Empirical Study of Domain Adaptation Under Relaxed Label Shift,https://openreview.net/forum?id=kLvYYV-YK_j,https://openreview.net/pdf?id=kLvYYV-YK_j,"A large scale study of popular domain adaptation methods under scenarios where both label distribution and conditionals p(x|y) may shift, highlights brittleness of existing methods and simple fixes that improves the performance.","Despite the emergence of principled methods for domain adaptation under label shift (where only the class balance changes), the sensitivity of these methods to natural-seeming covariate shifts remains precariously underexplored. Meanwhile, popular deep domain adaptation heuristics, despite showing promise on benchmark datasets, tend to falter when faced with shifts in the class balance. Moreover, it's difficult to assess the state of the field owing to inconsistencies among relevant papers in evaluation criteria, datasets, and baselines. In this paper, we introduce \textsc{RLSbench}, a large-scale benchmark for such \emph{relaxed label shift} settings, consisting of 11 vision datasets spanning $>$200 distribution shift pairs with different class proportions. We evaluate 12 popular domain adaptation methods, demonstrating a more widespread susceptibility to failure under extreme shifts in the class proportions than was previously known. We develop an effective meta-algorithm, compatible with most deep domain adaptation heuristics, that consists of the following two steps: (i) \emph{pseudo-balance} the data at each epoch; and (ii) adjust the final classifier with (an estimate of) target label distribution. Furthermore, we discover that batch-norm adaption of a model trained on source with aforementioned corrections offers a strong baseline, largely missing from prior comparisons. We hope that these findings and the availability of \textsc{RLSbench} will encourage researchers to include rigorously evaluate proposed methods in relaxed label shift settings. Code is publicly available at https://github.com/ICLR2023Anon. ","domain adaptation, distribution shift, label shift, large scale study" ROCO: A General Framework for Evaluating Robustness of Combinatorial Optimization Solvers on Graphs,https://openreview.net/forum?id=2r6YMqz4Mml,https://openreview.net/pdf?id=2r6YMqz4Mml,,"Solving combinatorial optimization (CO) on graphs has been attracting increasing interests from the machine learning community whereby data-driven approaches were recently devised to go beyond traditional manually-designated algorithms. In this paper, we study the robustness of a combinatorial solver as a blackbox regardless it is classic or learning-based though the latter can often be more interesting to the ML community. Specifically, we develop a practically feasible robustness metric for general CO solvers. A no-worse optimal cost guarantee is developed as such the optimal solutions are not required to achieve for solvers, and we tackle the non-differentiable challenge {in input instance disturbance} by resorting to black-box adversarial attack methods. Extensive experiments are conducted on 14 unique combinations of solvers and CO problems, and we demonstrate that the performance of state-of-the-art solvers like Gurobi can degenerate by over 20\% under the given time limit bound on the hard instances discovered by our robustness metric, raising concerns about the robustness of combinatorial optimization solvers. Source code and configuration details will be all made publicly available.","Combinatorial Optimization, Robustness, Graph Neural Networks, Reinforcement Learning" Spatial reasoning as Object Graph Energy Minimization,https://openreview.net/forum?id=XOl_9AU0EV,https://openreview.net/pdf?id=XOl_9AU0EV,,"We propose a model that maps spatial rearrangement instructions to goal scene configurations via gradient descent on a set of relational energy functions over object 2D overhead locations, one per spatial predicate in the instruction. Energy based models over object locations are trained from a handful of examples of object arrangements annotated with the corresponding spatial predicates. Predicates can be binary (e.g., left of, right of, etc.) or multi-ary (e.g., circles, lines, etc.). A language parser maps language instructions to the corresponding set of EBMs, and a visual-language model grounds their arguments on relevant objects in the visual scene. Energy minimization on the joint set of energies iteratively updates the object locations till their final configuration. Then, low-level local policies re-locate objects to the inferred goal locations. Our framework shows many forms of strong generalization: (i)joint energy minimization handles zero-shot complex predicate compositions while each EBM is trained only from single predicate instructions, (ii) the model can execute instructions zero-shot, without a need for paired instruction-action training, (iii) instructions can mention novel objects and attributes at test time thanks to the pre-training of the visual language grounding model from large scale passive captioned datasets. We test the model in established instruction-guided manipulation benchmarks, as well as a benchmark of compositional instructions we introduce in this work. We show large improvements over state-of-the-art end-to-end language to action policies and planning in large language models, especially for long instructions and multi-ary spatial concepts.","Energy Based Model, Robotics, Goal Generation" Words are all you need? Language as an approximation for representational similarity,https://openreview.net/forum?id=O-G91-4cMdv,https://openreview.net/pdf?id=O-G91-4cMdv,"We show that machine embeddings of text descriptions can predict human similarity judgments better than models trained from images, audio and video.","Human similarity judgments are a powerful supervision signal for machine learning applications based on techniques such as contrastive learning, information retrieval, and model alignment, but classical methods for collecting human similarity judgments are too expensive to be used at scale. Recent methods propose using pre-trained deep neural networks (DNNs) to approximate human similarity, but pre-trained DNNs may not be available for certain domains (e.g., medical images, low-resource languages) and their performance in approximating human similarity has not been extensively tested. We conducted an evaluation of 611 pre-trained models across three domains -- images, audio, video -- and found that there is a large gap in performance between human similarity judgments and pre-trained DNNs. To address this gap, we propose a new class of similarity approximation methods based on language. To collect the language data required by these new methods, we also developed and validated a novel adaptive tag collection pipeline. We find that our proposed language-based methods are significantly cheaper, in the number of human judgments, than classical methods, but still improve performance over the DNN-based methods. Finally, we also develop `stacked' methods that combine language embeddings with DNN embeddings, and find that these consistently provide the best approximations for human similarity across all three of our modalities. Based on the results of this comprehensive study, we provide a concise guide for researchers interested in collecting or approximating human similarity data. To accompany this guide, we also release all of the similarity and language data, a total of 206,339 human judgments, that we collected in our experiments, along with a detailed breakdown of all modeling results.","cognitive science, language, perception, representational similarity" Graph Contrastive Learning with Reinforced Augmentation,https://openreview.net/forum?id=bWhEJ0lF5L9,https://openreview.net/pdf?id=bWhEJ0lF5L9,"In this paper, we design a novel GA2C model making the augmented views evolves well to energize graph contrastive learning and outperforms the SOTA methods..","Graph contrastive learning (GCL), designing contrastive objectives to learn embeddings from augmented graphs, has become a prevailing method for learning embeddings from graphs in an unsupervised manner. As an important procedure in GCL, graph data augmentation (GDA) directly affects the model performance on the downstream task. Currently, there are three types of GDA strategies: trial-and-error, precomputed method, and adversarial method. However, these strategies ignore the connection between the two consecutive augmentation results because GDA is regarded as an independent process. In this paper, we regard the GDA in GCL as a Markov decision process. Based on this point, we propose a reinforced method, i.e., the fourth type of GDA strategy, using a novel Graph Advantage Actor-Critic (GA2C) model for GCL. On 23 graph datasets, the experimental results verify that GA2C outperforms the SOTA GCL models on a series of downstream tasks such as graph classification, node classification, and link prediction.","Graph contrastive learning, graph neural network, graph classification, reinforcement learning" A Novel Fast Exact Subproblem Solver for Stochastic Quasi-Newton Cubic Regularized Optimization,https://openreview.net/forum?id=tWOoSXyW0KE,https://openreview.net/pdf?id=tWOoSXyW0KE,,"In this work we describe an Adaptive Regularization using Cubics (ARC) method for large-scale nonconvex unconstrained optimization using Limited memory Quasi-Newton (LQN) matrices. ARC methods are a relatively new family of second-order optimization strategies that utilize a cubic-regularization (CR) term in place of trust-regions or line-searches. Solving the CR subproblem exactly requires Newton's method, yet using properties of the internal structure of LQN matrices, we are able to find exact solutions to the CR subproblem in a matrix-free manner, providing very large speedups. Additionally, we expand upon previous ARC work and explicitly incorporate first-order updates into our algorithm. We provide empirical results for different LQN matrices and find our proposed method compares to or exceeds all tested optimizers with minimal tuning.","optimization, quasi-newton" Block-Diagonal Structure Learning for Subspace Clustering,https://openreview.net/forum?id=sxLL8K3E39G,https://openreview.net/pdf?id=sxLL8K3E39G,,"Finding the informative subspaces of high-dimensional ordered datasets is at the core of innumerable applications in computer vision, where spectral-based subspace clustering is arguably the most commonly studied method due to its strong empirical performance. Such algorithms compute an affinity matrix to construct a self-representation for each sample utilizing other samples as a dictionary, and spectral clustering is employed to identify the clustering structure based on the affinity matrix. Since the ordered nature, the block-diagonal structure learning embedded in self-representation plays a vital role in effective subspace clustering. However, direct optimization with block-diagonal priors is challenging due to the random sparseness and connectivity nature of self-representation, and none of the existing techniques resort to the block-diagonal structure learning of self-representation alone. In this paper, we propose a technique, namely block-diagonal structure representation learning, to solve the optimal clustering of the ordered data directly instead of employing spectral clustering. The proposed algorithm can theoretically achieve the global optimal solution of the proposed discrete non-convex block-diagonal partition problem. We test the proposed clustering method on several types of segmentation databases, such as human face recognization, video scene clip partition, motion tracks, and dynamic 3-D facial expression sequences. The experiments illustrate that the proposed method outperforms state-of-the-art subspace clustering methods. ", Decentralized Federated Learning via Overlapping Data Augmentation,https://openreview.net/forum?id=DSKD610FRN1,https://openreview.net/pdf?id=DSKD610FRN1,This paper studies the scenario of selective partial sharing in federated learning.,"Recently, there have been rising concerns about the heterogeneity among local clients in federated learning, which could lead to inefficient utilization of the data from other clients. To mitigate the adverse effects of heterogeneity, FL research has mostly focused on learning a globally shared initialization under the assumption that the shared information is consistent among all clients. In this paper, we consider a more general scenario, Selective Partial Sharing (SPS), where each pair of clients may share different patterns or distribution components. We propose a novel FL framework named Fed-SPS to exploit the shared knowledge by a partial and pairwise collaboration. Meanwhile, to reduce data traffic and improve computing efficiency, we realize a decentralized learning paradigm for our framework. Due to privacy concerns, one cannot obtain the overlapped distribution components with direct access to the raw data. While the learned personalized model is an approximation of local distribution, we propose to identify the selective sharing structure by exploring the vulnerability overlap between local models. With the detected sharing structure, we propose an overlapping data augmentation, which efficiently boosts the leveraging of the overlapped data between clients. Comprehensive experiments on a suite of benchmark data sets and a real-world clinical data set show that our approach can achieve better generalization compared with existing methods.","federated learning, personalized federated learning, decentralized federated learning" Offline RL of the Underlying MDP from Heterogeneous Data Sources,https://openreview.net/forum?id=AR4rOT4sECN,https://openreview.net/pdf?id=AR4rOT4sECN,This work investigated the problem of learning an underlying MDP with offline datasets from heterogeneous sources and proposed several provably efficient designs.,"Most of the existing offline reinforcement learning (RL) studies assume the available dataset is sampled directly from the target environment. However, in some practical applications, the available data are often coming from several related but heterogeneous environments. A theoretical understanding of efficient learning from heterogeneous offline datasets remains lacking. In this work, we study the problem of learning a (hidden) underlying Markov decision process (MDP) based on heterogeneous offline datasets collected from multiple randomly perturbed data sources. A novel HetPEVI algorithm is proposed, which jointly considers two types of uncertainties: sample uncertainties from the finite number of data samples per data source, and source uncertainties due to a finite number of data sources. Building on HetPEVI, we further incorporate reference-advantage decompositions and Bernstein-type penalties to propose the HetPEVI-Adv algorithm. Theoretical analysis not only proves the effectiveness of both HetPEVI and HetPEVI-Adv but also demonstrates the advantage of the latter. More importantly, the results explicitly characterize the learning loss due to the finite heterogeneously realized environments compared with sampling directly from the underlying MDP. Finally, we extend the study to MDPs with linear function approximation and propose the HetPEVI-Lin algorithm that provides additional efficiency guarantees beyond the tabular case.","RL Theory, Offline RL, Underlying MDP, Heterogeneous Data Sources, Provable Efficiency" An interpretable contrastive logical knowledge learning method for sentiment analysis,https://openreview.net/forum?id=9RDD2hefT94,https://openreview.net/pdf?id=9RDD2hefT94,We present a novel contrastive logical knowledge learning (CLK) method to learn interpretable TPK models and generate explanations for sentiment analysis tasks. ,"Current interpretable sentiment analysis (ISA) methods frequently underperform state-of-the-art models, and few of them cast light on the inner working of pre-trained models. In this work, we fill the gap by addressing four key research challenges in ISA—knowledge acquisition, knowledge representation, knowledge learning and knowledge reasoning—in one unified framework. Theoretically, we propose a novel contrasitive logical knowledge learning (CLK) framework that can visualize the decisions made through deterministic Talmudic public announcement logic semantics. We apply CLK to current popular sentiment analysis models to obtain CLK based interpretable ones. Empirically, experimental results of both binary sentiment analysis tasks and fine-grained sentiment analysis tasks indicate that CLK can achieve an effective trade-off between accuracy and interpretability. Furthermore, we find that CLK can reduce the uncertainty of logical knowledge for discriminative labels by visualizing the learned feature representations and model output. Besides, we carry out a case study to investigate the fidelity of model interpretability through knowledge reasoning, which demonstrates that the explanations provided by our method are reasonable and consistent for sentiment analysis tasks. ","interpretable sentiment analysis, Talmudic public announcement logic, contrastive logical knowledge learning, knowledge reasoning" FreeMatch: Self-adaptive Thresholding for Semi-supervised Learning,https://openreview.net/forum?id=PDrUPTXJI_A,https://openreview.net/pdf?id=PDrUPTXJI_A,We propose FreeMatch to define and adjust the confidence threshold in a self-adaptive manner for semi-supervised learning.,"Pseudo labeling and consistency regularization approaches based on confidence thresholding have made great progress in semi-supervised learning (SSL). However, we argue that existing methods might fail to adopt suitable thresholds since they either use a pre-defined / fixed threshold or an ad-hoc threshold adjusting scheme, resulting in inferior performance and slow convergence. We first analyze a motivating example to achieve some implications on the relationship between the desirable threshold and model's learning status. Based on the analysis, we hence propose FreeMatch to define and adjust the confidence threshold in a self-adaptive manner according to the model's learning status. We further introduce a self-adaptive class fairness regularization penalty that encourages the model to produce diverse predictions during the early stages of training. Extensive experimental results indicate the superiority of FreeMatch especially when the labeled data are extremely rare. FreeMatch achieves 5.78%, 13.59%, and 1.28% error rate reduction over the latest state-of-the-art method FlexMatch on CIFAR-10 with 1 label per class, STL-10 with 4 labels per class, and ImageNet with 100 labels per class, respectively.","Semi-Supervised Learning, Semi-Supervised Classification" The Impact of Neighborhood Distribution in Graph Convolutional Networks,https://openreview.net/forum?id=XUqTyU9VlWp,https://openreview.net/pdf?id=XUqTyU9VlWp,We find the distinguishability of neighborhood distribution plays a more important role in the performance of GCN than homophily and propose GCN-PND to promote neighborhood distinguishability.,"Graph Convolutional Networks (GCNs) which aggregate information from neighbors to learn node representation, have shown excellent ability in processing graph-structured data. However, it is inaccurate that the notable performance of GCNs tends to depend on strong homophily assumption of networks, since GCNs can also perform well over some heterophilous graphs. Thus the impact of homophily on GCNs needs to be reconsidered. In this paper, we study what influences the aggregation of GCNs from the perspective of neighborhood distribution. Theoretical and empirical analysis is provided to reveal that the distinguishability of neighborhood distribution plays a more important role in the performance of GCN than homophily. Furthermore, we address that neighborhood structure and neighborhood range are two key factors for GCNs to promote neighborhood distinguishability. Based on the conclusion, we propose an improved graph convolution network (GCN-PND) including updating graph topology based on the similarity between local neighborhood distribution of nodes and designing extensible aggregation from multi-hop neighbors. We did extensive experiments on graph benchmark datasets to analyze the superiority of the proposed method. The experimental results demonstrate that GCN-PND is more effective on heterophilous datasets than most of existing state-of-the-art GCN methods. ","Graph convolutional networks, graph neural networks, homophily, heterophily" Training image classifiers using Semi-Weak Label Data,https://openreview.net/forum?id=kPL4YzdDqWE,https://openreview.net/pdf?id=kPL4YzdDqWE,,"This paper introduces a new semi-weak label learning paradigm which provides additional information in comparison to the weak label classification. We define semi-weak label data as data where we know the presence or absence of a given class and additionally we have the information about the exact count of each class as opposed to knowing the label proportions. A three-stage framework is proposed to address the problem of learning from semi-weak labels. It leverages the fact that counting information is naturally non-negative and discrete. Experiments are conducted on generated samples from CIFAR-10 and we compare our model with a fully-supervised setting baseline, a weakly-supervised setting baseline and a learning from proportion(LLP) baseline. Our framework not only outperforms both baseline models for MIL-based weakly supervised setting and learning from proportion setting, but also gives comparable results compared to the fully supervised model. Further, we conduct thorough ablation studies to analyze across datasets and variation with batch size, losses architectural changes, bag size and regularization, thereby demonstrating robustness of our approach. ", Confidence Estimation Using Unlabeled Data,https://openreview.net/forum?id=sOXU-PEJSgQ,https://openreview.net/pdf?id=sOXU-PEJSgQ,,"Overconfidence is a common issue for deep neural networks, limiting their deployment in real-world applications. To better estimate confidence, existing methods mostly focus on fully-supervised scenarios and rely on training labels. In this paper, we propose the first confidence estimation method for a semi-supervised setting, when most training labels are unavailable. We stipulate that even with limited training labels, we can still reasonably approximate the confidence of model on unlabeled samples by inspecting the prediction consistency through the training process. We use training consistency as a surrogate function and propose a consistency ranking loss for confidence estimation. On both image classification and segmentation tasks, our method achieves state-of-the-art performances in confidence estimation. Furthermore, we show the benefit of the proposed method through a downstream active learning task. ", Towards Class-Balanced Transductive Few-Shot Learning,https://openreview.net/forum?id=ZA_F5AU-byh,https://openreview.net/pdf?id=ZA_F5AU-byh,We develop transductive fine-tuning with margin-based uncertainty weighting and class-balanced normalization to tackle the issue of class imbalanced predictions in few-shot learning.,"In this work, we present an observation of severe class-imbalanced predictions in few-shot learning and propose solving it by acquiring a more balanced marginal probability through Transductive Fine-tuning with Margin-based uncertainty weighting and Class-balanced normalization (TF-MC). Margin-based uncertainty weighting compresses the utilization of wrong predictions with lower loss weights to stabilize predicted marginal distribution. Class-balanced normalization adjusts the predicted probability for testing data to pursue class-balanced fine-tuning without directly regularizing the marginal testing distribution. TF-MC effectively improves the class balance in predictions with state-of-the-art performance on in- / out-of-distribution evaluations of Meta-Dataset and surpasses previous transductive methods by a large margin. ","Few-shot classification, Transductive learning, Class-imbalanced Prediction" Spectral Decomposition Representation for Reinforcement Learning,https://openreview.net/forum?id=FBMLeaXpZN,https://openreview.net/pdf?id=FBMLeaXpZN,We propose a new spectral representation learning method that gets rid of the policy dependency and can be easily applied in downstream tasks.,"Representation learning often plays a critical role in avoiding the curse of dimensionality in reinforcement learning. A representative class of algorithms exploits spectral decomposition of the stochastic transition dynamics to construct representations that enjoy strong theoretical properties in idealized settings. However, current spectral methods suffer from limited applicability because they are constructed for state-only aggregation and are derived from a policy-dependent transition kernel, without considering the issue of exploration. To address these issues, we propose an alternative spectral method, Spectral Decomposition Representation (SPEDER), that extracts a state-action abstraction from the dynamics without inducing spurious dependence on the data collection policy, while also balancing the exploration-versus-exploitation trade-off during learning. A theoretical analysis establishes the sample efficiency of the proposed algorithm in both the online and offline settings. In addition, an experimental investigation demonstrates superior performance over current state-of-the-art algorithms across several RL benchmarks.","Spectral Representation, Markov Decision Processes, Reinforcement Learning" On Accelerated Perceptrons and Beyond,https://openreview.net/forum?id=fYzLpCsGZVf,https://openreview.net/pdf?id=fYzLpCsGZVf,"We provide a unified analysis for accelerated Perceptrons, and obtain improved results for a series of other problems.","The classical Perceptron algorithm of Rosenblatt can be used to find a linear threshold function to correctly classify $n$ linearly separable data points, assuming the classes are separated by some margin $\gamma > 0$. A foundational result is that Perceptron converges after $\Omega(1/\gamma^{2})$ iterations. There have been several recent works that managed to improve this rate by a quadratic factor, to $\Omega(\sqrt{\log n}/\gamma)$, with more sophisticated algorithms. In this paper, we unify these existing results under one framework by showing that they can all be described through the lens of solving min-max problems using modern acceleration techniques, mainly through \emph{optimistic} online learning. We then show that the proposed framework also lead to improved results for a series of problems beyond the standard Perceptron setting. Specifically, a) For the margin maximization problem, we improve the state-of-the-art result from $O(\log t/t^2)$ to $O(1/t^2)$, where $t$ is the number of iterations; b) We provide the first result on identifying the implicit bias property of the classical Nesterov's accelerated gradient descent (NAG) algorithm, and show NAG can maximize the margin with an $O(1/t^2)$ rate; c) For the classical $p$-norm Perceptron problem, we provide an algorithm with $\Omega(\sqrt{(p-1)\log n}/\gamma)$ convergence rate, while existing algorithms suffer the $\Omega({(p-1)}/\gamma^2)$ convergence rate.","Perceptron, game, optimistic online learning, implicit bias, margin maximization" DITTO: Offline Imitation Learning with World Models,https://openreview.net/forum?id=Ix4Ytiwor4U,https://openreview.net/pdf?id=Ix4Ytiwor4U,"Completely offline imitation learning with world models, using RL on a latent matching objective in the model.","We propose DITTO, a fully offline approach to imitation learning which addresses the problem of covariate shift without access to an oracle or any additional online interactions. By unrolling agent policies in the latent space of a learned world model and penalizing drift from expert demonstrations, we can use online reinforcement learning algorithms to learn policies which solve the imitation objective, without access to the underlying environment or reward function. Decoupling policy and world model learning lets us leverage datasets of any quality to learn latent representations which provide a natural reward signal for imitation learning, avoiding the need for complex adversarial or sparse imitation-inducing rewards. Compared to competitive baselines, our method achieves state-of-the-art performance in a variety of challenging environments from pixel observations alone.","world models, imitation learning, reinforcement learning" BAT-Chain: Bayesian-Aware Transport Chain for Topic Hierarchies Discovery,https://openreview.net/forum?id=baRatYtGBXp,https://openreview.net/pdf?id=baRatYtGBXp,We in this paper propose a novel model to mine hierarchical topics and document representation under the conditional transport framework,"Topic modeling has been an important tool for text analysis. Originally, topics discovered by a model are usually assumed to be independent. However, as a semantic representation of a concept, a topic is naturally related to others, which motivates the development of learning hierarchical topic structure. Most existing Bayesian models are designed to learn hierarchical structure, but they need non-trivial posterior inference. Although the recent transport-based topic models bypass the posterior inference, none of them considers deep topic structures. In this paper, we interpret document as its word embeddings and propose a novel Bayesian-aware transport chain to discover multi-level topic structures, where each layer learns a set of topic embeddings and the document hierarchical representations are defined as a series of empirical distributions according to the topic proportions and corresponding topic embeddings. To fit such hierarchies, we develop an upward-downward optimizing strategy under the recent conditional transport theory, where document information is first transported via the upward path, and then its hierarchical representations are refined by the downward path under the Bayesian perspective. Extensive experiments on text corpora show that our approach enjoys superior modeling accuracy and interpretability. Moreover, we also conduct experiments on learning hierarchical visual topics from images, which demonstrate the adaptability and flexibility of our method.","Topic modeling, hierarchical representation, optimal transport, conditional transport, concept learning" On the Importance of Calibration in Semi-supervised Learning,https://openreview.net/forum?id=c-h2XSi-vEM,https://openreview.net/pdf?id=c-h2XSi-vEM,We propose a family of new methods that optimize for calibration in semi-supervised learning and demonstrate improvements on popular vision benchmarks and on real-world applications.,"State-of-the-art (SOTA) semi-supervised learning (SSL) methods have been highly successful in leveraging a mix of labeled and unlabeled data by combining techniques of consistency regularization and pseudo-labeling. During pseudo-labeling, the model's predictions on unlabeled data are used for training and thus, model calibration is important in mitigating confirmation bias. Yet, many SOTA methods are optimized for model performance, with little focus directed to improve model calibration. In this work, we empirically demonstrate that model calibration is strongly correlated with model performance and propose to improve calibration via approximate Bayesian techniques. We introduce a family of new SSL models that optimizes for calibration and demonstrate their effectiveness across standard vision benchmarks of CIFAR-10, CIFAR-100 and ImageNet, giving up to 15.9\% improvement in test accuracy. Furthermore, we also demonstrate their effectiveness in additional realistic and challenging problems, such as class-imbalanced datasets and in photonics science.", SoftMatch: Addressing the Quantity-Quality Tradeoff in Semi-supervised Learning,https://openreview.net/forum?id=ymt1zQXBDiF,https://openreview.net/pdf?id=ymt1zQXBDiF,"This paper revisit the quantity-quality tradeoff with a unified sample weighting function of pseudo-labeling/consistency loss. From the analysis, we propose SoftMatch, which better utilizes unlabeled data while reducing the enrolled error rate. ","The critical challenge of Semi-Supervised Learning (SSL) is how to effectively leverage the limited labeled data and massive unlabeled data to improve the model's generalization performance. In this paper, we first revisit the popular pseudo-labeling methods via a unified sample weighting formulation and demonstrate the inherent quantity-quality trade-off problem of pseudo-labeling with thresholding, which may prohibit learning. To this end, we propose SoftMatch to overcome the trade-off by maintaining both high quantity and high quality of pseudo-labels during training, effectively exploiting the unlabeled data. We derive a truncated Gaussian function to weight samples based on their confidence, which can be viewed as a soft version of the confidence threshold. We further enhance the utilization of weakly-learned classes by proposing a uniform alignment approach. In experiments, SoftMatch shows substantial improvements across a wide variety of benchmarks, including image, text, and imbalanced classification.","Semi-Supervised Learning, Semi-Supervised Classification" Unleashing the Potential of Data Sharing in Ensemble Deep Reinforcement Learning,https://openreview.net/forum?id=g4RxrIB52M9,https://openreview.net/pdf?id=g4RxrIB52M9,,"This work studies a crucial but often overlooked element of ensemble methods in deep reinforcement learning: data sharing between ensemble members. We show that data sharing enables peer learning, a powerful learning process in which individual agents learn from each other’s experience to significantly improve their performance. When given access to the experience of other ensemble members, even the worst agent can match or outperform the previously best agent, triggering a virtuous circle. However, we show that peer learning can be unstable when the agents’ ability to learn is impaired due to overtraining on early data. We thus employ the recently proposed solution of periodic resets and show that it ensures effective peer learning. We perform extensive experiments on continuous control tasks from both dense states and pixels to demonstrate the strong effect of peer learning and its interaction with resets.", Certifiably Robust Policy Learning against Adversarial Multi-Agent Communication,https://openreview.net/forum?id=dCOL0inGl3e,https://openreview.net/pdf?id=dCOL0inGl3e,We propose a defense method such that an agent receiving communication in an multi-agent system can be certifiably robust when a subset of communication messages get (arbitrarily) perturbed.,"Communication is important in many multi-agent reinforcement learning (MARL) problems for agents to share information and make good decisions. However, when deploying trained communicative agents in a real-world application where noise and potential attackers exist, the safety of communication-based policies becomes a severe issue that is underexplored. Specifically, if communication messages are manipulated by malicious attackers, agents relying on untrustworthy communication may take unsafe actions that lead to catastrophic consequences. Therefore, it is crucial to ensure that agents will not be misled by corrupted communication, while still benefiting from benign communication. In this work, we consider an environment with $N$ agents, where the attacker may arbitrarily change the communication from any $C<\frac{N-1}{2}$ agents to a victim agent. For this strong threat model, we propose a certifiable defense by constructing a message-ensemble policy that aggregates multiple randomly ablated message sets. Theoretical analysis shows that this message-ensemble policy can utilize benign communication while being certifiably robust to adversarial communication, regardless of the attacking algorithm. Experiments in multiple environments verify that our defense significantly improves the robustness of trained policies against various types of attacks.","certifiable robustness, reinforcement learning, multi-agent system, adversarial communication, adversarial attack" Node Importance Specific Meta Learning in Graph Neural Networks,https://openreview.net/forum?id=pKRYZpCDr-p,https://openreview.net/pdf?id=pKRYZpCDr-p,This paper focuses on the few-shot node classification problem in graph; theoretically studies the influence of node importance on the model accuracy and proposes a node importance calculation method to implement on meta learning GNNs.,"While current node classification methods for graphs have enabled significant progress in many applications, they rely on abundant labeled nodes for training. In many real-world datasets, nodes for some classes are always scarce, thus current algorithms are ill-equipped to handle these few-shot node classes. Some meta learning approaches for graphs have demonstrated advantages in tackling such few-shot problems, but they disregard the impact of node importance on a task. Being exclusive to graph data, the dependencies between nodes convey vital information for determining the importance of nodes in contrast to node features only, which poses unique challenges here. In this paper, we investigate the effect of node importance in node classification meta learning tasks. We first theoretically analyze the influence of distinguishing node importance on the lower bound of the model accuracy. Then, based on the theoretical conclusion, we propose a novel Node Importance Meta Learning architecture (NIML) that learns and applies the importance score of each node for meta learning. Specifically, after constructing an attention vector based on the interaction between a node and its neighbors, we train an importance predictor in a supervised manner to capture the distance between node embedding and the expectation of same-class embedding. Extensive experiments on public datasets demonstrate the state-of-the-art performance of NIML on few-shot node classification problems.","Meta Learning, Graph Neural Network, Node Importance" Attention-Guided Backdoor Attacks against Transformers,https://openreview.net/forum?id=pNZkow3k3BH,https://openreview.net/pdf?id=pNZkow3k3BH,"We propose a novel Trojan Attention Loss, which enhances the Trojan behavior by directly manipulating the attention pattern.","With the popularity of transformers in natural language processing (NLP) applications, there are growing concerns about their security. Most existing NLP attack methods focus on injecting stealthy trigger words/phrases. In this paper, we focus on the interior structure of neural networks and the Trojan mechanism. Focusing on the prominent NLP transformer models, we propose a novel Trojan Attention Loss (TAL), which enhances the Trojan behavior by directly manipulating the attention pattern. Our loss significantly improves the attack efficacy; it achieves better successful rates and with a much smaller poisoning rate (i.e., a smaller proportion of poisoned samples). It boosts attack efficacy for not only traditional dirty-label attacks, but also the more challenging clean-label attacks. TAL is also highly compatible with most existing attack methods and its flexibility enables this loss easily adapted to other backbone transformer models. ","Natural Language Processing, Transformer, Backdoor Attack, Trojan Attack, Trojan Attention Loss" Disentangling the Mechanisms Behind Implicit Regularization in SGD,https://openreview.net/forum?id=LE5LxBgjB4V,https://openreview.net/pdf?id=LE5LxBgjB4V,,"A number of competing hypotheses have been proposed to explain why small-batch Stochastic Gradient Descent (SGD) leads to improved generalization over the full-batch regime, with recent work crediting the implicit regularization of various quantities throughout training. However, to date, empirical evidence assessing the explanatory power of these hypotheses is lacking. In this paper, we conduct an extensive empirical evaluation, focusing on the ability of various theorized mechanisms to close the small-to-large batch generalization gap. Additionally, we characterize how the quantities that SGD has been claimed to (implicitly) regularize change over the course of training. By using micro-batches, i.e. disjoint smaller subsets of each mini-batch, we empirically show that explicitly penalizing the gradient norm or the Fisher Information Matrix trace, averaged over micro-batches, in the large-batch regime recovers small-batch SGD generalization, whereas Jacobian-based regularizations fail to do so. This generalization performance is shown to often be correlated with how well the regularized model's gradient norms resemble those of small-batch SGD. We additionally show that this behavior breaks down as the micro-batch size approaches the batch size. Finally, we note that in this line of inquiry, positive experimental findings on CIFAR10 are often reversed on other datasets like CIFAR100, highlighting the need to test hypotheses on a wider collection of datasets.","deep learning, generalization, implicit regularization, sgd" Seq2Seq Pre-training with Dual-channel Recombination for Translation,https://openreview.net/forum?id=hw4XagZJuw,https://openreview.net/pdf?id=hw4XagZJuw,,"Sequence to sequence (\textit{seq2seq}) pre-training has achieved predominate success in natural language generation (NLG). Generally, the powerful encoding and language generation capacities from the pre-trained seq2seq models can significantly improve most NLG tasks when fine-tuning them with task-specific data. However, as a cross-lingual generation task, machine translation needs an additional ability of representation transferring on languages (or \textit{translation model}). Fine-tuning the pre-trained models to learn the translation model, which is not covered in the self-supervised processing, will lead to the \textit{catastrophic forgetting} problem. This paper presents a dual-channel recombination framework for translation (\textsc{DcRT}) to address the abovementioned problem. In the proposed approach, we incorporate two cross-attention networks into the pre-trained seq2seq model to fetch the contextual information and require them to learn the \textit{translation} and \textit{language} models, respectively. Then, the model generates outputs according to the composite representation. Experimental results on multiple translation tasks demonstrate that the proposed \textsc{DcRT} achieves considerable improvements compared to several strong baselines by tuning less than 20\% parameters. Further, \textsc{DcRT} can incorporate multiple translation tasks into one model without dropping performance, drastically reducing computation and storage consumption. ","neural machine translation, transformer, sequence to sequence pre-training" Oracles and Followers: Stackelberg Equilibria in Deep Multi-Agent Reinforcement Learning,https://openreview.net/forum?id=Vo1MVffQED,https://openreview.net/pdf?id=Vo1MVffQED,We show a general framework for learning Stackelberg Equilibrian in multi-agent reinforcement learning,"Stackelberg equilibria arise naturally in a range of popular learning problems, such as in security games or indirect mechanism design, and have received in- creasing attention in the reinforcement learning literature. We present a general framework for implementing Stackelberg equilibria search as a multi-agent RL problem, allowing a wide range of algorithmic design choices. We discuss how previous approaches can be seen as specific instantiations of this framework. As a key insight, we note that the design space allows for approaches not previously seen in the literature, for instance by leveraging multitask and meta-RL techniques for follower convergence. We propose one such approach using contextual poli- cies and evaluate it experimentally on standard benchmark domains. Finally, we illustrate the effect of adopting designs outside the borders of our framework in controlled experiments.","Multi-Agent Reinforcement Learning, Game Theory, Security Games, Mechanism Design, Stackelberg Equilibrium, Indirect Mechanism Design" Structural Code Representation Learning for Auto-Vectorization,https://openreview.net/forum?id=k7qRYoxUlB,https://openreview.net/pdf?id=k7qRYoxUlB,,"The single instruction multiple data (SIMD) capability in modern processors is critical to improving the performance of current compute-intensive programs. SIMD allows architectures to exploit the natural data parallelism that exists in a wide-range of real applications (e.g., games, signal processing, etc) by executing a single instruction on multiple data items simultaneously. Modern compilers use vectorization techniques to exploit the SIMD capability, by detecting data parallelism in scalar source code and transforming a group of scalar instructions into vector-based instructions. In this work, we focus on one of the most common vectorization techniques called \emph{loop-based vectorization}, which targets loops and optimize their performance by grouping multiple occurrences of the same operation across loop iterations into single SIMD instructions. This is achieved by setting two key parameters: (1) the vectorization factor (VF), and (2) the interleaving factor (IF). Unfortunately, vectorizing loop computations effectively is a key challenging problem for both programmers and compilers due to the large search space. For example, manual vectorization of each loop puts a huge burden on the programmer, is more error-prone, and/or requires expert knowledge of both the software and the architecture. Alternatively, current compilers use fixed-cost models based on expert heuristics to make automatic vectorization decisions. However, these models often ignore the data dependencies, as well as the underlying computation graph. In this paper, we propose a data-driven graph-based learning framework for automatic vectorization, called \emph{autograph}, which takes an input program, extracts the loops, then learns a structured representation to automatically predict the correct VF/IF factors. Our proposed framework utilizes deep reinforcement learning to learn an optimal policy (observations to actions) from an intelligent agent in a SIMD environment, and automatically injects the predicted vectorization pragmas into the input program. We conducted an extensive evaluation on multiple benchmark datasets and comparisons with state-of-the-art baselines. Our experimental results show that the proposed framework can achieve up to 1.02x-2.26x and 1.06x-4.27x performance improvement, compared to state-of-the-art baseline and LLVM -O3 respectively.", Overthinking the Truth: Understanding how Language Models process False Demonstrations,https://openreview.net/forum?id=em4xg1Gvxa,https://openreview.net/pdf?id=em4xg1Gvxa,,"Through few-shot learning or chain-of-thought prompting, modern language models can detect and imitate complex patterns in their prompt. This behavior allows language models to complete challenging tasks without fine-tuning, but can be at odds with completion quality: if the context is inaccurate or harmful, then the model may reproduce these defects in its completions. In this work, we show that this {harmful context-following} appears late in a model's computation--in particular, given an inaccurate context, models perform \emph{better} after zeroing out later layers. More concretely, at early layers models have similar performance given either accurate and inaccurate few-shot prompts, but a gap appears at later layers (e.g.~layers 10-14 for GPT-J). This gap appears at a consistent depth across datasets, and coincides with the appearance of “induction heads” that attend to previous answers in the prompt. We restore the performance for inaccurate contexts by ablating a subset of these heads, reducing the gap by 28\% on average across 8 datasets. Our results suggest that studying early stages of computation could be a promising strategy to prevent misleading outputs, and that understanding and editing internal mechanisms can help correct unwanted model behavior.","Large Language Models, Interpretability, Safety, Mechanistic Interpretability, Science of ML" Sequential Attention for Feature Selection,https://openreview.net/forum?id=TTLLGx3eet,https://openreview.net/pdf?id=TTLLGx3eet,"Sequential feature selection using the attention mechanism, with provable guarantees.","Feature selection is the problem of selecting a subset of features for a machine learning model that maximizes model quality subject to a resource budget constraint. For neural networks, prior methods, including those based on $\ell_1$ regularization, attention, and stochastic gates, typically select all of the features in one evaluation round, ignoring the residual value of the features during selection (i.e., the marginal contribution of a feature conditioned on the previously selected features). We propose a feature selection algorithm called Sequential Attention that achieves state-of-the-art empirical results for neural networks. This algorithm is based on an efficient implementation of greedy forward selection and uses attention weights at each step as a proxy for feature importance. We provide theoretical insights into our Sequential Attention algorithm for linear regression models by showing that an adaptation to this setting is equivalent to the classical Orthogonal Matching Pursuit algorithm [PRK1993], and thus inherits all of its provable guarantees. Lastly, our theoretical and empirical analyses provide new explanations towards the effectiveness of attention and its connections to overparameterization, which might be of independent interest.","feature selection, attention" Trusted Aggregation (TAG): Model Filtering Backdoor Defense In Federated Learning,https://openreview.net/forum?id=4j7TG4gD_RM,https://openreview.net/pdf?id=4j7TG4gD_RM,TAG is a novel defense against Backdoor Attacks in Federated Learning,"Federated Learning is a framework for training machine learning models from multiple local data sets without access to the data in aggregate. A shared model is jointly learned through an interactive process between server and clients that combines locally learned model gradients or weights. However, the lack of data transparency naturally raises concerns about model security. Recently, several state-of-the-art backdoor attacks have been proposed, which achieve high attack success rates while simultaneously being difficult to detect, leading to compromised federated learning models. In this paper, motivated by differences in the output layer distribution between models trained with and without the presence of backdoor attacks, we propose a defense method that can prevent backdoor attacks from influencing the model while maintaining the accuracy of the original classification task. TAG leverages a small validation data set to estimate the largest change that a benign user's local training can make to the output layer of the shared model, which can be used as a cutoff for returning user models. Experimental results on multiple data sets show that TAG defends against backdoor attacks even when 40\% of the user submissions to update the shared model are malicious.","federated learning, backdoor attack, robust aggregation" What Deep Representations Should We Learn? -- A Neural Collapse Perspective,https://openreview.net/forum?id=ZKEhS93FjhR,https://openreview.net/pdf?id=ZKEhS93FjhR,n/a,"For classification problems, when sufficiently large networks are trained until convergence, an intriguing phenomenon has recently been discovered in the last-layer classifiers, and features termed neural collapse (NC): (i) the intra-class variability of the features collapses to zero, and (ii) the between-class feature means are maximally and equally separated. Despite of recent endeavors to understand why NC happens, a fundamental question remains: whether NC is a blessing or a curse for deep learning? In this work, we investigate the problem under the setting of transfer learning that we pretrain a model on a large dataset and transfer it to downstream tasks. Through various experiments, our findings on NC are two-fold: (i) when pretrain models, preventing intra-class variability collapse (to a certain extent) better preserve the structures of data, and leads to better model transferability; (ii) when fine-tuning models on downstream tasks, obtaining features with more NC on downstream data results in better test accuracy on the given task. Our findings based upon NC not only explain many widely used heuristics in model pretraining (e.g., data augmentation, projection head, self-supervised learning), but also leads to more efficient and principled transfer learning method on downstream tasks.","representation learning, neural collapse, transfer learning" Improved Sample Complexity for Reward-free Reinforcement Learning under Low-rank MDPs,https://openreview.net/forum?id=jpsw-KuOi7r,https://openreview.net/pdf?id=jpsw-KuOi7r,"We propose a novel reward free reinforcement learning algorithm under low-rank MDPs, which improves the sample complexity of previous work. We also provide a lower bound. Finally we study representation learning via reward free reinforement learning.","In reward-free reinforcement learning (RL), an agent explores the environment first without any reward information in order to achieve certain learning goals afterwards for any given reward. While reward-free RL has been well studied under the tabular setting with minimax optimal sample complexity being achieved, theoretical study of reward-free RL with complicated function approximation is still limited. In this paper we focus on reward-free RL under low-rank MDP models, which capture the representation learning in RL. We propose a new model-based algorithm, coined RAFFLE, and show that it can both find an $\epsilon$-optimal policy and achieve an $\epsilon$-accurate system identification via reward-free exploration, with a sample complexity of $\tilde{O}(\frac{H^3d^2K(d^2+K)}{\epsilon^2})$, where $d$, $H$ and $K$ respectively denote the representation dimension, episode horizon, and action space cardinality. This significantly improves the sample complexity of $\tilde{O}(\frac{H^{22}K^9d^7}{\epsilon^{10}})$ in Agarwal et al. (2020) for the same learning goals. We further provide a sample complexity lower bound of $\tilde{\Omega}(\frac{HdK}{\epsilon^2})$ that holds for any reward-free algorithm under low-rank MDPs, which matches our upper bound in the dependence on $\epsilon$, as well as on $K$ in the large $d$ regime. Comparing this lower bound for low-rank MDPs with the upper bound for linear MDPs in Wang et al. (2020), it implies that reward-free RL under low-rank MDPs is strictly harder than linear MDPs. Finally, we complete our study by reusing RAFFLE to learn representation. We estimate the representation individually with only access to the learned transition kernels from RAFFLE and without interacting with true environment, and then theoretically characterize the closeness between the learned and the ground truth representation. The learned representation can be further used for few shot RL as in supervised learning (Du et al., 2021b). ","Reward Free Exploration, Representation Learning, Sample Complexity, Model-Based RL" Re-Imagen: Retrieval-Augmented Text-to-Image Generator,https://openreview.net/forum?id=XSEBx0iSjFQ,https://openreview.net/pdf?id=XSEBx0iSjFQ,A text-to-image generation model that can retrieve from external knowledge base to generate more faithful images.,"Research on text-to-image generation has witnessed significant progress in generating diverse and photo-realistic images, driven by diffusion and auto-regressive models trained on large-scale image-text data. Though state-of-the-art models can generate high-quality images of common entities, they often have difficulty generating images of uncommon entities, such as `Chortai (dog)' or `Picarones (food)'. To tackle this issue, we present the Retrieval-Augmented Text-to-Image Generator (Re-Imagen), a generative model that uses retrieved information to produce high-fidelity and faithful images, even for rare or unseen entities. Given a text prompt, Re-Imagen accesses an external multi-modal knowledge base to retrieve relevant (image, text) pairs, and uses them as references to generate the image. With this retrieval step, Re-Imagen is augmented with the knowledge of high-level semantics and low-level visual details of the mentioned entities, and thus improves its accuracy in generating the entities' visual appearances. We train Re-Imagen on a constructed dataset containing (image,text,retrieval) triples to teach the model to ground on both text prompt and retrieval. Furthermore, we develop a new sampling strategy to interleave the classifier-free guidance for text and retrieval condition to balance the text and retrieval alignment. Re-Imagen achieves new SoTA FID results on two image generation benchmarks, such as COCO (\ie, FID = 5.25) and WikiImage (\ie, FID = 5.82) without fine-tuning. To further evaluate the capabilities of the model, we introduce EntityDrawBench, a new benchmark that evaluates image generation for diverse entities, from frequent to rare, across multiple visual domains. Human evaluation on EntityDrawBench shows that Re-Imagen performs on par with the best prior models in photo-realism, but with significantly better real-world faithfulness, especially on less frequent entities. ","Diffusion Model, Information Retrieval, Knowledge Grounding, Image Generation" BiViT: Exploring Binary Vision Transformers,https://openreview.net/forum?id=lXBzOtKn20t,https://openreview.net/pdf?id=lXBzOtKn20t,"We introduce BiViT, a the first fully binary quantized binary vision transformer with both 1-bit weights and activations.","We introduce BiViT, a Binary Vision Transformer that tackles the extremely difficult problem of quantizing both the weights and activations of a ViT model to just 1 bit. Initially, we observe that the techniques used to binarize transformers in NLP don't work on Vision Transformers (ViTs). To address this, we introduce some simple yet critical architectural changes, improving 28% over a baseline binarized ViT. Then, we improve 11% over from-scratch training by employing our normalized BiViT distillation scheme, which we find to be crucial for dense distillation in vision. Overall, BiViT can achieve a 58x reduction in operations and a 20x compression in model size, while bringing top-1 accuracy on ImageNet-1k in line with similar benchmarks for binary transformers in NLP. We hope BiViT can be the first step toward even more powerful binary ViT models.","quantization, binary quantization, vision transformer, distillation, classification, imagenet" Magnum: Tackling High-Dimensional Structures with Self-Organization,https://openreview.net/forum?id=xYNOuQh1Z7Y,https://openreview.net/pdf?id=xYNOuQh1Z7Y,,"A big challenge for dealing with real world problems is scalability. In fact, this is partially the reason behind the success of deep learning over other learning paradigms. Here, we tackle the scalability of a novel learning paradigm proposed in 2021 based solely on self organizing principles. This paradigm consists of only dynamical equations which self-organize with the input to create attractor-repeller points that are related to the patterns found in data. To achieve scalability for such a system, we propose the Magnum algorithm, which utilizes many self-organizing subsystems (SubSigma) each with subsets of the problem's variables. The main idea is that by merging SubSigmas, Magnum builds over time a variable correlation by consensus, capable of accurately predicting the structure of large groups of variables. Experiments show that Magnum surpasses or ties with other unsupervised algorithms in all of the high-dimensional chunking problems, each with distinct types of shapes and structural features. Moreover, SubSigma alone outperforms or ties with other unsupervised algorithms in six out of seven basic chunking problems. Thus, this work sheds light in how self-organization learning paradigms can be scaled up to deal with high dimensional structures and compete with current learning paradigms.","self-organization, chunking, time-series data" Provably Efficient Lifelong Reinforcement Learning with Linear Representation,https://openreview.net/forum?id=Qd0p0bl-A9t,https://openreview.net/pdf?id=Qd0p0bl-A9t,"We study lifelong RL, where the agent needs to solve a streaming sequence of tasks. We propose an algorithm with provable sublinear regret using sublinear number of planning calls for any sequence of tasks.","We theoretically study lifelong reinforcement learning (RL) with linear representation in a regret minimization setting. The goal of the agent is to learn a multi-task policy based on a linear representation while solving a sequence of tasks that may be adaptively chosen based on the agent's past behaviors. We frame the problem as a linearly parameterized contextual Markov decision process (MDP), where each task is specified by a context and the transition dynamics is context-independent, and we introduce a new completeness-style assumption on the representation which is sufficient to ensure the optimal multi-task policy is realizable under the linear representation. Under this assumption, we propose an algorithm, called UCB Lifelong Value Distillation (UCBlvd), that provably achieves sublinear regret for any sequence of tasks while using only sublinear planning calls. Specifically, for $K$ task episodes of horizon $H$, our algorithm has a regret bound $\tilde{\mathcal{O}}(\sqrt{(d^3+d^\prime d)H^4K})$ using $\mathcal{O}(dH\log(K))$ number of planning calls, where $d$ and $d^\prime$ are the feature dimensions of the dynamics and rewards, respectively. This theoretical guarantee implies that our algorithm can enable a lifelong learning agent to learn to internalize experiences into a multi-task policy and rapidly solve new tasks.","Lifelong RL, Contextual MDP, Regret, Planning calls, Computation sharing, Streaming sequence of adversarial tasks" Towards Adversarially Robust Deepfake Detection: An Ensemble Approach,https://openreview.net/forum?id=4bH8SxYNcI,https://openreview.net/pdf?id=4bH8SxYNcI,"We present - Disjoint Deepfake Detection (D3), an ensemble based technique for deepfake detection and provide theoretical and empirical evidence for it's robustness.","Detecting deepfakes remains an open problem. Current detection methods fail against an adversary who adds imperceptible adversarial perturbations to the deepfake to evade detection. We propose Disjoint Deepfake Detection (D3), a deepfake detector designed to improve adversarial robustness beyond de facto solutions such as adversarial training. D3 uses an ensemble of models over disjoint subsets of the frequency spectrum to significantly improve robustness. Our key insight is to leverage a redundancy in the frequency domain and apply a saliency partitioning technique to disjointly distribute frequency components across multiple models. We formally prove that these disjoint ensembles lead to a reduction in the dimensionality of the input subspace where adversarial deepfakes lie. We then empirically validate the D3 method against white-box attacks and black-box attacks and find that D3 significantly outperforms existing state-of-the-art defenses applied to deepfake detection.","Deepfakes, Ensembles, Adversarial Subspace, Frequency, Defense" Fast Adaptation via Human Diagnosis of Task Distribution Shift,https://openreview.net/forum?id=locB7rYBzTw,https://openreview.net/pdf?id=locB7rYBzTw,We develop a human-in-the-loop framework to help humans diagnose and fix goal-conditioned policy failures.,"When agents fail in the world, it is important to understand why they failed. These errors could be due to underlying distribution shifts in the goals desired by the end user or to the environment layouts that impact the policy's actions. In the case of multi-task policies conditioned on goals, this problem manifests in difficulty in disambiguating between goal and policy failures: is the agent failing because it can't correctly infer what the desired goal is or because it doesn't know how to take actions toward achieving the goal? We hypothesize that successfully disentangling these two failures modes holds important implications for selecting a finetuning strategy. In this paper, we explore the feasibility of leveraging human feedback to diagnose what vs. how failures for efficient adaptation. We develop an end-to-end policy training framework that uses attention to produce a human-interpretable representation, a visual masked state, to communicate the agent's intermediate task representation. In experiments with human users in both discrete and continuous control domains, we show that our visual attention mask policy can aid participants in successfully inferring the agent's failure mode significantly better than actions alone. Leveraging this feedback, we show subsequent empirical performance gains during finetuning and discuss implications of using humans to diagnose parameter-level failures of distribution shift.","human-ai interaction, human-in-the-loop" Link Prediction with Non-Contrastive Learning,https://openreview.net/forum?id=9Jaz4APHtWD,https://openreview.net/pdf?id=9Jaz4APHtWD,We evaluate the performance of non-contrastive methods on link prediction and propose a new method to improve its performance in the inductive setting.,"Graph neural networks (GNNs) are prominent in the graph machine learning domain, owing to their strong performance across various tasks. A recent focal area is the space of graph self-supervised learning (SSL), which aims to derive useful node representations without labeled data. Notably, many state-of-the-art graph SSL methods are contrastive methods, which use a combination of positive and negative samples to learn node representations. Owing to challenges in negative sampling (slowness and model sensitivity), recent literature introduced non-contrastive methods, which instead only use positive samples. Though such methods have shown promising performance in node-level tasks, their suitability for link prediction tasks, which are concerned with predicting link existence between pairs of nodes (and have broad applicability to recommendation systems contexts) is yet unexplored. In this work, we extensively evaluate the performance of existing non-contrastive methods for link prediction in both transductive and inductive settings. While most existing non-contrastive methods perform poorly overall, we find that, surprisingly, BGRL generally performs well in transductive settings. However, it performs poorly in the more realistic inductive settings where the model has to generalize to links to/from unseen nodes. We find that non-contrastive models tend to overfit to the training graph and use this analysis to propose T-BGRL, a novel non-contrastive framework that incorporates cheap corruptions to improve the generalization ability of the model. This simple modification strongly improves inductive performance in 5/6 of our datasets, with up to a 120% improvement in Hits@50 - all with comparable speed to other non-contrastive baselines, and up to $14\times$ faster than the best-performing contrastive baseline. Our work imparts interesting findings about non-contrastive learning for link prediction and paves the way for future researchers to further expand upon this area.","graph learning, graph neural networks, non-contrastive learning, link prediction" Distributed Differential Privacy in Multi-Armed Bandits,https://openreview.net/forum?id=cw8FeirkIfU,https://openreview.net/pdf?id=cw8FeirkIfU,We achieve pure DP for the first time in the distributed trust model while maintaining the same regret under the central model,"We consider the standard $K$-armed bandit problem under a distributed trust model of differential privacy (DP), which enables to guarantee privacy without a trustworthy server. Under this trust model, previous work largely focus on achieving privacy using a shuffle protocol, where a batch of users data are randomly permuted before sending to a central server. This protocol achieves ($\epsilon,\delta$) or approximate-DP guarantee by sacrificing an additive $O\!\left(\!\frac{K\log T\sqrt{\log(1/\delta)}}{\epsilon}\!\right)\!$ factor in $T$-step cumulative regret. In contrast, the optimal privacy cost to achieve a stronger ($\epsilon,0$) or pure-DP guarantee under the widely used central trust model is only $\Theta\!\left(\!\frac{K\log T}{\epsilon}\!\right)\!$, where, however, a trusted server is required. In this work, we aim to obtain a pure-DP guarantee under distributed trust model while sacrificing no more regret than that under central trust model. We achieve this by designing a generic bandit algorithm based on successive arm elimination, where privacy is guaranteed by corrupting rewards with an equivalent discrete Laplace noise ensured by a secure computation protocol. We also show that our algorithm, when instantiated with Skellam noise and the secure protocol, ensures \emph{R\'{e}nyi differential privacy} -- a stronger notion than approximate DP -- under distributed trust model with a privacy cost of $O\!\left(\!\frac{K\sqrt{\log T}}{\epsilon}\!\right)\!$. Finally, as a by-product of our techniques, we also recover the best-known regret bounds for bandits under central and local models while using only \emph{discrete privacy noise}, which can avoid the privacy leakage due to floating point arithmetic of continuous noise on finite computers.","Multi-armed Bandits, Differential Privacy" Thrust: Adaptively Propels Large Language Models with External Knowledge,https://openreview.net/forum?id=g_H6fj4OGZ,https://openreview.net/pdf?id=g_H6fj4OGZ,"We design a novel metric, Thrust, that can decide if we use external knowledge for each instance and observe significant improvement on both cost-efficiency and performance for various knowledge-intensive natural language processing tasks.","Large-scale pre-trained language models (PTLM) have achieved great success in various natural language processing (NLP) tasks. Much evidence shows that PTLMs already encode rich knowledge themselves, but knowledge stored in PTLMs can be opaque and static, making external knowledge retrieval necessary. However, there are two major challenges when using external knowledge. First, knowledge indexing and retrieving on large-scale knowledge bases are time costly. Second, knowledge retrieved could be noisy and sometimes misleading. Motivated by the observation that external knowledge is not always required by PTLMs, we investigate an effective and efficient way to apply knowledge only when the knowledge is essential. Specifically, we propose instance-level adaptive propulsion of external knowledge (IAPEK), where we score each instance on whether the PTLMs need the support of external knowledge. To achieve this goal, we design a novel metric, Thrust, which leverages the distribution estimation on seen/training instances. Extensive experiments demonstrate that we can achieve significantly higher cost-efficiency through Thrust compared to the naive usage of external knowledge on 88% of the evaluated tasks with 26% average performance improvement. Such findings further shed light on the real-world practice of knowledge-enhanced LMs with a limited budget for knowledge seeking due to computation latency or costs. ","knowledge-intensive natural language processing, pre-trained language models, instance-level adaptive knowledge usage" Progress measures for grokking via mechanistic interpretability,https://openreview.net/forum?id=9XFSbDPmdW,https://openreview.net/pdf?id=9XFSbDPmdW,"We fully reverse engineer how one-layer transformers implement modular addition, and use this knowledge to explain grokking. ","Neural networks often exhibit emergent behavior in which qualitatively new capabilities that arise from scaling up the number of parameters, training data, or even the number of steps. One approach to understanding emergence is to find the continuous \textit{progress measures} that underlie the seemingly discontinuous qualitative changes. In this work, we argue that progress measures can be found via mechanistic interpretability---that is, by reverse engineering learned models into components and measuring the progress of each component over the course of training. As a case study, we study small transformers trained on a modular arithmetic tasks with emergent grokking behavior. We fully reverse engineer the algorithm learned by these networks, which uses discrete fourier transforms and trigonometric identities to convert addition to rotation about a circle. After confirming the algorithm via ablation, we then use our understanding of the algorithm to define progress measures that precede the grokking phase transition on this task. We see our result as demonstrating both that it is possible to fully reverse engineer trained networks, and that doing so can be invaluable to understanding their training dynamics. ","interpretability, grokking, progress measures, mechanistic interpretability, circuits" Deep Bayesian Active Learning for Accelerating Stochastic Simulation,https://openreview.net/forum?id=cVEOiBz2Em,https://openreview.net/pdf?id=cVEOiBz2Em,,"Stochastic simulations such as large-scale, spatiotemporal, age-structured epidemic models are computationally expensive at fine-grained resolution. While deep surrogate models can speed up the simulations, doing so for stochastic simulations and with active learning approaches is an underexplored area. We propose Interactive Neural Process (INP), a deep Bayesian active learning framework for learning deep surrogate models to accelerate stochastic simulations. INP consists of two components, a spatiotemporal surrogate model built upon Neural Process (NP) family and an acquisition function for active learning. For surrogate modeling, we develop Spatiotemporal Neural Process (STNP) to mimic the simulator dynamics. For active learning, we propose a novel acquisition function, Latent Information Gain (LIG), calculated in the latent space of NP based models. We perform a theoretical analysis and demonstrate that LIG reduces sample complexity compared with random sampling in high dimensions. We also conduct empirical studies on two complex spatiotemporal simulators for reaction diffusion and infectious disease. The results demonstrate that STNP outperforms the baselines in the offline learning setting and LIG achieves the state-of-the-art for Bayesian active learning.", On the Mysterious Optimization Geometry of Deep Neural Networks,https://openreview.net/forum?id=hlwmtSw5KC,https://openreview.net/pdf?id=hlwmtSw5KC,Reveal a mysterious type of geometry in deep learning optimization.,"Understanding why gradient-based algorithms are successful in practical deep learning optimization is a fundamental and long-standing problem. Most existing works promote the explanation that deep neural networks have smooth and amenable nonconvex optimization geometries. In this work, we argue that this may be an oversimplification of practical deep learning optimization by revealing a mysterious and complex optimization geometry of deep networks through extensive experiments. Specifically, we consistently observe two distinct geometric patterns in training various deep networks: a regular smooth geometry and a mysterious zigzag geometry, where gradients computed in adjacent iterations are extremely negatively correlated. Also, such a zigzag geometry exhibits a fractal structure in that it appears over a wide range of geometrical scales, implying that deep networks can be highly non-smooth in certain local parameter regions. Moreover, our results show that a substantial part of the training progress is achieved under such complex geometry. Therefore, the existing smoothness-based explanations do not fully match the practice. ","deep learning, optimization geometry, nonconvex optimization" Implicit regularization via Spectral Neural Networks and non-linear matrix sensing,https://openreview.net/forum?id=EKdBD-1qHW6,https://openreview.net/pdf?id=EKdBD-1qHW6,,"The phenomenon of \textit{implicit regularization} has attracted interest in the recent years as a fundamental aspect of the remarkable generalizing ability of neural networks. In a nutshell, it entails that gradient flow dynamics in many neural nets, even without any explicit regularizer in the loss function, converges to the solution of a regularized learning problem. However, known results attempting to theoretically explain this phenomenon focus overwhelmingly on the setting of linear neural nets, and the simplicity of the linear structure is particularly crucial to existing arguments. In this paper, we explore this problem in the context of more realistic neural networks with a general class of non-linear activation functions, and rigorously demonstrate the implicit regularization phenomenon for such networks in the setting of matrix sensing problems. This is coupled with rigorous rate guarantees that ensure exponentially fast convergence of gradient descent, complemented by matching lower bounds which stipulate that the exponential rate is the best achievable. In this vein, we contribute a network architecture called Spectral Neural Networks (\textit{abbrv.} SNN) that is particularly suitable for matrix learning problems. Conceptually, this entails coordinatizing the space of matrices by their singular values and singular vectors, as opposed to by their entries, a potentially fruitful perspective for matrix learning. We demonstrate that the SNN architecture is inherently much more amenable to theoretical analysis than vanilla neural nets and confirm its effectiveness in the context of matrix sensing, supported via both mathematical guarantees and empirical investigations. We believe that the SNN architecture has the potential to be of wide applicability in a broad class of matrix learning scenarios.", Goal-Space Planning with Subgoal Models,https://openreview.net/forum?id=vPS7yxt6oNE,https://openreview.net/pdf?id=vPS7yxt6oNE,A new approach to model-based RL where planning operates in an abstract subgoal space,"This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture. Background planning with learned models is often worse than model-free alternatives, such as Double DQN, even though the former uses significantly more memory and computation. The fundamental problem is that learned models can be inaccurate and often generate invalid states, especially when iterated many steps. In this paper, we avoid this limitation by constraining background planning to a set of (abstract) subgoals and learning only local, subgoal-conditioned models. This goal-space planning (GSP) approach is more computationally efficient, naturally incorporates temporal abstraction for faster long-horizon planning and avoids learning the transition dynamics entirely. We show that our GSP algorithm can learn significantly faster than a Double DQN baseline in a variety of situations.","Reinforcement Learning, Model-Based Reinforcement Learning, Planning, Temporal Abstraction" MET : Masked Encoding for Tabular data,https://openreview.net/forum?id=lRXSMYJtXwT,https://openreview.net/pdf?id=lRXSMYJtXwT,Masking based algorithm for SSL on tabular datasets. Key idea: there exists a latent graphical model that captures relations between different coordinates and classification in latent space is easy. Masking based SSL learns this latent structure.,"We propose $\textit{Masked Encoding for Tabular Data (MET)}$ for learning self-supervised representations from $\textit{tabular data}$. Tabular self-supervised learning (tabular-SSL) -- unlike structured domains like images, audio, text -- is more challenging since each tabular dataset can have a completely different structure among its features (or coordinates), which is hard to identify a priori. $\textit{MET}$ attempts to circumvent this problem by assuming the following hypothesis: the observed tabular data features come from a latent graphical model and the downstream tasks are significantly easier to solve in the latent space. Based on this hypothesis, $\textit{MET}$ uses random masking based encoders to learn a positional embedding for each coordinate, which would in turn capture the latent structure between coordinates. Through experiments on a toy dataset from a linear graphical model, we show that $\textit{MET}$ is indeed able to capture the latent graphical model. Practically, through extensive experiments on multiple benchmarks for tabular data, we demonstrate that $\textit{MET}$ significantly outperforms all the baselines. For example, on Criteo -- a large-scale click prediction dataset -- $\textit{MET}$ achieves as much as $5\%$ improvement over the current state-of-the-art (SOTA) while purely supervised learning based approaches have been able to advance SOTA by at most $2\%$ in the last six years. Furthermore, averaged over $\textit{nine}$ datasets, $\textit{MET}$ is around $3.9\%$ more accurate than the next best method of Gradient-boosted decision trees -- considered as SOTA for the tabular setting.","Tabular Data, Self Supervised Learning, Masked Auto-Encoder" Shortcut Learning Through the Lens of Early Training Dynamics,https://openreview.net/forum?id=5wa-ueGGI33,https://openreview.net/pdf?id=5wa-ueGGI33,Potential shortcuts can be found by monitoring the easy features learned by the initial layers of a DNN early during the training using suitable instance difficulty metrics.,"Deep Neural Networks (DNNs) are prone to learn shortcut patterns that damage the generalization of the DNN during deployment. Shortcut learning is concerning, particularly when the DNNs are applied to safety-critical domains. This paper aims to better understand shortcut learning through the lens of the learning dynamics of the internal neurons during the training process. More specifically, we make the following observations: (1) While previous works treat shortcuts as synonymous with spurious correlations, we emphasize that not all spurious correlations are shortcuts. We show that shortcuts are only those spurious features that are ``easier'' than the core features. (2) We build upon this premise and use instance difficulty methods (like Prediction Depth) to quantify ``easy'' and to identify this behavior during the training phase. (3) We empirically show that shortcut learning can be detected by observing the learning dynamics of the DNN's early layers, irrespective of the network architecture used. In other words, easy features learned by the initial layers of a DNN early during the training are potential shortcuts. We verify our claims on simulated and real medical imaging data and justify the empirical success of our hypothesis by showing the theoretical connections between Prediction Depth and information-theoretic concepts like $\mathcal{V}$-usable information. Lastly, our experiments show the insufficiency of monitoring only accuracy plots during training (as is common in machine learning pipelines), and we highlight the need for monitoring early training dynamics using example difficulty metrics.","shortcut learning, spurious correlations, convolutional neural networks, deep learning, machine learning, computer vision, training dynamics" On $\mathcal{O}(1/K)$ Convergence and Low Sample Complexity for Single-Timescale Policy Evaluation with Nonlinear Function Approximation,https://openreview.net/forum?id=8-aqFHleFyC,https://openreview.net/pdf?id=8-aqFHleFyC,," Learning an accurate value function for a given policy is a critical step in solving reinforcement learning (RL) problems. So far, however, the convergence speed and sample complexity performances of most existing policy evaluation algorithms remain unsatisfactory, particularly with {\em non-linear} function approximation. This challenge motivates us to develop a new variance-reduced primal-dual method (VRPD) that is able to achieve a fast convergence speed for RL policy evaluation with nonlinear function approximation. To lower the high sample complexity limitation of variance-reduced approaches (due to the periodic full gradient evaluation with all training data), we further propose an enhanced VRPD method with an adaptive-batch adjustment (VRPD$^+$). The main features of VRPD include: i) VRPD allows the use of {\em{constant}} step sizes and achieves the $\mathcal{O}(1/K)$ convergence rate to the first-order stationary points of non-convex policy evaluation problems; ii) VRPD is a generic {\em{single}}-timescale algorithm that is also applicable for solving a large class of non-convex strongly-concave minimax optimization problems; iii) By adaptively adjusting the batch size via historical stochastic gradient information, VRPD$^+$ is more sample-efficient in practice without loss of theoretical convergence rate. Our extensive numerical experiments verify our theoretical findings and showcase the high efficiency of the proposed VRPD and VRPD$^+$ algorithms compared with the state-of-the-art methods.", Generating Features with Increased Crop-Related Diversity for Few-shot Object Detection,https://openreview.net/forum?id=4dsIu9DOFNB,https://openreview.net/pdf?id=4dsIu9DOFNB,"We transform the latent space such that the latent norm represents a data property, allowing controllable feature generation. ","Two-stage object detectors generate object proposals and classify them to detect objects in images. These proposals often do not perfectly contain the objects but overlap with them in many possible ways, exhibiting great variability induced by different object scales, object positions (w.r.t. the boxes), object parts, and backgrounds. Training a robust classifier against this variability requires abundant training data, which is not available in few-shot settings. To mitigate this issue, we propose a novel variational autoencoder (VAE) based data generation model, which is capable of generating data with increased crop-related variability. The main idea is to transform the latent space such that latent codes with different norms represent different crop-related variations. This allows us to generate features with increased crop-related diversity via simply varying the latent norm. In particular, each latent code is rescaled such that its norm linearly correlates with the IoU score of the input crop w.r.t. the ground-truth box. Here the IoU score is a proxy that represents the crop-related variation. We train this VAE model on base classes conditioned on the semantic code of each class and then use the trained model to generate features for novel classes. Our experimental results show that our generated features consistently improve state-of-the-art few-shot object detection methods on PASCAL VOC and COCO datasets.",Few-shot Object Detection On the Implicit Bias Towards Depth Minimization in Deep Neural Networks,https://openreview.net/forum?id=MJSIkA72S4k,https://openreview.net/pdf?id=MJSIkA72S4k,,"Recent results in the literature suggest that the penultimate (second-to-last) layer representations of neural networks that are trained for classification exhibit a clustering property called neural collapse (NC). We study the implicit bias of stochastic gradient descent (SGD) in favor of low-depth solutions when training deep neural networks. We characterize a notion of effective depth that measures the first layer for which sample embeddings are separable using the nearest-class center classifier. Furthermore, we hypothesize and empirically show that SGD implicitly selects neural networks of small effective depths. Secondly, while neural collapse emerges even when generalization should be impossible - we argue that the \emph{degree of separability} in the intermediate layers is related to generalization. We derive a generalization bound based on comparing the effective depth of the network with the minimal depth required to fit the same dataset with partially corrupted labels. Remarkably, this bound provides non-trivial estimations of the test performance. Finally, we empirically show that the effective depth of a trained neural network monotonically increases when increasing the number of random labels in data.", Prometheus: Endowing Low Sample and Communication Complexities to Constrained Decentralized Stochastic Bilevel Learning,https://openreview.net/forum?id=OmpIgSvg7-Z,https://openreview.net/pdf?id=OmpIgSvg7-Z,," In recent years, constrained decentralized stochastic bilevel optimization has become increasingly important due to its versatility in modeling a wide range of multi-agent learning problems, such as multi-agent reinforcement learning and multi-agent meta-learning with safety constraints. However, one under-explored and fundamental challenge in constrained decentralized stochastic bilevel optimization is how to achieve low sample and communication complexities, which, if not addressed appropriately, could affect the long-term prospect of many emerging multi-agent learning paradigms that use decentralized bilevel optimization as a bedrock. In this paper, we investigate a class of constrained decentralized bilevel optimization problems, where multiple agents collectively solve a nonconvex-strongly-convex bilevel problem with constraints in the upper-level variables. Such problems arise naturally in many multi-agent reinforcement learning and meta learning problems. In this paper, we propose an algorithm called Prometheus (proximal tracked stochastic recursive estimator) that achieves the first $\mathcal{O}(\epsilon^{-1})$ results in both sample and communication complexities for constrained decentralized bilevel optimization, where $\epsilon>0$ is the desired stationarity error. Collectively, the results in this work contribute to a theoretical foundation for low sample- and communication-complexity constrained decentralized bilevel learning.", SGD and Weight Decay Provably Induce a Low-Rank Bias in Neural Networks,https://openreview.net/forum?id=N7Tv4aZ4Cyx,https://openreview.net/pdf?id=N7Tv4aZ4Cyx,,"We analyze deep ReLU neural networks trained with mini-batch Stochastic Gradient Descent (SGD) and weight decay. We show, both theoretically and empirically, that when training a neural network using SGD with weight decay and small batch size, the resulting weight matrices tend to be of small rank. Our analysis relies on a minimal set of assumptions; the neural networks may be arbitrarily wide or deep and may include residual connections, as well as convolutional layers. The same analysis implies the inherent presence of SGD ``noise'', defined as the inability of SGD to converge to a stationary point. In particular, we prove that SGD noise must always be present, even asymptotically, as long as we incorporate weight decay and the batch size is smaller than the total number of training samples. ", "MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features",https://openreview.net/forum?id=wtr-9AKxCI5,https://openreview.net/pdf?id=wtr-9AKxCI5,"Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features","MobileViT (MobileViTv1) combines convolutional neural networks (CNNs) and vision transformers (ViTs) to create light-weight models for mobile vision tasks. Though the main MobileViTv1-block helps to achieve competitive state-of-the-art results, the fusion block inside MobileViTv1-block, creates scaling challenges and has a complex learning task. We propose changes to the fusion block that are simple and effective to create MobileViTv3-block, which addresses the scaling and simplifies the learning task. Our proposed MobileViTv3-block used to create MobileViTv3-XXS, XS and S models outperform MobileViTv1 on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets. On ImageNet-1K, MobileViTv3-XXS and MobileViTv3-XS surpasses MobileViTv1-XXS and MobileViTv1-XS by 2% and 1.9% respectively. Recently published MobileViTv2 architecture removes fusion block and uses linear complexity transformers to perform better than MobileViTv1. We add our proposed fusion block to MobileViTv2 to create MobileViTv3-0.5, 0.75 and 1.0 models. MobileViTv3-0.5 and MobileViTv3-0.75 outperforms MobileViTv2-0.5 and MobileViTv2-0.75 by 2.1% and 1.0% respectively on ImageNet-1K dataset. For segmentation task, MobileViTv3-1.0 achieves 2.07% and 1.1% better mIOU compared to MobileViTv2-1.0 on ADE20K dataset and PascalVOC2012 dataset respectively.","Deep Learning, Vision Transformers, Convolutional Neural Networks, Mobile Vision Transformers, Light-weight neural network" PiFold: Toward effective and efficient protein inverse folding,https://openreview.net/forum?id=oMsN9TYwJ0j,https://openreview.net/pdf?id=oMsN9TYwJ0j,,"How can we design protein sequences folding into the desired structures effectively and efficiently? Structure-based protein design has attracted increasing attention in recent years; however, few methods can simultaneously improve the accuracy and efficiency due to the lack of expressive features and autoregressive sequence decoder. To address these issues, we propose PiFold, which contains a novel residue featurizer and PiGNN layers to generate protein sequences in a one-shot way with improved recovery. Experiments show that PiFold could achieve 51.66\% recovery on CATH 4.2, while the inference speed is 70 times faster than the autoregressive competitors. In addition, PiFold achieves 58.72\% and 60.42\% recovery scores on TS50 and TS500, respectively. We conduct comprehensive ablation studies to reveal the role of different types of protein features and model designs, inspiring further simplification and improvement.", "A Theoretical Understanding of Vision Transformers: Learning, Generalization, and Sample Complexity",https://openreview.net/forum?id=jClGv3Qjhb,https://openreview.net/pdf?id=jClGv3Qjhb,A theoretical characterization of generalization and sample complexity of training three-layer Vision Transformers.,"Vision Transformers (ViTs) with self-attention modules have recently achieved great empirical success in many vision tasks. Due to non-convex interactions across layers, however, the theoretical learning and generalization analysis is mostly elusive. Based on a data model characterizing both label-relevant and label-irrelevant tokens, this paper provides the first theoretical analysis of training a three-layer ViT, i.e., one self-attention layer followed by a two-layer perceptron, for a classification task. We characterize the sample complexity to achieve a zero generalization error. Our sample complexity bound is positively correlated with the inverse of the fraction of label-relevant tokens, the token noise level, and the initial model error. We also prove that a training process using stochastic gradient descent (SGD) leads to a sparse attention map, which is a formal verification of the general intuition about the success of attention. Moreover, this paper indicates that a proper token sparsification can improve the test performance by removing label-irrelevant and/or noisy tokens, including spurious correlations. Empirical experiments on synthetic data and CIFAR-10 dataset justify our theoretical results and generalize to deeper ViTs. ","Vision transformer, Learning, Generalization, Sample comeplxity, Token sparsification, Theory" AlphaDesign: A graph protein design method and benchmark on AlphaFold DB,https://openreview.net/forum?id=yC8PKpNl4f,https://openreview.net/pdf?id=yC8PKpNl4f,,"While AlphaFold has remarkably advanced protein folding, the inverse problem, protein design, by which protein sequences are predicted from the corresponding 3D structures, still faces significant challenges. First of all, there lacks a large-scale benchmark covering the vast protein space for evaluating methods and models; secondly, existing methods are still low in prediction accuracy and time-inefficient inference. This paper establishes a new benchmark based on AlphaFold DB, one of the world's largest protein structure databases. Moreover, we propose a new baseline method called AlphaDesign, which achieves 5\% higher recovery than previous methods and about 70 times inference speed-up in designing long protein sequences. We also reveal AlphaDesign's potential for practical protein design tasks, where the designed proteins achieve good structural compatibility with native structures. The open-source code will be released. ", Transfer Learning with Context-aware Feature Compensation,https://openreview.net/forum?id=c0UQacrBmFB,https://openreview.net/pdf?id=c0UQacrBmFB,,"Transfer learning aims to reuse the learnt representations or subnetworks to a new domain with minimum effort for adaption. Here, the challenge lies in the mismatch between source domain and target domain, which is the major gap to be tackled by transfer learning. Hence, how to identify the mismatch between source and target domain becomes a critical problem. We propose an end-to-end framework to learn feature compensation for transfer learning with soft gating to decide whether and how much feature compensation is needed, accounting for the mismatch between source domain and target domain. To enable identifying the position of the input in reference to the overall data distribution of source domain, we perform clustering at first to figure out the data distribution in a compact form represented by cluster centers, and then use the similarities between the input and the cluster centers to describe the relative position of the input in reference to the cluster centers. This acts as the context to indicate whether and how much feature compensation is needed for the input to compensate for the mismatch between source domain and target domain. To approach that, we add only two subnetworks in the form of Multilayer Perceptron, one for computing the feature compensation and the other for soft gating the compensation, where both are computed based on the context. The experiments show that such minor change to backbone network can result in significant performance improvements compared with the baselines on some widely used benchmarks.", Contrastive Learning Can Find An Optimal Basis For Approximately View-Invariant Functions,https://openreview.net/forum?id=AjC0KBjiMu,https://openreview.net/pdf?id=AjC0KBjiMu,"We show that existing contrastive objectives approximate a ""positive-pair kernel"", and that applying Kernel PCA produces a representation that is provably optimal for supervised learning of functions that assign similar values to positive pairs.","Contrastive learning is a powerful framework for learning self-supervised representations that generalize well to downstream supervised tasks. We show that multiple existing contrastive learning methods can be reinterpeted as learning kernel functions that approximate a fixed *positive-pair kernel*. We then prove that a simple representation obtained by combining this kernel with PCA provably minimizes the worst-case approximation error of linear predictors, under a straightforward assumption that positive pairs have similar labels. Our analysis is based on a decomposition of the target function in terms of the eigenfunctions of a positive-pair Markov chain, and a surprising equivalence between these eigenfunctions and the output of Kernel PCA. We give generalization bounds for downstream linear prediction using our kernel PCA representation, and show empirically on a set of synthetic tasks that applying kernel PCA to contrastive learning models can indeed approximately recover the Markov chain eigenfunctions, although the accuracy depends on the kernel parameterization as well as on the augmentation strength.","contrastive learning, self-supervised learning, representation learning, kernel, kernel PCA, positive definite, eigenfunction, spectral clustering, invariance, Markov chain, minimax optimal" Learning Unsupervised Forward Models from Object Keypoints,https://openreview.net/forum?id=vKEMum01xu,https://openreview.net/pdf?id=vKEMum01xu,,"Object-centric representation is an essential abstraction for forward prediction. Most existing forward models learn this representation through extensive supervision (e.g., object class and bounding box) although such ground-truth information is not readily accessible in reality. To address this, we introduce KINet (Keypoint Interaction Network)---an end-to-end unsupervised framework to reason about object interactions based on a keypoint representation. Using visual observations, our model learns to associate objects with keypoint coordinates and discovers a graph representation of the system as a set of keypoint embeddings and their relations. It then learns an action-conditioned forward model using contrastive estimation to predict future keypoint states. By learning to perform physical reasoning in the keypoint space, our model automatically generalizes to scenarios with a different number of objects, novel backgrounds, and unseen object geometries. Experiments demonstrate the effectiveness of our model in accurately performing forward prediction and learning plannable object-centric representations which can also be used in downstream robotic manipulation tasks.", K-SAM: Sharpness-Aware Minimization at the Speed of SGD,https://openreview.net/forum?id=EW00yKKLiX,https://openreview.net/pdf?id=EW00yKKLiX,We propose an efficient sharpness-aware minimization by subsampling training data with highest k losses in both gradient calculation steps.,"Sharpness-Aware Minimization (SAM) has recently emerged as a robust technique for improving the accuracy of deep neural networks. However, SAM incurs a high computational cost in practice, requiring up to twice as much computation as vanilla SGD. The computational challenge posed by SAM arises because each iteration requires both ascent and descent steps and thus double the gradient computations. To address this challenge, we propose to compute gradients in both stages of SAM on only the top-k samples with highest loss. K-SAM is simple and extremely easy-to-implement while providing significant generalization boosts over vanilla SGD at little to no additional cost.","deep learning, efficient training" Copula Conformal Prediction for Multi-step Time Series Forecasting,https://openreview.net/forum?id=jCdoLxMZxf,https://openreview.net/pdf?id=jCdoLxMZxf,"significantly improve efficiency/sharpness of conformal prediction confidence intervals, for time series forecasting, by modeling dependence of time steps using copulas","Accurate uncertainty measurement is a key step to building robust and reliable machine learning systems. Conformal prediction is a distribution-free uncertainty quantification algorithm popular for its ease of implementation, statistical coverage guarantees, and versatility for underlying forecasters. However, existing conformal prediction algorithms for time series are limited to single-step prediction without considering the temporal dependency. In this paper we propose a Copula-based Conformal Prediction algorithm for multivariate, multi-step Time Series forecasting, CopulaCPTS. On several synthetic and real-world multivariate time series datasets, we show that CopulaCPTS produces more calibrated and sharp confidence intervals for multi-step prediction tasks than existing techniques.","Conformal Prediction, time series, uncertainty quantification, calibration, RNN" Multi-Epoch Matrix Factorization Mechanisms for Private Machine Learning,https://openreview.net/forum?id=Ab8hkaJSJI,https://openreview.net/pdf?id=Ab8hkaJSJI,"We enable generation of factorization-based methods under multi-participations, which enables us to achieve new SOTA results in private training without amplification.","We introduce new differentially private (DP) mechanisms for gradient-based machine learning (ML) training involving multiple passes (epochs) of a dataset, substantially improving the achievable privacy-utility-computation tradeoffs. Our key contribution is an extension of the online matrix factorization DP mechanism to multiple participations, substantially generalizing the approach of DMRST2022. We first give a non-trivial reduction of the problem with per-iteration vector contributions to the simpler one of scalar contributions. Using this, we formulate the construction of optimal (in total squared error at each iterate) matrix mechanisms for SGD variants as a convex program. We provide a closed form solution to the dual function, leading directly to an efficient optimization algorithms. While tractable, both solving the convex problem offline and computing the necessary noise masks during training can become prohibitively expensive when many training steps are necessary. To address this, we design a Fourier-transform-based mechanism with significantly less computation and only a minor utility decrease. Extensive empirical evaluation on two tasks, example-level DP for image classification and user-level DP for language modeling, demonstrate substantial improvements over the previous state-of-the-art. Though our primary application is to ML, we note our main DP results are applicable to arbitrary linear queries and hence may have much broader applicability.","differential privacy, matrix mechanism, machine learning, artificial intelligence, private learning" Quantum 3D graph structure learning with applications to molecule computing,https://openreview.net/forum?id=pvLMBZ5w9eg,https://openreview.net/pdf?id=pvLMBZ5w9eg,,"Graph representation learning has been extensively studied over the last decades, and recent models start to focus on an under-explored area of 3D graph learning with 3D spatial position as well as node attributes. Despite the progress, the ability to understand the physical meaning of the 3D topology data is still a bottleneck for existing models. On the other hand, quantum computing is known to be a promising direction for theoretically verified supremacy as well as increasing evidence for access to a physical quantum device in the near term. For the first time, we propose a quantum 3D embedding ansatz that learns the latent representation of 3D structures from the Hilbert space composed of the Bloch sphere of each qubit. We convert the 3D Cartesian coordinates of nodes into rotation and torsion angles and then encode them into the form of qubits. Moreover, Parameterized Quantum Circuit (PQC) is applied to serve as the trainable layers and we take the output of the PQC as the node embedding. Experimental results on two downstream tasks, molecular property prediction and 3D molecular geometries generation, demonstrate the effectiveness of our model. Though the results are still restricted by computational power, we have shown the capability of our model with very few parameters and the potential to execute on a real quantum device.", Distributional Signals for Node Classification in Graph Neural Networks,https://openreview.net/forum?id=eoqfMQJogx0,https://openreview.net/pdf?id=eoqfMQJogx0,"In this paper, we introduce the new notion of distributional graph signals and use it to design a GNN regularization scheme. ","In graph neural networks (GNNs), both node features and labels are examples of graph signals, a key notion in graph signal processing (GSP). While it is common in GSP to impose signal smoothness constraints in learning and estimation tasks, it is unclear how this can be done for discrete node labels. We bridge this gap by introducing the concept of distributional graph signals. In our framework, we work with the distributions of node labels instead of their values and propose notions of smoothness and non-uniformity of such distributional graph signals. We then propose a general regularization method for GNNs that allows us to encode distributional smoothness and non-uniformity of the model output in semi-supervised node classification tasks. Numerical experiments demonstrate that our method can significantly improve the performance of most base GNN models in different problem settings. ","Graph neural networks, graph signal processing, regularization, smoothness, node classification" Vector Quantized Wasserstein Auto-Encoder,https://openreview.net/forum?id=Z8qk2iM5uLI,https://openreview.net/pdf?id=Z8qk2iM5uLI,,"Learning deep discrete latent presentations offers a promise of better symbolic and summarized abstractions that are more useful to subsequent downstream tasks. Recent work on Vector Quantized Variational Auto-Encoder (VQ-VAE) has made substantial progress in this direction. However, this quantizes latent representations using the online k-means algorithm which suffers from poor initialization and non-stationary clusters. To strengthen the clustering quality for the latent representations, we propose Vector Quantized Wasserstein Auto-Encoder (VQ-WAE) intuitively developed based on the clustering viewpoint of Wasserstein (WS) distance. Specifically, we endow a discrete distribution over the codewords and learn a deterministic decoder that transports the codeword distribution to the data distribution via minimizing a WS distance between them. We develop further theories to connect it with the clustering viewpoint of WS distance, allowing us to have a better and more controllable clustering solution. Finally, we empirically evaluate our method on several well-known benchmarks, where it achieves better qualitative and quantitative performances than the baselines in terms of the codebook utilization and image reconstruction/generation.", Exact Representation of Sparse Networks with Symmetric Nonnegative Embeddings,https://openreview.net/forum?id=N-eul1pdagX,https://openreview.net/pdf?id=N-eul1pdagX,"We expand on previous bounds for exact factorization of undirected graphs and extend them to our proposed model, which is more interpretable.","Many models for undirected graphs are based on factorizing the graph's adjacency matrix; these models find a vector representation of each node such that the predicted probability of a link between two nodes increases with the similarity (dot product) of their associated vectors. Recent work has shown that these models are unable to capture key structures in real-world graphs, particularly heterophilous structures, wherein links occur between dissimilar nodes. In contrast, a factorization with two vectors per node, based on logistic principal components analysis (LPCA), has been proven not only to represent such structures, but also to provide exact low-rank factorization of any graph with bounded max degree. However, this bound has limited applicability to real-world networks, which often have power law degree distributions with high max degree. Further, the LPCA model lacks interpretability since its asymmetric factorization does not reflect the undirectedness of the graph. We address the above issues in two ways. First, we prove a new bound for the LPCA model in terms of arboricity rather than max degree; this greatly increases the bound's applicability to many sparse real-world networks. Second, we propose an alternative graph model whose factorization is symmetric and nonnegative, which allows for link predictions to be interpreted in terms of node clusters. We show that the bounds for exact representation in the LPCA model extend to our new model. On the empirical side, our model is optimized effectively on real-world graphs with gradient descent on a cross-entropy loss. We demonstrate its effectiveness on a variety of foundational tasks, such as community detection and link prediction.","graph, network, embeddings, arboricity, factorization, model, community, nonnegative" Skill-Based Reinforcement Learning with Intrinsic Reward Matching,https://openreview.net/forum?id=ZUXy6d49JNJ,https://openreview.net/pdf?id=ZUXy6d49JNJ,We unify unsupervised RL skill pretraining and downstream finetuning phases of learning by leveraging the skill discriminator as a task specifier.,"While unsupervised skill discovery has shown promise in autonomously acquiring behavioral primitives, there is still a large methodological disconnect between task-agnostic skill pretraining and downstream, task-aware finetuning. We present Intrinsic Reward Matching (IRM), which unifies these two phases of learning via the $\textit{skill discriminator}$, a pretraining model component often discarded during finetuning. Conventional approaches finetune pretrained agents directly at the policy level, often relying on expensive environment rollouts to empirically determine the optimal skill. However, often the most concise yet complete description of a task is the reward function itself, and skill learning methods learn an $\textit{intrinsic}$ reward function via the discriminator that corresponds to the skill policy. We propose to leverage the skill discriminator to $\textit{match}$ the intrinsic and downstream task rewards and determine the optimal skill for an unseen task without environment samples, consequently finetuning with greater sample-efficiency. Furthermore, we generalize IRM to sequence skills and solve more complex, long-horizon tasks. We demonstrate that IRM is competitive with previous skill selection methods on the Unsupervised Reinforcement Learning Benchmark and enables us to utilize pretrained skills far more effectively on challenging tabletop manipulation tasks.","Unsupervised Reinforcement Learning, Reinforcement Learning, Deep Learning" A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning,https://openreview.net/forum?id=5U3xzYJoThy,https://openreview.net/pdf?id=5U3xzYJoThy,A new path to improving instruction following agents using pure imitation learning (no RL) and large scale in-domain data augmentation ,"Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments, as a step towards intelligent agents or robots that can follow human instructions. However, given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding. Pretraining on large text and image-text datasets from the web has been extensively explored but the improvements are limited. To address the scarcity of in-domain instruction data, we investigate large-scale augmentation with synthetic instructions. We take 500+ indoor environments captured in densely-sampled 360◦ panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory using Marky (Wang et al., 2022), a high-quality multilingual navigation instruction generator. To further increase the variability of the trajectories, we also synthesize image observations from novel viewpoints using an image-to-image GAN. The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets, and contains a wider variety of environments and viewpoints. To efficiently leverage data at this scale, we train a transformer agent with imitation learning for over 700M steps of experience. On the challenging Room-across-Room dataset, our approach outperforms all existing RL agents, improving the state-of-the-art NDTW from 71.1 to 79.1 in seen environments, and from 64.6 to 66.8 in unseen test environments. Self-supervision with synthetic instructions in new environments can improve further to 68.6 (vs. human 79.5). Our work points to a new path to improving instruction-following agents, emphasizing large-scale imitation learning and the development of synthetic instruction generation capabilities – which are shown to flow through directly to improved instruction-following performance.",vision and language navigation TuneUp: A Training Strategy for Improving Generalization of Graph Neural Networks,https://openreview.net/forum?id=8xuFD1yCoH,https://openreview.net/pdf?id=8xuFD1yCoH,We develop a curriculum learning strategy to train GNNs with high generalization performance especially on tail nodes.,"Despite many advances in Graph Neural Networks (GNNs), their training strategies simply focus on minimizing a loss over nodes in a graph. However, such simplistic training strategies may be sub-optimal as they neglect that certain nodes are much harder to make accurate predictions on than others. Here we present TuneUp, a curriculum learning strategy for better training GNNs. Crucially, TuneUp trains a GNN in two stages. The first stage aims to produce a strong base GNN. Such base GNNs tend to perform well on head nodes (nodes with large degrees) but less so on tail nodes (nodes with small degrees). So, the second stage of TuneUp specifically focuses on improving prediction on tail nodes. Concretely, TuneUp synthesizes many additional supervised tail node data by dropping edges from head nodes and reusing the supervision on the original head nodes. TuneUp then minimizes the loss over the synthetic tail nodes to finetune the base GNN. TuneUp is a general training strategy that can be used with any GNN architecture and any loss, making TuneUp applicable to a wide range of prediction tasks. Extensive evaluation of TuneUp on two GNN architectures, three types of prediction tasks, and both inductive and transductive settings shows that TuneUp significantly improves the performance of the base GNN on tail nodes, while often even improving the performance on head nodes, which together leads up to 58.5% relative improvement in GNN predictive performance. Moreover, TuneUp significantly outperforms its variants without the two-stage curriculum learning, existing graph data augmentation techniques, as well as other specialized methods for tail nodes.","Graph Neural Networks, Curriculum learning, Tail nodes" Collecting The Puzzle Pieces: Disentangled Self-Driven Human Pose Transfer by Permuting Textures,https://openreview.net/forum?id=upJ3vrFKaL,https://openreview.net/pdf?id=upJ3vrFKaL,Present a self-driven human pose transfer method by permuting textures.,"Human pose transfer aims to synthesize a new view of a person under a given pose. Recent works achieve this via self-reconstruction, which disentangles pose and texture features from the person image, then combines the two features to reconstruct the person. Such feature-level disentanglement is a difficult and ill-defined problem that could lead to loss of details and unwanted artifacts. In this paper, we propose a self-driven human pose transfer method that permutes the textures at random, then reconstructs the image with a dual branch attention to achieve image-level disentanglement and detail-preserving texture transfer. We find that compared with feature-level disentanglement, image-level disentanglement is more controllable and reliable. Furthermore, we introduce a dual kernel encoder that gives different sizes of receptive fields in order to reduce the noise caused by permutation and thus recover clothing details while aligning pose and textures. Extensive experiments on DeepFashion and Market-1501 shows that our model improves the quality of generated images in terms of FID, LPIPS and SSIM over other self-driven methods, and even outperforming some fully-supervised methods. A user study also shows that among self-driven approaches, images generated by our method are preferred in 72\% of cases over prior work.","person image generation, human pose transfer, image synthesis" A Scalable and Exact Gaussian Process Sampler via Kernel Packets,https://openreview.net/forum?id=1sN_4ROgel,https://openreview.net/pdf?id=1sN_4ROgel,,"In view of the widespread use of Gaussian processes (GPs) in machine learning models, generating random sample paths of GPs is crucial for many machine learning applications. Sampling from a GP essentially requires generating high-dimensional Gaussian random vectors, which is computationally challenging if a direct method, such as the one based on Cholesky decomposition, is implemented. We develop a scalable algorithm to sample random realizations of the prior and the posterior of GP models with Matérn correlation functions. Unlike existing scalable sampling algorithms, the proposed approach draws samples from the theoretical distributions exactly. The algorithm exploits a novel structure called the kernel packets (KP), which gives an exact sparse representation of the dense covariance matrices. The proposed method is applicable for one-dimensional GPs, and multi-dimensional GPs under some conditions such as separable kernels with full grid designs. Via a series of experiments and comparisons with other recent works, we demonstrate the efficiency and accuracy of the proposed method.", Model ChangeLists: Characterizing Changes in ML Prediction APIs,https://openreview.net/forum?id=4-oNRO0Fqy,https://openreview.net/pdf?id=4-oNRO0Fqy,"In this work, we study MLaaS API updates. We introduce, Mocha, a new framework for describing model updates. Then we use Mocha to demonstrate how subtle, but significant, shifts are commonly introduced by updates.","Updates to Machine Learning as a Service (MLaaS) APIs may affect downstream systems that depend on their predictions. However, performance changes introduced by these updates are poorly documented by providers and seldom studied in the literature. As a result, users are left wondering: do model updates introduce subtle performance changes that could adversely affect my system? Ideally, users would have access to a detailed ChangeList specifying the slices of data where model performance has improved and degraded since the update. But, producing a ChangeList is challenging because it requires (1) discovering slices in the absence of detailed annotations or metadata, (2) accurately attributing coherent concepts to the discovered slices, and (3) communicating them to the user in a digestable manner. We introduce Mocha, an interactive framework for building, verifying and releasing ChangeLists that addresses these challenges. Using it, we perform a large-scale analysis of three real-world MLaaS API updates. We produce a ChangeList for each, identifying over 100 coherent data slices on which the model’s performance changed significantly. Notably, we find 63 instances where an update improves performance globally, but hurts performance on a coherent slice – a phenomenon not previously documented at scale in the literature. These findings underscore the importance of producing a detailed ChangeList when the model behind an API is updated.","model evaluation, model comparison, machine learning as a service, robustness" Provably Auditing Ordinary Least Squares in Low Dimensions,https://openreview.net/forum?id=DlpCotqdTy,https://openreview.net/pdf?id=DlpCotqdTy,We develop provable and efficient algorithms for estimating stability of OLS to dropping samples in the low-dimensional regime.,"Auditing the stability of a machine learning model to small changes in the training procedure is critical for engendering trust in practical applications. For example, a model should not be overly sensitive to removing a small fraction of its training data. However, algorithmically validating this property seems computationally challenging, even for the simplest of models: Ordinary Least Squares (OLS) linear regression. Concretely, recent work defines the stability of a regression as the minimum number of samples that need to be removed so that rerunning the analysis overturns the conclusion (Broderick et al., 2020), specifically meaning that the sign of a particular coefficient of the OLS regressor changes. But the only known approach for estimating this metric, besides the obvious exponential-time algorithm, is a greedy heuristic that may produce severe overestimates and therefore cannot certify stability. We show that stability can be efficiently certified in the low-dimensional regime: when the number of covariates is a constant but the number of samples is large, there are polynomial-time algorithms for estimating (a fractional version of) stability, with provable approximation guarantees. Applying our algorithms to the Boston Housing dataset, we exhibit regression analyses where our estimator outperforms the greedy heuristic, and can successfully certify stability even in the regime where a constant fraction of the samples are dropped.","stability, linear regression, ordinary least squares, robustness" Exploring semantic information in disease: Simple Data Augmentation Techniques for Chinese Disease Normalization,https://openreview.net/forum?id=PjT1TJ62vJW,https://openreview.net/pdf?id=PjT1TJ62vJW,A novel data augmentation method in NLP to address the problem of Chinese Disease Normalization.,"Disease is a core concept in the medical field, and the task of normalizing disease names is the basis of all disease-related tasks. However, due to the multi-axis and multi-grain nature of disease names, incorrect information is often injected and harms the performance when using general text data augmentation techniques. To address the above problem, we propose a set of data augmentation techniques that work together as an augmented training task for disease normalization, which is called Disease Data Augmentation (DDA). Our data augmentation methods are based on both the clinical disease corpus and standard disease corpus derived from ICD-10 coding. Extensive experiments are conducted to show the effectiveness of our proposed methods. The results demonstrate that our method can have up to 3\% performance gain compared to non-augmented counterparts, and they can work even better on smaller datasets.","Data Augmentation, Medicine, Disease, Disease Normalization, Deep Learning, Natural Language Processing, Representation Learning" Live in the Moment: Learning Dynamics Model Adapted to Evolving Policy,https://openreview.net/forum?id=f8zhJ1jAfIq,https://openreview.net/pdf?id=f8zhJ1jAfIq,We theoretically analyze how the distribution of historical policies affects the model learning and model rollouts and propose a novel model learning method for model-based RL.,"Model-based reinforcement learning (RL) often achieves higher sample efficiency in practice than model-free RL by learning a dynamics model to generate samples for policy learning. Previous works learn a dynamics model that fits under the empirical state-action visitation distribution for all historical policies, i.e., the sample replay buffer. However, in this paper, we observe that fitting the dynamics model under the distribution for \emph{all historical policies} does not necessarily benefit model prediction for the \emph{current policy} since the policy in use is constantly evolving over time. The evolving policy during training will cause state-action visitation distribution shifts. We theoretically analyze how this distribution shift over historical policies affects the model learning and model rollouts. We then propose a novel dynamics model learning method, named \textit{Policy-adapted Dynamics Model Learning (PDML)}. PDML dynamically adjusts the historical policy mixture distribution to ensure the learned model can continually adapt to the state-action visitation distribution of the evolving policy. Experiments on a range of continuous control environments in MuJoCo show that PDML achieves significant improvement in sample efficiency and higher asymptotic performance combined with the state-of-the-art model-based RL methods.","Reinforcement Learning, Model-based Reinforcement Learning, State-action Visitation Distribution, Distribution Shift, Policy-adapted Dynamics Model Learning" Planning Goals for Exploration,https://openreview.net/forum?id=6qeBuZSo7Pr,https://openreview.net/pdf?id=6qeBuZSo7Pr,We use world models to generate goals for exploration.,"Dropped into an unknown environment, what should an agent do to quickly learn about the environment and how to accomplish diverse tasks within it? We address this question within the goal-conditioned reinforcement learning paradigm, by identifying how the agent should set its goals at training time to maximize exploration. We propose ""planning exploratory goals"" (PEG), a method that sets goals for each training episode to directly optimize an intrinsic exploration reward. PEG first chooses goal commands such that the agent's goal-conditioned policy, at its current level of training, will end up in states with high exploration potential. It then launches an exploration policy starting at those promising states. To enable this direct optimization, PEG learns world models and adapts sampling-based planning algorithms to ""plan goal commands"". In challenging simulated robotics environments including a multi-legged ant robot in a maze, and a robot arm on a cluttered tabletop, PEG exploration enables more efficient and effective training of goal-conditioned policies relative to baselines and ablations. Our ant successfully navigates a long maze, and the robot arm successfully builds a stack of three blocks upon command. Project website: https://sites.google.com/view/exploratory-goals","model-based reinforcement learning, exploration, goal-conditioned reinforcement learning, planning, intrinsic motivation, reinforcement learning" Deep Direct Discriminative Decoders for High-dimensional Time-series Data Analysis,https://openreview.net/forum?id=xZSRTER-2Qv,https://openreview.net/pdf?id=xZSRTER-2Qv,,"The state-space models (SSMs) are widely utilized in the analysis of time-series data. SSMs rely on an explicit definition of the state and observation processes. Characterizing these processes is not always easy and becomes a modeling challenge in many instances, such as when the dimension of observed data grows or the observed data distribution deviates from a non-normal distribution. New variants of SSMs try to address these challenges by utilizing deep neural networks (DNNs). Here, we propose a new formulation of SSM for high-dimensional observation processes with heavy-tailed distributions. We call this solution the deep direct discriminative process (D4). D4 utilizes discriminative models like DNN in characterizing the observation process. With this formulation, we bring DNNs' expressiveness and scalability to the SSM formulation that lets us optimally estimate the underlying state process through high-dimensional observation signals. We develop the filter solution for D4 and build a training solution to find the model-free parameters. We demonstrate D4 solutions in simulated and real data such as Lorenz attractors, Langevin dynamics, and rat hippocampus spiking neural data where D4's performance precedes traditional models. D4 can be applied to a broader class of time-series data modeling analysis where the connection between high-dimensional observation and the underlying data generation process is complex and characterizing the conditional distribution of observations is a challenging modeling problem.","Dynamical Model, State-space Model (SSM), Neural Decoding, Deep neural network (DNN)" Learning Sparse Group Models Through Boolean Relaxation,https://openreview.net/forum?id=Do9MOlwWHu0,https://openreview.net/pdf?id=Do9MOlwWHu0,,"We introduce an efficient algorithmic framework for learning sparse group models formulated as the natural convex relaxation of a cardinality-constrained program with Boolean variables. We provide theoretical techniques to characterize the equivalent condition when the relaxation achieves the exact integral optimal solution, as well as a rounding algorithm to produce a feasible integral solution once the optimal relaxation solution is fractional. We demonstrate the power of our equivalent condition by applying it to two ensembles of random problem instances that are challenging and popularly used in literature and prove that our method achieves exactness with overwhelming probability and nearly optimal sample complexity. Empirically, we use synthetic datasets to demonstrate that our proposed method significantly outperforms the state-of-the-art group sparse learning models in terms of individual and group support recovery when the number of samples is small. Furthermore, we show the out-performance of our method in cancer drug response prediction.","Structured sparisity, Convex relaxation, Cardinality-constrained program, Small sample size" LVQ-VAE:End-to-end Hyperprior-based Variational Image Compression with Lattice Vector Quantization,https://openreview.net/forum?id=1pGmKJvneD7,https://openreview.net/pdf?id=1pGmKJvneD7,,"Image compression technology has become more important research topic. In recent years, learning-based methods have been extensively studied and variational autoencoder (VAE)-based methods using hyperprior-based context-adaptive entropy model have been reported to be comparable to the latest video coding standard H.266/VVC in terms of RD performance. We think there is room for improvement in quantization process of latent features by adopting vector quantization (VQ). Many VAE-based methods use scalar quantization for latent features and do not exploit correlation between the features. Although there are methods that incorporate VQ into learning-based methods, to the best our knowledge, there are no studies that utilizes the hyperprior-based VAE with VQ because incorporating VQ into a hyperprior-based VAE makes it difficult to estimate the likelihood. In this paper, we propose a new VAE-based image compression method using VQ based latent representation for hyperprior-based context-adaptive entropy model to improve the coding efficiency. The proposed method resolves problem faced by conventional VQ-based methods due to codebook size bloat by adopting Lattice VQ as the basis quantization method and achieves end-to-end optimization with hyperprior-based context-adaptive entropy model by approximating the likelihood calculation of latent feature vectors with high accuracy using Monte Carlo integration. Furthermore, in likelihood estimation, we model each latent feature vector with multivariate normal distribution including covariance matrix parameters, which improves the likelihood estimation accuracy and RD performance. Experimental results show that the proposed method achieves a state-of-the-art RD performance exceeding existing learning-based methods and the latest video coding standard H.266/VVC by 18.0%.","Image Compression, Variational Autoencoder, Vector Quantization, Lattice" Direct Embedding of Temporal Network Edges via Time-Decayed Line Graphs,https://openreview.net/forum?id=Qamz7Q_Ta1k,https://openreview.net/pdf?id=Qamz7Q_Ta1k,We propose a line graph-based method for temporal networks which directly embeds temporal edges.,"Temporal networks model a variety of important phenomena involving timed interactions between entities. Existing methods for machine learning on temporal networks generally exhibit at least one of two limitations. First, time is assumed to be discretized, so if the time data is continuous, the user must determine the discretization and discard precise time information. Second, edge representations can only be calculated indirectly from the nodes, which may be suboptimal for tasks like edge classification. We present a simple method that avoids both shortcomings: construct the line graph of the network, which includes a node for each interaction, and weigh the edges of this graph based on the difference in time between interactions. From this derived graph, edge representations for the original network can be computed with efficient classical methods. The simplicity of this approach facilitates explicit theoretical analysis: we can constructively show the effectiveness of our method's representations for a natural synthetic model of temporal networks. Empirical results on real-world networks demonstrate our method's efficacy and efficiency on both edge classification and temporal link prediction.","temporal, networks, graphs, embedding" Neural DAG Scheduling via One-Shot Priority Sampling,https://openreview.net/forum?id=WL8FlAugqQ,https://openreview.net/pdf?id=WL8FlAugqQ,We propose a novel ML scheduler that uses a one-shot neural network encoder to sample node priorities which are converted by list scheduling to the final schedules.,"We consider the problem of scheduling operations/nodes, the dependency among which is characterized by a Directed Acyclic Graph (DAG). Due to its NP-hard nature, heuristic algorithms were traditionally used to acquire reasonably good solutions, and more recent works have proposed Machine Learning (ML) heuristics that can generalize to unseen graphs and outperform the non-ML heuristics. However, it is computationally costly to generate solutions using existing ML schedulers since they adopt the episodic reinforcement learning framework that necessitates multi-round neural network processing. We propose a novel ML scheduler that uses a one-shot neural network encoder to sample node priorities which are converted by list scheduling to the final schedules. Since the one-shot encoder can efficiently sample the priorities in parallel, our algorithm runs significantly faster than existing ML baselines and has comparable run time with the fast traditional heuristics. We empirically show that our algorithm generates better schedules than both non-neural and neural baselines across various real-world and synthetic scheduling tasks.","Combinatorial Optimization, Directed Acyclic Graph, Scheduling, Graph Neural Network, Reinforcement Learning" TrajGRU-Attention-ODE: Novel Spatiotemporal Predictive Models,https://openreview.net/forum?id=WHgGpgHFNT,https://openreview.net/pdf?id=WHgGpgHFNT,This paper presents novel deep learning models for spatiotemporal predictive tasks.,"To perform the long-term spatiotemporal sequence prediction (SSP) task with irregular time sampling assumptions, we build the sequence-to-sequence models based on the Trajectory Gated Recurrent Unit (TrajGRU) network and our proposed deep learning modules. First, we design a novel attention mechanism, namely Motion-based Attention (MA), and insert it into the TrajGRU network to create the TrajGRU-Attention model. In particular, the TrajGRU-Attention model can alleviate the impact of the vanishing gradient, which leads to the blurry effect in the long-term predictions and handle both regularly sampled and irregularly sampled time series. Second, leveraging the advances in Neural Ordinary Differential Equation (NODE) technique, we proposed the TrajGRU-Attention-ODE model, which can be applied in continuous-time applications. To evaluate the performance of the proposed models, we select three available spatiotemporal datasets based on the complexity level, including the MovingMNIST, MovingMNIST++, and KTH Action. Our models outperform the state-of-the-art NODE model and generate better results than the standard TrajGRU model for SSP tasks with different circumstances of time sampling. ","Spatiotemporal predictive model, convolutional recurrent neural network, attention mechanism, neural ordinary differential equation, irregularly sampled time series." Learning-Based Radiomic Prediction of Type 2 Diabetes Mellitus Using Image-Derived Phenotypes,https://openreview.net/forum?id=vk96czrH2Y,https://openreview.net/pdf?id=vk96czrH2Y,Tabular learning models can predict patient diabetes risk from a combination of noninvasive physical examination and CT imaging data.,"Early diagnosis of Type 2 Diabetes Mellitus (T2DM) is crucial to enable timely therapeutic interventions and lifestyle modifications. As medical imaging data become more widely available for many patient populations, we sought to investigate whether image-derived phenotypic data could be leveraged in tabular learning classifier models to predict T2DM incidence without the use of invasive blood lab measurements. We show that both neural network and decision tree models that use image-derived phenotypes can predict patient T2DM status with recall scores as high as 87.6%. We also propose the novel use of these same architectures as 'SynthA1c encoders' that are able to output interpretable values mimicking blood hemoglobin A1C empirical lab measurements. Finally, we demonstrate that T2DM risk prediction model sensitivity to small perturbations in input vector components can be used to predict performance on covariates sampled from previously unseen patient populations.","Tabular Data, Disease Prediction, Machine Learning for Health, Out-of-Domain Generalization, Classification" Efficiently Computing Nash Equilibria in Adversarial Team Markov Games,https://openreview.net/forum?id=mjzm6btqgV,https://openreview.net/pdf?id=mjzm6btqgV,Nash equlibrium can be computed effieciently in Markov games where a single player competes against multiple agents who share a common interest.," Computing Nash equilibrium policies is a central problem in multi-agent reinforcement learning that has received extensive attention both in theory and in practice. However, in light of computational intractability barriers in general-sum games, provable guarantees have been thus far either limited to fully competitive or cooperative scenarios or impose strong assumptions that are difficult to meet in most practical applications. In this work, we depart from those prior results by investigating infinite-horizon \emph{adversarial team Markov games}, a natural and well-motivated class of games in which a team of identically-interested players---in the absence of any explicit coordination or communication---is competing against an adversarial player. This setting allows for a unifying treatment of zero-sum Markov games and Markov potential games, and serves as a step to model more realistic strategic interactions that feature both competing and cooperative interests. Our main contribution is the first algorithm for computing stationary $\epsilon$-approximate Nash equilibria in adversarial team Markov games with computational complexity that is polynomial in all the natural parameters of the game, as well as $1/\epsilon$. The proposed algorithm is based on performing independent policy gradient steps for each player in the team, in tandem with best responses from the side of the adversary; in turn, the policy for the adversary is then obtained by solving a carefully constructed linear program. Our analysis leverages non-standard techniques to establish the KKT optimality conditions for a nonlinear program with nonconvex constraints, thereby leading to a natural interpretation of the induced Lagrange multipliers.","multiagent-reinforcement-learning.marl, rl, reinforcement-learning, learning-in-games, optimization, game-theory, policy-gradient" Meta Temporal Point Processes,https://openreview.net/forum?id=QZfdDpTX1uM,https://openreview.net/pdf?id=QZfdDpTX1uM,We present a novel approach to train temporal point processes in a meta learning framework.,"A temporal point process (TPP) is a stochastic process where its realization is a sequence of discrete events in time. Recent work in TPPs model the process using a neural network in a supervised learning framework, where a training set is a collection of all the sequences. In this work, we propose to train TPPs in a meta learning framework, where each sequence is treated as a different task, via a novel framing of TPPs as neural processes (NPs). We introduce context sets to model TPPs as an instantiation of NPs. Motivated by attentive NP, we also introduce local history matching to help learn more informative features. We demonstrate the potential of the proposed method on popular public benchmark datasets and tasks, and compare with state-of-the-art TPP methods.","Temporal Point Process, Asynchronous Time Series, Meta-learning" EmbedDistill: A geometric knowledge distillation for information retrieval,https://openreview.net/forum?id=-aEuKX6zQKmr,https://openreview.net/pdf?id=-aEuKX6zQKmr,We propose a novel distillation approach to train dual encoder information retrieval models that goes beyond score-matching and aims to explicitly align embedding spaces of teacher and student models.,"Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval. In this paper, we aim to improve distillation methods that pave the way for the deployment of such models in practice. The proposed distillation approach supports both retrieval and re-ranking stages and crucially leverages the relative geometry among queries and documents learned by the large teacher model. It goes beyond existing distillation methods in the information retrieval literature, which simply rely on the teacher's scalar scores over the training data, on two fronts: providing stronger signals about local geometry via embedding matching and attaining better coverage of data manifold globally via query generation. Embedding matching provides a stronger signal to align the representations of the teacher and student models. At the same time, query generation explores the data manifold to reduce the discrepancies between the student and teacher where the training data is sparse. Our distillation approach is theoretically justified and applies to both dual encoder (DE) and cross-encoder (CE) models. Furthermore, for distilling a CE model to a DE model via embedding matching, we propose a novel dual pooling-based scorer for the CE model that facilitates a more distillation-friendly embedding geometry, especially for DE student models.","Knowledge distillation, dual encoder, cross encoder, information retrieval, query generation, embedding matching, retrieval, re-ranking" Graph Neural Network-Inspired Kernels for Gaussian Processes in Semi-Supervised Learning,https://openreview.net/forum?id=flap0Bo6TK_,https://openreview.net/pdf?id=flap0Bo6TK_,"Graph-based Gaussian process kernels are developed based on graph neural networks, showing competitive semi-supervised learning performance and timing advantage.","Gaussian processes (GPs) are an attractive class of machine learning models because of their simplicity and flexibility as building blocks of more complex Bayesian models. Meanwhile, graph neural networks (GNNs) emerged recently as a promising class of models for graph-structured data in semi-supervised learning and beyond. Their competitive performance is often attributed to a proper capturing of the graph inductive bias. In this work, we introduce this inductive bias into GPs to improve their predictive performance for graph-structured data. We show that a prominent example of GNNs, the graph convolutional network, is equivalent to some GP when its layers are infinitely wide; and we analyze the kernel universality and the limiting behavior in depth. We further present a programmable procedure to compose covariance kernels inspired by this equivalence and derive example kernels corresponding to several interesting members of the GNN family. We also propose a computationally efficient approximation of the covariance matrix for scalable posterior inference with large-scale data. We demonstrate that these graph-based kernels lead to competitive classification and regression performance, as well as advantages in computation time, compared with the respective GNNs.","graph neural network, Gaussian process, semi-supervised learning" Deconstructing Distributions: A Pointwise Framework of Learning,https://openreview.net/forum?id=9IaN4FkVSR1,https://openreview.net/pdf?id=9IaN4FkVSR1,"We propose a new lens for studying the pointwise performance of learning algorithms which reveals new insights into their behavior and goes beyond traditional notions of in-distribution and ""out-of-distribution"" learning. ","In machine learning, we traditionally evaluate the performance of a single model, averaged over a collection of test inputs. In this work, we propose a new approach: we measure the performance of a collection of models when evaluated at *single input point*. Specifically, we study a point's *profile*: the relationship between models' average performance on the test distribution and their pointwise performance on this individual point. We find that profiles can yield new insights into the structure of both models and data---in and out-of-distribution. For example, we empirically show that real data distributions consist of points with qualitatively different profiles. On one hand, there are ``compatible'' points with strong correlation between the pointwise and average performance. On the other hand, there are points with weak and even *negative* correlation: cases where improving overall model accuracy actually *hurts* performance on these inputs. As an application, we use profiles to construct a dataset we call CIFAR-10-NEG: a subset of CINIC-10 such that for standard models, accuracy on CIFAR-10-NEG is *negatively correlated* with CIFAR-10 accuracy. Illustrating for the first time an OOD dataset that completely inverts ``accuracy-on-the-line'' (Miller et al., 2021).","understanding deep learning, empirical investigation, distribution shift" Logical view on fairness of a binary classification task,https://openreview.net/forum?id=MMiaF8KppTZ,https://openreview.net/pdf?id=MMiaF8KppTZ,The fairness of a binary classifier is a logical phenomenon since its loss is not expressible in the first-order logic of a suitable model.,"Ethical, Interpretable/Explainable, and Responsible AI are an active area of research and important social initiative. We prove that, with no regards to data, fairness and trustworthiness are algorithmically undecidable for a basic machine learning task, the binary classification. Therefore, even the approach based on not only improving but fully solving the three usually assumed issues -- the insufficient quality of measurements, the complex consequences of (mis)measurements, and the limits of existing social theories -- is only heuristics. We show that, effectively, the fairness of a classifier is not even a (version of bias-variance) trade-off since it is a logical phenomenon. Namely, we reveal a language $L$ and an $L-$theory $T$ for binary classification task such that the very notion of loss is not expressible in the first-order logic formula in $L$. ","binary classification, fairness, first-order logic, decidability" Revisiting Instance-Reweighted Adversarial Training,https://openreview.net/forum?id=ALbEpTC4hBp,https://openreview.net/pdf?id=ALbEpTC4hBp,We clarify a weakness of previous methods and propose a method to resolve the weakness by transforming margins into an appropriate representation.,"Instance-reweighted adversarial training (IRAT) is a type of adversarial training that assigns large weights to high-importance examples and then minimizes the weighted loss. The importance often uses the margins between decision boundaries and each example. In particular, IRAT can alleviate robust overfitting and obtain excellent robustness by computing margins with an estimated probability. However, previous works implicitly dealt with binary classification even in the multi-class cases, because they computed margins with only the true class and the most confusing class. The computed margins can become equal even with different true probability examples, because of the complex decision boundaries in multi-class classification. In this paper, first, we clarify the above problem with a specific example. Then, we propose \textit{margin reweighting}, which can transform the previous margins into appropriate representations for multi-class classification by leveraging the relations between the most confusing class and other classes. Experimental results on the CIFAR-10/100 datasets demonstrate that the proposed method is effective in boosting the robustness against several attacks as compared to the previous methods.","Adversarial training, Adversarial robustness, Instance-reweighted" Diffusion Models for Causal Discovery via Topological Ordering,https://openreview.net/forum?id=Idusfje4-Wq,https://openreview.net/pdf?id=Idusfje4-Wq,"We use diffusion models for causal discovery by iteratively finding and removing leaves in causal graph, resulting in a efficient topological ordering algorithm for high-dimensional graphs.","Discovering causal relations from observational data becomes possible with additional assumptions such as considering the functional relations to be constrained as nonlinear with additive noise. In this case, the \emph{Hessian} of the data log-likelihood can be used for finding leaf nodes in a causal graph. Topological ordering approaches for causal discovery exploit this by performing graph discovery in two steps, first sequentially identifying nodes in reverse order of depth (\emph{topological ordering}), and secondly pruning the potential relations. This is more efficient since the search is performed over a permutation rather than a graph space. However, existing computational methods for obtaining the Hessian still do not scale as the number of variables and the number of samples are increased. Therefore, inspired by recent innovations in diffusion probabilistic models (DPMs), we propose \emph{DiffAN}, a topological ordering algorithm that leverages DPMs. Further, we introduce theory for updating the learned Hessian without re-training the neural network, and we show that computing with a subset of samples gives an accurate approximation of the ordering, which allows scaling to datasets with more samples and variables. We show empirically that our method scales exceptionally well to datasets with up to $500$ nodes and up to $10^5$ samples while still performing on par over small datasets with state-of-the-art causal discovery methods.","Diffusion Models, Causal Discovery, Topological Ordering, Score-based Methods" Scalable and Equivariant Spherical CNNs by Discrete-Continuous (DISCO) Convolutions,https://openreview.net/forum?id=eb_cpjZZ3GH,https://openreview.net/pdf?id=eb_cpjZZ3GH,A discrete-continuous (DISCO) spherical CNN framework that is simultaneously rotationally equivariant and computationally scalable and achieves state-of-the-art on numerous benchmarks,"No existing spherical convolutional neural network (CNN) framework is both computationally scalable and rotationally equivariant. Continuous approaches capture rotational equivariance but are often prohibitively computationally demanding. Discrete approaches offer more favorable computational performance but at the cost of equivariance. We develop a hybrid discrete-continuous (DISCO) group convolution that is simultaneously equivariant and computationally scalable to high-resolution. While our framework can be applied to any compact group, we specialize to the sphere. Our DISCO spherical convolutions not only exhibit $\text{SO}(3)$ rotational equivariance but also a form of asymptotic $\text{SO}(3)/\text{SO}(2)$ rotational equivariance, which is more desirable for many applications (where $\text{SO}(n)$ is the special orthogonal group representing rotations in $n$-dimensions). Through a sparse tensor implementation we achieve linear scaling in number of pixels on the sphere for both computational cost and memory usage. For 4k spherical images we realize a saving of $10^9$ in computational cost and $10^4$ in memory usage when compared to the most efficient alternative equivariant spherical convolution. We apply the DISCO spherical CNN framework to a number of benchmark dense-prediction problems on the sphere, such as semantic segmentation and depth estimation, on all of which we achieve the state-of-the-art performance.","Spherical CNNs, rotational equivariance, efficient algorithms" Towards Solving Industrial Sequential Decision-making Tasks under Near-predictable Dynamics via Reinforcement Learning: an Implicit Corrective Value Estimation Approach,https://openreview.net/forum?id=UawwAryavZI,https://openreview.net/pdf?id=UawwAryavZI,We decouple the data dynamics of industrial sequential decision-making tasks and design a bi-critic framework to solve the state transition uncertainty.,"Learning to plan and schedule is receiving increasing attention for industrial decision-making tasks for its potential for outperforming heuristics, especially under dynamic uncertainty, as well as its efficiency in problem-solving, especially with the adoption of neural networks and the behind GPU computing. Naturally, reinforcement learning (RL) with the Markov decision process (MDP) becomes a popular paradigm. Rather than handling the near-stationary environments like Atari games or the opposite case for open world dynamics with high uncertainty. In this paper, we aim to devise a tailored RL-based approach for the setting in the between: the near-predictable dynamics which often hold in many industrial applications, e.g., elevator scheduling and bin packing, as empirical case studies tested in this paper. We formulate a two-stage MDP by decoupling the data dynamics from the industrial environment. Specifically, we design a bi-critic framework for estimating the state value in stages according to the two-stage MDP.", Graph Convolutional Normalizing Flows for Semi-Supervised Classification and Clustering,https://openreview.net/forum?id=3i9EgUss-Vs,https://openreview.net/pdf?id=3i9EgUss-Vs,"A normalizing flow architecture based on graphs is developed for semi-supervised learning, producing high-quality classification and clustering.","Graph neural networks (GNNs) are \emph{discriminative models} that directly model the class posterior $p(y|\mathbf{x})$ for semi-supervised classification of graph data. While being effective for prediction, as a representation learning approach, the node representations extracted from a GNN often miss useful information for effective clustering, because that is not necessary for a good classification. In this work, we replace a GNN layer by a combination of graph convolutions and normalizing flows under a Gaussian mixture representation space, which allows us to build a \emph{generative model} that models both the class conditional likelihood $p(\mathbf{x}|y)$ and the class prior $p(y)$. The resulting neural network, GC-Flow, enjoys two benefits: it not only maintains the predictive power because of the retention of graph convolutions, but also produces high-quality clusters in the representation space, due to the structuring of the representation as a mixture of Gaussians. We demonstrate these benefits on a variety of benchmark data sets. Moreover, we show that additional parameterization, such as that on the adjacency matrix used for graph convolutions, yields additional improvement in classification and clustering. ","graph convolutional network, normalizing flow, generative model" Weakly Supervised Explainable Phrasal Reasoning with Neural Fuzzy Logic,https://openreview.net/forum?id=Hu4r-dedqR0,https://openreview.net/pdf?id=Hu4r-dedqR0,,"Natural language inference (NLI) aims to determine the logical relationship between two sentences, such as Entailment, Contradiction, and Neutral. In recent years, deep learning models have become a prevailing approach to NLI, but they lack interpretability and explainability. In this work, we address the explainability for NLI by weakly supervised logical reasoning, and propose an Explainable Phrasal Reasoning (EPR) approach. Our model first detects phrases as the semantic unit and aligns corresponding phrases in the two sentences. Then, the model predicts the NLI label for the aligned phrases, and induces the sentence label by fuzzy logic formulas. Our EPR is almost everywhere differentiable and thus the system can be trained end to end. In this way, we are able to provide explicit explanations of phrasal logical relationships in a weakly supervised manner. We further show that such reasoning results help textual explanation generation.","Neural Fuzzy Logic, Weakly Supervised Reasoning, Natural Language Inference, Explainability and Interpretability" Simplified State Space Layers for Sequence Modeling,https://openreview.net/forum?id=Ai8Hw3AXqks,https://openreview.net/pdf?id=Ai8Hw3AXqks,"We introduce a new state space sequence modeling layer, building on the recent S4 layer, that increases the state of the art on many long-range benchmark tasks.","Models using structured state space sequence (S4) layers have achieved state-of-the-art performance on long-range sequence modeling tasks. An S4 layer combines linear state space models (SSMs), the HiPPO framework, and deep learning to achieve high performance. We build on the design of the S4 layer and introduce a new state space layer, the S5 layer. Whereas an S4 layer uses many independent single-input, single-output SSMs, the S5 layer uses one multi-input, multi-output SSM. We establish a connection between S5 and S4, and use this to develop the initialization and parameterization used by the S5 model. The result is a state space layer that can leverage efficient and widely implemented parallel scans, allowing S5 to match the computational efficiency of S4 while achieving state-of-the-art performance on several long-range sequence modeling tasks. S5 averages $87.3\%$ on the long range arena benchmark, and $98.5\%$ on the most difficult Path-X task.","sequence models, state space, S4, RNN, transformers, long range arena" Learning Listwise Domain-Invariant Representations for Ranking,https://openreview.net/forum?id=m4f7Wl93fzT,https://openreview.net/pdf?id=m4f7Wl93fzT,We establish a domain adaptation generalization bound for ranking and propose a method based on learning listwise invariant representations.,"Domain adaptation aims to transfer models trained on data-rich domains to low-resource ones, for which a popular method is invariant representation learning. While they have been studied extensively for classification and regression problems, how they would apply to ranking problems, where the metrics and data follow a list structure, is not well understood. Theoretically, we establish a generalization bound for ranking problems under metrics including MRR and NDCG, leading to a method based on learning listwise invariant feature representations. The main novelty of our results is that they are tailored to the listwise approach of learning to rank: the invariant representations our method learns are for each list of items as a whole, instead of the individual items they contain. Our method is evaluated on the passage reranking task, where we adapt neural text rankers trained on a general domain to various specialized domains.","learning to rank, domain adaptation, text ranking" DCI-ES: An Extended Disentanglement Framework with Connections to Identifiability,https://openreview.net/forum?id=462z-gLgSht,https://openreview.net/pdf?id=462z-gLgSht,We extend the DCI framework for evaluating disentangled representations and connect it to identifiability.,"In representation learning, a common approach is to seek representations which disentangle the underlying factors of variation. Eastwood & Williams (2018) proposed three metrics for quantifying the quality of such disentangled representations: disentanglement (D), completeness (C) and informativeness (I). In this work, we first connect this DCI framework to two common notions of linear and nonlinear identifiability, thus establishing a formal link between disentanglement and the closely-related field of independent component analysis. We then propose an extended DCI-ES framework with two new measures of representation quality—explicitness (E) and size (S)—and point out how D and C can be computed for black-box predictors. Our main idea is that the functional capacity required to use a representation is an important but thus-far neglected aspect of representation quality, which we quantify using explicitness or ease-of-use (E). We illustrate the relevance of our extensions on the MPI3D and Cars3D datasets.","disentanglement, identifiability, representation learning" Eigenvalue Initialisation and Regularisation for Koopman Autoencoders,https://openreview.net/forum?id=6TugHflAGRU,https://openreview.net/pdf?id=6TugHflAGRU,Using eigenvalues to regularise and initialise Koopman autoencoders improves performance significantly,"Regularising the parameter matrices of neural networks is ubiquitous in training deep models. Typical regularisation approaches suggest initialising weights using small random values, and to penalise weights to promote sparsity. However, these widely used techniques may be less effective in certain scenarios. Here, we study the Koopman autoencoder model which includes an encoder, a Koopman operator layer, and a decoder. These models have been designed and dedicated to tackle physics-related problems with interpretable dynamics and an ability to incorporate physics-related constraints. However, the majority of existing work employs standard regularisation practices. In our work, we take a step toward augmenting Koopman autoencoders with initialisation and penalty schemes tailored for physics-related settings. Specifically, we propose the ""eigeninit"" initialisation scheme that samples initial Koopman operators from specific eigenvalue distributions. In addition, we suggest the ""eigenloss"" penalty scheme that penalises the eigenvalues of the Koopman operator during training. We demonstrate the utility of these schemes on two synthetic data sets: a driven pendulum and flow past a cylinder; and two real-world problems: ocean surface temperatures and cyclone wind fields. We find on these datasets that eigenloss and eigeninit improves the convergence rate by a factor of 2 to 5, and that they reduce the cumulative long-term prediction error by up to a factor of 2.5. Such a finding points to the utility of incorporating similar schemes as an inductive bias in other physics-related deep learning approaches.","koopman, deep learning, dynamical systems, autoencoders, physics-constrained learning, neural networks" Learning from Labeled Images and Unlabeled Videos for Video Segmentation,https://openreview.net/forum?id=Q0XkE_srKnG,https://openreview.net/pdf?id=Q0XkE_srKnG,,"Performance on video object segmentation still lags behind that of image segmentation due to a paucity of labeled videos. Annotations are time-consuming and laborious to collect, and may not be feasibly obtained in certain situations. However there is a growing amount of freely available unlabeled video data which has spurred interest in unsupervised video representation learning. In this work we focus on the setting in which there is no/little access to labeled videos for video object segmentation. To this end we leverage large-scale image segmentation datasets and adversarial learning to train 2D/3D networks for video object segmentation. We first motivate the treatment of images and videos as two separate domains by analyzing the performance gap of an image segmentation network trained on images and applied to videos. Through studies using several image and video segmentation datasets, we show how an adversarial loss placed at various locations within the network can make feature representations invariant to these domains and improve the performance when the network has access to only labeled images and unlabeled videos. To prevent the loss of discriminative semantic class information we apply our adversarial loss within clusters of features and show this boosts our method's performance within Transformer-based models.","Video, Segmentation, Representation" Score-based Generative 3D Mesh Modeling,https://openreview.net/forum?id=0cpM2ApF9p6,https://openreview.net/pdf?id=0cpM2ApF9p6,Diffusion model on 3D meshes of arbitrary topology by direct parametrizing meshes with tetrahedral grids,"We consider the task of generating realistic 3D shapes, which is useful for a variety of applications such as automatic scene generation and physical simulation. Compared to other 3D representations like voxels and point clouds, meshes are more desirable in practice, because (1) they enable easy and arbitrary manipulation of shapes for relighting and simulation, and (2) they can fully leverage the power of modern graphics pipelines which are mostly optimized for meshes. Existing scalable methods for generating meshes typically rely on sub-optimal post-processing, and they tend to produce overly-smooth or noisy surfaces without fine-grained geometric details. To overcome these shortcomings, we take advantage of the regular graph structure of meshes and use a simple yet very effective generative modeling method to generate 3D meshes. Specifically, we represent meshes in a deformable tetrahedral grid, and then train a diffusion model on this direct parameterization. We demonstrate the effectiveness of our model on multiple generative tasks.","generative model, diffusion model, 3D mesh, shape generation" Faster federated optimization under second-order similarity,https://openreview.net/forum?id=ElC6LYO4MfD,https://openreview.net/pdf?id=ElC6LYO4MfD,,"Federated learning (FL) is a subfield of machine learning where multiple clients try to collaboratively learn a model over a network under communication constraints. We consider finite-sum federated optimization under a second-order function similarity condition and strong convexity, and propose two new algorithms: SVRP and Catalyzed SVRP. This second-order similarity condition has grown popular recently, and is satisfied in many applications including distributed statistical learning and differentially private empirical risk minimization. The first algorithm, SVRP, combines approximate stochastic proximal point evaluations, client sampling, and variance reduction. We show that SVRP is communication efficient and achieves superior performance to many existing algorithms when function similarity is high enough. Our second algorithm, Catalyzed SVRP, is a Catalyst-accelerated variant of SVRP that achieves even better performance and uniformly improves upon existing algorithms for federated optimization under second-order similarity and strong convexity. In the course of analyzing these algorithms, we provide a new analysis of the Stochastic Proximal Point Method (SPPM) that might be of independent interest. Our analysis of SPPM is simple, allows for approximate proximal point evaluations, does not require any smoothness assumptions, and shows a clear benefit in communication complexity over ordinary distributed stochastic gradient descent.","federated learning, distributed optimization, hessian similarity, client sampling, stochastic proximal point, proximal point method, distributed learning" Assessing Neural Network Robustness via Adversarial Pivotal Tuning of Real Images,https://openreview.net/forum?id=4k95LUAcqi,https://openreview.net/pdf?id=4k95LUAcqi,Utilizing StyleGAN's full capacity to manipulate images semantically so as to fool image classifiers through a process called Adversarial Pivotal Tuning.,"The ability to assess the robustness of image classifiers to a diverse set of manipulations is essential to their deployment in the real world. Recently, semantic manipulations of real images have been considered for this purpose, as they may not arise using standard adversarial settings. However, such semantic manipulations are often limited to style, color or attribute changes. While expressive, these manipulations do not consider the full capacity of a pretrained generator to affect adversarial image manipulations. In this work, we aim at leveraging the full capacity of a pretrained image generator to generate highly detailed, diverse and photorealistic image manipulations. Inspired by recent GAN-based image inversion methods, we propose a method called Adversarial Pivotal Tuning (APT). APT first finds a pivot latent space input to a pretrained generator that best reconstructs an input image. It then adjusts the weights of the generator to create small, but semantic, manipulations which fool a pretrained classifier. Crucially, APT changes both the input and the weights of the pretrained generator, while preserving its expressive latent editing capability, thus allowing the use of its full capacity in creating semantic adversarial manipulations. We demonstrate that APT generates a variety of semantic image manipulations, which preserve the input image class, but which fool a variety of pretrained classifiers. We further demonstrate that classifiers trained to be robust to other robustness benchmarks, are not robust to our generated manipulations and propose an approach to improve the robustness towards our generated manipulations.","Robustness, Adversarial Examples, StyleGAN, Generative Models" REV: Information-Theoretic Evaluation of Free-Text Rationales,https://openreview.net/forum?id=jg9ELHRfHD7,https://openreview.net/pdf?id=jg9ELHRfHD7,This paper proposes an evaluation metric based on conditional $\mathcal{V}$-information to measure the information in free-text rationales.,"Free-text rationales are a promising step towards explainable AI, yet their evaluation remains an open research problem. While existing metrics have mostly focused on measuring the direct association between the rationale and a given label, we argue that an ideal metric should also be able to focus on the new information uniquely provided in the rationale that is otherwise not provided in the input or the label. We investigate this research problem from an information-theoretic perspective using the conditional $\mathcal{V}$-information \citep{hewitt-etal-2021-conditional}. More concretely, we propose a metric called REV (Rationale Evaluation with conditional $\mathcal{V}$-information), that can quantify the new information in a rationale supporting a given label beyond the information already available in the input or the label. Experiments on reasoning tasks across four benchmarks, including few-shot prompting with GPT-3, demonstrate the effectiveness of REV in evaluating different types of rationale-label pairs, compared to existing metrics. Through several quantitative comparisons, we demonstrate the capability of REV in providing more sensitive measurements of new information in free-text rationales with respect to a label. Furthermore, REV is consistent with human judgments on rationale evaluations. Overall, when used alongside traditional performance metrics, REV provides deeper insights into a models' reasoning and prediction processes.","free-text rationales, conditional V-information, evaluation metric, explainable AI" Examining the Difference Among Transformers and CNNs with Explanation Methods,https://openreview.net/forum?id=383GRAoNhzb,https://openreview.net/pdf?id=383GRAoNhzb,,"We propose a methodology that systematically applies deep explanation algorithms on a dataset-wide basis, to compare different types of visual recognition backbones, such as convolutional networks (CNNs), global attention networks, and local attention networks. We examine both qualitative visualizations and quantitative statistics across the dataset, in order to generate intuitions that are not just anecdotal, but are supported by the statistics computed on the whole dataset. Specifically, we propose two methods. The first one, sub-explanation counting, systematically searches for minimally-sufficient explanations of all images and count the amount of sub-explanations for each network. The second one, called cross-testing, computes salient regions using one network and then evaluates the performance by only showing these regions as an image to other networks. Through a combination of qualitative insights and quantitative statistics, we illustrate that 1) there are significant differences between the salient features of CNNs and attention models; 2) the occlusion-robustness in local attention models and global attention models may come from different decision-making mechanisms.","Explanation, transformers, multiple explanations" A Quasistatic Derivation of Optimization Algorithms' Exploration on Minima Manifolds,https://openreview.net/forum?id=UDbNL0_W-3x,https://openreview.net/pdf?id=UDbNL0_W-3x,,"A quasistatic approach is proposed to derive the optimization algorithms' effective dynamics on the manifold of minima when the iterator oscillates around the manifold. Compared with existing strict analysis, our derivation method is simple and intuitive, has wide applicability, and produces easy-to-interpret results. As examples, we derive the manifold dynamics for SGD, SGD with momentum (SGDm) and Adam with different noise covariances, and justify the closeness of the derived manifold dynamics with the true dynamics through numerical experiments. We then use minima manifold dynamics to study and compare the properties of optimization algorithms. For SGDm, we show that scaling up learning rate and batch size simultaneously accelerates exploration without affecting generalization, which confirms a benefit of large batch training. For Adam, we show that the speed of its manifold dynamics changes with the direction of the manifold, because Adam is not rotationally invariant. This may cause slow exploration in high dimensional parameter spaces.","Implicit bias, minima manifold, time-scale separation, Adam" Mutual Partial Label Learning with Competitive Label Noise,https://openreview.net/forum?id=EUrxG8IBCrC,https://openreview.net/pdf?id=EUrxG8IBCrC,,"Partial label learning (PLL) is an important weakly supervised learning problem, where each training instance is associated with a set of candidate labels that include both the true label and noise labels. Most existing PLL methods assume the candidate noise labels are randomly chosen, which hardly holds in the real- world learning scenarios. In this paper, we consider a more realistic PLL scenario with competitive noise labels that are more difficult to distinguish from the true label than the random noise labels. We propose a novel Mutual Learning based PLL approach named ML-PLL to address this challenging problem. ML-PLL learns a prediction network based classifier and a class-prototype based classifier cooperatively through interactive mutual learning and label correction. Moreover, we use a transformation network to model the association relationships between the true label and candidate noise labels, and learn it together with the prediction network to match the observed candidate labels in the training data and enhance label correction. Extensive experiments are conducted on several benchmark PLL datasets, and the proposed ML-PLL approach demonstrates the state-of-the-art performance for partial label learning.","Partial label learning, label noise, classification" The Graph Learning Attention Mechanism: Learnable Sparsification Without Heuristics,https://openreview.net/forum?id=0eSq84hbXhe,https://openreview.net/pdf?id=0eSq84hbXhe,"We introduce a drop-in, differentiable graph structure learning layer for use with GNNs.","Graph Neural Networks (GNNs) are local aggregators that derive their expressive power from their sensitivity to network structure. However, this sensitivity comes at a cost: noisy edges degrade performance. In response, many GNNs include edge-weighting mechanisms that scale the contribution of each edge in the aggregation step. However, to account for neighborhoods of varying size, node-embedding mechanisms must normalize these edge-weights across each neighborhood. As such, the impact of noisy edges cannot be eliminated without removing those edges altogether. Motivated by this issue, we introduce the Graph Learning Attention Mechanism (GLAM): a drop-in, differentiable structure learning layer for GNNs that separates the distinct tasks of structure learning and node embedding. In contrast to existing graph learning approaches, GLAM does not require the addition of exogenous structural regularizers or edge-selection heuristics to learn optimal graph structures. In experiments on citation and co-purchase datasets, we demonstrate that our approach can match state of the art semi-supervised node classification accuracies while inducing an order of magnitude greater sparsity than existing graph learning methods.","graph structure learning, graph attention networks" Partial Label Unsupervised Domain Adaptation with Class-Prototype Alignment,https://openreview.net/forum?id=jpq0qHggw3t,https://openreview.net/pdf?id=jpq0qHggw3t,This is the first partial label learning method that handles partial label learning and unsupervised domain adaptation simultaneously.,"Partial label learning (PLL) tackles the problem where each instance is associated with a set of candidate labels, only one of which is the ground-truth. Most existing PLL approaches assume that both the training and test sets share the identical data distribution. However, this assumption does not hold in many real-world scenarios where the training and test data come from different distributions. In this paper, we formalize this learning scenario as a new problem called partial label unsupervised domain adaptation (PLUDA). To address this challenging PLUDA problem, we propose a novel Prototype Alignment based PLUDA method named PAPLUDA, which dynamically refines the pseudo-labels of instances from both the source and target domains by consulting the outputs of a teacher-student model in a moving-average manner, and bridges the domain discrepancy through inter- domain class prototype alignment. In addition, a teacher-student model based contrastive regularization is also deployed to enhance the prediction stability and hence improve the class prototypes in both domains for PLUDA. Comprehensive experimental results demonstrate that PAPLUDA achieves the state-of-the-art performance on the widely used benchmark datasets.","Partial label learning, label noise, domain adaptation" Why Self Attention is Natural for Sequence-to-Sequence Problems? A Perspective from Symmetries,https://openreview.net/forum?id=dNdOnKy9YNs,https://openreview.net/pdf?id=dNdOnKy9YNs,,"In this paper, we show that structures similar to self-attention are natural to learn many sequence-to-sequence problems from the perspective of symmetry. Inspired by language processing applications, we study the orthogonal equivariance of {\it seq2seq functions with knowledge}, which are functions taking two inputs---an input sequence and a ``knowledge''---and outputting another sequence. The knowledge consists of a set of vectors in the same embedding space as the input sequence, containing the information of the language used to process the input sequence. We show that orthogonal equivariance in the embedding space is natural for seq2seq functions with knowledge, and under such equivariance the function must take the form close to the self-attention. This shows that network structures similar to self-attention are the right structures to represent the target function of many seq2seq problems. The representation can be further refined if a ``finite information principle'' is considered, or a permutation equivariance holds for the elements of the input sequence. ","Self attention, sequence-to-sequence function, orthogonal equivairance, permutation equivariance" simpleKT: A Simple But Tough-to-Beat Baseline for Knowledge Tracing,https://openreview.net/forum?id=9HiGqC9C-KA,https://openreview.net/pdf?id=9HiGqC9C-KA,"We propose \textsc{simpleKT}, a simple but tough-to-beat KT baseline that is simple to implement, computationally friendly and robust to a wide range of KT datasets across different domains","Knowledge tracing (KT) is the problem of predicting students' future performance based on their historical interactions with intelligent tutoring systems. Recently, many works present lots of special methods for applying deep neural networks to KT from different perspectives like model architecture, adversarial augmentation and etc., which make the overall algorithm and system become more and more complex. Furthermore, due to the lack of standardized evaluation protocol \citep{liu2022pykt}, there is no widely agreed KT baselines and published experimental comparisons become inconsistent and self-contradictory, i.e., the reported AUC scores of DKT on ASSISTments2009 range from 0.721 to 0.821 \citep{minn2018deep,yeung2018addressing}. Therefore, in this paper, we provide a strong but simple baseline method to deal with the KT task named \textsc{simpleKT}. Inspired by the Rasch model in psychometrics, we explicitly model question-specific variations to capture the individual differences among questions covering the same set of KCs. Furthermore, instead of using sophisticated representations to capture student forgetting behaviors, we use the ordinary dot-product attention function to extract the time-aware information embedded in the student learning interactions. Extensive experiments show that such a simple baseline is able to always rank top 3 in terms of AUC scores and achieve 57 wins, 3 ties and 16 loss against 12 DLKT baseline methods on 7 public datasets of different domains. We believe this work serves as a strong baseline for future KT research. Code is available at \url{https://tinyurl.com/5d62cdkt}.","knowledge tracing, assessment, ai for education" Exp-$\alpha$: Beyond Proportional Aggregation in Federated Learning,https://openreview.net/forum?id=TTduM2sE0Ja,https://openreview.net/pdf?id=TTduM2sE0Ja,We theoretically study properties of proportional aggregation and propose a novel aggregation strategy for faster convergence under Non-IID setting.,"Federated Learning (FL) is a distributed learning paradigm, which computes gradients of a model locally on different clients and aggregates the updates to construct a new model collectively. Typically, the updates from local clients are aggregated with weights proportional to the size of clients' local datasets. In practice, clients have different local datasets suffering from data heterogeneity, such as imbalance. Although proportional aggregation still theoretically converges to the global optimum, it is provably slower when non-IID data is present (under convexity assumptions), the effect of which is exacerbated in practice. We posit that this analysis ignores convergence rate, which is especially important under such settings in the more realistic non-convex real world. To account for this, we analyze a generic and time-varying aggregation strategy to reveal a surprising trade-off between convergence rate and convergence error under convexity assumptions. Inspired by the theory, we propose a new aggregation strategy, Exp-$\alpha$, which weights clients differently based on their severity of data heterogeneity. It achieves stronger convergence rates at the theoretical cost of a non-vanishing convergence error. Through a series of controlled experiments, we empirically demonstrate the superior convergence behavior (both in terms of rate and, in practice, even error) of the proposed aggregation on three types of data heterogeneity: imbalance, label-flipping, and domain shift when combined with existing FL algorithms. For example, on our imbalance benchmark, Exp-$\alpha$, combined with FedAvg, achieves a relative $12\%$ increase in convergence rate and a relative $3\%$ reduction in error across four FL communication settings. ",Federated Learning Learning Efficient Hybrid Particle-continuum Representations of Non-equilibrium N-body Systems,https://openreview.net/forum?id=n3RFM5cBB4,https://openreview.net/pdf?id=n3RFM5cBB4,We introduce a method for Learning Hybrid Particle-Continuum models that enables an efficient and accurate coupling between fluid and kinetic representations of N-body systems.,"An important class of multi-scale, non-equilibrium, N-body physical systems deals with an interplay between particle and continuum phenomena. These include hypersonic flow and plasma dynamics, materials science, and astrophysics. Hybrid solvers that combine particle and continuum representations could provide an efficient framework to model these systems. However, the coupling between these two representations has been a key challenge, which is often limited to inaccurate or incomplete prescriptions. In this work, we introduce a method for Learning Hybrid Particle-Continuum (LHPC) models from the data of first-principles particle simulations. LHPC analyzes the local velocity-space particle distribution function and separates it into near-equilibrium (thermal) and far-from-equilibrium (non-thermal) components. The most computationally-intensive particle solver is used to advance the non-thermal particles, whereas a neural network solver is used to efficiently advance the thermal component using a continuum representation. Most importantly, an additional neural network learns the particle-continuum coupling: the dynamical exchange of mass, momentum, and energy between the particle and continuum representations. Training of the different neural network components is done in an integrated manner to ensure global consistency and stability of the LHPC model. We demonstrate our method in an intense laser-plasma interaction problem involving highly nonlinear, far-from-equilibrium dynamics associated with the coupling between electromagnetic fields and multiple particle species. More efficient modeling of these interactions is critical for the design and optimization of compact accelerators for material science and medical applications. Our method achieves an important balance between accuracy and speed: LHPC is 8 times faster than a classical particle solver and achieves up to 6.8-fold reduction of long-term prediction error for key quantities of interest compared to deep-learning baselines using uniform representations.","multi-scale, hybrid representation, particle-continuum, n-body, plasma" Towards Large Scale Transfer Learning for Differentially Private Image Classification,https://openreview.net/forum?id=Si_XWk8umO,https://openreview.net/pdf?id=Si_XWk8umO,"We perform comprehensive exploration of Differentially Private training on ImageNet. Combined with large scale transfer learning and a few insights, we obtain state of the art private results with minimal computational overhead.","Differentially Private Stochastic Gradient Descent (DP-SGD) has emerged as a popular private training algorithm. Unfortunately, the computational cost of training large-scale models with DP-SGD is substantially higher than non-private training. This is further exacerbated by the fact that increasing the number of parameters leads to larger degradation in utility with DP. In this work, we zoom in on the ImageNet dataset and demonstrate that, similar to the non-private case, pre-training over-parameterized models on a large public dataset can lead to substantial gains when the models are finetuned privately. Moreover, by systematically comparing private and non-private models across a range of large batch sizes, we find that similar to the non-private setting, the choice of optimizer can further improve performance substantially with DP. By using the LAMB optimizer with DP-SGD we saw improvement of up to 20$\%$ points (absolute). We also show that finetuning just the last layer for a \emph{single step} in the full batch setting, combined with extremely small-scale (near-zero) initialization leads to both SOTA results of 81.7 $\%$ under a wide privacy budget range of $\epsilon \in [4, 10]$ and $\delta$ = $10^{-6}$ while minimizing the computational overhead substantially. Finally, we present additional results on CIFAR-10 and CIFAR-100, surpassing previous state of the art by leveraging transfer learning with our recommendations.","Differential Privacy, Understanding Differential Privacy, Image Classification, Deep Learning" Neural Network Approximation of Lipschitz Functions in High Dimensions with Applications to Inverse Problems,https://openreview.net/forum?id=qSjf5zf5tv,https://openreview.net/pdf?id=qSjf5zf5tv,We provide neural network approximation guarantees for Lipschitz functions on low-complexity sets in high dimensions.,"The remarkable successes of neural networks in a huge variety of inverse problems have fueled their adoption in disciplines ranging from medical imaging to seismic analysis over the past decade. However, the high dimensionality of such inverse problems has simultaneously left current theory, which predicts that networks should scale exponentially in the dimension of the problem, unable to explain why the seemingly small networks used in these settings work as well as they do in practice. To reduce this gap between theory and practice, a general method for bounding the complexity required for a neural network to approximate a Lipschitz function on a high-dimensional set with a low-complexity structure is provided herein. The approach is based on the observation that the existence of a linear Johnson-Lindenstrauss embedding $\mathbf{A} \in \mathbb{R}^{d \times D}$ of a given high-dimensional set $\mathcal{S} \subset \mathbb{R}^D$ into a low dimensional cube $[-M,M]^d$ implies that for any Lipschitz function $f : \mathcal{S} \to \mathbb{R}^p$, there exists a Lipschitz function $g : [-M,M]^d \to \mathbb{R}^p$ such that $g(\mathbf{A}\mathbf{x}) = f(\mathbf{x})$ for all $\mathbf{x} \in \mathcal{S}$. Hence, if one has a neural network which approximates $g : [-M,M]^d \to \mathbb{R}^p$, then a layer can be added which implements the JL embedding $\mathbf{A}$ to obtain a neural network which approximates $f : \mathcal{S} \to \mathbb{R}^p$. By pairing JL embedding results along with results on approximation of Lipschitz functions by neural networks, one then obtains results which bound the complexity required for a neural network to approximate Lipschitz functions on high dimensional sets. The end result is a general theoretical framework which can then be used to better explain the observed empirical successes of smaller networks in a wider variety of inverse problems than current theory allows.","approximation theory, neural networks, deep learning, Johnson-Lindenstrauss embedding, inverse problems" Weighted Ensemble Self-Supervised Learning,https://openreview.net/forum?id=CL-sVR9pvF,https://openreview.net/pdf?id=CL-sVR9pvF,We efficiently ensemble SSL methods and train them with new objectives to get SOTA results on ImageNet-1K SSL evaluations.,"Ensembling has proven to be a powerful technique for boosting model performance, uncertainty estimation, and robustness in supervised learning. Advances in self-supervised learning (SSL) enable leveraging large unlabeled corpora for state-of-the-art few-shot and supervised learning performance. In this paper, we explore how ensemble methods can improve recent SSL techniques by developing a framework that permits data-dependent weighted cross-entropy losses. We refrain from ensembling the representation backbone; this choice yields an efficient ensemble method that incurs a small training cost and requires no architectural changes or computational overhead to downstream evaluation. The effectiveness of our method is demonstrated with two state-of-the-art SSL methods, DINO (Caron et al., 2021) and MSN (Assran et al., 2022). Our method outperforms both in multiple evaluation metrics on ImageNet-1K, particularly in the few-shot setting. We explore several weighting schemes and find that those which increase the diversity of ensemble heads lead to better downstream evaluation results. Thorough experiments yield improved prior art baselines which our method still surpasses; e.g., our overall improvement with MSN ViT-B/16 is 3.9 p.p. for 1-shot learning.","self-supervised learning, ensemble, representation learning" DOT: Fast Cell Type Deconvolution by Optimal Transport,https://openreview.net/forum?id=whfYRamFiOL,https://openreview.net/pdf?id=whfYRamFiOL,Fast Optimal Transport for robust cell type mapping in high and low resolution spatial data ,"Single-cell RNA sequencing (scRNA-seq) and spatially-resolved imaging/sequencing technologies are the current cutting edge of transcriptomics data generation in biomedical research. On one hand, scRNA-seq data brings rich high-throughput information spanning the entire transcriptome, sacrificing the structural context of the cells. On the other hand, high-resolution measurements of the spatial context of cells comes with a trade-off in throughput and coverage. Combining data from these two modalities facilitates better understanding of the development and organization of complex tissues, as well as the emerging processes and function of distinct constituent cell types within the tissue. Recent approaches focus only on the expression of genes available in both modalities. They don't incorporate other relevant and available features, especially the spatial context. We propose DOT, a novel optimization framework for assigning cell types to tissue locations, ensuring a high-quality mapping by taking into account relevant but previously neglected features of the data. Our model (i) incorporates ideas from Optimal Transport theory to exploit structural similarities in the data modalities, leveraging not only joint features but also distinct features, i.e. the spatial context, (ii) introduces scale-invariant distance functions to account for differences in the sensitivity of different measurement technologies, (iii) ensures representation of rare cell types using Nash-fairness objectives, and (iv) provides control over the abundance of cell types in the localization. We present a fast implementation based on the Frank-Wolfe algorithm and we demonstrate the effectiveness of DOT on correctly assigning cell types to spatial data coming from (i) the primary motor cortex of the mouse brain, (ii) the primary somatosensory cortex of the mouse brain, and (iii) the developing human heart.","Optimal Transport, Frank-Wolfe, Cell type deconvolution, Spatial data" Partially Observable RL with B-Stability: Unified Structural Condition and Sharp Sample-Efficient Algorithms,https://openreview.net/forum?id=n05upKp02kQ,https://openreview.net/pdf?id=n05upKp02kQ,"We propose a unified structural condition for sample-efficient partially observable RL (POMDPs/PSRs), and establish substantially sharper learning results than existing ones.","Partial Observability---where agents can only observe partial information about the true underlying state of the system---is ubiquitous in real-world applications of Reinforcement Learning (RL). Theoretically, learning a near-optimal policy under partial observability is known to be hard in the worst case due to an exponential sample complexity lower bound. Recent work has identified several tractable subclasses that are learnable with polynomial samples, such as Partially Observable Markov Decision Processes (POMDPs) with certain revealing or decodability conditions. However, this line of research is still in its infancy, where (1) unified structural conditions enabling sample-efficient learning are lacking; (2) existing sample complexities for known tractable subclasses are far from sharp; and (3) fewer sample-efficient algorithms are available than in fully observable RL. This paper advances all three aspects above for Partially Observable RL in the general setting of Predictive State Representations (PSRs). First, we propose a natural and unified structural condition for PSRs called \emph{B-stability}. B-stable PSRs encompasses the vast majority of known tractable subclasses such as weakly revealing POMDPs, low-rank future-sufficient POMDPs, decodable POMDPs, and regular PSRs. Next, we show that any B-stable PSR can be learned with polynomial samples in relevant problem parameters. When instantiated in the aforementioned subclasses, our sample complexities improve substantially over the current best ones. Finally, our results are achieved by three algorithms simultaneously: Optimistic Maximum Likelihood Estimation, Estimation-to-Decisions, and Model-Based Optimistic Posterior Sampling. The latter two algorithms are new for sample-efficient learning of POMDPs/PSRs.","reinforcement learning theory, POMDPs, predictive state representations, partially observable reinforcement learning" Bias Amplification Improves Worst-Group Accuracy without Group Information,https://openreview.net/forum?id=TSqRwmrRiOn,https://openreview.net/pdf?id=TSqRwmrRiOn,We propose a novel two-stage training algorithm that achieves the state-of-the-art worst-group accuracy on test data without group information.,"Neural networks produced by standard training are known to suffer from poor accuracy on rare subgroups despite achieving high accuracy on average, due to the correlations between certain spurious features and labels. Previous approaches based on worst-group loss minimization (\textit{e.g.} Group-DRO) are effective in improving worse-group accuracy but require expensive group annotations for all the training samples. In this paper, we focus on the more challenging and realistic setting where group annotations are only available on a small validation set or are not available at all. We propose \bam, a novel two-stage training algorithm: in the first stage, the model is trained using a \emph{bias amplification} scheme via introducing a learnable \emph{auxiliary variable} for each training sample together with the adoption of squared loss; in the second stage, we upweight the samples that the bias-amplified model misclassifies, and then continue training the same model on the reweighted dataset. Empirically, \bam leads to consistent improvement over its counterparts in worst-group accuracy, resulting in state-of-the-art performance in spurious correlation benchmarks in computer vision and natural language processing. Moreover, we find a simple stopping criterion that completely removes the need for group annotations, with little or no loss in worst-group accuracy. ","spurious correlation, worst-group accuracy, group robustness" Actionable Recourse Guided by User Preference,https://openreview.net/forum?id=HjzWIMEWipV,https://openreview.net/pdf?id=HjzWIMEWipV,Capturing user preference and suggesting actionable recourse for adversely affected individuals by a machine learning model.,"The growing popularity of machine learning models has led to their increased application in domains directly impacting human lives. In critical fields such as healthcare, banking, and criminal justice, tools that ensure trust and transparency are vital for the responsible adoption of these models. One such tool is \emph{actionable recourse} (AR) for negatively impacted users. AR describes recommendations of cost-efficient changes to a user's \emph{actionable} features to help them obtain favorable outcomes. Existing approaches for providing recourse optimize for properties such as proximity, sparsity, validity, and distance-based costs. However, an often-overlooked but crucial requirement for actionability is a consideration of \emph{User Preference} to guide the recourse generation process. Moreover, existing works considering a user's preferences require users to precisely specify their costs for taking actions. This requirement raises questions about the practicality of the corresponding solutions due to the high cognitive loads imposed. In this work, we attempt to capture user preferences via soft constraints in three simple forms: \textit{i) scoring continuous features, ii) bounding feature values} and \textit{iii) ranking categorical features}. We propose an optimization framework that is sensitive to {user preference} and a gradient-based approach to identify \emph{User Preferred Actionable Recourse (UP-AR)}. We empirically demonstrate the proposed approach's superiority in adhering to user preference while maintaining competitive performance in traditional metrics with extensive experiments.",Actionable recourse Large Learning Rate Matters for Non-Convex Optimization,https://openreview.net/forum?id=JsrvkgM8gO2,https://openreview.net/pdf?id=JsrvkgM8gO2,,"When training neural networks, it has been widely observed that a large step size is essential in stochastic gradient descent (SGD) for obtaining superior models. However, the effect of large step sizes on the success of SGD is not well understood theoretically. Several previous works have attributed this success to the stochastic noise present in SGD. However, we show through a novel set of experiments that the stochastic noise is not sufficient to explain good non-convex training, and that instead the effect of a large learning rate itself is essential for obtaining best performance. We demonstrate the same effects also in the noise-less case, i.e. for full-batch GD. We formally prove that GD with large step size---on certain non-convex function classes---follows a different trajectory than GD with a small step size, which can lead to convergence to a global minimum instead of a local one. Finally, we also demonstrate the difference in trajectories for small and large learning rates for real neural networks, again observing that large learning rates allow escaping from a local minimum, confirming this behavior is indeed relevant in practice.","large learning rates, GD, SGD, non-convex optimization" A Deep Learning Framework for Musical Acoustics Simulations,https://openreview.net/forum?id=q_7TgV0ugq,https://openreview.net/pdf?id=q_7TgV0ugq,An open-access/open-source framework designed for the generation of numerical musical acoustics datasets and for the training/benchmarking of acoustics neural operators.,"The acoustic modeling of musical instruments is a heavy computational process, often bound to the solution of complex systems of partial differential equations (PDEs). Numerical models can achieve a high level of accuracy, but they may take up to several hours to complete a full simulation, especially in the case of intricate musical mechanisms. The application of deep learning, and in particular of neural operators that learn mappings between function spaces, has the potential to revolutionize how acoustics PDEs are solved and noticeably speed up musical simulations. However, such operators require large datasets, capable of exemplifying the relationship between input parameters (excitation) and output solutions (acoustic wave propagation) per each target musical instrument/configuration. With this work, we present an open-access, open-source framework designed for the generation of numerical musical acoustics datasets and for the training/benchmarking of acoustics neural operators. We first describe the overall structure of the framework and the proposed data generation workflow. Then, we detail the first numerical models that were ported to the framework. Finally, we conclude by sharing some preliminary results obtained by means of training a state-of-the-art neural operator with a dataset generated via the framework. This work is a first step towards the gathering of a research community that focuses on deep learning applied to musical acoustics, and shares workflows and benchmarking tools.","Datasets, acoustics simulation, numerical modeling, deep learning, neural operators, benchmarking" Domain Generalization via Heckman-type Selection Models ,https://openreview.net/forum?id=fk7RbGibe1,https://openreview.net/pdf?id=fk7RbGibe1,"A non-random sample selection framework for solving domain generalization, and a set of Heckman-type estimators for various types of outcomes.","The domain generalization (DG) setup considers the problem where models are trained on data sampled from multiple domains and evaluated on test domains unseen during training. In this paper, we formulate DG as a sample selection problem where each domain is sampled from a common underlying population through non-random sampling probabilities that correlate with both the features and the outcome. Under this setting, the fundamental iid assumption of the empirical risk minimization (ERM) is violated, so it often performs worse on test domains whose non-random sampling probabilities differ from the domains in the training dataset. We propose a Selection-Guided DG (SGDG) framework to learn the selection probability of each domain and the joint distribution of the outcome and domain selection variables. The proposed SGDG is domain generalizable as it intends to minimize the risk under the population distribution. We theoretically proved that, under certain regular conditions, SGDG can achieve smaller risk than ERM. Furthermore, we present a class of parametric SGDG (HeckmanDG) estimators applicable to continuous, binary, and multinomial outcomes. We also demonstrated its efficacy empirically through simulations and experiments on a set of benchmark datasets comparing with other well-known DG methods.","Domain Generalization, Sample Selection, Bias Correction, Heckman" Moving Forward by Moving Backward: Embedding Action Impact over Action Semantics,https://openreview.net/forum?id=vmjctNUSWI,https://openreview.net/pdf?id=vmjctNUSWI,,"A common assumption when training embodied agents is that the impact of taking an action is stable; for instance, executing the ``move ahead'' action will always move the agent forward by a fixed distance, perhaps with some small amount of actuator-induced noise. This assumption is limiting; an agent may encounter settings that dramatically alter the impact of actions: a move ahead action on a wet floor may send the agent twice as far as it expects and using the same action with a broken wheel might transform the expected translation into a rotation. Instead of relying that the impact of an action stably reflects its pre-defined semantic meaning, we propose to model the impact of actions on-the-fly using latent embeddings. By combining these latent action embeddings with a novel, transformer-based, policy head, we design an Action Adaptive Policy (AAP). We evaluate our AAP on two challenging visual navigation tasks in the AI2-THOR environment and show that our AAP is highly performant even when faced, at inference-time, with missing actions and, previously unseen, perturbed action spaces. We will make the code and models for this work publicly available.","Embodied AI, Adaptation, Visual Navigation" Guiding Safe Exploration with Weakest Preconditions,https://openreview.net/forum?id=zzqBoIFOQ1,https://openreview.net/pdf?id=zzqBoIFOQ1,"We use an online, weakest-precondition-based approach to ensure safety during exploration without interfering with performance.","In reinforcement learning for safety-critical settings, it is often desirable for the agent to obey safety constraints at all points in time, including during training. We present a novel neurosymbolic approach called SPICE to solve this safe exploration problem. SPICE uses a new, online shielding layer based on symbolic weakest preconditions to achieve a more precise safety analysis than existing tools without unduly impacting the training process. We evaluate the approach on a suite of benchmarks from robotics and classic control, and show that it is able to achieve comparable performance to existing safe learning techniques while incurring fewer safety violations. Additionally, we present theoretical results showing that SPICE converges to the optimal safe policy under reasonable assumptions.","reinforcement learning, safe learning, safe exploration" MetaMD: Principled Optimiser Meta-Learning for Deep Learning,https://openreview.net/forum?id=LOMA7vSa2Y,https://openreview.net/pdf?id=LOMA7vSa2Y,"We proposed a meta-learning based algorithm, learning optimisers under the mirror descent framework.","Optimiser design influences learning speed and generalisation in training machine learning models. Several studies have attempted to learn more effective gradient-descent optimisers via solving a bi-level optimisation problem where generalisation error is minimised with respect to optimiser parameters. However, most existing neural network oriented optimiser learning methods are intuitively motivated, without clear theoretical support, and focus on learning implicit biases that improve generalisation, rather than speed of convergence. We take a different perspective starting from mirror descent rather than gradient descent, and meta-learning the corresponding Bregman divergence. Within this paradigm, we formalise a novel meta-learning objective of optimising the rate of convergence. The resulting framework, termed Meta Mirror Descent (MetaMD), learns to accelerate optimisation speed. Unlike many meta-learned neural network optimisers, it also supports convergence guarantees and uniquely does so without requiring validation data. We empirically evaluate our framework on a variety of tasks and architectures in terms of convergence rate and generalisation error and demonstrate strong performance.","Meta-learning, Optimiser Learning" A Sample Based Method for Understanding The Decisions of Neural Networks Semantically,https://openreview.net/forum?id=5MR1OGvCtH,https://openreview.net/pdf?id=5MR1OGvCtH,This paper introduces a semantic interpretability framework that is used to understand how CNN models and their robust counterparts manipulate image regions.,"Interpretability in deep learning is one of the largest obstacles to its more widespread adoption in critical applications. A variety of methods have been introduced to understand and explain decisions made by Deep Models. A class of these methods highlights which features are most influential to model predictions. These methods have some key weaknesses. First, most of these methods are applicable only to the atomic elements that make up raw inputs to the model (e.g., pixels or words). Second, these methods generally don't distinguish between the importance of features individually versus due to interactions with other features. As a result, it is difficult to explore high level questions about how models use features. We tackle these issues by proposing Sample-Based Semantic Analysis (SBSA). We use Sobol sensitivity analysis as our sample-based method. Sobol-SBSA allows us to quantify the importance of semantic combinations of raw inputs and highlight the extent to which these features are important individually as opposed to due to interactions with other features. We demonstrate the ability of Sobol-SBSA to answer a richer class of questions about the behavior of Deep Learning models by exploring how CNN models from AlexNet to DenseNet use regions when classifying images. We present two key findings. 1) The architectural improvements from AlexNet to DenseNet manifested themselves in CNN models utilizing greater levels of region interactions for predictions. 2) Adversarially robust CNNs resist exploiting spurious correlations in ImageNet data by forcing these architectures to rely less on region-to-region interaction. Our proposed method is generalizable to a wide variety of network and input types and can help provide greater clarity about model decisions.","Machine Learning Interpretability, Bias, ImageNet, AlexNet, ResNet, VGG-16, Inception, CNNs, Bag of Words" Deep Biological Pathway Informed Pathology-Genomic Multimodal Survival Prediction,https://openreview.net/forum?id=nbGCPw8Rry,https://openreview.net/pdf?id=nbGCPw8Rry,,"The integration of multi-modal data, such as pathological images and genomic data, is essential for understanding cancer heterogeneity and complexity for personalized treatments, as well as for enhancing survival predictions. Despite the progress made in integrating pathology and genomic data, most existing methods cannot mine the complex inter-modality relations thoroughly. Additionally, identifying explainable features from these models that govern preclinical discovery and clinical prediction is crucial for cancer diagnosis, prognosis, and therapeutic response studies. We propose PONET- a novel biological pathway informed pathology-genomic deep model that integrates pathological images and genomic data not only to improve survival prediction but also to identify genes and pathways that cause different survival rates in patients. Empirical results on six of The Cancer Genome Atlas (TCGA) datasets show that our proposed method achieves superior predictive performance and reveals meaningful biological interpretations. The proposed method establishes insight on how to train biological informed deep networks on multimodal biomedical data which will have general applicability for understanding diseases and predicting response and resistance to treatment.", A CMDP-within-online framework for Meta-Safe Reinforcement Learning,https://openreview.net/forum?id=mbxz9Cjehr,https://openreview.net/pdf?id=mbxz9Cjehr,We study the problem of meta-reinforcement learning (meta-RL) for constrained Markov decision processes (CMDPs) through the inexact CMDP-within-online framework.,"Meta-reinforcement learning has widely been used as a learning-to-learn framework to solve unseen tasks with limited experience. However, the aspect of constraint violations has not been adequately addressed in the existing works, making their application restricted in real-world settings. In this paper, we study the problem of meta-safe reinforcement learning (meta-SRL) through the CMDP-within-online framework. We obtain task-averaged regret guarantees for the reward maximization (optimality gap) and constraint violations using gradient-based meta-learning and show that the task-averaged optimality gap and constraint satisfaction improve with task-similarity in the static environment, or task-relatedness in the changing environment. Several technical challenges arise when making this framework practical while still having strong theoretical guarantees. To address these challenges, we propose a meta-algorithm that performs inexact online learning on the upper bounds of intra-task optimality gap and constraint violations estimated by off-policy stationary distribution corrections. Furthermore, we enable the learning rates to be adapted for every task and extend our approach to settings with the dynamically changing task environments. Finally, experiments are conducted to demonstrate the effectiveness of our approach. The proposed theoretical framework is the first to handle the nonconvexity and stochastic nature of within-task CMDPs, while exploiting inter-task dependency for multi-task safe learning. ","Meta-Reinforcement learning, Constrained MDPs, online learning, safe RL, dynamic regret" Active Sampling for Node Attribute Completion on Graphs,https://openreview.net/forum?id=PuEOL1hhyrF,https://openreview.net/pdf?id=PuEOL1hhyrF,,"Node attribute is one kind of crucial information on graphs, but real-world graphs usually face attribute-missing problem where attributes of partial nodes are missing and attributes of the other nodes are available. It is meaningful to restore the missing attributes so as to benefit downstream graph learning tasks. Popular GNN is not designed for this node attribute completion issue and is not capable of solving it. Recent proposed Structure-attribute Transformer (SAT) framework decouples the input of graph structures and node attributes by a distribution matching technique, and can work on it properly. However, SAT leverages nodes with observed attributes in an equally-treated way and neglects the different contributions of different nodes in learning. In this paper, we propose a novel active sampling algorithm (ATS) to more efficiently utilize the nodes with observed attributes and better restore the missing node attributes. Specifically, ATS contains two metrics that measure the representativeness and uncertainty of each node's information by considering the graph structures, representation similarity and learning bias. Then, these two metrics are linearly combined by a Beta distribution controlled weighting scheme to finally determine which nodes are selected into the train set in the next optimization step. This ATS algorithm can be combined with SAT framework together, and is learned in an iterative manner. Through extensive experiments on 4 public benchmark datasets and two downstream tasks, we show the superiority of ATS in node attribute completion.","Graph Neural Network, Node Attribute Completion, Active Sampling" Effects of Graph Convolutions in Multi-layer Networks,https://openreview.net/forum?id=P-73JPgRs0R,https://openreview.net/pdf?id=P-73JPgRs0R,Theoretical and empirical insights into the performance of graph convolutions in multi-layer networks,"Graph Convolutional Networks (GCNs) are one of the most popular architectures that are used to solve classification problems accompanied by graphical information. We present a rigorous theoretical understanding of the effects of graph convolutions in multi-layer networks. We study these effects through the node classification problem of a non-linearly separable Gaussian mixture model coupled with a stochastic block model. First, we show that a single graph convolution expands the regime of the distance between the means where multi-layer networks can classify the data by a factor of at least $1/\sqrt[4]{\rm deg}$, where ${\rm deg}$ denotes the expected degree of a node. Second, we show that with a slightly stronger graph density, two graph convolutions improve this factor to at least $1/\sqrt[4]{n}$, where $n$ is the number of nodes in the graph. Finally, we provide both theoretical and empirical insights into the performance of graph convolutions placed in different combinations among the layers of a neural network, concluding that the performance is mutually similar for all combinations of the placement. We present extensive experiments on both synthetic and real-world data that illustrate our results.","graph neural networks, node classification, classification threshold, contextual stochastic block model" SimPer: Simple Self-Supervised Learning of Periodic Targets,https://openreview.net/forum?id=EKpMeEV0hOo,https://openreview.net/pdf?id=EKpMeEV0hOo,A simple contrastive self-supervised framework for learning periodic targets and tasks.,"From human physiology to environmental evolution, important processes in nature often exhibit meaningful and strong periodic or quasi-periodic changes. Due to their inherent label scarcity, learning useful representations for periodic tasks with limited or no supervision is of great benefit. Yet, existing self-supervised learning (SSL) methods overlook the intrinsic periodicity in data, and fail to learn representations that capture periodic or frequency attributes. In this paper, we present SimPer, a simple contrastive SSL regime for learning periodic information in data. To exploit the periodic inductive bias, SimPer introduces customized augmentations, feature similarity measures, and a generalized contrastive loss for learning efficient and robust periodic representations. Extensive experiments on common real-world tasks in human behavior analysis, environmental sensing, and healthcare domains verify the superior performance of SimPer compared to state-of-the-art SSL methods, highlighting its intriguing properties including better data efficiency, robustness to spurious correlations, and generalization to distribution shifts.","Periodic learning, Self-supervised learning, Representation learning, Periodic targets, Periodicity" Explaining Patterns in Data with Language Models via Interpretable Autoprompting,https://openreview.net/forum?id=GvMuB-YsiK6,https://openreview.net/pdf?id=GvMuB-YsiK6,"We introduce interpretable autoprompting, a simple approach to *understand a dataset* by finding a semantically meaningful prompt for a large language model.","Large language models (LLMs) have displayed an impressive ability to harness natural language to perform complex tasks. In this work, we explore whether we can leverage this learned ability to find and explain patterns in data. Specifically, given a pre-trained LLM and data examples, we introduce interpretable autoprompting (iPrompt), an algorithm that generates a natural-language string explaining the data. iPrompt iteratively alternates between generating explanations with an LLM and reranking them based on their performance when used as a prompt. Experiments on a wide range of datasets, from synthetic mathematics to natural-language understanding, show that iPrompt can yield meaningful insights by accurately finding groundtruth dataset descriptions. Moreover, the prompts produced by iPrompt are simultaneously human-interpretable and highly effective for generalization: on real-world sentiment classification datasets, iPrompt produces prompts that match or even improve upon human-written prompts for GPT-3. Finally, experiments with an fMRI dataset show the potential for iPrompt to aid in scientific discovery.","Interpretability, explainability, XAI, AI for science" Lipschitz regularized gradient flows and latent generative particles,https://openreview.net/forum?id=vjSKpocWeGf,https://openreview.net/pdf?id=vjSKpocWeGf,"We construct gradient flows, in real and latent spaces, as a generative tool to evolve empirical distributions in terms of a particle algorithm.","Lipschitz regularized $f$-divergences are constructed by imposing a bound on the Lipschitz constant of the discriminator in the variational representation. These divergences interpolate between the Wasserstein metric and $f$-divergences and provide a flexible family of loss functions for non-absolutely continuous (e.g. empirical) distributions, possibly with heavy tails. We first construct Lipschitz regularized gradient flows on the space of probability measures based on these divergences. Examples of such gradient flows are Lipschitz regularized Fokker-Planck and porous medium partial differential equations (PDEs) for the Kullback-Leibler and $\alpha$-divergences, respectively. The regularization corresponds to imposing a Courant–Friedrichs–Lewy numerical stability condition on the PDEs. For empirical measures, the Lipschitz regularization on gradient flows induces a numerically stable transporter/discriminator particle algorithm, where the generative particles are transported along the gradient of the discriminator. The gradient structure leads to a regularized Fisher information which is the total kinetic energy of the particles and can be used to track the convergence of the algorithm. The Lipschitz regularized discriminator can be implemented via neural network spectral normalization and the particle algorithm generates approximate samples from possibly high-dimensional distributions known only from data. Notably, our particle algorithm can generate synthetic data even in small sample size regimes. A new data processing inequality for the regularized divergence allows us to combine our particle algorithm with representation learning, e.g. autoencoder architectures. The resulting particle algorithm in latent space yields markedly improved generative properties in terms of efficiency and quality of the synthetic samples. From a statistical mechanics perspective the encoding can be interpreted dynamically as learning a better mobility for the generative particles. ","probability divergences, generative models, Lipschitz regularization, gradient flows, autoencoders, particle algorithms" Post-hoc Concept Bottleneck Models,https://openreview.net/forum?id=nA5AZ8CEyow,https://openreview.net/pdf?id=nA5AZ8CEyow,"We present a method to turn any neural network into a concept bottleneck model without sacrificing model performance, retaining interpretability benefits along with easy model editing.","Concept Bottleneck Models (CBMs) map the inputs onto a set of interpretable concepts (``the bottleneck'') and use the concepts to make predictions. A concept bottleneck enhances interpretability since it can be investigated to understand what concepts the model ""sees"" in an input and which of these concepts are deemed important. However, CBMs are restrictive in practice as they require dense concept annotations in the training data to learn the bottleneck. Moreover, CBMs often do not match the accuracy of an unrestricted neural network, reducing the incentive to deploy them in practice. In this work, we address these limitations of CBMs by introducing Post-hoc Concept Bottleneck models (PCBMs). We show that we can turn any neural network into a PCBM without sacrificing model performance while still retaining the interpretability benefits. When concept annotations are not available on the training data, we show that PCBM can transfer concepts from other datasets or from natural language descriptions of concepts via multimodal models. A key benefit of PCBM is that it enable users to quickly debug and update the model to reduce spurious correlations and improve generalization to new distributions. PCBM allows for global model edits, which can be more efficient than previous works on local interventinos that fixes a specific prediction.","concepts, interpretability, concept bottleneck models, model editing" Undersampling is a Minimax Optimal Robustness Intervention in Nonparametric Classification,https://openreview.net/forum?id=tAfyE2V7oye,https://openreview.net/pdf?id=tAfyE2V7oye,We show that learning is fundamentally constrained by a number of minority group samples in the setting of nonparametric classification with distribution shift. ,"While a broad range of techniques have been proposed to tackle distribution shift, the simple baseline of training on an undersampled balanced dataset often achieves close to state-of-the-art-accuracy across several popular benchmarks. This is rather surprising, since undersampling algorithms discard excess majority group data. To understand this phenomenon, we ask if learning is fundamentally constrained by a lack of minority group samples. We prove that this is indeed the case in the setting of nonparametric binary classification. Our results show that in the worst case, an algorithm cannot outperform undersampling unless there is a high degree of overlap between the train and test distributions (which is unlikely to be the case in real-world datasets), or if the algorithm leverages additional structure about the distribution shift. In particular, in the case of label shift we show that there is always an undersampling algorithm that is minimax optimal. In the case of group-covariate shift we show that there is an undersampling algorithm that is minimax optimal when the overlap between the group distributions is small. We also perform an experimental case study on a label shift dataset and find that in line with our theory, the test accuracy of robust neural network classifiers is constrained by the number of minority samples. ","robustness, distribution shift, nonparametric classification, minimax lower bounds, undersampling, label shift, covariate shift" Emb-GAM: an Interpretable and Efficient Predictor using Pre-trained Language Models,https://openreview.net/forum?id=iEVpHXjV4jj,https://openreview.net/pdf?id=iEVpHXjV4jj,"By using pre-trained language models to extract fixed-size representations, we can learn much more effective linear ngram models without sacrificing interpretability","Deep learning models have achieved impressive prediction performance but often sacrifice interpretability, a critical consideration in high-stakes domains such as healthcare or policymaking. In contrast, generalized additive models (GAMs) can maintain interpretability, but often suffer from poor prediction performance due to their inability to effectively capture feature interactions. In this work, we aim to bridge this gap by using pre-trained large-language models to extract embeddings for each input before learning a linear model in the embedding space. The final model (which we call Emb-GAM) is a transparent, linear function of its input features and feature interactions. Leveraging the language model allows Emb-GAM to learn far fewer linear coefficients, model larger interactions, and generalize well to novel inputs (e.g. unseen ngrams in text). Across a variety of natural-language-processing datasets, Emb-GAM achieves strong prediction performance without sacrificing interpretability. All code for using Emb-GAM and reproducing our results is made available on github.","Interpretability, Explainability, Additivity, Generalized additive model, Linearity" When Source-Free Domain Adaptation Meets Learning with Noisy Labels,https://openreview.net/forum?id=u2Pd6x794I,https://openreview.net/pdf?id=u2Pd6x794I,,"Recent state-of-the-art source-free domain adaptation (SFDA) methods have focused on learning meaningful cluster structures in the feature space, which have succeeded in adapting the knowledge from source domain to unlabeled target domain without accessing the private source data. However, existing methods rely on the pseudo-labels generated by source models that can be noisy due to domain shift. In this paper, we study SFDA from the perspective of learning with label noise (LLN). Unlike the label noise in the conventional LLN scenario, we prove that the label noise in SFDA follows a different distribution assumption. We also prove that such a difference makes existing LLN methods that rely on their distribution assumptions unable to address the label noise in SFDA. Empirical evidence suggests that only marginal improvements are achieved when applying the existing LLN methods to solve the SFDA problem. On the other hand, although there exists a fundamental difference between the label noise in the two scenarios, we demonstrate theoretically that the early-time training phenomenon (ETP), which has been previously observed in conventional label noise settings, can also be observed in the SFDA problem. Extensive experiments demonstrate significant improvements to existing SFDA algorithms by leveraging ETP to address the label noise in SFDA.","Source-Free Domain Adaptation, Unsupervised Domain Adaptation, Noisy Label Learning" Is a Caption Worth a Thousand Images? A Study on Representation Learning,https://openreview.net/forum?id=cYijsVZhb5,https://openreview.net/pdf?id=cYijsVZhb5,Our work performs a systematic investigation into whether additional language supervision (in CLIP) helps models learn more transferrable representations.,"The development of CLIP [Radford et al., 2021] has sparked a debate on whether adding language supervision can yield vision models with more transferable representations than traditional image-only methods. Our work studies this question through a carefully controlled comparison of two approaches, in terms of their ability to learn representations that generalize to downstream classification tasks. We find that when the pre-training data meets certain criteria---it is sufficiently large and contains descriptive captions with low variability----image-only methods do not match CLIP's performance even when they are trained with more image data. However, contrary to what one might expect, there are practical settings in which these criteria are not met, wherein added supervision through captions is actually detrimental. Motivated by our findings, we devise simple data and algorithmic interventions to improve the transfer performance of CLIP-style models.","CLIP, transfer learning, contrastive learning, multi-modal" Parameter-Efficient Fine-Tuning Design Spaces,https://openreview.net/forum?id=XSRSWxyJIC,https://openreview.net/pdf?id=XSRSWxyJIC,,"Parameter-efficient fine-tuning aims to achieve comparable performances of fine-tuning with much fewer trainable parameters. Recently, various tuning strategies (e.g., Adapters, Prefix Tuning, BitFit, and LoRA) have been proposed. However, their designs are hand-crafted separately, and it remains unclear whether certain design patterns exist for parameter-efficient fine-tuning. Thus, we present a parameter-efficient fine-tuning design paradigm and discover design patterns that are applicable to different experimental settings. Instead of focusing on designing another individual tuning strategy, we introduce parameter-efficient fine-tuning design spaces that parameterize tuning structures and tuning strategies. Specifically, any design space is characterized by four components: layer grouping, trainable parameter allocation, tunable groups, and strategy assignment. Our comprehensive empirical study leads to the discovery of design patterns: (i) grouping layers in a spindle pattern, (ii) uniformly allocating the number of trainable parameters to layers, (ii) tuning all the groups, and (iv) tuning different groups with proper strategies. Our discovered design patterns result in new parameter-efficient fine-tuning methods. Experiments show that these methods consistently outperform investigated parameter-efficient fine-tuning strategies across different backbone models and different tasks in natural language processing.","parameter-efficient fine-tuning, design spaces" Concept Gradient: Concept-based Interpretation Without Linear Assumption,https://openreview.net/forum?id=_01dDd3f78,https://openreview.net/pdf?id=_01dDd3f78,Extending concept-based gradient interpretation to non-linear concept functions.,"Concept-based interpretations of black-box models are often more intuitive for humans to understand. The most widely adopted approach for concept-based, gradient interpretation is Concept Activation Vector (CAV). CAV relies on learning a linear relation between some latent representation of a given model and concepts. The premise of meaningful concepts lying in a linear subspace of model layers is usually implicitly assumed but does not hold true in general. In this work we proposed Concept Gradient (CG), which extends concept-based, gradient interpretation methods to non-linear concept functions. We showed that for a general (potentially non-linear) concept, we can mathematically measure how a small change of concept affects the model’s prediction, which is an extension of gradient-based interpretation to the concept space. We demonstrated empirically that CG outperforms CAV in attributing concept importance on real world datasets and performed case study on a medical dataset.","Interpretability, Concept-based interpretation, XAI" FedCUAU: Clustered Federated Learning using weight divergence,https://openreview.net/forum?id=TjY9fl2Bcs,https://openreview.net/pdf?id=TjY9fl2Bcs,This work uses the relative weight divergence between each client update and their aggregated update to cluster clients and govern the knowledge transfer between clusters to improve both the initial and personalized performance.,"The majority of federated learning (FL) approaches aim to learn either a high-performing global model or multiple personalized models. Although there has been significant progress in each research direction, the optimization of one often comes at the expense of the other. In this work, we approach this problem by investigating how different clusters of clients with varying degrees of data heterogeneity may impact the single global model. From this empirical analysis, we discover a surprising insight: despite a significant distribution mismatch between clusters, the knowledge shared from low data heterogeneous clusters to high data heterogeneous clusters can significantly boost the latter's personalized accuracy but not vice versa. By building on this observation, we propose a cluster-based approach named FedCUAU, in which clients are clustered based on their degree of data heterogeneity, and knowledge between each cluster is selectively transferred. Experimental results on standard FL benchmarks show that FedCUAU can be plugged into existing FL algorithms to achieve considerable improvement both the initial and personalized performance. Empirical results shows that FedCUAU improves FedAvg initial global accuracy by $1.53\%$ and $1.82\%$ for Cifar10 and FEMNIST respectively, and personalized accuracy by $0.29\%$ and $3.81\%$. ",Federated Learning Constraining Representations Yields Models That Know What They Don't Know,https://openreview.net/forum?id=1w_Amtk67X,https://openreview.net/pdf?id=1w_Amtk67X,We introduce a model class able to provide confidence scores indicating how likely it is that it is making an erroneous prediction.,"A well-known failure mode of neural networks is that they may confidently return erroneous predictions. Such unsafe behaviour is particularly frequent when the use case slightly differs from the training context, and/or in the presence of an adversary. This work presents a novel direction to address these issues in a broad, general manner: imposing class-aware constraints on a model's internal activation patterns. Specifically, we assign to each class a unique, fixed, randomly-generated binary vector - hereafter called class code - and train the model so that its cross-depths activation patterns predict the appropriate class code according to the input sample's class. The resulting predictors are dubbed total activation classifiers (TAC), and TACs may either be trained from scratch, or used with negligible cost as a thin add-on on top of a frozen, pre-trained neural network. The distance between a TAC's activation pattern and the closest valid code acts as an additional confidence score, besides the default unTAC'ed prediction head's. In the add-on case, the original neural network's inference head is completely unaffected (so its accuracy remains the same) but we now have the option to use TAC's own confidence and prediction when determining which course of action to take in an hypothetical production workflow. In particular, we show that TAC strictly improves the value derived from models allowed to reject/defer. We provide further empirical evidence that TAC works well on multiple types of architectures and data modalities and that it is at least as good as state-of-the-art alternative confidence scores derived from existing models.","Rejecting classifiers, Selective classification, Uncertainty estimation, Robust classification, Out-of-distribution detection" Neural Networks Efficiently Learn Low-Dimensional Representations with SGD,https://openreview.net/forum?id=6taykzqcPD,https://openreview.net/pdf?id=6taykzqcPD,"We prove that SGD on neural networks can learn low-dimensional features in certain settings, and use this to derive novel generalization and excess risk bounds.","We study the problem of training a two-layer neural network (NN) of arbitrary width using stochastic gradient descent (SGD) where the input $\boldsymbol{x}\in \mathbb{R}^d$ is Gaussian and the target $y \in \mathbb{R}$ follows a multiple-index model, i.e., $y=g(\langle\boldsymbol{u_1},\boldsymbol{x}\rangle,...,\langle\boldsymbol{u_k},\boldsymbol{x}\rangle)$ with a noisy link function $g$. We prove that the first-layer weights in the NN converge to the $k$-dimensional principal subspace spanned by the vectors $\boldsymbol{u_1},...,\boldsymbol{u_k}$ of the true model, when online SGD with weight decay is used for training. This phenomenon has several important consequences when $k \ll d$. First, by employing uniform convergence on this smaller subspace, we establish a generalization error bound of $\mathcal{O}(\sqrt{{kd}/{T}})$ after $T$ iterations of SGD, which is independent of the width of the NN. We further demonstrate that, by recovering the principal direction, SGD-trained ReLU NNs can learn a single-index target of the form $y=f(\langle\boldsymbol{u},\boldsymbol{x}\rangle) + \epsilon$ with a sample complexity linear in $d$ (up to log factors), where $f$ is a monotonic function with at most polynomial growth, and $\epsilon$ is the noise. This is in contrast to the known $d^{\Omega(p)}$ samples required to learn any degree $p$ polynomial in the kernel regime, and shows that SGD-trained NNs can outperform the Neural Tangent Kernel at initialization. Finally, we establish compressibility guarantees for NNs using that SGD produces an approximately rank-$k$ first-layer weight matrix.","feature learning, generalization, compressibility, sgd, neural networks" CoMoE: Contrastive Mixture-of-Experts are Efficient Representation Learners,https://openreview.net/forum?id=VBmeysLYDN,https://openreview.net/pdf?id=VBmeysLYDN,We study scaling contrastive learning with mixture of experts and improve its performance with a novel regularization.,"While Contrastive Learning (CL) achieves great success in many downstream tasks, its good performance heavily relies on a large model capacity. As previous methods focus on scaling dense models, training and inference costs increase rapidly with model sizes, leading to large resource consumption. In this paper, we explore CL with an efficient scaling method, Mixture of Experts (MoE), to obtain a large but sparse model. We start by plugging in the state-of-the-art CL method to MoE. However, this naive combination fails to visibly improve performance despite a much larger capacity. A closer look reveals that the naive MoE+CL model has a strong tendency to route two augmented views of the same image token to different subsets of experts: such ``cross-view instability"" breaks the weight-sharing nature in CL and misleads the invariant feature learning. To address this issue, we introduce a new regularization mechanism, by enforcing expert-routing similarity between different views of the same image (or its overlapped patch tokens), while promoting expert-routing diversity of patches from different images. The resultant method, called CoMoE, improves by 1.7 points in terms of 1\% semi-supervised learning accuracy on ImageNet, compared to the naive combination baseline. It further surpasses the state-of-the-art CL methods on ImageNet pre-training of Vision Transformer (ViT) by 2.8 points, at the same computational cost. Our findings validate CoMoE as an effective and efficient image representation learner. Code is included in the supplemental materials.","Contrastive learning, sparse Mixture of Expert" Mixed Federated Learning: Joint Decentralized and Centralized Learning,https://openreview.net/forum?id=eZLdhVUG1hg,https://openreview.net/pdf?id=eZLdhVUG1hg,"Federated learning (FL) is good (better privacy, higher accuracy), and 'mixed FL' (concurrent joint FL and centralized learning) can make it even better, by mitigating distribution shifts and saving bandwidth and compute.","Federated learning (FL) enables learning from decentralized privacy-sensitive data, with computations on raw data confined to take place at edge clients. This paper introduces mixed FL, which incorporates an additional loss term calculated at the coordinating server (while maintaining FL’s private data restrictions). For example, additional datacenter data can be leveraged to jointly learn from centralized (datacenter) and decentralized (federated) training data and better match an expected inference data distribution.Mixed FL also enables offloading some intensive computations (e.g., embedding regularization) to the server, greatly reducing communication and client computation load. For these and other mixed FL use cases, we present three algorithms: PARALLEL TRAINING, 1-WAY GRADIENT TRANSFER, and 2-WAY GRADIENT TRANSFER. We perform extensive experiments of the algorithms on three tasks, demonstrating that mixed FL can blend training data to achieve an oracle’s accuracy on an inference distribution, and can reduce communication and computation overhead by more than 90%. Finally, we state convergence bounds for all algorithms, and give intuition on the mixed FL problems best suited to each. The theory confirms our empirical observations of how the algorithms perform under different mixed FL problem settings.","federated learning, decentralized learning, privacy, security, distribution shift, distribution skew, mobile computing" OTCOP: Learning optimal transport maps via constraint optimizations,https://openreview.net/forum?id=mdECGh-qlK,https://openreview.net/pdf?id=mdECGh-qlK, We integrates constraint optimization algorithm and neural networks for the computation of optimal transport maps based on the Monge formulation.,"The approximation power of neural networks makes it an ideal tool to learn optimal transport maps. However, existing methods are mostly based on the Kantorovich duality and require regularization and/or special network structures such as Input Convex Neural Networks (ICNN). In this paper, we propose a direct constraint optimization algorithm for the computation of optimal transport maps based on the Monge formulation. We solve this constraint optimization problem by using three different methods: the penalty method, the augmented Lagrangian method, and the alternating direction method of multipliers method (AMDD). We demonstrate a significant improvement in the accuracy of learned optimal transport maps on benchmarks. Moreover, we show that our methods reduce the regularization effects and accurately learn the target distributions at lower transport cost. ","optimal transport, constraint optimization, Monge problem" An Extensible Multi-modal Multi-task Object Dataset with Materials,https://openreview.net/forum?id=n70oyIlS4g,https://openreview.net/pdf?id=n70oyIlS4g,"We develop a dataset of Amazon product listings. The dataset includes images, text, price, mass, materials, and categories + others. We also show how to quickly add custom binary attributes to the dataset.","We present EMMa, an Extensible, Multimodal dataset of Amazon product listings that contains rich Material annotations. It contains more than 2 million objects, each with image(s), listing text, mass, price, product ratings, and position in Amazon’s product-category taxonomy. We also design a comprehensive taxonomy of 182 physical materials (e.g., Plastic → Thermoplastic → Acrylic). Objects are annotated with one or more materials from this taxonomy. With the numerous data attributes available for each object, we develop a smart labeling framework to quickly add new binary labels to all objects with only hours of manual labeling effort, making the dataset extensible at scale. Each object attribute in our dataset can be included in either the model inputs or outputs, leading to combinatorial possibilities in task formulations. For example, we can train a model to predict the object category from the listing text, or the mass and price from the product listing image. EMMa offers a new benchmark for multi-task learning in computer vision and NLP and allows practitioners to efficiently add new tasks and object attributes at scale.","Multi-task, multi-modal, dataset, materials, weak supervision" Sampling with Mollified Interaction Energy Descent,https://openreview.net/forum?id=zWy7dqOcel,https://openreview.net/pdf?id=zWy7dqOcel,Unconstrained and constrained sampling by minimizing a new class of mollified interaction energies.,"Sampling from a target measure whose density is only known up to a normalization constant is a fundamental problem in computational statistics and machine learning. In this paper, we present a new optimization-based method for sampling called mollified interaction energy descent (MIED). MIED minimizes a new class of energies on probability measures called mollified interaction energies (MIEs). These energies rely on mollifier functions---smooth approximations of the Dirac delta originated from PDE theory. We show that as the mollifier approaches the Dirac delta, the MIE converges to the chi-square divergence with respect to the target measure and the gradient flow of the MIE agrees with that of the chi-square divergence. Optimizing this energy with proper discretization yields a practical first-order particle-based algorithm for sampling in both unconstrained and constrained domains. We show experimentally that for unconstrained sampling problems our algorithm performs on par with existing particle-based algorithms like SVGD, while for constrained sampling problems our method readily incorporates constrained optimization techniques to handle more flexible constraints with strong performance compared to alternatives. ", Does Zero-Shot Reinforcement Learning Exist?,https://openreview.net/forum?id=MYEap_OcQI,https://openreview.net/pdf?id=MYEap_OcQI,"We revisit zero-shot RL based on successor representations, we introduce improved losses and new models and evaluate them systematically on the unsupervised RL benchmark.","A zero-shot RL agent is an agent that can solve any RL task in a given environment, instantly with no additional planning or learning, after an initial reward-free learning phase. This marks a shift from the reward-centric RL paradigm towards controllable agents that can follow arbitrary instructions in an environment. Current RL agents can solve families of related tasks at best, or require planning anew for each task. Strategies for approximate zero-shot RL have been suggested using successor features (SFs) (Borsa et al., 2018) or forward-backward (FB) representations (Touati & Ollivier, 2021), but testing has been limited. After clarifying the relationships between these schemes, we introduce improved losses and new SF models, and test the viability of zero-shot RL schemes systematically on tasks from the Unsupervised RL benchmark (Laskin et al., 2021). To disentangle universal representation learning from exploration, we work in an offline setting and repeat the tests on several existing replay buffers. SFs appear to suffer from the choice of the elementary state features. SFs with Laplacian eigenfunctions do well, while SFs based on auto-encoders, inverse curiosity, transition models, low-rank transition matrix, contrastive learning, or diversity (APS), perform unconsistently. In contrast, FB representations jointly learn the elementary and successor features from a single, principled criterion. They perform best and consistently across the board, reaching $85\%$ of supervised RL performance with a good replay buffer, in a zero-shot manner.","controllable agents, zero-shot RL, self-supervised representation learning, successor representation, offline RL" Few-Shot Text Classification with Dual Contrastive Consistency Training,https://openreview.net/forum?id=KQ-ipHOmBc,https://openreview.net/pdf?id=KQ-ipHOmBc,,"In this paper, we explore how to utilize pre-trained language model to perform few-shot text classification where only a few annotated examples are given for each class. Since using traditional cross-entropy loss to fine-tune language model under this scenario causes serious overfitting and leads to sub-optimal generalization of model, we adopt supervised contrastive learning on few labeled data and consistency-regularization on vast unlabeled data. Moreover, we propose a novel contrastive consistency to further boost model performance and refine sentence representation. After conducting extensive experiments on four datasets, we demonstrate that our model (FTCC) can outperform state-of-the-art methods and has better robustness. ","Few-Shot Learning, Contrastive Learning, Consistency Training" Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability,https://openreview.net/forum?id=nhKHA59gXz,https://openreview.net/pdf?id=nhKHA59gXz,"We explain the mechanism behind the edge of stability phenomenon, where full batch gradient descent non-monotonically decreases the loss in the presence of instability.","Traditional analyses of gradient descent show that when the largest eigenvalue of the Hessian, also known as the sharpness $S(\theta)$, is bounded by $2/\eta$, training is ""stable"" and the training loss decreases monotonically. Recent works, however, have observed that this assumption does not hold when training modern neural networks with full batch or large batch gradient descent. Most recently, Cohen et al. (2021) observed two important phenomena. The first, dubbed \emph{progressive sharpening}, is that the sharpness steadily increases throughout training until it reaches the instability cutoff $2/\eta$. The second, dubbed \emph{edge of stability}, is that the sharpness hovers at $2/\eta$ for the remainder of training while the loss continues decreasing, albeit non-monotonically. We demonstrate that, far from being chaotic, the dynamics of gradient descent at the edge of stability can be captured by a cubic Taylor expansion: as the iterates diverge in direction of the top eigenvector of the Hessian due to instability, the cubic term in the local Taylor expansion of the loss function causes the curvature to decrease until stability is restored. This property, which we call \emph{self-stabilization}, is a general property of gradient descent and explains its behavior at the edge of stability. A key consequence of self-stabilization is that gradient descent at the edge of stability implicitly follows \emph{projected} gradient descent (PGD) under the constraint $S(\theta) \le 2/\eta$. Our analysis provides precise predictions for the loss, sharpness, and deviation from the PGD trajectory throughout training, which we verify both empirically in a number of standard settings and theoretically under mild conditions. Our analysis uncovers the mechanism for gradient descent's implicit bias towards stability.","gradient descent, optimization, edge of stability, implicit regularization, implicit bias" Toward Discovering Options that Achieve Faster Planning,https://openreview.net/forum?id=3lr-ESFLUO,https://openreview.net/pdf?id=3lr-ESFLUO,We propose a new objective for option discovery that emphasizes the computational advantage of using options in planning.,"We propose a new objective for option discovery that emphasizes the computational advantage of using options in planning. In a sequential machine, the speed of planning is proportional to the number of elementary operations used to achieve a good policy. For episodic tasks, the number of elementary operations depends on the number of options composed by the policy in an episode and the number of options being considered at each decision point. To reduce the amount of computation in planning, for a given set of episodic tasks and a given number of options, our objective prefers options with which it is possible to achieve a high return by composing few options, and also prefers a smaller set of options to choose from at each decision point. We develop an algorithm that optimizes the proposed objective. In a variant of the classic four-room domain, we show that 1) a higher objective value is typically associated with fewer number of elementary planning operations used by the option-value iteration algorithm to obtain a near-optimal value function, 2) our algorithm achieves an objective value that matches it achieved by two human-designed options 3) the amount of computation used by option-value iteration with options discovered by our algorithm matches those with the human-designed options, 4) the options produced by our algorithm also make intuitive sense--they seem to move to and terminate at the entrance of each room.","Option Discovery, Temporal Abstraction, Planning, Reinforcement Learning" Conditional Permutation Invariant Flows,https://openreview.net/forum?id=g-qWfKQlL3,https://openreview.net/pdf?id=g-qWfKQlL3,"We present a novel, conditional generative probabilistic model of set-valued data with a tractable log density.","We present a novel, conditional generative probabilistic model of set-valued data with a tractable log density. This model is a continuous normalizing flow governed by permutation equivariant dynamics. These dynamics are driven by a learnable per-set-element term and pairwise interactions, both parametrized by deep neural networks. We illustrate the utility of this model via applications including (1) complex traffic scene generation conditioned on visually specified map information, and (2) object bounding box generation conditioned directly on images. We train our model by maximizing the expected likelihood of labeled conditional data under our flow, with the aid of a penalty that ensures the dynamics are smooth and hence efficiently solvable. Our method significantly outperforms non-permutation invariant baselines in terms of log likelihood and domain-specific metrics (offroad, collision, and combined infractions), yielding realistic samples that are difficult to distinguish from real data.","Permutation invariance, continuous normalizing flows, traffic scene generation, object location" TILP: Differentiable Learning of Temporal Logical Rules on Knowledge Graphs,https://openreview.net/forum?id=_X12NmQKvX,https://openreview.net/pdf?id=_X12NmQKvX,,"Compared with static knowledge graphs, temporal knowledge graphs (tKG), which can capture the evolution and change of information and knowledge, are more realistic and general. However, due to the complexity from introducing the notion of time, accurate link prediction based on explainable and comprehensible patterns is still a difficult problem. In this paper, we propose TILP, a differentiable framework for temporal logical rules learning. By designing constrained random walk mechanism and corresponding operators, we ensure the efficiency of our model. Furthermore, we discuss temporal features modelling in tKG, e.g., recurrence, temporal order, interval between pair of relations, and duration, and incorporate it into our learning process. TILP is compared with state-of-the-art baselines on two benchmark datasets and shows comparable performance. More importantly, we introduce some hard settings to test the robustness of different models, e.g., few training samples, biased data, and time shifting. In these cases, TILP works better than most state-of-the-art methods.", Hyperbolic Deep Reinforcement Learning,https://openreview.net/forum?id=TfBHFLgv77,https://openreview.net/pdf?id=TfBHFLgv77,"We use hyperbolic space to model the latent representations of deep RL algorithms, attaining great performance and generalization benefits.","In deep reinforcement learning (RL), useful information about the state is inherently tied to its possible future successors. Consequently, encoding features that capture the hierarchical relationships between states into the model's latent representations is often conducive to recovering effective policies. In this work, we study a new class of deep RL algorithms that promote encoding such relationships by using hyperbolic space to model latent representations. However, we find that a naive application of existing methodology from the hyperbolic deep learning literature leads to fatal instabilities due to the non-stationarity and variance characterizing common gradient estimators in RL. Hence, we design a new general method that directly addresses such optimization challenges and enables stable end-to-end learning with deep hyperbolic representations. We empirically validate our framework by applying it to popular on-policy and off-policy RL algorithms on the Procgen and Atari 100K benchmarks, attaining near universal performance and generalization benefits. Given its natural fit, we hope this work will inspire future RL research to consider hyperbolic representations as a standard tool.","Reinforcement learning, Hyperbolic space, Representation learning, Machine learning" Learning Controllable Adaptive Simulation for Multi-scale Physics,https://openreview.net/forum?id=PbfgkZ2HdbE,https://openreview.net/pdf?id=PbfgkZ2HdbE,We introduce a method jointly learns the surrogate model and dynamically selects appropriate spatial resolutions that devote more compute to the highly dynamic regions,"Simulating the time evolution of physical systems is pivotal in many scientific and engineering problems. An open challenge in simulating such systems is their multi-scale dynamics: a small fraction of the system is extremely dynamic, and requires very fine-grained resolution, while a majority of the system is changing slowly and can be modeled by coarser spatial scales. Typical learning-based surrogate models use a uniform spatial scale, which needs to resolve to the finest required scale and can waste a huge compute to achieve required accuracy. In this work, we introduce Learning controllable Adaptive simulation for Multi-scale Physics (LAMP) as the first full deep learning-based surrogate model that jointly learns the evolution model and optimizes appropriate spatial resolutions that devote more compute to the highly dynamic regions. LAMP consists of a Graph Neural Network (GNN) for learning the forward evolution, and a GNN-based actor-critic for learning the policy of spatial refinement and coarsening. We introduce learning techniques that optimizes LAMP with weighted sum of error and computational cost as objective, which allows LAMP to adapt to varying relative importance of error vs. computation tradeoff at inference time. We test our method in a 1D benchmark of nonlinear PDEs and a challenging 2D mesh-based simulation. We demonstrate that our LAMP outperforms state-of-the-art deep learning surrogate models with up to 60.5\% error reduction, and is able to adaptively trade-off computation to improve long-term prediction error.","adaptive, multi-scale, error vs. computation, controllable" "Gated Neural ODEs: Trainability, Expressivity and Interpretability",https://openreview.net/forum?id=ArPM-xtsFrk,https://openreview.net/pdf?id=ArPM-xtsFrk,,"Understanding how the dynamics in biological and artificial neural networks implement the computations required for a task is a salient open question in machine learning and neuroscience. In particular, computations requiring complex memory storage and retrieval pose significant challenge for these networks to implement or learn. Recently, a family of models described by neural ordinary differential equations (nODEs) has emerged as powerful dynamical neural network models capable of capturing complex dynamics. Here, we extend nODEs by endowing them with adaptive timescales using gating interactions. We refer to these as gated neural ODEs (gnODEs). Using a task that requires memory of continuous quantities, we demonstrate the inductive bias of the gnODEs to learn (approximate) continuous attractors. We further show how reduced-dimensional gnODEs retain their modeling power while greatly improving interpretability, even allowing explicit visualization of the structure of learned attractors. We introduce a novel measure of expressivity which probes the capacity of a neural network to generate complex trajectories. Using this measure, we explore how the phase-space dimension of the nODEs and the complexity of the function modeling the flow field contribute to expressivity. We see that a more complex function for modeling the flow field allows a lower-dimensional nODE to capture a given target dynamics. Finally, we demonstrate the benefit of gating in nODEs on several real-world tasks.","Computational Neuroscience, Dynamical Systems, Differential Equations, Neural ODEs, Gating, Interpretability" Value-Based Membership Inference Attack on Actor-Critic Reinforcement Learning,https://openreview.net/forum?id=wKIxJKTDmX-,https://openreview.net/pdf?id=wKIxJKTDmX-,We introduce a new membership inference attack focusing on the value function of the actor-critic algorithm.,"In actor-critic reinforcement learning (RL), the so-called actor and critic, respectively, compute candidate policies and a value function that evaluates the candidate policies. Such RL algorithms may be vulnerable to membership inference attacks (MIAs), a privacy attack that infers the data membership, i.e., whether a specific data record belongs to the training dataset. We investigate the vulnerability of value function in actor-critic to MIAs. We develop \textit{CriticAttack}, a new MIA that targets black-box RL agents by examining the correlation between the expected reward and the value function. We empirically show that \textit{CriticAttack} can correctly infer approximately 90\% of the training data membership, i.e., it achieves 90\% attack accuracy. Such accuracy is far beyond the 50\% random guessing accuracy, indicating a severe privacy vulnerability of the value function. To defend against \textit{CriticAttack}, we design a method called \textit{CriticDefense} that inserts uniform noise to the value function. \textit{CriticDefense} can reduce the attack accuracy to 60\% without significantly affecting the agent’s performance.","Privacy, Membership Inference Attack, Value Function, Actor-Critic, Reinforcement Learning" Open-Vocabulary Object Detection upon Frozen Vision and Language Models,https://openreview.net/forum?id=MIMwy4kh9lf,https://openreview.net/pdf?id=MIMwy4kh9lf,We propose a novel open-vocabulary detection approach by building upon frozen vision and language models.,"We present F-VLM, a simple open-vocabulary object detection method built uponFrozenVision andLanguageModels. F-VLM simplifies the current multi-stage training pipeline by eliminating the need for knowledge distillation or detection-tailored pretraining. Surprisingly, we observe that a frozen VLM: 1) retains the locality-sensitive features necessary for detection, and 2) is a strong region classifier. We finetune only the detector head and combine the detector and VLM outputs for each region at inference time. F-VLM shows compelling scaling behavior and achieves +6.5 mask AP improvement over the previous state of theart on novel categories of LVIS open-vocabulary detection benchmark. In addition, we demonstrate very competitive results on COCO open-vocabulary detection benchmark and cross-dataset transfer detection, in addition to significant training speed-up and compute savings. Code will be released. ","open-vocabulary recognition, object detection, vision and language" Learned Neural Network Representations are Spread Diffusely with Redundancy,https://openreview.net/forum?id=G2GpzH1l9AC,https://openreview.net/pdf?id=G2GpzH1l9AC,We show that a randomly selected fraction of neurons from a pre-trained representation achieve similar performance as the full representation.,"Representations learned by pre-training a neural network on a large dataset are increasingly used successfully to perform a variety of downstream tasks. In this work, we take a closer look at how features are encoded in such pre-trained representations. We find that learned representations in a given layer exhibit a degree of diffuse redundancy, ie, any randomly chosen subset of neurons in the layer that is larger than a threshold size shares a large degree of similarity with the full layer and is able to perform similarly as the whole layer on a variety of downstream tasks. For example, a linear probe trained on 20% of randomly picked neurons from a ResNet50 pre-trained on ImageNet1k achieves an accuracy within 5% of a linear probe trained on the full layer of neurons for downstream CIFAR10 classification. We conduct experiments on different neural architectures (including CNNs and Transformers) pre-trained on both ImageNet1k and ImageNet21k and evaluate a variety of downstream tasks taken from the VTAB benchmark. We find that the loss & dataset used during pre-training largely govern the degree of diffuse redundancy and the ""critical mass"" of neurons needed often depends on the downstream task, suggesting that there is a task-inherent sparsity-performance Pareto frontier. Our findings shed light on the nature of representations learned by pre-trained deep neural networks and suggest that entire layers might not be necessary to perform many downstream tasks. We investigate the potential for exploiting this redundancy to achieve efficient generalization for downstream tasks and also draw caution to certain possible unintended consequences.","representation learning, redundancy, transfer learning, fairness" Contrastive Graph Representation Learning with Cross-view Reconstruction,https://openreview.net/forum?id=GbFK7JJJVTz,https://openreview.net/pdf?id=GbFK7JJJVTz,Our paper propose a new contrastive learning framework to learn graph representation in accordance with the information bottleneck principle.,"Although different graph self-supervised learning strategies have been proposed to tackle the supervision shortage issue in graph learning tasks, Graph contrastive learning (GCL) has been the most prevalent approach to this problem. Despite the remarkable performances those GCL methods have achieved, existing GCL methods that heavily depend on various manually designed augmentation techniques still struggle to improve model robustness without risking losing task-relevant information. Consequently, the learned representation is either brittle or unilluminating. In light of this, we introduce the GraphCV, which follows the information bottleneck principle to learn minimal yet sufficient representations from graph data. Specifically, our proposed model elicits the predictive (useful for downstream instance discrimination) and other non-predictive features separately. Except for the conventional contrastive loss which guarantees the consistency and sufficiency of the representations across different augmentation views, we introduce a cross-view reconstruction mechanism to pursue the disentanglement of the two learned representations. Besides, an adversarial global view is added as the third view of contrastive loss to avoid the learned representation from being drafted too far away from the original distribution. We empirically demonstrate that our proposed model outperforms the state-of-the-art on graph classification task over multiple benchmark datasets.","Graph Neural Network, Graph Contrastive Learning" Neural DAEs: Constrained neural networks,https://openreview.net/forum?id=UmU9mydWRV3,https://openreview.net/pdf?id=UmU9mydWRV3,We add constraints to neural networks,"In this article we investigate the effect of explicitly adding auxiliary trajectory information to neural networks for dynamical systems. We draw inspiration from the field of differential-algebraic equations and differential equations on manifolds and implement similar methods in residual neural networks. We discuss constraints through stabilization as well as projection methods, and show when to use which method based on experiments involving simulations of multi-body pendulums and molecular dynamics scenarios. Several of our methods are easy to implement in existing code and have limited impact on performance while giving significant boosts in terms of inference.","neural networks, differential algebraic equations, constraints" Revisiting the Assumption of Latent Separability for Backdoor Defenses,https://openreview.net/forum?id=_wSHsgrVali,https://openreview.net/pdf?id=_wSHsgrVali,Adaptive Backdoor Attacks against Latent Separation Based Defenses,"Recent studies revealed that deep learning is susceptible to backdoor poisoning attacks. An adversary can embed a hidden backdoor into a model to manipulate its predictions by only modifying a few training data, without controlling the training process. Currently, a tangible signature has been widely observed across a diverse set of backdoor poisoning attacks --- models trained on a poisoned dataset tend to learn separable latent representations for poison and clean samples. This latent separation is so pervasive that a family of backdoor defenses directly take it as a default assumption (dubbed latent separability assumption), based on which to identify poison samples via cluster analysis in the latent space. An intriguing question consequently follows: is the latent separation unavoidable for backdoor poisoning attacks? This question is central to understanding whether the assumption of latent separability provides a reliable foundation for defending against backdoor poisoning attacks. In this paper, we design adaptive backdoor poisoning attacks to present counter-examples against this assumption. Our methods include two key components: (1) a set of trigger-planted samples correctly labeled to their semantic classes (other than the target class) that can regularize backdoor learning; (2) asymmetric trigger planting strategies that help to boost attack success rate (ASR) as well as to diversify latent representations of poison samples. Extensive experiments on benchmark datasets verify the effectiveness of our adaptive attacks in bypassing existing latent separation based backdoor defenses. Moreover, our attacks still maintain a high attack success rate with negligible clean accuracy drop. Our studies call for defense designers to take caution when leveraging latent separation as an assumption in their defenses.",Backdoor Attacks Restricted Strong Convexity of Deep Learning Models with Smooth Activations,https://openreview.net/forum?id=PINRbk7h01,https://openreview.net/pdf?id=PINRbk7h01,,"We consider the problem of optimization of deep learning models with smooth activation functions. While there exist influential results on the problem from the ``near initialization'' perspective, we shed considerable new light on the problem. In particular, we make two key technical contributions for such models with $L$ layers, $m$ width, and $\sigma_0^2$ initialization variance. First, for suitable $\sigma_0^2$, we establish a $O(\frac{\text{poly}(L)}{\sqrt{m}})$ upper bound on the spectral norm of the Hessian of such models, considerably sharpening prior results. Second, we introduce a new analysis of optimization based on Restricted Strong Convexity (RSC) which holds as long as the squared norm of the average gradient of predictors is $\Omega(\frac{\text{poly}(L)}{\sqrt{m}})$ for the square loss. We also present results for more general losses. The RSC based analysis does not need the ``near initialization"" perspective and guarantees geometric convergence for gradient descent (GD). To the best of our knowledge, ours is the first result on establishing geometric convergence of GD based on RSC for deep learning models, thus becoming an alternative sufficient condition for convergence that does not depend on the widely-used Neural Tangent Kernel (NTK). We share preliminary experimental results supporting our theoretical advances.", Koopman Neural Operator Forecaster for Time-series with Temporal Distributional Shifts,https://openreview.net/forum?id=kUmdmHxK5N,https://openreview.net/pdf?id=kUmdmHxK5N,,"Temporal distributional shifts, with underlying dynamics changing over time, frequently occur in real-world time series, and pose a fundamental challenge for deep neural networks (DNNs). In this paper, we propose a novel deep sequence model based on the Koopman theory for time series forecasting: Koopman Neural Forecaster (KNF) that leverages DNNs to learn the linear Koopman space and the coefficients of chosen measurement functions. KNF imposes appropriate inductive biases for improved robustness against distributional shifts, employing both a global operator to learn shared characteristics, and a local operator to capture changing dynamics, as well as a specially-designed feedback loop to continuously update the learnt operators over time for rapidly varying behaviors. To the best of our knowledge, this is the first time that Koopman theory is applied to real-world chaotic time series without known governing laws. We demonstrate that KNF achieves the superior performance compared to the alternatives, on multiple time series datasets that are shown to suffer from distribution shifts.","Time series forecasting, Temporal distributional shifts, Koopman Theory" Uncertainty-Driven Exploration for Generalization in Reinforcement Learning,https://openreview.net/forum?id=nulUqBMpBb,https://openreview.net/pdf?id=nulUqBMpBb,We found that exploration is crucial for generalization in contextual MDPs and proposed the first value-based deep RL algorithm that achieves state-of-the-art performance on Procgen.,"Value-based methods tend to outperform policy optimization methods when trained and tested in single environments; however, they significantly underperform when trained on multiple environments with similar characteristics and tested on new ones from the same distribution. We investigate the potential reasons behind the poor generalization performance of value-based methods and discover that exploration plays a crucial role in these settings. Exploration is helpful not only for finding optimal solutions to the training environments, but also for acquiring knowledge that helps generalization to other unseen environments. We show how to make value-based methods competitive with policy optimization methods in these settings by using uncertainty-driven exploration and distribtutional RL. Our algorithm is the first value-based method to achieve state-of-the-art on both Procgen and Crafter, two challenging benchmarks for generalization in RL. ","Deep reinforcement learning, exploration, generalization, procgen, crafter" SPIDR: SDF-based Neural Point Fields for Illumination and Deformation,https://openreview.net/forum?id=e3lYU9cD8y,https://openreview.net/pdf?id=e3lYU9cD8y,,"Implicit neural representations such as neural radiance fields (NeRFs) have re- recently emerged as a promising approach for 3D reconstruction and novel view synthesis. However, NeRF-based methods encode shape, reflectance, and illumination implicitly in their neural representations, and this makes it challenging for users to manipulate these properties in the rendered images explicitly. Exist- ing approaches only enable limited editing of the scene and deformation of the geometry. Furthermore, no existing work enables accurate scene illumination after object deformation. In this work, we introduce SPIDR, a new hybrid neural SDF representation. SPIDR combines point cloud and neural implicit representations to enable the reconstruction of higher quality meshes and surfaces for object deformation and lighting estimation. To more accurately capture environment illumination for scene relighting, we propose a novel neural implicit model to learn environment light. To enable accurate illumination updates after deformation, we use the shadow mapping technique to efficiently approximate the light visibility updates caused by geometry editing. We demonstrate the effectiveness of SPIDR in enabling high quality geometry editing and deformation with accurate updates to the illumination of the scene. In comparison to prior work, we demonstrate significantly better rendering quality after deformation and lighting estimation. ", On Convergence of Federated Averaging Langevin Dynamics,https://openreview.net/forum?id=CKTmsDxRPn,https://openreview.net/pdf?id=CKTmsDxRPn,A federated averaging Langevin algorithm (FA-LD) for uncertainty quantification and mean predictions in federated learning.,"We propose a federated averaging Langevin algorithm (FA-LD) for uncertainty quantification and mean predictions with distributed clients. In particular, we generalize beyond normal posterior distributions and consider a general class of models. We develop theoretical guarantees for FA-LD for strongly log-concave distributions with non-i.i.d data and study how the injected noise and the stochastic-gradient noise, the heterogeneity of data, and the varying learning rates affect the convergence. Such an analysis sheds light on the optimal choice of local updates to minimize communication cost. Important to our approach is that the communication efficiency does not deteriorate with the injected noise in the Langevin algorithms. In addition, we examine in our FA-LD algorithm both independent and correlated noise used over different clients. We observe there is a trade-off between the pairs among communication, accuracy, and data privacy. As local devices may become inactive in federated networks, we also show convergence results based on different averaging schemes where only partial device updates are available. In such a case, we discover an additional bias that does not decay to zero.","Langevin dynamics, federated learning, posterior inference, MCMC, stochastic gradient Langevin dynamics, differential privacy" Posthoc Privacy guarantees for neural network queries,https://openreview.net/forum?id=Jw5ivmKS2C,https://openreview.net/pdf?id=Jw5ivmKS2C,We present a framework for achieving formal privacy guarantees in adversarially trained ML models,"Cloud based machine learning inference is an emerging paradigm where users share their data with a service provider. Due to increased concerns over data privacy, recent works have proposed using Adversarial Representation Learning (ARL) to learn a privacy-preserving encoding of sensitive user data before it is shared with an untrusted service provider. Traditionally, the privacy of these encodings is evaluated empirically as they lack formal guarantees. In this work, we develop a new framework that provides formal privacy guarantees for an arbitrarily trained neural network by linking its local Lipschitz constant with its local sensitivity. To utilize local sensitivity for guaranteeing privacy, we extend the Propose-Test-Release(PTR) framework to make it tractable for neural network based queries. We verify the efficacy of our framework experimentally on real-world datasets and elucidate the role of ARL in improving the privacy-utility tradeoff.","data privacy, differential privacy, privacy preserving machine learning, adversarial learning" MetaGL: Evaluation-Free Selection of Graph Learning Models via Meta-Learning,https://openreview.net/forum?id=C1ns08q9jZ,https://openreview.net/pdf?id=C1ns08q9jZ,We present a meta-learning based framework that tackles the new problem of selecting a graph learning model without any evaluation.,"Given a graph learning task, such as link prediction, on a new graph, how can we select the best method as well as its hyperparameters (collectively called a model) without having to train or evaluate any model on the new graph? Model selection for graph learning has been largely ad hoc. A typical approach has been to apply popular methods to new datasets, but this is often suboptimal. On the other hand, systematically comparing models on the new graph quickly becomes too costly, or even impractical. In this work, we develop the first meta-learning approach for evaluation-free graph learning model selection, called MetaGL, which utilizes the prior performances of existing methods on various benchmark graph datasets to automatically select an effective model for the new graph, without any model training or evaluations. To quantify similarities across a wide variety of graphs, we introduce specialized meta-graph features that capture the structural characteristics of a graph. Then we design G-M network, which represents the relations among graphs and models, and develop a graph-based meta-learner operating on this G-M network, which estimates the relevance of each model to different graphs. Extensive experiments show that using MetaGL to select a model for the new graph greatly outperforms several existing meta-learning techniques tailed for graph learning model selection (up to 47% better), while being extremely fast at test time (∼1 sec).","evaluation-free model selection, automatic graph learning, link prediction, meta-learning" FOCUS: Fairness via Agent-Awareness for Federated Learning on Heterogeneous Data,https://openreview.net/forum?id=9bVBH1GD5sr,https://openreview.net/pdf?id=9bVBH1GD5sr,We propose a formal definition of fairness via agent-awareness for FL (FAA) on heterogeneous data and a fair FL training algorithm based on agent clustering (FOCUS) to achieve FAA.,"Federated learning (FL) provides an effective collaborative training paradigm, allowing local agents to train a global model jointly without sharing their local data to protect privacy. On the other hand, due to the heterogeneous nature of local agents, it is challenging to optimize or even define the fairness for agents, which may discourage valuable participation. For instance, the trained global model may sacrifice the performance of a minority user with high-quality data based on loss optimization over all users. Existing work usually considers accuracy equity as fairness for different users in FL, which is limited especially under the heterogeneous setting, since it is intuitively ""unfair"" that agents with low-quality data would achieve similar accuracy. In this work, we aim to address such limitations and propose a formal fairness definition in FL, fairness via agent-awareness (FAA), which takes the heterogeneous data contributions of local agents into account. In addition, we propose a fair FL training algorithm based on agent clustering (FOCUS) to achieve FAA. Theoretically, we prove the convergence and optimality of FOCUS under mild conditions for linear and general convex loss functions with bounded smoothness. We also prove that FOCUS always achieves higher fairness measured by FAA compared with standard FedAvg protocol under both linear and general convex loss functions. Empirically, we evaluate FOCUS on four datasets, including synthetic data, images, and texts under different settings, and we show that FOCUS achieves significantly higher fairness based on FAA while maintaining similar or even higher prediction accuracy compared with FedAvg and other existing fair FL algorithms. ","federated learning, fairness, data heterogeneity, clustering, expectation–maximization (EM)" A Simple Unsupervised Data Depth-based Method to Detect Adversarial Images,https://openreview.net/forum?id=RIcaT3C0wP,https://openreview.net/pdf?id=RIcaT3C0wP,We crafted a simple detection method for adversarial samples based on data depths which is especially designed for vision transformers architectures,"Deep neural networks suffer from critical vulnerabilities regarding robustness, which limits their exploitation in many real-world applications. In particular, a serious concern is their inability to defend against adversarial attacks. Although the research community has developed a large amount of effective attacks, the detection problem has received little attention. Existing detection methods either rely on additional training or on specific heuristics at the risk of overfitting. Moreover, they have mainly focused on ResNet architectures while transformers, which are state-of-the-art for vision tasks, have not been properly investigated. In this paper, we overcome these limitations by introducing APPROVED, a simple unsupervised detection method for transformer architectures. It leverages the information available in the logit layer and computes a similarity score with respect to the training distribution. This is accomplished using a data depth that is: (i) computationally efficient; and (ii) non-differentiable, making it harder for gradient-based adversaries to craft malicious samples. Our extensive experiments show that APPROVED consistently outperforms previous detectors on CIFAR10, CIFAR100 and Tiny ImageNet.","Adversarial attacks, Detection, Vision transformers, Safety AI" Co-Evolution As More Than a Scalable Alternative for Multi-Agent Reinforcement Learning,https://openreview.net/forum?id=NTCYXulK9qm,https://openreview.net/pdf?id=NTCYXulK9qm,Evolutionary Algorithms can be competitively used for policy search in multi-agent reinforcement and can scale to a high number of agents.,"In recent years, gradient based multi-agent reinforcement learning is growing in success. One contributing factor is the use of shared parameters for learning policy networks. While this approach scales well with the number of agents during execution it lacks this ambiguity for training as the number of produced samples grows linearly with the number of agents. For a very large number of agents, this could lead to an inefficient use of the circumstantial amount of produced samples. Moreover in single-agent reinforcement learning policy search with evolutionary algorithms showed viable success when sampling can be parallelized on a larger scale. The here proposed method does not only consider sampling in concurrent environments but further investigates sampling diverse parameters from the population in co-evolution in joint environments during training. This co-evolutionary policy search has shown to be capable of training a large number of agents. Beyond that, it has been shown to produce competitive results in smaller environments in comparison to gradient descent based methods. This surprising result make evolutionary algorithms a promising candidate for further research in the context of multi-agent reinforcement learning.","reinforcement learning, multi-agent reinforcement learning, policy search, co-evolution, evolutionary algorithm" Adaptive Parametric Prototype Learning for Cross-Domain Few-Shot Classification,https://openreview.net/forum?id=lTjtY1HOUI6,https://openreview.net/pdf?id=lTjtY1HOUI6,,"Cross-domain few-shot classification induces a much more challenging problem than its in-domain counterpart due to the existence of domain shifts between the training and test tasks. In this paper, we develop a novel Adaptive Parametric Prototype Learning (APPL) method under the meta-learning convention for cross-domain few-shot classification. Different from existing prototypical few-shot methods that use the averages of support instances to calculate the class prototypes, we propose to learn class prototypes from the concatenated features of the support set in a parametric fashion and meta-learn the model by enforcing prototype-based regularization on the query set. In addition, we fine-tune the model in the target domain in a transductive manner using a weighted-moving-average self-training approach on the query instances. We conduct experiments on multiple cross-domain few-shot benchmark datasets. The empirical results demonstrate that APPL yields superior performance than many state-of-the-art methods. ", Minimum Description Length Control,https://openreview.net/forum?id=oX3tGygjW1q,https://openreview.net/pdf?id=oX3tGygjW1q,"We propose a novel framework for multitask reinforcement learning which seeks to distill shared structure among tasks into a low-complexity representation, which is then leveraged to accelerate convergence on new tasks. ","We propose a novel framework for multitask reinforcement learning based on the minimum description length (MDL) principle. In this approach, which we term MDL-control (MDL-C), the agent learns the common structure among the tasks with which it is faced and then distills it into a simpler representation which facilitates faster convergence and generalization to new tasks. In doing so, MDL-C naturally balances adaptation to each task with epistemic uncertainty about the task distribution. We motivate MDL-C via formal connections between the MDL principle and Bayesian inference, derive theoretical performance guarantees, and demonstrate MDL-C's empirical effectiveness on both discrete and high-dimensional continuous control tasks.","multitask reinforcement learning, RL, reinforcement learning, MDL" RainProof: An Umbrella to Shield Text Generator from Out-Of-Distribution Data,https://openreview.net/forum?id=_4F4CDK9Mo,https://openreview.net/pdf?id=_4F4CDK9Mo,Out of distribution detection for natural language generation,"As more and more conversational and translation systems are deployed in production, it is essential to implement and develop effective control mechanisms to ensure their proper functioning and security. An essential component to ensure the safe behavior of the system is out-of-distribution (OOD) detection, which aims to detect whether an input sample is statistically far from the training distribution. While OOD detection is a widely covered topic in classification tasks, it has received much less attention in text generation. This paper addresses the problem of OOD detection for machine translation and dialog generation from an operational perspective. Our contribution includes (i) RAINPROOF a Relative informAItioN Projection Out OF distribution detection framework and (ii) a more operational evaluation setting for OOD detection. Surprisingly, we find that OOD detection is not necessarily aligned with task-specific measures. The OOD detector may filter out samples that are well processed by the model and keep samples that are not, leading to weaker performance. Our results show that RAINPROOF breaks this curse and achieve good results in OOD detection while increasing system performance.","NLP, OOD detection, natural language generation" Variance Double-Down: The Small Batch Size Anomaly in Multistep Deep Reinforcement Learning,https://openreview.net/forum?id=6R1unINH63,https://openreview.net/pdf?id=6R1unINH63,"We perform an exhaustive investigation into the interplay of batch size and update horizon and uncover a surprising phenomenon: when increasing the update horizon, it is more beneficial to decrease the batch size","State of the art results in reinforcement learning suggest that multi-step learning is necessary. However, the increased variance that comes with it makes it difficult to increase the update horizon beyond relatively small numbers. In this paper, we report the counterintuitive finding that decreasing the batch size substantially improves performance across a large swath of deep RL agents. It is well-known that gradient variance decreases with increasing batch sizes, so obtaining improved performance by increasing variance on two fronts is a rather surprising finding. We conduct a broad set of experiments to better understand this variance double-down phenomenon.","Reinforcement Learning, Deep Reinforcement Learning, Value based, Batch Size, Multi step learning" PerFedMask: Personalized Federated Learning with Optimized Masking Vectors,https://openreview.net/forum?id=hxEIgUXLFF,https://openreview.net/pdf?id=hxEIgUXLFF, We propose PerFedMask to address both the data and device heterogeneity issues in federated learning.,"Recently, various personalized federated learning (FL) algorithms have been proposed to tackle data heterogeneity. To mitigate device heterogeneity, a common approach is to use masking. In this paper, we first show that using random masking can lead to a bias in the obtained solution of the learning model. To this end, we propose a personalized FL algorithm with optimized masking vectors called PerFedMask. In particular, PerFedMask facilitates each device to obtain its optimized masking vector based on its computational capability before training. Fine-tuning is performed after training. PerFedMask is a generalization of a recently proposed personalized FL algorithm, FedBABU (Oh et al., 2022). PerFedMask can be combined with other FL algorithms including HeteroFL (Diao et al., 2021) and Split-Mix FL (Hong et al., 2022). Results based on CIFAR-10 and CIFAR-100 datasets show that the proposed PerFedMask algorithm provides a higher test accuracy after fine-tuning and lower average number of trainable parameters when compared with six existing state-of-the-art FL algorithms in the literature.","Computational capability, Data heterogeneity, Masking vectors, Personalized federated learning" Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP,https://openreview.net/forum?id=FELWgMjxZJj,https://openreview.net/pdf?id=FELWgMjxZJj,"For the first time, zero-shot segmentation model matches the supervised model on ADE-20k without seeing any single training images.","Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions. We identify the performance bottleneck of this paradigm to be the pre-trained CLIP model, since it does not perform well on masked images. To address this, we propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions. We collect training data by mining an existing image-caption dataset (e.g., COCO Captions), using CLIP to match masked image regions to nouns in the image captions. Compared with the more precise and manually annotated segmentation labels with fixed classes (e.g., COCO-Stuff), we find our noisy but diverse dataset can better retain CLIP's generalization ability. Along with finetuning the entire model, we utilize the ""blank"" areas in masked images using a method we dub mask prompt tuning. Experiments demonstrate mask prompt tuning brings significant improvement without modifying any weights of CLIP, and it can further improve a fully finetuned model. In particular, when trained on COCO and evaluated on ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the previous state-of-the-art. For the first time, open-vocabulary generalist models match the performance of supervised specialist models in 2017 without dataset-specific adaptations.","vision-language models, open-vocabulary, image segmentation" Variational Latent Branching Model for Off-Policy Evaluation,https://openreview.net/forum?id=3VFQfAG3vwi,https://openreview.net/pdf?id=3VFQfAG3vwi,,"Model-based methods have recently shown great potential for off-policy evaluation (OPE); offline trajectories induced by behavioral policies are fitted to transitions of Markov decision processes (MDPs), which are used to rollout simulated trajectories and estimate the performance of policies. Model-based OPE methods face two key challenges. First, as offline trajectories are usually fixed, they tend to cover limited state and action space. Second, the performance of model-based methods can be sensitive to the initialization of their parameters. In this work, we propose the variational latent branching model (VLBM) to learn the transition function of MDPs by formulating the environmental dynamics as a compact latent space, from which the next states and rewards are then sampled. Specifically, VLBM leverages and extends the variational inference framework with the recurrent state alignment (RSA), which is designed to capture as much information underlying the limited training data, by smoothing out the information flow between the variational (encoding) and generative (decoding) part of VLBM. Moreover, we also introduce the branching architecture to improve the model’s robustness against randomly initialized model weights. The effectiveness of the VLBM is evaluated on the deep OPE (DOPE) benchmark, from which the training trajectories are designed to result in varied coverage of the state-action space. We show that the VLBM outperforms existing state-of-the-art OPE methods in general.","Model-Based Off-policy Evaluation, Reinforcement Learning, Variational Inference" Building compact representations for image-language learning,https://openreview.net/forum?id=3ZGJVocZ2XQ,https://openreview.net/pdf?id=3ZGJVocZ2XQ,,"We propose a method to learn compact vision and language representations, which adaptively and iteratively fuses the multi-modal features. It greatly lowers the FLOPs of the model by effectively combining and reducing the number of tokens used for both text and images. This allows the model to scale without a large increase in FLOPs or memory and leads to a data efficient training. In addition, we propose adaptive pre-training data sampling which further improves the data efficiency. We achieve competitive performance compared to much larger models, and do so with significantly less data and FLOPs. With only 40M training examples and with 39 GFLOPs our model of 350M parameters outperforms all methods that have used less than 1B examples for pre-training. Code will be released. ", Discretization Invariant Learning on Neural Fields,https://openreview.net/forum?id=pJ9Kg_K8ufd,https://openreview.net/pdf?id=pJ9Kg_K8ufd,We design a discretization invariant framework for learning various tasks on neural fields of arbitrary parameterization.,"While neural fields have emerged as powerful representations of continuous data, there is a need for neural networks that can perform inference on such data without being sensitive to how the field is sampled, a property called discretization invariance. We develop DI-Net, a framework for learning discretization invariant operators on neural fields of any type. Whereas current theoretical analyses of discretization invariant networks are restricted to the limit of infinite samples, our analysis does not require infinite samples and establishes upper bounds on the variation in DI-Net outputs given different finite discretizations. Our framework leads to a family of neural networks driven by numerical integration via quasi-Monte Carlo sampling with discretizations of low discrepancy. DI-Nets enjoy several desirable theoretical properties such as universal approximation of a large class of maps between $L^2$ functions with gradients that are also discretization invariant. DI-Nets can also be seen as generalizations of many existing network families as they bridge discrete and continuous network classes, such as convolutional neural networks (CNNs) and neural operators respectively. Experimentally, DI-Nets derived from CNNs are demonstrated to classify and segment visual data represented by neural fields under various discretizations, and sometimes even generalize to new types of discretizations at test time.","discretization invariance, neural fields, universal approximation, numerical integration, quasi-Monte Carlo" Dynamic Pretraining of Vision-Language Models,https://openreview.net/forum?id=QcffIcjq8bl,https://openreview.net/pdf?id=QcffIcjq8bl,We propose a dynamic pretraining resampling of tasks that learns faster and better models," Vision-Language pretraining aims to learn universal cross-modal representations and to create models with broad capabilities. In this paper, we propose a novel dynamic pretraining resampling for a variety of pretraining tasks. Unlike recent large-scale vision-language approaches, we show that a set of diverse self- and weakly-supervised pretraining tasks dynamically sampled according to task difficulty provides strong performance. Further, the approach is sample-efficient, using much less data and compute to address a range of downstream tasks. We show that a single 330M pretrained model using only smaller and publicly accessible datasets, achieves competitive or SOTA performance on three diverse groups of tasks: visual question answering, text-based image localization by referring expressions, and video question answering. The code will be released.","pretraining, vision language, sampling, curriculum learning" HEAV: Hierarchical Ensembling of Augmented Views for Image Captioning,https://openreview.net/forum?id=RYRUJEjcCY,https://openreview.net/pdf?id=RYRUJEjcCY,We tackle the problem of how to efficiently and effectively leverage and ensemble heterogeneous views for image captioning,"A great deal of progress has been made in image captioning, driven by research into how to encode the image using pre-trained models. This includes visual encodings (e.g. image grid features or detected objects) and more recently textual encodings (e.g. image tags or text descriptions of image regions). As more advanced encodings are available and incorporated, it is natural to ask: how to efficiently and effectively leverage and ensemble the heterogeneous set of encodings? In this paper, we propose to regard the encodings as augmented views of the input image. The model encodes each view independently with a shared encoder efficiently, and a contrastive loss is incorporated across the encoded views to improve the representation quality, as well as to enable semi-supervised training of image captioning. Our proposed hierarchical decoder then adaptively ensembles the encoded views according to their usefulness by first ensembling within each view at the token level, and then across views at the view level. We demonstrate significant performance improvements of +5.6% CIDEr on MS-COCO compared to state of the art under the same trained-from-scratch setting and +16.8% CIDEr on Flickr30K with semi-supervised training, and conduct rigorous analyses to demonstrate the importance of each part of our design. ","image captioning, vision and language" Tuning Frequency Bias in Neural Network Training with Nonuniform Data,https://openreview.net/forum?id=oLIZ2jGTiv,https://openreview.net/pdf?id=oLIZ2jGTiv,,"Small generalization errors of over-parameterized neural networks (NNs) can be partially explained by the frequency biasing phenomenon, where gradient-based algorithms minimize the low-frequency misfit before reducing the high-frequency residuals. Using the Neural Tangent Kernel (NTK), one can provide a theoretically rigorous analysis for training where data are drawn from constant or piecewise-constant probability densities. Since most training data sets are not drawn from such distributions, we use the NTK model and a data-dependent quadrature rule to theoretically quantify the frequency biasing of NN training given fully nonuniform data. By replacing the loss function with a carefully selected Sobolev norm, we can further amplify, dampen, counterbalance, or reverse the intrinsic frequency biasing in NN training.","frequency bias, neural networks, training, nonuniform, Sobolev norms, Neural Tangent Kernel" "Global Counterfactual Explanations Are Reliable Or Efficient, But Not Both",https://openreview.net/forum?id=NN1sraxIyZ,https://openreview.net/pdf?id=NN1sraxIyZ,,"Counterfactual explanations have been widely studied in explainability, with a range of application dependent methods emerging in fairness, recourse and model understanding. The major shortcoming associated with these methods, however, is their inability to provide explanations beyond the local or instance-level. While many works touch upon the notion of a global explanation, typically suggesting to aggregate masses of local explanations in the hope of ascertaining global properties, few provide frameworks that are both reliable and computationally tractable. Meanwhile, practitioners are requesting more efficient and interactive explainability tools. We take this opportunity to investigate existing methods, improving the efficiency of Actionable Recourse Summaries (AReS), one of the only known global recourse frameworks, and proposing Global & Efficient Counterfactual Explanations (GLOBE-CE), a novel and flexible framework that tackles the scalability issues associated with current state-of-the-art, particularly on higher dimensional datasets and in the presence of continuous features. Furthermore, we provide a unique mathematical analysis of categorical feature translations, utilising it in our method. Experimental evaluation with real world datasets and user studies verify the speed, reliability and interpretability improvements of our framework.","Global, counterfactual, explanations, recourse, fairness, efficiency, reliability, black box" Learning Multimodal Data Augmentation in Feature Space,https://openreview.net/forum?id=6SRDbbvU8s,https://openreview.net/pdf?id=6SRDbbvU8s,,"The ability to jointly learn from multiple modalities, such as text, audio, and visual data, is a defining feature of intelligent systems. While there have been promising advances in designing neural networks to harness multimodal data, the enormous success of data augmentation currently remains limited to single-modality tasks like image classification. Indeed, it is particularly difficult to augment each modality while preserving the overall semantic structure of the data; for example, a caption may no longer be a good description of an image after standard augmentations have been applied, such as translation. Moreover, it is challenging to specify reasonable transformations that are not tailored to a particular modality. In this paper, we introduce LeMDA, Learning Multimodal Data Augmentation, an easy-to-use method that automatically learns to jointly augment multimodal data in feature space, with no constraints on the identities of the modalities or the relationship between modalities. We show that LeMDA can (1) profoundly improve the performance of multimodal deep learning architectures, (2) apply to combinations of modalities that have not been previously considered, and (3) achieve state-of-the-art results on a wide range of applications comprised of image, text, and tabular data.", Where to Begin? Exploring the Impact of Pre-Training and Initialization in Federated,https://openreview.net/forum?id=Mpa3tRJFBb,https://openreview.net/pdf?id=Mpa3tRJFBb,Stop worrying about heterogeneity and start from pre-trained weights.,"An oft-cited challenge of federated learning is the presence of heterogeneity. \emph{Data heterogeneity} refers to the fact that data from different clients may follow very different distributions. \emph{System heterogeneity} refers to the fact that client devices have different system capabilities. A considerable number of federated optimization methods address this challenge. In the literature, empirical evaluations usually start federated training from random initialization. However, in many practical applications of federated learning, the server has access to proxy data for the training task that can be used to pre-train a model before starting federated training. We empirically study the impact of starting from a pre-trained model in federated learning using four standard federated learning benchmark datasets. Unsurprisingly, starting from a pre-trained model reduces the training time required to reach a target error rate and enables the training of more accurate models (up to 40\%) than is possible when starting from random initialization. Surprisingly, we also find that starting federated learning from a pre-trained initialization reduces the effect of both data and system heterogeneity. We recommend that future work proposing and evaluating federated optimization methods evaluate the performance when starting from random and pre-trained initializations. We also believe this study raises several questions for further work on understanding the role of heterogeneity in federated optimization.","federated learning, optimization" BigVGAN: A Universal Neural Vocoder with Large-Scale Training,https://openreview.net/forum?id=iTtGCMDEzS_,https://openreview.net/pdf?id=iTtGCMDEzS_,,"Despite recent progress in generative adversarial network (GAN)-based vocoders, where the model generates raw waveform conditioned on acoustic features, it is challenging to synthesize high-fidelity audio for numerous speakers across various recording environments. In this work, we present BigVGAN, a universal vocoder that generalizes well for various out-of-distribution (OOD) scenarios without fine-tuning. We introduce periodic activation function and anti-aliased representation into the GAN generator, which brings the desired inductive bias for audio synthesis and significantly improves audio quality. In addition, we train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature. We identify and address the failure modes in large-scale GAN training for audio, while maintaining high-fidelity output without over-regularization. Our BigVGAN achieves the state-of-the-art performance for various scenarios, including new speakers, novel languages, unseen recording environments, singing voices, music and instrumental audio. Code and model will be released.","audio synthesis, speech synthesis, waveform generation, universal neural vocoder" PaLI: A Jointly-Scaled Multilingual Language-Image Model,https://openreview.net/forum?id=mWVoBz4W0u,https://openreview.net/pdf?id=mWVoBz4W0u,,"Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI, a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pretrained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train a large, 4-billion parameter ViT (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.", Achieving Sub-linear Regret in Infinite Horizon Average Reward Constrained MDP with Linear Function Approximation,https://openreview.net/forum?id=zZhX4eYNeeh,https://openreview.net/pdf?id=zZhX4eYNeeh,We provide the first sub-linear regret and sub-linear constraint violation for constrained MDP for linear function approximation using model-free RL algorithm,"We study the infinite horizon average reward constrained Markov Decision Process (CMDP). In contrast to existing works on model-based, finite state space, we consider the model-free linear CMDP setup. We first propose a computationally inefficient algorithm and show that $\tilde{\mathcal{O}}(\sqrt{d^3T})$ regret and constraint violation can be achieved, in which $T$ is the number of interactions, and $d$ is the dimension of the feature mapping. We also propose an efficient variant based on the primal-dual adaptation of the LSVI-UCB algorithm and show that $\tilde{\mathcal{O}}((dT)^{3/4})$ regret and constraint violation can be achieved. This improves the known regret bound of $\tilde{\mathcal{O}}(T^{5/6})$ for the finite state-space model-free constrained RL which was obtained under a stronger assumption compared to ours. We also develop an efficient policy-based algorithm via novel adaptation of the MDP-EXP2 algorithm to our primal-dual set up with $\tilde{\mathcal{O}}(\sqrt{T})$ regret and even zero constraint violation bound under a stronger set of assumptions.","Reinforcement Learning Theory, Infinite horizon Average Reward, Theory of Constrained Reinforcement Learning, Linear MDP, Model-free RL, Soft-max" Causal Imitation Learning via Inverse Reinforcement Learning,https://openreview.net/forum?id=B-z41MBL_tH,https://openreview.net/pdf?id=B-z41MBL_tH,This paper proposes novel inverse reinforcement learning methods to learn effective imitating policies from the expert's demonstrations when unobserved confounders are present.,"One of the most common ways children learn when unfamiliar with the environment is by mimicking adults. Imitation learning concerns an imitator learning to behave in an unknown environment from an expert's demonstration; reward signals remain latent to the imitator. This paper studies imitation learning through causal lenses and extends the analysis and tools developed for behavior cloning (Zhang, Kumor, Bareinboim, 2020) to inverse reinforcement learning. First, we propose novel graphical conditions that allow the imitator to learn a policy performing as well as the expert's behavior policy, even when the imitator and the expert's state-action space disagree, and unobserved confounders (UCs) are present. When provided with parametric knowledge about the unknown reward function, such a policy may outperform the expert's. Also, our method is easily extensible and allows one to leverage existing IRL algorithms even when UCs are present, including the multiplicative-weights algorithm (MWAL) (Syed & Schapire, 2008) and the generative adversarial imitation learning (GAIL) (Ho & Ermon, 2016). Finally, we validate our framework by simulations using real-world and synthetic data.","Causal Inference, Graphical Models" Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale,https://openreview.net/forum?id=zfodIZGVWW,https://openreview.net/pdf?id=zfodIZGVWW,An optimizer that consistently converges faster (<=70% training steps) than AdamW for pre-training Transformer variants.,"We present Amos, a stochastic gradient-based optimizer designed for training deep neural networks. It can be viewed as an Adam optimizer with theoretically supported, adaptive learning-rate decay and weight decay. A key insight behind Amos is that it leverages model-specific information to determine the initial learning-rate and decaying schedules. When used for pre-training BERT variants and T5, Amos consistently converges faster than the state-of-the-art settings of AdamW, achieving better validation loss within <=70% training steps and time, while requiring <=51% memory for slot variables.","optimization, asymptotic behavior of stochastic optimization, learning-rate decay, weight decay, language model pre-training, Transformer pre-training" The Surprising Computational Power of Nondeterministic Stack RNNs,https://openreview.net/forum?id=o58JtGDs6y,https://openreview.net/pdf?id=o58JtGDs6y,"We show that nondeterministic stack RNNs can learn non-CFLs and languages with surprisingly large alphabets, and we propose a new version that models a stack of vector embeddings.","Traditional recurrent neural networks (RNNs) have a fixed, finite number of memory cells. In theory (assuming bounded range and precision), this limits their formal language recognition power to regular languages, and in practice, RNNs have been shown to be unable to learn many context-free languages (CFLs). In order to expand the class of languages RNNs recognize, prior work has augmented RNNs with a nondeterministic stack data structure, putting them on par with pushdown automata and increasing their language recognition power to CFLs. Nondeterminism is needed for recognizing all CFLs (not just deterministic CFLs), but in this paper, we show that nondeterminism and the neural controller interact to produce two more unexpected abilities. First, the nondeterministic stack RNN can recognize not only CFLs, but also many non-context-free languages. Second, it can recognize languages with much larger alphabet sizes than one might expect given the size of its stack alphabet. Finally, to increase the information capacity in the stack and allow it to solve more complicated tasks with large alphabet sizes, we propose a new version of the nondeterministic stack that simulates stacks of vectors rather than discrete symbols. We demonstrate perplexity improvements with this new model on the Penn Treebank language modeling benchmark.","formal languages, pushdown automata, language modeling, RNN" Critical Initialization of Wide and Deep Neural Networks through Partial Jacobians: General Theory and Applications,https://openreview.net/forum?id=xb333aboIu,https://openreview.net/pdf?id=xb333aboIu,We introduce a new diagnostic for critical initialization in deep neural networks; and show that a combination of LayerNorm and residual connections leads to everywhere critical architectures.,"Deep neural networks are notorious for defying theoretical treatment. However, when the number of parameters in each layer tends to infinity the network function is a Gaussian process (GP) and quantitatively predictive description is possible. Gaussian approximation allows to formulate criteria for selecting hyperparameters, such as variances of weights and biases, as well as the learning rate. These criteria rely on the notion of criticality defined for deep neural networks. In this work we describe a new practical way to diagnose criticality. We introduce *partial Jacobians* of a network, defined as derivatives of preactivations in layer $l$ with respect to preactivations in layer $l_0\leq l$. We derive recurrence relations for the norms of partial Jacobians and utilize these relations to analyze criticality of deep fully connected neural networks with LayerNorm and/or residual connections. We derive and implement a simple and cheap numerical test that allows one to select optimal initialization for a broad class of deep neural networks. Using these tools we show quantitatively that proper stacking of the LayerNorm (applied to preactivations) and residual connections leads to an architecture that is critical for any initialization. Finally, we apply our methods to analyze the MLP-Mixer architecture and show that it is everywhere critical.","Criticality, Gaussian Process, Jacobian, LayerNorm, Residual connections" Agnostic Learning of General ReLU Activation Using Gradient Descent,https://openreview.net/forum?id=EnrY5TOrbQ,https://openreview.net/pdf?id=EnrY5TOrbQ,We provide a convergence analysis of gradient descent for the problem of agnostically learning a single ReLU function under Gaussian distributions that achieves loss of O(OPT). ,"We provide a convergence analysis of gradient descent for the problem of agnostically learning a single ReLU function under Gaussian distributions. Unlike prior work that studies the setting of zero bias, we consider the more challenging scenario when the bias of the ReLU function is non-zero. Our main result establishes that starting from random initialization, in a polynomial number of iterations gradient descent outputs, with high probability, a ReLU function that achieves an error that is within a constant factor of the optimal i.e., it is guaranteed to achieve an error of $O(OPT)$, where $OPT$ is the error of the best ReLU function. This is a significant improvement over existing guarantees for gradient descent, which only guarantee error of $O(\sqrt{d \cdot OPT})$ even in the zero-bias case (Frei et al., 2020). We also provide finite sample guarantees, and obtain similar guarantees for a broader class of marginal distributions beyond Gaussians. ","agnostic learning, learning ReLU, global convergence, learning theory" Parametrizing Product Shape Manifolds by Composite Networks,https://openreview.net/forum?id=F_EhNDSamN,https://openreview.net/pdf?id=F_EhNDSamN,,"Parametrizations of data manifolds in shape spaces can be computed using the rich toolbox of Riemannian geometry. This, however, often comes with high computational costs, which raises the question if one can learn an efficient neural network approximation. We show that this is indeed possible for shape spaces with a special product structure, namely those smoothly approximable by a direct sum of low-dimensional manifolds. Our proposed architecture leverages this structure by separately learning approximations for the low-dimensional factors and a subsequent combination. After developing the approach as a general framework, we apply it to a shape space of triangular surfaces. Here, typical examples of data manifolds are given through datasets of articulated models and can be factorized, for example, by a Sparse Principal Geodesic Analysis (SPGA). We demonstrate the effectiveness of our proposed approach with experiments on synthetic data as well as manifolds extracted from data via SPGA.","shape spaces, product manifolds, nonlinear statistics, low-dimensional data manifolds" CURE: A Pre-training Framework on Large-scale Patient Data for Treatment Effect Estimation,https://openreview.net/forum?id=W0deqi42HD,https://openreview.net/pdf?id=W0deqi42HD,,"Treatment effect estimation (TEE) refers to the estimation of causal effects, and it aims to compare the difference among treatment strategies on important outcomes. Current machine learning based methods are mainly trained on labeled data with specific treatments or outcomes of interest, which can be sub-optimal if the labeled data are limited. In this paper, we propose a novel transformer-based pre-training and fine-tuning framework called CURE for TEE from observational data. CURE is pre-trained on large-scale unlabeled patient data to learn representative contextual patient representations, and then fine-tuned on labeled patient data for TEE. We design a new sequence encoding for longitudinal (or structured) patient data and we incorporate structure and time into patient embeddings. Evaluated on 4 downstream TEE tasks, CURE outperforms the state-of-the-art methods in terms of an average of 3.8\% and 6.9\% absolute improvement in Area under the ROC Curve (AUC) and Area under the Precision-Recall Curve (AUPR), and 15.7\% absolute improvement in Influence function-based Precision of Estimating Heterogeneous Effects (IF-PEHE). We further demonstrate the data scalability of CURE and verify the results with corresponding randomized clinical trials. Our proposed method provides a new machine learning paradigm for TEE based on observational data. ", A Probabilistic Approach to Self-Supervised Learning using Cyclical Stochastic Gradient MCMC ,https://openreview.net/forum?id=GPPmQdU3k7,https://openreview.net/pdf?id=GPPmQdU3k7,,"In this paper we present a practical Bayesian formulation for self-supervised learning method with Cyclical Stochastic Gradient Hamiltonian Monte Carlo (cSGHMC). Within this framework, we place a prior over the parameters of a self-supervised learning model and use cSGHMC to approximate the high dimensional and multimodal posterior distribution over the embeddings. By exploring an expressive posterior over the embeddings, the Bayesian self-supervised learning produces interpretable and diverse representations. Marginalising over these representations results improvement in semi-supervised learning and out-of-distribution detection tasks. We provide experimental results on multiple classification tasks in semi-supervised learning including Cifar10 and Cifar100. Moreover we demonstrate the effectiveness of the proposed method in out-of distribution detection task using SVHN dataset.", "Tabular Data to Image Generation: Benchmark Data, Approaches, and Evaluation",https://openreview.net/forum?id=g7TXnKjn3Y,https://openreview.net/pdf?id=g7TXnKjn3Y,We study the problem of generating a set of images from an arbitrary tabular dataset,"In this work, we study the problem of generating a set of images from an arbitrary tabular dataset. The set of generated images provides an intuitive visual summary of the tabular data that can be quickly and easily communicated and understood by the user. More specifically, we formally introduce this new dataset to image generation task and discuss a few motivating applications including exploratory data analysis and understanding customer segments for creating better marketing campaigns. We then curate a benchmark dataset for training such models, which we release publicly for others to use and develop new models for other important applications of interest. Further, we describe a general and flexible framework that serves as a fundamental basis for studying and developing models for this new task of generating images from tabular data. From the framework, we propose a few different approaches with varying levels of complexity and tradeoffs. One such approach leverages both numerical and textual data as the input to our image generation pipeline. The pipeline consists of an image decoder and a conditional auto-regressive sequence generation model which also includes a pre-trained tabular representation in the input layer. We evaluate the performance of these approaches through several quantitative metrics (FID for image quality and LPIPS scores for image diversity).", Learning Hyper Label Model for Programmatic Weak Supervision,https://openreview.net/forum?id=aCQt_BrkSjC,https://openreview.net/pdf?id=aCQt_BrkSjC,A hyper label model to aggregate weak labels from multiple weak supervision sources to infer the ground-truth labels in a single forward pass,"To reduce the human annotation efforts, the programmatic weak supervision (PWS) paradigm abstracts weak supervision sources as labeling functions (LFs) and involves a label model to aggregate the output of multiple LFs to produce training labels. Most existing label models require a parameter learning step for each dataset. In this work, we present a hyper label model that (once learned) infers the ground-truth labels for each dataset in a single forward pass without dataset-specific parameter learning. The hyper label model approximates an optimal analytical (yet computationally intractable) solution of the ground-truth labels. We train the model on synthetic data generated in the way that ensures the model approximates the analytical optimal solution, and build the model upon Graph Neural Network (GNN) to ensure the model prediction being invariant (or equivariant) to the permutation of LFs (or data points). On 14 real-world datasets, our hyper label model outperforms the best existing methods in both accuracy (by 1.4 points on average) and efficiency (by six times on average).","Programmatic Weak Supervision, Data Programming, Label Model" "SlenderGNN: Accurate, Robust, and Interpretable GNN, and the Reasons for its Success",https://openreview.net/forum?id=lMgFRIILVB,https://openreview.net/pdf?id=lMgFRIILVB,"We propose SlenderGNN, a linear GNN whose motivations are derived from comprehensive linearization on existing models.","Can we design a GNN that is accurate and interpretable at the same time? Could it also be robust to handle the case of homophily, heterophily, or even noisy edges without network effects? We propose SlenderGNN that has all desirable properties: (a) accurate, (b) robust, and (c) interpretable. For the reasons of its success, we had to dig deeper: The result is our GNNLIN framework which highlights the fundamental differences among popular GNN models (e.g., feature combination, structural normalization, etc.) and thus reveals the reasons for the success of our SlenderGNN, as well as the reasons for occasional failures of other GNN variants. Thanks to our careful design, SlenderGNN passes all the 'sanity checks' we propose, and it achieves the highest overall accuracy on 9 real-world datasets of both homophily and heterophily graphs, when compared against 10 recent GNN models. Specifically, SlenderGNN exceeds the accuracy of linear GNNs and matches or exceeds the accuracy of nonlinear models with up to 64 times fewer parameters.","Graph neural networks, Linear models, Node classification, Heterophily graphs, Lightweight models" FedFA: Federated Feature Augmentation,https://openreview.net/forum?id=U9yFP90jU0,https://openreview.net/pdf?id=U9yFP90jU0,"We propose a simple, flexible, effective, and robust method, named as FedFA, to solve federated learning from a novel perspective of federated feature augmentation.","Federated learning is a distributed paradigm that allows multiple parties to collaboratively train deep models without exchanging the raw data. However, the data distribution among clients is naturally non-i.i.d., which leads to severe degradation of the learnt model. The primary goal of this paper is to develop a robust federated learning algorithm to address feature shift in clients' samples, which can be caused by various factors, e.g., acquisition differences in medical imaging. To reach this goal, we propose FedFA to tackle federated learning from a distinct perspective of federated feature augmentation. FedFA is based on a major insight that instance-level feature statistics (i.e., mean and standard deviation) represent a special type of ``features'' that encodes domain-specific characteristics; hence, proper manipulations of the local feature statistics in the federation may beget novel domains, which can potentially alleviate local feature shift and benefit collaborative learning. Based on this insight, we model each feature statistic probabilistically via a Gaussian distribution, with the mean corresponding to the original statistic and the variance quantifying the augmentation scope, from which novel feature statistics can be drawn to fulfill augmentation. Key to our approach is the determination of a meaningful Gaussian variance, which is accomplished by taking into account not only biased data of each individual client, but also underlying feature statistics characterized by all participating clients. We demonstrate through extensive experiments that FedFA can significantly advance federated learning in diverse scenarios. The code will be released. ","federated learning, feature augmentation" Capsa: A Unified Framework for Quantifying Risk in Deep Neural Networks,https://openreview.net/forum?id=_BSowr-_ED,https://openreview.net/pdf?id=_BSowr-_ED,"A simple, scalable, and composable framework for creating risk-aware models","The modern pervasiveness of large-scale deep neural networks (NNs) is driven by their extraordinary performance on complex problems but is also plagued by their sudden, unexpected, and often catastrophic failures, particularly on challenging scenarios. Unfortunately, existing algorithms to achieve risk-awareness of NNs are complex and ad-hoc. Specifically, these methods require significant engineering changes, are often developed only for particular settings, and are not easily composable. Here we present Capsa, a flexible framework for extending models with risk-awareness. Capsa provides principled methodology for quantifying multiple forms of risk and composes different algorithms together to quantify different risk metrics in parallel. We validate Capsa by implementing state-of-the-art uncertainty estimation algorithms within the Capsa framework and benchmarking them on complex perception datasets. Furthermore, we demonstrate the ability of Capsa to easily compose aleatoric uncertainty, epistemic uncertainty, and bias estimation together in a single function set, and show how this integration provides a comprehensive awareness of NN risk. ","uncertainty estimation, robustness, risk-aware ML" Show and Write: Entity-aware Article Generation with Image Information,https://openreview.net/forum?id=AtWKqgziLF,https://openreview.net/pdf?id=AtWKqgziLF,,"Prior work for article generation has primarily focused on generating articles using a human-written prompt to provide topical context and metadata about the article. However, for many applications, such as generating news stories, these articles are also often paired with images and their captions or alt-text, which in turn are based on real-world events and may reference many different named entities that are difficult to be correctly recognized and predicted by language models. To address this shortcoming, this paper introduces an ENtity-aware article Generation method with Image iNformation, ENGIN, to incorporate an article's image information into language models. ENGIN represents articles that can be conditioned on metadata used by prior work and information such as captions and named entities extracted from images. Our key contribution is a novel Entity-aware mechanism to help our model recognize and predict the entity names in articles. We perform experiments on three public datasets, GoodNews, VisualNews, and WikiText. Quantitative results show that our approach improves generated article perplexity by 4-5 points over the base models. Qualitative results demonstrate the text generated by ENGIN is more consistent with embedded article images. We also perform article quality annotation experiments on the generated articles to validate that our model produces higher-quality articles. Finally, we investigate the effect ENGIN has on methods that automatically detect machine-generated articles.","image-to-text generation, language modeling, named entity recognition" Noise$^+$2Noise: Co-taught De-noising Autoencoders for Time-Series Data,https://openreview.net/forum?id=QLVvgqcyuj,https://openreview.net/pdf?id=QLVvgqcyuj,We combine Co-teaching and De-noising Autoencoders to recover clean signals from only noisy data in a time series setting.,"We consider the task of learning to recover clean signals given only access to noisy data. Recent work in computer vision has addressed this problem in the context of images using denoising autoencoders (DAEs). However, to date DAEs for learning from noisy data have not been explored in the context of time-series data. DAEs for denoising images often rely on assumptions unlikely to hold in the context of time series, \textit{e.g.}, multiple noisy samples of the same example. Here, we adapt DAEs to cleaning time-series data with noisy samples only. To recover the clean target signal when only given access to noisy target data, we leverage a noise-free auxiliary time-series signal that is related to the target signal. In addition to leveraging the relationship between the target signal and auxiliary signal, we iteratively filter and learn from clean samples using an approach based on co-teaching. Applied to the task of recovering carbohydrate values for blood glucose management, our approach reduces noise (MSE) in patient-reported carbohydrates from 72$g^2$ (95\% CI: 54,93) to 18$g^2$ (13,25), outperforming the best baseline (MSE = 33$g^2$ (27,43)). We demonstrate strong time-series denoising performance, extending the applicability of DAEs to a previously under-explored setting.","De-noising, Co-teaching, Noise recovery, Time-series, self-supervised, RNN" Adversarial Representation Learning for Canonical Correlation Analysis,https://openreview.net/forum?id=skThRS3MA-0,https://openreview.net/pdf?id=skThRS3MA-0,A reformulation of CCA under the adversarial framework for efficient canonical representation learning.,"Canonical correlation analysis (CCA) provides a framework to map multimodality data into a maximally correlated latent space. The deep version of CCA has replaced linear maps with deep transformations to enable more flexible correlated data representations; however, this approach requires optimization over all samples for each iteration and poorly scales. Here, we present a deep, adversarial approach to CCA, adCCA, that can be efficiently solved by standard mini-batch training. We reformulate CCA under the constraint that the different modalities are embedded with identical latent distributions, derive a tractable deep CCA target, and use an adversarial framework to efficiently learn the canonical representations. A consequence of the new formation is that adCCA learns maximally correlated representations across multimodalities meanwhile preserves structure within individual modalities. Further, adCCA removes the need for feature transformation and normalization and can be directly applied to diverse modalities and feature encodings. Numerical studies show that the performance of adCCA is robust to data transformations, binary encodings, and corruptions. Together, adCCA provides a scalable approach to align data across modalities without compromising structure within each modality.","Representation Learning, Canonical Correlation Analysis, Adversarial Learining" BYPASSING THE STABILITY-PLASTICITY TRADEOFF TO REDUCE PREDICTIVE CHURN,https://openreview.net/forum?id=K-3Qq-CC78,https://openreview.net/pdf?id=K-3Qq-CC78,,"The impact of an ML model is largely a function of how much trust users have in its predictions. As more data is gathered over time, the model can be updated to take advantage of a larger sample size and have improved performance. Even when model updates improve aggregate metrics such as accuracy, this can lead to errors on samples the previous model got right causing apparent regressions in performance known as predictive churn. Such prediction flips erode user trust thereby reducing the effectiveness of the human-AI team as a whole. Current approaches for reducing predictive churn fall mainly into two categories: ensembles and distillation. While ensembles are the most effective, they comes at the cost of having to train and use multiple models for inference. Distillation is much more efficient both in terms of training and inference, but is far less effective at reducing churn. We propose a missing middle-ground solution called StackMem based on accumulating models over time which achieves comparable performance to ensembles without any training time increases or changes to training procedures. Additionally, StackMem can be applied to models which are already deployed, unlike ensembles. We demonstrate the effectiveness of StackMem on several computer vision benchmark datasets comparing against STOTA churn reduction methods.","Preditive churn, Stability, Distillation, Ensembles" Neural Implicit Manifold Learning for Topology-Aware Generative Modelling,https://openreview.net/forum?id=WA35e2vPlFT,https://openreview.net/pdf?id=WA35e2vPlFT,We propose a new model for probability distributions on topologically complex data manifolds which learns manifolds implicitly as the set of zeros of a neural network and then learns the distribution within using a constrained energy-based model.,"Natural data observed in $\mathbb{R}^n$ is often constrained to an $m$-dimensional manifold $\mathcal{M}$, where $m < n$. Current probabilistic models represent this manifold by mapping an $m$-dimensional latent variable through a neural network $f_\theta: \mathbb{R}^m \to \mathbb{R}^n$. Such procedures, which we call pushforward models, incur a straightforward limitation: manifolds cannot in general be represented with a single parameterization, meaning that attempts to do so will incur either computational instability or the inability to learn probability densities within the manifold. To remedy this problem, we propose to model $\mathcal{M}$ as a neural implicit manifold: the set of zeros of a neural network. To learn the data distribution within $\mathcal{M}$, we introduce constrained energy-based models, which use a constrained variant of Langevin dynamics to train and sample within a learned manifold. The resulting model can be manipulated with an arithmetic of manifolds, which allows practitioners to take unions and intersections of model manifolds. In experiments on synthetic and natural data, we show that constrained EBMs can learn manifold-supported distributions with complex topologies more accurately than pushforward models.","Manifold Learning, Unsupervised Learning, Density Estimation, Topology, Differential Geometry, Generative Modelling" LT-SNN: Self-Adaptive Spiking Neural Network for Event-based Classification and Object Detection,https://openreview.net/forum?id=oyzMyylgINj,https://openreview.net/pdf?id=oyzMyylgINj,Learnable threshold based spiking neural network.,"Spiking neural networks (SNNs) have received increasing attention due to its high biological plausibility and energy efficiency. The binary spike-based information propagation enables efficient sparse computation with event-based computer vision applications. Prior works investigated direct SNN training algorithm to overcome the non-differentiability of spike generation. However, most of the existing works employ a fixed threshold value for the membrane potential throughout the entire training process, which limits the dynamics of SNNs towards further optimizing the performance. The adaptiveness in the membrane potential threshold and the mismatched mechanism between SNN and biological nervous system remain under-explored in prior works. In this work, we propose LT-SNN, a novel SNN training algorithm with self-adaptive learnable potential threshold to improve SNN performance. LT-SNN optimizes the layer-wise threshold value throughout SNN training, imitating the self-adaptiveness of the biological nervous system. To stabilize the SNN training even further, we propose separate surrogate gradient path (SGP), a simple-yet-effective method that enables the smooth learning process of SNN training. We validate the proposed LT-SNN algorithm on multiple event-based datasets, including both image classification and object detection tasks. Equipped with high adaptiveness that fully captures the dynamics of SNNs, LT-SNN achieves state-of-the-art performance with compact models. The proposed LT-SNN based classification network surpasses SoTA methods where we achieved 2.71% higher accuracy together with 10.48× smaller model size. Additionally, our LT-SNN-YOLOv2 object detection model demonstrates 0.11 mAP improvement compared to the SoTA SNN-based object detection.","Spiking neural networks, efficient neuromorphic computing, spatial-temporal adjustment, separate surrogate gradient path, output regularization and self-adaptive and learnable potential threshold." Characterizing neural representation of cognitively-inspired deep RL agents during an evidence accumulation task,https://openreview.net/forum?id=g05Epey82Ft,https://openreview.net/pdf?id=g05Epey82Ft,,"Evidence accumulation is thought to be fundamental for decision-making in humans and other mammals. It has been extensively studied in neuroscience and cognitive science with the goal of explaining how sensory information is sequentially sampled until sufficient evidence has accumulated to favor one decision over others. Neuroscience studies suggest that the hippocampus encodes a low-dimensional ordered representation of evidence through sequential neural activity. Cognitive modelers have proposed a mechanism by which such sequential activity could emerge through the modulation of recurrent weights with a change in the amount of evidence. This gives rise to neurons tuned to a specific magnitude of evidence which resemble neurons recorded in the hippocampus. Here we integrated a cognitive science model inside a deep Reinforcement Learning (RL) agent and trained the agent to perform a simple evidence accumulation task inspired by the behavioral experiments on animals. We compared the agent's performance with the performance of agents equipped with GRUs and RNNs. We found that the agent based on a cognitive model was able to learn much faster and generalize better while having significantly fewer parameters. We also compared the emergent neural activity across agents and found that in some cases, GRU-based agents developed similar neural representations to agents based on a cognitive model. This study illustrates how integrating cognitive models and deep learning systems can lead to brain-like neural representations that can improve learning.", Epistemological Bias As a Means for the Automated Detection of Injustices in News Media,https://openreview.net/forum?id=PWKs1IpMpv,https://openreview.net/pdf?id=PWKs1IpMpv,"We leverage the combined use of a fine-tuned epistemological detection model, two stereotype detection models, and a lexicon-based approach to show that epistemological biases can assist with the automatic detection of injustices in text.","Injustice occurs when someone experiences unfair treatment or their rights are violated. In the context of news media, injustices represent a form of bias through which discriminatory narratives can arise and spread. The automated identification of injustice in text has received little attention, due in part to the fact that underlying stereotypes are rarely explicitly stated and that instances often occur unconsciously due to the pervasive nature of prejudice in society. Here, we leverage the combined use of a fine-tuned BERT-based bias detection model, two stereotype detection models, and a lexicon-based approach to show that epistemological biases (i.e., words, which through their use, presupposes, entails, asserts, hedges, or boosts text to erode or assert a person's capacity as a knower) can assist with the automatic detection of injustice in text.","testimonial injustice, character injustice, framing bias, epistemological bias, news media" Neural Constraint Inference: Inferring Energy Constraints in Interacting Systems,https://openreview.net/forum?id=oo-X0K4XAn,https://openreview.net/pdf?id=oo-X0K4XAn,"We propose an approach that discovers a set of relational constraints, represented as energy functions, which when optimized reconstruct a given original trajectory.","Systems consisting of interacting agents are prevalent in the world, ranging from dynamical systems in physics to complex biological networks. To build systems which can interact robustly in the real world, it is thus important to be able to infer the precise interactions governing such systems. Existing approaches typically discover such interactions by explicitly modeling the feedforward dynamics of the trajectories. In this work, we propose Neural Constraint Inference (NCI) model as an alternative approach to discover such interactions: it discovers a set of relational constraints, represented as energy functions, which when optimized reconstruct the original trajectory. We illustrate how NCI can faithfully predict future trajectory dynamics, achieving more consistent long-rollouts than existing approaches. We show that the constraints discovered by NCI are disentangled and may be intermixed with constraints from other trajectories. Finally, we illustrate how those constraints enable the incorporation of external test-time constraints.","relational inference, energy-based models, energy constraints, trajectory prediction, graph neural networks" Self-supervised Continual Learning based on Batch-mode Novelty Detection,https://openreview.net/forum?id=b8f2YGWebo,https://openreview.net/pdf?id=b8f2YGWebo,A unified approach of continual learning and novelty detection. Each new out-of-distribution class is first detected and then merged into the previous knowledge.,"Continual learning (CL) plays a key role in dynamic systems in order to adapt to new tasks, while preserving previous knowledge. Most existing CL approaches focus on learning new knowledge in a supervised manner, while leaving the data gathering phase to the novelty detection (ND) algorithm. Such presumption limits the practical usage where new data needs to be quickly learned without being labeled. In this paper, we propose a unified approach of CL and ND, in which each new class of the out-of-distribution (ODD) data is first detected and then added to previous knowledge. Our method has three unique features: (1) a unified framework seamlessly tackling both ND and CL problems; (2) a self-supervised method for model adaptation, without the requirement of new data annotation; (3) batch-mode data feeding that maximizes the separation of new knowledge vs. previous learning, which in turn enables high accuracy in continual learning. By learning one class at each step, the new method achieves robust continual learning and consistently outperforms state-of-the-art CL methods in the single-head evaluation on MNIST, CIFAR-10, CIFAR-100 and TinyImageNet datasets.","Continual Learning, Gradients-based, Mahalanobis Distance, Novelty Detection, out-of-distribution, self-supervised" Stable Optimization of Gaussian Likelihoods,https://openreview.net/forum?id=hmuLHC5MrG,https://openreview.net/pdf?id=hmuLHC5MrG,"We analyze the instability of Gaussian likelihood optimization and propose a gradient-based optimizer demonstrating less volatile optimization especially for contextual, multivariate target distributions with full covariances.","Uncertainty-aware modeling has emerged as a key component in modern machine learning frameworks. The de-facto standard approach adopts heteroscedastic Gaussian distributions and minimizes the negative log-likelihood (NLL) under observed data. However, optimizing this objective turns out to be surprisingly intricate, and the current state-of-the-art reports several instabilities. This work breaks down the optimization problem, initially focusing on non-contextual settings where convergence can be analyzed analytically. We show that (1) in this learning scheme, the eigenvalues of the predictive covariance define stability in learning, and (2) coupling of gradients and predictions build up errors in both mean and covariance if either is poorly approximated. Building on these insights, we propose Trustable, a novel optimizer that overcomes instabilities methodically by combining systematic update restrictions in the form of trust regions with structured, tractable natural gradients. We demonstrate in several challenging experiments that Trustable outperforms current optimizers in regression with neural networks in terms of the NLL, MSE, and further performance metrics. Unlike other optimizers, Trustable yields an improved and more stable fit and can also be applied to multivariate outputs with full covariance matrices.","Gaussian, Likelihood, Optimization, Uncertainty, Stabilization" Break the Wall Between Homophily and Heterophily for Graph Representation Learning,https://openreview.net/forum?id=BAakXAV6Cf,https://openreview.net/pdf?id=BAakXAV6Cf,This work proposes a new GNN model called OGNN (Omnipotent Graph Neural Network) that extracts different aspects of graph representations to generalize well on the whole spectrum of homophily.,"Homophily and heterophily are intrinsic properties of graphs that describe whether two linked nodes share similar properties. Although many Graph Neural Network (GNN) models have been proposed, it remains unclear how to design a model so that it can generalize well to the whole spectrum of homophily. This work addresses the challenge by identifying three graph features, including the ego node feature, the aggregated node feature, and the graph structure feature, that are essential for graph representation learning. It further proposes a new GNN model called OGNN (Omnipotent Graph Neural Network) that extracts all three graph features and adaptively fuses them to achieve generalizability across the whole spectrum of homophily. Extensive experiments on both synthetic and real datasets demonstrate the superiority (average rank 1.56) of our OGNN compared with state-of-the-art methods. Our code will be available at https://*.","Graph Neural Networks, Graph Homophily, Graph Heterophily" Representing Latent Dimensions Using Compressed Number Lines,https://openreview.net/forum?id=KJ8iuccbPB,https://openreview.net/pdf?id=KJ8iuccbPB,,"Humans use log-compressed number lines to represent different quantities, including elapsed time, traveled distance, numerosity, sound frequency, etc. Inspired by recent cognitive science and computational neuroscience work, we developed a neural network that learns to construct log-compressed number lines. The network computes a discrete approximation of a real-domain Laplace transform using an RNN with analytically derived weights giving rise to a log-compressed timeline of the past. The network learns to extract latent variables from the input and uses them for global modulation of the recurrent weights turning a timeline into a number line over relevant dimensions. The number line representation greatly simplifies learning on a set of problems that require learning associations in different spaces - problems that humans can typically solve easily. This approach illustrates how combining deep learning with cognitive models can result in systems that learn to represent latent variables in a brain-like manner and exhibit human-like behavior manifested through Weber-Fechner law.", Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance,https://openreview.net/forum?id=ZAzSf9pzCm,https://openreview.net/pdf?id=ZAzSf9pzCm,"Speed up BERT phase 2 pretraining by 2x (and other models, too) by avoiding padding without impacting accuracy in contrast to existing approaches.","Effective training of today's large language models (LLMs) depends on large batches and long sequences for throughput and accuracy. To handle variable-length sequences on hardware accelerators, it is common practice to introduce padding tokens, so that all sequences in a batch have the same length. We show in this paper that the variation in sequence lengths in common NLP datasets is such that up to 50% of all tokens can be padding. In less common, but not extreme, cases (e.g. GLUE-COLA with sequence length 128), the ratio is up to 89%. Existing methods to address the resulting inefficiency are complicated by the need to avoid ""cross-contamination"" in self-attention, by a reduction in accuracy when sequence ordering information is lost, or by customized kernel implementations only valid for specific accelerators. This paper introduces a new formalization of sequence packing in the context of the well-studied bin packing problem, and presents new algorithms based on this formulation which, for example, confer a 2x speedup for phase 2 pretraining in BERT while preserving downstream performance. We show how existing models can be adapted to ensure mathematical equivalence between the original and packed models, meaning that packed models can be trained with existing pre-training and fine-tuning practices.","deep learning, BERT, IPU, GPU, hardware-acceleration, padding, Wikipedia, NLP, bin-packing" Cortically motivated recurrence enables task extrapolation,https://openreview.net/forum?id=3mji6eUxzY,https://openreview.net/pdf?id=3mji6eUxzY,Biologically inspired recurrent network solves (easy and) hard instances of a task with (less and) more iterations.,"Feedforward deep neural networks have become the standard class of models in the field of computer vision. Yet, they possess a striking difference relative to their biological counterparts which predominantly perform “recurrent” computations. Why do biological neurons evolve to employ recurrence pervasively? In this paper, we show that a recurrent network is able to flexibly adapt its computational budget during inference and generalize within-task across difficulties. Simultaneously in this study, we contribute a recurrent module we call LocRNN that is designed based on a prior computational model of local recurrent intracortical connections in primates to support such dynamic task extrapolation. LocRNN learns highly accurate solutions to the challenging visual reasoning problems of Mazes and PathFinder that we use here. More importantly, it is able to flexibly use less or more recurrent iterations during inference to zero-shot generalize to less- and more difficult instantiations of each task without requiring extra training data, a potential functional advantage of recurrence that biological visual systems capitalize on. Feedforward networks on the other hand with their fixed computational graphs only partially exhibit this trend, potentially owing to image-level similarities across difficulties. We also posit an intriguing tradeoff between recurrent networks’ representational capacity and their stability in the recurrent state space. Our work encourages further study of the role of recurrence in deep learning models – especially from the context of out-of-distribution generalization & task extrapolation – and their properties of task performance and stability.","cognitive science, recurrent neural networks, task extrapolation, out of distribution generalization, visual routines, path integration" Learning Object-Centric Dynamic Modes from Video and Emerging Properties,https://openreview.net/forum?id=TAtAJFo35lc,https://openreview.net/pdf?id=TAtAJFo35lc,"We propose a model for dynamics interpretability and manipulation by means of object-centric dynamic mode decomposition, directly from pixels.","One of the long-term objectives of Artificial Intelligence is to endow machines with the capacity of structuring and interpreting the world as we do. Towards this goal, recent methods have successfully decomposed and disentangled video sequences into their composing objects, attributes and dynamics, in a self-supervised fashion. However, there have been scarce efforts to propose useful decompositions of the dynamics in a scene. We propose a method to decompose a video into moving objects, their attributes and the dynamic modes of their trajectories. We model the objects' dynamics with linear system identification tools, by means of a Koopman mapping and the Koopman operator $\mathcal{K}$. This allows user access and interpretation of the dynamics in the scene. We test our framework in a variety of datasets, while illustrating the novel features that emerge from our dynamic modes decomposition: temporal super-resolution, backwards forecasting, model reduction and video dynamics interpretation and manipulation at test-time. We successfully forecast challenging object trajectories from pixels, achieving competitive performance while drawing useful insights.","Koopman theory, dynamics, video representation learning, dynamic mode decomposition, video manipulation, object-centric decomposition" Code Means More Than Plain Language: Bringing Syntax Structure Awareness To Algorithmic Problem Solution Generation,https://openreview.net/forum?id=RePt5K6wPux,https://openreview.net/pdf?id=RePt5K6wPux,The first work to introduce syntax tree structure in programming synthesis,"Program Synthesis (PS) is the task of building computer programs that satisfy problem specifications. Large-scale pre-trained language models treat the PS as a sequence prediction task, which has gained vivid popularity recently. However, these methods heavily rely on the conventional Natural Language Processing (NLP) tokenizers, which overlooks the rich structural/syntax information in the code. In this work, we posit that the syntax structures help generate syntax error-free and algorithmically correct programs. If the program syntax structures can be integrated into the tokenizer, the program representation space could be significantly simplified. To this end, we propose a new end-to-end framework named ASTer, coupled with our novel syntax-aware tokenization design toolkit. More specifically, our tokenizer encodes and decodes the program by its syntax roles and contents, not by what is superficially shown on the strings. The ASTer encompasses a novel sample-wise and token-wise attention mechanism, and avails the benefits of training with the syntactically aligned samples from our tokenization toolkit. Extensive evaluations show superior performance against state-of-the-arts, which confirms that bringing syntax knowledge into the language model can help better capture the data structure and simplify the search space. All of our codes will be publicly available upon acceptance. ","program synthesis, transformer, syntax structure" Is Adversarial Training Really a Silver Bullet for Mitigating Data Poisoning?,https://openreview.net/forum?id=zKvm1ETDOq,https://openreview.net/pdf?id=zKvm1ETDOq,"We propose an indiscriminative feature-based poisoning approach to substantially degrade adversarial training, which was previously considered to be impossible.","Indiscriminate data poisoning can decrease the clean test accuracy of a deep learning model by slightly perturbing its training samples. There is a consensus that such poisons can hardly harm adversarially-trained (AT) models when the adversarial training budget is no less than the poison budget, i.e., $\epsilon_\mathrm{adv}\geq\epsilon_\mathrm{poi}$. This consensus, however, is challenged in this paper based on our new attack strategy that induces \textit{indiscriminative features} (INF). The existence of indiscriminative features makes the poisoned data become less useful for training a model, no matter if AT is applied or not. In contrast, existing methods are limited to using perturbations as \textit{shortcuts}, which just override the actual image content during model training. We demonstrate that for attacking a CIFAR-10 AT model under a reasonable setting with $\epsilon_\mathrm{adv}=\epsilon_\mathrm{poi}=8/255$, our INF yields an accuracy drop of 13.31%, which is $7\times$ better than existing methods and equal to discarding 83% training data. We further show the generalizability of INF to more challenging settings, e.g., higher AT budgets, partial poisoning, unseen model architectures, and stronger (ensemble or adaptive) defenses. We finally provide new insights into the distinct roles of non-robust vs. robust features in poisoning standard vs. AT models and confirm the effectiveness of INF in poisoning standard models.","Data poisoning, adversarial training, indiscriminative features, adaptive defenses, robust vs. non-robust features" Offline Congestion Games: How Feedback Type Affects Data Coverage Requirement,https://openreview.net/forum?id=PXVGer7hmJ,https://openreview.net/pdf?id=PXVGer7hmJ,,"This paper investigates when one can efficiently recover an approximate Nash Equilibrium (NE) in offline congestion games. The existing dataset coverage assumption in offline general-sum games inevitably incurs a dependency on the number of actions, which can be exponentially large in congestion games. We consider three different types of feedback with decreasing revealed information. Starting from the facility-level (a.k.a., semi-bandit) feedback, we propose a novel one-unit deviation coverage condition and show a pessimism-type algorithm that can recover an approximate NE. For the agent-level (a.k.a., bandit) feedback setting, interestingly, we show the one-unit deviation coverage condition is not sufficient. On the other hand, we convert the game to multi-agent linear bandits and show that with a generalized data coverage assumption in offline linear bandits, we can efficiently recover the approximate NE. Lastly, we consider a novel type of feedback, the game-level feedback where only the total reward from all agents is revealed. Again, we show the coverage assumption for the agent-level feedback setting is insufficient in the game-level feedback setting, and with a stronger version of the data coverage assumption for linear bandits, we can recover an approximate NE. Together, our results constitute the first study of offline congestion games and imply formal separations between different types of feedback.", Learning with Stochastic Orders,https://openreview.net/forum?id=P3PJokAqGW,https://openreview.net/pdf?id=P3PJokAqGW,"We propose and study discrepancies and distances between probability measures that arise from the convex or Choquet order, which capture dominance constraints and are useful in applications like image generation.","Learning high-dimensional distributions is often done with explicit likelihood modeling or implicit modeling via minimizing integral probability metrics (IPMs). In this paper, we expand this learning paradigm to stochastic orders, namely, the \emph{convex} or \emph{Choquet order} between probability measures. Towards this end, exploiting the relation between convex orders and optimal transport, we introduce the Choquet-Toland distance between probability measures, that can be used as a drop-in replacement for IPMs. We also introduce the \emph{Variational Dominance Criterion} (VDC) to learn probability measures with dominance constraints, that encode the desired stochastic order between the learned measure and a known baseline. We analyze both quantities and show that they suffer from the curse of dimensionality and propose surrogates via input convex maxout networks (ICMNs), that enjoy parametric rates. We provide a min-max framework for learning with stochastic orders and validate it experimentally on synthetic and high-dimensional image generation, with promising results. Finally, our ICMNs class of convex functions and its derived Rademacher complexity are of independent interest beyond their application in convex orders.","optimal transport, stochastic order, Choquet order, convex function, input convex neural network, integral probability metric, image generation, statistical rates" A Deep Reinforcement Learning Approach for Finding Non-Exploitable Strategies in Two-Player Atari Games,https://openreview.net/forum?id=katAmwuUGc,https://openreview.net/pdf?id=katAmwuUGc,,"This paper proposes novel, end-to-end deep reinforcement learning algorithms for learning two-player zero-sum Markov games. Different from prior efforts on training agents to beat a fixed set of opponents, our objective is to find the Nash equilibrium policies that are free from exploitation by even the adversarial opponents. We propose (1) Nash DQN algorithm, which integrates DQN with a Nash finding subroutine for the joint value functions; and (2) Nash DQN Exploiter algorithm, which additionally adopts an exploiter for guiding agent's exploration. Our algorithms are the practical variants of theoretical algorithms which are guaranteed to converge to Nash equilibria in the basic tabular setting. Experimental evaluation on both tabular examples and two-player Atari games demonstrates the robustness of the proposed algorithms against adversarial opponents, as well as their advantageous performance over existing methods.", MEDFAIR: BENCHMARKING FAIRNESS FOR MEDICAL IMAGEING,https://openreview.net/forum?id=6ve2CkeQe5S,https://openreview.net/pdf?id=6ve2CkeQe5S,We develop a fairness benchmark for medical imaging and find that the state-of-the-art bias mitigation algorithm does not significantly outperform ERM.,"A multitude of work has shown that machine learning-based medical diagnosis systems can be biased against certain subgroups of people. This has motivated a growing number of bias mitigation algorithms that aim to address fairness issues in machine learning. However, it is difficult to compare their effectiveness in medical imaging for two reasons. First, there is little consensus on the criteria to assess fairness. Second, existing bias mitigation algorithms are developed under different settings, e.g., datasets, model selection strategies, backbones, and fairness metrics, making a direct comparison and evaluation based on existing results impossible. In this work, we introduce MEDFAIR, a framework to benchmark the fairness of machine learning models for medical imaging. MEDFAIR covers eleven algorithms from various categories, nine datasets from different imaging modalities, and three model selection criteria. Through extensive experiments, we find that the under-studied issue of model selection criterion can have a significant impact on fairness outcomes; while in contrast, state-of-the-art bias mitigation algorithms do not significantly improve fairness outcomes over empirical risk minimization (ERM) in both in-distribution and out-of-distribution settings. We evaluate fairness from various perspectives and make recommendations for different medical application scenarios that require different ethical principles. Our framework provides a reproducible and easy-to-use entry point for the development and evaluation of future bias mitigation algorithms in deep learning.","Fairness, Bias Mitigation, Medical Imaging, Benchmark" Does Decentralized Learning with Non-IID Unlabeled Data Benefit from Self Supervision?,https://openreview.net/forum?id=2L9gzS80tA4,https://openreview.net/pdf?id=2L9gzS80tA4,"We study decentralized learning with non-IID unlabeled data, and try to understand the robustness and communication efficiency of decentralized self-supervised learning, through extensive experiments and theoretical analysis.","The success of machine learning relies heavily on massive amounts of data, which are usually generated and stored across a range of diverse and distributed data sources. Decentralized learning has thus been advocated and widely deployed to make efficient use of the distributed datasets, with an extensive focus on supervised learning (SL) problems. Unfortunately, the majority of real-world data are unlabeled and can be highly heterogeneous across sources. In this work, we carefully study decentralized learning with unlabeled data through the lens of self-supervised learning (SSL), specifically contrastive visual representation learning. We study the effectiveness of a range of contrastive learning algorithms under decentralized learning setting, on relatively large-scale datasets including ImageNet-100, MS-COCO, and a new real-world robotic warehouse dataset. Our experiments show that the decentralized SSL (Dec-SSL) approach is robust to the heterogeneity of decentralized datasets, and learns useful representation for object classification, detection, and segmentation tasks, even when combined with the simple and standard decentralized learning algorithm of Federated Averaging (FedAvg). This robustness makes it possible to significantly reduce communication and to reduce the participation ratio of data sources with only minimal drops in performance. Interestingly, using the same amount of data, the representation learned by Dec-SSL can not only perform on par with that learned by centralized SSL which requires communication and excessive data storage costs, but also sometimes outperform representations extracted from decentralized SL which requires extra knowledge about the data labels. Finally, we provide theoretical insights into understanding why data heterogeneity is less of a concern for Dec-SSL objectives, and introduce feature alignment and clustering techniques to develop a new Dec-SSL algorithm that further improves the performance, in the face of highly non-IID data. Our study presents positive evidence to embrace unlabeled data in decentralized learning, and we hope to provide new insights into whether and why decentralized SSL is effective and/or even advantageous.","Decentralized Learning, Heterogeneous and Unlabeled Data, Federated Learning, Self-Supervised Learning, Representation Learning" Polarity is all you need to learn and transfer faster,https://openreview.net/forum?id=oFoRPrl9CYX,https://openreview.net/pdf?id=oFoRPrl9CYX,Transfer and fix weight polarities to learn faster with less data,"Natural intelligences (NIs) thrive in a dynamic world – they learn quickly, sometimes with only a few samples. In contrast, Artificial intelligence (AI) has achieved supra (-human) level performance in certain AI settings, typically dependent on a prohibitive amount of training samples and computational power. What design principle difference between NI and AI could contribute to such a discrepancy? Here, we propose a research avenue based on a simple observation from NIs: post-development, neuronal connections in the brain rarely see polarity switch. Why? Our answer is: to learn and transfer more efficiently. We demonstrate with theory and simulations that if weight polarities are adequately set $\textit{a priori}$, then networks learn with less time and data. We extend such findings onto image classification tasks and demonstrate that polarity, not weight, is a more effective medium for knowledge transfer between networks. We also explicitly illustrate situations in which $\textit{a priori}$ setting the weight polarities is disadvantageous for networks. Our work illustrates the value of weight polarities from the perspective of statistical and computational efficiency for both NI and AI.","Weight Polarity, Learning Efficiency, Transfer Learning, Bio-inspired AI" On the Geometry of Reinforcement Learning in Continuous State and Action Spaces,https://openreview.net/forum?id=jIu4hk04776,https://openreview.net/pdf?id=jIu4hk04776,We prove that the state space is a low dimensional manifold and show that DDPG can effectively learn in this low dimensional space,"Advances in reinforcement learning have led to its successful application in complex tasks with continuous state and action spaces. Despite these advances in practice, most theoretical work pertains to finite state and action spaces. We propose building a theoretical understanding of continuous state and action spaces by employing a geometric lens. Central to our work is the idea that the transition dynamics induce a low dimensional manifold of reachable states embedded in the high-dimensional nominal state space. We prove that, under certain conditions, the dimensionality of this manifold is at most the dimensionality of the action space plus one. This is the first result of its kind, linking the geometry of the state space to the dimensionality of the action space. We empirically corroborate this upper bound for four MuJoCo environments.We further demonstrate the applicability of our result by learning a policy in this low dimensional representation. To do so we introduce an algorithm that learns a mapping to a low dimensional representation, as a narrow hidden layer of a deep neural network, in tandem with the policy using DDPG. Our experiments show that a policy learnt this way perform on par or better for four MuJoCo control suite tasks.","geometry, deep reinforcement learning, manifold" Deep Invertible Approximation of Topologically Rich Maps between Manifolds,https://openreview.net/forum?id=p-cx6fK0rW9,https://openreview.net/pdf?id=p-cx6fK0rW9,,"How can we design neural networks that allow for stable universal approximation of maps between topologically interesting manifolds? The answer is with a coordinate projection. Neural networks based on topological data analysis (TDA) use tools such as persistent homology to learn topological signatures of data and stabilize training but may not be universal approximators or have stable inverses. Other architectures universally approximate data distributions on submanifolds but only when the latter are given by a single chart, making them unable to learn maps that change topology. By exploiting the topological parallels between locally bilipschitz maps, covering spaces, and local homeomorphisms, and by using universal approximation arguments from machine learning, we find that a novel network of the form $\mathcal{T} \circ p \circ \mathcal{E}$, where $\mathcal{E}$ is an injective network, $p$ a fixed coordinate projection, and $\mathcal{T}$ a bijective network, is a universal approximator of local diffeomorphisms between compact smooth submanifolds embedded in $\mathbb{R}^n$. We emphasize the case when the target map changes topology. Further, we find that by constraining the projection $p$, multivalued inversions of our networks can be computed without sacrificing universality. As an application, we show that learning a group invariant function with unknown group action naturally reduces to the question of learning local diffeomorphisms for finite groups. Our theory permits us to recover orbits of the group action. We also outline possible extensions of our architecture to address molecular imaging of molecules with symmetries. Finally, our analysis informs the choice of topologically expressive starting spaces in generative problems. ","Manifold Learning, Universality, Inversion, Topology" Malign Overfitting: Interpolation and Invariance are Fundamentally at Odds,https://openreview.net/forum?id=dQNL7Zsta3,https://openreview.net/pdf?id=dQNL7Zsta3,Proof that interpolating classifiers cannot satisfy common invariance and fairness criteria; Provides insight on empirical observations and possible effective solutions,"Learned classifiers should often possess certain invariance properties meant to encourage fairness, robustness, or out-of-distribution generalization. However, multiple recent works empirically demonstrate that common invariance-inducing regularizers are ineffective in the over-parameterized regime, in which classifiers perfectly fit (i.e. interpolate) the training data. This suggests that the phenomenon of ``benign overfitting"", in which models generalize well despite interpolating, might not favorably extend to settings in which robustness or fairness are desirable. In this work we provide a theoretical justification for these observations. We prove that - even in the simplest of settings - any interpolating classifier (with nonzero margin) will not satisfy these invariance properties. We then propose and analyze an algorithm that - in the same setting - successfully learns a non-interpolating classifier that is provably invariant. We validate our theoretical observations regarding the conflict between interpolation and invariance on simulated data and the Waterbirds dataset.","Invariance, Overparameterization, Fairness, Robustness, Benign Overfitting" Exploring and Exploiting Decision Boundary Dynamics for Adversarial Robustness,https://openreview.net/forum?id=aRTKuscKByJ,https://openreview.net/pdf?id=aRTKuscKByJ,,"The robustness of a deep classifier can be characterized by its margins: the decision boundary's distances to natural data points. However, it is unclear whether existing robust training methods effectively increase the margin for each vulnerable point during training. To understand this, we propose a continuous-time framework for quantifying the relative speed of the decision boundary with respect to each individual point. Through visualizing the moving speed of the decision boundary under Adversarial Training, one of the most effective robust training algorithms, a surprising moving-behavior is revealed: the decision boundary moves away from some vulnerable points but simultaneously moves closer to others, decreasing their margins. To alleviate these conflicting dynamics of the decision boundary, we propose Dynamics-aware Robust Training (DyART), which encourages the decision boundary to engage in movement that prioritizes increasing smaller margins. In contrast to prior works, DyART directly operates on the margins rather than their indirect approximations, allowing for more targeted and effective robustness improvement. Experiments on the CIFAR-10 and Tiny-ImageNet datasets verify that DyART alleviates the conflicting dynamics of the decision boundary and obtains improved robustness under various perturbation sizes compared to the state-of-the-art defenses.","Adversarial Robustness, margin maximization, dynamical system" Countering the Attack-Defense Complexity Gap for Robust Classifiers,https://openreview.net/forum?id=FDlfFbnI7AR,https://openreview.net/pdf?id=FDlfFbnI7AR,We provide a formal rationale for why attacks are more efficient than defenses and introduce a new defensive technique that sidesteps this asymmetry.,"We consider the decision version of defending and attacking Machine Learning classifiers. We provide a rationale for the well-known difficulties in building robust models: in particular we prove that, under broad assumptions, attacking a polynomial-time classifier is $NP$-complete, while training a polynomial-time model that is robust on even a single input is $\Sigma_2^P$-complete. We also provide more general bounds for non-polynomial classifiers. We then show how such a complexity gap can be sidestepped by introducing Counter-Attack (CA), a system that computes on-the-fly robustness certificates for a given input up to an arbitrary distance bound $\varepsilon$. We also prove that, even when attacked with perturbations of magnitude $\varepsilon^\prime > \varepsilon$, CA still provides computational robustness: specifically, while computing a certificate is $NP$-complete, attacking the system beyond its intended robustness is $\Sigma_2^P$-complete. Since the exact form of CA can still be computationally expensive, we introduce a relaxation of this method, which we empirically show to be reliable at identifying non-robust inputs. As part of our work, we introduce UG100, a new dataset obtained by applying a provably optimal attack to six limited-scale networks (three for MNIST and three for CIFAR10), each trained in three different manners.","adversarial attacks, adversarial robustness, computational complexity, dataset" Evaluating Counterfactual Explainers,https://openreview.net/forum?id=iAPs7yMjjyQ,https://openreview.net/pdf?id=iAPs7yMjjyQ,,"Explainability methods have been widely used to provide insight into the decisions made by statistical models, thus facilitating their adoption in various domains within the industry. Counterfactual explanation methods aim to improve our understanding of a model by perturbing samples in a way that would alter its response in an unexpected manner. This information is helpful for users and for machine learning practitioners to understand and improve their models. Given the value provided by counterfactual explanations, there is a growing interest in the research community to investigate and propose new methods. However, we identify two issues that could hinder the progress in this field. (1) Existing metrics do not accurately reflect the value of an explainability method for the users. (2) Comparisons between methods are usually performed with datasets like CelebA, where images are annotated with attributes that do not fully describe them and with subjective attributes such as ``Attractive''. In this work, we address these problems by proposing an evaluation method with a principled metric to evaluate and compare different counterfactual explanation methods. The evaluation method is based on a synthetic dataset where images are fully described by their annotated attributes. As a result, we are able to perform a fair comparison of multiple explainability methods in the recent literature, obtaining insights about their performance. We make the code public for the benefit of the research community.","Explainability, Counterfactuals, XAI" SMART: Sentences as Basic Units for Text Evaluation,https://openreview.net/forum?id=OIe3kpwl40D,https://openreview.net/pdf?id=OIe3kpwl40D,,"Widely used evaluation metrics for text generation either do not work well with longer texts or fail to evaluate all aspects of text quality. In this paper, we introduce a new metric called SMART to mitigate such limitations. Specifically, we treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences. Candidate sentences are also compared to sentences in the source documents to allow grounding (e.g., factuality) evaluation. Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics on the SummEval summarization meta-evaluation dataset, while the same metric with a string-based matching function is competitive with current model-based metrics. The latter does not use any neural model, which is useful during model development phases where resources can be limited and fast evaluation is required. SMART also outperforms all factuality evaluation metrics on the TRUE benchmark. Finally, we also conducted extensive analyses showing that our proposed metrics work well with longer summaries and are less biased towards specific models.","summarization, evaluation" A Reinforcement Learning Approach to Estimating Long-term Treatment Effects,https://openreview.net/forum?id=tKMLGb7MWC,https://openreview.net/pdf?id=tKMLGb7MWC,We propose a reinforcement learning approach to estimating the long-term treatment effect from short-term data.,"Randomized experiments (a.k.a. A/B tests) are a powerful tool for estimating treatment effects, to inform decisions making in business, healthcare and other applications. In many problems, the treatment has a lasting effect that evolves over time. A limitation with randomized experiments is that they do not easily extend to measure long-term effects, since running long experiments is time-consuming and expensive. In this paper, we take a reinforcement learning (RL) approach that estimates the average reward in a Markov process. Motivated by real-world scenarios where the observed state transition is nonstationary, we develop a new algorithm for a class of nonstationary problems, and demonstrate promising results in two synthetic datasets and one online store dataset.","reinforcement learning, off-policy evaluation, A/B testing" Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier,https://openreview.net/forum?id=OpC-9aBBVJe,https://openreview.net/pdf?id=OpC-9aBBVJe,The combination of a large number of updates and resets drastically improves the sample efficiency of deep RL algorithms.,"Increasing the replay ratio, the number of updates of an agent's parameters per environment interaction, is an appealing strategy for improving the sample efficiency of deep reinforcement learning algorithms. In this work, we show that fully or partially resetting the parameters of deep reinforcement learning agents causes better replay ratio scaling capabilities to emerge. We push the limits of the sample efficiency of carefully-modified algorithms by training them using an order of magnitude more updates than usual, significantly improving their performance in the Atari 100k and DeepMind Control Suite benchmarks. We then provide an analysis of the design choices required for favorable replay ratio scaling to be possible and discuss inherent limits and tradeoffs.","reinforcement learning, sample efficiency, resets" Explaining Image Classification through Knowledge-aware Neuron Interpretation,https://openreview.net/forum?id=VEqj2fNC2Fw,https://openreview.net/pdf?id=VEqj2fNC2Fw,,"Although neural networks have achieved remarkable results, they still encounter doubts due to the intransparency. To this end, neural network prediction explanation is attracting more and more attentions. State of the art methods, however, rarely introduce human-understandable external knowledge, making the explanation hard to interpret by human beings. In this paper, we propose a knowledge-aware framework to explain neural network predictions for image scene classification. We introduce two notions of core concepts, with the help of knowledge graphs, to measure the association of concepts with respect to image scenes, and analyze solutions for prediction explanation and model manipulation. In our experiments on two popular scene classification datasets ADE20k and Opensurfaces, the proposed solutions produce better results than baseline and state of the art methods, e.g., our method produces over 25% IoU gain on compositional explanation for neuron behaviors. In addition, our core concepts and related explanation metrics can help effectively manipulate the model prediction, further leading to a new training method with 26.7% performance improvement.", Tier Balancing: Towards Dynamic Fairness over Underlying Causal Factors,https://openreview.net/forum?id=SZdfz5k7cd1,https://openreview.net/pdf?id=SZdfz5k7cd1,We formulate and investigate a long-term fairness notion that captures decision-distribution interplay via a detailed modeling over both observed and latent causal factors.,"The pursuit of long-term fairness involves the interplay between decision-making and the underlying data generating process. In this paper, through causal modeling with a directed acyclic graph (DAG) on the decision-distribution interplay, we investigate the possibility of achieving long-term fairness from a dynamic perspective. We propose \emph{Tier Balancing}, a technically more challenging but more natural notion to achieve in the context of long-term, dynamic fairness analysis. Different from previous fairness notions that are defined purely on observed variables, our notion goes one step further, capturing behind-the-scenes situation changes on the unobserved latent causal factors that directly carry out the influence from the current decision to the future data distribution. Under the specified dynamics, we prove that in general one cannot achieve the long-term fairness goal only through one-step interventions. Furthermore, in the effort of approaching long-term fairness, we consider the mission of ""getting closer to"" the long-term fairness goal and present possibility and impossibility results accordingly.","Algorithmic Fairness, Causality, Dynamic Modeling, Long-term Fairness" Anamnesic Neural Differential Equations with Orthogonal Polynomial Projections,https://openreview.net/forum?id=xYWqSjBcGMl,https://openreview.net/pdf?id=xYWqSjBcGMl,Long-term memory Neural ODEs archictecture using orthogonal polynomials projections.,"Neural ordinary differential equations (Neural ODEs) are an effective framework for learning dynamical systems from irregularly sampled time series data. These models provide a continuous-time latent representation of the underlying dynamical system where new observations at arbitrary time points can be used to update the latent representation of the dynamical system. Existing parameterizations for the dynamics functions of Neural ODEs limit the ability of the model to retain global information about the time series; specifically, a piece-wise integration of the latent process between observations can result in a loss of memory on the dynamic patterns of previously observed data points. We propose PolyODE, a Neural ODE that models the latent continuous-time process as a projection onto a basis of orthogonal polynomials. This formulation enforces long-range memory and preserves a global representation of the underlying dynamical system. Our construction is backed by favourable theoretical guarantees and in a series of experiments, we demonstrate that it outperforms previous works in the reconstruction of past and future data, and in downstream prediction tasks.","Neural ODEs, Time Series, Orthogonal Polynomials, Long-term memory, Representation Learning" Neural Design for Genetic Perturbation Experiments,https://openreview.net/forum?id=TUBpc5rqGA,https://openreview.net/pdf?id=TUBpc5rqGA,We introduce and analyze many tractable methods for noiseless optimistic arm elimination with applications in genetic perturbation experiments.,"The problem of how to genetically modify cells in order to maximize a certain cellular phenotype has taken center stage in drug development over the last few years (with, for example, genetically edited CAR-T, CAR-NK, and CAR-NKT cells entering cancer clinical trials). Exhausting the search space for all possible genetic edits (perturbations) or combinations thereof is infeasible due to cost and experimental limitations. This work provides a theoretically sound framework for iteratively exploring the space of perturbations in pooled batches in order to maximize a target phenotype under an experimental budget. Inspired by this application domain, we study the problem of batch query bandit optimization and introduce the Optimistic Arm Elimination ($\mathrm{OAE}$) principle designed to find an almost optimal arm under different functional relationships between the queries (arms) and the outputs (rewards). We analyze the convergence properties of $\mathrm{OAE}$ by relating it to the Eluder dimension of the algorithm's function class and validate that $\mathrm{OAE}$ outperforms other strategies in finding optimal actions in experiments on simulated problems, public datasets well-studied in bandit contexts, and in genetic perturbation datasets when the regression model is a deep neural network. OAE also outperforms the benchmark algorithms in 3 of 4 datasets in the GeneDisco experimental planning challenge. ","genetiic perturbation experiments, gene disco, optimism, neural optimism" Conceptual SCAN: Learning With and About Rules,https://openreview.net/forum?id=2iu9NhxX23,https://openreview.net/pdf?id=2iu9NhxX23,,"The ability to learn from a mix of rules and examples and to reflect on the learned abstractions is an important aspect of human intelligence. At the same time, there is a lack of benchmarks that systematically test for this ability, which makes it hard to evaluate the degree to which it is present in state-of-the-art ML architectures. We introduce a method to systematically construct such benchmarks by using an example structure that allows us to explicitly provide and ask about rules that are relevant for the given task. We present a simple dataset that is constructed according to this method, and we use it to analyze the performance of a variety of T5-based machine learning models. We identify four challenge areas in this setup: maintaining consistency between learned rules and their application, scaling to larger rule sets, compositional generalization, and dealing with limited training data.","reasoning, compositional generalization, rule learning, semantic parsing, consistency" An alternative approach to train neural networks using monotone variational inequality,https://openreview.net/forum?id=4QTrtR181T,https://openreview.net/pdf?id=4QTrtR181T,"We investigate training neural networks with monotone variation inequality, yielding performance guarantees and competitive/better performance than widely-used stochastic gradient descent methods, especially during initial training phases.","The current paper investigates an alternative approach to neural network training, which is a non-convex optimization problem, through the lens of another convex problem — to solve a monotone variational inequality (MVI) - inspired by a recent work of (Juditsky and Nemirovski, 2019). MVI solutions can be found by computationally efficient procedures, with performance guarantee of $\ell_2$ and $\ell_{\infty}$ bounds on model recovery and prediction accuracy under the theoretical setting of training a single-layer linear neural network. We study the use of MVI for training multi-layer neural networks by proposing a practical and completely general algorithm called \textit{stochastic variational inequality} (\texttt{SVI}). We demonstrate its applicability in training fully-connected neural networks, graph neural networks (GNN), and convolutional networks (CNN) (\texttt{SVI} is completely general for training other network architectures). We show the competitive or better performance of \texttt{SVI} compared to widely-used stochastic gradient descent methods on both synthetic and real network data prediction tasks regarding various performance metrics, especially in the improved efficiency in the early stage of training.","monotone variational inequality, graph neural networks, neural network training" Have Missing Data? Make It Miss More! Imputing Tabular Data with Masked Autoencoding,https://openreview.net/forum?id=yzE6LtZSHo,https://openreview.net/pdf?id=yzE6LtZSHo,"We present ReMasker, an extremely simple yet effective method for imputing missing values in tabular data.","We present ReMasker, a novel method for imputing missing values in tabular data by extending the masked autoencoding framework. In contrast to prior work, ReMasker is both {\em simple} -- besides the missing values (i.e., naturally masked), we randomly ``re-mask'' another set of values, optimize the autoencoder by reconstructing this re-masked set, and apply the trained model to predict the missing values; and {\em effective} -- with extensive evaluation on benchmark datasets, we show that ReMasker consistently outperforms state-of-the-art methods in terms of both imputation fidelity and utility under various missingness settings, while its performance advantage often increases with the ratio of missing data. We further explore theoretical justification for its effectiveness, showing that ReMasker tends to learn missingness-invariant representations of tabular data. Our findings indicate that masked modeling represents a promising direction for further research on tabular data imputation.","Imputation, Tabular Data, Masked Autoencoder" Invertible normalizing flow neural networks by JKO scheme,https://openreview.net/forum?id=-z7O7fk_Cs,https://openreview.net/pdf?id=-z7O7fk_Cs,"We propose JKO-Flow to train normalizing flow neural ODE model block-wise with time reparametrization, and experimentally show JKO-Flow reaches competitive performance while greatly reduce computation","Normalizing flow is a class of deep generative models for efficient sampling and density estimation. In practice, the flow often appears as a chain of invertible neural network blocks. To facilitate training, past works have regularized flow trajectories and designed special network architectures. The current paper develops a neural ODE flow network inspired by the Jordan-Kinderleherer-Otto (JKO) scheme, which allows an efficient \textit{block-wise} training procedure: as the JKO scheme unfolds the dynamic of gradient flow, the proposed model naturally stacks residual network blocks one-by-one and reduces the memory load as well as the difficulty of training deep networks. We also develop an adaptive time-reparametrization of the flow network with a progressive refinement of the trajectory in probability space, which improves the optimization efficiency and model accuracy in practice. On high-dimensional generative tasks for tabular data, JKO-Flow can process larger data batches and perform competitively as or better than continuous and discrete flow models, using 10X less number of iterations (e.g., batches) and significantly less time per iteration. ","Normalizing flow, invertible neural networks, JKO scheme" Unsupervised learning of features and object boundaries from local prediction,https://openreview.net/forum?id=igF77jrKri,https://openreview.net/pdf?id=igF77jrKri,"Using a contrastive loss for a local prediction, we learn a representation of both features and segmentation, which is similar to the human visual system.","A visual system has to learn both which features to extract from images and how to group locations into (proto-)objects. Those two aspects are usually dealt with separately, although predictability is discussed as a cue for both. To incorporate features and boundaries into the same model, we model a layer of feature maps with a pairwise Markov random field model in which each factor is paired with an additional binary variable, which switches the factor on or off. Using one of two contrastive learning objectives, we can learn both the features and the parameters of the Markov random field factors from images without further supervision signals. The features learned by shallow neural networks based on this loss are local averages, opponent colors, and Gabor-like stripe patterns. Furthermore, we can infer connectivity between locations by inferring the switch variables. Contours inferred from this connectivity perform quite well on the Berkeley segmentation database (BSDS500) without any training on contours. Thus, computing predictions across space aids both segmentation and feature learning, and models trained to optimize these predictions show similarities to the human visual system. We speculate that retinotopic visual cortex might implement such predictions over space through lateral connections.","unsupervised learning, segmentation, prediction, Markov random field, objects" Towards Causal Concepts for Explaining Language Models,https://openreview.net/forum?id=xYy2l4tiOe,https://openreview.net/pdf?id=xYy2l4tiOe,A framework that derives causal and concept-based explanations for complex NLP models,"The emergence of large-scale pretrained language models has posed unprecedented challenges in deriving explanations of why the model has made some predictions. Stemmed from the compositional nature of languages, spurious correlations have further undermined the trustworthiness of NLP systems. Thus, there exists an urgent demand for causal explanations to encourage fairness and transparency. To derive more causal, usable, and faithful explanations, we propose a complete framework for interpreting language models by deriving causal concepts. Specifically, we propose a post-hoc method that derives both high-level concepts and surface-level local explanations from hidden layer activations. To ensure causality, we optimize for a causal loss that maximizes the Average Treatment Effect (ATE), where we intervene on the concept-level as an innovative substitute to the traditional counterfactual interventions. Moreover, we devise several causality evaluation metrics for explanations that can be universally applied. Extensive experiments on real and synthetic tasks demonstrate that our method achieves superior results on causality, usability, and faithfulness compared to the baselines. Our codebase is available at \url{https://anonymous.4open.science/r/CausalConcept}.","NLP explainability, concept-based explanations, causality" "TRIDE: A Temporal, Robust, and Informative Data Augmentation Framework for Disease Progression Modeling",https://openreview.net/forum?id=jbd0I0-sdE,https://openreview.net/pdf?id=jbd0I0-sdE,,"Modeling the progression of a target disease using electronic health records (EHRs), especially early predicting the onset of a disease, is critical for timely and accurate clinical interventions. While numerous deep learning-based prediction models have shown great success in handling sequential multivariate data such as EHRs, they often lack temporal robustness. This is problematic because they may not perform consistently well across different early prediction hours as training data become scarce upon targeting further future. Indeed, having even one weak point of time can significantly restrict the reliability of the models. In this work, we present TRIDE, a temporal, robust and informative data augmentation framework that can learn temporal representations of EHRs and use them to generate diverse and meaningful training samples by optimizing the level of data transformation. We validate TRIDE on modeling the progression of an extremely challenging disease, septic shock, by using real-world EHRs collected from two different medical systems. Our results show that TRIDE significantly outperforms strong baseline models across different prediction times and datasets, and thus enhances the temporal robustness. Further, we provide in-depth analyses of the generated samples and estimated model parameters to clarify the processes.","Temporal robustness, data augmentation, representation learning, language modeling, electronic health records" Multi-Segmental Informational Coding for Self-Supervised Representation Learning,https://openreview.net/forum?id=m8ll6ILyZW,https://openreview.net/pdf?id=m8ll6ILyZW,,"Self-supervised representation learning aims to map high-dimensional data into a compact embedding space, where samples with similar semantics are close to each other. Currently, most representation learning methods maximize the cosine similarity or minimize the distance between different views from the same sample in an $\ell^2$ normalized embedding space, and reduce the feature redundancy via a linear correlation constraint. In this study, we propose MUlti-Segmental Informational Coding (MUSIC) as a new embedding scheme for self-supervised representation learning. MUSIC divides an embedding vector into multiple segments to represent different types of attributes, and each segment automatically learns a set of discrete and complementary attributes. MUSIC enables the estimation of the probability distribution over discrete attributes and thus the learning process can be directly guided by information measurements, reducing the feature redundancy beyond the linear correlation. Our theoretical analysis guarantees that MUSIC learns transform-invariant, non-trivial, diverse, and discriminative features. MUSIC does not require a special asymmetry design, a very high dimension of embedding features, or a deep projection head, making the training framework flexible and efficient. Extensive experiments demonstrate the superiority of MUSIC. ","self-supervised learning, representation learning, unsupervised learning, deep learning" Rule-based policy regularization for reinforcement learning-based building control,https://openreview.net/forum?id=rDVb_OgcQP,https://openreview.net/pdf?id=rDVb_OgcQP,A unified method to incorporate rule-based policy into online and offline reinforcement learning algorithm,"Rule-based control (RBC) is widely adopted in buildings due to its stability and robustness. It resembles a behavior cloning methodology refined by human expertise. However, it is unlikely for RBC to exceed a reinforcement learning (RL) agent’s performance since it is challenging to ingest a large number of parameters during decision-making. In this paper, we explore how to incorporate rule-based control into reinforcement learning to learn a more robust policy in both online and offline settings with a unified approach. We start with state-of-the-art online and offline RL methods, TD3 and TD3+BC, then improve on them using a dynamically weighted actor loss function to selectively choose which policy should RL models learn from at each time step of training. With experiments across multiple tasks and various weather conditions in both deterministic and stochastic scenarios, we empirically demonstrate that our dynamically weighted rule-based incorporated control regularization (RUBICON) method outperforms representative baseline methods in offline settings by 40.7% in a reward settings consisting of the combination of thermal comfort and energy consumption and by 49.7% in online settings in building-RL environments.","deep reinforcement learning, batch reinforcement learning, buildings, HVAC" Neural Graphical Models,https://openreview.net/forum?id=UA34f_shAO,https://openreview.net/pdf?id=UA34f_shAO,"A neural network based graphical model with efficient learning, inference and sampling algorithms","Graphs are ubiquitous and are often used to understand the dynamics of a system. Probabilistic Graphical Models comprising Bayesian and Markov networks, and Conditional Independence graphs are some of the popular graph representation techniques. They can model relationships between features (nodes) together with the underlying distribution. Although theoretically these models can represent very complex dependency functions, in practice often simplifying assumptions are made due to computational limitations associated with graph operations. This work introduces Neural Graphical Models (NGMs) which attempt to represent complex feature dependencies with reasonable computational costs. Specifically, given a graph of feature relationships and corresponding samples, we capture the dependency structure between the features along with their complex function representations by using neural networks as a multi-task learning framework. We provide efficient learning, inference and sampling algorithms for NGMs. Moreover, NGMs can fit generic graph structures including directed, undirected and mixed-edge graphs as well as support mixed input data types. We present empirical studies that show NGMs' capability to represent Gaussian graphical models, inference analysis of a lung cancer data and extract insights from a real world infant mortality data provided by CDC.","Graphical models, Deep learning, Learning Representations" AUGMENTING ZERO-SHOT DENSE RETRIEVERS WITH PLUG-IN MIXTURE-OF-MEMORIES,https://openreview.net/forum?id=45FFlw8N47,https://openreview.net/pdf?id=45FFlw8N47,"We explore the potential of augmenting lanuguage models with mixture-of-memory and plugging in new corpus during inference, which leads to their enhanced generalization ability on the zero-shot dense retrieval task.","In this paper we improve the zero-shot generalization ability of language models via Mixture-Of-Memory Augmentation (MoMA), a mechanism that retrieves augmentation documents from multiple information corpora (“external memories”), with the option to “plug in” new memory at inference time. We develop a joint learning mechanism that trains the augmentation component with latent labels derived from the end retrieval task, paired with hard negatives from the memory mixture. We instantiate the model in a zero-shot dense retrieval setting by augmenting a strong T5-based retriever with MoMA. Our model, MoMA-DR, obtains strong zero-shot retrieval accuracy on the eighteen tasks included in the standard BEIR benchmark. It outperforms other dense retrieval models of similar scales and achieves comparable accuracy with systems that seek generalization from increased scales in encoder models or vector indices. Our analysis illustrates the necessity of augmenting with mixture-of-memory for robust generalization, the benefits of joint learning, and how MoMA-DR utilizes the plug-in memory at inference time without changing its parameters. We plan to open source our code.","Retrieval Augmented Language Model, Zero-shot Dense Retrieval, Mixture of Memory" Efficient Discrete Multi Marginal Optimal Transport Regularization,https://openreview.net/forum?id=R98ZfMt-jE,https://openreview.net/pdf?id=R98ZfMt-jE,"Using a fast algorithm for computing generalized earth mover's distances, we solve practical discrete multi-marginal optimal transport problems in neural network learning applications.","Optimal transport has emerged as a powerful tool for a variety of problems in machine learning, and it is frequently used to enforce distributional constraints. In this context, existing methods often use either a Wasserstein metric, or else they apply concurrent barycenter approaches when more than two distributions are considered. In this paper, we leverage multi-marginal optimal transport (MMOT), where we take advantage of a procedure that computes a generalized earth mover's distance as a sub-routine. We show that not only is our algorithm computationally more efficient compared to other barycentric-based distance methods, but it has the additional advantage that gradients used for backpropagation can be efficiently computed during the forward pass computation itself, which leads to substantially faster model training. We provide technical details about this new regularization term and its properties, and we present experimental demonstrations of faster runtimes when compared to standard Wasserstein-style methods. Finally, on a range of experiments designed to assess effectiveness at enforcing fairness, we demonstrate our method compares well with alternatives.","optimal transport, multi-marginal, earth mover's distance, fairness" AutoTransfer: AutoML with Knowledge Transfer - An Application to Graph Neural Networks,https://openreview.net/forum?id=y81ppNf_vg,https://openreview.net/pdf?id=y81ppNf_vg,"We propose AutoTransfer, an AutoML solution that improves search efficiency by transferring the known architectural design knowledge to the novel task of interest.","AutoML has demonstrated remarkable success in finding an effective neural architecture for a given machine learning task defined by a specific dataset and an evaluation metric. However, most present AutoML techniques consider each task independently from scratch, which requires exploring many architectures, leading to high computational cost. Here we propose AutoTransfer, an AutoML solution that improves search efficiency by transferring the prior architectural design knowledge to the novel task of interest. Our key innovation includes a task-model bank that captures the model performance over a diverse set of GNN architectures and tasks, and a computationally efficient task embedding that can accurately measure the similarity among different tasks. Based on the task-model bank and the task embeddings, we estimate the design priors of desirable models of the novel task, by aggregating a similarity-weighted sum of the top-K design distributions on tasks that are similar to the task of interest. The computed design priors can be used with any AutoML search algorithm. We evaluate AutoTransfer on six datasets in the graph machine learning domain. Experiments demonstrate that (i) our proposed task embedding can be computed efficiently, and that tasks with similar embeddings have similar best-performing architectures; (ii) AutoTransfer significantly improves search efficiency with the transferred design priors, reducing the number of explored architectures by an order of magnitude. Finally, we release GNN-Bank-101, a large-scale dataset of detailed GNN training information of 120,000 task-model combinations to facilitate and inspire future research.","Graph Neural Networks, AutoML, Knowledge Transfer" Meta-learning from demonstrations improves compositional generalization,https://openreview.net/forum?id=dNyDCl2FsvM,https://openreview.net/pdf?id=dNyDCl2FsvM,We extend meta-seq2seq to grounded environments by generating environment relevant meta-training supports.,"We study the problem of compositional generalization of language-instructed agents in gSCAN. gSCAN is a popular benchmark which requires an agent to generalize to instructions containing novel combinations of words, which are not seen in the training data. We propose to improve the agent’s generalization capabilities with an architecture inspired by the Meta-Sequence-to-Sequence learning approach (Lake, 2019). The agent receives as a context a few examples of pairs of instructions and action trajectories in a given instance of the environment (a support set) and it is tasked to predict an action sequence for a query instruction for the same environment instance. The context is generated by an oracle and the instructions come from the same distribution as seen in the training data. In each training episode, we also shuffle the indices of the actions and the words of the instructions to make the agent figure out the relations between the actions and the words from the context. Our predictive model has the standard transformer architecture. We show that the proposed architecture can significantly improve the generalization capabilities of the agent on one of the most difficult gSCAN splits: the ``adverb-to-verb” split H.","meta-learning, grounded language learning, compositional generalization" Deep Dependency Networks for Action Classification in Video,https://openreview.net/forum?id=4zGai1tFQE,https://openreview.net/pdf?id=4zGai1tFQE,A new approach that jointly learns a conditional dependency network and a deep neural network for activity classification in video,"We propose a simple approach which combines the strengths of probabilistic graphical models and deep learning architectures for solving the multi-label action classification task in videos. At a high level, given a video clip, the goal in this task is to infer the set of activities, defined as verb-noun pairs, that are performed in the clip. First, we show that the performance of previous approaches that combine Markov Random Fields with neural networks can be modestly improved by leveraging more powerful methods such as iterative join graph propagation, $\ell$-1 regularization based structure learning and integer linear programming. Then we propose a new modeling framework called deep dependency network which augments a dependency network, a model that is easy to train and learns more accurate dependencies but is limited to Gibbs sampling for inference, to the output layer of a neural network. We show that despite its simplicity, joint learning this new architecture yields significant improvements in performance over the baseline neural network. In particular, our experimental evaluation on three video datasets: Charades, Textually Annotated Cooking Scenes (TaCOS), and Wetlab shows that deep dependency networks are almost always superior to pure neural architectures that do not use dependency networks.","probabilistic graphical models, action classification, multi-label classification, combining probabilistic models with deep learning, end-to-end learning" Temporal Dependencies in Feature Importance for Time Series Prediction,https://openreview.net/forum?id=C0q9oBc3n4,https://openreview.net/pdf?id=C0q9oBc3n4,New explainability method for multivariate time series predictions,"Time series data introduces two key challenges for explainability methods: firstly, observations of the same feature over subsequent time steps are not independent, and secondly, the same feature can have varying importance to model predictions over time. In this paper, we propose Windowed Feature Importance in Time (WinIT), a feature removal based explainability approach to address these issues. Unlike existing feature removal explanation methods, WinIT explicitly accounts for the temporal dependence between different observations of the same feature in the construction of its importance score. Furthermore, WinIT captures the varying importance of a feature over time, by summarizing its importance over a window of past time steps. We conduct an extensive empirical study on synthetic and real-world data, compare against a wide range of leading explainability methods, and explore the impact of various evaluation strategies. Our results show that WinIT achieves significant gains over existing methods, with more consistent performance across different evaluation metrics.","time series, recurrent, explainability" Peaks2Image: Reconstructing fMRI Statistical Maps from Peaks,https://openreview.net/forum?id=bjMNguuxbH,https://openreview.net/pdf?id=bjMNguuxbH,"Peaks2Image allows the reconstruction of statistical maps from coordinates in neuroscientific studies, and leverage them to decode any concept from the studies vocabulary.","Neuroscience is striving to overcome the lack of power due to the small sample size of standard studies. An important step forward has been the creation of large-scale public image repositories, such as NeuroVault. Such repositories enable comparing images across studies and automatically associating them with cognitive terms. Yet, this type of meta-analysis faces a major roadblock: the scarcity and inconsistency of image annotations and metadata. Another resource containing rich annotations is the neuroscientific literature. However it only yields a handful of brain-space coordinates per publication, those of the main activity peaks reported in each study. In this work, we propose Peaks2Image, a neuralnetwork approach to reconstructing continuous spatial representations of brain activity from peak activation tables. Using reconstructions of studies published in the neuroscientific literature, we train a decoder using tf-idf features as labels, leading to a much broader set of decoded terms than current image-based studies. We validate the decoder on 43,000 NeuroVault images, successfully decoding 58 out of 81 concepts in a zero-shot setting.", Bridging the Gap between Semi-supervised and Supervised Continual Learning via Data Programming,https://openreview.net/forum?id=_o4JUv2lmD,https://openreview.net/pdf?id=_o4JUv2lmD,"We built a semi-supervised continual learning (SSCL) framework to approach the performance of supervised, via self-taught data programming. Results show we not only obtain similar performance as supervised, but also defeat existing SSCL methods.","Semi-supervised continual learning (SSCL) has shown its utility in learning cumulative knowledge with partially labeled data per task. However, the state-of-the-art has yet to explicitly address how to reduce the performance gap between using partially labeled data and fully labeled. In response, we propose a general-purpose SSCL framework, namely DP-SSCL, that uses data programming (DP) to pseudo-label the unlabeled data per task, and then cascades both ground-truth-labeled and pseudo-labeled data to update a downstream supervised continual learning model. The framework includes a feedback loop that brings mutual benefits: On one hand, DP-SSCL inherits guaranteed pseudo-labeling quality from DP techniques to improve continual learning, approaching the performance of using fully supervised data. On the other hand, knowledge transfer from previous tasks facilitates training of the DP pseudo-labeler, taking advantage of cumulative information via self-teaching. Experiments show that (1) DP-SSCL bridges the performance gap, approaching the final accuracy and catastrophic forgetting as using fully labeled data, (2) DP-SSCL outperforms existing SSCL approaches at low cost, by up to $25\%$ higher final accuracy and lower catastrophic forgetting on standard benchmarks, while reducing memory overhead from $100$ MB level to $1$ MB level at the same time complexity, and (3) DP-SSCL is flexible, maintaining steady performance supporting plug-and-play extensions for a variety of supervised continual learning models.","continual learning, lifelong learning, semi-supervised learning" Characterizing the spectrum of the NTK via a power series expansion,https://openreview.net/forum?id=Tvms8xrZHyR,https://openreview.net/pdf?id=Tvms8xrZHyR,"We characterize the NTK spectrum via a power series representation in terms of the Hermite coefficients of the activation function, the depth, and the effective rank of the input Gram. ","Under mild conditions on the network initialization we derive a power series expansion for the Neural Tangent Kernel (NTK) of arbitrarily deep feedforward networks in the infinite width limit. We provide expressions for the coefficients of this power series which depend on both the Hermite coefficients of the activation function as well as the depth of the network. We observe faster decay of the Hermite coefficients leads to faster decay in the NTK coefficients. Using this series, first we relate the effective rank of the NTK to the effective rank of the input-data Gram. Second, for data drawn uniformly on the sphere we derive an explicit formula for the eigenvalues of the NTK, which shows faster decay in the NTK coefficients implies a faster decay in its spectrum. From this we recover existing results on eigenvalue asymptotics for ReLU networks and comment on how the activation function influences the RKHS. Finally, for generic data and activation functions with sufficiently fast Hermite coefficient decay, we derive an asymptotic upper bound on the spectrum of the NTK.","neural tangent kernel, power series, Hermite coefficient, activation function, spectrum, input Gram matrix" Unmasking the Lottery Ticket Hypothesis: What's Encoded in a Winning Ticket's Mask?,https://openreview.net/forum?id=xSsW2Am-ukZ,https://openreview.net/pdf?id=xSsW2Am-ukZ,We provide an error landscape perspective on what information is encoded in a winning ticket's mask and how Iterative Magnitude Pruning finds matching subnetworks.,"Modern deep learning involves training costly, highly overparameterized networks, thus motivating the search for sparser networks that require less compute and memory but can still be trained to the same accuracy as the full network (i.e. matching). Iterative magnitude pruning (IMP) is a state of the art algorithm that can find such highly sparse matching subnetworks, known as winning tickets. IMP operates by iterative cycles of training, masking a fraction of smallest magnitude weights, rewinding unmasked weights back to an early training point, and repeating. Despite its simplicity, the underlying principles for when and how IMP finds winning tickets remain elusive. In particular, what useful information does an IMP mask found at the end of training convey to a rewound network near the beginning of training? How does SGD allow the network to extract this information? And why is iterative pruning needed, i.e. why can't we prune to very high sparsities in one shot? We develop answers to these questions in terms of the geometry of the error landscape. First, we find that—at higher sparsities—pairs of pruned networks at successive pruning iterations are connected by a linear path with zero error barrier if and only if they are matching. This indicates that masks found at the end of training convey to the rewind point the identity of an axial subspace that intersects a desired linearly connected mode of a matching sublevel set. Second, we show SGD can exploit this information due to a strong form of robustness: it can return to this mode despite strong perturbations early in training. Third, we show how the flatness of the error landscape at the end of training determines a limit on the fraction of weights that can be pruned at each iteration of IMP. This analysis yields a new quantitative link between IMP performance and the Hessian eigenspectrum. Finally, we show that the role of retraining in IMP is to find a network with new small weights to prune. Overall, these results make progress toward demystifying the existence of winning tickets by revealing the fundamental role of error landscape geometry in the algorithms used to find them.","linear mode connectivity, iterative magnitude pruning, loss landscape geometry, lottery ticket hypothesis, sparsity" A critical look at evaluation of GNNs under heterophily: Are we really making progress?,https://openreview.net/forum?id=tJbbQfw-5wv,https://openreview.net/pdf?id=tJbbQfw-5wv,"We show that popular heterophilous datasets for node classification have serious drawbacks, propose several new ones, and show that, at this moment, standard GNNs outperform most of the specialized models on these datasets.","Node classification is a classical graph representation learning task on which Graph Neural Networks (GNNs) have recently achieved strong results. However, it is often believed that standard GNNs only work well for homophilous graphs, i.e., graphs where edges tend to connect nodes of the same class. Graphs without this property are called heterophilous, and it is typically considered that specialized methods are required to achieve strong performance on such graphs. Many GNN models for heterophilous graphs have recently been proposed in the literature. However, these models are typically evaluated on the same set of six heterophilous datasets. In this work, we challenge this evaluation setting. First, we show that the popular heterophilous benchmarks have serious drawbacks, the most significant being a large number of duplicate nodes in Squirrel and Chameleon datasets. We show that some models implicitly use this property of the datasets, and removing duplicate nodes strongly affects their performance. Then, we propose a set of heterophilous graphs of varying properties that we believe can serve as a better benchmark for testing the performance of GNNs under heterophily. We show that, at this moment, standard GNNs achieve competitive results on heterophilous graphs outperforming most of the specialized models.","GNN, graph, node classification, heterophily, benchmark" Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness,https://openreview.net/forum?id=Wc5bmZZU9cy,https://openreview.net/pdf?id=Wc5bmZZU9cy,,"Neural text-to-SQL models have achieved remarkable performance in translating natural language questions into SQL queries. However, recent studies reveal that text-to-SQL models are vulnerable to task-specific perturbations. Previous curated robustness test sets usually focus on individual phenomena. In this paper, we propose a comprehensive robustness benchmark based on Spider, a cross-domain text-to-SQL benchmark, to diagnose the model robustness. We design 17 perturbations on databases, natural language questions, and SQL queries to measure the robustness from different angles. In order to collect more diversified natural question perturbations, we utilize large pretrained language models (PLMs) to simulate human behaviors in creating natural questions. We conduct a diagnostic study of the state-of-the-art models on the robustness set. Experimental results reveal that even the most robust model suffers from a 14.0% performance drop overall and a 50.7% performance drop on the most challenging perturbation. We also present a breakdown analysis regarding text-to-SQL model designs and provide insights for improving model robustness.", A Non-monotonic Self-terminating Language Model,https://openreview.net/forum?id=vw-5EgYbJZr,https://openreview.net/pdf?id=vw-5EgYbJZr,We propose a new method to prevent language models from non-terminating sequences resulting from incomplete decoding algorithms.,"Recent large-scale neural autoregressive sequence models have shown impressive performances on a variety of natural language generation tasks. However, their generated sequences often exhibit degenerate properties such as non-termination, undesirable repetition, and premature termination, when generated with decoding algorithms such as greedy search, beam search, top-k sampling, and nucleus sampling. In this paper, we focus on the problem of non-terminating sequences resulting from an incomplete decoding algorithm. We first define an incomplete probable decoding algorithm which includes greedy search, top-k sampling, and nucleus sampling, beyond the incomplete decoding algorithm originally put forward by Welleck et al. (2020). We then propose a non-monotonic self-terminating language model, which significantly relaxes the constraint of monotonically increasing termination probability in the originally proposed self-terminating language model by Welleck et al. (2020), to address the issue of non-terminating sequences when using incomplete probable decoding algorithms. We prove that our proposed model prevents non-terminating sequences when using not only incomplete probable decoding algorithms but also beam search. We empirically validate our model on sequence completion tasks with various architectures.","non-terminating sequences, language modeling, sequence completion, decoding, consistency, self-terminating" Counterfactual Memorization in Neural Language Models,https://openreview.net/forum?id=PvOo1sHKzf,https://openreview.net/pdf?id=PvOo1sHKzf,,"Modern neural language models widely used in tasks across NLP risk memorizing sensitive information from their training data. As models continue to scale up in parameters, training data, and compute, understanding memorization in language models is both important from a learning-theoretical point of view, and is practically crucial in real world applications. An open question in previous studies of memorization in language models is how to filter out ""common"" memorization. In fact, most memorization criteria strongly correlate with the number of occurrences in the training set, capturing ""common"" memorization such as familiar phrases, public knowledge or templated texts. In this paper, we provide a principled perspective inspired by a taxonomy of human memory in Psychology. From this perspective, we formulate a notion of counterfactual memorization, which characterizes how a model's predictions change if a particular document is omitted during training. We identify and study counterfactually-memorized training examples in standard text datasets. We further estimate the influence of each training example on the validation set and on generated texts, and show that this can provide direct evidence of the source of memorization at test time.","memorization, influence, language models" TT-Rules: Extracting & Optimizing Exact Rules of a CNN-Based Model - Application to Fairness,https://openreview.net/forum?id=5pU6126YRp,https://openreview.net/pdf?id=5pU6126YRp,"In this work, we proposed an optimized new CNN-based framework for global and exact interpretability with application to healthcare and fairness tabular datasets.","Most Machine Learning (ML) models are ``black box'' models, but in critical domains such as healthcare, energy, finance, military, or justice, they need to be globally and exactly interpretable. Creating ML models convertible by design into rule-based models is an attractive solution: they produce all the rules (global nature of interpretability) that allow us to obtain exactly the output result (exact nature of interpretability). Today, these rule-based models are mainly decision trees, whose natural interpretability is outweighed by their poor performances and scalability. In this paper, we offer a new three-step framework, TT-rules, that extracts and optimizes exact rules from a recent family of Convolution Neural Networks (CNNs) called Truth Table nets (TTnets). First, we show how to extract rules $\mathcal{R}$ in Disjunction Normal Form (DNF) from TTnets, which we adapt and enhance for tabular datasets. Secondly, we explain how the TT-rules framework permits the optimization of two key interpretability factors, namely the number of rules and their size, transforming the original set $\mathcal{R}$ into an optimized $\mathcal{R}_{opt}$. Our rule-based model is thus composed of $\mathcal{R}_{opt}$ with a final binary linear regression and allows multi-label classification. In a third step, we improve the rules' visualization by converting them into Reduced Ordered Binary Decision Diagrams (ROBDD) and enriching them by computing interesting associated probabilities. To evaluate TT-rules' performances, we applied it to two tabular healthcare datasets and two fairness datasets. Our framework reaches competitive results compared to state-of-the-art rule-based models in terms of accuracy, complexity, and statistical parity, also giving exact and global interpretability. In addition, we show that practitioners can use their domain knowledge to diagnose individual fairness of a given TT-rules model by analyzing and further modifying the rules $\mathcal{R}_{opt}$. As an example of the compactness of our framework's output, we draw all the rules in $\mathcal{R}_{opt}$ for one model on the Adult dataset (only 15 conditions for an 84.6\% accuracy).","global & exact interpretability, convolutional neural-networks, rule-based model for fairness" uGLAD: A deep learning model to recover conditional independence graphs,https://openreview.net/forum?id=dmWMfJeZMM,https://openreview.net/pdf?id=dmWMfJeZMM,An unsupervised deep learning method based on unrolled algorithm technique to recover conditional independence graphs. ,"Probabilistic Graphical Models are generative models of complex systems. They rely on conditional independence assumptions between variables to learn sparse representations which can be visualized in a form of a graph. Such models are used for domain exploration and structure discovery in poorly understood domains. This work introduces a novel technique to perform sparse graph recovery by optimizing deep unrolled networks. Assuming that the input data $X\in\mathbb{R}^{M\times D}$ comes from an underlying multivariate Gaussian distribution, we apply a deep model on $X$ that outputs the precision matrix $\Theta$. Then, the partial correlation matrix \mathrm{P} is calculated which can also be interpreted as providing a list of conditional independence assertions holding in the input distribution. Our model, \texttt{uGLAD}, builds upon and extends the state-of-the-art model \texttt{GLAD} to the unsupervised setting. The key benefits of our model are (1) \texttt{uGLAD} automatically optimizes sparsity-related regularization parameters leading to better performance than existing algorithms. (2) We introduce multi-task learning based `consensus' strategy for robust handling of missing data in an unsupervised setting. We evaluate performance on synthetic Gaussian, non-Gaussian data generated from Gene Regulatory Networks, and present case studies in anaerobic digestion and infant mortality.","Graphical Lasso, Deep Learning, Unrolled Algorithms, Conditional Independence graphs, Sparse graphs" Quantifying Memorization Across Neural Language Models,https://openreview.net/forum?id=TatRHT_1cK,https://openreview.net/pdf?id=TatRHT_1cK,"Model size, duplication, and context length all impact how easy it is to extract training data from large language models.","Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized training data verbatim. This is undesirable because memorization violates privacy (exposing user data), degrades utility (repeated easy-to-memorize text is often low quality), and hurts fairness (some texts are memorized over others). We describe three log-linear relationships that quantify the degree to which LMs emit memorized training data. Memorization significantly grows as we increase (1) the capacity of a model, (2) the number of times an example has been duplicated, and (3) the number of tokens of context used to prompt the model. Surprisingly, we find the situation becomes complicated when generalizing these results across model families. On the whole, we find that memorization in LMs is more prevalent than previously believed and will likely get worse as models continues to scale, at least without active mitigations.","memorization, large language models, duplication" Powderworld: A Platform for Understanding Generalization via Rich Task Distributions,https://openreview.net/forum?id=AWZgXGmsbA,https://openreview.net/pdf?id=AWZgXGmsbA,Powderworld is an environment supporting the study of generalization by providing diverse tasks arising from the same core rules.,"One of the grand challenges of reinforcement learning is the ability to generalize to new tasks. However, general agents require a set of rich, diverse tasks to train on. Designing a `foundation environment' for such tasks is tricky -- the ideal environment would support a range of emergent phenomena, an expressive task space, and fast runtime. To take a step towards addressing this research bottleneck, this work presents Powderworld, a lightweight yet expressive simulation environment running directly on the GPU. Within Powderworld, two motivating task distributions are presented, one for world-modelling and one for reinforcement learning. Each contains hand-designed test tasks to examine generalization. Experiments indicate that increasing the environment's complexity improves generalization for world models, yet causes reinforcement learning agents to struggle. Powderworld aims to support the study of generalization by providing a source of diverse tasks arising from the same core rules.","reinforcement learning, environment, generalization, out-of-distribution, multi-task" Federated Self-supervised Learning for Heterogeneous Clients,https://openreview.net/forum?id=bNPth9YMqZ,https://openreview.net/pdf?id=bNPth9YMqZ,We propose a new and systematic framework for performing self-supervised federated learning when the clients are diverse and cannot train models of identical architectures.,"Federated Learning has become an important learning paradigm due to its privacy and computational benefits. As the field advances, two key challenges that still remain to be addressed are: (1) system heterogeneity - variability in the compute and/or data resources present on each client, and (2) lack of labeled data in certain federated settings. Several recent developments have tried to overcome these challenges independently. In this work, we propose a unified and systematic framework, \emph{Heterogeneous Self-supervised Federated Learning} (Hetero-SSFL) for enabling self-supervised learning with federation on heterogeneous clients. The proposed framework allows collaborative representation learning across all the clients without imposing architectural constraints or requiring presence of labeled data. The key idea in Hetero-SSFL is to let each client train its unique self-supervised model and enable the joint learning across clients by aligning the lower dimensional representations on a common dataset. The entire training procedure could be viewed as self and peer-supervised as both the local training and the alignment procedures do not require presence of any labeled data. As in conventional self-supervised learning, the obtained client models are task independent and can be used for varied end-tasks. We provide a convergence guarantee of the proposed framework for non-convex objectives in heterogeneous settings and also empirically demonstrate that our proposed approach outperforms the state of the art methods by a significant margin. ","Federated Learning, Self-supervised Learning" ContraSim -- A Similarity Measure Based on Contrastive Learning,https://openreview.net/forum?id=_fiHdKjxfR,https://openreview.net/pdf?id=_fiHdKjxfR,We develop a new similarity measure based on contrastive learning,"Recent work has compared neural network representations via similarity-based analyses, shedding light on how different aspects (architecture, training data, etc.) affect models' internal representations. The quality of a similarity measure is typically evaluated by its success in assigning a high score to representations that are expected to be matched. However, existing similarity measures perform mediocrely on standard benchmarks. In this work, we develop a new similarity measure, dubbed ContraSim, based on contrastive learning. In contrast to common closed-form similarity measures, ContraSim learns a parameterized measure by using both similar and dissimilar examples. We perform an extensive experimental evaluation of our method, with both language and vision models, on the standard layer prediction benchmark and two new benchmarks that we develop: the multilingual benchmark and the image--caption benchmark. In all cases, ContraSim achieves much higher accuracy than previous similarity measures, even when presented with challenging examples. ","Interpretability, similarity measure, analysis, language models, multilingual, image captioning" Learning to Segment from Noisy Annotations: A Spatial Correction Approach,https://openreview.net/forum?id=Qc_OopMEBnC,https://openreview.net/pdf?id=Qc_OopMEBnC,,"Noisy labels can significantly affect the performance of deep neural networks (DNNs). In medical image segmentation tasks, annotations are error-prone due to the high demand in annotation time and in the annotators' expertise. Existing methods mostly tackle label noise in classification tasks. Their independent-noise assumptions do not fit label noise in segmentation task. In this paper, we propose a novel noise model for segmentation problems that encodes spatial correlation and bias, which are prominent in segmentation annotations. Further, to mitigate such label noise, we propose a label correction method to recover true label progressively. We provide theoretical guarantees of the correctness of the proposed method. Experiments show that our approach outperforms current state-of-the-art methods on both synthetic and real-world noisy annotations.", PointConvFormer: Revenge of the Point-Based Convolution,https://openreview.net/forum?id=9jW_Oynp0au,https://openreview.net/pdf?id=9jW_Oynp0au,"We introduce PointConvFormer, a novel building block using transformer attention to improve point convolution.","We introduce PointConvFormer, a novel building block for point cloud based deep network architectures. Inspired by generalization theory, PointConvFormer combines ideas from point convolution, where filter weights are only based on relative position, and Transformers which utilize feature-based attention. In PointConvFormer, attention computed from feature difference between points in the neighborhood is used to modify the convolutional weights at each point. Hence, we preserved the invariances from point convolution, whereas attention helps to select relevant points in the neighborhood for convolution. We experiment on both semantic segmentation and scene flow estimation tasks on point clouds with multiple datasets including ScanNet, SemanticKitti, FlyingThings3D and KITTI. Our results show that PointConvFormer substantially outperforms classic convolutions, regular transformers, and voxelized sparse convolution approaches with much smaller and faster networks. Visualizations show that PointConvFormer performs similarly to convolution on flat areas, whereas the neighborhood selection effect is stronger on object boundaries, showing that it has got the best of both worlds. The code will be available with the final version.", Measuring Forgetting of Memorized Training Examples,https://openreview.net/forum?id=7bJizxLKrR,https://openreview.net/pdf?id=7bJizxLKrR,"When models are trained on large datasets, we show that privacy attacks become less effective on examples seen early in training, and investigate why.","Machine learning models exhibit two seemingly contradictory phenomena: training data memorization and various forms of forgetting. In memorization, models overfit specific training examples and become susceptible to privacy attacks. In forgetting, examples which appeared early in training are forgotten by the end. In this work, we connect these phenomena. We propose a technique to measure to what extent models ``forget'' the specifics of training examples, becoming less susceptible to privacy attacks on examples they have not seen recently. We show that, while non-convexity can prevent forgetting from happening in the worst-case, standard image and speech models empirically do forget examples over time. We identify nondeterminism as a potential explanation, showing that deterministically trained models do not forget. Our results suggest that examples seen early when training with extremely large datasets---for instance those examples used to pre-train a model---may observe privacy benefits at the expense of examples seen later.","forgetting, memorization, membership inference, canary extraction, nondeterminism, convexity" Leveraging Human Features at Test-Time,https://openreview.net/forum?id=CvnKNdZQsxb,https://openreview.net/pdf?id=CvnKNdZQsxb,,"Machine learning (ML) models can make decisions based on large amounts of data, but they may be missing important context. For example, a model trained to predict psychiatric outcomes may know nothing about a patient’s social support system, and social support may look different for different patients. In this work, we explore strategies for querying for a small, additional set of these human fea- tures that are relevant for each specific instance at test time, so as to incorporate this information while minimizing the burden to the user to label feature values. We define the problem of querying users for an instance-specific set of human fea- ture values, and propose algorithms to solve it. We show in experiments on real datasets that our approach outperforms a feature selection baseline that chooses the same set of human features for all instances.", Graduated Non-Convexity for Robust Self-Trained Language Understanding,https://openreview.net/forum?id=XG_LmeoU8Xq,https://openreview.net/pdf?id=XG_LmeoU8Xq,"Robust self-trained language understanding against pseudo labeling noises, data imbalance, overfitting, and adversarial evaluation data.","Self-training has been proved an efficient strategy for unsupervised fine-tuning of language models using unlabeled data and model-generated pseudo-labels. However, the performance of self-trained models is unstable under different settings of training and evaluation data, influenced by both data distribution and pseudo-label accuracy. In this work, we propose an outlier robust self-training method based on graduated non-convexity (GNC) to mitigate the problem. We construct self-training as a non-convex optimization problem with outlier training examples. The models are self-trained with robust cost functions based according to Black-Rangarajan Duality. The algorithm learns slack variables as the loss weights for all training samples. The slack variables are used to calibrate the loss items during training to update the model parameters. The calibrated loss items lead to more robust self-trained models against different training and evaluation data and tasks. We conducted experiments on few-shot natural language understanding tasks with labeled and unlabeled data examples. Experiment results show that the proposed loss calibration method improves the performance and stability of self-training under different training tasks and data examples, and also benefits the robustness against adversarial evaluation corpora. ", On the Activation Function Dependence of the Spectral Bias of Neural Networks,https://openreview.net/forum?id=TVAFpPEWSn7,https://openreview.net/pdf?id=TVAFpPEWSn7,,"Neural networks are universal function approximators which are known to generalize well despite being dramatically overparameterized. We study this phenomenon from the point of view of the spectral bias of neural networks. Our contributions are two-fold. First, we provide a theoretical explanation for the spectral bias of ReLU neural networks by leveraging connections with the theory of finite element methods, which is widely used to numerically solve PDEs. Second, based upon this theory we predict that switching the activation function to a piecewise linear B-spline, namely the Hat function, will remove this spectral bias, which we verify empirically in a variety of settings. This is of particular significance for solving PDEs using neural networks since for such problems it is important to capture all frequencies in the solutions. Our empirical studies also show that neural networks with the Hat activation function are trained significantly faster using stochastic gradient descent and ADAM. Combined with previous work showing that the Hat activation function also improves generalization accuracy on image classification tasks, this indicates that using the Hat activation provides significant advantages over the ReLU on a variety of problems. ", MaskViT: Masked Visual Pre-Training for Video Prediction,https://openreview.net/forum?id=QAV2CcLEDh,https://openreview.net/pdf?id=QAV2CcLEDh,We propose to learn a Transformer based video prediction model via masked visual modeling. ,"The ability to predict future visual observations conditioned on past observations and motor commands can enable embodied agents to plan solutions to a variety of tasks in complex environments. This work shows that we can create good video prediction models by pre-training transformers via masked visual modeling. Our approach, named MaskViT, is based on two simple design decisions. First, for memory and training efficiency, we use two types of window attention: spatial and spatiotemporal. Second, during training, we mask a variable percentage of tokens instead of a fixed mask ratio. For inference, MaskViT generates all tokens via iterative refinement where we incrementally decrease the masking ratio following a mask scheduling function. On several datasets we demonstrate that MaskViT outperforms prior works in video prediction, is parameter efficient, and can generate high resolution videos ($256 \times $256). Further, we demonstrate the benefits of inference speedup (up to $512 \times$) due to iterative decoding by using MaskViT for planning on a real robot. Our work suggests that we can endow embodied agents with powerful predictive models by leveraging the general framework of masked visual modeling with minimal domain knowledge. ","Video Prediction, Masked Visual Modeling, Visual MPC, Transformers" Text Summarization with Oracle Expectation,https://openreview.net/forum?id=HehQobsr0S,https://openreview.net/pdf?id=HehQobsr0S,,"Extractive summarization produces summaries by identifying and concatenating the most important sentences in a document. Since most summarization datasets do not come with gold labels indicating whether document sentences are summary-worthy, different labeling algorithms have been proposed to extrapolate oracle extracts for model training. In this work, we identify two flaws with the widely used greedy labeling approach: it delivers suboptimal and deterministic oracles. To alleviate both issues, we propose a simple yet effective labeling algorithm that creates soft, expectation-based sentence labels. We define a new learning objective for extractive summarization which incorporates learning signals from multiple oracle summaries and prove it is equivalent to estimating the oracle expectation for each document sentence. Without any architectural modifications, the proposed labeling scheme achieves superior performance on a variety of summarization benchmarks across domains and languages, in both supervised and zero-shot settings.","Text Summarization, NLP" MERMADE: $K$-shot Robust Adaptive Mechanism Design via Model-Based Meta-Learning,https://openreview.net/forum?id=8uf1JIb07M,https://openreview.net/pdf?id=8uf1JIb07M,"We propose MERMADE, a deep RL approach to mechanism design that learns a world model together with a meta-learned mechanism which can be quickly adapted to perform well on unseen test agents that learn.","Mechanism design (MD) studies how rules and rewards shape the behavior of intelligent agents, e.g., in auctions or the economy. Simulations with AI agents are powerful tools for MD, but real-world agents may behave and learn differently than simulated agents under a given mechanism. Also, the mechanism designer may not fully observe an agent's learning strategy or rewards, and executing a mechanism may be costly, e.g., enforcing a tax might require extra labor. Hence, it is key to design robust adaptive mechanisms that generalize well to agents with unseen (learning) behavior, are few-shot adaptable, and are cost-efficient. Here, we introduce MERMADE, a model-based meta-learning framework to learn mechanisms that can quickly adapt when facing out-of-distribution agents with different learning strategies and reward functions. First, we show that meta-learning allows adapting to the theoretically known and appropriate Stackelberg equilibrium in a simple matrix game at meta-test time, with few interactions with the agent. Second, with bandit agents, we show empirically that our approach yields strong meta-test time performance against agents with various unseen explore-exploit behaviors. Finally, we outperform baselines that separately use either meta-learning or agent behavior modeling to learn a cost-effective mechanism that is $K$-shot adaptable with only partial information about the agents.","Mechanism design, Robustness, Meta-learning, Adaptive agents, Simulation based learning" Continuous-time identification of dynamic state-space models by deep subspace encoding,https://openreview.net/forum?id=_4n3k3d1ob,https://openreview.net/pdf?id=_4n3k3d1ob,This work proposes a method for the estimation of continuous-time nonlinear state-pace models parameterized by ANN's which is robust and well theoretically motivated.,"Continuous-time (CT) modeling has proven to provide improved sample efficiency and interpretability in learning the dynamical behavior of physical systems compared to discrete-time (DT) models. However, even with numerous recent developments, the CT nonlinear state-space (NL-SS) model identification problem remains to be solved in full, considering common experimental aspects such as the presence of external inputs, measurement noise, latent states, and general robustness. This paper presents a novel estimation method that addresses all these aspects and that can obtain state-of-the-art results on multiple benchmarks with compact fully connected neural networks capturing the CT dynamics. The proposed estimation method called the subspace encoder approach (SUBNET) ascertains these results by efficiently approximating the complete simulation loss by evaluating short simulations on subsections of the data, by using an encoder function to estimate the initial state for each subsection and a novel state-derivative normalization to ensure stability and good numerical conditioning of the training process. We prove that the use of subsections increases cost function smoothness together with the necessary requirements for the existence of the encoder function and we show that the proposed state-derivative normalization is essential for reliable estimation of CT NL-SS models.","Continuous-time, State-space, Artificial Neural Networks" Waveformer: Linear-Time Attention with Forward and Backward Wavelet Transform,https://openreview.net/forum?id=xWwbnbtJd5,https://openreview.net/pdf?id=xWwbnbtJd5,"We propose a model with linear complexity achieving SOTA results on a set of long-range tasks, under a new paradigm to learn attention in wavelet space which boosts accuracies of various attention methods without increasing time complexity.","We propose Waveformer that learns attention mechanism in the wavelet coefficient space, requires only linear time complexity, and enjoys universal approximating power. Specifically, we first apply forward wavelet transform to project the input sequences to multi-resolution orthogonal wavelet bases, then conduct nonlinear transformations (in this case, a random feature kernel) in the wavelet coefficient space, and finally reconstruct the representation in input space via backward wavelet transform. We note that other non-linear transformations may be used, hence we name the learning paradigm Wavelet transformatIon for Sequence lEarning (WISE). We emphasize the importance of backward reconstruction in the WISE paradigm — without it, one would be mixing information from both the input space and coefficient space through skip connections, which shall not be considered as mathematically sound. Compared with Fourier transform in recent works, wavelet transform is more efficient in time complexity and better captures local and positional information; we further support this through our ablation studies. Extensive experiments on seven long-range understanding datasets from the Long Range Arena benchmark and code understanding tasks demonstrate that (1) Waveformer achieves competitive and even better accuracy than a number of state-of-the-art Transformer variants and (2) WISE can boost accuracies of various attention approximation methods without increasing the time complexity. These together showcase the superiority of learning attention in a wavelet coefficient space over the input space. ","transformer, efficient attention, long range reasoning" SemSup-XC: Semantic Supervision for Extreme Classification,https://openreview.net/forum?id=1zaoVA_z8Q,https://openreview.net/pdf?id=1zaoVA_z8Q,We propose a new model for extreme classification over very large label spaces and achieve SOTA results on three popular benchmarks.,"Extreme classification (XC) considers the scenario of predicting over a very large number of classes (thousands to millions), with real-world applications including serving search engine results, e-commerce product tagging, and news article classification. The zero-shot version of this task involves the addition of new categories at test time, requiring models to generalize to novel classes without additional training data (e.g. one may add a new class “fidget spinner” for ecommerce product tagging). In this paper, we develop SEMSUP-XC, a model that achieves state-of-the-art zero-shot (ZS) and few-shot (FS) performance on three extreme classification benchmarks spanning the domains of law, e-commerce, and Wikipedia. SEMSUP-XC builds upon the recently proposed framework of semantic supervision that uses semantic label descriptions to represent and generalize to classes (e.g., “fidget spinner” described as “A popular spinning toy intended as a stress reliever”). Specifically, we use a combination of contrastive learning, a hybrid lexico-semantic similarity module and automated description collection to train SEMSUP-XC efficiently over extremely large class spaces. SEMSUP-XC significantly outperforms baselines and state-of-the-art models on all three datasets, by up to 6-10 precision@1 points on zero-shot classification and >10 precision points on few-shot classification, with similar gains for recall@10 (3 for zero-shot and 2 for few-shot). Our ablation studies and qualitative analyses demonstrate the relative importance of our various improvements and show that SEMSUP-XC’s automated pipeline offers a consistently efficient method for extreme classification.","Extreme classification, zero-shot inference, few-shot learning" SaMoE: Parameter Efficient MoE Language Models via Self-Adaptive Expert Combination,https://openreview.net/forum?id=HO2q49XYRC,https://openreview.net/pdf?id=HO2q49XYRC,SaMoE is a parameter efficient MoE architecture design that enables parameter savings on MoE while achieving comparable or better accuracy.,"Recently, Mixture-of-Experts (MoE) has demonstrated success in scaling models to have large amounts of parameters without significant increases in computational cost. However, MoEs have been also reported to be parameter inefficient such that larger models do not always lead to better performance. In this work, we study how to build parameter-efficient MoE models. Our analysis identifies that MoE layers exhibit poor gradient flow as the number of experts increases, leading to insufficient training of experts. To overcome this issue, we propose a new MoE architecture design (SaMoE), which improves the parameter efficiency of MoE models by learning a soft combination of a global set of expert layers for each MoE layer. Such a scheme enables substantial parameter savings on MoE while achieving comparable or better accuracy than the standard MoE training baseline. Extensive experiments on billion-scale GPT-3 style autoregressive MoE language models demonstrate that SaMoE significantly improves the parameter efficiency of MoE models by reducing up to 5.2X total parameters while obtaining superior pre-training and zero-shot generalization results as compared to baseline. ","Mixture-of-Expert, Autoregressive language model, Parameter efficiency." How to Train your HIPPO: State Space Models with Generalized Orthogonal Basis Projections,https://openreview.net/forum?id=klK17OQ3KB,https://openreview.net/pdf?id=klK17OQ3KB,We develop a new theoretical interpretation of S4 and generalize it to other basis functions,"Linear time-invariant state space models (SSM) are a classical model from engineering and statistics, that have recently been shown to be very promising in machine learning through the Structured State Space sequence model (S4). A core component of S4 involves initializing the SSM state matrix to a particular matrix called a HiPPO matrix, which was empirically important for S4's ability to handle long sequences. However, the specific matrix that S4 uses was actually derived in previous work for a particular *time-varying* dynamical system, and the use of this matrix as a *time-invariant* SSM had no known mathematical interpretation. Consequently, the theoretical mechanism by which S4 models long-range dependencies actually remains unexplained. We derive a more general and intuitive formulation of the HiPPO framework, which provides a simple mathematical interpretation of S4 as a decomposition onto exponentially-warped Legendre polynomials, explaining its ability to capture long dependencies. Our generalization introduces a theoretically rich class of SSMs that also lets us derive more intuitive S4 variants for other bases such as the Fourier basis, and explains other aspects of training S4, such as how to initialize the important timescale parameter. These insights improve S4's performance to 86% on the Long Range Arena benchmark, with 96% on the most difficult Path-X task. ","Deep learning, sequence model, state space model, S4, HiPPO" Interpretable Debiasing of Vectorized Language Representations with Iterative Orthogonalization,https://openreview.net/forum?id=TkQ1sxd9P4,https://openreview.net/pdf?id=TkQ1sxd9P4,Our proposed debiasing technique significantly improves the amount of debiasing while retaining relevant information in the embedding representation. It can also be extended to multiple subspace debiasing.,"We propose a new mechanism to augment a word vector embedding representation that offers improved bias removal while retaining the key information—resulting in improved interpretability of the representation. Rather than removing the information associated with a concept that may induce bias, our proposed method identifies two concept subspaces and makes them orthogonal. The resulting representation has these two concepts uncorrelated. Moreover, because they are orthogonal, one can simply apply a rotation on the basis of the representation so that the resulting subspace corresponds with coordinates. This explicit encoding of concepts to coordinates works because they have been made fully orthogonal, which previous approaches do not achieve. Furthermore, we show that this can be extended to multiple subspaces. As a result, one can choose a subset of concepts to be represented transparently and explicitly, while the others are retained in the mixed but extremely expressive format of the representation.","bias, fairness, ethics, debiasing, static embeddings, pre-trained contextualized embeddings, natural language processing" Communication-Optimal Distributed Graph Clustering under Duplication Models,https://openreview.net/forum?id=KajSampr4_,https://openreview.net/pdf?id=KajSampr4_,,"We consider the problem of clustering graph nodes over large-scale distributed graphs, when graph edges with possibly edge duplicates are observed distributively. Although edge duplicates across different sites appear to be beneficial at the first glance, in fact they could make the clustering task more complicated since potentially their processing would need extra computations and communications. We propose the first communication-optimal algorithms for two well-established communication models namely the message passing and the blackboard models. Specifically, given a graph on $n$ nodes with edges observed at $s$ sites, our algorithms achieve communication costs $\tilde{O}(ns)$ and $\tilde{O}(n+s)$ ($\tilde{O}$ hides a polylogarithmic factor), which almost match their lower bounds, $\Omega(ns)$ and $\Omega(n+s)$, in the message passing and the blackboard models respectively. The communication costs are asymptotically the same as those under non-duplication models, under a mild assumption on edge distribution. Our algorithms can also guarantee clustering quality nearly as good as that of centralizing all edges and then applying any standard clustering algorithm.","Graph Clustering, Distributed Computation, Communication Complexity, Duplication Models" Unpacking Large Language Models with Conceptual Consistency,https://openreview.net/forum?id=YsAbPH2VWKE,https://openreview.net/pdf?id=YsAbPH2VWKE,Conceptual consistency measures whether knowledge of relevant background information is consistent with ability to answer questions correctly in large language models.,"If a Large Language Model (LLM) answers “yes” to the question “Are moun- tains tall?” then does it know what a mountain is? Can you rely on it respond- ing correctly or incorrectly to other questions about mountains? The success of Large Language Models (LLMs) indicates they are increasingly able to answer queries like these accurately, but that ability does not necessarily imply a general understanding of concepts relevant to the anchor query. We propose conceptual consistency to measure a LLM’s understanding of relevant concepts. This novel metric measures how well a model can be characterized by finding out how con- sistent its responses to queries about conceptually relevant background knowledge are. To compute it we extract background knowledge by traversing paths between concepts in a knowledge base and then try to predict the model’s response to the anchor query from the background knowledge. We investigate the performance of current LLMs in a commonsense reasoning setting using the CSQA dataset and the ConceptNet knowledge base. While conceptual consistency, like other metrics, does increase with the scale of the LLM used, we find that popular mod- els do not necessarily have high conceptual consistency. Our analysis also shows significant variation in conceptual consistency across different kinds of relations, concepts, and prompts. This serves as a step toward building models that humans can apply a theory of mind to, and thus interact with intuitively.","Conceptual Consistency, Theory of Mind, Zero Shot Prompting, Large Language Models, Semantic Consistency, Unsupervised Question Answering, Background Knowledge Extraction" Graph in Graph Neural Network,https://openreview.net/forum?id=653nhbKy6yE,https://openreview.net/pdf?id=653nhbKy6yE,,"Most existing Graph Neural Networks (GNNs) frequently suffer from two limitations: (i) they can only process graphs whose vertices are represented by vectors or single values; and (ii) they assume each input graph is independent from others during the propagation. In this paper, we propose \textbf{the first GNN model (called Graph in Graph Neural Network (GIG)) that can process graphs whose vertices are also represented by graphs}. Considering that the relationship between different graphs may contain crucial task-related cues, we further propose a GIG graph relationship modelling (GRM) strategy that integrates multiple target graph samples as a global graph, each of whose vertex describes a target graph sample. We then applies the GIG model to jointly process the combined graph samples (i.e., the global graph), where additional task-specific relationship cues among graph samples can be extracted in an end-to-end manner. The experimental results show that the proposed GIG model and the GRM strategy generalize well on various graph analysis tasks, providing new state-of-the-art results on five out of seven benchmark graph datasets. Importantly, not only its vertex/edge updating functions are flexible to be customized from different existing GNNs but also it is robust to different settings. Our code is provided in the supplementary material for reproducibly purpose.","Graph Neural Network, Deep Learning, Sub-graph" LSTM-BASED-AUTO-BI-LSTM for Remaining Useful Life (RUL) Prediction: the first round of test results,https://openreview.net/forum?id=ch_t8OpXaa,https://openreview.net/pdf?id=ch_t8OpXaa,The paper describes prelimirary test results of LSTM-BASED-AUTO-BI-LSTM architecture,"The Remaining Useful Life (RUL) is one of the most critical indicators to detect a component’s failure before it effectively occurs. It can be predicted by historical data or direct data extraction by adopting model-based, data-driven, or hybrid methodologies. Data-driven methods have mainly used Machine Learning (ML) approaches, despite several studies still pointing out different challenges in this sense. For instance, traditional ML methods cannot extract features directly from time series depending, in some cases, on the prior knowledge of the system. In this context, this work proposes a DL-based approach called LSTM-based-AUTO-Bi-LSTM. It ensembles an LSTM-based autoencoder to automatically perform feature engineering (instead of manually) with Bidirectional Long Short-Term Memory (Bi-LSTM) to predict RUL. We have tested the model using the Turbofan Engine Degradation Simulation Dataset (FD001), an open dataset. It was generated from the Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) from the Prognostics Center of Excellence (PcoE), from the National Aeronautics and Space Administration (NASA). The objective is to release the first round of analytical results and statistical visualisations of the model application, which will guide us in future improvements.","Remaining Useful Life, Predictive Maintenance, Machine Learning, Deep Learning, Autoencoder" Recurrent Real-valued Neural Autoregressive Density Estimator for Online Density Estimation and Classification of Streaming Data,https://openreview.net/forum?id=K1Z-P0Le0DT,https://openreview.net/pdf?id=K1Z-P0Le0DT,,"In contrast with the traditional offline learning, where complete data accessibility is assumed, many modern applications involve processing data in a streaming fashion. This online learning setting raises various challenges, including concept drift, hardware memory constraints, etc. In this paper, we propose the Recurrent Real-valued Neural Autoregressive Density Estimator (RRNADE), a flexible density-based model for online classification and density estimation. RRNADE combines a neural Gaussian mixture density module with a recurrent module. This combination allows RRNADE to exploit possible sequential correlations in the streaming task, which are often ignored in the classical streaming setting where each input is assumed to be independent from the previous ones. We showcase the ability of RRNADE to adapt to concept drifts on synthetic density estimation tasks. We also apply RRNADE to online classification tasks on both real world and synthetic datasets and compare it with multiple density based as well as nondensity based online classification methods. In almost all of these tasks, RRNADE outperforms the other methods. Lastly, we conduct an ablation study demonstrating the complementary benefits of the density and the recurrent modules.","density estimation, online learning, streaming data, classification" Out-of-Distribution Detection and Selective Generation for Conditional Language Models,https://openreview.net/forum?id=kJUS5nD0vPB,https://openreview.net/pdf?id=kJUS5nD0vPB,"A simple, fast, effective method for out-of-distribution detection and selective generation for conditional language models.","Machine learning algorithms typically assume independent and identically distributed samples in training and at test time (IID). Much work has shown that high-performing ML classifiers can degrade significantly and provide overly-confident, wrong classification predictions, particularly for out-of-distribution (OOD) inputs. Conditional language models (CLMs) are predominantly trained to classify the next token in an output sequence, and may suffer even worse degradation on OOD inputs as the prediction is done auto-regressively over many steps. Furthermore, the space of potential low-quality outputs is larger as arbitrary text can be generated and it is important to know when to trust the generated output. We present a highly accurate and lightweight OOD detection method for CLMs, and demonstrate its effectiveness on abstractive summarization and translation. We also show how our method can be used under the common and realistic setting of distribution shift for selective generation (analogous to selective prediction for classification) of high-quality outputs, while automatically abstaining from low-quality ones, enabling safer deployment of generative language models.","Out-of-distribution Detection, Natural Language Generation, Selective Generation, Uncertainty" Layer Grafted Pre-training: Bridging Contrastive Learning And Masked Image Modeling For Better Representations,https://openreview.net/forum?id=jwdqNwyREyh,https://openreview.net/pdf?id=jwdqNwyREyh,We proposed a simple yet principled combination for MIM and CL that can merge both merits of them.,"Recently, both Contrastive Learning (CL) and Mask Image Modeling (MIM) demonstrate that self-supervision is powerful to learn good representations. However, naively combining them is far from success. In this paper, we start by making the empirical observation that a naive joint optimization of CL and MIM losses leads to conflicting gradient directions - more severe as the layers go deeper. This motivates us to shift the paradigm from combining loss at the end, to choosing the proper learning method per network layer. Inspired by experimental observations, we find that MIM and CL are suitable to lower and higher layers, respectively. We hence propose to combine them in a surprisingly simple, ``sequential cascade'' fashion: early layers are first trained under one MIM loss, on top of which latter layers continue to be trained under another CL loss. The proposed Layer Grafted Pre-training learns good visual representations that demonstrate superior label efficiency in downstream applications, in particular yielding strong few-shot performance besides linear evaluation. For instance, on ImageNet-1k, Layer Grafted Pre-training yields 65.5% Top-1 accuracy in terms of 1% few-shot learning with ViT-B/16, which improves MIM and CL baselines by 14.4% and 2.1% with no bells and whistles. Our code will be released upon acceptance. ","Mask Image Modeling, Contrastive learning" Structural Adversarial Objectives for Self-Supervised Representation Learning,https://openreview.net/forum?id=99XwOpGYAH,https://openreview.net/pdf?id=99XwOpGYAH,,"Within the framework of generative adversarial networks (GANs), we propose objectives that task the discriminator with additional structural modeling responsibilities. In combination with an efficient smoothness regularizer imposed on the network, these objectives guide the discriminator to learn to extract informative representations, while maintaining a generator capable of sampling from the domain. Specifically, we influence the features produced by the discriminator at two levels of granularity. At coarse scale, we impose a Gaussian assumption encouraging smoothness and diversified representation, while at finer scale, we group features forming local clusters. Experiments demonstrate that augmenting GANs with these self-supervised objectives suffices to produce discriminators which, evaluated in terms of representation learning, compete with networks trained by state-of-the-art contrastive approaches. Furthermore, operating within the GAN framework frees our system from the reliance on data augmentation schemes that is prevalent across purely contrastive representation learning methods.", VIMA: General Robot Manipulation with Multimodal Prompts,https://openreview.net/forum?id=hzjQWjPC04A,https://openreview.net/pdf?id=hzjQWjPC04A,"We design a transformer agent, VIMA, that ingests *multimodal* prompts and solves a wide variety of robot manipulation tasks.","Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. Yet task specification in robotics comes in various forms, such as imitating one-shot demonstrations, following language instructions, and reaching visual goals. They are often considered different tasks and tackled by specialized models. This work shows that we can express a wide spectrum of robot manipulation tasks with *multimodal prompts*, interleaving textual and visual tokens. We design a transformer-based generalist robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively. To train and evaluate VIMA, we develop a new simulation benchmark with thousands of procedurally-generated tabletop tasks with multimodal prompts, 600K+ expert trajectories for imitation learning, and four levels of evaluation protocol for systematic generalization. VIMA achieves strong scalability in both model capacity and data size. It outperforms prior SOTA methods in the hardest zero-shot generalization setting by up to 2.9x task success rate given the same training data. With 10x less training data, VIMA still performs 2.7x better than the top competing approach. Video demos are available at https://iclr3081.github.io/.","Robot Learning, Foundation Model, Transformer, Language Model, Multi-Task Learning" Discovering Latent Knowledge in Language Models Without Supervision,https://openreview.net/forum?id=ETKGuby0hcs,https://openreview.net/pdf?id=ETKGuby0hcs,,"Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.","AI safety, AI alignment, truthfulness, large language models, honesty, interpretability" ModReduce: A Multi-Knowledge Distillation Framework with Online Learning,https://openreview.net/forum?id=YuXt90f4Kb7,https://openreview.net/pdf?id=YuXt90f4Kb7,,"Deep neural networks have produced revolutionary results in many applications; however, the computational resources required to use such models are expensive in terms of processing power and memory space. Research has been conducted in the field of knowledge distillation, aiming to enhance the performance of smaller models. Knowledge distillation transfers knowledge from large networks into smaller ones. Literature defines three types of knowledge that can be transferred: response-based, relational-based, and feature-based. To the best of our knowledge, only transferring one or two types of knowledge has been studied before, but transferring all three remains unexplored. In this paper, we propose ModReduce, a framework designed to transfer the three knowledge types in a unified manner using a combination of offline and online knowledge distillation. Moreover, an extensive experimental study on the effects of combining different knowledge types on student models’ generalization and overall performance has been performed. Our experiments showed that ModReduce outperforms state-of-the-art knowledge distillation methods in terms of Average Relative Improvement.","Knowledge distillation, Deep neural networks, Model Compression, Knowledge transfer, Online Learning" Prefix Conditioning Unifies Language and Label Supervision,https://openreview.net/forum?id=u7ugqk7VBP8,https://openreview.net/pdf?id=u7ugqk7VBP8,Prefix conditioning technique to train vision-language model with image-caption and image-classification dataset. ,"Image-classification datasets have been used to pretrain image recognition models. Recently, web-scale image-caption datasets have emerged as a source of powerful pretraining alternative. Image-caption datasets are more ``open-domain'', containing a wider variety of scene types and vocabulary words than traditional classification datasets, and models trained on these datasets have demonstrated strong performance on few- and zero-shot recognition tasks. When naively unifying image-classification and -caption dataset, we show that such dataset biases negatively affect pre-training by reducing the generalizability of learned representations and thus jeopardizing zero-shot performance since the unification can tailor the model for the classification dataset, making it vulnerable to the distribution shift from the dataset. In this work, we address the problem by disentangling the dataset bias using prefix tokens that inform a language encoder of the type of the input dataset (e.g., image-classification or caption) at training time. This approach allows the language encoder to share the knowledge from two datasets as well as switch the mode of feature extraction, i.e., image-classification dataset or image-caption dataset tailored mode, where we use image-caption mode in the zero-shot evaluation. Our method is generic and can be easily integrated into existing VL pre-training objectives such as CLIP or UniCL. In experiments, we show that this simple technique improves the performance in zero-shot image recognition accuracy and robustness to the image-level distribution shift.","Vision-language contrastive learning, Zero-shot recognition" Defending against Reconstruction attacks using Rényi Differential Privacy,https://openreview.net/forum?id=e0GcQ9l4Dh,https://openreview.net/pdf?id=e0GcQ9l4Dh,Quantify the information leakage using better reconstruction bounds backed by experimental testing,"Reconstruction attacks allow an adversary to regenerate data samples of the training set using access to only a trained model. It has been recently shown that simple heuristics can reconstruct data samples from language models, making this threat scenario an important aspect of model release. Differential privacy is a known solution to such attacks, but is often used with a large privacy budget (epsilon > 8) which does not translate to meaningful guarantees. In this paper we show that, for a same mechanism, we can derive privacy guarantees for reconstruction attacks that are better than the traditional ones from the literature. In particular, we show that larger privacy budgets do not provably protect against membership inference, but can still protect extraction of rare secrets. We design a method to efficiently run reconstruction attacks with lazy sampling and empirically show that we can surface at-risk training samples from non-private language models. We show experimentally that our guarantees hold on real-life language models trained with differential privacy for difficult scenarios, including GPT-2 finetuned on Wikitext-103.","Rényi Differential Privacy, Reconstruction Attacks, Information Theory" Diffusion Adversarial Representation Learning for Self-supervised Vessel Segmentation,https://openreview.net/forum?id=H0gdPxSwkPb,https://openreview.net/pdf?id=H0gdPxSwkPb,,"Vessel segmentation in medical images is one of the important tasks in the diagnosis of vascular diseases and therapy planning. Although learning-based segmentation approaches have been extensively studied, a large amount of ground-truth labels are required in supervised methods and confusing background structures make neural networks hard to segment vessels in an unsupervised manner. To address this, here we introduce a novel diffusion adversarial representation learning (DARL) model that leverages a denoising diffusion probabilistic model with adversarial learning, and apply it for vessel segmentation. In particular, for self-supervised vessel segmentation, DARL learns background image distribution using a diffusion module, which lets a generation module effectively provide vessel representations. Also, by adversarial learning based on the proposed switchable spatially-adaptive denormalization, our model estimates synthetic fake vessel images as well as vessel segmentation masks, which further makes the model capture vessel-relevant semantic information. Once the proposed model is trained, the model generates segmentation masks by one step and can be applied to general vascular structure segmentation of coronary angiography and retinal images. Experimental results on various datasets show that our method significantly outperforms existing unsupervised and self-supervised methods in the vessel segmentation.","Diffusion model, Adversarial learning, Self-supervised learning, Vessel segmentation" Reconciling Security and Communication Efficiency in Federated Learning,https://openreview.net/forum?id=GUMLIArCIwB,https://openreview.net/pdf?id=GUMLIArCIwB,Uplink communication effiency with a high privacy and security bar,"Cross-device Federated Learning is an increasingly popular machine learning setting to train a model by leveraging a large population of client devices with high privacy and security guarantees. However, communication efficiency remains a major bottleneck when scaling federated learning to production environments, particularly due to bandwidth constraints during uplink communication. In this paper, we formalize and address the problem of compressing client-to-server model updates under the Secure Aggregation primitive, a core component of Federated Learning pipelines that allows the server to aggregate the client updates without accessing them individually. In particular, we adapt standard scalar quantization and pruning methods to Secure Aggregation and propose Secure Indexing, a variant of Secure Aggregation that supports quantization for extreme compression. We establish state-of-the-art results on LEAF benchmarks in a secure Federated Learning setup with up to 40x compression in uplink communication with no meaningful loss in utility compared to uncompressed baselines.","Federated Learning, Secure Aggregation, Compression, Efficiency, Product Quantization" Semantic Image Manipulation with Background-guided Internal Learning,https://openreview.net/forum?id=1z9VTrxCgf,https://openreview.net/pdf?id=1z9VTrxCgf,,"Image manipulation has attracted a lot of interest due to its wide range of applications. Prior work modifies images either from low-level manipulation, such as image inpainting or through manual edits via paintbrushes and scribbles, or from high-level manipulation, employing deep generative networks to output an image conditioned on high-level semantic input. In this study, we propose Semantic Image Manipulation with Background-guided Internal Learning (SIMBIL), which combines high-level and low-level manipulation. Specifically, users can edit an image at the semantic level by applying changes on the scene graph. Then our model manipulates the image at the pixel level according to the modified scene graph. There are two major advantages of our approach. First, high-level manipulation requires less manual effort from the user compared to manipulating raw image pixels. Second, our low-level internal learning approach is scalable to images of various sizes without reliance on external visual datasets for training. We outperform the state-of-the-art in a quantitative and qualitative evaluation on CLEVR and Visual Genome datasets. Experiments show around 8 points improvement of SSIM (RoI) on CLEVR and around 25% improvement of user evaluation accuracy on Visual Genome, demonstrating the effectiveness of our approach.","semantic image manipulation, internal learning, scene-graph driven image editing" Pretraining the Vision Transformer using self-supervised methods for vision based Deep Reinforcement Learning,https://openreview.net/forum?id=CEhy-i7_KfC,https://openreview.net/pdf?id=CEhy-i7_KfC,,"The Vision Transformer architecture has shown to be competitive in the computer vision (CV) space where it has dethroned convolution-based networks in several benchmarks. Nevertheless, Convolutional Neural Networks (CNN) remain the preferential architecture for the representation module in Reinforcement Learning. In this work, we study pretraining a Vision Transformer using several state-of-the-art self-supervised methods and assess data-efficiency gains from this training framework. We propose a new self-supervised learning method called TOV-VICReg that extends VICReg to better capture temporal relations between observations by adding a temporal order verification task. Furthermore, we evaluate the resultant encoders with Atari games in a sample-efficiency regime. Our results show that the vision transformer, when pretrained with TOV-VICReg, outperforms the other self-supervised methods but still struggles to overcome a CNN. Nevertheless, we were able to outperform a CNN in two of the ten games where we perform a 100k steps evaluation. Ultimately, we believe that such approaches in Deep Reinforcement Learning (DRL) might be the key to achieving new levels of performance as seen in natural language processing and computer vision.","Deep Reinforcement Learning, Transformers, Self-Supervised Learning, Pre-training" Noise Injection Node Regularization for Robust Learning,https://openreview.net/forum?id=gmSZ-GPNY6,https://openreview.net/pdf?id=gmSZ-GPNY6,"We provide analytical and empirical evidence indicating that training using a large amount of adaptive noise injection results in an emergent regularization scheme, improving robustness against a number of tests.","We introduce Noise Injection Node Regularization (NINR), a method of injecting structured noise into Deep Neural Networks (DNN) during the training stage, resulting in an emergent regularizing effect. We present theoretical and empirical evidence for substantial improvement in robustness against various test data perturbations for feed-forward DNNs when trained under NINR. The novelty in our approach comes from the interplay of adaptive noise injection and initialization conditions such that noise is the dominant driver of dynamics at the start of training. As it simply requires the addition of external nodes without altering the existing network structure or optimization algorithms, this method can be easily incorporated into many standard problem specifications. We find improved stability against a number of data perturbations, including domain shifts, with the most dramatic improvement obtained for unstructured noise, where our technique outperforms other existing methods such as dropout or $L_2$ regularization, in some cases. We further show that desirable generalization properties on clean data are generally maintained.","Reguralization, Input corruption, Noise Injection, Deep Learning" "Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size",https://openreview.net/forum?id=V5NFgHyNBI8,https://openreview.net/pdf?id=V5NFgHyNBI8,Large Batch Optimization for SAC-N allows to reduce size of the Q-ensemble and improves convergence time by 2.5x times on average,"Training large neural networks is known to be time-consuming, where learning duration may stretch up to days or weeks. To address this problem, the approach of large-batch optimization was introduced, demonstrating that scaling mini-batch sizes with appropriate learning rate adjustments may speed up the training process by orders of magnitude. While long training time was not typically a major issue for model-free deep offline RL algorithms, recently introduced Q-ensemble methods achieving state-of-the-art performance made this issue more relevant, notably extending the training duration. In this work, we demonstrate how large-batch optimization, typically overlooked in deep offline RL community, can benefit this class of methods. We show that simply scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time, effectively shortening training durations by 2.5x times on average.","Offline Reinforcement Learning, Q-Ensemble, Large Batch Optimization, Ensemble Based Reinforcement Learning" Efficient Edge Inference by Selective Query,https://openreview.net/forum?id=jpR98ZdIm2q,https://openreview.net/pdf?id=jpR98ZdIm2q,"Low-complexity model(edge) performs poorly on large-scale tasks; For efficient inference, it must learn to identify examples that benefit by querying; it has to identify both hard-to-classify examples and those that the cloud model would misclassify.","Edge devices provide inference on predictive tasks to many end-users. However, deploying deep neural networks that achieve state-of-the-art accuracy on these devices is infeasible due to edge resource constraints. Nevertheless, cloud-only processing, the de-facto standard, is also problematic, since uploading large amounts of data imposes severe communication bottlenecks. We propose a novel end-to-end hybrid learning framework that allows the edge to selectively query only those hard examples that the cloud can classify correctly. Our framework optimizes over neural architectures and trains edge predictors and routing models so that the overall accuracy remains high while minimizing the overall latency. Training a hybrid learner is difficult since we lack annotations of hard edge-examples. We introduce a novel proxy supervision in this context and show that our method adapts seamlessly and near optimally across different latency regimes. On the ImageNet dataset, our proposed method deployed on a micro-controller unit exhibits $25\%$ reduction in latency compared to cloud-only processing while suffering no excess loss.","efficient edge inference, low-capacity model, large scale prediction, dynamic neural networks, adaptive neural networks" Learning Intuitive Policies Using Action Features,https://openreview.net/forum?id=nYOlSqq9nv2,https://openreview.net/pdf?id=nYOlSqq9nv2,We show that certain network architectures encourage reinforcement learning algorithms to respect semantic relationships between actions and observations.,"An unaddressed challenge in multi-agent coordination is to enable AI agents to exploit the semantic relationships between the features of actions and the features of observations. Humans take advantage of these relationships in highly intuitive ways. For instance, in the absence of a shared language, we might point to the object we desire or hold up our fingers to indicate how many objects we want. To address this challenge, we investigate the effect of network architecture on the propensity of learning algorithms to exploit these semantic relationships. Across a procedurally generated coordination task, we find that attention-based architectures that jointly process a featurized representation of observations and actions have a better inductive bias for zero-shot coordination. Through fine-grained evaluation and scenario analysis, we show that the resulting policies are human-interpretable. Moreover, such agents coordinate with people without training on any human data. ","multi-agent coordination, attention, inductive bias" Differentially Private $L_2$-Heavy Hitters in the Sliding Window Model,https://openreview.net/forum?id=3UHoYrglYkG,https://openreview.net/pdf?id=3UHoYrglYkG,,"The data management of large companies often prioritize more recent data, as a source of higher accuracy prediction than outdated data. For example, the Facebook data policy retains user search histories for $6$ months while the Google data retention policy states that browser information may be stored for up to $9$ months. These policies are captured by the sliding window model, in which only the most recent $W$ statistics form the underlying dataset. In this paper, we consider the problem of privately releasing the $L_2$-heavy hitters in the sliding window model, which include $L_p$-heavy hitters for $p\le 2$ and in some sense are the strongest possible guarantees that can be achieved using polylogarithmic space, but cannot be handled by existing techniques due to the sub-additivity of the $L_2$ norm. Moreover, existing non-private sliding window algorithms use the smooth histogram framework, which has high sensitivity. To overcome these barriers, we introduce the first differentially private algorithm for $L_2$-heavy hitters in the sliding window model by initiating a number of $L_2$-heavy hitter algorithms across the stream with significantly lower threshold. Similarly, we augment the algorithms with an approximate frequency tracking algorithm with significantly higher accuracy. We then use smooth sensitivity and statistical distance arguments to show that we can add noise proportional to an estimation of the $L_2$ norm. To the best of our knowledge, our techniques are the first to privately release statistics that are related to a sub-additive function in the sliding window model, and may be of independent interest to future differentially private algorithmic design in the sliding window model.","differential privacy, heavy hitters, streaming algorithms, sliding window model" Scaling Convex Neural Networks with Burer-Monteiro Factorization,https://openreview.net/forum?id=rnN4pHyf6jD,https://openreview.net/pdf?id=rnN4pHyf6jD,"We apply the Burer-Monteiro factorization to two-layer ReLU (fully-connected, convolutional, self-attention) neural networks by leveraging their implicit convexity, and provide insights into stationary points and local optima of these networks.","Recently, it has been demonstrated that a wide variety of (non) linear two-layer neural networks (such as two-layer perceptrons, convolutional networks, and self-attention) can be posed as equivalent convex optimization problems, with an induced regularizer which encourages low rank. However, this regularizer becomes prohibitively expensive to compute at moderate scales, impeding training convex neural networks. To this end, we propose applying the Burer-Monteiro factorization to convex neural networks, which for the first time enables a Burer-Monteiro perspective on neural networks with non-linearities. This factorization leads to an equivalent yet computationally tractable non-convex alternative with no spurious local minima. We develop a novel relative optimality bound of stationary points of the Burer-Monteiro factorization, thereby providing verifiable conditions under which any stationary point is a global optimum. Further, for the first time, we show that linear self-attention with sufficiently many heads has no spurious local minima. Our experiments demonstrate the utility and implications of the novel relative optimality bound for stationary points of the Burer-Monteiro factorization. ","burer-monteiro, convex optimization, neural networks, stationary points, global optima, relu activation" Human-level Atari 200x faster,https://openreview.net/forum?id=JtC6yOHRoJJ,https://openreview.net/pdf?id=JtC6yOHRoJJ,"We propose an RL agent 'MEME' that achieves human-level performance on all 57 Atari games within 390M environment frames, only 1/200 of the experience required by Agent57.","The task of building general agents that perform well over a wide range of tasks has been an important goal in reinforcement learning since its inception. The problem has been subject of research of a large body of work, with performance frequently measured by observing scores over the wide range of environments contained in the Atari 57 benchmark. Agent57 was the first agent to surpass the human benchmark on all 57 games, but this came at the cost of poor data-efficiency, requiring nearly 80 billion frames of experience to achieve. Taking Agent57 as a starting point, we employ a diverse set of strategies to achieve a 200-fold reduction of experience needed to outperform the human baseline, within our novel agent MEME. We investigate a range of instabilities and bottlenecks we encountered while reducing the data regime, and propose effective solutions to build a more robust and efficient agent. We also demonstrate competitive performance with high-performing methods such as Muesli and MuZero. Our contributions aim to achieve faster propagation of learning signals related to rare events, stabilize learning under differing value scales, improve the neural network architecture, and make updates more robust under a rapidly-changing policy.","Reinforcement Learning, Data-efficiency, Exploration, Off-policy" Wide Graph Neural Network,https://openreview.net/forum?id=Ih0fKoIUyEh,https://openreview.net/pdf?id=Ih0fKoIUyEh,"This paper proposes a unified view to understend GNNs, and it motivates a new model called wide graph neural network.","Usually, graph neural networks (GNNs) suffer from several problems, e.g., over-smoothing (in the spatial domain), poor flexibility (in the spectral domain), and low performance on heterophily (in both domains). In this paper, we provide a new GNN framework, called Wide Graph Neural Networks (WGNN) to solve these problems. It is motivated by our proposed unified view of GNNs from the perspective of dictionary learning. In light of this view, we formulate the graph learning in GNNs as learning representations from the dictionaries, where the fixed graph information is regarded as the dictionary and the trainable parameters are representations. Then, the dictionaries of spatial GNNs encode the adjacency matrix multiplication, while spectral ones sum its polynomials. Differently, WGNN directly concatenates all polynomials as the dictionary, where each polynomial is a sub-dictionary. Beyond polynomials, WGNN allows sub-dictionaries with an arbitrary size, for instance, the principal components of the adjacency matrix. This wide concatenation structure enjoys the great capability of avoiding over-smoothing and promoting flexibility, while the supplement of principal components can significantly improve the representation of heterophilic graphs. We provide a detailed theoretical analysis and conduct extensive experiments on eight datasets to demonstrate the superiority of the proposed WGNN. ","Graph neural networks, represenation learning, dictionary learning" Taming the Long Tail of Deep Probabilistic Forecasting,https://openreview.net/forum?id=fvvcpsEl3Z6,https://openreview.net/pdf?id=fvvcpsEl3Z6,We propose novel loss augmentation approaches to mitigate long tail in error of deep probabilistic forecasting and achieve significantly better results than the base model and baseline methods.,"Deep probabilistic forecasting is gaining attention in numerous applications from weather prognosis, through electricity consumption estimation, to autonomous vehicle trajectory prediction. However, existing approaches focus on improvements on average metrics without addressing the long tailed distribution of errors. In this work, we observe long tail behavior in the error distribution of state-of-the-art deep learning methods for probabilistic forecasting. We present two loss augmentation methods to reduce tailedness: Pareto Loss and Kurtosis Loss. Both methods are related to the concept of moments, which measures the shape of a distribution. Kurtosis Loss is based on a symmetric measure, the fourth moment. Pareto Loss is based on an asymmetric measure of right tailedness and models loss using a Generalized Pareto Distribution (GPD). We demonstrate the performance of our methods on several real-world datasets, including time series and spatiotemporal trajectories, achieving significant improvements on tail error metrics, while maintaining and even improving upon average error metrics.","Deep probabilistic forecasting, Long tail error, Time Series forecasting, Trajectory forecasting" Approximate Conditional Coverage via Neural Model Approximations,https://openreview.net/forum?id=ip0ENxmhIja,https://openreview.net/pdf?id=ip0ENxmhIja,"We construct prediction sets over Transformer networks, via KNN-based approximations, obtaining reliable assumption- and parameter-light approximate conditional coverage.","We propose a new approach for constructing prediction sets for Transformer networks via the strong signals for prediction reliability from KNN-based approximations. This enables a data-driven partitioning of the high-dimensional feature space and a new Inductive Venn Predictor for calibration, the Venn-ADMIT Predictor. Our approach more closely obtains approximate conditional coverage than recent work proposing adaptive and localized conformal score functions for deep networks. We analyze coverage on several representative natural language processing classification tasks, including class-imbalanced and distribution-shifted settings.","distribution-free uncertainty quantification, split-conformal prediction sets, Venn Predictors" AUTOJOIN: EFFICIENT ADVERSARIAL TRAINING FOR ROBUST MANEUVERING VIA DENOISING AUTOEN- CODER AND JOINT LEARNING,https://openreview.net/forum?id=8thVleggPV_,https://openreview.net/pdf?id=8thVleggPV_,,"As a result of increasingly adopted machine learning algorithms and ubiquitous sensors, many ‘perception-to-control’ systems are developed and deployed. For these systems to be trustworthy, we need to improve their robustness with adversarial training being one approach. We propose a gradient-free adversarial training technique, called AutoJoin, which is a very simple yet effective and efficient approach to produce robust models for imaged-based maneuvering. Compared to other SOTA methods with testing on over 5M perturbed and clean images, AutoJoin achieves significant performance increases up to the 40% range under gradient-free perturbations while improving on clean performance up to 300%. Regarding efficiency, AutoJoin demonstrates strong advantages over other SOTA techniques by saving up to 83% time per training epoch and 90% training data. Although not the focus of AutoJoin, it even demonstrates superb ability in defending gradient-based attacks. The core idea of AutoJoin is to use a decoder attachment to the original regression model creating a denoising autoencoder within the architecture. This architecture allows the tasks ‘maneuvering’ and ‘denoising sensor input’ to be jointly learnt and reinforce each other’s performance.","autonomous driving, machine learning, robust training" Private Data Stream Analysis for Universal Symmetric Norm Estimation,https://openreview.net/forum?id=zGy_wqpRGTa,https://openreview.net/pdf?id=zGy_wqpRGTa,We provide a differentially private algorithm that approximate an arbitrary number of symmetric norms on a data stream,"We study how to release summary statistics on a data stream subject to the constraint of differential privacy. In particular, we focus on releasing the family of \emph{symmetric norms}, which are invariant under sign-flips and coordinate-wise permutations on an input data stream and include $L_p$ norms, $k$-support norms, top-$k$ norms, and the box norm as special cases. Although it may be possible to design and analyze a separate mechanism for each symmetric norm, we propose a general parametrizable framework that differentially privately releases a number of sufficient statistics from which the approximation of all symmetric norms can be simultaneously computed. Our framework partitions the coordinates of the underlying frequency vector into different levels based on their magnitude and releases approximate frequencies for the ``heavy'' coordinates in important levels and releases approximate level sizes for the ``light'' coordinates in important levels. Surprisingly, our mechanism allows for the release of an \emph{arbitrary} number of symmetric norm approximations without any overhead or additional loss in privacy. Moreover, our mechanism permits $(1+\alpha)$-approximation to each of the symmetric norms and can be implemented using sublinear space in the streaming model for many regimes of the accuracy and privacy parameters.","differential privacy, norm estimation" StarGraph: Knowledge Representation Learning based on Incomplete Two-hop Subgraph,https://openreview.net/forum?id=mTOB_VK_BWk,https://openreview.net/pdf?id=mTOB_VK_BWk,knowledge representation learning based on incomplete local subgraph,"Conventional representation learning algorithms for knowledge graphs (KG) map each entity to a unique embedding vector, ignoring the rich information contained in the neighborhood. We propose a method named StarGraph, which gives a novel way to utilize the neighborhood information for large-scale knowledge graphs to obtain entity representations. An incomplete two-hop neighborhood subgraph for each target node is at first generated, then processed by a modified self-attention network to obtain the entity representation, which is used to replace the entity embedding in conventional methods. We achieved SOTA performance on ogbl-wikikg2 and got competitive results on fb15k-237. The experimental results proves that StarGraph is efficient in parameters, and the improvement made on ogbl-wikikg2 demonstrates its great effectiveness of representation learning on large-scale knowledge graphs.","Knowledge Representation Learning, Knowledge Graph Embedding, Knowledge Graph, Self-Attention Network" Temporal Domain Generalization with Drift-Aware Dynamic Neural Networks,https://openreview.net/forum?id=sWOsRj4nT1n,https://openreview.net/pdf?id=sWOsRj4nT1n,A novel framework is proposed to dynamically model how neural networks evolve across domains for characterizing the distribution drift across time in temporal domain generalization.,"Temporal domain generalization is a promising yet extremely challenging area where the goal is to learn models under temporally changing data distributions and generalize to unseen data distributions following the trends of the change. The advancement of this area is challenged by: 1) characterizing data distribution drift and its impacts on models, 2) expressiveness in tracking the model dynamics, and 3) theoretical guarantee on the performance. To address them, we propose a Temporal Domain Generalization with Drift-Aware Dynamic Neural Network (DRAIN) framework. Specifically, we formulate the problem into a Bayesian framework that jointly models the relation between data and model dynamics. We then build a recurrent graph generation scenario to characterize the dynamic graph-structured neural networks learned across different time points. It captures the temporal drift of model parameters and data distributions and can predict models in the future without the presence of future data. In addition, we explore theoretical guarantees of the model performance under the challenging temporal DG setting and provide theoretical analysis, including uncertainty and generalization error. Finally, extensive experiments on several real-world benchmarks with temporal drift demonstrate the proposed method’s effectiveness and efficiency.","Domain Generalization, Sequential Learning Model, Dynamic Neural Network" Leveraging Incompatibility to Defend Against Backdoor Poisoning,https://openreview.net/forum?id=mkJm5Uy4HrQ,https://openreview.net/pdf?id=mkJm5Uy4HrQ,"We observe that training with poisoned data does not improve clean accuracy (and vice-versa), and develop a defense that exploits this property.","As deep learning datasets grow larger and less curated, backdoor data poisoning attacks, which inject malicious poisoned data into the training dataset, have drawn increasing attention in both academia and industry. We identify an incompatibility property of the interaction of clean and poisoned data with the training algorithm, specifically that including poisoned data in the training dataset does not improve model accuracy on clean data and vice-versa. Leveraging this property, we develop an algorithm that iteratively refines subsets of the poisoned dataset to obtain subsets that concentrate around either clean or poisoned data. The result is a partition of the original dataset into disjoint subsets, for each of which we train a corresponding model. A voting algorithm over these models identifies the clean data within the larger poisoned dataset. We empirically evaluate our approach and technique for image classification tasks over the GTSRB and CIFAR-10 datasets. The experimental results show that prior dirty-label and clean-label backdoor attacks in the literature produce poisoned datasets that exhibit behavior consistent with the incompatibility property. The results also show that our defense reduces the attack success rate below 1% on 134 out of 165 scenarios in this setting, with only a 2% drop in clean accuracy on CIFAR-10 (and negligible impact on GTSRB).","data poisoning, defense" Towards Representative Subset Selection for Self-Supervised Speech Recognition,https://openreview.net/forum?id=4wXotzMJ7Wo,https://openreview.net/pdf?id=4wXotzMJ7Wo,A new data subset selection method for self-supervised speech recognition that performs better than existing dataset pruning strategies.,"Self-supervised speech recognition models require considerable labeled training data for learning high-fidelity representations for Automatic Speech Recognition (ASR) which is computationally demanding and time-consuming, thereby hindering the usage of these models in resource-constrained environments. We consider the task of identifying an optimal subset of data to train self-supervised speech models for ASR. We make a surprising observation that the dataset pruning strategies used in vision tasks for sampling the most informative examples do not perform better than random subset selection on the task of fine-tuning self-supervised ASR. We then present the COWERAGE algorithm for better subset selection in self-supervised ASR, which is based on our finding that ensuring the coverage of examples based on training Word Error Rate (WER) in the early training epochs leads to better generalization performance. Extensive experiments on the wav2vec 2.0 model and TIMIT, Librispeech, and LJSpeech datasets show the effectiveness of COWERAGE, with up to 17% absolute WER improvement over existing dataset pruning methods and random sampling. We also demonstrate that the coverage of training instances in terms of WER ensures inclusion of phonemically diverse examples which leads to better test accuracy in self-supervised speech recognition models.","subset selection, self-supervised speech recognition, active learning, data pruning" Enforcing Hard Constraints with Soft Barriers: Safe Reinforcement Learning in Unknown Stochastic Environments,https://openreview.net/forum?id=lu6qxw6-QEV,https://openreview.net/pdf?id=lu6qxw6-QEV,,"It is quite challenging to ensure the safety of reinforcement learning (RL) agents in an unknown and stochastic environment under hard constraints that require the system state not to reach certain specified unsafe regions. Many popular safe RL methods such as those based on the Constrained Markov Decision Process (CMDP) paradigm formulate safety violations in a cost function and try to constrain the expectation of cumulative cost under a threshold. However, it is often difficult to effectively capture and enforce hard reachability-based safety constraints indirectly with such constraints on safety violation cost. In this work, we leverage the notion of barrier function to explicitly encode the hard safety constraints, and given that the environment is unknown, relax them to our design of \emph{generative-model-based soft barrier functions}. Based on such soft barriers, we propose a safe RL approach that can jointly learn the environment and optimize the control policy, while effectively avoiding the unsafe regions with safety probability optimization. Experiments on a set of examples demonstrate that our approach can effectively enforce hard safety constraints and significantly outperform CMDP-based baseline methods in system safe rate measured via simulations. ", Integrating Episodic and Global Novelty Bonuses for Efficient Exploration,https://openreview.net/forum?id=zZXwDQFxwib,https://openreview.net/pdf?id=zZXwDQFxwib,"We study when episodic and global novelty bonuses are useful in contextual MDPs, and find that it depends on the amount of shared structure across contexts; by combining them, we get SOTA results on MiniHack.","Exploration in environments which differ across episodes has received increasing attention in recent years. Current methods use some combination of global novelty bonuses, computed using the agent's entire training experience, and episodic novelty bonuses, computed using only experience from the current episode. However, the use of these two types of bonuses has been ad-hoc and poorly understood. In this work, we first shed light on the behavior these two kinds of bonuses on hard exploration tasks through easily interpretable examples. We find that the two types of bonuses succeed in different settings, with episodic bonuses being most effective when there is little shared structure between environments and global bonuses being effective when more structure is shared. We also find that combining the two bonuses leads to more robust behavior across both of these settings. Motivated by these findings, we then investigate different algorithmic choices for defining and combining function approximation-based global and episodic bonuses. This results in a new algorithm which sets a new state of the art across 18 tasks from the MiniHack suite used in prior work. Our code is public at \url{web-link}. ","reinforcement learning, exploration, generalization" "A Unified Approach to Reinforcement Learning, Quantal Response Equilibria, and Two-Player Zero-Sum Games",https://openreview.net/forum?id=DpE5UYUQzZH,https://openreview.net/pdf?id=DpE5UYUQzZH,A single algorithm for both single-agent reinforcement learning and approximating quantal response and Nash equilibria in two-player zero-sum games.,"Algorithms designed for single-agent reinforcement learning (RL) generally fail to converge to equilibria in two-player zero-sum (2p0s) games. On the other hand, game-theoretic algorithms for approximating Nash and regularized equilibria in 2p0s games are not typically competitive for RL and can be difficult to scale. As a result, algorithms for these two cases are generally developed and evaluated separately. In this work, we show that a single algorithm---a simple extension to mirror descent with proximal regularization that we call magnetic mirror descent (MMD)---can produce strong results in both settings, despite their fundamental differences. From a theoretical standpoint, we prove that MMD converges linearly to quantal response equilibria (i.e., entropy regularized Nash equilibria) in extensive-form games---this is the first time linear convergence has been proven for a first order solver. Moreover, applied as a tabular Nash equilibrium solver via self-play, we show empirically that MMD produces results competitive with CFR in both normal-form and extensive-form games---this is the first time that a standard RL algorithm has done so. Furthermore, for single-agent deep RL, on a small collection of Atari and Mujoco tasks, we show that MMD can produce results competitive with those of PPO. Lastly, for multi-agent deep RL, we show MMD can outperform NFSP in 3x3 Abrupt Dark Hex.","reinforcement learning, quantal response equilibria, two-player zero-sum games, mirror descent, variational inequalities, Nash equilibria, algorithmic game theory, proximal gradient" Dynamics-aware Skill Generation from Behaviourally Diverse Demonstrations,https://openreview.net/forum?id=VHyurNEKJBh,https://openreview.net/pdf?id=VHyurNEKJBh,"Learning a diverse set of policies using states-only demonstrations collected from different individuals, where each individual performs the task differently, being influenced by their own preferences or expertise.","Learning from demonstrations (LfD) provides a data-efficient way for a robot to learn a task by observing humans performing the task, without the need for an explicit reward function. However, in many real-world scenarios (e.g., driving a car) humans often perform the same task in different ways, motivated not only by the primary objective of the task (e.g., reaching the destination safely) but also by their individual preferences (e.g., different driving behaviours), leading to a multi-modal distribution of demonstrations. In this work, we consider an LfD problem, where the reward function for the main objective of the task is known to the learning agent; however, the individual preferences leading to the variations in the demonstrations are unknown. We show that current LfD approaches learn policies that either track a single mode or the mean of the demonstration distribution. In contrast, we propose an algorithm to learn a diverse set of policies to perform the task, capturing the different modes in the demonstrations due to the diverse preferences of the individuals. We show that we can build a parameterised solution space that captures different behaviour patterns from the demonstrations. Then, a set of policies can be generated in solution space that generate a diverse range of behaviours that go beyond the provided demonstrations.","Learning from Demonstration, Reinforcement Learning" Basis for Intentions: Efficient Inverse Reinforcement Learning using Past Experience,https://openreview.net/forum?id=tC_2Ej6pbaRq,https://openreview.net/pdf?id=tC_2Ej6pbaRq,leveraging prior RL experience for faster inverse RL ,"This paper addresses the problem of inverse reinforcement learning (IRL) – inferring the reward function of an agent from observing its behavior. IRL can provide a generalizable and compact representation for apprenticeship learning, or enable accurately inferring the preferences of a person in order to assist them. However, effective IRL is challenging, because many reward functions can be compatible with an observed behavior. We focus on how prior reinforcement learning (RL) experience can be leveraged to make IRL faster and more efficient. We propose the algorithm BASIS (Behavior Acquisition through Successor-feature Intention inference from Samples), which leverages multi-task RL pre-training and successor features to allow an agent to build a strong basis for intentions that spans the space of possible goals in a given domain. When exposed to just a few expert demonstrations optimizing a novel goal, the agent uses its basis to quickly and effectively infer the reward function. Our experiments reveal that our method is highly effective at inferring and optimizing demonstrated reward functions, accurately inferring reward functions from less than 100 trajectories.","inverse reinforcement learning, successor features, multi-task reinforcement learning" DiP-GNN: Discriminative Pre-Training of Graph Neural Networks,https://openreview.net/forum?id=W0VPud1QV69,https://openreview.net/pdf?id=W0VPud1QV69,"We propose a discriminative pre-training framework for graph neural networks (DiP-GNN), where we train a discriminator to distinguish edges generated by a generator from the original graph's edges.","Graph neural network (GNN) pre-training methods have been proposed to enhance the power of GNNs. Specifically, a GNN is first pre-trained on a large-scale unlabeled graph and then fine-tuned on a separate small labeled graph for downstream applications, such as node classification. One popular pre-training method is to mask out a proportion of the edges, and a GNN is trained to recover them. However, such a generative method suffers from graph mismatch. That is, the masked graph input to the GNN deviates from the original graph. To alleviate this issue, we propose DiP-GNN (Discriminative Pre-training of Graph Neural Networks). Specifically, we train a generator to recover identities of the masked edges, and simultaneously, we train a discriminator to distinguish the generated edges from the original graph's edges. The discriminator is subsequently used for downstream fine-tuning. In our pre-training framework, the graph seen by the discriminator better matches the original graph because the generator can recover a proportion of the masked edges. Extensive experiments on large-scale homogeneous and heterogeneous graphs demonstrate the effectiveness of the proposed framework. Our code will be publicly available.", Learning to Act through Activation Function Optimization in Random Networks,https://openreview.net/forum?id=fZxdcpfwTQb,https://openreview.net/pdf?id=fZxdcpfwTQb,We optimize parameterized activation functions in fixed random networks to solve reinforcement learning tasks.,"Biological neural networks are characterised by a high degree of neural diversity, a trait that artificial neural networks (ANNs) generally lack. Additionally, learning in ANNs is typically synonymous with only modifying the strengths of connection weights. However, there is much evidence from neuroscience that different classes of neurons each have crucial roles in the information processing done by the network. In nature, each neuron is a dynamical system that is a powerful information processor in its own right. In this paper we ask the question, how well can ANNs learn to perform reinforcement learning tasks only through the optimization of neural activation functions, without any weight optimization? We demonstrate the viability of the method and show that the neural parameters are expressive enough to allow learning three different continuous control tasks without weight optimization. These results open up for more possibilities for synergies between synaptic and neural optimization in ANNs in the future.","artificial neural networks, activation functions, neural diversity" Safer Reinforcement Learning with Counterexample-guided Offline Training,https://openreview.net/forum?id=2outcw5N9wH,https://openreview.net/pdf?id=2outcw5N9wH,,"Safe reinforcement learning (RL) aims at addressing the limitation of reinforcement learning in safety-critical scenarios, where failures during learning may incur high costs. Several methods exist to incorporate external knowledge or to use proximal sensor data to limit the exploration of unsafe states. However, dealing with (partially) unknown environments and dynamics, where an agent must discover safety threats during exploration, remains challenging. In this paper, we propose a method to abstract hybrid continuous-discrete systems into compact surrogate models representing the safety-relevant knowledge acquired by the agent at any time during exploration. We exploit probabilistic counterexamples generation to synthesise minimal, partial simulation environments from the surrogate model where the agent can train offline to produce heuristic strategies to minimise the risk of visiting unsafe states during subsequent online exploration. We demonstrate our method's effectiveness in increasing the agent's exploration safety on a selection of OpenAI Gym benchmarks. ", Pitfalls of Gaussians as a noise distribution in NCE,https://openreview.net/forum?id=ovZE0KsbM3S,https://openreview.net/pdf?id=ovZE0KsbM3S,We show that using Gaussians as the noise distribution in Noise Contrastive Estimation can lead to exponentially bad statistical and algorithmic complexity.,"Noise Contrastive Estimation (NCE) is a popular approach for learning probability density functions parameterized up to a constant of proportionality. The main idea is to design a classification problem for distinguishing training data from samples from an (easy-to-sample) noise distribution $q$, in a manner that avoids having to calculate a partition function. It is well-known that the choice of $q$ can severely impact the computational and statistical efficiency of NCE. In practice, a common choice for $q$ is a Gaussian which matches the mean and covariance of the data. In this paper, we show that such a choice can result in an exponentially bad (in the ambient dimension) conditioning of the Hessian of the loss - even for very simple data distributions. As a consequence, both the statistical and algorithmic complexity for such a choice of $q$ will be problematic in practice - suggesting that more complex noise distributions are essential to the success of NCE.","NCE, Noise Contrastive Estimation, Generative Models, statistical efficiency, theory" Scaling Laws for a Multi-Agent Reinforcement Learning Model,https://openreview.net/forum?id=ZrEbzL9eQ3W,https://openreview.net/pdf?id=ZrEbzL9eQ3W,We examine scaling laws for AlphaZero.,"The recent observation of neural power-law scaling relations has made a significant impact in the field of deep learning. A substantial amount of attention has been dedicated as a consequence to the description of scaling laws, although mostly for supervised learning and only to a reduced extent for reinforcement learning frameworks. In this paper we present an extensive study of performance scaling for a cornerstone reinforcement learning algorithm, AlphaZero. On the basis of a relationship between Elo rating, playing strength and power-law scaling, we train AlphaZero agents on the games Connect Four and Pentago and analyze their performance. We find that player strength scales as a power law in neural network parameter count when not bottlenecked by available compute, and as a power of compute when training optimally sized agents. We observe nearly identical scaling exponents for both games. Combining the two observed scaling laws we obtain a power law relating optimal size to compute similar to the ones observed for language models. We find that the predicted scaling of optimal neural network size fits our data for both games. This scaling law implies that previously published state-of-the-art game-playing models are significantly smaller than their optimal size, given the respective compute budgets. We also show that large AlphaZero models are more sample efficient, performing better than smaller models with the same amount of training data.","Neural scaling laws, Multi-agent reinforcement learning, AlphaZero" Risk Control for Online Learning Models,https://openreview.net/forum?id=uqLDy0HGPR7,https://openreview.net/pdf?id=uqLDy0HGPR7,"A flexible tool for constructing uncertainty estimates with a rigorous long-range risk control (such as coverage, false negative rate, or F1 score) in an online learning setting, where the distribution can vary greatly over time.","To provide rigorous uncertainty quantification for online learning models, we develop a framework for constructing uncertainty sets that provably control risk---such as coverage of confidence intervals, false negative rate, or F1 score---in the online setting. This extends conformal prediction to apply to a larger class of online learning problems. Our method guarantees risk control at any user-specified level even when the underlying data distribution shifts drastically, even adversarially, over time in an unknown fashion. The technique we propose is highly flexible as it can be applied with any base online learning algorithm (e.g., a deep neural network trained online), requiring minimal implementation effort and essentially zero additional computational cost. We further extend our approach to control multiple risks simultaneously, so the prediction sets we generate are valid for all given risks. To demonstrate the utility of our method, we conduct experiments on real-world tabular time-series data sets showing that the proposed method rigorously controls various natural risks. Furthermore, we show how to construct valid intervals for an online image-depth estimation problem that previous sequential calibration schemes cannot handle.","Conformal Prediction, Uncertainty Quantification, Time Series, Online Learning" Generative Adversarial Training for Neural Combinatorial Optimization Models,https://openreview.net/forum?id=nJuzV-izmPJ,https://openreview.net/pdf?id=nJuzV-izmPJ,We propose a general framework to improve the generalization ability of deep learning models for Combinatorial Optimization Problems.,"Recent studies show that deep neural networks can be trained to learn good heuristics for various Combinatorial Optimization Problems (COPs). However, it remains a great challenge for the trained deep optimization models to generalize to distributions different from the training one. To address this issue, we propose a general framework, Generative Adversarial Neural Combinatorial Optimization (GANCO) which is equipped with another deep model to generate training instances for the optimization model, so as to improve its generalization ability. The two models are trained alternatively in an adversarial way, where the generation model is trained by reinforcement learning to find instance distributions hard for the optimization model. We apply the GANCO framework to two recent deep combinatorial optimization models, i.e., AM and POMO. Extensive experiments on various COPs such as Traveling Salesman Problem, Capacitated Vehicle Routing Problem, and 0-1 Knapsack Problem show that GANCO can significantly improve the generalization ability of optimization models on various instance distributions.","Vehicle Routing Problems, Combinatorial Optimization, Deep Reinforcement Learning" Federated Learning with Openset Noisy Labels,https://openreview.net/forum?id=jCHRWpXk1pD,https://openreview.net/pdf?id=jCHRWpXk1pD,A framework for openset noisy label classification in federated learning,"Federated learning is a learning paradigm that allows the central server to learn from different data sources while keeping the data private at local. Without controlling and monitoring the local data collection process, it is highly likely that the locally available training labels are noisy, just as in a centralized data collection effort. Moreover, different clients may hold samples within different label spaces. The noisy label space is likely to be different from the unobservable clean label space, resulting in openset noisy labels. In this work, we study the challenge of federated learning from clients with openset noisy labels. We observe that many existing solutions, e.g., loss correction, in the noisy label literature cannot achieve their originally claimed effect in local training. A central contribution of this work is to propose an approach that communicates globally randomly selected ``contrastive labels"" among clients to prevent local models from memorizing the openset noise patterns individually. Randomized label generations are applied during label sharing to facilitate access to the contrastive labels while ensuring differential privacy (DP). Both the DP guarantee and the effectiveness of our approach are theoretically guaranteed. Compared with several baseline methods, our solution shows its efficiency in several public benchmarks and real-world datasets under different noise ratios and noise models. ","Federated Learning, OpenSet Classification, Noisy Label" Perfectly Secure Steganography Using Minimum Entropy Coupling,https://openreview.net/forum?id=HQ67mj5rJdR,https://openreview.net/pdf?id=HQ67mj5rJdR,"A scalable, perfect security approach to information-theoretic steganography based on minimum entropy coupling ","Steganography is the practice of encoding secret information into innocuous content in such a manner that an adversarial third party would not realize that there is hidden meaning. While this problem has classically been studied in security literature, recent advances in generative models have led to a shared interest among security and machine learning researchers in developing scalable steganography techniques. In this work, we show that a steganography procedure is perfectly secure under Cachin (1998)'s information theoretic-model of steganography if and only if it is induced by a coupling. Furthermore, we show that, among perfectly secure procedures, a procedure is maximally efficient if and only if it is induced by a minimum entropy coupling. Due to recent breakthroughs in approximate and iterative minimum entropy coupling techniques, these insights yield what are, to the best of our knowledge, the first steganography algorithms to achieve perfect security guarantees with non-trivial efficiency; additionally, these algorithms are highly scalable. To provide empirical validation, we compare a minimum entropy coupling-based approach to three modern baselines---arithmetic coding, Meteor, and adaptive dynamic grouping---using GPT-2 and WaveRNN as communication channels. We find that the minimum entropy coupling-based approach yields superior encoding efficiency, despite its stronger security constraints. In aggregate, these results suggest that it may be natural to view information-theoretic steganography through the lens of minimum entropy coupling.","Information-Theoretic Steganography, Minimum Entropy Coupling" The power of choices in decision tree learning,https://openreview.net/forum?id=AYvLkPnDguL,https://openreview.net/pdf?id=AYvLkPnDguL,"We propose a simple generalization of greedy decision tree learning algorithms which parameterizes the greediness in these algorithms by a parameter $k$, and validate the effectiveness of having this parameter, both theoretically and empirically.","We propose a simple and natural generalization of standard and empirically successful decision tree learning algorithms such as ID3, C4.5, and CART. These classic algorithms, which have been central to machine learning for decades, are greedy in nature: they grow a decision tree by iteratively splitting on the ""best"" attribute. We augment these algorithms with an additional greediness parameter $k$ and our resulting algorithm, Top-$k$, considers the $k$ best attributes as possible splits instead of just the single best attribute. We demonstrate, theoretically and empirically, the power of this simple generalization. We first prove a sharp greediness hierarchy theorem showing that for every $k\in \mathbb{N}$, Top-$(k+1)$ can be much more powerful than Top-$k$: there are data distributions for which the former achieves accuracy $1-\epsilon$, whereas the latter only achieves accuracy $\frac{1}{2}+\epsilon$. We then show, through extensive experiments, that Top-$k$ compares favorably with the two main approaches to decision tree learning: classic greedy algorithms and more recent ""optimal decision tree"" algorithms. On one hand, Top-$k$ consistently enjoys significant accuracy gains over the greedy algorithms across a wide range of benchmarks, at the cost of only a mild training slowdown. On the other hand, Top-$k$ is markedly more scalable than optimal decision tree algorithms, and is able to handle dataset and feature set sizes that remain beyond the reach of these algorithms. Taken together, our results highlight the potential practical impact of the power of choices in decision tree learning.","Decision Trees, Decision Tree Learning, Top-k, ID3, Greedy Algorithms" Identifiability of Label Noise Transition Matrix ,https://openreview.net/forum?id=HTKSDFhGYhQ,https://openreview.net/pdf?id=HTKSDFhGYhQ,"This paper provides understandings for when a label noise transition matrix is identifiable, and what factors contribute to its identifiability. ","The noise transition matrix plays a central role in the problem of learning with noisy labels. Among many other reasons, a large number of existing solutions rely on access to it. Identifying and estimating the transition matrix without ground truth labels is a critical and challenging task. When label noise transition depends on each instance, the problem of identifying the instance-dependent noise transition matrix becomes substantially more challenging. Despite recent works proposing solutions for learning from instance-dependent noisy labels, the field lacks a unified understanding of when such a problem remains identifiable. The goal of this paper is to characterize the identifiability of the label noise transition matrix. Building on Kruskal's identifiability results, we show the necessity of multiple noisy labels in identifying the noise transition matrix for the generic case at the instance level. We further instantiate the results to relate to the successes of the state-of-the-art solutions and how additional assumptions alleviated the requirement of multiple noisy labels. Our result also reveals that disentangled features are helpful in the above identification task and we provide empirical evidence. ","identifiability, label noise transition matrix, noisy labels" Learning from Others: Similarity-based Regularization for Mitigating Artifacts,https://openreview.net/forum?id=MFD2b2cwr5d,https://openreview.net/pdf?id=MFD2b2cwr5d,Similarity regularization reduces intrinsic and extrinsic bias in NLU models,"Common methods for mitigating spurious correlations in natural language understanding (NLU) usually operate in the output space, encouraging a main model to behave differently from a bias model by down-weighing examples where the bias model is confident. While improving out of distribution (OOD) performance, it was recently observed that the internal representations of the presumably debiased models are actually more, rather than less biased. We propose SimgReg, a new method for debiasing internal model components via similarity-based regularization, in representation space: We encourage the model to learn representations that are either similar to an unbiased model or different from a biased model. We experiment with three NLU tasks and different kinds of biases. We find that SimReg improves OOD performance, with little in-distribution degradation. Moreover, the representations learned by SimReg are less biased than in other methods. ","NLP, robustness, spurious correlations, Dataset bias, natural language understanding, shortcut learning" Calibrating Transformers via Sparse Gaussian Processes,https://openreview.net/forum?id=jPVAFXHlbL,https://openreview.net/pdf?id=jPVAFXHlbL,This paper proposes to improve the uncertainty calibration for transformers by performing Bayesian inference for the outputs of multi-head attention blocks using sparse Gaussian processes.,"Transformer models have achieved profound success in prediction tasks in a wide range of applications in natural language processing, speech recognition and computer vision. Extending Transformer’s success to safety-critical domains requires calibrated uncertainty estimation which remains under-explored. To address this, we propose Sparse Gaussian Process attention (SGPA), which performs Bayesian inference directly in the output space of multi-head attention blocks (MHAs) in transformer to calibrate its uncertainty. It replaces the scaled dot-product operation with a valid kernel and uses sparse Gaussian processes (SGP) techniques to approximate the posterior processes of MHA outputs. Empirically, on a suite of prediction tasks on text, images and graphs, SGPA-based Transformers achieve competitive predictive accuracy, while noticeably improving both in-distribution calibration and out-of-distribution robustness and detection.","Transformers, Gaussian processes, Bayesian neural networks, uncertainty estimation, variational inference" Representation Learning via Consistent Assignment of Views over Random Partitions,https://openreview.net/forum?id=sRQNwKcE2X,https://openreview.net/pdf?id=sRQNwKcE2X,An unsupervised representation learning method for visual data based on self-supervised clustering. ,"We introduce Consistent Assignment of views over Random Partitions (CARP), a self-supervised clustering method for representation learning of visual features. CARP learns prototypes in an end-to-end online fashion using gradient descent without additional non-differentiable modules to solve the cluster assignment problem. We present a new pretext task based on random partitions of prototypes by enforcing consistency between views' assignments over these random subsets. We use a fast (student) and a slow (teacher) learners to provide stable targets for the assignment task. We present an extensive ablation study and show that our proposed random partition pretext task (1) improves the quality of the learned representations by devising multiple random classification tasks and (2) improves training stability and prevents collapsed solutions in joint-embedding training. CARP achieves top-1 linear accuracy of 71.7% and k-NN performance of 64.8% on the ImageNet-1M, surpassing contemporary work under limited training conditions. When trained for longer epochs, CARP outperforms state-of-the-art methods in the k-NN evaluation and performs comparably in other benchmarks.","representation learning, unsupervised learning, self-supervised learning, computer vision" Model Transferability with Responsive Decision Subjects ,https://openreview.net/forum?id=nIGza1_wxk,https://openreview.net/pdf?id=nIGza1_wxk,This paper studies model transferability when human decision subjects respond to a deployed machine learning model.,"This paper studies model transferability when human decision subjects respond to a deployed machine learning model. In our setting, an agent or a user corresponds to a sample $(X,Y)$ drawn from a distribution $\mathcal{D}$ and will face a model $h$ and its classification result $h(X)$. Agents can modify $X$ to adapt to $h$, which will incur a distribution shift on $(X,Y)$. Therefore, when training $h$, the learner will need to consider the subsequently ``induced"" distribution when the output model is deployed. Our formulation is motivated by applications where the deployed machine learning models interact with human agents, and will ultimately face \emph{responsive} and interactive data distributions. We formalize the discussions of the transferability of a model by studying how the model trained on the available source distribution (data) would translate to the performance on the induced domain. We provide both upper bounds for the performance gap due to the induced domain shift, as well as lower bound for the trade-offs that a classifier has to suffer on either the source training distribution or the induced target distribution. We provide further instantiated analysis for two popular domain adaptation settings with covariate shift and target shift.","transferability, responsive decision subjects, induced distribution shift, human-ML interaction, performance bound" Red PANDA: Disambiguating Anomaly Detection by Removing Nuisance Factors,https://openreview.net/forum?id=z37tDDHHgi,https://openreview.net/pdf?id=z37tDDHHgi,Proposing a new anomaly detection setting when the operator specifies a nuisance attribute to be ignored,"Anomaly detection methods strive to discover patterns that differ from the norm in a meaningful way. This goal is ambiguous as different operators may find different attributes meaningful. A data point differing from the norm by an attribute such as pose, age, or gender, may be considered anomalous by some operators while others may consider the attribute irrelevant. Breaking from previous research, we present a new anomaly detection method that allows operators to exclude an attribute when detecting anomalies. Our approach learns representations which do not contain information regarding such nuisance attributes. Anomaly scoring is performed using a density-based approach. Importantly, our approach does not require specifying the attributes where anomalies could appear, which is typically impossible in anomaly detection, but only attributes to ignore. An empirical investigation is presented verifying the effectiveness of our approach.","Anomaly Detection, Disentanglement" Abstracting Imperfect Information Away from Two-Player Zero-Sum Games,https://openreview.net/forum?id=ZljQYfl8SJ,https://openreview.net/pdf?id=ZljQYfl8SJ,A reduction from imperfect information two-player zero-sum games to perfect information two-player zero-sum games,"In their seminal work, Nayyar et al. (2013) showed that imperfect information can be abstracted away from common-payoff games by having players publicly announce their policies as they play. This insight underpins sound solvers and decision-time planning algorithms for common-payoff games. Unfortunately, a naive application of the same insight to two-player zero-sum games fails because Nash equilibria of the game with public policy announcements may not correspond to Nash equilibria of the original game. As a consequence, existing sound decision-time planning algorithms require complicated additional mechanisms that have unappealing properties. The main contribution of this work is showing that certain regularized equilibria do not possess the aforementioned non-correspondence problem---thus, computing them can be treated as perfect information problems. This result yields a simplified framework for decision-time planning in two-player zero-sum games, void of the unappealing properties that plague existing decision-time planning algorithms.","imperfect information, public belief states, decision-time planning, regularized equilibria" Is Attention All That NeRF Needs?,https://openreview.net/forum?id=xE-LtsE-xx,https://openreview.net/pdf?id=xE-LtsE-xx,"We present Generalizable NeRF Transformer (GNT), a pure, unified transformer-based architecture that efficiently reconstructs Neural Radiance Fields (NeRFs) on the fly.","We present Generalizable NeRF Transformer (GNT), a transformer-based architecture that reconstructs Neural Radiance Fields (NeRFs) and learns to renders novel views on the fly from source views. While prior works on NeRFs optimize a scene representation by inverting a handcrafted rendering equation, GNT achieves neural representation and rendering that generalizes across scenes using transformers at two stages. (1) The view transformer leverages multi-view geometry as an inductive bias for attention-based scene representation, and predicts coordinate-aligned features by aggregating information from epipolar lines on the neighboring views. (2) The ray transformer renders novel views using attention to decode the features from the view transformer along the sampled points during ray marching. Our experiments demonstrate that when optimized on a single scene, GNT can successfully reconstruct NeRF without an explicit rendering formula due to the learned ray renderer. When trained on multiple scenes, GNT consistently achieves state-of-the-art performance when transferring to unseen scenes and outperform all other methods by ~10% on average. Our analysis of the learned attention maps to infer depth and occlusion indicates that attention enables learning a physically-grounded rendering. Our results show the promise of transformers as a universal modelling tool for graphics.","Neural Radiance Field, Transformer, Neural Rendering" STOCHASTIC NO-REGRET LEARNING FOR GENERAL GAMES WITH VARIANCE REDUCTION,https://openreview.net/forum?id=oJZ8bPtCar,https://openreview.net/pdf?id=oJZ8bPtCar,,"We show that a stochastic version of optimistic mirror descent (OMD), a variant of mirror descent with recency bias, converges fast in general games. More specifically, with our algorithm, the individual regret of each player vanishes at a speed of $O(1/T^{3/4})$ and the sum of all players' regret vanishes at a speed of $O(1/T)$, which is an improvement upon the $O(1/\sqrt{T})$ convergence rate of prior stochastic algorithms, where $T$ is the number of interaction rounds. Due to the advantage of stochastic methods in the computational cost, we significantly improve the time complexity over the deterministic algorithms to approximate coarse correlated equilibrium. To achieve lower time complexity, we equip the stochastic version of OMD in \cite{alacaoglu2021stochastic} with a novel low-variance Monte-Carlo estimator. Our algorithm extends previous works \cite{alacaoglu2021stochastic,carmon2019variance} from two-player zero-sum games to general games. ",game theory The Dark Side of AutoML: Towards Architectural Backdoor Search,https://openreview.net/forum?id=bsZULlDGXe,https://openreview.net/pdf?id=bsZULlDGXe,"This paper presents EVAS, a new attack to leverage NAS to find neural architectures with exploitable backdoor vulnerability.","This paper asks the intriguing question: is it possible to exploit neural architecture search (NAS) as a new attack vector to launch previously improbable attacks? Specifically, we present EVAS, a new attack that leverages NAS to find neural architectures with inherent backdoors and exploits such vulnerability using input-aware triggers. Compared with existing attacks, EVAS demonstrates many interesting properties: (i) it does not require polluting training data or perturbing model parameters; (ii) it is agnostic to downstream fine-tuning or even re-training from scratch; (iii) it naturally evades defenses that rely on inspecting model parameters or training data. With extensive evaluation on benchmark datasets, we show that EVAS features high evasiveness, transferability, and robustness, thereby expanding the adversary's design spectrum. We further characterize the mechanisms underlying EVAS, which are possibly explainable by architecture-level ``shortcuts'' that recognize trigger patterns. This work raises concerns about the current practice of NAS and points to potential directions to develop effective countermeasures.","backdoor attack and defense, neural architecture search" Generalization and Estimation Error Bounds for Model-based Neural Networks,https://openreview.net/forum?id=9F_xlC7sk9,https://openreview.net/pdf?id=9F_xlC7sk9,,"Model-based neural networks provide unparalleled performance for various tasks, such as sparse coding and compressed sensing problems. Due to the strong connection with the sensing model, these networks are interpretable and inherit prior structure of the problem. In practice, model-based neural networks exhibit higher generalization capability compared to ReLU neural networks. However, this phenomenon was not addressed theoretically. Here, we leverage complexity measures including the global and local Rademacher complexities, in order to provide upper bounds on the generalization and estimation errors of model-based networks. We show that the generalization abilities of model-based networks for sparse recovery outperform those of regular ReLU networks, and derive practical design rules that allow to construct model-based networks with guaranteed high generalization. We demonstrate through a series of experiments that our theoretical insights shed light on a few behaviours experienced in practice, including the fact that ISTA and ADMM networks exhibit higher generalization abilities (especially for small number of training samples), compared to ReLU networks.","Model based neural networks, Generalization error, Estimation error, Local Rademacher complexity." Isometric Representations in Neural Networks Improve Robustness,https://openreview.net/forum?id=tK9UwBsQK9,https://openreview.net/pdf?id=tK9UwBsQK9,"Adding Isometric regularization to classification loss, enforces continuous representations which improve robustness against adversarial attacks.","Artificial and biological agents are unable to learn given completely random and unstructured data. The structure of data is encoded in the distance or similarity relationships between data points. In the context of neural networks, the neuronal activity within a layer forms a representation reflecting the transformation that the layer implements on its inputs. In order to utilize the structure in the data in a truthful manner, such representations should reflect the input distances and thus be continuous and isometric. Supporting this statement, recent findings in neuroscience propose that generalization and robustness are tied to neural representations being continuously differentiable. However, in machine learning, most algorithms lack robustness and are generally thought to rely on aspects of the data that differ from those that humans use, as is commonly seen in adversarial attacks. During cross-entropy classification, the metric and structural properties of network representations are usually broken both between and within classes. This side effect from training can lead to instabilities under perturbations near locations where such structure is not preserved. One of the standard solutions to obtain robustness is to train specifically by introducing perturbations in the training data. This leads to networks that are particularly robust to specific training perturbations but not necessarily to general perturbations. While adding ad hoc regularization terms to improve robustness has become common practice, to our knowledge, forcing representations to preserve the metric structure of the input data as a stabilising mechanism has not yet been introduced. In this work, we train neural networks to perform classification while simultaneously maintaining the metric structure within each class, leading to continuous and isometric within-class representations. We show that such network representations turn out to be a beneficial component for making accurate and robust inferences about the world. By stacking layers with this property we provide the community with an network architecture that facilitates hierarchical manipulation of internal neural representations. Finally, we verify that our isometric regularization term improves the robustness to adversarial attacks on MNIST. ","Isometry, Deep Learning, Robustness, Adversarial Attacks, Continuous Representation, Classification" Memory-efficient Trajectory Matching for Scalable Dataset Distillation,https://openreview.net/forum?id=dN70O8pmW8,https://openreview.net/pdf?id=dN70O8pmW8,we propose a memory-efficient method that scales dataset distillation to ImageNet-1K with IPC 10 and 50 for the first time and achieves state of the art performances,"Dataset distillation methods aim to compress a large dataset into a small set of synthetic samples, such that when being trained on, competitive performances can be achieved compared to regular training on the entire dataset. Among recently proposed methods, Matching Training Trajectories (MTT) achieves state-of-the-art performance on CIFAR-10/100, while having difficulty scaling to ImageNet-1k dataset due to the large memory requirement when performing unrolled gradient computation through back-propagation. Surprisingly, we show that there exists a procedure to exactly calculate the gradient of the trajectory matching loss with constant memory requirement (irrelevant to the number of unrolled steps). With this finding, the proposed memory-efficient trajectory matching method can easily scale to ImageNet-1K with roughly 6x memory reduction while introducing only around 2% runtime overhead than original MTT. Further, we find that assigning soft labels for synthetic images is crucial for the performance when scaling to larger number of categories (e.g., 1,000) and propose a novel soft label version of trajectory matching that facilities better aligning of model training trajectories on large datasets. The proposed algorithm not only surpasses previous SOTA on ImageNet-1K under extremely low IPCs (Images Per Class), but also for the first time enables us to scale up to 50 IPCs on ImageNet-1K. Our method (TESLA) achieves 27.9% testing accuracy, a remarkable +18.2% margin over prior arts.","dataset condensation, dataset distillation, imagenet-1k, deep learning, dataset synthesis" TAN without a burn: Scaling laws of DP-SGD,https://openreview.net/forum?id=PHtzmXK8am,https://openreview.net/pdf?id=PHtzmXK8am,Computationally friendly hyper-parameter search with DP-SGD for new state-of-the-art performance on ImageNet.,"Differentially Private methods for training Deep Neural Networks (DNNs) have progressed recently, in particular with the use of massive batches and aggregated data augmentations for a large number of steps. These techniques require much more compute than their non-private counterparts, shifting the traditional privacy-accuracy trade-off to a privacy-accuracy-compute trade-off and making hyper-parameter search virtually impossible for realistic scenarios. In this work, we decouple privacy analysis and experimental behavior of noisy training to explore the trade-off with minimal computational requirements. We first use the tools of Rényi Differential Privacy (RDP) to show that the privacy budget, when not overcharged, only depends on the total amount of noise (TAN) injected throughout training. We then derive scaling laws for training models with DP-SGD to optimize hyper-parameters with more than a $100\times$ reduction in computational budget. We apply the proposed method on CIFAR-10 and ImageNet and, in particular, strongly improve the state-of-the-art on ImageNet with a $+9$ points gain in accuracy for a privacy budget $\varepsilon=8$.", A sampling framework for value-based reinforcement learning,https://openreview.net/forum?id=x70_D-KGEMx,https://openreview.net/pdf?id=x70_D-KGEMx," We propose an efficient and scalable sampling framework for reinforcement learning, which enables uncertainty quantification and addresses the local trap issue suffered by the existing training approaches. ","Value-based algorithms have achieved great successes in solving Reinforcement Learning problems via minimizing the mean squared Bellman error (MSBE). Temporal-difference (TD) algorithms such as Q-learning and SARSA often use stochastic gradient descent based optimization approaches to estimate the value function parameters, but fail to quantify their uncertainties. In our work, under the Kalman filtering paradigm, we establish a novel and scalable sampling framework based on stochastic gradient Markov chain Monte Carlo, which allows us to efficiently generate samples from the posterior distribution of deep neural network parameters. For TD-learning with both linear and nonlinear function approximation, we prove that the proposed algorithm converges to a stationary distribution, which allows us to measure uncertainties of the value function and its parameters.","Reinforcement learning, Bayesian sampling, Stochastic gradient MCMC, Value function approximation" Enhancing Cross-Category Learning in Recommendation Systems with Multi-Layer Embedding Training,https://openreview.net/forum?id=sWSWudSpYy,https://openreview.net/pdf?id=sWSWudSpYy,,"Modern DNN-based recommendation systems rely on training-derived real-valued embeddings of sparse categorical features. Input sparsity makes obtaining high-quality embeddings for rarely-occurring categories harder as their representations are updated infrequently. We demonstrate an effective overparameterization technique for enhancing embeddings training by enabling useful cross-category learning. Our scheme trains embeddings using training-time forced factorization of the embedding (linear) layer, with an inner dimension higher than the target embedding dimension. We show that factorization breaks update sparsity via non-homogeneneous weighting of dense base embedding matrices. Such weighting controls the magnitude of weight updates in each embedding direction, and is adaptive to training-time embedding singular values. The dynamics of singular values further explains the puzzling importance of factorization inner dimension on learning enhancements. We call the scheme multi-layer embeddings training (MLET). For deployment efficiency, MLET converts the trained two-layer embedding into a single-layer one at the conclusion of training, avoiding inference-time model size increase. MLET consistently produces better models when tested on multiple recommendation models for click-through rate (CTR) prediction. At constant model quality, MLET allows embedding dimension reduction by up to 16x, and 5.8x on average, across the models. MLET retains its benefits in combination with other table reduction methods (hashing and quantization). ", StructViT: Learning Correlation Structures for Vision Transformers,https://openreview.net/forum?id=Af43zsue7kw,https://openreview.net/pdf?id=Af43zsue7kw,We introduce structural self-attention (StructSA) that exploits geometric structures of query-key correlations and the proposed network StructViT achieves state-of-the-art results on various image and video classification benchmarks.,"We introduce the structural self-attention (StructSA) mechanism that leverages structural patterns of query-key correlation for visual representation learning. StructSA generates attention by recognizing space-time structures of correlations and performs long-range interactions across entire locations, effectively capturing structural patterns, e.g., spatial layouts, motion, or inter-object relations in images and videos. Using StructSA as a main building block, we develop the structural vision transformer (StructViT) and evaluate its effectiveness on both image and video classification tasks, achieving state-of-the-art results on ImageNet-1K, Kinetics-400, Something-Something V1 & V2, Diving-48, and FineGym.","transformer, self-attention, image classification, video classification, correlation" The Curse of Low Task Diversity: On the Failure of Transfer Learning to Outperform MAML and their Empirical Equivalence,https://openreview.net/forum?id=8re-nA0wDxW,https://openreview.net/pdf?id=8re-nA0wDxW,"when the task diversity of few-shot learning benchmarks is low and comparison is fair, MAML and transfer learning perform the same -- opposite of claims that transfer learning is better","Recently, it has been observed that a transfer learning solution might be all we need to solve many few-shot learning benchmarks -- thus raising important questions about when and how meta-learning algorithms should be deployed. In this paper, we seek to clarify these questions by 1. proposing a novel metric -- the {\it diversity coefficient} -- to measure the diversity of tasks in a few-shot learning benchmark and 2. by comparing MAML and transfer learning under fair conditions (same architecture, same optimizer and all models trained to convergence). Using the diversity coefficient, we show that the popular MiniImagenet and Cifar-fs few-shot learning benchmarks have low diversity. This novel insight contextualizes claims that transfer learning solutions are better than meta-learned solutions in the regime of low diversity under a fair comparison. Specifically, we empirically find that a low diversity coefficient correlates with a high similarity between transfer learning and Model-Agnostic Meta-Learning (MAML) learned solutions in terms of accuracy at meta-test time and classification layer similarity (using feature based distance metrics like SVCCA, PWCCA, CKA, and OPD). To further support our claim, we find this meta-test accuracy holds even as the model size changes. Therefore, we conclude that in the low diversity regime, MAML and transfer learning have equivalent meta-test performance when both are compared fairly. We also hope our work inspires more thoughtful constructions and quantitative evaluations of meta-learning benchmarks in the future.","meta-learning, machine learning, transfer learning, deep learning" Bi-Stride Multi-Scale Graph Neural Network for Mesh-Based Physical Simulation,https://openreview.net/forum?id=ZqK0Hnlqo-,https://openreview.net/pdf?id=ZqK0Hnlqo-,A robust yet simple pooling strategy for improving multi-scale GNNs for predicting physical simulations,"Learning physical systems on unstructured meshes by flat Graph neural networks (GNNs) faces the challenge of modeling the long-range interactions due to the scaling complexity w.r.t. the number of nodes, limiting the generalization under mesh refinement. On regular grids, the convolutional neural networks (CNNs) with a U-net structure can resolve this challenge by efficient stride, pooling, and upsampling operations. Nonetheless, these tools are much less developed for graph neural networks (GNNs), especially when GNNs are employed for learning large-scale mesh-based physics. The challenges arise from the highly irregular meshes and the lack of effective ways to construct the multi-level structure without losing connectivity. Inspired by the bipartite graph determination algorithm, we introduce Bi-Stride Multi-Scale Graph Neural Network (BSMS-GNN) by proposing \textit{bi-stride} as a simple pooling strategy for building the multi-level GNN. \textit{Bi-stride} pools nodes by striding every other BFS frontier; it 1) works robustly on any challenging mesh in the wild, 2) avoids using a mesh generator at coarser levels, 3) avoids the spatial proximity for building coarser levels, and 4) uses non-parametrized aggregating/returning instead of MLPs during pooling and unpooling. Experiments show that our framework significantly outperforms the state-of-the-art method's computational efficiency in representative physics-based simulation cases.","GNN, physics-based simulation" Spatially Resolved Temporal Networks: Online Unsupervised Representation Learning of High Frequency Time Series,https://openreview.net/forum?id=v71SH44HD8,https://openreview.net/pdf?id=v71SH44HD8,Unsupervised representation Learning to generate clinically interpretable waveforms. ,"Univariate high-frequency time series are dominant data sources for many medical, economic and environmental applications. In many of these domains, the time series are tied to real-time changes in state. In the intensive care unit, for example, changes in an electrocardiogram signal can indicate a heart attack, and intracranial pressure waveforms can indicate whether a patient is developing decreased blood perfusion to the brain. However, most representation learning to resolve states is conducted in an offline, batch-dependent manner. In high frequency time-series, high intra-state and inter-sample variability makes offline, batch-dependent learning a relatively difficult task. Hence, we propose Spatial Resolved Temporal Networks (SpaRTeN), a novel composite deep learning model for online, unsupervised representation learning through a spatially constrained latent space. We simultaneously train two distinct blocks: a recurrent neural network ensemble $f_R$ that captures states in high frequency time series, and a spatial block $f_S$ that spatially resolves state changes from the predictions generated by $f_R$. The spatial block $f_S$ identifies the block in $f_R$ that best fits the current state of the time series, and the training procedure for $f_R$ optimizes that block. This procedure corresponds to a minimax framework. When $f_S$ and $f_R$ are deep neural networks, the entire system can be trained via back-propagation. Finally, we demonstrate the application of this framework to online forecasting and interpretable, zero-shot clustering. We compare and demonstrate that SpaRTeN outperforms spectral clustering and a Gaussian mixture model.","High frequency time series, Representation Learning, Online Learning" ChordMixer: A Scalable Neural Attention Model for Sequences with Different Length,https://openreview.net/forum?id=E8mzu3JbdR,https://openreview.net/pdf?id=E8mzu3JbdR,,"Sequential data naturally have different lengths in many domains, with some very long sequences. As an important modeling tool, neural attention should capture long-range interaction in such sequences. However, most existing neural attention models admit only short sequences, or they have to employ chunking or padding to enforce a constant input length. Here we propose a simple neural network building block called ChordMixer which can model the attention for long sequences with variable lengths. Each ChordMixer block consists of a position-wise rotation layer without learnable parameters and an element-wise MLP layer. Repeatedly applying such blocks forms an effective network backbone that mixes the input signals towards the learning targets. We have tested ChordMixer on the synthetic adding problem, long document classification, and DNA sequence-based taxonomy classification. The experiment results show that our method substantially outperforms other neural attention models.","Mixer, Attention, Scalable" Boosting Adversarial Transferability using Dynamic Cues,https://openreview.net/forum?id=SZynfVLGd5,https://openreview.net/pdf?id=SZynfVLGd5,A new approach for optimizing temporal prompts through frozen image models to capture motion dynamics for better transferability,"The transferability of adversarial perturbations between image models has been extensively studied. In this case, an attack is generated from a known surrogate e.g., the ImageNet trained model, and transferred to change the decision of an unknown (black-box) model trained on an image dataset. However, attacks generated from image models do not capture the dynamic nature of a moving object or a changing scene due to a lack of temporal cues within image models. This leads to reduced transferability of adversarial attacks from representation-enriched image models such as Supervised Vision Transformers (ViTs), Self-supervised ViTs (e.g., DINO), and Vision-language models (e.g., CLIP) to black-box video models. In this work, we induce dynamic cues within the image models without sacrificing their original performance on images. To this end, we optimize temporal prompts through frozen image models to capture motion dynamics. Our temporal prompts are the result of a learnable transformation that allows optimizing for temporal gradients during an adversarial attack to fool the motion dynamics. Specifically, we introduce spatial (image) and temporal (video) cues within the same source model through task-specific prompts. Attacking such prompts maximizes the adversarial transferability from image-to-video and image-to-image models using the attacks designed for image models. As an example, an iterative attack launched from image model Deit-B with temporal prompts reduces generalization (top1 % accuracy) of a video model by 35% on Kinetics-400. Our approach also improves adversarial transferability to image models by 9% on ImageNet w.r.t the current state-of-the-art approach. Our attack results indicate that the attacker does not need specialized architectures, e.g., divided space-time attention, 3D convolutions, or multi-view convolution networks for different data modalities. Image models are all we need to optimize for an effective surrogate and an adversarial attack to fool black-box models in a changing environment over time. Our code will be made public.","Adversarial attacks, Transferability, Prompt learning, Dynamic video modeling" Taming Policy Constrained Offline Reinforcement Learning for Non-expert Demonstrations,https://openreview.net/forum?id=zMVCSe945x,https://openreview.net/pdf?id=zMVCSe945x,The performance losses of policy constraint-based offline RL algorithms on contaminated datasets can be alleviated by gradient penalty and constraint relaxation.,"A promising paradigm for offline reinforcement learning (RL) is to constrain the learned policy to stay close to the dataset behaviors, known as policy constraint offline RL. However, existing works heavily rely on the purity of the data, exhibiting performance degradation or even catastrophic failure when learning from contaminated datasets containing impure trajectories of diverse levels. e.g., expert level, medium level, etc., while offline contaminated data logs exist commonly in the real world. To mitigate this, we first introduce gradient penalty over the learned value function to tackle the exploding Q-function gradients. We then relax the closeness constraints towards non-optimal actions with critic weighted constraint relaxation. Experimental results show that the proposed techniques effectively tame the non-optimal trajectories for policy constraint offline RL methods, evaluated on a set of contaminated D4RL Mujoco and Adroit datasets.","Contaminated Datasets, Robust Offline Reinforcement Learning" Static Prediction of Runtime Errors by Learning to Execute Programs with External Resource Descriptions,https://openreview.net/forum?id=lLp-C5nTdJG,https://openreview.net/pdf?id=lLp-C5nTdJG,"For statically predicting runtime errors, the IPA-GNN scales to complex programs, models exception handling, and executes resource descriptions; it performs well and surprisingly localizes errors despite training without location supervision.","The execution behavior of a program often depends on external resources, such as program inputs or file contents, and so the program cannot be run in isolation. Nevertheless, software developers benefit from fast iteration loops where automated tools identify errors as early as possible, even before programs can be compiled and run. This presents an interesting machine learning challenge: can we predict runtime errors in a ""static"" setting, where program execution is not possible? Here, we introduce a competitive programming dataset and task for predicting runtime errors, which we show is difficult for generic models like Transformers. We approach this task by developing an interpreter-inspired architecture with an inductive bias towards mimicking program executions, which models exception handling and ""learns to execute"" descriptions of external resources. Surprisingly, we show that the model can also predict the locations of errors, despite being trained only on labels indicating error presence or absence and kind. In total, we present a practical and difficult-yet-approachable challenge problem related to learning program execution behavior and we demonstrate promising new capabilities of interpreter-inspired machine learning models for code.","program analysis, graph neural networks, recurrent networks, attention mechanisms, source code, program execution" Attentional Context Alignment for Multimodal Sequential Learning,https://openreview.net/forum?id=wkU9ezzXbHk,https://openreview.net/pdf?id=wkU9ezzXbHk,,"Transformer-based methods have gone mainstream in multimodal sequential learning. The intra and inter modality interactions are captured by the query-key associations of multi-head attention, where the calculated multimodal contexts are expected to be relevant to the query modality. However, in existing literature, the alignments between different calculated contextual sequences, that can back-evaluate the effectiveness of multi-head attention, are under-explored. Based on this concern, we propose a new constrained scheme called Multimodal Contextual Contrast (MCC), which could align the attentional sequences from both local and global perspectives, making the attentional capture more accurate. Concretely, the multimodal contexts of different modalities are mapped into a common feature space, those contexts at the same sequential step are considered as a positive group and the remaining sets are negative. From local perspective, we sample the negative groups for a positive group by randomly changing the sequential step of one specific context and keeping the other stay the same. From coarse global perspective, we divide all the contextual groups into two sets (i.e., aligned and unaligned), making the total score of aligned group relatively large. We extend the vectorial inner product operation for more input and calculate the aligned score for each multimodal group. Considering that the computational complexity scales exponentially to the number of modalities, we adopt stochastic expectation approximation (SEA) for the real process. The extensive experimental results on several tasks reveal the effectiveness of our contributions. ","Multimodal, Attentional Context, Alignment" Matching receptor to odorant with protein language and graph neural networks,https://openreview.net/forum?id=q9VherQJd8_,https://openreview.net/pdf?id=q9VherQJd8_,We leverage recent advances in protein representation learning and graph neural networks to predict olfactory receptor-molecule binding.,"Odor perception in mammals is triggered by interactions between volatile organic compounds and a subset of hundreds of proteins called olfactory receptors (ORs). Molecules activate these receptors in a complex combinatorial coding allowing mammals to discriminate a vast number of chemical stimuli. Recently, ORs have gained attention as new therapeutic targets following the discovery of their involvement in other physiological processes and diseases. To date, predicting molecule-induced activation for ORs is highly challenging since $43\%$ of ORs have no identified active compound. In this work, we combine [CLS] token from protBERT with a molecular graph and propose a tailored GNN architecture incorporating inductive biases from the protein-molecule binding. We abstract the biological process of protein-molecule activation as the injection of a molecule into a protein-specific environment. On a newly gathered dataset of $46$ $650$ OR-molecule pairs, this model outperforms standard GNN baselines by $30\%$. Moreover, by incorporating non-bonded interactions the model is able to work with mixtures of compounds. Finally, our predictions reveal a similar activation pattern for molecules within a given odor family, which is in agreement with the theory of combinatorial coding in olfaction.","Olfaction, protein-ligand binding, olfactory receptors, computational biology, protein language modelling, graph neural networks" REAP: A Large-Scale Realistic Adversarial Patch Benchmark,https://openreview.net/forum?id=noJYC9HMP42,https://openreview.net/pdf?id=noJYC9HMP42,We create a realistic benchmark for evaluating adversarial patch attacks and defenses containing over 14k traffic signs on driving scenes where the signs are annotated with realistic geometric and lighting transforms.,"Machine learning models are known to be susceptible to adversarial perturbation. One famous attack is the adversarial patch, a sticker with a particularly crafted pattern that makes the model incorrectly predict the object it is placed on. This attack presents a critical threat to cyber-physical systems that rely on cameras such as autonomous cars. Despite the significance of the problem, conducting research in this setting has been difficult; evaluating attacks and defenses in the real world is exceptionally costly while synthetic data are unrealistic. In this work, we propose the REAP (REalistic Adversarial Patch) Benchmark, a digital benchmark that allows the user to evaluate patch attacks on real images, and under real-world conditions. Built on top of the Mapillary Vistas dataset, our benchmark contains over 14,000 traffic signs. Each sign is augmented with a pair of geometric and lighting transformations, which can be used to apply a digitally generated patch realistically onto the sign while matching real-world conditions. Using our benchmark, we perform the first large-scale assessments of adversarial patch attacks under realistic conditions. Our experiments suggest that adversarial patch attacks may present a smaller threat than previously believed and that the success rate of an attack on simpler digital simulations is not predictive of its actual effectiveness in practice","adversarial examples, adversarial patch, benchmark, traffic sign detection" How does overparametrization affect performance on minority groups?,https://openreview.net/forum?id=Su84ELBdm5U,https://openreview.net/pdf?id=Su84ELBdm5U,,"The benefits of overparameterization for the overall performance of modern machine learning (ML) models are well known. However, the effect of overparameterization at a more granular level of data subgroups is less understood. Recent empirical studies demonstrate encouraging results: (i) when groups are not known, overparameterized models trained with empirical risk minimization (ERM) perform better on minority groups; (ii) when groups are known, ERM on data subsampled to equalize group sizes yields state-of-the-art worst-group-accuracy in the overparameterized regime. In this paper, we complement these empirical studies with a theoretical investigation of the risk of overparameterized random feature models on minority groups. In a setting in which the regression functions for the majority and minority groups are different, we show that overparameterization always improves minority group performance.", Federated Training of Dual Encoding Models on Small Non-IID Client Datasets,https://openreview.net/forum?id=8znaO_qG0H,https://openreview.net/pdf?id=8znaO_qG0H,"Novel approach for training dual encoding models on distributed data composed of many small, non-IID client datasets.","Dual encoding models that encode a pair of inputs are widely used for representation learning. Many approaches train dual encoding models by maximizing agreement between pairs of encodings on centralized training data. However, in many scenarios, datasets are inherently decentralized across many clients (user devices or organizations) due to privacy concerns, motivating federated learning. In this work, we focus on federated training of dual encoding models on decentralized data composed of many small, non-IID (independent and identically distributed) client datasets. We show that existing approaches that work well in centralized settings perform poorly when naively adapted to this setting using federated averaging. We observe that, we can simulate large-batch loss computation on individual clients for loss functions that are based on encoding statistics. Based on this insight, we propose a novel federated training approach, Distributed Cross Correlation Optimization (DCCO), which trains dual encoding models using encoding statistics aggregated across clients, without sharing individual data samples. Our experimental results on two datasets demonstrate that the proposed DCCO approach outperforms federated variants of existing approaches by a large margin.","dual encoding models, federated learning, representation learning, self-supervised learning, federated self-supervised learning" Offline Policy Comparison with Confidence: Benchmarks and Baselines,https://openreview.net/forum?id=AsOLzq1S-p,https://openreview.net/pdf?id=AsOLzq1S-p,We introduce a benchmark and baselines to study uncertainty estimation via policy comparisons in offline reinforcement learning datasets.,"Decision makers often wish to use offline historical data to compare sequential-action policies at various world states. Importantly, computational tools should produce confidence values for such offline policy comparison (OPC) to account for statistical variance and limited data coverage. Nevertheless, there is little work that directly evaluates the quality of confidence values for OPC. In this work, we address this issue by creating benchmarks for OPC with Confidence (OPCC), derived by adding sets of policy comparison queries to datasets from offline reinforcement learning. In addition, we present an empirical evaluation of the ""risk versus coverage"" trade-off for a class of model-based baselines. In particular, the baselines learn ensembles of dynamics models, which are used in various ways to produce simulations for answering queries with confidence values. While our results suggest advantages for certain baseline variations, there appears to be significant room for improvement in future work.","offline reinforcement learning, reinforcement learning, benchmark, uncertainty, model based reinforcement learning" PRANC: Pseudo RAndom Networks for Compacting deep models,https://openreview.net/forum?id=sz_iMI6IPM,https://openreview.net/pdf?id=sz_iMI6IPM,A method to compress networks for sharing knowledge and storage,"Compacting deep models has various applications where the communication and/or storage is expensive including multi-agent learning. We introduce a simple yet effective framework for compacting neural networks. In short, we train our network to be a linear combination of many pseudo-randomly generated frozen models. Then, one can reconstruct the model by communicating or storing the single `seed' scalar used to generate the pseudo-random `basis' networks along with the learned linear mixture coefficients. Our method, denoted as PRANC, learns almost $100\times$ fewer parameters than a deep model and still performs reasonably well on several datasets and architectures. PRANC enables 1) efficient communication of models between agents, 2) efficient model storage, and 3) memory-efficient inference by generating layer-wise weights on the fly. We test PRANC on CIFAR-10, CIFAR-100, tinyImageNet, and ImageNet-100 with various architectures like AlexNet, LeNet, ResNet18, ResNet20, and ResNet56 and demonstrate a massive reduction in the number of parameters while providing satisfactory performance on these benchmark datasets.","Computer Vision, Compacting Deep Models" SGDA with shuffling: faster convergence for nonconvex-PŁ minimax optimization,https://openreview.net/forum?id=6xXtM8bFFJ,https://openreview.net/pdf?id=6xXtM8bFFJ,We study the convergence bounds of (mini-batch) SGDA with random reshuffling for nonconvex-PŁ and primal-PŁ-PŁ problems.,"Stochastic gradient descent-ascent (SGDA) is one of the main workhorses for solving finite-sum minimax optimization problems. Most practical implementations of SGDA randomly reshuffle components and sequentially use them (i.e., without-replacement sampling); however, there are few theoretical results on this approach for minimax algorithms, especially outside the easier-to-analyze (strongly-)monotone setups. To narrow this gap, we study the convergence bounds of SGDA with random reshuffling (SGDA-RR) for smooth nonconvex-nonconcave objectives with Polyak-Łojasiewicz (PŁ) geometry. We analyze both simultaneous and alternating SGDA-RR for nonconvex-PŁ and primal-PŁ-PŁ objectives, and obtain convergence rates faster than with-replacement SGDA. Our rates also extend to mini-batch SGDA-RR, recovering known rates for full-batch gradient descent-ascent (GDA). Lastly, we present a comprehensive lower bound for two-time-scale GDA, which matches the full-batch rate for primal-PŁ-PŁ case.","minimax optimization, SGDA, without-replacement sampling, random reshuffling, Polyak-Łojasiewicz" NTFields: Neural Time Fields for Physics-Informed Robot Motion Planning,https://openreview.net/forum?id=ApF0dmi1_9K,https://openreview.net/pdf?id=ApF0dmi1_9K,A physics-informed neural time fields model for robot motion planning.,"Neural Motion Planners (NMPs) have emerged as a promising tool for solving robot navigation tasks in complex environments. However, these methods often require expert data for learning, which limits their application to scenarios where data generation is time-consuming. Recent developments have also led to physics-informed deep neural models capable of representing complex dynamical Partial Differential Equations (PDEs). Inspired by these developments, we propose Neural Time Fields (NTFields) for robot motion planning in cluttered scenarios. Our framework represents a wave propagation model generating continuous arrival time to find path solutions informed by a nonlinear first-order PDE called Eikonal Equation. We evaluate our method in various cluttered 3D environments, including the Gibson dataset, and demonstrate its ability to solve motion planning problems for 4-DOF and 6-DOF robot manipulators where the traditional grid-based Eikonal planners often face the curse of dimensionality. Furthermore, the results show that our method exhibits high success rates and significantly lower computational times than the state-of-the-art methods, including NMPs that require training data from classical planners.","Robotics, Motion Planning, Neural Fields, Implicit Neural Representation, Physics Informed Deep Learning" MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models,https://openreview.net/forum?id=H0HGljkxQFN,https://openreview.net/pdf?id=H0HGljkxQFN,,"This paper presents MOAT, a family of neural networks that build on top of MObile convolution (i.e., inverted residual blocks) and ATtention. The Transformer architectures are gaining popularity in computer vision for their powerful attention mechanism to encode long-range interactions. However, their performance and generalization are worse than convolutional neural networks (ConvNets), especially in the low data regime. Transformers require a huge amount of training images to learn the right inductive biases for vision recognition, some of which (e.g., translation equivariance) are successfully built in ConvNets. The research community has thus attempted to combine the strengths from both architectures. Unlike the current works that simply stack separate mobile convolution and transformer blocks, we effectively merge them into a MOAT block. Starting with a standard Transformer block, we replace its multi-layer perceptron with a mobile convolution block, and further reorder it before the self-attention operation. The mobile convolution block not only enhances the network representation capacity, but also produces better downsampled features. Our conceptually simple MOAT networks are surprisingly effective, achieving 89.1% top-1 accuracy on ImageNet-1K with ImageNet-22K pretraining. Additionally, MOAT can be seamlessly applied to downstream tasks that require large resolution inputs by simply converting the global attention to window attention. Thanks to the mobile convolution that effectively exchanges local information between pixels (and thus cross-windows), MOAT does not need the extra window-shifting mechanism. As a result, on COCO object detection, MOAT achieves 58.5% AP$^{\text{box}}$ with 110M model parameters (single-scale inference, and hard NMS), and on ADE20K semantic segmentation, MOAT attains 57.6% mIoU with 496M model parameters (single-scale inference). Finally, the tiny-MOAT family, obtained by simply reducing the channel sizes, also surprisingly outperforms several mobile-specific transformer-based models on ImageNet. Code will be made publicly available.", Biological connectomes as a representation for the architecture of artificial neural networks,https://openreview.net/forum?id=sW4dg6x_sGv,https://openreview.net/pdf?id=sW4dg6x_sGv,We map the neural circuitry of a nematode taken from neuroscience literature into an ANN and show that it works well on certain tasks but not others.,"Grand efforts in neuroscience are working toward mapping the connectomes of many new species, including the near completion of the Drosophila melanogaster. It is important to ask whether these models could benefit artificial intelligence. In this work we ask two fundamental questions: (1) where and when biological connectomes can provide use in machine learning, (2) which design principles are necessary for extracting a good representation of the connectome. Toward this end, we translate the motor circuit of the C. Elegans nematode into artificial neural networks at varying levels of biophysical realism and evaluate the outcome of training these networks on motor and non-motor behavioral tasks. We demonstrate that biophysical realism need not be upheld to attain the advantages of using biological circuits. We also establish that, even if the exact wiring diagram is not retained, the architectural statistics provide a valuable prior. Finally, we show that while the C. Elegans locomotion circuit provides a powerful inductive bias on locomotion problems, its structure may hinder performance on tasks unrelated to locomotion such as visual classification problems.","neural architecture, biologically inspired, connectomes, neuro-AI" MSQ-BioBERT: Ambiguity Resolution to Enhance BioBERT Medical Question-Answering,https://openreview.net/forum?id=6BdJ5G5wEdp,https://openreview.net/pdf?id=6BdJ5G5wEdp,A way to improve BioBERT Question-answering using multiple synonymous questions.,"Bidirectional Encoder Representations from Transformers (BERT) and its biomedical variation (BioBERT) achieve impressive results on the SQuAD or medical question-answering (QA) datasets, and so they are widely used for a variety of passage-based QA tasks. However, their performances rapidly deteriorate when encountering passage and context ambiguities. This issue is prevalent and unavoidable in many fields, notably the medical field. To address this issue, we introduce a novel approach called the Multiple Synonymous Questions BioBERT (MSQ-BioBERT), which integrates question augmentation, rather than the typical single question used by traditional BioBERT, to elevate performance. Experiments with both an ambiguous medical dataset and open biomedical datasets demonstrate the significant performance gains of the MSQ-BioBERT approach, showcasing a new method for addressing ambiguity in QA tasks. ","Question answering, Question augmentation, BioBERT, Matrix approximation" Part-Based Models Improve Adversarial Robustness,https://openreview.net/forum?id=bAMTaeqluh4,https://openreview.net/pdf?id=bAMTaeqluh4,Using an auxiliary task and richer annotation in the form of part segmentation can improve robustness of neural networks by a large margin.,"We show that combining human prior knowledge with end-to-end learning can improve the robustness of deep neural networks by introducing a part-based model for object classification. We believe that the richer form of annotation helps guide neural networks to learn more robust features without requiring more samples or larger models. Our model combines a part segmentation model with a tiny classifier and is trained end-to-end to simultaneously segment objects into parts and then classify the segmented object. Empirically, our part-based models achieve both higher accuracy and higher adversarial robustness than a ResNet-50 baseline on all three datasets. For instance, the clean accuracy of our part models is up to 15 percentage points higher than the baseline’s, given the same level of robustness. Our experiments indicate that these models also reduce texture bias and yield better robustness against common corruptions and spurious correlations. The code is included in the supplementary material.","adversarial robustness, adversarial examples, computer vision, part-based model" Asymmetric Certified Robustness via Feature-Convex Neural Networks,https://openreview.net/forum?id=IXsI73NDuqN,https://openreview.net/pdf?id=IXsI73NDuqN,"We propose a novel, convexity-based learning architecture which enables closed-form adversarial robustness certificates for all norm balls in an asymmetric robustness setting.","Recent works have introduced input-convex neural networks (ICNNs) as learning models with advantageous training, inference, and generalization properties linked to their convex structure. In this paper, we propose a novel feature-convex neural network (FCNN) architecture as the composition of an ICNN with a Lipschitz feature map in order to achieve adversarial robustness. We consider the asymmetric binary classification setting with one ""sensitive"" class, and for this class we prove deterministic, closed-form, and easily-computable certified robust radii for arbitrary $\ell_p$-norms. We theoretically justify the use of these models by characterizing their decision region geometry, extending the universal approximation theorem for ICNN regression to the classification setting, and proving a lower bound on the probability that such models perfectly fit even unstructured uniformly distributed data in sufficiently high dimensions. Experiments on Malimg malware classification as well as subsets of MNIST, CIFAR-10, and ImageNet-scale datasets show that FCNNs can attain orders of magnitude larger certified $\ell_1$-radii than competing methods while maintaining substantial $\ell_2$- and $\ell_{\infty}$-radii.","robustness, certification, convex, machine learning" PGrad: Learning Principal Gradients For Domain Generalization,https://openreview.net/forum?id=CgCmwcfgEdH,https://openreview.net/pdf?id=CgCmwcfgEdH,,"Machine learning models fail to perform when facing out-of-distribution (OOD) domains, a challenging task known as domain generalization (DG). In this work, we develop a novel DG training strategy, we call PGrad, to learn a robust gradient direction, improving models' generalization ability on unseen domains. The proposed gradient aggregates the principal directions of a sampled roll-out optimization trajectory that measures the training dynamics across all training domains. PGrad gradient design forces the DG training to ignore domain-dependent noise signals and updates all training domains with a robust direction covering main components of parameter dynamics. We further improve PGrad via bijection-based computational refinement and directional plus length-based calibrations. Our theoretical proof connects PGrad to the spectral analysis of Hessian in training neural networks. Experiments on DomainBed and WILDS benchmarks demonstrate that our approach effectively enables robust DG optimization and leads to smoothly decreased loss curves. Empirically, PGrad achieves competitive results across seven datasets, demonstrating its efficacy across both synthetic and real-world distributional shifts.", Learning Efficient Models From Few Labels By Distillation From Multiple Tasks,https://openreview.net/forum?id=yzHn1QejdT4,https://openreview.net/pdf?id=yzHn1QejdT4,We create an efficient model for a novel task via task similarity-weighted multi-source distillation. ,"We address the challenge of getting efficient yet accurate recognition systems that can be trained with limited labels. Many specialized applications of computer vision (e.g. analyzing X-rays or satellite images) have severe resource constraints both during training and inference. While transfer learning is an effective solution for training on small labeled datasets it still often requires a large base model for fine-tuning. In this paper we present a weighted multi-source distillation method; we distill multiple (diverse) source models trained on different domains, weighted by their relevance for the target task, into a single efficient model using limited labeled data. When the goal is accurate recognition under computational constraints, our approach outperforms both transfer learning from strong ImageNet initializations as well as state-of-the-art semi-supervised techniques such as FixMatch. When averaged over 8 diverse target tasks our method outperform the baselines by 5.6%-points and 4.5%-points, respectively.","transfer learning, semi-supervised learning, multi-source distillation" Extremely Simple Activation Shaping for Out-of-Distribution Detection,https://openreview.net/forum?id=ndYXTEL6cZz,https://openreview.net/pdf?id=ndYXTEL6cZz,"We develop an extremely simple, post hoc, on-the-fly, and plug-and-play activation shaping method for out-of-distribution detection.","The separation between training and deployment of machine learning models implies that not all scenarios encountered in deployment can be anticipated during training, and therefore relying solely on advancements in training has its limits. Out-of-distribution (OOD) detection is an important area that stress-tests a model’s ability to handle unseen situations: Do models know when they don’t know? Existing OOD detection methods either incur extra training steps, additional data or make nontrivial modifications to the trained network. In contrast, in this work, we propose an extremely simple, post-hoc, on-the-fly activation shaping method, ASH, where a large portion (e.g. 90%) of a sample’s activation at a late layer is removed, and the rest (e.g. 10%) simplified or lightly adjusted. The shaping is applied at inference time, and does not require any statistics calculated from training data. Experiments show that such a simple treatment enhances in-distribution and out- of-distribution sample distinction so as to allow state-of-the-art OOD detection on ImageNet, and does not noticeably deteriorate the in-distribution accuracy.","out-of-distribution, out-of-distribution detection, activation pruning, post hoc, sparsity, activation shaping" ZiCo: Zero-shot NAS via inverse Coefficient of Variation on Gradients,https://openreview.net/forum?id=rwo-ls5GqGn,https://openreview.net/pdf?id=rwo-ls5GqGn,A theoretically grounded SOTA proxy for zero-shot NAS under various inference budgets.,"Neural Architecture Search (NAS) is widely used to automatically design the neural network with the best performance among a large number of candidate architectures. To reduce the search time, zero-shot NAS aims at designing training-free proxies that can predict the test performance of a given architecture. However, as shown recently, none of the zero-shot proxies proposed to date can actually work consistently better than a naive proxy, namely, the number of network parameters (#Params). To improve this state of affairs, as the main theoretical contribution, we first reveal how some specific gradient properties across different samples impact the convergence rate of neural networks. Based on this theoretical analysis, we propose a new zero-shot proxy, ZiCo, the first proxy that works consistently better than #Params. We demonstrate that ZiCo works better than State-Of-The-Art (SOTA) proxies on several popular NAS-Benchmarks (NASBench101, NATSBench-SSS/TSS) for multiple datasets (CIFAR10/100, ImageNet16-120). Finally, we demonstrate that the optimal architectures found via ZiCo are as competitive as the ones found by one-shot and multi-shot NAS methods, but with much less search time. For example, ZiCo-based NAS can find optimal architectures with 78.1%, 79.4%, and 80.4% test accuracy under inference budgets of 450M, 600M, and 1000M FLOPs on ImageNet within 0.4 GPU days. ","Neural Architecture Search, Zero-shot NAS, Gradient Analysis, Training Convergence" Statistical Guarantees for Consensus Clustering,https://openreview.net/forum?id=kQxry8Z6Fd9,https://openreview.net/pdf?id=kQxry8Z6Fd9,"We propose spectral algorithms for aggregating labels from multiple clustering algorithms without knowing the optimal matching between clusters, and we provide theoretical performance bounds. ","Consider the problem of clustering $n$ objects. One can apply multiple algorithms to produce $N$ potentially different clustersings of the same objects, that is, partitions of the $n$ objects into $K$ groups. Even a single randomized algorithm can output different clusterings. This often happens when one samples from the posterior of a Bayesian model, or runs multiple MCMC chains from random initializations. A natural task is then to form a consensus among these different clusterings. The challenge in an unsupervised setting is that the optimal matching between clusters of different inputs is unknown. We model this problem as finding a barycenter (also known as Fr\'{e}chet mean) relative to the misclassification rate. We show by lifting the problem to the space of association matrices, one can derive aggregation algorithms that circumvent the knowledge of the optimal matchings. We analyze the statistical performance of aggregation algorithms under a stochastic label perturbation model, and show that a $K$-means type algorithm followed by a local refinement step can achieve near optimal performance, with a rate that decays exponentially in $N$. Numerical experiments show the effectiveness of the proposed methods.","consensus clustering, unsupervised label aggregation, spectral clustering, barycenter problem, Frechet mean, semidefinite relaxation" "Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation",https://openreview.net/forum?id=f6cywgfd11,https://openreview.net/pdf?id=f6cywgfd11,"We propose a comprehensive benchmark for holistic evaluation of general-purpose visual representations, as well as a general framework to mitigate gaps among visual tasks and accommodate arbitrary representations","Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding. Existing efforts at general vision models are limited to a narrow range of tasks and offer no overarching framework to perform visual tasks holistically. We present a new comprehensive benchmark, General-purpose Visual Understanding Evaluation (G-VUE), covering the full spectrum of visual cognitive abilities with four disjoint functional domains — Perceive, Ground, Reason, and Act. The four domains are embodied in 11 carefully curated tasks, from 3D reconstruction to visual reasoning and manipulation. Along with the benchmark, we provide a general encoder-decoder framework for the tasks in G-VUE, to accommodate arbitrary visual representations on all 11 tasks. With our benchmark and framework, we evaluate 7 typical visual representations and observe that (1) transformer and more data empirically lead to more general-purpose, (2) language plays a significant role in learning versatile visual representation, and (3) correlations indicate a subtle constituent among tasks despite the distinctions, which could be evidence of general-purpose. We argue that instead of pursuing general-purpose vision models by end-to-end multi-task training, it is more reasonable to evaluate and investigate representations, which helps digest emerging pre-trained vision models and hopefully shed light on general intelligence.","general-purpose vision, benchmark, visual representation" Expressive Monotonic Neural Networks,https://openreview.net/forum?id=w2P7fMy_RH,https://openreview.net/pdf?id=w2P7fMy_RH,This paper introduces a new method to universally approximate lipschitz functions that are monotonic in any subset of their inputs.,"The monotonic dependence of the outputs of a neural network on some of its inputs is a crucial inductive bias in many scenarios where domain knowledge dictates such behavior. This is especially important for interpretability and fairness considerations. In a broader context, scenarios in which monotonicity is important can be found in finance, medicine, physics, and other disciplines. It is thus desirable to build neural network architectures that implement this inductive bias provably. In this work, we propose a weight-constrained architecture with a single residual connection to achieve exact monotonic dependence in any subset of the inputs. The weight constraint scheme directly controls the Lipschitz constant of the neural network and thus provides the additional benefit of robustness. Compared to currently existing techniques used for monotonicity, our method is simpler in implementation and in theory foundations, has negligible computational overhead, is guaranteed to produce monotonic dependence, and is highly expressive. We show how the algorithm is used to train powerful, robust, and interpretable discriminators that achieve competitive performance compared to current state-of-the-art methods across various benchmarks, from social applications to the classification of the decays of subatomic particles produced at the CERN Large Hadron Collider.","monotonic, lipschitz" Active Image Indexing,https://openreview.net/forum?id=K9RHxPpjn2,https://openreview.net/pdf?id=K9RHxPpjn2,"In the context of image tracing, instead of watermarking an image with an ID, we slightly modify it to make its representation more indexing-friendly, which makes plain content-based indexing much more robust (62% → 100% accuracy for some settings).","Image copy detection and retrieval from large databases leverage two components. First, a neural network maps an image to a vector representation, that is relatively robust to various transformations of the image. Second, an efficient but approximate similarity search algorithm trades scalability (size and speed) against quality of the search, thereby introducing a source of error. This paper improves the robustness of image copy detection with active indexing, that optimizes the interplay of these two components. We reduce the quantization loss of a given image representation by making imperceptible changes to the image before its release. The loss is back-propagated through the deep neural network back to the image, under perceptual constraints. These modifications make the image more retrievable. Our experiments show that the retrieval and copy detection of activated images is significantly improved. For instance, activation improves by $+40\%$ the Recall1@1 on various image transformations, and for several popular indexing structures based on product quantization and locality sensitivity hashing.","Indexing, Copy detection, Image similarity search, Watermarking" Towards Explaining Distribution Shifts,https://openreview.net/forum?id=ppxKnb1SIB,https://openreview.net/pdf?id=ppxKnb1SIB,"We use interpretable distributional mappings to explain how a source distribution shifted to a target distribution, usable on both tabular and image-based distribution shifts.","Distribution shift can have fundamental consequences such as signaling a change in the operating environment or significantly reducing the accuracy of downstream models. Thus, understanding such distribution shifts is critical for examining and hopefully mitigating the effect of such a shift. Most prior work has focused on merely detecting if a shift has occurred and assumes any detected shift can be understood and handled appropriately by a human operator. We hope to aid in these manual mitigation tasks by explaining the distribution shift using interpretable transportation maps from the original distribution to the shifted one. We derive our interpretable mappings from a relaxation of the optimal transport problem, where the candidate mappings are restricted to belong to a set of interpretable mappings. We then use quintessential examples of distribution shift in simulated and real-world cases to showcase how our explanatory mappings provide a better balance between detail and interpretability than the de facto standard mean shift explanation by both visual inspection and our PercentExplained metric.","Distribution Shifts, Explainable AI" Perturbation Analysis of Neural Collapse,https://openreview.net/forum?id=uAmv2zRAWn,https://openreview.net/pdf?id=uAmv2zRAWn,"We propose a new model for exploring practical NC behavior and establish, via perturbation analysis, results that could not have been obtained by existing (idealized) models.","Training deep neural networks for classification often includes minimizing the training loss beyond the zero training error point. In this phase of training, a “neural collapse” behavior has been observed: the variability of features (outputs of the penultimate layer) of within-class samples decreases and the mean features of different classes approach a certain tight frame structure. Recent works analyze this behavior via idealized unconstrained features models where all the minimizers exhibit exact collapse. However, with practical networks and datasets, the features typically do not reach exact collapse, e.g., because deep layers cannot arbitrarily modify intermediate features that are far from being collapsed. In this paper, we propose a richer model that can capture this phenomenon by forcing the features to stay in the vicinity of a predefined features matrix (e.g., intermediate features). We explore the model in the small vicinity case via perturbation analysis and establish results that cannot be obtained by the previously studied models. For example, we prove reduction in the within-class variability of the optimized features compared to the predefined input features (via analyzing gradient flow on the “central-path” with minimal assumptions), analyze the minimizers in the near-collapse regime, and provide insights on the effect of regularization hyperparameters on the closeness to collapse. We support our theory with experiments in practical deep learning settings.","deep learning theory, neural collapse" Learning Simultaneous Navigation and Construction in Grid Worlds ,https://openreview.net/forum?id=NEtep2C7yD,https://openreview.net/pdf?id=NEtep2C7yD,Position-related representation learning improves DRL consistently when addressing the localization-planning interdependence challenge in the proposed mobile construction tasks.,"We propose to study a new learning task, mobile construction, to enable an agent to build designed structures in 1/2/3D grid worlds while navigating in the same evolving environments. Unlike existing robot learning tasks such as visual navigation and object manipulation, this task is challenging because of the interdependence between accurate localization and strategic construction planning. In pursuit of generic and adaptive solutions to this partially observable Markov decision process (POMDP) based on deep reinforcement learning (RL), we design a Deep Recurrent Q-Network (DRQN) with explicit recurrent position estimation in this dynamic grid world. Our extensive experiments show that pre-training this position estimation module before Q-learning can significantly improve the construction performance measured by the intersection-over-union score, achieving the best results in our benchmark of various baselines including model-free and model-based RL, a handcrafted SLAM-based policy, and human players.","Navigation, Localization, Construction, Deep reinforcement learning, Representation learning" Learning to CROSS exchange to solve min-max vehicle routing problems,https://openreview.net/forum?id=ZcnzsHC10Y,https://openreview.net/pdf?id=ZcnzsHC10Y,,"CROSS exchange (CE), a meta-heuristic that solves various vehicle routing problems (VRPs), improves the solutions of VRPs by swapping the sub-tours of the vehicles. Inspired by CE, we propose Neuro CE (NCE), a fundamental operator of \textit{learned} meta-heuristic, to solve various min-max VRPs while overcoming the limitations of CE, i.e., the expensive $\mathcal{O}(n^4)$ search cost. NCE employs graph neural network to predict the cost-decrements (i.e., results of CE searches) and utilizes the predicted cost-decrements to guide the selection of sub-tours for swapping, while reducing the search cost to $\mathcal{O}(n^2)$. As the learning objective of NCE is to predict the cost-decrement, the training can be simply done in a supervised fashion, whose training samples can be easily collected. Despite the simplicity of NCE, numerical results show that the NCE trained with min-max flexible multi-depot VRP (min-max FMDVRP) outperforms the meta-heuristic baselines. More importantly, it significantly outperforms the neural baselines when solving distinctive special cases of min-max FMDVRP (e.g., min-max MDVRP, min-max mTSP, min-max CVRP) without additional training.", PandA: Unsupervised Learning of Parts and Appearances in the Feature Maps of GANs,https://openreview.net/forum?id=iUdSB2kK9GY,https://openreview.net/pdf?id=iUdSB2kK9GY,,"Recent advances in the understanding of Generative Adversarial Networks (GANs) have led to remarkable progress in visual editing and synthesis tasks, capitalizing on the rich semantics that are embedded in the latent spaces of pre-trained GANs. However, existing methods are often tailored to specific GAN architectures and are limited to either discovering global semantic directions that do not facilitate localized control, or require some form of supervision through manually provided regions or segmentation masks. In this light, we present an architecture-agnostic approach that jointly discovers factors representing spatial parts and their appearances in an entirely unsupervised fashion. These factors are obtained by applying a semi-nonnegative tensor factorization on the feature maps, which in turn enables context-aware local image editing with pixel-level control. In addition, we show that the discovered appearance factors correspond to saliency maps that localize concepts of interest, without using any labels. Experiments on a wide range of GAN architectures and datasets show that, in comparison to the state of the art, our method is far more efficient in terms of training time and, most importantly, provides much more accurate localized control. Our code is available in the supplementary material.","GANs, interpretability, local image editing" Compositional Law Parsing with Latent Random Functions,https://openreview.net/forum?id=PEuxUXIMLlA,https://openreview.net/pdf?id=PEuxUXIMLlA,,"Human cognition has compositionality. We understand a scene by decomposing the scene into different concepts (e.g. shape and position of an object) and learning the respective laws of these concepts which may be either natural (e.g. laws of motion) or man-made (e.g. laws of a game). The automatic parsing of these laws indicates the model's ability to understand the scene, which makes law parsing play a central role in many visual tasks. In this paper, we propose a deep latent variable model for Compositional LAw Parsing (CLAP). CLAP achieves the human-like compositionality ability through an encoding-decoding architecture to represent concepts in the scene as latent variables, and further employ concept-specific random functions, instantiated with Neural Processes, in the latent space to capture the law on each concept. Our experimental results demonstrate that CLAP outperforms the compared baseline methods in multiple visual tasks including intuitive physics, abstract visual reasoning, and scene representation. In addition, CLAP can learn concept-specific laws in a scene without supervision and one can edit laws through modifying the corresponding latent random functions, validating its interpretability and manipulability.", Pink Noise Is All You Need: Colored Noise Exploration in Deep Reinforcement Learning,https://openreview.net/forum?id=hQ9V5QN27eS,https://openreview.net/pdf?id=hQ9V5QN27eS,"Pink noise, a temporally correlated noise type, outperforms other action noise types on standard continuous control benchmarks.","In off-policy deep reinforcement learning with continuous action spaces, exploration is often implemented by injecting action noise into the action selection process. Popular algorithms based on stochastic policies, such as SAC or MPO, inject white noise by sampling actions from uncorrelated Gaussian distributions. In many tasks, however, white noise does not provide sufficient exploration, and temporally correlated noise is used instead. A common choice is Ornstein-Uhlenbeck (OU) noise, which is closely related to Brownian motion (red noise). Both red noise and white noise belong to the broad family of colored noise. In this work, we perform a comprehensive experimental evaluation on MPO and SAC to explore the effectiveness of other colors of noise as action noise. We find that pink noise, which is halfway between white and red noise, significantly outperforms white noise, OU noise, and other alternatives on a wide range of environments. Thus, we recommend it as the default choice for action noise in continuous control. ","reinforcement learning, exploration, action noise, continuous control" LilNetX: Lightweight Networks with EXtreme Model Compression and Structured Sparsification,https://openreview.net/forum?id=NVZvalzCLg,https://openreview.net/pdf?id=NVZvalzCLg,,"We introduce LilNetX, an end-to-end trainable technique for neural networks that enables learning models with specified accuracy-rate-computation trade-off. Prior works approach these problems one at a time and often require post-processing or multistage training which become less practical and do not scale very well for large datasets or architectures. Our method constructs a joint training objective that penalizes the self information of network parameters in a latent representation space to encourage small model size while also introducing priors to increase structured sparsity in the parameter space to reduce computation. When compared with existing state-of-the-art model compression methods, we achieve up to 50% smaller model size and 98% model sparsity on ResNet-20 on the CIFAR-10 dataset as well as 37% smaller model size and 71% structured sparsity on ResNet-50 trained on ImageNet while retaining the same accuracy as those methods. We show that the resulting sparsity can improve the inference time of the models by almost 1.8 times the dense ResNet-50 baseline model. ","Quantization, Model Compression, Sparsity, Pruning" First-order Context-based Adaptation for Generalizing to New Dynamical Systems,https://openreview.net/forum?id=AW0i0lOhzqJ,https://openreview.net/pdf?id=AW0i0lOhzqJ,"We propose FOCA, a learning framework to model sets of systems governed by common but unknown laws that differentiate themselves by their context and train FOCA with a simple and efficient EMA-based method.","In this paper, we propose FOCA (First-Order Context-based Adaptation), a learning framework to model sets of systems governed by common but unknown laws that differentiate themselves by their context. Inspired by classical modeling-and-identification approaches, FOCA learns to represent the common law through shared parameters and relies on online optimization to compute system-specific context. Due to the online optimization-based context inference, the training of FOCA involves a bi-level optimization problem. To train FOCA efficiently, we utilize an exponential moving average (EMA)-based method that allows for fast training using only first-order derivatives. We test FOCA on polynomial regression and time-series prediction tasks composed of three ODEs and one PDE, empirically finding it outperforms baselines.","physical system modeling, differential equation, generalization, context, adaptation" CBP-QSNN: Spiking Neural Networks Quantized Using Constrained Backpropagation,https://openreview.net/forum?id=P7h7UT9uDzb,https://openreview.net/pdf?id=P7h7UT9uDzb,We propose a method to quantize FP32 weights in spiking neural networks using constrained backpropagation.,"Spiking Neural Networks (SNNs) support sparse event-based data processing at high power efficiency when implemented in event-based neuromorphic processors. However, the limited on-chip memory capacity of neuromorphic processors strictly delimits the depth and width of SNNs implemented. A direct solution is the use of quantized SNNs (QSNNs) in place of SNNs with FP32 weights. To this end, we propose a method to quantize the weights using constrained backpropagation (CBP) with the Lagrangian function (conventional loss function plus well-defined weight-constraint functions) as an objective function. This work utilizes CBP as a post-training algorithm for deep SNNs pre-trained using various state-of-the-art methods including direct training (TSSL-BP, STBP, and surrogate gradient) and DNN-to-SNN conversion (SNN-Calibration), validating CBP as a general framework for QSNNs. CBP-QSNNs highlight their high accuracy insomuch as the degradation of accuracy on CIFAR-10, DVS128 Gesture, and CIFAR10-DVS in the worst case is less than 1\%. Particularly, CBP-QSNNs for SNN-Calibration-pretrained SNNs on CIFAR-100 highlight an unexpected large increase in accuracy by 3.72\% while using small weight-memory (3.5\% of the FP32 case).","Quantized spiking neural network, Constrained backpropagation, Binary weight, Lagrange multiplier method, Weight constraint" Leveraging the Third Dimension in Contrastive Learning,https://openreview.net/forum?id=Pqi9ZxxdjM,https://openreview.net/pdf?id=Pqi9ZxxdjM,Depth signal improves contrastive learning,"Self-Supervised Learning (SSL) methods operate on unlabeled data to learn robust representations useful for downstream tasks. Most SSL methods rely on augmentations obtained by transforming the 2D image pixel map. These augmentations ignore the fact that biological vision takes place in an immersive three-dimensional, temporally contiguous environment, and that low-level biological vision relies heavily on depth cues. Using a signal provided by a pretrained state-of-the-art RGB-to-depth model (the Depth Prediction Transformer, Ranftl et al., 2021), we explore two distinct approaches to incorporating depth signals into the SSL framework. First, we evaluate contrastive learning using an RGB+depth input representation. Second, we use the depth signal to generate novel views from slightly different camera positions, thereby producing a 3D augmentation for contrastive learning. We evaluate these two approaches on three different SSL methods---BYOL, SimSiam, and SwAV---using ImageNette (10 class subset of ImageNet) and ImageNet-100. We find that both approaches to incorporating depth signals improve the robustness and generalization of the baseline SSL methods, though the first approach (with depth-channel concatenation) is superior.","contrastive learning, depth, self-supervised learning" O-ViT: Orthogonal Vision Transformer,https://openreview.net/forum?id=vk2855b0aT,https://openreview.net/pdf?id=vk2855b0aT,,"Inspired by the tremendous success of self-attention mechanism in natural language processing, the Vision Transformer (ViT) creatively applies it to image patch sequences and achieves incredible performance. However, ViT brings about feature redundancy and low utilization of model capacity. To address this problem, we propose a novel and effective method named Orthogonal Vision Transformer (O-ViT), to optimize ViT from the geometric perspective. O-ViT limits parameters of self-attention blocks to reside on the orthogonal manifold, which can reduce the similarity between trainable parameters and construct a higher degree of distinction between features. Moreover, O-ViT achieves both orthogonal constraints and negligible optimization overhead by adopting a surjective mapping between the orthogonal group and its Lie algebra. Comparative experiments on various image recognition tasks demonstrate the validity of O-ViT. The experimental results show that O-ViT can boost the performance of ViT by up to 6.4%.", STaSy: Score-based Tabular data Synthesis,https://openreview.net/forum?id=1mNssCWt_v,https://openreview.net/pdf?id=1mNssCWt_v,"We design a score-based generative model for tabular data and apply two training strategies, including the self-paced learning and the proposed fine-tuning method, to stabilize the denoising score matching training.","Tabular data synthesis is a long-standing research topic in machine learning. Many different methods have been proposed over the past decades, ranging from statistical methods to deep generative methods. However, it has not always been successful due to the complicated nature of real-world tabular data. In this paper, we present a new model named $\textbf{S}$core-based $\textbf{Ta}$bular data $\textbf{Sy}$nthesis ($\texttt{STaSy}$) and its training strategy based on the paradigm of score-based generative modeling. Despite the fact that score-based generative models have resolved many issues in generative models, there still exists room for improvement in tabular data synthesis. Our proposed training strategy includes a self-paced learning technique and a fine-tuning strategy, which further increases the sampling quality and diversity by stabilizing the denoising score matching training. Furthermore, we also conduct rigorous experimental studies in terms of the generative task trilemma: sampling quality, diversity, and time. In our experiments with 15 benchmark tabular datasets and 7 baselines, our method outperforms existing methods in terms of task-dependant evaluations and diversity. ","Score-based generative model, Tabular data, Self-paced learning" REDUCING OVERSMOOTHING IN GRAPH NEURAL NETWORKS BY CHANGING THE ACTIVATION FUNCTION,https://openreview.net/forum?id=8CDeu0f4i2,https://openreview.net/pdf?id=8CDeu0f4i2,,"The performance of Graph Neural Networks (GNNs) deteriorates as the depth of the network increases. That performance drop is mainly attributed to oversmoothing, which leads to similar node representations through repeated graph convolutions. We show that in deep GNNs the activation function plays a crucial role in oversmoothing. We explain theoretically why this is the case and propose a simple modification to the slope of ReLU to reduce oversmoothing. The proposed approach enables deep architectures without the need to change the network architecture or to add residual connections. We verify the theoretical results experimentally and further show that deep networks, which do not suffer from oversmoothing are beneficial in the presence of the “cold start” problem, i.e. when there is no feature information about unlabeled nodes.","Graph Neural Networks, Oversmoothing" Disentangled (Un)Controllable Features,https://openreview.net/forum?id=1nZelVKqpks,https://openreview.net/pdf?id=1nZelVKqpks,Separation of Controllable and Uncontrollable Features in Latent Space,"In the context of MDPs with high-dimensional states, reinforcement learning can achieve better results when using a compressed, low-dimensional representation of the original input space. A variety of learning objectives have therefore been used to learn useful representations. However, these representations usually lack interpretability of the different features. We propose a representation learning algorithm that is able to disentangle latent features into a controllable and an uncontrollable part. The resulting representations are easily interpretable and can be used for learning and planning efficiently by leveraging the specific properties of the two parts. To highlight the benefits of the approach, the disentangling properties of the algorithm are illustrated in three different environments.","Representation Learning, Reinforcement Learning" Visual Prompt Tuning For Test-time Domain Adaptation,https://openreview.net/forum?id=3HnIBTjlXTS,https://openreview.net/pdf?id=3HnIBTjlXTS,Vision Transformer can generalize better during testing by tuning a set of visual prompts with only a little unlabeled target domain data.,"Models should have the ability to adapt to unseen data during test-time to avoid performance drops caused by inevitable distribution shifts in real-world deployment scenarios. In this work, we tackle the practical yet challenging test-time adaptation (TTA) problem, where a model adapts to the target domain without accessing the source data. We propose a simple recipe called data-efficient prompt tuning (DePT) with two key ingredients. First, DePT plugs visual prompts into the vision Transformer and only tunes these source-initialized prompts during adaptation. We find such parameter-efficient finetuning can efficiently adapt the model representation to the target domain without overfitting to the noise in the learning objective. Second, DePT bootstraps the source representation to the target domain by memory bank-based online pseudo labeling. A hierarchical self-supervised regularization specially designed for prompts is jointly optimized to alleviate error accumulation during self-training. With much fewer tunable parameters, DePT demonstrates not only state-of-the-art performance on major adaptation benchmarks, but also superior data efficiency, i.e., adaptation with only 1\% or 10\% data without much performance degradation compared to 100\% data. In addition, DePT is also versatile to be extended to online or multi-source TTA settings.","deep learning, test-time domain adaptation, unsupervised learning, visual prompt tuning, vision transformer, self-supervision" Mitigating Dataset Bias by Using Per-Sample Gradient,https://openreview.net/forum?id=7mgUec-7GMv,https://openreview.net/pdf?id=7mgUec-7GMv,"We solve the dataset bias problem by using the per-sample gradient. Furthermore, we provide the mathematical background of the proposed algorithm.","The performance of deep neural networks is strongly influenced by the training dataset setup. In particular, when attributes having a strong correlation with the target attribute are present, the trained model can provide unintended prejudgments and show significant inference errors (i.e., the dataset bias problem). Various methods have been proposed to mitigate dataset bias, and their emphasis is on weakly correlated samples, called bias-conflicting samples. These methods are based on explicit bias labels provided by human. However, such methods require human costs. Recently, several studies have tried to reduce human intervention by utilizing the output space values of neural networks, such as feature space, logits, loss, or accuracy. However, these output space values may be insufficient for the model to understand the bias attributes well. In this study, we propose a debiasing algorithm leveraging gradient called PGD (Per-sample Gradient-based Debiasing). PGD comprises three steps: (1) training a model on uniform batch sampling, (2) setting the importance of each sample in proportion to the norm of the sample gradient, and (3) training the model using importance-batch sampling, whose probability is obtained in step (2). Compared with existing baselines for various datasets, the proposed method showed state-of-the-art accuracy for the classification task. Furthermore, we describe theoretical understandings of how PGD can mitigate dataset bias. ","Dataset bias, Debiasing, Gradient-norm based debiasing" CAMA: A New Framework for Safe Multi-Agent Reinforcement Learning Using Constraint Augmentation,https://openreview.net/forum?id=jK02XX9ZpJkt,https://openreview.net/pdf?id=jK02XX9ZpJkt,CAMA can combine any SOTA non-safe MARL algorithms to ensure they satisfied added constraints without strong assumptions and complex implementations.,"With the widespread application of multi-agent reinforcement learning (MARL) in real-life settings, the ability to meet safety constraints has become an urgent problem to solve. For example, it is necessary to avoid collisions to reach a common goal in controlling multiple drones. We address this problem by introducing the Constraint Augmented Multi-Agent framework --- CAMA. CAMA can serve as a plug-and-play module to the popular MARL algorithms, including centralized training, decentralized execution and independent learning frameworks. In our approach, we represent the safety constraint as the sum of discounted safety costs bounded by the predefined value, which we call the safety budget. Experiments demonstrate that CAMA can converge quickly to a high degree of constraint satisfaction and surpasses other state-of-the-art safety counterpart algorithms in both cooperative and competitive settings. ","Safe, Multi-agent Reinforcement Learning, Augmentation" CWATR: Generating Richer Captions with Object Attributes,https://openreview.net/forum?id=0uHNy9jmR7z,https://openreview.net/pdf?id=0uHNy9jmR7z,We propose a method to generate richer and more grounded image captions by integrating attributes of the objects in the scene to the generated caption.,"Image captioning is a popular yet challenging task which is at the intersection of Computer Vision and Natural Language Processing. Recently, transformer-based unified Vision and Language models advanced the state-of-the-art further on image captioning. However, there are still fundamental problems in these models. Even though the generated captions by these models are grammatically correct and describe the input image fairly good, they might overlook important details in the image. In this paper, we demonstrate these problems in a state-of-the-art baseline image captioning method and analyze the reasoning behind these problems. We propose a novel approach, named CWATR (Captioning With ATtRibutes), to integrate object attributes to the generated captions in order to obtain richer and more detailed captions. Our analyses demonstrate that the proposed approach generates richer and more visually grounded captions by integrating attributes of the objects in the scene to the generated captions successfully.","image captioning, vision and language pretraining, object attributes, machine learning, deep learning, computer vision" Internal Purity: A Differential Entropy based Internal Validation Index for Clustering Validation,https://openreview.net/forum?id=XhgbD4ZNKFA,https://openreview.net/pdf?id=XhgbD4ZNKFA,,"In an effective process of cluster analysis, it is indispensable to validate the goodness of different partitions after clustering. Existing internal validation indices are implemented based on distance, variance and model-selection. The indices based on distance or variance cannnot catpure the real ``density"" of the cluster and the time complexity for distance based indices is usually too high to be applied for large datasets. Moreover, the indices based on model-selection tend to overestimate the number of cluster in clustering validation. Therefore, we propose a novel internal validation index based on the differential entropy, named \textit{internal purity} (IP). The proposed IP index can effectively measure the purity of a cluster without using the external cluster information, and successfully overcome the drawbacks of existing internal indices. Based on six powerful deep pre-trained models and without further fine-tuning using the experimental datasets, we use four different clustering algorithms to compare our index with thirteen other well-known internal indices on five text and five image datasets. The results show that, for 60 test cases in total, our IP index can return the optimal clustering results in 43 cases while the second best index can merely report the optimal partition in 17 cases, which demonstrates the significant superiority of our IP index when validating the goodness of the clustering results. Moreover, theoretical analysis for the effectiveness and efficiency of the proposed index are also provided.", Task Regularized Hybrid Knowledge Distillation For Continual Object Detection,https://openreview.net/forum?id=wZxuiDFEtyR,https://openreview.net/pdf?id=wZxuiDFEtyR,Task Regularized Hybrid Knowledge Distillation Method For Class Incremental Objects Detection,"Knowledge distillation has been used to overcome catastrophic forgetting in Continual Object Detection(COD) task. Previous works mainly focus on combining different distillation methods, including feature, classification, location and relation, into a mixed scheme to solve this problem. In this paper, we propose a task regularized hybrid knowledge distillation method for COD task. First, we propose an image-level hybrid knowledge representation by combining instance-level hard and soft knowledge to use teacher knowledge critically. Second, we propose a task-based regularization distillation loss by taking account of loss and category differences to make continual learning more balance between old and new tasks. We find that, under appropriate knowledge selection and transfer strategies, using only classification distillation can also relieve knowledge forgetting effectively. Extensive experiments conducted on MS COCO2017 demonstrate that our method achieves state-of-the-art results under various scenarios. We get an absolute improvement of 27.98 at $RelGap$ under the most difficult five-task scenario. Code is in attachment and will be available on github.","Knowledge Distillation, Continual Learning, Continual Object Detection, Class Incremental Object Detection" Efficient Model Updates for Approximate Unlearning of Graph-Structured Data,https://openreview.net/forum?id=fhcu4FBLciL,https://openreview.net/pdf?id=fhcu4FBLciL,We study the approximate unlearning problem for graph-structured data with theoretical guarantees.,"With the adoption of recent laws ensuring the ``right to be forgotten'', the problem of machine unlearning has become of significant importance. This is particularly the case for graph-structured data, and learning tools specialized for such data, including graph neural networks (GNNs). This work introduces the first known approach for \emph{approximate graph unlearning} with provable theoretical guarantees. The challenges in addressing the problem are two-fold. First, there exist multiple different types of unlearning requests that need to be considered, including node feature, edge and node unlearning. Second, to establish provable performance guarantees, one needs to carefully evaluate the process of feature mixing during propagation. We focus on analyzing Simple Graph Convolutions (SGC) and their generalized PageRank (GPR) extensions, thereby laying the theoretical foundations for unlearning GNNs. Empirical evaluations of six benchmark datasets demonstrate excellent performance/complexity/privacy trade-offs of our approach compared to complete retraining and general methods that do not leverage graph information. For example, unlearning $200$ out of $1208$ training nodes of the Cora dataset only leads to a $0.1\%$ loss in test accuracy, but offers a $4$-fold speed-up compared to complete retraining with a $(\epsilon,\delta)=(1,10^{-4})$ ``privacy cost''. We also exhibit a $12\%$ increase in test accuracy for the same dataset when compared to unlearning methods that do not leverage graph information, with comparable time complexity and the same privacy guarantee.","machine unlearning, graph unlearning, privacy" Risk-aware Bayesian RL for Cautious Exploration,https://openreview.net/forum?id=1tfGKiwnJRJ,https://openreview.net/pdf?id=1tfGKiwnJRJ,,"This paper addresses the problem of maintaining safety during training in Reinforcement Learning (RL), such that the safety constraint violations are bounded at any point during learning. Whilst enforcing safety during training might limit the agent's exploration, we propose a new architecture that handles the trade-off between efficient progress in exploration and safety maintenance. As the agent's exploration progresses, we update Dirichlet-Categorical models of the transition probabilities of the Markov decision process that describes the agent's behavior within the environment by means of Bayesian inference. We then propose a way to approximate moments of the agent's belief about the risk associated with the agent's behavior originating from local action selection. We demonstrate that this approach can be easily coupled with RL, we provide rigorous theoretical guarantees, and we present experimental results to showcase the performance of the overall architecture.","Reinforcement learning, Bayesian inference, Safe learning, Risk, Safety Specification" Change Detection for bi-temporal images classification based on Siamese Variational AutoEncoder and Transfer Learning,https://openreview.net/forum?id=EHi_B2stiNs,https://openreview.net/pdf?id=EHi_B2stiNs,,"Siamese structures empower Deep Learning (DL) models to increase their efficiency by learning how to extract the relevant temporal features from the input data. In this paper, a Siamese Variational Auto-Encoder (VAE) model based on transfer learning (TL) is applied for change detection (CD) using bi-temporal images. The introduced method is trained in a supervised strategy for classification tasks. Firstly, the suggested generative method utilizes two VAEs to extract features from bi-temporal images. Subsequently, concatenates them into a feature vector. To get a classification map of the source scene, the classifier receives this vector and the ground truth data as input. The source model is fine-tuned to be applied to the target scene with less ground truth data using a TL strategy. Experiments were carried out in two study areas in the arid regions of southern Tunisia. The obtained results reveal that the proposed method outperformed the Siamese Convolution Neural Network (SCNN) by achieving an accuracy of more than 98%, in the source scene, and increased the accuracy in the target scene by 1.25% by applying the TL strategy.","Feature extraction, Variational Auto-Encoder, Change Detection, Siamese structure, Transfer Learning, Desertification" G-CEALS: Gaussian Cluster Embedding in Autoencoder Latent Space for Tabular Data Representation,https://openreview.net/forum?id=q_u6UVhenn7,https://openreview.net/pdf?id=q_u6UVhenn7,This paper proposes an unsupervised learning method to improve embedding clustering of tabular data,"The latent representation in an autoencoder achieves dimensionality reduction via self-supervised data reconstruction learning. The quality of latent representations has been improved for images by jointly learning a t-distributed embedding with clustering inspired by the neighborhood embedding concept proposed for data visualization. In this paper, we discuss the objectives of clustering and data visualization to present a novel Gaussian Cluster Embedding in Autoencoder Latent Space (G-CEALS) by replacing t-distributions with Gaussian clusters. Unlike current methods, the proposed method defines the Gaussian embedding and the target cluster distribution independently to accommodate any clustering algorithm in representation learning. The proposed G-CEALS method outperforms six baseline clustering and cluster embedding methods on five out of seven tabular data sets and is on par with a cluster embedding method on the sixth data set. In general, G-CEALS outperforms all six methods for clustering tabular data when the data dimensionality is greater than ten. Realizing the superior performance of traditional machine learning with tabular data over deep learning, this paper shows one of the first joint representation learning and clustering methods to improve the clustering of tabular data.","embedding clustering, tabular data, Gaussian clusters, autoencoder, representation learning" Learning Top-k Classification with Label Ranking,https://openreview.net/forum?id=6sQr2-BlARv,https://openreview.net/pdf?id=6sQr2-BlARv,,"Class confusability and multi-label nature of examples inevitably arise in classification tasks with the increasing number of classes, which poses a huge challenge to classification. To mitigate this problem, top-$k$ classification is proposed, where the classifier is allowed to predict $k$ label candidates and the prediction result is considered correct as long as the ground truth label is included in the $k$ labels. However, existing top-k classification methods neglect the ranking of the ground truth label among the predicted $k$ labels, which has high application value. In this paper, we propose a novel three-stage approach to learn top-$k$ classification with label ranking. We first propose an ensemble based relabeling method and relabel the training data with $k$ labels, which is used to train the top-$k$ classifier. We then propose a novel top-$k$ classification loss function that aims to improve the ranking of the ground truth label. Finally, we have conducted extensive experiments on four text datasets and four image datasets, and the experimental results show that our method could significantly improve the performance of existing methods.", QUANTIZATION AWARE FACTORIZATION FOR DEEP NEURAL NETWORK COMPRESSION,https://openreview.net/forum?id=b1g6e7enW5o,https://openreview.net/pdf?id=b1g6e7enW5o,We propose a novel approach to neural network compression that performs tensor factorization and quantization simultaneously.,"Tensor approximation of convolutional and fully-connected weights is an effective way to reduce parameters and FLOP in neural networks. Due to memory and power consumption limitations of mobile or embedded devices, the quantization step is usually necessary when pre-trained models are deployed. A conventional post-training quantization approach applied to networks with decomposed weights yields a drop in accuracy. Therefore, our goal is to develop an algorithm that finds tensor approximation directly with quantized factors and thus benefit from both compression techniques while keeping the prediction quality of the model. Namely, we propose to use Alternating Direction Method of Multipliers (ADMM) for approximating a float tensor by a tensor of Canonical Polyadic format (CP), whose factors are close to their quantized versions. This leads to lower approximation error after quantization and smaller quality drop in model predictions while maintaining the same compression rate. ","tensor methods, compression, quantization" Populating memory in Continual Learning with Consistency Aware Sampling,https://openreview.net/forum?id=AFhaaOZTkKA,https://openreview.net/pdf?id=AFhaaOZTkKA,,"Continual Learning (CL) methods aim to mitigate Catastrophic Forgetting (CF), where knowledge from previously learned tasks is often lost in favor of the new one. Among those algorithms, some have shown the relevance of keeping a rehearsal buffer with previously seen examples, referred to as $memory$. Yet, despite their popularity, limited research has been done to understand which elements are more beneficial to store in memory. It is common for this memory to be populated through random sampling, with little guiding principles that may aid in retaining prior knowledge. In this paper, and consistent with previous work, we found that some storage policies behave similarly given a certain memory size or compute budget, but when these constraints are relevant, results differ considerably. Based on these insights, we propose CAWS (Consistency AWare Sampling), an original storage policy that leverages a learning consistency score (C-Score) to populate the memory with elements that are $easy$ $to$ $learn$ and $representative$ of previous tasks. Because of the impracticality of directly using the C-Score in CL, we propose more feasible and efficient proxies to calculate the score that yield state-of-the-art results on CIFAR-100 and Tiny Imagenet.","Continual Learning, Learning Consistency, Populating Memory, Memory-based Continual Learning" Fairness of Federated Learning with Dynamic Participants,https://openreview.net/forum?id=ZRkHGPMY3dd,https://openreview.net/pdf?id=ZRkHGPMY3dd,,"The concept of fairness has widely caught the attention of Federated Learning (FL). While there are tremendous studies about various notations of fairness in FL in recent years, all of them only consider the case where the training process starts and ends at the time point for all participants. Actually, participants could be dynamic and they may join and leave the training process at different time points. However, participants who join the training process at different time points receive similar incentive benefits can be seen as a signal of unfairness. In this paper, we provide the first study on such fairness of FL for dynamic participants. First, we propose a new mathematical definition of the above fairness namely $\textit{dynamic fairness}$. Briefly speaking, an algorithm is dynamically fair and satisfies that local agents who participate in the model training longer should receive more benefits than those who participate in the process shorter. Second, we develop a simple but novel method, which could be seen as a normalized version of $\textit{Fedavg}$, and theoretically show that it is fairer than $\textit{Fedavg}$. Moreover, we can combine our method with the previous methods in fair FL for static participants to additionally guarantee fair treatment for local agents who join the training process at the same time point by minimizing the discrepancy of benefits they receive. Finally, empirically we propose a measure for $\textit{dynamic fairness}$ and demonstrate that our method can achieve a fairer performance under our definition of fairness through intensive experiments on three benchmark datasets. ","Federated Learning, Dynamic Fairness, Benefit, Normalized SGD" A Unified Algebraic Perspective on Lipschitz Neural Networks,https://openreview.net/forum?id=k71IGLC8cfc,https://openreview.net/pdf?id=k71IGLC8cfc,"We present a novel algebraic perspective unifying various types of 1-Lipschitz neural networks, and show that AOL and CPL can be re-derived and generalized using exactly the same semidefinite programming (SDP) condition.","Important research efforts have focused on the design and training of neural networks with a controlled Lipschitz constant. The goal is to increase and sometimes guarantee the robustness against adversarial attacks. Recent promising techniques draw inspirations from different backgrounds to design 1-Lipschitz neural networks, just to name a few: convex potential layers derive from the discretization of continuous dynamical systems, Almost-Orthogonal-Layer proposes a tailored method for matrix rescaling. However, it is today important to consider the recent and promising contributions in the field under a common theoretical lens to better design new and improved layers. This paper introduces a novel algebraic perspective unifying various types of 1-Lipschitz neural networks, including the ones previously mentioned, along with methods based on orthogonality and spectral methods. Interestingly, we show that many existing techniques can be derived and generalized via finding analytical solutions of a common semidefinite programming (SDP) condition. We also prove that AOL biases the scaled weight to the ones which are close to the set of orthogonal matrices in a certain mathematical manner. Moreover, our algebraic condition, combined with the Gershgorin circle theorem, readily leads to new and diverse parameterizations for 1-Lipschitz network layers. Our approach, called SDP-based Lipschitz Layers (SLL), allows us to design non-trivial yet efficient generalization of convex potential layers. Finally, the comprehensive set of experiments on image classification shows that SLLs outperforms previous approaches on natural and certified accuracy.","Deep Learning, Lipschitz neural networks, Robustness" AudioGen: Textually Guided Audio Generation,https://openreview.net/forum?id=CYK7RfcOzQ4,https://openreview.net/pdf?id=CYK7RfcOzQ4,We propose a text-to-audio generation model,"In this work, we tackle the problem of generating audio samples conditioned on descriptive text captions. We propose AudioGen, an auto-regressive generative model, operating on a learnt discrete audio representation, that generates audio samples conditioned on text inputs. The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differentiating ``objects'' can be a difficult task (e.g., separating multiple people simultaneously speaking). This is further complicated by real-world recording conditions (e.g., background noise, reverberation, etc.). Scarce text annotations impose another constraint, limiting the ability to scale models. Finally, modeling high fidelity audio requires one to operate over extremely long sequences. To alleviate the aforementioned challenges we propose an augmentation technique that mixes different audio samples, driving the model to internally learn to separate multiple sources. We curated 10 datasets containing different types of audio and text annotations to handle the scarcity of text-audio data points. For faster inference, we explore the use of multi-stream modeling, allowing the use of shorter sequences while maintaining a similar bitrate and perceptual quality. Finally, we apply classifier-free guidance to improve adherence to text. Comparing to the evaluated baselines, AudioGen outperforms over both objective and subjective metrics. We further conduct an ablation study to gauge the effects of pre-trained text and audio components.","text-to-audio, audio generation" Faster Reinforcement Learning with Value Target Lower Bounding,https://openreview.net/forum?id=WWYHBZ1wWzp,https://openreview.net/pdf?id=WWYHBZ1wWzp,"Lower bounding the Bellman value target turns out to be simple, efficient and effective in improving RL efficiency and RL performance both in theory and practice.","We show that an arbitrary lower bound of the maximum achievable value can be used to improve the Bellman value target during value learning. In the tabular case, value learning using the lower bounded Bellman operator converges to the same optimal value as using the original Bellman operator, at a potentially faster speed. In practice, discounted episodic return in episodic tasks and n-step bootstrapped return in continuing tasks can serve as lower bounds to improve the value target. We experiment on Atari games, FetchEnv tasks and a challenging physically simulated car push and reach task. We see large gains in sample efficiency as well as converged performance over common baselines such as TD3, SAC and Hindsight Experience Replay (HER) in most tasks, and observe a reliable and competitive performance against the stronger n-step methods such as td-lambda, Retrace and optimality tightening. Prior works have already successfully applied a special case of lower bounding (using episodic return), but are limited to a small number of episodic tasks. To the best of our knowledge, we are the first to propose the general method of value target lower bounding (with possibly bootstrapped return), to demonstrate its optimality in theory, and effectiveness in a wide range of tasks over many strong baselines.","Reinforcement Learning, Bellman equation, value improvement, sample efficiency" Hebbian and Gradient-based Plasticity Enables Robust Memory and Rapid Learning in RNNs,https://openreview.net/forum?id=2WklawyeI08,https://openreview.net/pdf?id=2WklawyeI08,,"Rapidly learning from ongoing experiences and remembering past events with a flexible memory system are two core capacities of biological intelligence. While the underlying neural mechanisms are not fully understood, various evidence supports that synaptic plasticity plays a critical role in memory formation and fast learning. Inspired by these results, we equip Recurrent Neural Networks (RNNs) with plasticity rules to enable them to adapt their parameters according to ongoing experiences. In addition to the traditional local Hebbian plasticity, we propose a global, gradient-based plasticity rule, which allows the model to evolve towards its self-determined target. Our models show promising results on sequential and associative memory tasks, illustrating their ability to robustly form and retain memories. In the meantime, these models can cope with many challenging few-shot learning problems. Comparing different plasticity rules under the same framework shows that Hebbian plasticity is well-suited for several memory and associative learning tasks; however, it is outperformed by gradient-based plasticity on few-shot regression tasks which require the model to infer the underlying mapping.","synaptic plasticity, meta-learning, Hebbian learning, few-shot learning, recurrent neural networks" Towards Minimax Optimal Reward-free Reinforcement Learning in Linear MDPs,https://openreview.net/forum?id=U9HW6vyNClg,https://openreview.net/pdf?id=U9HW6vyNClg,We propose a computationally-efficient algorithm for reward-free exploration in linear MDPs reaching a minimax optimal sample complexity up to an $H$ and logarithm factor.,"We study reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs). In this setting, an agent first interacts with the environment without accessing the reward function in the exploration phase. In the subsequent planning phase, it is given a reward function and asked to output an $\epsilon$-optimal policy. We propose a novel algorithm LSVI-RFE under the linear MDP setting, where the transition probability and reward functions are linear in a feature mapping. We prove an $\widetilde{O}(H^{4} d^{2}/\epsilon^2)$ sample complexity upper bound for LSVI-RFE, where $H$ is the episode length and $d$ is the feature dimension. We also establish a sample complexity lower bound of $\Omega(H^{3} d^{2}/\epsilon^2)$. To the best of our knowledge, LSVI-RFE is the first computationally efficient algorithm that achieves the minimax optimal sample complexity in linear MDP settings up to an $H$ and logarithmic factors. Our LSVI-RFE algorithm is based on a novel variance-aware exploration mechanism to avoid overly-conservative exploration in prior works. Our sharp bound relies on the decoupling of UCB bonuses during two phases, and a Bernstein-type self-normalized bound, which remove the extra dependency of sample complexity on $H$ and $d$, respectively.","Reinforce Learning, Reward-Free Exploration, Linear MDPs, Learning Theory" PromptSum: Planning with Mixed Prompts for Parameter-Efficient Controllable Abstractive Summarization,https://openreview.net/forum?id=FEBCwrGzR3j,https://openreview.net/pdf?id=FEBCwrGzR3j,"A new prompting mechanism which enables controllable, parameter-efficient and data-efficient summarization. ","Prompt tuning (PT), a technique that only tunes the additional prompt embeddings while keeping the backbone pretrained language model frozen, has shown promising results in language understanding tasks, especially in low-resource scenarios. However, there lacks better prompt design methods for generation tasks such as summarization. At the same time, summarization guided through instructions (discrete prompts) can achieve a desirable double objective of higher quality and controllability in summary generation. Towards a triple goal of data-efficiency, parameter-efficiency and controllability, we introduce PromptSum, a method combining PT with a multi-task objective and discrete entity prompts for abstractive summarization. Our model achieves state-of-the-art results on several popular few-shot benchmarks as well as a strong level of controllability through entities, all while only tuning several orders of magnitude less parameters.","summarization, controllability, parameter-efficiency, prompt-tuning, pre-training, multi-tasking" Context and History Aware Other-Shaping,https://openreview.net/forum?id=54F8woU8vhq,https://openreview.net/pdf?id=54F8woU8vhq,A scalable shaping algorithm which can be used in complex environments.,"Cooperation failures, in which self-interested agents converge to collectively worst-case outcomes, are a common failure mode of Multi-Agent Reinforcement Learning (MARL) methods. Methods such as Model-Free Opponent Shaping (MFOS) and The Good Shepherd address this issue by shaping their co-player’s learning into mutual cooperation. However, these methods fail to capture important co-player learning dynamics or do not scale to co-players parameterised by deep neural networks. To address these issues, we propose Context and History Aware Other-Shaping (CHAOS). A CHAOS agent is a meta-learner parameterised by a recurrent neural network that learns to shape its co-player over multiple trials. CHAOS considers both the context (inter-episode information), and history (intra-episode information) to shape co-players successfully. CHAOS also successfully scales to shaping co-players parameterised by deep neural networks. In a set of experiments, we show that CHAOS achieves state-of-the-art shaping in matrix games. We provide extensive ablations, motivating the importance of both context and history. CHAOS also successfully shapes on a complex grid-worldbased game, demonstrating CHAOS’s scalability empirically. Finally, we provide empirical evidence that, counterintuitively, the widely-used Coin Game environment does not require history to learn shaping because states are often indicative of past actions. This suggests that the Coin Game is, in contrast to common understanding, unsuitable for investigating shaping in high-dimensional, multi-step environments.","Shaping, Multi-Agent, Reinforcement Learning, Meta Reinforcement Learning" ReD-GCN: Revisit the Depth of Graph Convolutional Network,https://openreview.net/forum?id=tMg5hKRiW-2,https://openreview.net/pdf?id=tMg5hKRiW-2,Extend the depth of GCN from positive integer domain ($\mathbb{N}+$) to real number domain ($\mathbb{R}$). A novel problem of automatic GCN depth tuning for graph homophily/heterophily detection is formulated. ,"Finding the proper depth $d$ of a GNN that provides strong representation power has drawn significant attention, yet nonetheless largely remains an open problem for the graph learning community. Although noteworthy progress has been made, the depth or the number of layers of a corresponding GCN is realized by a series of graph convolution operations, which naturally makes $d$ a positive integer ($d \in \mathbb{N}+$). An interesting question is whether breaking the constraint of $\mathbb{N}+$ by making $d$ a real number ($d \in \mathbb{R}$) can bring new insights into graph learning mechanisms. In this work, by redefining GCN's depth $d$ as a trainable parameter continuously adjustable within $(-\infty,+\infty)$, we open a new door of controlling its expressiveness on graph signal processing to model graph homophily/heterophily (nodes with similar/dissimilar labels/attributes tend to inter-connect). A simple and powerful GCN model ReD-GCN, is proposed to retain the simplicity of GCN and meanwhile automatically search for the optimal $d$ without the prior knowledge regarding whether the input graph is homophilic or heterophilic. Negative-valued $d$ intrinsically enables high-pass frequency filtering functionality for graph heterophily. Variants extending the model flexibility/scalability are also developed. The theoretical feasibility of having a real-valued depth with explainable physical meanings is ensured via eigen-decomposition of the graph Laplacian and a properly designed transformation function from the perspective of functional calculus. Extensive experiments demonstrate the superiority of ReD-GCN on node classification tasks for a variety of graphs. Furthermore, by introducing the concept of eigengraph, a novel graph augmentation method is obtained: the optimal $d$ effectively generates a new topology through a properly weighted combination of eigengraphs, which dramatically boosts the performance even for a vanilla GCN.","graph convolutional network, the depth of graph convolutional network" The Influence of Learning Rule on Representation Dynamics in Wide Neural Networks,https://openreview.net/forum?id=nZ2NtpolC5-,https://openreview.net/pdf?id=nZ2NtpolC5-,A theoretical analysis of deep networks and their representations when trained with a variety of learning rules.,"It is unclear how changing the learning rule of a deep neural network alters its learning dynamics and representations. To gain insight into the relationship between learned features, function approximation, and the learning rule, we analyze infinite-width deep networks trained with gradient descent (GD) and biologically-plausible alternatives including feedback alignment (FA), direct feedback alignment (DFA), and error modulated Hebbian learning (Hebb), as well as gated linear networks (GLN). We show that, for each of these learning rules, the evolution of the output function at infinite width is governed by a time varying effective neural tangent kernel (eNTK). In the lazy training limit, this eNTK is static and does not evolve, while in the rich mean-field regime this kernel's evolution can be determined self-consistently with dynamical mean field theory (DMFT). This DMFT enables comparisons of the feature and prediction dynamics induced by each of these learning rules. In the lazy limit, we find that DFA and Hebb can only learn using the last layer features, while full FA can utilize earlier layers with a scale determined by the initial correlation between feedforward and feedback weight matrices. In the rich regime, DFA and FA utilize a temporally evolving and depth-dependent NTK. Counterintuitively, we find that FA networks trained in the rich regime exhibit more feature learning if initialized with smaller correlation between the forward and backward pass weights. GLNs admit a very simple formula for their lazy limit kernel and preserve conditional Gaussianity of their preactivations under gating functions. Error modulated Hebb rules show very small task-relevant alignment of their kernels and perform most task relevant learning in the last layer.","Deep Learning Theory, Learning Rules, Representations" Multiple Modes for Continual Learning,https://openreview.net/forum?id=7sWLxZBLPO5,https://openreview.net/pdf?id=7sWLxZBLPO5,,"Adapting model parameters to incoming streams of data is a crucial factor to deep learning scalability. Interestingly, prior continual learning strategies in online settings inadvertently anchor their updated parameters to a local parameter subspace to remember old tasks, else drift away from the subspace and forget. From this observation, we formulate a trade-off between constructing multiple parameter modes and allocating tasks per mode. Mode-Optimized Task Allocation (MOTA), our contributed adaptation strategy, trains multiple modes in parallel, then optimizes task allocation per mode. We empirically demonstrate improvements over baseline continual learning strategies and across varying distribution shifts, namely sub-population, domain, and task shift.", A Theory of Equivalence-Preserving Program Embeddings,https://openreview.net/forum?id=69MODRAL5u8,https://openreview.net/pdf?id=69MODRAL5u8,We develop a theory of program embeddings that preserve semantic equivalence and show when they are tractable to compute,"Program embeddings are used to solve tasks such as \textit{code clone detection} and \textit{semantic labeling}. Solutions to these \textit{semantic tasks} should be invariant to semantics-preserving program transformations. When a program embedding function satisfies this invariance, we call it an \textit{equivalence-preserving program embedding function}. We say a programming language can be \textit{tractably embedded} when we can construct an equivalence-preserving program embedding function that executes in polynomial time in program/input length and produces program embeddings that are proportional to the input length. Determining whether a programming language can be tractably embedded is the \textit{equivalence-preserving program embedding problem}. We formalize this problem and theoretically characterize when programming languages can be tractably embedded. To validate our theoretical results, we use the BERT-Tiny model to learn an equivalence-preserving program embedding function for a programming language that can be tractably embedded and show the model fails to construct an equivalence-preserving program embedding function for a similar language that is intractable to embed. ","Programming Languages, Program Embeddings, Code, Big Code" Multimodal Open-Vocabulary Video Classification via Vision and Language Models,https://openreview.net/forum?id=u2e4grt3aKm,https://openreview.net/pdf?id=u2e4grt3aKm,We propose a method for open-vocabulary video classification leveraging pre-trained vision and language models and multimodal signals like optical flow and audio to improve the performance.,"Utilizing vision and language models (VLMs) pre-trained on internet-scale image-text pairs is becoming a promising paradigm for open-vocabulary vision tasks. This work conducts an extensive study for multimodal open-vocabulary video classification via pre-trained VLMs by leveraging motion and audio that naturally exist in the video. We design an asymmetrical cross-modal fusion mechanism to aggregate multimodal information differently for video and optical flow / audio. Experiments on Kinetics and VGGSound show that introducing more modalities significantly improves the accuracy on seen classes, while generalizing better to unseen classes over existing approaches. Despite its simplicity, our method achieves state-of-the-art results on UCF and HMDB zero-shot video action recognition benchmarks, significantly outperforming traditional zero-shot techniques, video-text pre-training methods and recent VLM-based approaches. Code and models will be released.","open-vocabulary, multimodal, video, optical flow, audio" On the Data-Efficiency with Contrastive Image Transformation in Reinforcement Learning,https://openreview.net/forum?id=-nm-rHXi5ga,https://openreview.net/pdf?id=-nm-rHXi5ga,CoIT is a learnable image transformation for sample-efficiency improvement.,"Data-efficiency has always been an essential issue in pixel-based reinforcement learning (RL). As the agent not only learns the decision-making but also meaningful representations from images. The line of reinforcement learning with data augmentation shows significant improvements in sample-efficiency. However, it is challenging to guarantee the optimality invariant transformation, that is, the augmented data are readily recognized as a completely different state by the agent. In the end, we propose a contrastive invariant transformation (CoIT), a simple yet promising learnable data augmentation combined with standard model-free algorithms to improve sample-efficiency. Concretely, the differentiable CoIT leverages original samples with augmented samples and hastens the state encoder for a contrastive invariant embedding. We evaluate our approach on DeepMind Control Suite and Atari100K. Empirical results verify advances using CoIT, enabling it to outperform the new state-of-the-art on various tasks. Source code is available at https://github.com/Kamituna/CoIT.","Reinforcement Learning, Data Augmentation, Self-Supervised Learning, Representation Learning" Energy-based Out-of-Distribution Detection for Graph Neural Networks,https://openreview.net/forum?id=zoz7Ze4STUL,https://openreview.net/pdf?id=zoz7Ze4STUL,We propose an energy-based model as a provably effective OOD discriminator from a GNN classifier trained in semi-supervised learning on graphs,"Representation learning on semi-structured data, e.g., graphs, has become a central problem in deep learning community as relational structures are pervasive in real situations and induce data inter-dependence that hinders trivial adaptation of existing approaches in other domains where the inputs are assumed to be i.i.d. sampled. However, current models in this regime mostly focus on improving testing performance of in-distribution data and largely ignores the potential risk w.r.t. out-of-distribution (OOD) testing samples that may cause negative outcome if the model is overconfident in prediction on them. In this paper, we identify a provably effective OOD discriminator based on an energy function directly extracted from a graph neural network trained with standard supervised classification loss. This paves a way for a simple and efficient OOD detection model for GNN-based semi-supervised learning on graphs, which we call GNN-Safe. It also has nice theoretical properties that guarantee an overall distinguishable margin between the detection scores for in-distribution and OOD samples, which, more critically, can be further strengthened by a non-learning-based structured propagation scheme. Extensive experiments over five real-world datasets validate the practical efficacy of the proposed model for detecting various OOD instances that are inter-connected in a graph with up to 17.0% improvement on average AUROC over competitive peer models and without sacrificing in-distribution testing accuracy.","graph neural networks, out-of-distribution detection, semi-supervised node classification, energy model" Theoretical Characterization of Neural Network Generalization with Group Imbalance,https://openreview.net/forum?id=ig4E0Y11pX,https://openreview.net/pdf?id=ig4E0Y11pX,A theoretical characterization of generalization and sample complexity of training neural networks with group imbalance,"Group imbalance has been a known problem in empirical risk minimization (ERM), where the achieved high \textit{average} accuracy could be accompanied by low accuracy in a \textit{minority} group. Despite various algorithmic efforts to improve the minority group accuracy, a theoretical study of the generalization performance of ERM on individual groups remains elusive. By formulating the group imbalance problem with the Gaussian Mixture Model, this paper quantifies the impact of individual groups on the sample complexity, the convergence rate, and the average and group-level testing performance. Although our theoretical framework is centered on binary classification using a one-hidden-layer neural network, to the best of our knowledge, we provide the first theoretical analysis of the group-level generalization of ERM in addition to the commonly studied average generalization performance. Sample insights of our theoretical results include that when all group-level co-variance is in the medium regime and all mean are close to zero, the learning performance is most desirable in the sense of a small sample complexity, a fast training rate, and a high average and group-level testing accuracy. Moreover, we show that increasing the fraction of the minority group in the training data does not necessarily improve the generalization performance of the minority group. Our theoretical results are validated on both synthetic and empirical datasets such as CelebA and CIFAR-10 in image classification.","Group imbalance, Sample complexity, Generelization analysis, Gaussian mixture model, Empirical risk minimization" SDMuse: Stochastic Differential Music Editing and Generation via Hybrid Representation,https://openreview.net/forum?id=XlKwprKzOZ,https://openreview.net/pdf?id=XlKwprKzOZ,,"While deep generative models have empowered music generation, it remains a challenging and under-explored problem to edit an existing musical piece at fine granularity. In this paper, we propose SDMuse, a unified stochastic differential music editing and generation framework, which can not only compose a whole musical piece from scratch, but also modify existing musical pieces in many ways, such as combination, continuation, inpainting, and style transferring. The proposed SDMuse follows a two-stage pipeline to achieve music generation and editing on top of a hybrid representation including pianoroll and MIDI-event. In particular, SDMuse first generates/edits pianoroll by iteratively denoising through a stochastic differential equation (SDE) based on a diffusion model generative prior, and then refines the generated pianoroll and predicts MIDI-event tokens auto-regressively. We evaluate the generated music of our method on ailabs1k7 pop music dataset in terms of quality and controllability on various music editing and generation tasks. Experimental results demonstrate the effectiveness of our proposed stochastic differential music editing and generation process, as well as the hybrid representations. ", Masked Autoencoders Enable Efficient Knowledge Distillers,https://openreview.net/forum?id=7QTldIMkkqX,https://openreview.net/pdf?id=7QTldIMkkqX,"This paper studies the potential of distilling knowledge from self-supervised pre-trained models, especially Masked Autoencoders","This paper studies the potential of distilling knowledge from self-supervised pre-trained models, especially Masked Autoencoders. Our approach is simple: in addition to optimizing the pixel reconstruction loss on masked inputs, we minimize the distance between the intermediate feature map of the teacher model and that of the student model. This design leads to a computationally efficient knowledge distillation framework, given 1) only a small visible subset of patches is used, and 2) the teacher model only needs to forward propagate inputs through the first few layers for obtaining intermediate feature maps. Compared to directly distilling fine-tuned models, distilling pre-trained models substantially improves the performance of downstream representation learning, meanwhile incurring little extra pre-training cost. For example, by distilling the knowledge from an MAE pre-trained ViT-L into a ViT-B, our method achieves an 84.0% ImageNet top-1 accuracy, outperforming the baseline of distilling a fine-tuned ViT-L by 1.2%, with no extra training time at all. More interestingly, our method can robustly tackle different masking ratios: e.g., by pushing to the extreme 95% masking ratio where merely TEN patches are visible during distillation, our ViT-B still secures a top-1 accuracy of 83.8%, meanwhile further reducing total training time by 13% of that of the distilling during fine-tuning baseline. ","Transformer, Pretraining, Knowledge Distillation" Formal Interpretability with Merlin-Arthur Classifiers,https://openreview.net/forum?id=7hvbaJ1AbaM,https://openreview.net/pdf?id=7hvbaJ1AbaM,We introduce a new type of interpretable classifier with theoretical guarantees based on the Merlin-Arthur protocol from Interactive Proof Systems.,"We propose a new type of multi-agent interactive classifier that provides, for the first time, provable interpretability guarantees even for complex agents such as neural networks. In our setting, which is inspired by the Merlin-Arthur protocol from Interactive Proof Systems, two agents cooperate to provide a classification: the prover selects a small set of features as a certificate and presents it to the verifier who decides the class. A second, adversarial prover ensures the truthfulness of the system and allows us to connect the game-theoretic equilibrium between the provers and the verifier to guarantees on the exchanged features. We define completeness and soundness metrics to provide a lower bound on the mutual information between the features and the class. Our experiments demonstrate good agreement between theory and practice using neural network classifiers, and we show how our setup practically prevents manipulation.","interpretability, explainable ai" Quasi-optimal Learning with Continuous Treatments,https://openreview.net/forum?id=O8Vc52xFSUR,https://openreview.net/pdf?id=O8Vc52xFSUR,The paper proposes a novel learning algorithm for reliable continuous action allocations.,"Many real-world applications of reinforcement learning (RL) require making decisions in continuous action environments. In particular, determining the optimal dose level plays a vital role in developing medical treatment regimes. One challenge in adapting existing RL algorithms to medical applications, however, is that the popular infinite support stochastic policies, e.g., Gaussian policy, may assign riskily high dosages and harm patients seriously. Hence, it is important to induce a policy class whose support only contains near-optimal actions, and shrink the action-searching area for effectiveness and reliability. To achieve this, we develop a novel \emph{quasi-optimal learning algorithm}, which can be easily optimized in off-policy settings with guaranteed convergence under general function approximations. Theoretically, we analyze the consistency, sample complexity, adaptability, and convergence of the proposed algorithm. We evaluate our algorithm with comprehensive simulated experiments and a dose suggestion real application to Ohio Type 1 diabetes dataset.","Continuous Treatments, Markov Decision Process, Safe Action Allocation" "Generalization Bounds for Federated Learning: Fast Rates, Unparticipating Clients and Unbounded Losses",https://openreview.net/forum?id=-EHqoysUYLx,https://openreview.net/pdf?id=-EHqoysUYLx,,"In {federated learning}, the underlying data distributions may be different across clients. This paper provides a theoretical analysis of generalization error of {federated learning}, which captures both heterogeneity and relatedness of the distributions. In particular, we assume that the heterogeneous distributions are sampled from a meta-distribution. In this two-level distribution framework, we characterize the generalization error not only for clients participating in the training but also for unparticipating clients. We first show that the generalization error for unparticipating clients can be bounded by participating generalization error and participating gap caused by clients' sampling. We further establish fast learning bounds of order $\mathcal{O}(\frac{1}{mn} + \frac{1}{m})$ for unparticipating clients, where $m$ is the number of clients and $n$ is the sample size at each client. To our knowledge, the obtained fast bounds are state-of-the-art in the two-level distribution framework. Moreover, previous theoretical results mostly require the loss function to be bounded. We derive convergence bounds of order $\mathcal{O}(\frac{1}{\sqrt{mn}} + \frac{1}{\sqrt{m}})$ under unbounded assumptions, including sub-exponential and sub-Weibull losses. ","Federated learning, Generalization error, Risk bound, Unbounded losses, Learning theory" Contrastive Unsupervised Learning of World Model with Invariant Causal Features,https://openreview.net/forum?id=7DtgxVZGj-y,https://openreview.net/pdf?id=7DtgxVZGj-y,"We present a world model, which learns the causal features using invariance principle and achieves state-of-the-art performance on out-of-distribution generalisation.","In this paper we present a world model, which learns the causal features using invariance principle. We use contrastive unsupervised learning to learn the invariant causal features, which enforces invariance across augmentations of irrelevant parts or styles of the observation. Since the world model based reinforcement learning methods optimize representation learning and policy of the agent independently, contrastive loss collapses due to lack of supervisory signal to the representation learning module. We propose depth reconstruction as an auxiliary task to explicitly enforce the invariance and data augmentation as style intervention on the RGB space to mitigate this issue. Our design help us to leverage state-of-the-art unsupervised representation learning method to learn the world model with invariant causal features, which outperforms current state-of-the-art model-based as well as model-free reinforcement learning methods on out-of-distribution point navigation tasks on Gibson and iGibson dataset at 100k and 500k interaction step benchmarks. Further experiments on DeepMind control suite even without depth reconstruction, our proposed model performs on par with the state-of-the-art counterpart models.","world models, causality, contrastive learning, model-based reinforcement learning, reinforcement learning, out-of-distribution generalisation, sim-to-real transfer, robot navigation" GOING BEYOND 1-WL EXPRESSIVE POWER WITH 1-LAYER GRAPH NEURAL NETWORKS,https://openreview.net/forum?id=jDOE5xirIJb,https://openreview.net/pdf?id=jDOE5xirIJb,A fast and memory-efficient method to enhance the expressive power of GNNs,"Graph neural networks have become the \textit{de facto} standard for representational learning in graphs, and have achieved SOTA in many graph-related tasks such as node classification, graph classification and link prediction. However, it has been shown that the expressive power is equivalent maximally to Weisfeiler-Lehman Test. Recently, there is a line of work aiming to enhance the expressive power of graph neural networks. In this work, we propose a more generalized variant of neural Weisfeiler-Lehman test to enhance structural representation for each node in a graph to uplift the expressive power of any graph neural network. It is shown theoretically our method is strictly more powerful than 1\&2-WL test. The Numerical experiments also demonstrate that our proposed method outperforms the standard GNNs on almost all the benchmark datasets by a large margin in most cases with significantly lower running time and memory consumption compared with other more powerful GNNs. ","graph neural networks, expressivity, memory-efficient" System Identification as a Reinforcement Learning Problem,https://openreview.net/forum?id=2iKvo44-Bya,https://openreview.net/pdf?id=2iKvo44-Bya,System Identification as a Reinforcement Learning Problem,"System identification, also known as learning forward models, transfer functions, system dynamics, etc., has a long tradition both in science and engineering in different fields. Particularly, it is a recurring theme in Reinforcement Learning research, where forward models approximate the state transition function of a Markov Decision Process by learning a mapping function from current state and action to the next state. This problem is commonly defined as a Supervised Learning problem in a direct way. This common approach faces several difficulties due to the inherent complexities of the dynamics to learn, for example, delayed effects, high non-linearity, non-stationarity, partial observability and, more important, error accumulation when using bootstrapped predictions (predictions based on past predictions), over large time horizons. Here we explore the use of Reinforcement Learning in this problem. We elaborate on why and how this problem fits naturally and sound as a Reinforcement Learning problem, and present some experimental results that demonstrate RL is a promising technique to solve these kind of problems.","System Identification, Reinforcement Learning, Offline Reinforcement Learning, Forward Models" When to Trust Aggregated Gradients: Addressing Negative Client Sampling in Federated Learning,https://openreview.net/forum?id=ZLvwIqjMJtO,https://openreview.net/pdf?id=ZLvwIqjMJtO,"We find that existing federated optimization suffers from the unreliable aggregated gradients caused by the negative client sampling results, and propose a gradient similarity–aware learning rate adaptation mechanism to address this problem.","Federated Learning has become a widely-used framework which allows learning a global model on decentralized local datasets under the condition of protecting local data privacy. However, federated learning faces severe optimization difficulty when training samples are not independently and identically distributed (non-i.i.d.). In this paper, we point out that the client sampling practice plays a decisive role in the aforementioned optimization difficulty. We find that the negative client sampling will cause the merged data distribution of currently sampled clients heavily inconsistent with that of all available clients, and further make the aggregated gradient unreliable. To address this issue, we propose a novel learning rate adaptation mechanism to adaptively adjust the server learning rate for the aggregated gradient in each round, according to the consistency between the merged data distribution of currently sampled clients and that of all available clients. Specifically, we make theoretical deductions to find a meaningful and robust indicator that is positively related to the optimal server learning rate and can effectively reflect the merged data distribution of sampled clients, and we utilize it for the server learning rate adaptation. Extensive experiments on multiple image and text classification tasks validate the great effectiveness of our method.","federated learning, client sampling" More ConvNets in the 2020s: Scaling up Kernels Beyond 51x51 using Sparsity,https://openreview.net/forum?id=bXNl-myZkJl,https://openreview.net/pdf?id=bXNl-myZkJl,"We propose Sparse Large Kernel Network (SLaK), a pure CNN architectures equipped with 51x51 kernels that can perform on par with or better than the state-of-the-art hierarchical Transformers and modern ConvNets.","Transformers have quickly shined in the computer vision world since the emergence of Vision Transformers (ViTs). The dominant role of convolutional neural networks (CNNs) seems to be challenged by increasingly effective transformer-based models. Very recently, a couple of advanced convolutional models strike back with large kernels motivated by the local-window attention mechanism, showing appealing performance and efficiency. While one of them, i.e. RepLKNet, impressively manages to scale the kernel size to 31x31 with improved performance, the performance starts to saturate as the kernel size continues growing, compared to the scaling trend of advanced ViTs such as Swin Transformer. In this paper, we explore the possibility of training extreme convolutions larger than 31x31 and test whether the performance gap can be eliminated by strategically enlarging convolutions. This study ends up with a recipe for applying extremely large kernels from the perspective of sparsity, which can smoothly scale up kernels to 61x61 with better performance. Built on this recipe, we propose Sparse Large Kernel Network (SLaK), a pure CNN architecture equipped with sparse factorized 51x51 kernels that can perform on par with or better than state-of-the-art hierarchical Transformers and modern ConvNet architectures like ConvNeXt and RepLKNet, on ImageNet classification as well as a wide range of downstream tasks including semantic segmentation on ADE20K, object detection on PASCAL VOC 2007, and object detection/segmentation on MS COCO. Codes are included in the supplementary. ","51x51 kernel, Large kernel convolution, convolutional neural networks, sparsity, backbone" Language-Aware Soft Prompting for Vision & Language Foundation Models,https://openreview.net/forum?id=W4HBwaybWedX,https://openreview.net/pdf?id=W4HBwaybWedX,,"This paper is on soft prompt learning for Vision & Language (V&L) models. Similarly to their NLP counterparts, V&L models can be adapted to a downstream task by learning soft continuous prompts using a few training examples. Current methods learn the soft prompts by minimizing a cross-entropy loss using as class weights the features obtained by passing the prompts plus the class names through the text encoder. Such methods, however, significantly overfit the training data suffering from large accuracy degradation when tested on unseen classes from the same domain. Our main contribution, in this paper, is a surprisingly simple approach to alleviate this problem: we use a second cross entropy loss to minimize the distance between the learned soft prompts and a set of hand-engineered manual prompts (obtained by prompt engineering). The proposed loss can be interpreted in multiple ways including as a regularizer, as a means for language-based augmentation, and as a way of learning more discriminative class centroids. Importantly, our formulation is inherently amenable to including, during training, virtual classes, i.e. class names for which no visual samples are available, further increasing the robustness of the learned prompts. Through extensive evaluations on 11 datasets, we show that our approach (a) significantly outperforms all prior works on soft prompting, and (b) matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for the majority of the test datasets. Code will be made available.", On Structural Expressive Power of Graph Transformers,https://openreview.net/forum?id=ek6kvkKb7wm,https://openreview.net/pdf?id=ek6kvkKb7wm,Investigating the expressive power of graph Transformers.,"Graph Transformer (GT) has recently received wide attention in the research community with its outstanding performance, yet its structural expressive power has not been well analyzed. Inspired by the connections between Weisfeiler-Lehman (WL) graph isomorphism test and graph neural network (GNN), we introduce \textbf{GT test}, a generalized graph isomorphism test algorithm as a powerful theoretical tool for exploring the structural discriminative power of graph Transformers. We theoretically prove that the GT test is an expressivity upper bound on a wide range of graph Transformers, and the representational power of GT test can be approximated by a simple Transformer network arbitrarily under certain conditions. With the GT test, we show how graph Transformers' expressive power is determined by the design of structural encodings, and present conditions that make the expressivity of graph Transformers beyond WL test and GNNs. Moreover, motivated by the popular shortest path distance encoding, we follow the theory-oriented principles and develop a provably stronger structural encoding method, Shortest Path Induced Subgraph (\textit{SPIS}) encoding. Our theoretical findings provide a novel and practical paradigm for investigating the expressive power of graph Transformers, and extensive synthetic and real-world experiments empirically verify the strengths of our proposed methods.","graph representation learning, graph isomorphism testing, graph Transformer" Projected Latent Distillation for Data-Agnostic Consolidation in Multi-Agent Continual Learning,https://openreview.net/forum?id=NCNT1r62-UV,https://openreview.net/pdf?id=NCNT1r62-UV,,"Many real-world applications are characterized by non-stationary distributions. In this setting, independent expert models trained on subsets of the data can benefit from each other and improve their generalization and forward transfer by sharing knowledge. In this paper, we formalize this problem as a multi-agent continual learning scenario, where agents are trained independently but they can communicate by sharing the model parameters after each learning experience. We split the learning problem into two phases: adaptation and consolidation. Adaptation is a learning phase that optimizes the current task, while consolidation prevents forgetting by combining expert models together, enabling knowledge sharing. We propose Data-Agnostic Consolidation (DAC), a novel double knowledge distillation method. The method performs distillation in the latent space via a novel Projected Latent Distillation (PLD) loss. Experimental results show state-of-the-art accuracy on SplitCIFAR100 even when a single out-of-distribution image is used as the only source of data during consolidation.","continual learning, knowledge distillation" Few-shot Cross-domain Image Generation via Inference-time Latent-code Learning,https://openreview.net/forum?id=sCYXJr3QJM8,https://openreview.net/pdf?id=sCYXJr3QJM8,Adapt a GAN trained on a single large-scale source dataset to multiple target domains containing very few examples without re-training the pretrained source generator.,"In this work, our objective is to adapt a Deep generative model trained on a large-scale source dataset to multiple target domains with scarce data. Specifically, we focus on adapting a pre-trained Generative Adversarial Network (GAN) to a target domain without re-training the generator. Our method draws the motivation from the fact that out-of-distribution samples can be `embedded' onto the latent space of a pre-trained source-GAN. We propose to train a small latent-generation network during the inference stage, each time a batch of target samples is to be generated. These target latent codes are fed to the source-generator to obtain novel target samples. Despite using the same small set of target samples and the source generator, multiple independent training episodes of the latent-generation network results in the diversity of the generated target samples. Our method, albeit simple, can be used to generate data from multiple target distributions using a generator trained on a single source distribution. We demonstrate the efficacy of our surprisingly simple method in generating multiple target datasets with only a single source generator and a few target samples.","generative domain adaptation, generative adversarial network" RLx2: Training a Sparse Deep Reinforcement Learning Model from Scratch,https://openreview.net/forum?id=DJEEqoAq7to,https://openreview.net/pdf?id=DJEEqoAq7to,We propose a new framework for training an efficient DRL agent from scratch with an ultra-sparse network with strong performanc without performance degradation.,"Training deep reinforcement learning (DRL) models usually require high computation costs. Therefore, compressing DRL models possesses immense potential for training acceleration and model deployment. However, existing methods that generate small models mainly adopt the knowledge distillation-based approach by iteratively training a dense network. As a result, the training process still demands massive computing resources. Indeed, sparse training from scratch in DRL has not been well explored and is particularly challenging due to non-stationarity in bootstrap training. In this work, we propose a novel sparse DRL training framework, “the Rigged Reinforcement Learning Lottery” (RLx2), which builds upon gradient-based topology evolution and is capable of training a sparse DRL model based entirely on a sparse network. Specifically, RLx2 introduces a novel multi-step TD target mechanism with a dynamic-capacity replay buffer to achieve robust value learning and efficient topology exploration in sparse models. It also reaches state-of-the-art sparse training performance in several tasks, showing $7.5\times$-$20\times$ model compression with less than $3\%$ performance degradation and up to $20\times$ and $50\times$ FLOPs reduction for training and inference, respectively.","Deep Reinforcement Learning, Lottery Ticket Hypothesis, Model Compression, Value Learning" Black-Box Adversarial Attack Guided by Model Behavior for Programming Pre-trained Language Models,https://openreview.net/forum?id=6G1MXNU8VtV,https://openreview.net/pdf?id=6G1MXNU8VtV,We use the uncertainty of model outputs to guide searching for adversarial examples by the variable name replacement.,"Pre-trained models for programming languages are widely used to solve code tasks in Software Engineering (SE) community, such as code clone detection and bug identification. Reliability is the primary concern of these machine learning applications in SE because software failure can lead to intolerable loss. However, deep neural networks are known to suffer from adversarial attacks. In this paper, we propose a novel black-box adversarial attack based on model behaviors for pre-trained programming language models, named Representation Nearest Neighbor Search(RNNS). The proposed approach can efficiently identify adversarial examples via variable replacement in an ample search space of real variable names under similarity constraints. We evaluate RNNS on 6 code tasks (e.g., clone detection), 3 programming languages (Java, Python, and C), and 3 pre-trained code models: CodeBERT, GraphCodeBERT, and CodeT5. The results demonstrate that RNNS outperforms the state-of-the-art black-box attacking method (MHM) in terms of both attack success rate and quality of generated adversarial examples. ","black-box, adversarial attack, pre-trained models for programming languages, code model" Learning Critically in Federated Learning with Noisy and Heterogeneous Clients,https://openreview.net/forum?id=7zxPlqOT5us,https://openreview.net/pdf?id=7zxPlqOT5us,,"Federated learning (FL) is a distributed learning framework for collaboratively training models with privacy guarantee. Class imbalance problem is a main problem in FL with heterogeneous clients. Besides, Label noise is also an inherent problem in scenarios since clients have varied expertise in annotations. However, the co-existence of heterogeneous label noise and class-imbalance distribution in FL’s small local datasets renders conventional label-noise learning methods ineffective. Thus, in this paper, we propose algorithm FedCNI, including a noise-resilience local solver and a robust global aggregator, to address the challenges of noisy and highly-skewed data in FL without using an additional clean proxy dataset. For the local solver, we first design a prototypical classifier to detect the noisy samples by evaluating the similarity between samples and prototypes. Then, we introduce a curriculum pseudo labeling method with thresholds for different classes cautiously from the noisy samples. For the global aggregator, We aggregate critically by switching re-weighted aggregation from data-size to noise level in different learning periods. Experiments on real-world datasets demonstrate that our method can substantially outperform state-of-the-art solutions and is robust in mix-heterogeneous FL environments.","Federated learning, Noisy labels, Class imbalance" Rethinking Positive Sampling for Contrastive Learning with Kernel,https://openreview.net/forum?id=-WiOF7FTt-n,https://openreview.net/pdf?id=-WiOF7FTt-n,Improving positive sampling in contrastive learning using kernel," Data augmentation is a crucial component in unsupervised contrastive learning (CL). It determines how positive samples are defined and, ultimately, the quality of the representation. Even if efforts have been made to find efficient augmentations for ImageNet, CL underperforms compared to supervised methods and it is still an open problem in other applications, such as medical imaging, or in datasets with easy-to-learn but irrelevant imaging features. In this work, we propose a new way to define positive samples using kernel theory along with a novel loss called \textit{decoupled uniformity}. We propose to integrate prior information, learnt from generative models viewed as feature extractor, or given as auxiliary attributes, into contrastive learning, to make it less dependent on data augmentation. We draw a connection between contrastive learning and the conditional mean embedding theory to derive tight bounds on the downstream classification loss. In an unsupervised setting, we empirically demonstrate that CL benefits from generative models, such as VAE and GAN, to less rely on data augmentations. We validate our framework on vision and medical datasets including CIFAR10, CIFAR100, STL10, ImageNet100, CheXpert and a brain MRI dataset. In the weakly supervised setting, we demonstrate that our formulation provides state-of-the-art results.","contrastive learning, kernel theory, representation learning, deep learning" Stationary Deep Reinforcement Learning with Quantum K-spin Hamiltonian Equation,https://openreview.net/forum?id=LVum7knUA7g,https://openreview.net/pdf?id=LVum7knUA7g,,"Instability is a major issue of deep reinforcement learning (DRL) algorithms --- high variance of cumulative rewards over multiple runs. The instability is mainly caused by the existence of \textit{many local minimas} and worsened by the \textit{multiple fixed points} issue of Bellman's optimality equation. As a fix, we propose a quantum K-spin Hamiltonian regularization term (called \textit{H-term}) to help a policy network converge to a high-quality local minima. First, we take a quantum perspective by modeling a policy as a \textit{K-spin Ising model} and employ a Hamiltonian equation to measure the \textit{energy} of a policy. Then, we derive a novel Hamiltonian policy gradient theorem and design a generic actor-critic algorithm that utilizes the H-term to regularize the policy network. Finally, the proposed method significantly reduces the variance of cumulative rewards by $65.2\% \sim 85.6\%$ on six MuJoCo tasks; achieves an approximation ratio $\leq 1.05$ over $90\%$ test cases and reduces its variance by $60.16\% \sim 94.52\%$ on two combinatorial optimization tasks and two non-convex optimization tasks, compared with those of existing algorithms over $20$ runs, respectively.", Performance Disparities Between Accents in Automatic Speech Recognition,https://openreview.net/forum?id=oqSKdRyYO1g,https://openreview.net/pdf?id=oqSKdRyYO1g,,"Automatic speech recognition (ASR) services are ubiquitous, transforming speech into text for systems like Amazon’s Alexa, Google’s Assistant, and Microsoft’s Cortana. Past research has identified discriminatory ASR performance as a function of racial group and nationality. In this paper, we expand the discussion about nationality and English language ASR by performing an audit of some of the most popular English ASR services using a large and global data set of speech from The Speech Accent Archive. We show that performance disparities exist as a function of whether or not a speaker’s first language is English and, even when controlling for multiple linguistic covariates, that these disparities have a statistically significant relationship to the political alignment of the speaker’s birth country with respect to the United States’ geopolitical power. We discuss this bias in the context of the historical use of language to maintain global and political power.","bias, automatic speech recognition, natural language processing, artificial intelligence, machine learning, accent, dialect, english, language, speech, fairness, audit" Do Perceptually Aligned Gradients Imply Robustness?,https://openreview.net/forum?id=W6topEXC2-v,https://openreview.net/pdf?id=W6topEXC2-v,,"Deep learning-based networks have achieved unprecedented success in numerous tasks, among which image classification. Despite these remarkable achievements, recent studies have demonstrated that such classification networks are easily fooled by small malicious perturbations, also known as adversarial examples. This security weakness led to extensive research aimed at obtaining robust models. Beyond the clear robustness benefits of such models, it was also observed that their gradients with respect to the input align with human perception. Several works have identified Perceptually Aligned Gradients (PAG) as a byproduct of robust training, but none have considered it as a standalone phenomenon nor studied its own implications. In this work, we focus on this trait and test whether \emph{Perceptually Aligned Gradients imply Robustness}. To this end, we develop a novel objective to directly promote PAG in training classifiers and examine whether models with such gradients are more robust to adversarial attacks. We present both heuristic and principled ways for obtaining target PAGs, which our method aims to learn. Specifically, we harness recent findings in score-based generative modeling as a source for PAG. Extensive experiments on CIFAR-10 and STL validate that models trained with our method have improved robust performance, exposing the surprising bidirectional connection between PAG and robustness.","Adversarial Robustness, Perceptually Aligned Gradients" Sparsity May Cry: Let Us Fail (Current) Sparse Neural Networks Together!,https://openreview.net/forum?id=J6F3lLg4Kdp,https://openreview.net/pdf?id=J6F3lLg4Kdp,"In this work, we assemble a large-scale, difficult and diverse benchmark for sparse neural networks, on which current SOTA sparse networks are actually prone to significant performance degradation, sometimes even at trivial sparsity levels, e.g., 5%.","Sparse Neural Networks (SNNs) have received voluminous attention predominantly due to growing computational and memory footprints of consistently exploding parameter count in large-scale models. Similar to their dense counterparts, recent SNNs generalize just as well and are equipped with numerous favorable benefits (e.g., low complexity, high scalability, and robustness), sometimes even better than the original dense networks. As research effort is focused on developing increasingly sophisticated sparse algorithms, it is startling that a comprehensive benchmark to evaluate the effectiveness of these algorithms has been highly overlooked. In absence of a carefully crafted evaluation benchmark, most if not all, sparse algorithms are evaluated against fairly simple and naive tasks (eg. CIFAR-10/100, ImageNet, GLUE, etc.), which can potentially camouflage many advantages as well unexpected predicaments of SNNs. In pursuit of a more general evaluation and unveiling the true potential of sparse algorithms, we introduce ""Sparsity May Cry"" Benchmark (SMC-Bench), a collection of carefully curated 4 diverse tasks with 12 datasets, that accounts for capturing a wide-range of domain-specific knowledge. Our systemic evaluation of representative SOTA sparse algorithms reveals an important obscured observation: all of the SOTA sparse algorithms bluntly fail to perform on SMC-Bench, sometimes at significantly trivial sparsity as low as 5%, which sought immediate attention of sparsity community to reconsider the highly proclaimed benefits of SNNs. By incorporating these well-thought and diverse tasks, SMC-Bench is designed to favor and encourage the development of highly generalizable sparse algorithms. We plan to open-source SMC-Bench evaluation suite to encourage sparsity researchers and assist them in building next-generation sparse algorithms with the potential to generalize on complex and practical tasks. ","Sparse Neural Networks, Benchmark, Sparsity, Neural Network Pruning" Multitask Reinforcement Learning by Optimizing Neural Pathways,https://openreview.net/forum?id=H73xwqPfW2f,https://openreview.net/pdf?id=H73xwqPfW2f,Proposing a novel multitask learning framework using task-specific neural pathways for online and offline reinforcement learning.,"Reinforcement learning (RL) algorithms have achieved great success in learning specific tasks, as evidenced by examples such as AlphaGo or fusion control. However, it is still difficult for an RL agent to learn how to solve multiple tasks. In this paper, we propose a novel multitask learning framework, in which multiple specialized pathways through a single network are trained simultaneously, with each pathway focusing on a single task. We show that this approach achieves competitive performance with existing multitask RL methods, while using only 5% of the number of neurons per task. We demonstrate empirically the success of our approach on several continuous control tasks, in both online and offline training.","Multitask Learning, Online Reinforcement Learning, Offline Reinforcement Learning, Neural Pathways" Input Perturbation Reduces Exposure Bias in Diffusion Models,https://openreview.net/forum?id=TFMEqzfFrP_,https://openreview.net/pdf?id=TFMEqzfFrP_,,"Denoising Diffusion Probabilistic Models (DDPMs) are fast becoming one of the dominant generative methods thanks to their high generation quality and diversity. However, one of the main problems of DDPMs is their large computational cost, which is due to the chain of sampling steps. In this paper, we argue that one of the reasons why DDPMs need a long sampling chain is due to an exposure bias problem, similar to the analogous problem in autoregressive text generation. Specifically, we note that there is a discrepancy between training and testing, since the former is conditioned on the ground truth samples, while the latter is conditioned on the previously generated results. In order to alleviate this problem, we propose a very simple but effective training protocol modification, consisting in perturbing the ground truth samples to simulate the inference time prediction errors. We empirically show that the proposed input perturbation leads to a significant improvement of the sample quality and to smoother sampling chains, with a drastic acceleration of the inference time. For instance, in all the tested benchmarks, we observed an acceleration over a state-of-the-art DDPM of 12.5 times.","Generative Models, Diffusion Model" Pruning Parameterization with Bi-level Optimization for Efficient Semantic Segmentation on the Edge,https://openreview.net/forum?id=qRD7kqmr9HJ,https://openreview.net/pdf?id=qRD7kqmr9HJ,We proposed a pruning parameterization method for real-time semantic segmentation on the edge.,"With the ever-increasing popularity of edge devices, it is necessary to implement real-time segmentation on the edge for autonomous driving and many other applications. Vision Transformers (ViTs) have shown considerably stronger results for many vision tasks. However, ViTs with the full-attention mechanism usually consume a large number of computational resources, leading to difficulties for real-time inference on edge devices. In this paper, we aim to derive ViTs with fewer computations and fast inference speed to facilitate the dense prediction of semantic segmentation on edge devices. To achieve this, we propose a pruning parameterization method to formulate the pruning problem of semantic segmentation. Then we adopt a bi-level optimization method to solve this problem with the help of implicit gradients. Our experimental results demonstrate that we can achieve 38.9 mIoU on ADE20K val with a speed of 56.5 FPS on Samsung S21, which is the highest mIoU under the same computation constraint with real-time inference.","Segmentation, efficient deep learning, pruning" How deep convolutional neural networks lose spatial information with training,https://openreview.net/forum?id=cy554rYBzMT,https://openreview.net/pdf?id=cy554rYBzMT,Deep nets perform image classification by aggregating information over space. We investigate the mechanisms by which this is achieved and propose a theory for an artificial scale-detection task.,"A central question of machine learning is how deep nets manage to learn tasks in high dimensions. An appealing hypothesis is that they achieve this feat by building a representation of the data where information irrelevant to the task is lost. For image data sets, this view is supported by the observation that after (and not before) training, the neural representation becomes less and less sensitive to diffeomorphisms acting on images as the signal propagates through the net. This loss of sensitivity correlates with performance, and surprisingly also correlates with a gain of sensitivity to white noise acquired during training. These facts are unexplained, and as we demonstrate still hold when white noise is added to the images of the training set. Here, we (i) show empirically for various architectures that stability to image diffeomorphisms is achieved by spatial pooling in the first half of the net, and by channel pooling in the second half, (ii) introduce a scale-detection task for a simple model of data where pooling is learnt during training, which captures all empirical observations above and (iii) compute in this model how stability to diffeomorphisms and noise scale with depth. The scalings are found to depend on the presence of strides in the net architecture. We find that the increased sensitivity to noise is due to the perturbing noise piling up during pooling, after a ReLU non-linearity is applied to the noise in the internal layers.","Deep Learning Theory, Convolutional Neural Networks, Curse of Dimensionality, Representation Learning, Feature Learning, Computer Vision, Pooling, Stability, Diffeomorphisms, Gaussian noise, Image Classification, Learning Invariants" Which Layer is Learning Faster? A Systematic Exploration of Layer-wise Convergence Rate for Deep Neural Networks,https://openreview.net/forum?id=wlMDF1jQF86,https://openreview.net/pdf?id=wlMDF1jQF86,"We empirically show that the shallower layers converge faster than the deeper layers in neural networks, and provide the theoretical justification and practical value of this finding.","The deeply hierarchical structures enable deep neural networks (DNNs) to fit extremely complex target functions. However, the complex interaction between layers also makes the learning process of a particular layer poorly understood. This work demonstrates that the shallower layers of DNNs tend to converge faster than the deeper layers. We call this phenomenon Layer Convergence Bias. We also uncover the fundamental reason behind this phenomenon: Flatter local minima of shallower layers make their gradients more stable and predictive, allowing for faster training. Another surprising result is that the shallower layers tend to learn the low-frequency components of the target function, while the deeper layers usually learn the high-frequency components. It is consistent with the recent discovery that DNNs learn lower frequency objects faster.","Deep neural networks, Convergence rate" Joint Embedding Self-Supervised Learning in the Kernel Regime,https://openreview.net/forum?id=eEoSHelICSG,https://openreview.net/pdf?id=eEoSHelICSG,We analyze and derive self-supervised learning algorithms using kernel methods,"The fundamental goal of self-supervised learning (SSL) is to produce useful representations of data without access to any labels for classifying the data. Modern methods in SSL, which form representations based on known or constructed relationships between samples, have been particularly effective at this task. Here, we aim to extend this framework to incorporate algorithms based on kernel methods where embeddings are constructed by linear maps acting on the feature space of a kernel. In this kernel regime, we derive methods to find the optimal form of the output representations for contrastive and non-contrastive loss functions. This procedure produces a new representation space with an inner product denoted as the induced kernel which generally correlates points which are related by an augmentation in kernel space and de-correlates points otherwise. We analyze our kernel model on small datasets to identify common features of self-supervised learning algorithms and gain theoretical insights into their performance on downstream tasks.", Linear convergence for natural policy gradient with log-linear policy parametrization,https://openreview.net/forum?id=03sXXjL1um3,https://openreview.net/pdf?id=03sXXjL1um3,,"We analyze the convergence rate of the \emph{unregularized} natural policy gradient algorithm with log-linear policy parametrizations in infinite-horizon discounted Markov decision processes. In the deterministic case, when the Q-value is known and can be approximated by a linear combination of a known feature function up to a bias error, we show that a geometrically-increasing step size yields a linear convergence rate towards an optimal policy. We then consider the sample-based case, when the best representation of the Q-value function among linear combinations of a known feature function is known up to an estimation error. In this setting, we show that the algorithm enjoys the same linear guarantees as in the deterministic case up to an error term that depends on the estimation error, the bias error, and the condition number of the feature covariance matrix. Our results build upon the general framework of policy mirror descent and extend previous findings for the softmax tabular parametrization to the log-linear policy class.", A non-asymptotic analysis of oversmoothing in Graph Neural Networks,https://openreview.net/forum?id=CJd-BtnwtXq,https://openreview.net/pdf?id=CJd-BtnwtXq,We precisely characterize the mechanism of overmoothing via a non-asymptotic analysis and answer why oversmoothing happens in shallow GNNs.,"A central challenge of building more powerful Graph Neural Networks (GNNs) is the oversmoothing phenomenon, where increasing the network depth leads to homogeneous node representations and thus worse classification performance. While previous works have only demonstrated that oversmoothing is inevitable when the number of graph convolutions tends to infinity, in this paper, we precisely characterize the mechanism behind the phenomenon via a non-asymptotic analysis. Specifically, we distinguish between two different effects when applying graph convolutions—an undesirable mixing effect that homogenizes node representations in different classes, and a desirable denoising effect that homogenizes node representations in the same class. By quantifying these two effects on random graphs sampled from the Contextual Stochastic Block Model (CSBM), we show that oversmoothing happens once the mixing effect starts to dominate the denoising effect, and the number of layers required for this transition is $O(\log N/\log (\log N))$ for sufficiently dense graphs with $N$ nodes. We also extend our analysis to study the effects of Personalized PageRank (PPR) on oversmoothing. Our results suggest that while PPR mitigates oversmoothing at deeper layers, PPR-based architectures still achieve their best performance at a shallow depth and are outperformed by the graph convolution approach on certain graphs. Finally, we support our theoretical results with numerical experiments, which further suggest that the oversmoothing phenomenon observed in practice may be exacerbated by the difficulty of optimizing deep GNN models.","graph neural networks, oversmoothing, representational power, theory, deep learning" Class-Incremental Learning with Repetition,https://openreview.net/forum?id=ESR6hysKDsW,https://openreview.net/pdf?id=ESR6hysKDsW,,"Real-world data streams naturally include the repetition of previous concepts. From a Continual Learning (CL) perspective, repetition is a property of the environment and, unlike replay, cannot be controlled by the user. Nowadays, Class-Incremental scenarios represent the leading test-bed for assessing and comparing CL strategies. This family of scenarios is very easy to use, but it never allows revisiting previously seen classes, thus completely disregarding the role of repetition. We focus on the family of Class-Incremental with Repetition (CIR) scenarios, where repetition is embedded in the definition of the stream. We propose two stochastic scenario generators that produce a wide range of CIR scenarios starting from a single dataset and a few control parameters. We conduct the first comprehensive evaluation of repetition in CL by studying the behavior of existing CL strategies under different CIR scenarios. We then present a novel replay strategy that exploits repetition and counteracts the natural imbalance present in the stream. On both CIFAR100 and TinyImageNet, our strategy outperforms other replay approaches, which are not designed for environments with repetition.","continual learning, lifelong learning, class-incremental learning, incremental learning" Scaleformer: Iterative Multi-scale Refining Transformers for Time Series Forecasting,https://openreview.net/forum?id=sCrnllCtjoE,https://openreview.net/pdf?id=sCrnllCtjoE,Propose a new framework to improve recent state-of-the-arts on time-series forecasting using transformers.,"The performance of time series forecasting has recently been greatly improved by the introduction of transformers. In this paper, we propose a general multi-scale framework that can be applied to state-of-the-art transformer-based time series forecasting models (FEDformer, Autoformer, etc.). Using iteratively refining a forecasted time series at multiple scales with shared weights, architecture adaptations and a specially-designed normalization scheme, we are able to achieve significant performance improvements with minimal additional computational overhead. Via detailed ablation studies, we demonstrate the effectiveness of our proposed architectural and methodological innovations. Furthermore, our experiments on various public datasets demonstrate that the proposed method outperforms the corresponding baselines. Depending on the choice of transformer architecture, our mutli-scale framework results in mean squared error reductions ranging from 5.5% to 38.5%. Our code is publicly available in https://github.com/Scaleformer/Scaleformer.","Time-series forecasting, Transformers" Theoretical Characterization of How Neural Network Pruning Affects its Generalization,https://openreview.net/forum?id=dn6_PK73hAY,https://openreview.net/pdf?id=dn6_PK73hAY,We study the effect of pruning under different rates on neural network generalization. ,"It has been observed in practice that applying pruning-at-initialization methods to neural networks and training the sparsified networks can not only retain the testing performance of the original dense models, but also sometimes even slightly boost the generalization performance. Theoretical understanding for such experimental observations are yet to be developed. This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization. Specifically, this work considers a classification task for overparameterized two-layer neural networks, where the network is randomly pruned according to different rates at the initialization. It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero and the network exhibits good generalization performance. More surprisingly, the generalization bound gets better as the pruning fraction gets larger. To complement this positive result, this work further shows a negative result: there exists a large pruning fraction such that while gradient descent is still able to drive the training loss toward zero (by memorizing noise), the generalization performance is no better than random guessing. This further suggests that pruning can change the feature learning process, which leads to the performance drop of the pruned neural network. Up to our knowledge, this is the first generalization result for pruned neural networks, suggesting that pruning can improve the neural network's generalization. ","pruning, neural network, gradient descent, generalization" Backdoors Stuck At The Frontdoor: Multi-Agent Backdoor Attacks That Backfire,https://openreview.net/forum?id=fkNZtv_-BeW,https://openreview.net/pdf?id=fkNZtv_-BeW,,"Malicious agents in collaborative learning and outsourced data collection threaten the training of clean models. Backdoor attacks, where an attacker poisons a model during training to successfully achieve targeted misclassification, are a major concern to train-time robustness. In this paper, we investigate a multi-agent backdoor attack scenario, where multiple attackers attempt to backdoor a victim model simultaneously. A consistent backfiring phenomenon is observed across a wide range of games, where agents suffer from a low collective attack success rate. We examine different modes of backdoor attack configurations, non-cooperation / cooperation, joint distribution shifts, and game setups to return an equilibrium attack success rate at the lower bound. The results motivate the re-evaluation of backdoor defense research for practical environments.", "Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs",https://openreview.net/forum?id=SMa9EAovKMC,https://openreview.net/pdf?id=SMa9EAovKMC,,"The formalization of existing mathematical proofs is a notoriously difficult process. Despite decades of research on automation and proof assistants, writing formal proofs remains arduous and only accessible to a few experts. While previous studies to automate formalization focused on powerful search algorithms, no attempts were made to take advantage of available informal proofs. In this work, we introduce Draft, Sketch, and Prove (DSP), a method that maps informal proofs to formal proof sketches, and uses the sketches to guide an automated prover by directing its search to easier sub-problems. We investigate two relevant setups where informal proofs are either written by humans or generated by a language model. Our experiments and ablation studies show that large language models are able to produce well-structured formal sketches that follow the same reasoning steps as the informal proofs. Guiding an automated prover with these sketches enhances its performance from $20.9\%$ to $39.3\%$ on a collection of mathematical competition problems.", Interpolating Compressed Parameter Subspaces,https://openreview.net/forum?id=EPUWZhBd9Lb,https://openreview.net/pdf?id=EPUWZhBd9Lb,,"Though distribution shifts have caused growing concern for machine learning scalability, solutions tend to specialize towards a specific type of distribution shift. Methods for label shift may not succeed against domain or task shift, and vice versa. We learn that constructing a Compressed Parameter Subspaces (CPS), a geometric structure representing distance-regularized parameters mapped to a set of train-time distributions, can maximize average accuracy over a broad range of distribution shifts concurrently. We show sampling parameters within a CPS can mitigate backdoor, adversarial, permutation, stylization and rotation perturbations. We also show training a hypernetwork representing a CPS can adapt to seen tasks as well as unseen interpolated tasks.", Liquid Structural State-Space Models,https://openreview.net/forum?id=g4OTKRKfS7R,https://openreview.net/pdf?id=g4OTKRKfS7R,"We use the recently proposed parametrization and memorization techniques for training state-space models in a linearized version of liquid neural networks, and achieve SOTA on sequence modeling tasks.","A proper parametrization of state transition matrices of linear state-space models (SSMs) followed by standard nonlinearities enables them to efficiently learn representations from sequential data, establishing the state-of-the-art on a large series of long-range sequence modeling benchmarks. In this paper, we show that we can improve further when the structural SSM, such as S4, is given by a linear liquid time-constant (LTC) state-space model. LTC neural networks are causal continuous-time neural networks with an input-dependent state transition module, which makes them learn to adapt to incoming inputs at inference. We show that by using a diagonal plus low-rank decomposition of the state transition matrix introduced in S4, and a few simplifications, the LTC-based structural state-space model, dubbed Liquid-S4, achieves the new state-of-the-art generalization across sequence modeling tasks with long-term dependencies such as image, text, audio, and medical time-series, with an average performance of 87.32% on the Long-Range Arena benchmark. On the full raw Speech Command recognition dataset, Liquid-S4 achieves 96.78% accuracy with a 30% reduction in parameter counts compared to S4. The additional gain in performance is the direct result of the Liquid-S4's kernel structure that takes into account the similarities of the input sequence samples during training and inference.","state-space models, liquid neural networks, time series. memory, recurrent neural networks" Equivariant Hypergraph Diffusion Neural Operators,https://openreview.net/forum?id=RiTjKoscnNd,https://openreview.net/pdf?id=RiTjKoscnNd,"In this work, we are inspired by hypergraph diffusion algorithms and design a novel HNN architecture that holds provable expressiveness while keeping efficiency.","Hypergraph neural networks (HNNs) using neural networks to encode hypergraphs provide a promising way to model higher-order relations in data and further solve relevant prediction tasks built upon such higher-order relations. However, higher-order relations in practice contain complex patterns and are often highly irregular. So, it is often challenging to design an HNN that suffices to express those relations while keeping computational efficiency. Inspired by hypergraph diffusion algorithms, this work proposes a new HNN architecture named ED-HNN, which provably approximates any continuous equivariant hypergraph diffusion operators that can model a wide range of higher-order relations. ED-HNN can be implemented efficiently by combining star expansions of hypergraphs with standard message passing neural networks. ED-HNN further shows great superiority in processing heterophilic hypergraphs and constructing deep models. We evaluate ED-HNN for node classification on nine real-world hypergraph datasets. ED-HNN uniformly outperforms the best baselines over these nine datasets and achieves more than 2%$\uparrow$ in prediction accuracy over four datasets therein. ","Hypergraph Neural Network, Hypergraph Diffusion, Equivariant Network" Ollivier-Ricci Curvature for Hypergraphs: A Unified Framework,https://openreview.net/forum?id=sPCKNl5qDps,https://openreview.net/pdf?id=sPCKNl5qDps,We introduce a flexible framework for Ollivier-Ricci curvature on hypergraphs.,"Bridging geometry and topology, curvature is a powerful and expressive invariant. While the utility of curvature has been theoretically and empirically confirmed in the context of manifolds and graphs, its generalization to the emerging domain of hypergraphs has remained largely unexplored. On graphs, Ollivier-Ricci curvature measures differences between random walks via Wasserstein distances, thus grounding a geometric concept in ideas from probability and optimal transport. We develop ORCHID, a flexible framework generalizing Ollivier-Ricci curvature to hypergraphs, and prove that the resulting curvatures have favorable theoretical properties. Through extensive experiments on synthetic and real-world hypergraphs from different domains, we demonstrate that ORCHID curvatures are both scalable and useful to perform a variety of hypergraph tasks in practice.","curvature, hypergraphs, graphs, Wasserstein distance, topological data analysis, random walks, probability measure" gGN: learning to represent nodes in directed graphs as low-rank Gaussian distributions,https://openreview.net/forum?id=FU2FX1wDN2x,https://openreview.net/pdf?id=FU2FX1wDN2x,Representing graph nodes using low-rank Gaussian distributions,"Unsupervised learning of node representations from knowledge graphs is critical for numerous downstream tasks, ranging from large-scale graph analysis to measuring semantic similarity between nodes. This study presents gGN as a novel representation that defines graph nodes as Gaussian distributions. Unlike existing representations that approximate such distributions using diagonal covariance matrices, our proposal approximates them using low-rank perturbations. We demonstrate that this low-rank approximation is more expressive and better suited to represent complex asymmetric relations between nodes. In addition, we provide a computationally affordable algorithm for learning the low-rank representations in an unsupervised fashion. This learning algorithm uses a novel loss function based on the reverse Kullback-Leibler divergence and two ranking metrics whose joint minimization results in node representations that preserve not only node depths but also local and global asymmetric relationships between nodes. We assessed the representation power of the low-rank approximation with an in-depth systematic empirical study. The results show that our proposal was significantly better than the diagonal approximation for preserving graph structures. Moreover, gGN also outperformed 17 methods on the downstream task of measuring semantic similarity between graph nodes.","knowledge graphs, representation learning, low-rank approximation, Gaussian distribution, deep learning" Domain-Unified Prompt Representations for Source-Free Domain Generalization,https://openreview.net/forum?id=M_c03_fU2cl,https://openreview.net/pdf?id=M_c03_fU2cl,,"Domain generalization (DG), aiming to make models work on unseen domains, is a surefire way toward general artificial intelligence. Limited by the scale and diversity of current DG datasets, it is difficult for existing methods to scale to diverse domains in open-world scenarios (e.g., science fiction and pixelate style). Therefore, the source-free domain generalization (SFDG) task is necessary and challenging. To address this challenge, we propose an approach based on large-scale vision-language pretraining models (e.g., CLIP), which exploits the extensive domain information embedded in it. The proposed scheme generates diverse prompts from a domain bank that contains many more diverse domains than existing DG datasets. Furthermore, our method yields domain-unified representations from these prompts, thus being able to cope with samples from open-world domains. Extensive experiments on mainstream DG datasets, namely PACS, VLCS, OfficeHome, and DomainNet, show that the proposed method achieves competitive performance compared to state-of-the-art DG methods that require source domain data for training.","Source-free domain generalization, vision-language pretraining model" Biases in Evaluation of Molecular Optimization Methods and Bias Reduction Strategies,https://openreview.net/forum?id=Sh97TNO5YY_,https://openreview.net/pdf?id=Sh97TNO5YY_,"This paper analyzes biases in the evaluation of molecular optimization methods, and methods to alleviate them.","We are interested in in silico evaluation methodology for molecular optimization methods. Given a sample of molecules and their properties of our interest, we wish not only to train a generator of molecules that can find those optimized with respect to a target property but also to evaluate its performance accurately. A common practice is to train a predictor of the target property on the sample and use it for both training and evaluating the generator. We theoretically investigate this evaluation methodology and show that it potentially suffers from two biases; one is due to misspecification of the predictor and the other to reusing the same sample for training and evaluation. We discuss bias reduction methods for each of the biases, and empirically investigate their effectiveness. ", Sharper Analysis of Sparsely Activated Wide Neural Networks with Trainable Biases,https://openreview.net/forum?id=G6-oxjbc_mK,https://openreview.net/pdf?id=G6-oxjbc_mK,We study convergence and generalization of training one-hidden-layer neural networks with sparse activation. ,"This work studies training one-hidden-layer overparameterized ReLU networks via gradient descent in the neural tangent kernel (NTK) regime, where, differently from the previous works, the networks' biases are trainable and are initialized to some constant rather than zero. The tantalizing benefit of such initialization is that the neural network will provably have sparse activation pattern before, during and after training, which can enable fast training procedures and, therefore, reduce the training cost. The first set of results of this work characterize the convergence of the network's gradient descent dynamics. The required width is provided to ensure gradient descent can drive the training error towards zero in linear rate. The contribution over previous work is that not only the bias is allowed to be updated by gradient descent under our setting but also a finer analysis is given such that the required width to ensure the network's closeness to its NTK is improved. Secondly, the networks' generalization bound after training is provided. A width-sparsity dependence is presented which yields sparsity-dependent localized Rademacher complexity and same-as-previous (ignoring logarithmic factors) generalization bound. Up to our knowledge, this is the first sparsity-dependent generalization result via localized Rademacher complexity. As a by-product, if the bias initialization is chosen to be zero, the width requirement improves the previous bound for the shallow networks' generalization. Lastly, since the generalization bound has dependence on the smallest eigenvalue of the limiting NTK and the bounds from previous works yield vacuous generalization, this work further studies the least eigenvalue of the limiting NTK. Surprisingly, while it is not shown that trainable biases are necessary, trainable bias helps to identify a nice data-dependent region where a much finer analysis of the NTK's smallest eigenvalue can be conducted, which leads to a much sharper lower bound than the previously known worst-case bound and, consequently, a non-vacuous generalization bound. Experimental evaluation is provided to evaluate our results. ","Convergence analysis, sparse activation, neural tangent kernel, Rademacher complexity, generalization bound" Hard-Meta-Dataset++: Towards Understanding Few-Shot Performance on Difficult Tasks,https://openreview.net/forum?id=wq0luyH3m4,https://openreview.net/pdf?id=wq0luyH3m4,"We propose (i) a general and computationally efficient algorithm to extract difficult few-shot classification tasks from large-scale vision datasets, and (ii) a new test benchmark of these difficult tasks to stress test few-shot classifiers.","Few-shot classification is the ability to adapt to any new classification task from only a few training examples. The performance of current top-performing few-shot classifiers varies widely across different tasks where they often fail on a subset of `difficult' tasks. This phenomenon has real-world consequences for deployed few-shot systems where safety and reliability are paramount, yet little has been done to understand these failure cases. In this paper, we study these difficult tasks to gain a more nuanced understanding of the limitations of current methods. To this end, we develop a general and computationally efficient algorithm to extract difficult tasks from any large-scale vision dataset. Notably, our algorithm can extract tasks at least 20x faster than existing methods enabling its use on large-scale datasets. We use our algorithm to extract difficult tasks from Meta-Dataset, a widely-used few-shot classification benchmark, and other challenging large-scale vision datasets including ORBIT, CURE-OR and ObjectNet. These tasks are curated into Hard-Meta-Dataset++, a new few-shot testing benchmark to promote the development of methods that are robust to even the most difficult tasks. We use Hard-Meta-Dataset++ to stress-test an extensive suite of few-shot classification methods and show that state-of-the-art approaches fail catastrophically on difficult tasks. We believe that our extraction algorithm and Hard-Meta-Dataset++ will aid researchers in further understanding failure modes of few-shot classification models.","Few-shot learning, Meta-Dataset, Benchmarks, Evaluation" REVISITING PRUNING AT INITIALIZATION THROUGH THE LENS OF RAMANUJAN GRAPH,https://openreview.net/forum?id=uVcDssQff_,https://openreview.net/pdf?id=uVcDssQff_,,"Pruning neural networks at initialization (PaI) has received an upsurge of interest due to its end-to-end saving potential. PaI is able to find sparse subnetworks at initialization that can achieve comparable performance to the full networks on different scores. These methods can surpass the trivial baseline of random pruning but suffer from a significant performance gap compared to post-training pruning. Previous approaches firmly rely on weights, gradients, and sanity checks as primary signals when conducting PaI analysis. To better understand the underlying mechanism of PaI, we propose to interpret it through the lens of the Ramanujan Graph - a class of expander graphs that are sparse while being highly connected. It is believed there should be a strong correlation between the Ramanujan graph and PaI since both are about finding sparse and well-connected neural networks. However, the finer-grained link relating highly sparse and connected networks to their relative performance (i.e., ranking of difference sparse structures at the same specific global sparsity) is still missing. We observe that not only the Ramanujan property for sparse networks shows no significant relationship to PaI’s relative performance, but maximizing it can also lead to the formation of pseudo-random graphs with no structural meanings. We reveal the underlying cause to be Ramanujan Graph’s highly draconian assumption on the upper bound of the third largest eigenvalues (ˆμ) of layers belonging to highly sparse networks. We hence propose Iterative Mean Difference of Bound (IMDB) as a mean to relax the ˆμ upper bound. Likewise, we also show there exists a lower bound for ˆμ, which we call the NormAlized Random Coefficient (NaRC), that give us an accurate assessment for when sparse but highly connected structure degenerates into naive randomness. Finally, we systematically analyze the behavior of various PaI methods and demonstrate the utility of our proposed metrics in characterizing PaI performance. We show subnetworks preserving better IMDB property correlate higher in performance, while NaRC provides us with a possible mean to locate the region where highly connected, highly sparse, and non-trivial Ramanujan expanders exist. Codes will be made available upon acceptance. ","pruning at initialization, graph theory, Ramanujan Graph, sparse neural networks" Self-supervised Speech Enhancement using Multi-Modal Data,https://openreview.net/forum?id=OqPD_6kukm,https://openreview.net/pdf?id=OqPD_6kukm,Using clean low resolution IMU data to supervise the multimodal denoiser,"Modern earphones come equipped with microphones and inertial measurement units (IMU). When a user wears the earphone, the IMU can serve as a second modality for detecting speech signals. Specifically, as humans speak to their earphones (e.g., during phone calls), the throat’s vibrations propagate through the skull to ultimately induce a vibration in the IMU. The IMU data is heavily distorted (compared to the microphone’s recordings), but IMUs offer a critical advantage — they are not interfered by ambient sounds. This presents an opportunity in multi-modal speech enhancement, i.e., can the distorted but uninterfered IMU signal enhance the user’s speech when the microphone’s signal suffers from strong ambient interference? We combine the best of both modalities (microphone and IMU) by designing a cooperative and self-supervised network architecture that does not rely on clean speech data from the user. Instead, using only noisy speech recordings, the IMU learns to give hints on where the target speech is likely located. The microphone uses this hint to enrich the speech signal, which then trains the IMU to improve subsequent hints. This iterative approach yields promising results, comparable to a supervised denoiser trained on clean speech signals. When clean signals are also available to our architecture, we observe promising SI-SNR improvement. We believe this result can aid speech-related applications in earphones and hearing aids, and potentially generalize to others, like audio-visual denoising.","multi-modal, selfsupervise, denoising, iterative algorithm, attention map, expectation maximization, IMU" Impulse Control Arbitration for A Dual System of Exploitation and Exploration,https://openreview.net/forum?id=jAD0chIdt_,https://openreview.net/pdf?id=jAD0chIdt_,We propose a plug-and-play framework with a learned impulse control switching mechanism for targeted arbitration between exploration and exploitation behaviour.,"Efficient reinforcement learning (RL) involves a trade-off between ""exploitative"" actions that maximise expected reward and ``explorative"" ones that lead to the visitation of ""novel"" states. To encourage exploration, existing methods proposed methods such as injecting stochasticity into action selection, implicit regularisation, and additive synthetic reward. However, these techniques do not necessarily offer entirely systematic approaches making this trade-off. Here we introduce SElective Reinforcement EXploration (SEREX), a plug-and-play framework that casts the exploration-exploitation trade-off as a game between an RL agent--- exploiter, which purely exploits task-dependent rewards, and another RL agent--- switcher, which chooses at which states to activate a pure exploration policy that is trained to minimise system uncertainty and override Exploiter. Using a form of policies known as impulse control, switcher is able to determine the best set of states to switch to the exploration policy while Exploiter is free to execute its actions everywhere else. We prove that SEREX converges quickly and induces a natural schedule towards pure exploitation. Through extensive empirical studies in both discrete and continuous control benchmarks, we show that with minimal modification, SEREX can be readily combined with existing RL algorithms and yields significant improvement in performance.","reinforcement learning. exploration-exploitation tradeoff, impulse control switching" Sparse MoE with Random Routing as the New Dropout: Training Bigger and Self-Scalable Models,https://openreview.net/forum?id=w1hwFUb_81,https://openreview.net/pdf?id=w1hwFUb_81,"A new plug-and-paly strategy for training over-parameterized transformer models, leverages SMoEs with random routings to empower scaling transformers to better performance in the full capacity settings without collapse.","Exploiting scale to revamp information absorption has recently become central to the success of deep learning and transformers have become $\textit{de facto}$ choice achieving numerous breakthrough performances on many real-world applications. Despite their enormous success, gigantic transformers suffer not only from exorbitant computational and memory footprints during training but also from severe collapse as evidenced by a high degree of parameter redundancy. Recently proposed Sparsely-activated Mixture-of-Experts (SMoEs) models have shown promise to mitigate the issue of training efficiency, yet they have some critical limitations. In particular, SMoEs models are prone to $\textit{redundant experts}$ due to representational collapse and \textit{poor scalability during inference and downstream fine-tuning} primarily due to overfitting of the learned routing policy to the number of activated experts during training. As recent research efforts are predominantly focused on improving routing policies to encourage expert specializations, our work focuses on $\textit{exploring the overlooked scalability bottleneck of SMoEs}$, to effectively benefit scaling large-scale transformers. To this end, we propose a new plug-and-play training framework, $\textbf{SMoE-Dropout}$ to enable scaling transformers to better accuracy in the full capacity setting without collapse. Specifically, SMoE-Dropout consists of a $\textit{randomly initialized and fixed}$ router network to activate experts and gradually increase their number as training progresses over time. SMoE-Dropout naturally provides a $\textbf{``self-slimmable”}$ property offering consistent boosted performance for transformers with an increase in activated experts during inference and downstream fine-tuning, subjected to resource availability. Our extensive experiments across diverse transformer architectures on a variety of tasks validate superior performance and substantial computation savings, compared to densely trained baselines with equivalent parameter counts. More precisely, our trained BERT outperforms their densely trained counterpart with consistent improvements of {$1.03\%$, $0.78\%$, $1.09\%$} on challenging reasoning tasks {$\texttt{ASDiv-A}$, $\texttt{MAWPS}$, $\texttt{SVAMP}$}, respectively. Codes and models will be publicly released. ","Sparse Mixture-of-Experts, Random Routing, Transformer Training, Dropout" Deep Patch Visual Odometry,https://openreview.net/forum?id=mXwThfu1HQL,https://openreview.net/pdf?id=mXwThfu1HQL,We propose a new deep learning system for monocular Visual Odometry.,"We propose Deep Patch Visual Odometry (DPVO), a new deep learning system for monocular Visual Odometry (VO). DPVO is accurate and robust while running at 2x-5x real-time speeds on a single RTX-3090 GPU using only 4GB of memory. We perform evaluation on standard benchmarks and outperform all prior work (classical or learned) in both accuracy and speed. ","Visual Odometry, SLAM, Simultaneous Localization and Mapping, 3D, Structure from Motion" Compositional Semantic Parsing with Large Language Models,https://openreview.net/forum?id=gJW8hSGBys8,https://openreview.net/pdf?id=gJW8hSGBys8,"Using an extension of least-to-most prompting we demonstrate strong performance on two benchmarks for compositional generalization, CFQ and COGS, and achieve state of the art on CFQ while using only 1% of the training data.","Humans can reason compositionally when presented with new tasks. Previous research shows that appropriate prompting techniques enable large language models (LLMs) to solve artificial compositional generalization tasks such as SCAN. In this work, we identify additional challenges in more realistic semantic parsing tasks with larger vocabulary and refine these prompting techniques to address them. Our best method is based on least-to-most prompting: it decomposes the problem using prompting-based syntactic parsing, then uses this decomposition to select appropriate exemplars and to sequentially generate the semantic parse. This method allows us to set a new state of the art for CFQ while requiring only 1% of the training data used by traditional approaches. Due to the general nature of our approach, we expect similar efforts will lead to new results in other tasks and domains, especially for knowledge-intensive applications.","large language models, prompting, compositional generalization, natural language processing" TiAda: A Time-scale Adaptive Algorithm For Nonconvex Minimax Optimization,https://openreview.net/forum?id=zClyiZ5V6sL,https://openreview.net/pdf?id=zClyiZ5V6sL,,"Adaptive gradient methods have shown their ability to adjust the stepsizes on the fly in a parameter-agnostic manner, and empirically achieve faster convergence for solving minimization problems. When it comes to nonconvex minimax optimization, however, current convergence analyses of gradient descent ascent (GDA) combined with adaptive stepsizes require careful tuning of hyper-parameters and the knowledge of problem-dependent parameters. Such a discrepancy arises from the primal-dual nature of minimax problems and the necessity of delicate time-scale separation between the primal and dual updates in attaining convergence. In this work, we propose a single-loop adaptive GDA algorithm called TiAda for nonconvex minimax optimization that automatically adapts to the time-scale separation. Our algorithm is fully parameter-agnostic and can achieve near-optimal complexities simultaneously in deterministic and stochastic settings of nonconvex-strongly-concave minimax problems. The effectiveness of the proposed method is further justified numerically for a number of machine learning applications.","optimization, minimax optimization, adaptive algorithm" Generalization Properties of Retrieval-based Models,https://openreview.net/forum?id=0nroZT5gHsS,https://openreview.net/pdf?id=0nroZT5gHsS,We present a novel theoretical analysis to study the generalization bounds for retrieval-based classification models.,"Many modern high-performing machine learning models such as GPT-3 primarily rely on scaling up models, e.g., transformer networks. Simultaneously, a parallel line of work aims to improve the model performance by augmenting an input instance with other (labeled) instances during inference. Examples of such augmentations include task-specific prompts and similar examples retrieved from the training data by a nonparametric component. Remarkably, retrieval-based methods have enjoyed success on a wide range of problems, ranging from standard natural language processing and vision tasks to protein folding, as demonstrated by many recent efforts, including WebGPT and AlphaFold. Despite growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored. In this paper, we present a formal treatment of retrieval-based models to characterize their generalization ability. In particular, we focus on two classes of retrieval-based classification approaches: First, we analyze a local learning framework that employs an explicit local empirical risk minimization based on retrieved examples for each input instance. Interestingly, we show that breaking down the underlying learning task into local sub-tasks enables the model to employ a low complexity parametric component to ensure good overall accuracy. The second class of retrieval-based approaches we explore learns a global model using kernel methods to directly map an input instance and retrieved examples to a prediction, without explicitly solving a local learning task.","Generalization bounds, retrieval-based models, local empirical risk minimization, semiparametric models, nonparametric models, kernel methods" Semi-Variance Reduction for Fair Federated Learning,https://openreview.net/forum?id=d5LLy8_6_YV,https://openreview.net/pdf?id=d5LLy8_6_YV,We propose two new algorithms for fair federated learning based on variance and semi-variance regularization,"Ensuring fairness in Federated Learning (FL) systems, i.e. ensuring a satisfactory performance for all of the diverse clients in the systems, is an important and challenging problem. There are multiple fair FL algorithms in the literature, which have been relatively successful in providing fairness. However, these algorithms mostly emphasize on the loss functions of worst-off clients to improve their performance, which often results in the suppression of well-performing ones. As a consequence, the system's overall average performance is usually sacrificed for achieving fairness. Motivated by this and inspired by two well-known risk modeling methods in Finance, Mean-Variance and Mean-Semi-Variance, we propose and study two new fair FL algorithms, Variance Reduction (VRed) and Semi-Variance Reduction (Semi-VRed). VRed encourages equality between clients loss functions by penalizing their variance. In contrast, Semi-VRed penalizes the discrepancy of only the worst-off clients loss functions from the average loss. Through extensive experiments on multiple vision and language datasets, we show that, Semi-VRed achieves SoTA performance in scenarios with highly heterogeneous data distributions by improving both fairness and the system overall average performance at the same time.","Federated Learning, Fairness" Multi-Modality Alone is Not Enough: Generating Scene Graphs using Cross-Relation-Modality Tokens,https://openreview.net/forum?id=7AwPeT4XbAh,https://openreview.net/pdf?id=7AwPeT4XbAh,Introducing a novel cross relational multi-modal token generation strategy for scene graphs,"Recent years have seen a growing interest in Scene Graph Generation (SGG), a comprehensive visual scene understanding task that aims to predict the relationships between objects detected in a scene. One of its key challenges is the strong bias of the visual world around us toward a few frequently occurring relationships, leaving a long tail of under-represented classes. Although infusing additional modalities is one prominent way to improve SGG performance on under-represented classes, we argue that using additional modalities alone is not enough. We propose to inject entity relation information (Cross-Relation) and modality dependencies (Cross-Modality) into each embedding token of a transformer which we term primal fusion. The resulting Cross-RElAtion-Modality (CREAM) token acts as a strong inductive bias for the SGG framework. Our experimental results on the Visual Genome dataset demonstrate that our CREAM model outperforms state-of-the-art SGG models by around 20% while being simpler and requiring substantially less computation. Additionally, to analyse the generalisability of the CREAM model we also evaluate it on the Open Images dataset. Finally, we examine the impact of the depth-map quality on SGG performance and empirically show the superiority of our model over the prior state of the art by better capturing the depth data, boosting the performance by a margin of around 25%.","scene graphs, transformers, fusion strategies, multi-modal" FaiREE: fair classification with finite-sample and distribution-free guarantee,https://openreview.net/forum?id=shzu8d6_YAR,https://openreview.net/pdf?id=shzu8d6_YAR,We propose a fair classification algorithm which can satisfy the group fairness constraints with finite-sample and distribution-free guarantees.,"Algorithmic fairness plays an increasingly critical role in machine learning research. Several group fairness notions and algorithms have been proposed. However, the fairness guarantee of existing fair classification methods mainly depend on specific data distributional assumptions, often requiring large sample sizes, and fairness could be violated when there is a modest number of samples, which is often the case in practice. In this paper, we propose FaiREE, a fair classification algorithm which can satisfy group fairness constraints with finite-sample and distribution-free theoretical guarantees. FaiREE can be adapted to satisfying various group fairness notions (e.g., Equality of Opportunity, Equalized Odds, Demographic Parity, etc.) and achieve the optimal accuracy. These theoretical guarantees are further supported by experiments on both synthetic and real data. FaiREE is shown to have favorable performance over state-of-the-art algorithms.","algorithmic fairness, distribution-free, finite-sample, classification" Bidirectional global to local attention for deep metric learning.,https://openreview.net/forum?id=Xt87fcPFArU,https://openreview.net/pdf?id=Xt87fcPFArU,,"Deep metric learning (DML) provides rich measures of content-based visual similarity, which have become an essential component for many downstream tasks in computer vision and beyond. This paper questions a central paradigm of DML, the process of embedding individual images before comparing their embedding vectors. The embedding drastically reduces image information, removing all spatial information and pooling local image characteristics into a holistic representation. But how can we determine for an individual image the characteristics that would render it similar to a particular other image without having seen the other one? Rather than aiming for the least common denominator and requiring a common embedding space for all training images, our approach identifies for each pair of input images the locations and features that should be considered to compare them. We follow a cross-attention approach to determine these meaningful local features in one image by measuring their correspondences to the other image. Overall image similarity is then a non-linear aggregation of these meaningful local comparisons. The experimental evaluation on standard DML benchmarks shows this approach to significantly improve over the state of the art.","Deep Metric Learning, Visual Similarity Learning, Attention" Deep Evidential Reinforcement Learning for Dynamic Recommendations,https://openreview.net/forum?id=eoUsOflG7QD,https://openreview.net/pdf?id=eoUsOflG7QD,we propose a novel deep evidential reinforcement learning (DERL) framework that learns a more effective recommendation policy by integrating both the expected reward and evidence-based uncertainty.,"Reinforcement learning (RL) has been applied to build recommender systems (RS) to capture users' evolving preferences and continuously improve the quality of recommendations. In this paper, we propose a novel deep evidential reinforcement learning (DERL) framework that learns a more effective recommendation policy by integrating both the expected reward and evidence-based uncertainty. In particular, DERL conducts evidence-aware exploration to locate items that a user will most likely take interest in the future. Two central components of DERL include a customized recurrent neural network (RNN) and an evidential-actor-critic (EAC) module. The former module is responsible for generating the current state of the environment by aggregating historical information and a sliding window that contains the current user interactions as well as newly recommended items that may encode future interest. The latter module performs evidence-based exploration by maximizing a uniquely designed evidential Q-value to derive a policy giving preference to items with good predicted ratings while remaining largely unknown to the system (due to lack of evidence). These two components are jointly trained by supervised learning and reinforcement learning. Experiments on multiple real-world dynamic datasets demonstrate the state-of-the-art performance of DERL and its capability to capture long-term user interests.","recommender system, exploration, actor-critic" Exponential Generalization Bounds with Near-Optimal Rates for $L_q$-Stable Algorithms,https://openreview.net/forum?id=1_jtWjhSSkr,https://openreview.net/pdf?id=1_jtWjhSSkr,We presented a set of sharper and near-optimal exponential generalization bounds for $L_q$-stable learning algorithms,"The \emph{stability} of learning algorithms to changes in the training sample has been actively studied as a powerful proxy for reasoning about generalization. Recently, exponential tail generalization and risk bounds with near-optimal rates have been obtained under the stringent and distribution-free notion of uniform stability~\citep{bousquet2020sharper,klochkov2021stability}. In the meanwhile, under the notion of $L_q$-stability, which is weaker and distribution dependent, exponential generalization bounds are also available yet so far only with sub-optimal rates. Therefore, a natural question we would like to address in this paper is whether it is possible to derive near-optimal exponential generalization bounds for $L_q$-stable learning algorithms. As the core contribution of the present work, we give an affirmative answer to this question by developing strict analogues of the near-optimal generalization and risk bounds of uniformly stable algorithms for $L_q$-stable algorithms. We demonstrate the power of our improved bounds by applying them to derive strong sparse excess risk bounds, under mild conditions, for computationally tractable sparsity estimation algorithms such as Iterative Hard Thresholding (IHT).","$L_q$-stability, Uniform stability, Moments inequality, Exponential generalization bound, Excess risk, Sparsity" Disentangling Learning Representations with Density Estimation,https://openreview.net/forum?id=EMvG1Jdhw_8,https://openreview.net/pdf?id=EMvG1Jdhw_8,"We present GCAE, a scalable disentanglement method that uses the dual total correlation criterion","Disentangled learning representations have promising utility in many applications, but they currently suffer from serious reliability issues. We present Gaussian Channel Autoencoder (GCAE), a method which achieves reliable disentanglement via scalable non-parametric density estimation of the latent space. GCAE avoids the curse of dimensionality of density estimation by disentangling subsets of its latent space with the Dual Total Correlation (DTC) metric, thereby representing its high-dimensional latent joint distribution as a collection of many low-dimensional conditional distributions. In our experiments, GCAE achieves highly competitive and reliable disentanglement scores compared with state-of-the-art baselines.","autoencoder, representation learning, disentanglement, density estimation" Coarse-to-fine Knowledge Graph Domain Adaptation based on Distantly-supervised Iterative Training,https://openreview.net/forum?id=-wDaB590pkt,https://openreview.net/pdf?id=-wDaB590pkt,,"Modern supervised learning neural network models require a large amount of manually labeled data, which makes the construction of domain-specific knowledge graphs time-consuming and labor-intensive. In parallel, although there has been much research on named entity recognition and relation extraction based on distantly supervised learning, constructing a domain-specific knowledge graph from large collections of textual data without manual annotations is still an urgent problem to be solved. In response, we propose an integrated framework for adapting and re-learning knowledge graphs from one coarse domain (biomedical) to a finer-define domain (oncology). In this framework, we apply distant-supervision on cross-domain knowledge graph adaptation. Consequently, no manual data annotation is required to train the model. We introduce a novel iterative training strategy to facilitate the discovery of domain-specific named entities and triples. Experimental results indicate that the proposed framework can perform domain adaptation and construction of knowledge graph efficiently.","Knowledge Graph Domain Adaptation, Knowledge Graph Construction, Named Entity Recognition, Relationship Extraction" Teacher Guided Training: An Efficient Framework for Knowledge Transfer,https://openreview.net/forum?id=GVSf7Z7DbYL,https://openreview.net/pdf?id=GVSf7Z7DbYL,We propose and theoretically analyze a novel way to improve the training efficiency of compact student models that better leverages the knowledge of pretrained generative (teacher) models compared to standard distillation methods.,"The remarkable performance gains realized by large pretrained models, e.g., GPT-3, hinge on the massive amounts of data they are exposed to during training. Analogously, distilling such large models to compact models for efficient deployment also necessitates a large amount of (labeled or unlabeled) training data. In this paper, we propose the teacher-guided training (TGT) framework for training a high-quality compact model that leverages the knowledge acquired by pretrained generative models, while obviating the need to go through a large volume of data. TGT exploits the fact that the teacher has acquired a good representation of the underlying data domain, which typically corresponds to a much lower dimensional manifold than the input space. Furthermore, we can use the teacher to explore input space more efficiently through sampling or gradient-based methods; thus, making TGT especially attractive for limited data or long-tail settings. We formally capture this benefit of proposed data-domain exploration in our generalization bounds. We find that TGT can improve accuracy on several image classification benchmarks as well as a range of text classification and retrieval tasks.","Distillation, Semisupervised learning, Efficient machine learning, Generalization bounds, knowledge distillation" Neural Agents Struggle to Take Turns in Bidirectional Emergent Communication,https://openreview.net/forum?id=GULFHQfgw0g,https://openreview.net/pdf?id=GULFHQfgw0g,Neural agents struggle to develop a turn-taking protocol when playing cooperative game for which they have to communicate.,"The spontaneous exchange of turns is a central aspect of human communication. Although turn-taking conventions come to us naturally, artificial dialogue agents struggle to coordinate, and must rely on hard-coded rules to engage in interactive conversations with human interlocutors. In this paper, we investigate the conditions under which artificial agents may naturally develop turn-taking conventions in a simple language game. We describe a cooperative task where success is contingent on the exchange of information along a shared communication channel where talking over each other hinders communication. Despite these environmental constraints, neural-network based agents trained to solve this task with reinforcement learning do not systematically adopt turn-taking conventions. However, we find that agents that do agree on turn-taking protocols end up performing better. Moreover, agents that are forced to perform turn-taking can learn to solve the task more quickly. This suggests that turn-taking may help to generate conversations that are easier for speakers to interpret.","language emergence, turn-taking, conversation, communication, neural agents, cooperative game, reinforcement learning" Class Interference of Deep Networks,https://openreview.net/forum?id=tkvyCt1PzpvP,https://openreview.net/pdf?id=tkvyCt1PzpvP,We show that there is a phenomenon of class interference with all deep neural networks.,"Recognizing and telling similar objects apart is even hard for human beings. In this paper, we show that there is a phenomenon of class interference with all deep neural networks. Class interference represents the learning difficulty in data and it constitutes the largest percentage of generalization errors by deep networks. To understand class interference, we propose cross-class tests, class ego directions and interference models. We show how to use these definitions to study minima flatness and class interference of a trained model. We also show how to detect class interference during training through label dancing pattern and class dancing notes. ","Minima sharpness, generalization, loss contour, visualization" Observational Robustness and Invariances in Reinforcement Learning via Lexicographic Objectives,https://openreview.net/forum?id=b3k_8yKKdag,https://openreview.net/pdf?id=b3k_8yKKdag,We use lexicographic optimization to induce robustness in RL policies in a safe way.,"Policy robustness in Reinforcement Learning (RL) may not be desirable at any price; the alterations caused by robustness requirements from otherwise optimal policies should be explainable and quantifiable. Policy gradient algorithms that have strong convergence guarantees are usually modified to obtain robust policies in ways that do not preserve algorithm guarantees, which defeats the purpose of formal robustness requirements. In this work we study a notion of robustness in partially observable MDPs where state observations are perturbed by a noise-induced stochastic kernel. We characterize the set of policies that are maximally robust by analysing how the policies are altered by this kernel. We then establish a connection between such robust policies and certain properties of the noise kernel, as well as with structural properties of the underlying MDPs, constructing sufficient conditions for policy robustness. We use these notions to propose a robustness-inducing scheme, applicable to any policy gradient algorithm, to formally trade off the reward achieved by a policy with its robustness level through lexicographic optimization, which preserves convergence properties of the original algorithm. We test the the proposed approach through numerical experiments on safety-critical RL environments, and show how the proposed method helps achieve high robustness when state errors are introduced in the policy roll-out.","Robust Reinforcement Learning, Safe Reinforcement Learning, Lexicographic Optimization" SeedGNN: Graph Neural Network for Supervised Seeded Graph Matching,https://openreview.net/forum?id=iYvbPx8GTta,https://openreview.net/pdf?id=iYvbPx8GTta,We propose a new supervised GNN method for seeded graph matching that can learn from a training set how to match unseen graphs with only a few seeds.,"There have been significant interests in designing Graph Neural Networks (GNNs) for seeded graph matching, which aims to match two (unlabeled) graphs using only topological information and a small set of seeds. However, most previous GNNs for seeded graph matching employ a semi-supervised approach, which requires a large number of seeds and can not learn knowledge transferable to unseen graphs. In contrast, this paper proposes a new supervised approach that can learn from a training set how to match unseen graphs with only a few seeds. At the core of our SeedGNN architecture are two novel modules: 1) a convolution module that can easily learn the capability of counting and using witnesses of different hops; 2) a percolation module that can use easily-matched pairs as new seeds to percolate and match other nodes. We evaluate SeedGNN on both synthetic and real graphs, and demonstrate significant performance improvement over both non-learning and learning algorithms in the existing literature. Further, our experiments confirm that the knowledge learned by SeedGNN from training graphs can be generalized to test graphs with different sizes and categories. ","seeded graph matching, Graph Neural Network (GNN), percolation, multi-hop witnesses" Siamese DETR,https://openreview.net/forum?id=f_yTeb1v-GW,https://openreview.net/pdf?id=f_yTeb1v-GW,,"Recent self-supervised methods are mainly designed for representation learning with the base model, e.g., ResNets or ViTs. They cannot be easily transferred to DETR, especially the task-specific module Transformer. In this work, we present Siamese DETR, a Siamese self-supervised pretraining approach for the Transformer architecture in DETR. We consider learning view-invariant and detection-oriented representations simultaneously through two complementary tasks, i.e., localization and discrimination, in a novel multi-view learning framework. Two self-supervised pretext tasks are designed: (a) Multi-View Region Detection aims at learning to localize regions-of-interest between augmented views of the input, and (b) Multi-View Semantic Discrimination attempts to improve object-level discrimination for each region. The proposed Siamese DETR achieves state-of-the-art transfer performance on COCO and PASCAL VOC detection using different DETR variants in all setups. Code will be made available.","object detection, unsupervised learning, Transformer" Provable Sharpness-Aware Minimization with Adaptive Learning Rate ,https://openreview.net/forum?id=PYSktOGKBkY,https://openreview.net/pdf?id=PYSktOGKBkY,We present the first convergence guarantee of the adaptive SAM method with a linear speedup property under the non-convex setting.,"Sharpness aware minimization (SAM) optimizer has been extensively explored as it can converge fast and train deep neural networks efficiently via introducing extra perturbation steps to flatten the landscape of deep learning models. A combination of SAM with adaptive learning rate (AdaSAM) has also been explored to train large-scale deep neural networks without theoretical guarantee due to the dual difficulties in analyzing the perturbation step and the coupled adaptive learning rate. In this paper, we try to analyze the convergence rate of AdaSAM in the stochastic non-convex setting. We theoretically show that AdaSAM admit a $\mathcal{O}(1/\sqrt{bT})$ convergence rate and show linear speedup property with respect to mini-batch size b. To best of our knowledge, we are the first to provide the non-trivial convergence rate of SAM with an adaptive learning rate. To decouple the two stochastic gradient steps with the adaptive learning rate, we first introduce the delayed second-order momentum during the convergence to decompose them to make them independent while taking an expectation. Then we bound them by showing the adaptive learning rate has a limited range, which makes our analysis feasible. At last, we conduct experiments on several NLP tasks and they show that AdaSAM could achieve superior performance compared with SGD, AMSGrad, and SAM optimizer.","Adaptive learning rate, Sharpness aware minimization, mini-batch linear speedup" Prompting GPT-3 To Be Reliable,https://openreview.net/forum?id=98p5x51L5af,https://openreview.net/pdf?id=98p5x51L5af,"We establish simple and effective prompting methods to make GPT-3 reliable in terms of: robustness, fairness, calibration, factuality. ","Large language models (LLMs) show impressive abilities via few-shot prompting. Commercialized APIs such as OpenAI GPT-3 further increase their use in real-world language applications. However, existing research focuses on models’ accuracy on standard benchmarks and largely ignores their reliability, which is crucial for avoiding catastrophic real-world harms. While reliability is a broad and vaguely defined term, this work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality. We establish simple and effective prompts to demonstrate GPT-3’s reliability in these four aspects: 1) generalize out-of-domain, 2) balance demographic distribution to reduce social biases, 3) calibrate language model probabilities, and 4) update the LLM’s knowledge. We find that by employing appropriate prompts, GPT-3 outperforms smaller-scale supervised models by large margins on all these facets. We will release all pro- cessed datasets, evaluation scripts, and model predictions. Our findings not only shed new insights on the reliability of prompting LLMs, but more importantly, our prompting strategies can help practitioners more reliably use large language models like GPT-3.","prompting, GPT-3, large language models, reliability, robustness, biases, calibration, knowledge updating" Contrastive Graph Few-Shot Learning,https://openreview.net/forum?id=7Di4aNrBAhv,https://openreview.net/pdf?id=7Di4aNrBAhv,"We propose CGFL, a general and effective framework to mitigate the distribution shift impact for learning more generalizable representations on graph few-shot-learning tasks.","Prevailing supervised deep graph learning models often suffer from label sparsity issue. Although many graph few-shot learning (GFL) methods have been developed to avoid performance degradation in face of limited annotated data, they excessively rely on labeled data, where the distribution shift in the test phase might result in impaired generalization ability. Additionally, they lack a general purpose as their designs are coupled with task or data-specific characteristics. To this end, we propose a general and effective Contrastive Graph Few-shot Learning framework (CGFL). CGFL leverages a self-distilled contrastive learning procedure to boost GFL. Specifically, our model firstly pre-trains a graph encoder with contrastive learning using unlabeled data. Later, the trained encoder is frozen as a teacher model to distill a student model with a contrastive loss. The distilled model is finally fed to GFL. CGFL learns data representation in a self-supervised manner, thus mitigating the distribution shift impact for better generalization and making model task and data-independent for a general graph mining purpose. Furthermore, we introduce an information-based method to quantitatively measure the capability of CGFL. Comprehensive experiments demonstrate that CGFL outperforms state- of-the-art baselines on several graph mining tasks across various datasets in the few-shot scenario. We also provide a quantitative measurement of CGFL’s success.","Graph representation learning, Few-shot learning, Contrastive learning" Domain Generalization in Regression,https://openreview.net/forum?id=8TjyUm_XarL,https://openreview.net/pdf?id=8TjyUm_XarL,We propose a new domain generalization setting in regression scenario and a weighted meta-learning solution.,"In the context of classification, \textit{domain generalization} (DG) aims to predict the labels of unseen target-domain data only using labeled source-domain data, where the source and target domains usually share \textit{the same label set}. However, in the context of regression, DG is not well studied in the literature, and the main reason is that ranges of response variable in both domains are often \textit{different}, even disjoint under some extreme conditions. In this paper, we study a new problem setting: \textit{domain generalization in regression} (DGR), and propose a weighted meta-learning strategy to get optimal meta-initialization across disjoint domains to help address the DGR problem. The motivation is that when the meta-model performs well on one domain, we hope such a model also performs well in other related domains. To measure the relatedness regarding domains in the context of regression, we use the feature discrepancy in meta-space to calculate the discrepancy between any two domains and treat the discrepancy as the weight of a meta-training task in the meta-learning framework. The extensive regression experiments on standard domain generalization benchmark demonstrate the superiority of the proposed method.","domain generalization, regression, meta-learning" Adversarial Training of Self-supervised Monocular Depth Estimation against Physical-World Attacks,https://openreview.net/forum?id=LfdEuhjR5GV,https://openreview.net/pdf?id=LfdEuhjR5GV,Use self-supervised adversarial training to harden monocular depth estimation models against physical-world adversarial attacks.,"Monocular Depth Estimation (MDE) is a critical component in applications such as autonomous driving. There are various attacks against MDE networks. These attacks, especially the physical ones, pose a great threat to the security of such systems. Traditional adversarial training method requires ground-truth labels and hence cannot be directly applied to self-supervised MDE that does not have depth ground truth. Some self-supervised model hardening technique (e.g., contrastive learning) ignores the domain knowledge of MDE and can hardly achieve optimal performance. In this work, we propose a novel adversarial training method for self-supervised MDE models based on view synthesis without using the depth ground truth. We improve adversarial robustness against physical-world attacks using $L_0$-norm-bounded perturbation in training. We compare our method with supervised learning-based and contrastive learning-based methods that are tailored for MDE. Results on two representative MDE networks show that we achieve better robustness against various adversarial attacks with nearly no benign performance degradation.","Adversarial Training, Monocular Depth Estimation, Adversarial Attack, Self-supervised Learning." Sparsity-Constrained Optimal Transport,https://openreview.net/forum?id=yHY9NbQJ5BP,https://openreview.net/pdf?id=yHY9NbQJ5BP,We propose formulations for optimal transport with cardinality constraints and apply them to sparse mixture of experts.,"Regularized optimal transport (OT) is now increasingly used as a loss or as a matching layer in neural networks. Entropy-regularized OT can be computed using the Sinkhorn algorithm but it leads to fully-dense transportation plans, meaning that all sources are (fractionally) matched with all targets. To address this issue, several works have investigated quadratic regularization instead. This regularization preserves sparsity and leads to unconstrained and smooth (semi) dual objectives, that can be solved with off-the-shelf gradient methods. Unfortunately, quadratic regularization does not give direct control over the cardinality (number of nonzeros) of the transportation plan. We propose in this paper a new approach for OT with explicit cardinality constraints on the transportation plan. Our work is motivated by an application to sparse mixture of experts, where OT can be used to match input tokens such as image patches with expert models such as neural networks. Cardinality constraints ensure that at most $k$ tokens are matched with an expert, which is crucial for computational performance reasons. Despite the nonconvexity of cardinality constraints, we show that the corresponding (semi) dual problems are tractable and can be solved with first-order gradient methods. Our method can be thought as a middle ground between unregularized OT (recovered in the limit case $k=1$) and quadratically-regularized OT (recovered when $k$ is large enough). The smoothness of the objectives increases as $k$ increases, giving rise to a trade-off between convergence speed and sparsity of the optimal plan.","optimal transport, sparsity" A Risk-Averse Equilibrium for Multi-Agent Systems,https://openreview.net/forum?id=5Z1rblK1Be5,https://openreview.net/pdf?id=5Z1rblK1Be5,"We introduce a novel risk-averse solution concept that allows the learner to accommodate low probability actions by finding the strategy with minimum variance, given any level of expected utility. ","In multi-agent systems, intelligent agents are tasked with making decisions that lead to optimal outcomes when actions of the other agents are as expected, whilst also being prepared for their unexpected behaviour. In this work, we introduce a novel risk-averse solution concept that allows the learner to accommodate low probability actions by finding the strategy with minimum variance, given any level of expected utility. We first prove the existence of such a risk-averse equilibrium, and propose one fictitious-play type learning algorithm for smaller games that enjoys provable convergence guarantees in games classes including zero-sum and potential. Furthermore, we propose an approximation method for larger games based on iterative population-based training that generates a population of risk- averse agents. Empirically, our equilibrium is shown to be able to reduce the utility variance, specifically in the sense that other agents’ low probability behaviour is better accounted for by our equilibrium in comparison to playing other solutions. Importantly, we show that our population of agents that approximate a risk-averse equilibrium is particularly effective against unseen opposing populations, especially in the case of guaranteeing a minimum level of performance, which is critical to safety-aware multi-agent systems.","game theory, safe game theory, risk averse game theory, safe equilibrium, population learning, game theory equilibrium" SuperWeight Ensembles: Automated Compositional Parameter Sharing Across Diverse Architechtures,https://openreview.net/forum?id=GF4A49QlqjN,https://openreview.net/pdf?id=GF4A49QlqjN, A novel efficient ensembling technique for ensembling models of different architechtures; enabling anytime inference,"Neural net ensembles boost task performance, but have excessive storage requirements. Recent work in efficient ensembling has made the memory cost more tractable by sharing learned parameters between ensemble members. Existing efficient ensembles have high predictive accuracy, but they are overly restrictive in two ways: 1) They constrain ensemble members to have the same architecture, limiting their usefulness in applications such as anytime inference, and 2) They reduce the parameter count for a small predictive performance penalty, but do not provide an easy way to trade-off parameter count for predictive performance without increasing inference time. In this paper, we propose SuperWeight Ensembles, an approach for architecture-agnostic parameter sharing. SuperWeight Ensembles share parameters between layers which have sufficiently similar computation, even if they have different shapes. This allows anytime prediction of heterogeneous ensembles by selecting a subset of members during inference, which is a flexibility not supported by prior work. In addition, SuperWeight Ensembles provide control over the total number of parameters used, allowing us to increase or decrease the number of parameters without changing model architecture. On the anytime prediction task, our method shows a consistent boost over prior work while allowing for more flexibility in architectures and efficient parameter sharing. SuperWeight Ensembles preserve the performance of prior work in the low-parameter regime, and even outperform fully-parameterized ensembles with 17% fewer parameters on CIFAR-100 and 50% fewer parameters on ImageNet.","efficent ensembles, anytime inference" Human alignment of neural network representations,https://openreview.net/forum?id=ReDQ1OUQR0X,https://openreview.net/pdf?id=ReDQ1OUQR0X,"We evaluate the alignment of neural network representations with human judgments about object similarities in an odd-one-out triplet task, finding that dataset and objective function, but not model size or architecture, have a significant impact.","Today’s computer vision models achieve human or near-human level performance across a wide variety of vision tasks. However, their architectures, data, and learning algorithms differ in numerous ways from those that give rise to human vision. In this paper, we investigate the factors that affect alignment between the representations learned by neural networks and human concept representations. Human representations are inferred from behavioral responses in an odd-one-out triplet task, where humans were presented with three images and had to select the odd-one-out. We find that model scale and architecture have essentially no effect on alignment with human behavioral responses, whereas the training dataset and objective function have a much larger impact. Using a sparse Bayesian model of human conceptual representations, we partition triplets by the concept that distinguishes the two similar images from the odd-one-out, finding that some concepts such as food and animals are well-represented in neural network representations whereas others such as royal or sports-related objects are not. Overall, although models trained on larger, more diverse datasets achieve better alignment with humans than models trained on ImageNet alone, our results indicate that scaling alone is unlikely to be sufficient to train neural networks with conceptual representations that match those used by humans.","Human Alignment, Robustness, Neural Network Representations, Human Concepts, Object Similarity, Computer Vision" Imitation Learning for Mean Field Games with Correlated Equilibria,https://openreview.net/forum?id=VUdMeSbExWg,https://openreview.net/pdf?id=VUdMeSbExWg,,"Imitation learning (IL) aims at achieving optimal actions by learning from demonstrated behaviors without knowing the reward function and transition kernels. Conducting IL with a large population of agents is challenging as agents' interactions grow exponentially with respect to the population size. Mean field theory provides an efficient tool to study multi-agent problems by aggregating information on the population level. While the approximation is tractable, it is non-trivial to restore mean field Nash equilibria (MFNE) from demonstrations. Importantly, there are many real-world problems that cannot be explained by the classic MFNE concept; this includes the traffic network equilibrium induced from the public routing recommendations and the pricing equilibrium of goods generated on the E-commerce platform. In both examples, correlated devices are introduced to the equilibrium due to the intervention from the platform. To accommodate this, we propose a novel solution concept named adaptive mean field correlated equilibrium (AMFCE) that generalizes MFNE. On the theory side, we first prove the existence of AMFCE, and establish a novel framework based on IL and AMFCE with entropy regularization (MaxEnt-AMFCE) to recover the AMFCE policy from real-world demonstrations. Signatures from the rough path theory are then applied to characterize the mean-field evolution. A significant benefit of MaxEnt-AMFCE is that it can recover both the equilibrium policy and the correlation device from data. We test our MaxEnt-AMFCE against the state-of-the-art IL algorithms for MFGs on several tasks (including a real-world traffic flow prediction problem), results justify the effectiveness of our proposed method and show its potential to predicting and explaining large population behavior under correlated signals. ","Imitation Learning, Mean Field Games, Correlated Equilibria" EFFECTIVE FREQUENCY-BASED BACKDOOR ATTACKS WITH LOW POISONING RATIOS,https://openreview.net/forum?id=2sAVJZGwQRx,https://openreview.net/pdf?id=2sAVJZGwQRx,,"Backdoor attack has been considered a serious threat to deep learning. Although several seminal backdoor attack methods have been proposed, they often required at least a certain poisoning ratio (\eg, 1\% or more) to achieve high attack success rate (ASR). However, the attack with a large poisoning ratio may be difficult to evade human inspection or backdoor defenses, \ie, low stealthiness. To tackle the dilemma between high ASR and low stealthiness, we aim to enhance ASR under low poisoning ratio, \ie, pursuing high ASR and high stealthiness simultaneously. To achieve this goal, we propose a novel frequency-based backdoor attack, where the trigger is generated based on important frequencies that contribute positively to the model prediction with respect to the target class. Extensive experiments on four benchmark datasets (CIFAR-10, CIFAR-100, GTSRB, Tiny ImageNet) verify the effectiveness and stealthiness of the proposed method under extremely low poisoning ratios. Specifically, with only 0.01\% poisoning ratio, our attack could achieve the ASR of 80.51%, 51.3%, 76.3%, and 87.2% on above four datasets, respectively, while the ASR values of most state-of-the-art (SOTA) attack methods are close to 0. Meanwhile, our method could well evade several SOTA backdoor defense methods, \ie, the ASR values are not significantly affected under defense. ", Turning the Curse of Heterogeneity in Federated Learning into a Blessing for Out-of-Distribution Detection,https://openreview.net/forum?id=mMNimwRb7Gr,https://openreview.net/pdf?id=mMNimwRb7Gr,,"Deep neural networks have witnessed huge successes in many challenging prediction tasks and yet they often suffer from out-of-distribution (OoD) samples, misclassifying them with high confidence. Recent advances show promising OoD detection performance for centralized training, and however, OoD detection in federated learning (FL) is largely overlooked, even though many security sensitive applications such as autonomous driving and voice recognition authorization are commonly trained using FL for data privacy concerns. The main challenge that prevents previous state-of-the-art OoD detection methods from being incorporated to FL is that they require large amount of real OoD samples. However, in real-world scenarios, such large-scale OoD training data can be costly or even infeasible to obtain, especially for resource-limited local devices. On the other hand, a notorious challenge in FL is data heterogeneity where each client collects non-identically and independently distributed (non-iid) data. We propose to take advantage of such heterogeneity and turn the curse into a blessing that facilitates OoD detection in FL. The key is that for each client, non-iid data from other clients (unseen external classes) can serve as an alternative to real OoD samples. Specifically, we propose a novel Federated Out-of-Distribution Synthesizer (FOSTER), which learns a class-conditional generator to synthesize virtual external-class OoD samples, and maintains data confidentiality and communication efficiency required by FL. Experimental results show that our method outperforms the state-of-the-art by 2.49%, 2.88%, 1.42% AUROC, and 0.01%, 0.89%, 1.74% ID accuracy, on CIFAR-10, CIFAR-100, and STL10, respectively.","out-of-distribution detection, federated learning, heterogeneity" Clustering and Ordering Variable-Sized Sets: The Catalog Problem,https://openreview.net/forum?id=xgFfr5IIuXP,https://openreview.net/pdf?id=xgFfr5IIuXP,"A neural method for predicting an adaptive number of diverse, ordered clusters from any set is introduced and tested on synthetic and real-world datasets, demonstrating top performance on new and harder formulation of the PROCAT challenge.","Prediction of a varying number of ordered clusters from sets of any cardinality is a challenging task for neural networks, combining elements of set representation, clustering and learning to order. This task arises in many diverse areas, ranging from medical triage, through multi-channel signal analysis for petroleum exploration to product catalog structure prediction. This paper focuses on the latter, which exemplifies a number of challenges inherent to adaptive ordered clustering, referred to further as the eponymous Catalog Problem. These include learning variable cluster constraints, exhibiting relational reasoning and managing combinatorial complexity. Despite progress in both neural clustering and set-to-sequence methods, no joint, fully differentiable model exists to-date. We develop such a modular architecture, referred to further as Neural Ordered Clusters (NOC), enhance it with a specific mechanism for learning cluster-level cardinality constraints, and provide a robust comparison of its performance in relation to alternative models. We test our method on three datasets, including synthetic catalog structures and PROCAT, a dataset of real-world catalogs consisting of over 1.5M products, achieving state-of-the-art results on a new, more challenging formulation of the underlying problem, which has not been addressed before. Additionally, we examine the network's ability to learn higher-order interactions and investigate its capacity to learn both compositional and structural rulesets.","neural clustering, set-to-sequence, supervised clustering, structure prediction, set representation, learning to order" RangeAugment: Efficient Online Augmentation with Range Learning,https://openreview.net/forum?id=ZbwqqxW2f-G,https://openreview.net/pdf?id=ZbwqqxW2f-G,Efficiently learn the range of magnitudes for each augmentation operation in a constant time,"State-of-the-art automatic augmentation methods (e.g., AutoAugment and RandAugment) for visual recognition tasks diversify training data using a large set of augmentation operations. The range of magnitudes of many augmentation operations (e.g., brightness and contrast) is continuous. Therefore, to make search computationally tractable, these methods use fixed and manually-defined magnitude ranges for each operation, which may lead to sub-optimal policies. To answer the open question on the importance of magnitude ranges for each augmentation operation, we introduce RangeAugment that allows us to efficiently learn the range of magnitudes for individual as well as composite augmentation operations. RangeAugment uses an auxiliary loss based on image similarity as a measure to control the range of magnitudes of augmentation operations. As a result, RangeAugment has a single scalar parameter for search, image similarity, which we simply optimize via linear search. RangeAugment integrates seamlessly with any model and learns model- and task-specific augmentation policies. With extensive experiments on the ImageNet dataset across different networks, we show that RangeAugment achieves competitive performance to state-of-the-art automatic augmentation methods with 4-5 times fewer augmentation operations. Experimental results on semantic segmentation and contrastive learning further shows RangeAugment's effectiveness.",Automatic Augmentation How Predictors Affect Search Strategies in Neural Architecture Search?,https://openreview.net/forum?id=XWWAvqMMal5,https://openreview.net/pdf?id=XWWAvqMMal5,We study theoretically and empirically the impact of predictors on NAS search strategies.,"Predictor-based Neural Architecture Search (NAS) is an important topic since it can efficiently reduce the computational cost of evaluating candidate architectures. Most existing predictor-based NAS algorithms aim to design different predictors to improve prediction performance. Unfortunately, even a promising performance predictor may suffer from the accuracy decline due to long-term and continuous usage, thus leading to the degraded performance of the search strategy. That naturally gives rise to the following problems: how predictors affect search strategies and how to appropriately use the predictor? In this paper, we take reinforcement learning (RL) based search strategy to study theoretically and empirically the impact of predictors on search strategies. We first formulate a predictor-RL-based NAS algorithm as model-based RL and analyze it with a guarantee of monotonic improvement at each trail. Then, based on this analysis, we propose a simple procedure of predictor usage, named mixed batch, which contains ground-truth data and prediction data. The proposed procedure can efficiently reduce the impact of predictor errors on search strategies with maintaining performance growth. Our algorithm, Predictor-based Neural Architecture Search with Mixed batch (PNASM), outperforms traditional NAS algorithms and prior state-of-the-art predictor-based NAS algorithms on three NAS-Bench-201 tasks.","Neural Architecture Search, Predictor-based Neural Architecture Search, Reinforcement Learning" Unbiased Stochastic Proximal Solver for Graph Neural Networks with Equilibrium States,https://openreview.net/forum?id=j3cUWIMsFBN,https://openreview.net/pdf?id=j3cUWIMsFBN,,"Graph Neural Networks (GNNs) are widely used deep learning models that can extract meaningful representations from graph datasets and achieve great success in many machine learning tasks. Among them, graph neural networks with iterative iterations like unfolded GNNs and implicit GNNs can effectively capture long-range dependencies in graphs and demonstrate superior performance on large graphs since they can mathematically ensure its convergence to some nontrivial solution after lots of aggregations. However, the aggregation time for such models costs a lot as they need to aggregate the full graph in each update. Such weakness limits the scalability of the implicit graph models. To tackle such limitations, we propose two unbiased stochastic proximal solvers inspired by the stochastic proximal gradient descent method and its variance reduction variant called USP and USP-VR solvers. From the point of stochastic optimization, we theoretically prove that our solvers are unbiased, which can converge to the same solution as the original solvers for unfolded GNNs and implicit GNNs. Furthermore, the computation complexities for unfolded GNNs and implicit GNNs with our proposed solvers are significantly less than their vanilla versions. Experiments on various large graph datasets show that our proposed solvers are more efficient and can achieve state-of-the-art performance.", Energy Transformer,https://openreview.net/forum?id=4nrZXPFN1c4,https://openreview.net/pdf?id=4nrZXPFN1c4,"We propose a network, which describes the forward pass in a transformer as a gradient decent on an energy function. ","Transformers have become the de facto models of choice in machine learning, typically leading to impressive performance on many applications. At the same time, the architectural development in the transformer world is mostly driven by empirical findings, and the theoretical understanding of their architectural building blocks is rather limited. In contrast, Dense Associative Memory models or Modern Hopfield Networks have a well-established theoretical foundation, but have not yet demonstrated truly impressive practical results. We propose a transformer architecture that replaces the sequence of feedforward transformer blocks with a single large Associative Memory model. Our novel architecture, called Energy Transformer (or ET for short), has many of the familiar architectural primitives that are often used in the current generation of transformers. However, it is not identical to the existing architectures. The sequence of transformer layers in ET is purposely designed to minimize a specifically engineered energy function, which is responsible for representing the relationships between the tokens. As a consequence of this computational principle, the attention in ET is different from the conventional attention mechanism. In this work, we introduce the theoretical foundations of ET, explore it's empirical capabilities using the image completion task, and obtain strong quantitative results on the graph anomaly detection task.","Transformers, Hopfield Networks, Graph Anomaly Detection" Privacy-Preserving Vision Transformer on Permutation-Encrypted Images,https://openreview.net/forum?id=eL1iX7DMnPI,https://openreview.net/pdf?id=eL1iX7DMnPI,We propsoe a novel privacy-preserving learning paradigm that removes human-recognizable contents while preserves machine-learnable information.,"Massive human-related data is collected to train neural networks for computer vision tasks. Potential incidents, such as data leakages, expose significant privacy risks to applications. In this paper, we propose an efficient privacy-preserving learning paradigm, where images are first encrypted via one of the two encryption strategies: (1) random shuffling to a set of equally-sized patches and (2) mixing-up sub-patches of the images. Then, a permutation-equivariant vision transformer is designed to learn on the encrypted images for vision tasks, including image classification and object detection. Extensive experiments on ImageNet and COCO show that the proposed paradigm achieves comparable accuracy with the competitive methods. Moreover, decrypting the encrypted images is solving an NP-hard jigsaw puzzle or an ill-posed inverse problem, which is empirically shown intractable to be recovered by the powerful vision transformer-based attackers. We thus show that the proposed paradigm can destroy human-recognizable contents while preserving machine-learnable information. Code will be released publicly.","vision transformer, privacy" Lightweight CNNs Under A Unifying Tensor View,https://openreview.net/forum?id=PNyvODFNTkZ,https://openreview.net/pdf?id=PNyvODFNTkZ,"A unifying tensor view is introduced, which provides an easy-to-understand graphical illustration of various lightweight CNN components. A novel shift layer pruning scheme is proposed in response to the framework.","Despite the decomposition of convolutional kernels for lightweight CNNs being well studied, previous works that relied on tensor network diagrams or higher dimensional abstraction lacked geometry intuition. Our work captures the CNN kernel as a 3D tensor and explores its various decompositions, allowing for a straightforward graphical and analytical perspective between different tensor approximation schemes and efficient CNN components, including pointwise and depthwise convolutions. Extensive experiments are conducted, showing that a pointwise-depthwise-pointwise (PDP) configuration via a canonical polyadic decomposition (CPD) initialization can be a viable starting point for lightweight CNNs. The compression ratio of VGG-16 can reach over $50\%$ while its performance outperforms its randomly initialized counterpart by $>10\%$ in terms of accuracy. FPGA experiments for the PDP model further demonstrate its hardware efficacy, namely, $2.4\times$ faster and $1.4\times$ more energy efficient than the standard conv2d. Furthermore, our framework offers a unique slice-wise illustration and is the first to ever draw a connection to the shift layer. Such insight inspires a first-of-its-kind pruning method for shift layers, achieving nearly $50\%$ compression with $<1\%$ drop in accuracy for ShiftResNet-20.","compression, tensor decomposition, CNNs, FPGA" DiGress: Discrete Denoising diffusion for graph generation,https://openreview.net/forum?id=UaAD-Nu86WX,https://openreview.net/pdf?id=UaAD-Nu86WX,We propose a discrete denoising diffusion model for generating graphs with categorical node and edge attributes. It is state-of-the-art on both abstract and molecular datasets.,"This work introduces DiGress, a discrete denoising diffusion model for generating graphs with categorical node and edge attributes. Our model defines a diffusion process that progressively edits a graph with noise (adding or removing edges, changing the categories), and a graph transformer network that learns to revert this process. With these two ingredients in place, we reduce distribution learning over graphs to a simple sequence of classification tasks. We further improve sample quality by proposing a new Markovian noise model that preserves the marginal distribution of node and edge types during diffusion, and by adding auxiliary graph-theoretic features derived from the noisy graph at each diffusion step. Finally, we propose a guidance procedure for conditioning the generation on graph-level features. Overall, DiGress achieves state-of-the-art performance on both molecular and non-molecular datasets, with up to 3x validity improvement on a dataset of planar graphs. In particular, it is the first model that scales to the large GuacaMol dataset containing 1.3M drug-like molecules without using a molecule-specific representation such as SMILES or fragments.","Graph generation, Denoising Diffusion Model, Molecule Generation, Permutation Equivariance, Discrete Diffusion" Sample Relationships through the Lens of Learning Dynamics with Label Information,https://openreview.net/forum?id=Eo89g5X1m5g,https://openreview.net/pdf?id=Eo89g5X1m5g,"We propose a new kernel function for neural networks which can take the label information into consideration, and show that it helps to improve generalisation performance.","Although much research has been done on proposing new models or loss functions to improve the generalisation of artificial neural networks (ANNs), less attention has been directed to the data, which is also an important factor for training ANNs. In this work, we start from approximating the interaction between two samples, i.e. how learning one sample would modify the model's prediction on the other sample. Through analysing the terms involved in weight updates in supervised learning, we find that the signs of labels influence the interactions between samples. Therefore, we propose the labelled pseudo Neural Tangent Kernel (lpNTK) which takes label information into consideration when measuring the interactions between samples. We first prove that lpNTK would asymptotically converge to the well-known empirical NTK in terms of the Frobenius norm under certain assumptions. Secondly, we illustrate how lpNTK helps to understand learning phenomena identified in previous work, specifically the learning difficulty of samples and forgetting events during learning. Moreover, we also show that lpNTK can help to improve the generalisation performance of ANNs in image classification tasks, compared with the original whole training sets.","Sample Relationship, Iterated Learning, Generalisation, Neural Networks, Neural Tangent Kernel" Geometric Networks Induced by Energy Constrained Diffusion,https://openreview.net/forum?id=j6zUzrapY3L,https://openreview.net/pdf?id=j6zUzrapY3L,"We introduce an energy constrained diffusion model for semi-supervised representation learning, based on which a new class of nerual encoders is derived for efficiently and effectively learning inter-instance latent graphs","Real-world data generation often involves complex inter-dependencies among instances, violating the IID-data hypothesis of standard learning paradigms and posing a challenge for uncovering the geometric structures for learning desired instance representations. To this end, we introduce an energy constrained diffusion model which encodes a batch of instances from a dataset into evolutionary states that progressively incorporate other instances' information by their interactions. The diffusion process is constrained by descent criteria w.r.t.~a principled energy function that characterizes the global consistency of instance representations over latent structures. We provide rigorous theory that implies closed-form optimal estimates for the pairwise diffusion strength among arbitrary instance pairs, which gives rise to a new class of neural encoders, dubbed as DIFFormer, with two instantiations: a simple version with linear complexity for prohibitive instance numbers, and an advanced version for learning complex structures. Experiments highlight the wide applicability of our model as a general-purpose encoder backbone with superior performance in various tasks, such as semi-supervised node classification, image/text classification, and spatial-temporal dynamics prediction.","structured representation learning, diffusion model, optimization-induced model, node prediction" "Neural Lagrangian Schr\""{o}dinger Bridge: Diffusion Modeling for Population Dynamics",https://openreview.net/forum?id=d3QNWD_pcFv,https://openreview.net/pdf?id=d3QNWD_pcFv,,"Population dynamics is the study of temporal and spatial variation in the size of populations of organisms and is a major part of population ecology. One of the main difficulties in analyzing population dynamics is that we can only obtain observation data with coarse time intervals from fixed-point observations due to experimental costs or measurement constraints. Recently, modeling population dynamics by using continuous normalizing flows (CNFs) and dynamic optimal transport has been proposed to infer the sample trajectories from a fixed-point observed population. While the sample behavior in CNFs is deterministic, the actual sample in biological systems moves in an essentially random yet directional manner. Moreover, when a sample moves from point A to point B in dynamical systems, its trajectory typically follows the principle of least action in which the corresponding action has the smallest possible value. To satisfy these requirements of the sample trajectories, we formulate the Lagrangian Schrödinger bridge (LSB) problem and propose to solve it approximately by modeling the advection-diffusion process with regularized neural SDE. We also develop a model architecture that enables faster computation of the loss function. Experimental results show that the proposed method can efficiently approximate the population-level dynamics even for high-dimensional data and that using the prior knowledge introduced by the Lagrangian enables us to estimate the sample-level dynamics with stochastic behavior.","Population Dynamics, Trajectory Inference, Neural SDEs, Stochastic Optimal Transport, Schrödinger Bridge" Jump-Start Reinforcement Learning,https://openreview.net/forum?id=FZCFlj2_c7z,https://openreview.net/pdf?id=FZCFlj2_c7z,Efficiently initializing reinforcement learning policies using a prior policy. ,"Reinforcement learning (RL) provides a theoretical framework for continuously improving an agent’s behavior via trial and error. However, efficiently learning policies from scratch can be very difficult, particularly for tasks that present exploration challenges. In such settings, it might be desirable to initialize RL with an existing policy, offline data, or demonstrations. However, naively performing such initialization in RL often works poorly, especially for value-based methods. In this paper, we present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy, and is compatible with any RL approach. In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks: a guide-policy, and an exploration-policy. By using the guide-policy to form a curriculum of starting states for the exploration-policy, we are able to efficiently improve performance on a set of simulated robotic tasks. We show via experiments that it is able to significantly outperform existing imitation and reinforcement learning algorithms, particularly in the small-data regime. In addition, we provide an upper bound on the sample complexity of JSRL and show that with the help of a guide-policy, one can improve the sample complexity for non-optimism exploration methods from exponential in horizon to polynomial.","reinforcement learning, offline reinforcement learning, fine-tuning" GT-CausIn: a novel causal-based insight for traffic prediction,https://openreview.net/forum?id=6MMOFoiWnDM,https://openreview.net/pdf?id=6MMOFoiWnDM,"A model fusing causal knowledge, space dependency and temporal dependency is proposed in this work.","Traffic forecasting is an important issue of spatiotemporal series prediction. Among different methods, graph neural networks have achieved so far the most promising results, learning relations between graph nodes then becomes a crucial task. However, improvement space is very limited when these relations are learned in a node-to-node manner. The challenge stems from (1) obscure temporal dependencies between different stations, (2) difficulties in defining variables beyond the node level, and (3) no ready-made method to validate the learned relations. To confront these challenges, we define legitimate traffic variables to discover the causal structure of the traffic network. The causal relation is carefully checked with statistic tools and case analysis. We then present a novel model named Graph Spatial-Temporal Network Based on Causal Insight (GT-CausIn), where graph diffusion layers and temporal convolutional network (TCN) layers are integrated with causal knowledge to capture dependencies in spatiotemporal space. Experiments are carried out on two real-world traffic datasets: PEMS-BAY and METR-LA, which show that GT-CausIn significantly outperforms the state-of-the-art models.","spatiotemporal forecasting, causal discovery, graph neural networks" KerDEQ: Optimization induced Deep Equilibrium models via Gaussian Kernel,https://openreview.net/forum?id=fZb-Mg6Wip5,https://openreview.net/pdf?id=fZb-Mg6Wip5,,"Although the optimization induced deep equilibrium models (OptEqs) show the connection between the neural networks' structure and the designed hidden optimization problems (problems that the network's forward procedure tries to solve), we find that the linear kernels used in their hidden optimization problem hinder their performance since linear kernels cannot extract non-linear feature dependencies from the inputs. Inspired by the classical machine learning algorithms, we use the widely used Gaussian kernels to construct the OptEqs hidden optimization problem and then propose our deep equilibrium model named KerDEQ. With Gaussian kernels, it can extract the input features' non-linear information more efficiently compared with the original OptEqs. Furthermore, KerDEQ can be regarded as a weight-tied neural network with infinite width and depth, therefore it shows better performance. Apart from that, our KerDEQ also shows better uncertainty calibration properties and performs more stably under different corruptions, especially under noise credit to the Gaussian kernel hidden optimization problem and its induced structure. Furthermore, we also conduct various experiments to demonstrate the effectiveness and reliability of our KerDEQ.", AD-NEGF: An End-to-End Differentiable Quantum Transport Simulator for Sensitivity Analysis and Inverse Problems,https://openreview.net/forum?id=LR_KWiUgS8F,https://openreview.net/pdf?id=LR_KWiUgS8F,"We provide to the best of our knowledge the first end-to-end differentiable quantum transport simulator, which can compute differential quantities and perform atomic level device optimization.","Quantum transport theory describes transport phenomena from first principles, which is essential for domains such as semiconductor fabrication. As a representative, the Non-Equilibrium Green Function (NEGF) method achieves superiority in numerical accuracy. However, its tremendous computational cost makes it unbearable for high-throughput simulation tasks such as sensitivity analysis, inverse design, etc. In this work, we propose AD-NEGF, to the best of our knowledge the first Automatic Differentiation (AD) based quantum transport simulator. AD-NEGF calculates gradient information efficiently by utilizing automatic differentiation and implicit layer techniques, while guaranteeing the correctness of the forward simulation. Such gradient information enables accurate and efficient calculation of differential physical quantities and solving inverse problems that are intractable by traditional optimization methods.","Quantum Transport, Non-Equilibrium Green Function, Automatic Differentiation, Differentiable Programming, Deep Learning, Sensitivity Analysis, Inverse Design" Incomplete to complete multiphysics forecasting - a hybrid approach for learning unknown phenomena,https://openreview.net/forum?id=QP02DQ-FG-8,https://openreview.net/pdf?id=QP02DQ-FG-8,"This paper proposes a hybrid framework that combines neural network models with an incomplete PDE solver to account for the effects of unknown physics present in the system to predict a long-term temporal evolution of a complete, multiphysics system","Modeling complex dynamical systems where only partial knowledge of their physical mechanisms is available is a crucial problem across all scientific and engineering disciplines. Purely data-driven approaches, which only make use of an artificial neural network and data, often fail to accurately simulate the evolution of the system dynamics over a sufficiently long time and in a physically consistent manner. Therefore, we propose a hybrid approach that uses a neural network model in combination with an incomplete PDE solver that provides known but incomplete physical information. In this study, we demonstrate that the results obtained from the incomplete PDEs can be efficiently corrected at every time step by the proposed hybrid neural network – PDE solver model, so that the effect of the unknown physics present in the system is correctly accounted for. For validation purposes, the obtained simulations of the hybrid model are successfully compared against results coming from the complete set of PDEs describing the full physics of the considered system. We demonstrate the validity of the proposed approach on a reactive flow, an archetypal multi-physics system that combines fluid mechanics and chemistry, the latter being the physics considered unknown. Experiments are made on planar and Bunsen-type flames at various operating conditions. The hybrid neural network - PDE approach correctly models the flame evolution of the cases under study for significantly long time windows, yields improved generalization, and allows for larger simulation time steps. ","neural physics simulations, multi-physics systems, reactive flows, differentiable PDE solvers" Bi-Level Dynamic Parameter Sharing among Individuals and Teams for Promoting Collaborations in Multi-Agent Reinforcement Learning,https://openreview.net/forum?id=oap4aDN9yS2,https://openreview.net/pdf?id=oap4aDN9yS2,"We propose a bi-level dynamic parameter sharing mechanism between individuals and teams, which can not only promote agents to learn diversified strategies, but also promote agents to form more stable and complementary cooperative relationships.","Parameter sharing has greatly contributed to the success of multi-agent reinforcement learning in recent years. However, most existing parameter sharing mechanisms are static, and parameters are indiscriminately shared among individuals, ignoring the dynamic environments and different roles of multiple agents. In addition, although a single-level selective parameter sharing mechanism can promote the diversity of strategies, it is hard to establish complementary and cooperative relationships between agents. To address these issues, we propose a bi-level dynamic parameter sharing mechanism among individuals and teams for promoting effective collaborations (BDPS). Specifically, at the individual level, we define virtual dynamic roles based on the long-term cumulative advantages of agents and share parameters among agents in the same role. At the team level, we combine agents of different virtual roles and share parameters of agents in the same group. Through the joint efforts of these two levels, we achieve a dynamic balance between the individuality and commonality of agents, enabling agents to learn more complex and complementary collaborative relationships. We evaluate BDPS on a challenging set of StarCraft II micromanagement tasks. The experimental results show that our method outperforms the current state-of-the-art baselines, and we demonstrate the reliability of our proposed structure through ablation experiments.","Multi-Agent Reinforcement Learning, Reinforcement Learning" TCNL: Transparent and Controllable Network Learning Via Embedding Human-Guided Concepts,https://openreview.net/forum?id=gUxSHy2mNUh,https://openreview.net/pdf?id=gUxSHy2mNUh,Propose a novel method to improve interpretability of CNN by merging concept information corresponding to human understanding,"Explaining deep learning models is of vital importance for understanding artificial intelligence systems, improving safety, and evaluating fairness. To better understand and control the CNN model, many methods for transparency-interpretability have been proposed. However, most of these works are less intuitive for human understanding and have insufficient human control over the CNN model. We propose a novel method, Transparent and Controllable Network Learning (TCNL), to overcome such challenges. Towards the goal of improving transparency-interpretability, in TCNL, we define some concepts for specific classification tasks through scientific human-intuition study and incorporate concept information into the CNN model. In TCNL, the shallow feature extractor gets preliminary features first. Then several concept feature extractors are built right after the shallow feature extractor to learn high-dimensional concept representations. The concept feature extractor is encouraged to encode information related to the predefined concepts. We also build the concept mapper to visualize features extracted by the concept extractor in a human-intuitive way. TCNL provides a generalizable approach to transparency-interpretability. Researchers can define concepts corresponding to certain classification tasks and encourage the model to encode specific concept information, which to a certain extent improves transparency-interpretability and the controllability of the CNN model. The datasets (with concept sets) for our experiments will also be released.","Transparency-Interpretability, Human-Guided Concepts" How to prepare your task head for finetuning,https://openreview.net/forum?id=gVOXZproe-e,https://openreview.net/pdf?id=gVOXZproe-e,"Features need mild adaptation during finetuning, so mildly update your task head and then finetune together.","In the era of deep learning, transferring information from a pretrained network to a downstream task by finetuning has many benefits. The choice of task head plays an important role in fine-tuning, as the pretrained and downstream tasks are usually different. Although there exist many different designs for finetuning, a full understanding of when and why these algorithms work has been elusive. We analyze how the choice of task head controls feature adaptation and hence influences the downstream performance. By decomposing the feature's learning dynamics, we find the key aspect is the training accuracy and loss at the beginning of finetuning, which determines the ""energy"" available for the feature's adaptation. We identify a significant trend in the effect of changes in this initial energy on the resulting features after finetuning. Specifically, as the energy increases, the Euclidean and cosine distances between the resulting and original features increase, while their dot product (and the resulting features’ norm) first increases and then decreases. Inspired by this, we give several practical principles that lead to better downstream performance. We analytically prove this trend in an overparamterized linear setting and verify its applicability to different experimental settings.","representation learning, finetune, transfer learning" Gradient-based optimization is not necessary for generalization in neural networks,https://openreview.net/forum?id=QC10RmRbZy9,https://openreview.net/pdf?id=QC10RmRbZy9,We empirically showed that a random optimizer performs just as well as SGD,"It is commonly believed that the implicit regularization of optimizers is needed for neural networks to generalize in the overparameterized regime. In this paper, we observe experimentally that this implicit regularization behavior is {\em generic}, i.e. it does not depend strongly on the choice of optimizer. We demonstrate this by training neural networks using several gradient-free optimizers that do not benefit from properties that are often attributed to gradient-based optimizers. This includes a guess-and-check optimizer that generates uniformly random parameter vectors until one is found that happens to achieve perfect train accuracy, and a zeroth-order pattern search optimizer that uses no gradient computations. In the low sample and few-shot regimes, where zeroth order optimizers are most tractable, we find that these non-gradient optimizers achieve test accuracy comparable to SGD.","generalization, regularization" Uplift Modelling based on Graph Neural Network Combined with Causal Knowledge,https://openreview.net/forum?id=-I2nYWac2Id,https://openreview.net/pdf?id=-I2nYWac2Id,Improve uplift modeling performance through causal knowledge representation and structural neighborhood learning,"Uplift modeling is a crucial method to estimate marketing effect modeling, which is widely used to evaluate the effect of treatment on outcomes. On the one hand, we can find the treatment with the best effect through uplift modeling. On the other hand, we can find customers who tend to make corresponding positive decisions in a given treatment. The past uplift modeling methods are mostly based on the difference-in-difference(DID) framework, combined with the machine learning model as the learner to make an estimation, ignoring the relationship and confidential information among features in uplift modeling. We propose a graph neural network-based framework combining causal knowledge as an estimator of uplift value. Firstly, we proposed a causal representation method based on conditional average treatment effect(CATE) estimation and adjacency matrix structure learning. Secondly, we proposed an uplift modeling framework based on graph convolution networks to combine the causal knowledge, which has better scalability. Our experimental results show that our method can estimate the uplift value with minor errors in the general simulation data, and its performance has also been verified in the actual industry marketing data.","uplift modeling, Graph Neural Network, Knowledge Representation, Structure Learning" Sequence to sequence text generation with diffusion models,https://openreview.net/forum?id=jQj-_rLVXsj,https://openreview.net/pdf?id=jQj-_rLVXsj,We propose DiffuSeq: a diffusion model designed for sequence-to-sequence text generation tasks,"Recently, diffusion models have emerged as a new paradigm for generative models. Despite the success in domains using continuous signals such as vision and audio, adapting diffusion models to natural language is difficult due to the discrete nature of text. We tackle this challenge by proposing DiffuSeq: a diffusion model designed for sequence-to-sequence (Seq2Seq) text generation tasks. Upon extensive evaluation over a wide range of Seq2Seq tasks, we find DiffuSeq achieving comparable or even better performance than six established baselines, including a state-of-the-art model that is based on pre-trained language models. Apart from quality, an intriguing property of DiffuSeq is its high diversity during generation, which is desired in many Seq2Seq tasks. We further include a theoretical analysis revealing the connection between DiffuSeq and autoregressive/non-autoregressive models. Bringing together theoretical analysis and empirical evidence, we demonstrate the great potential of diffusion models in complex conditional language generation tasks.","diffusion model, sequence to sequence, text generation, diveristy" Rethinking Deep Spiking Neural Networks: A Multi-Layer Perceptron Approach,https://openreview.net/forum?id=-1x2-lp1eZf,https://openreview.net/pdf?id=-1x2-lp1eZf,"A multi-layer perceptron approach for deep spiking neural network, achieving state-of-the-art results on ImageNet.","By adopting deep convolution architectures, spiking neural networks (SNNs) have recently achieved competitive performances with their artificial counterparts in image classification, meanwhile with much lower computation cost due to event-driven and sparse activation. However, the multiplication-free inference (MFI) principle makes SNNs incompatible with attention or transformer mechanisms which have shown significant performance gains on high resolution vision tasks. Inspired from recent works on multi-layer perceptrons (MLPs), we explore an efficient spiking MLP design using batch normalization instead of layer normalization in both the token and the channel block to be compatible with MFI. We further strengthen the network’s local feature learning ability with a spiking patch encoding layer, which significantly improves the network performance. Based on these building blocks, we explore an optimal skip connection configuration and develop an efficient multi-stage spiking MLP network combining global receptive field and local feature extraction, achieving full spike-based computation. Without pre-training or other advanced SNN training techniques, the spiking MLP network achieves 66.39% top-1 accuracy on the ImageNet-1K dataset, surpassing the state-of-the-art directly trained spiking ResNet-34 by 2.67% under similar model capacity meanwhile with shorter simulation steps and much less computation cost. Another larger variant of the network achieves 68.84% top-1 accuracy, rivaling the spiking VGG-16 network with 4 times smaller model capacity. Our work demonstrates the effectiveness of an alternative deep SNN architecture combining both global and local learning abilities. More interestingly, finally we show a close resemblance of the trained receptive field of our network to cells in the cortex. Code will be publicly available.","spiking neural network, multi-layer perceptron, image classification" Collaborative Symmetricity Exploitation for Offline Learning of Hardware Design Solver,https://openreview.net/forum?id=qF5G70FqURx,https://openreview.net/pdf?id=qF5G70FqURx,We propose offline learning method with symmetric learning for hardware design,"This paper proposes \textit{collaborative symmetricity exploitation} (CSE) framework to train a solver for the decoupling capacitor placement problem (DPP) benchmark, one of the significant hardware design problems. Due to the sequentially coupled multi-level property of the hardware design process, the design condition of DPP changes depending on the design of higher-level problems. Also, the online evaluation of real-world electrical performance through simulation is extremely costly. Thus, a data-efficient offline learning method to train a solver (i.e., contextualized policy) with high generalization capability over changing task conditions is necessary. In this paper, we apply the CSE framework to train a DPP solver using a limited number of offline expert data. Leveraging the symmetricity for offline learning of hardware design solver has two major advantages: it increases data-efficiency by reducing the solution space and improves generalization capability by capturing the invariant nature present regardless of changing conditions. The proposed CSE is composed of two learning schemes: expert exploitation and self-exploitation. Expert exploitation induces symmetricity during the imitation learning process with offline expert data and self-exploitation induces symmetricity during the consistency learning process with self-generated data. Extensive experiments verified that CSE with zero-shot inference outperforms the neural baselines and iterative conventional design methods on the DPP benchmark. Furthermore, CSE showed promising extrapolation capability as it greatly outperforms the expert method used to generate the offline data for training. Scalability and flexibility of the proposed method were also verified for practical use of CSE in industry.","Symmetricity, Offline learning, Hardware design solver" From ChebNet to ChebGibbsNet,https://openreview.net/forum?id=2a5Ru3JtNe0,https://openreview.net/pdf?id=2a5Ru3JtNe0,,"Recent advancements in Spectral Graph Convolutional Networks (SpecGCNs) have led to state-of-the-art performance in various graph representation learning tasks. To exploit the potential of SpecGCNs, we analyze corresponding graph filters via polynomial interpolation, the cornerstone of graph signal processing. Different polynomial bases, such as Bernstein, Chebyshev, and monomial basis, have various convergence rates that will affect the error in polynomial interpolation. Although adopting Chebyshev basis for interpolation can minimize maximum error, the performance of ChebNet is still weaker than GPR-GNN and BernNet. We point out it is caused by the Gibbs phenomenon, which occurs when the corresponding graph frequency response function approximates the target function. It reduces the approximation ability of a truncated polynomial interpolation. In order to mitigate the Gibbs phenomenon, we propose to add the Gibbs damping factor with each term of Chebyshev polynomials on ChebNet. As a result, our lightweight approach leads to a significant performance boost. Afterwards, we reorganize ChebNet via decoupling feature propagation and transformation. We name this variant as ChebGibbsNet. Our experiments indicate that ChebGibbsNet is superior to other advanced SpecGCNs, such as GPR-GNN and BernNet, in both homogeneous graphs and heterogeneous graphs.","Spectral Graph Convolutional Networks, Gibbs phenomenon, Gibbs damping factors, ChebNet" Policy Expansion for Bridging Offline-to-Online Reinforcement Learning,https://openreview.net/forum?id=-Y34L45JR6z,https://openreview.net/pdf?id=-Y34L45JR6z,Bridging offline-to-online RL with Policy Expansion,"Pre-training with offline data and online fine-tuning using reinforcement learning is a promising strategy for learning control policies by leveraging the best of both worlds in terms of sample efficiency and performance. One natural approach is to initialize the policy for online learning with the one trained offline. In this work, we introduce a policy expansion scheme for this task. After learning the offline policy, we use it as one candidate policy in a policy set, and further learn another policy that will be responsible for further learning as an expansion to the policy set. The two policies will be composed in an adaptive manner for interacting with the environment. With this approach, the policy previously learned offline is fully retained during online learning, thus mitigating the potential issues such as destroying the useful behaviors of the offline policy in the initial stage of online learning while allowing the offline policy participate in the exploration naturally in an adaptive manner. Moreover, new useful behaviors can potentially be captured by the newly added policy through learning. Experiments are conducted on a number of tasks and the results demonstrate the effectiveness of the proposed approach.", On The Implicit Bias of Weight Decay in Shallow Univariate ReLU Networks,https://openreview.net/forum?id=B8_T6-8-tCU,https://openreview.net/pdf?id=B8_T6-8-tCU,"Minimal \ell_2-norm interpolation by univariate scalar one layer ReLU is completely characterized in terms of the convexity of the learned predictor, giving new sharp generalization bounds on 1d Lipschitz functions.","We give a complete characterization of the implicit bias of infinitesimal weight decay in the modest setting of univariate one layer ReLU networks. Our main result is a surprisingly simple geometric description of all one layer ReLU networks that exactly fit a dataset $\mathcal D= \set{(x_i,y_i)}$ with the minimum value of the $\ell_2$-norm of the neuron weights. Specifically, we prove that such functions must be either concave or convex between any two consecutive data sites $x_i$ and $x_{i+1}$. Our description implies that interpolating ReLU networks with weak $\ell_2$-regularization achieve the best possible generalization for learning $1d$ Lipschitz functions, up to universal constants. ","theory, implicit bias, generalization, interpolation, theoretical, shallow ReLU networks, ReLU networks, analysis of weight decay" Mitigating Memorization of Noisy Labels via Regularization between Representations,https://openreview.net/forum?id=6qcYDVlVLnK,https://openreview.net/pdf?id=6qcYDVlVLnK,We theoretically show the memorization effect of DNN with resepct to the model capacity and propose a representation-based regularizer to mitigate the memorization effect. ,"Designing robust loss functions is popular in learning with noisy labels while existing designs did not explicitly consider the overfitting property of deep neural networks (DNNs). As a result, applying these losses may still suffer from overfitting/memorizing noisy labels as training proceeds. In this paper, we first theoretically analyze the memorization effect and show that a lower-capacity model may perform better on noisy datasets. However, it is non-trivial to design a neural network with the best capacity given an arbitrary task. To circumvent this dilemma, instead of changing the model architecture, we decouple DNNs into an encoder followed by a linear classifier and propose to restrict the function space of a DNN by a representation regularizer. Particularly, we require the distance between two self-supervised features to be positively related to the distance between the corresponding two supervised model outputs. Our proposed framework is easily extendable and can incorporate many other robust loss functions to further improve performance. Extensive experiments and theoretical analyses support our claims.","learning with noisy labels, representation learning" Graph Neural Networks are Inherently Good Generalizers: Insights by Bridging GNNs and Multi-Layer Perceptrons,https://openreview.net/forum?id=dqnNW2omZL6,https://openreview.net/pdf?id=dqnNW2omZL6,,"Graph neural networks (GNNs), as the de-facto model class for representation learning on graphs, are built upon the multi-layer perceptrons (MLP) architecture with additional message passing layers to allow features to flow across nodes. While the conventional wisdom largely attributes the success of GNNs to their advanced expressivity for learning desired functions on nodes' ego-graphs, we conjecture that this is not the main cause of GNNs' superiority in node prediction tasks. This paper pinpoints the major source of GNNs' performance gain to their intrinsic generalization capabilities, by introducing an intermediate model class dubbed as P(ropagational)MLP, which is identical to standard MLP in training, and then adopt GNN's architecture in testing. Intriguingly, we observe that PMLPs consistently perform on par with (or even exceed) their GNN counterparts across ten benchmarks and different experimental settings, despite the fact that PMLPs share the same (trained) weights with poorly-performed MLP. This critical finding opens a door to a brand new perspective for understanding the power of GNNs, and allow bridging GNNs and MLPs for dissecting their generalization behaviors. As an initial step to analyze PMLP, we show its essential difference with MLP at infinite-width limit lies in the NTK feature map in the post-training stage. Moreover, though MLP and PMLP cannot extrapolate non-linear functions for extreme OOD data, PMLP has more freedom to generalize near the training support.","Graph Neural Networks, Generalization" Learning Cut Selection for Mixed-Integer Linear Programming via Hierarchical Sequence Model,https://openreview.net/forum?id=Zob4P9bRNcK,https://openreview.net/pdf?id=Zob4P9bRNcK,,"Cutting planes (cuts) are important for solving mixed-integer linear programs (MILPs), which formulate a wide range of important real-world applications. Cut selection---which aims to select a proper subset of the candidate cuts to improve the efficiency of solving MILPs---heavily depends on (P1) which cuts should be preferred, and (P2) how many cuts should be selected. Although many modern MILP solvers tackle (P1)-(P2) by manually designed heuristics, machine learning offers a promising approach to learn more effective heuristics from MILPs collected from specific applications. However, many existing learning-based methods focus on learning which cuts should be preferred, neglecting the importance of learning the number of cuts that should be selected. Moreover, we observe from extensive empirical results that (P3) what order of selected cuts should be preferred has a significant impact on the efficiency of solving MILPs as well. To address this challenge, we propose a novel hierarchical sequence model (HEM) to learn cut selection policies via reinforcement learning. Specifically, HEM consists of a two-level model: (1) a higher-level model to learn the number of cuts that should be selected, (2) and a lower-level model---that formulates the cut selection task as a sequence to sequence learning problem---to learn policies selecting an ordered subset with the size determined by the higher-level model. To the best of our knowledge, HEM is the first method that can tackle (P1)}-(P3) in cut selection simultaneously from a data-driven perspective. Experiments show that HEM significantly improves the efficiency of solving MILPs compared to human-designed and learning-based baselines on both synthetic and large-scale real-world MILPs, including MIPLIB 2017. Moreover, experiments demonstrate that HEM well generalizes to MILPs that are significantly larger than those seen during training.","mixed-integer linear programming, cut selection, deep reinforcement learning, sequence to sequence learning" BSTT: A Bayesian Spatial-Temporal Transformer for Sleep Staging,https://openreview.net/forum?id=ZxdkjTgK_Dl,https://openreview.net/pdf?id=ZxdkjTgK_Dl,,"Sleep staging is helpful in assessing sleep quality and diagnosing sleep disorders. However, how to adequately capture the temporal and spatial relations of the brain during sleep remains a challenge. In particular, existing methods cannot adaptively infer spatial-temporal relations of the brain under different sleep stages. In this paper, we propose a novel Bayesian spatial-temporal relation inference neural network, named Bayesian spatial-temporal transformer (BSTT), for sleep staging. Our model is able to adaptively infer brain spatial-temporal relations during sleep for spatial-temporal feature modeling through a well-designed Bayesian relation inference component. Meanwhile, our model also includes a spatial transformer for extracting brain spatial features and a temporal transformer for capturing temporal features. Experiments show that our BSTT outperforms state-of-the-art baselines on ISRUC and MASS datasets. In addition, the visual analysis shows that the spatial-temporal relations obtained by BSTT inference have certain interpretability for sleep staging. ","Spatial-Temporal Transformer, Sleep Staging, Bayesian Deep Learning" Self-Guided Noise-Free Data Generation for Efficient Zero-Shot Learning,https://openreview.net/forum?id=h5OpjGd_lo6,https://openreview.net/pdf?id=h5OpjGd_lo6,"This paper proposes a framework to automatically enhance the quality of PLM-generated data for efficient zero-shot learning, without relying on any human annotation.","There is a rising interest in further exploring the zero-shot learning potential of large pre-trained language models (PLMs). A new paradigm called data-generation-based zero-shot learning has achieved impressive success. In this paradigm, the synthesized data from the PLM acts as the carrier of knowledge, which is used to train a task-specific model with orders of magnitude fewer parameters than the PLM, achieving both higher performance and efficiency than prompt-based zero-shot learning methods on PLMs. The main hurdle of this approach is that the synthesized data from PLM usually contains a significant portion of low-quality samples. Fitting on such data will greatly hamper the performance of the task-specific model, making it unreliable for deployment. Previous methods remedy this issue mainly by filtering synthetic data using heuristic metrics(e.g., output confidence), or refining the data with the help of a human expert, which comes with excessive manual tuning or expensive costs. In this paper, we propose a novel noise-robust re-weighting framework SunGen to automatically construct high-quality data for zero-shot classification problems. Our framework features the ability to learn the sample weights indicating data quality without requiring any human annotation. We theoretically and empirically verify the ability of our method to help construct good-quality synthetic datasets. Notably, SunGen-LSTM yields a 9.8% relative improvement than the baseline on average accuracy across eight different established text classification tasks.","Pre-Trained Language Model, Prompt-Based Learning, Efficient Zero-Shot Learning" Beyond re-balancing: distributionally robust augmentation against class-conditional distribution shift in long-tailed recognition,https://openreview.net/forum?id=fkM4J9CnJBS,https://openreview.net/pdf?id=fkM4J9CnJBS,Study the problem of unreliable class-conditional distribution estimation in long-tailed recognition and propose a data augmentation method to solve it,"As a fundamental and practical problem, long-tailed recognition has drawn burning attention. In this paper, we investigate an essential but rarely noticed issue in long-tailed recognition, Class-Conditional Distribution (CCD) shift due to scarce instances, which exhibits a significant discrepancy between the empirical CCDs for training and test data, especially for tail classes. We observe empirical evidence that the shift is a key factor that limits the performance of existing long-tailed learning methods, and provide novel understanding of these methods in the course of our analysis. Motivated by this, we propose an adaptive data augmentation method, Distributionally Robust Augmentation (DRA), to learn models more robust to CCD shift. The convergence and generalization of DRA are theoretically guaranteed. Experimental results verify that DRA outperforms related data augmentation methods without extra training cost and significantly improves the performance of some existing long-tailed recognition methods.","Long-tailed recognition, data augmentation, distributionally robust optimization, im-balance" Improving Deep Policy Gradients with Value Function Search,https://openreview.net/forum?id=6qZC7pfenQm,https://openreview.net/pdf?id=6qZC7pfenQm,"We present a Value Function Search that employs a gradient-free population of perturbed value networks to improve Deep Policy Gradient primitives, leading to higher returns and better sample efficiency.","Deep Policy Gradient algorithms employ value networks to drive the learning of parameterized policies and reduce the variance of the gradient estimates. However, value function approximation gets stuck in local optima and struggles to fit the actual return, limiting the variance reduction efficacy and leading policies to sub-optimal performance. In this paper, we focus on improving value approximation and analyzing the effects on Deep Policy Gradient primitives such as value prediction, variance reduction, and correlation of gradient estimates with the true gradient. To this end, we introduce a Value Function Search that employs a population of perturbed value networks to search for a better approximation. Our framework does not require additional environment interactions, gradient computations, or ensembles, providing a computationally inexpensive approach to enhance the supervised learning task on which value networks train. Crucially, we show that improving Deep Policy Gradient primitives results in improved sample efficiency and policies with higher returns using standard policy gradient methods on common continuous control benchmark domains.","Deep Reinforcement Learning, Deep Policy Gradients" MEDICAL IMAGE UNDERSTANDING WITH PRETRAINED VISION LANGUAGE MODELS: A COMPREHENSIVE STUDY,https://openreview.net/forum?id=txlWziuCE5W,https://openreview.net/pdf?id=txlWziuCE5W,"This paper discuss about how to leverage the trending vision language model to transfer to the medical domain, showing exciting performance on zero-shot and few-shot learning tasks.","The large-scale pre-trained vision language models (VLM) have shown remarkable domain transfer capability on natural images. However, it remains unknown whether this capability can also apply to the medical image domain. This paper thoroughly studies the knowledge transferability of pre-trained VLMs to the medical domain, where we show that well-designed medical prompts are the key to elicit knowledge from pre-trained VLMs. We demonstrate that by prompting with expressive attributes that are shared between domains, the VLM can carry the knowledge across domains and improve its generalization. This mechanism empowers VLMs to recognize novel objects with fewer or without image samples. Furthermore, to avoid the laborious manual designing process, we develop three approaches for automatic generation of medical prompts, which can inject expert-level medical knowledge and image-specific information into the prompts for fine-grained grounding. We conduct extensive experiments on thirteen different medical datasets across various modalities, showing that our well-designed prompts greatly improve the zero-shot performance compared to the default prompts, and our fine-tuned models surpass the supervised models by a significant margin.","Vision Language models, Multimodality, Medical images, Few-shot learning, zero-shot" SPI-GAN: Denoising Diffusion GANs with Straight-Path Interpolations,https://openreview.net/forum?id=9XFX-DdkGp9,https://openreview.net/pdf?id=9XFX-DdkGp9,,"Score-based generative models (SGMs) are a recently proposed paradigm for deep generative tasks and now show the state-of-the-art sampling performance. It is known that the original SGM design solves the two problems of the generative trilemma: i) sampling quality, and ii) sampling diversity. However, the last problem of the trilemma was not solved, i.e., their training/sampling complexity is notoriously high. To this end, combining SGMs with simpler models, e.g., generative adversarial networks (GANs), is gathering much attention currently. We present an enhanced denoising method using GANs, called straight-path interpolation GAN (SPI-GAN), which drastically reduces the sampling time while achieving as high sampling quality and diversity as SGMs. Our SPI-GAN can be compared to the state-of-the-art shortcut-based denoising method using GANs, called denoising diffusion GAN (DD-GAN). However, our method corresponds to an extreme method that does not use any intermediate shortcut information of the reverse SDE path, in which case DD-GAN ($K=1$) fails to obtain good results. Nevertheless, our straight-path interpolation method greatly stabilizes the overall training process. As a result, SPI-GAN is one of the best-balanced models in terms of the sampling quality/diversity/time for CIFAR-10, CelebA-HQ-256, and LSUN-Church-256.", Generated Distributions Are All You Need for Membership Inference Attacks Against Generative Models,https://openreview.net/forum?id=Bgcp4BniE-U,https://openreview.net/pdf?id=Bgcp4BniE-U,Our work proposes a generalized membership inference against various generative models.,"Generative models have shown their promising performance on various real-world tasks, which, at the same time, introduce the threat of leaking private information of their training data. Several membership inference attacks against generative models have been proposed in recent years and exhibit their effectiveness in different settings. However, these attacks all suffer from their own limitations and cannot be generalized to all generative models under all scenarios. In this paper, we propose the first generalized membership inference attack for generative models, which can be utilized to quantitatively evaluate the privacy leakage of various existing generative models. Compared with previous works, our attack has three main advantages, i.e., (i) only requires black-box access to the target model, (ii) is computationally efficient, and (iii) can be generalized to numerous generative models. Extensive experiments show that various existing generative models in a variety of applications are vulnerable to our attack. For example, our attack could achieve the AUC of 0.997 (0.997) and 0.998 (0.999) against the generative model of DDPM (DDIM) on the CelebA and CIFAR-10 datasets. These results demonstrate that private information can be effectively exploited by attackers in an efficient way, which calls on the community to be aware of privacy threats when designing generative models.","generative models, diffusion models, membership inference" Temporal Coherent Test Time Optimization for Robust Video Classification,https://openreview.net/forum?id=-t4D61w4zvQ,https://openreview.net/pdf?id=-t4D61w4zvQ,,"Deep neural networks are likely to fail when the test data is corrupted in real-world deployment (e.g., blur, weather, etc.). Test-time optimization is an effective way that adapts models to generalize to corrupted data during testing, which has been shown in the image domain. However, the techniques for improving video classification corruption robustness remain few. In this work, we propose a Temporal Coherent Test-time Optimization framework (TeCo) to utilize spatio-temporal information in test-time optimization for robust video classification. To exploit information in video with self-supervised learning, TeCo minimizes the entropy of the prediction based on the global content from video clips. Meanwhile, it also feeds local content to regularize the temporal coherence at the feature level. TeCo retains the generalization ability of various video classification models and achieves significant improvements in corruption robustness across Mini Kinetics-C and Mini SSV2-C. Furthermore, TeCo sets a new baseline in video classification corruption robustness via test-time optimization. ","Video Classification, Robustness, Test Time Optimization" Offline Communication Learning with Multi-source Datasets,https://openreview.net/forum?id=R4oodnmxb9m,https://openreview.net/pdf?id=R4oodnmxb9m,,"Scalability and partial observability are two major challenges faced by multi-agent reinforcement learning. Recently researchers propose offline MARL algorithms to improve scalability by reducing online exploration cost, while the problem of partial observability is often ignored in the offline MARL setting. Communication is a promising approach to alleviate the miscoordination caused by partially observability, thus in this paper we focus on offline communication learning where agents learn from an fixed dataset. We find out that learning communications in an end-to-end manner from a given offline dateset without communication information is intractable, since the correct communication protocol space is too sparse compared with the exponentially growing joint state-action space when the number of agents increases. Besides, unlike offline policy learning which can be guided by reward signals, offline communication learning is struggling since communication messages implicitly impact the reward. Moreover, in real-world applications, offline MARL datasets are often collected from multi-source, leaving offline MARL communication learning more challenging. Therefore, we present a new benchmark which contains a diverse set of challenging offline MARL communication tasks with single/multi-source datasets, and propose a novel Multi-Head structure for Communication Imitation learning (MHCI) algorithm that automatically adapts to the distribution of the dataset. Empirical result shows the effectiveness of our method on various tasks of the new offline communication learning benchmark.", A Learning Based Hypothesis Test for Harmful Covariate Shift,https://openreview.net/forum?id=rdfgqiwz7lZ,https://openreview.net/pdf?id=rdfgqiwz7lZ,"We propose the Detectron, a method based that detects covariate shift using an ensemble of classifiers trained to disagree with each other on unlabeled samples.","The ability to quickly and accurately identify covariate shift at test time is a critical and often overlooked component of safe machine learning systems deployed in high-risk domains. While methods exist for detecting when predictions should not be made on out-of-distribution test examples, identifying distributional level differences between training and test time can help determine when a model should be removed from the deployment setting and retrained. In this work, we define harmful covariate shift (HCS) as a change in distribution that may weaken the generalization of a predictive model. To detect HCS, we use the discordance between an ensemble of classifiers trained to agree on training data and disagree on test data. We derive a loss function for training this ensemble and show that the disagreement rate and entropy represent powerful discriminative statistics for HCS. Empirically, we demonstrate the ability of our method to detect harmful covariate shift with statistical certainty on a variety of high-dimensional datasets. Across numerous domains and modalities, we show state-of-the-art performance compared to existing methods, particularly when the number of observed test samples is small.","Covariate Shift, Distribution Shift, Trustworthy Machine Learning, Statistics" Less is More: Rethinking Few-Shot Learning and Recurrent Neural Nets,https://openreview.net/forum?id=K2d8p6cjSe5,https://openreview.net/pdf?id=K2d8p6cjSe5,,"The statistical supervised learning framework assumes an input-output set with a joint probability distribution that is reliably represented by the training dataset. The learner is then required to output a prediction rule learned from the training dataset's input-output pairs. In this work, we provide meaningful insights into the asymptotic equipartition property (AEP) \citep{Shannon:1948} in the context of machine learning, and illuminate some of its potential ramifications for few-shot learning. We provide theoretical guarantees for reliable learning under the information-theoretic AEP, and for the generalization error with respect to the sample size. We then focus on a highly efficient recurrent neural net (RNN) framework and propose a reduced-entropy algorithm for few-shot learning. We also propose a mathematical intuition for the RNN as an approximation of a sparse coding solver. We verify the applicability, robustness, and computational efficiency of the proposed approach with image deblurring and optical coherence tomography (OCT) speckle suppression. Our experimental results demonstrate significant potential for improving learning models' sample efficiency, generalization, and time complexity, that can therefore be leveraged for practical real-time applications. ", SynMotor: A Benchmark Suite for Object Attribute Regression and Multi-task Learning,https://openreview.net/forum?id=GpicUyuSdTr,https://openreview.net/pdf?id=GpicUyuSdTr,,"In this paper, we develop a novel benchmark suite including both a 2D synthetic image dataset and a 3D synthetic point cloud dataset. Our work is a sub-task in the framework of a remanufacturing project, in which small electric motors are used as fundamental objects. Apart from the given detection, classification, and segmentation annotations, the key objects also have multiple learnable attributes with ground truth provided. This benchmark can be used for computer vision tasks including 2D/3D detection, classification, segmentation, and multi-attribute learning. It is worth mentioning that most attributes of the motors are quantified as continuously variable rather than binary, which makes our benchmark well suited for the less explored regression tasks. In addition, appropriate evaluation metrics are adopted or developed for each task and promising baseline results are provided. We hope this benchmark can stimulate more research efforts on the sub-domain of object attribute learning and multi-task learning in the future.","synthetic dataset, benchmark development, 3D point cloud, object attribute regression, multi-task learning" Training via Confidence Ranking,https://openreview.net/forum?id=wVdbbVAVuIH,https://openreview.net/pdf?id=wVdbbVAVuIH,We devise a series of loss function for training a new better model than deployed one in real-world machine learning system.,"Model evolution and constant available data are two common phenomenon in large-scale real-world machine learning application, e.g. ads and recommendation system. To adapt, real-world system typically operates both retraining with all available data and \textit{online-learning} with recent available data to update models periodically with the goal of better serving performance for future data. However, if model and data evolution results in a vastly different training manner, it may induce negative impact on online A/B platform. In this paper, we propose a novel framework, named Confidence Ranking, to design optimization objective as a ranking function with two different models. Our confidence ranking loss allows directly optimizing the logits output for different convex surrogate function of metrics, e.g. AUC and Accuracy depending on the target tasks and datasets. Armed with our proposed methods, our experiments show that the confidence ranking loss can outperform for test-set performance on CTR prediction and model compression with various setting against the knowledge distillation baselines.",loss function Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation,https://openreview.net/forum?id=NPrsUQgMjKK,https://openreview.net/pdf?id=NPrsUQgMjKK,Understanding and improving signal propagation in self-attention layers to train deep transformers without skip connections and/or normalisation.,"Skip connections and normalisation layers form two standard architectural components that are ubiquitous for the training of Deep Neural Networks (DNNs), but whose precise roles are poorly understood. Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them, using insights from wide NN kernel theory to improve signal propagation in vanilla DNNs (which we define as networks without skips or normalisation). However, these approaches are incompatible with the self-attention layers present in transformers, whose kernels are intrinsically more complicated to analyse and control. And so the question remains: \emph{is it possible to train deep vanilla transformers?} We answer this question in the affirmative by designing several approaches that use combinations of parameter initialisations, bias matrices and location-dependent rescaling to achieve faithful signal propagation in vanilla transformers. Our methods address various intricacies specific to signal propagation in transformers, including the interaction with positional encoding and causal masking. In experiments on WikiText-103 and C4, our approaches enable deep transformers without normalisation to train at speeds matching their standard counterparts, and deep vanilla transformers to reach the same performance as standard ones after about 5 times more iterations.","signal propagation, neural networks and kernels, deep transformers, self-attention, residual connections, layer normalisation, rank collapse, positional encoding" Towards Understanding Robust Memorization in Adversarial Training,https://openreview.net/forum?id=-4Maz7s3YXz,https://openreview.net/pdf?id=-4Maz7s3YXz,We provide a theoretical understanding of adversarial training by proposing a novel implicit bias called robust memorization.,"Adversarial training is a standard method to train neural networks to be robust to adversarial perturbation. However, in contrast with benign overfitting in the standard deep learning setting, which means that over-parameterized neural networks surprisingly generalize well for unseen data, while adversarial training method is able to achieve low robust training error, there still exists a significant robust generalization gap, which promotes us exploring what mechanism leads to robust overfitting during learning process. In this paper, we propose an implicit bias called $\textit{robust memorization}$ in adversarial training under the realistic data assumption. By function approximation theory, we prove that ReLU nets with efficient size have the ability to achieve robust memorization, while robust generalization requires exponentially large models. Then, we demonstrate robust memorization in adversarial training from both empirical and theoretical perspectives. In particular, we empirically investigate the dynamics of loss landscape over input, and we also provide theoretical analysis of robust memorization on data with linear separable assumption. Finally, we prove novel generalization bounds based on robust memorization, which further explains why deep neural networks have both high clean test accuracy and robust overfitting at the same time.","adversarial robustness, adversarial training, robust generalization gap, robust overfitting, deep learning theory" Self-supervised Geometric Correspondence for Category-level 6D Object Pose Estimation in the Wild,https://openreview.net/forum?id=ZKDUlVMqG_O,https://openreview.net/pdf?id=ZKDUlVMqG_O,,"While 6D object pose estimation has wide applications across computer vision and robotics, it remains far from being solved due to the lack of annotations. The problem becomes even more challenging when moving to category-level 6D pose, which requires generalization to unseen instances. Current approaches are restricted by leveraging annotations from simulation or collected from humans. In this paper, we overcome this barrier by introducing a self-supervised learning approach trained directly on large-scale real-world object videos for category-level 6D pose estimation in the wild. Our framework reconstructs the canonical 3D shape of an object category and learns dense correspondences between input images and the canonical shape via surface embedding. For training, we propose novel geometrical cycle-consistency losses which construct cycles across 2D-3D spaces, across different instances and different time steps. The learned correspondence can be applied for 6D pose estimation and other downstream tasks such as keypoint transfer. Surprisingly, our method, without any human annotations or simulators, can achieve on-par or even better performance than previous supervised or semi-supervised methods on in-the-wild images. ","Category-level 6D pose estimation, self-supervised learning, correspondence, computer vision" Incorporating Explicit Uncertainty Estimates into Deep Offline Reinforcement Learning,https://openreview.net/forum?id=Uuw51xL-ZHd,https://openreview.net/pdf?id=Uuw51xL-ZHd,,"Most theoretically motivated work in the offline reinforcement learning setting requires precise uncertainty estimates. This requirement restricts the algorithms derived in that work to the tabular and linear settings where such estimates exist. In this work, we develop a novel method for incorporating scalable uncertainty estimates into an offline reinforcement learning algorithm called deep-SPIBB that extends the SPIBB family of algorithms to environments with larger state and action spaces. We use recent innovations in uncertainty estimation from the deep learning community to get more scalable uncertainty estimates to plug into deep-SPIBB. While these uncertainty estimates do not allow for the same theoretical guarantees as in the tabular case, we argue that the SPIBB mechanism for incorporating uncertainty is more robust and flexible than pessimistic approaches that incorporate the uncertainty as a value function penalty. We bear this out empirically, showing that deep-SPIBB outperforms pessimism based approaches with access to the same uncertainty estimates and performs at least on par with a variety of other strong baselines across several environments and datasets.", Non-parametric Outlier Synthesis,https://openreview.net/forum?id=JHklpEZqduQ,https://openreview.net/pdf?id=JHklpEZqduQ,,"Out-of-distribution (OOD) detection is indispensable for safely deploying machine learning models in the wild. One of the key challenges is that models lack supervision signals from unknown data, and as a result, can produce overconfident predictions on OOD data. Recent work on outlier synthesis modeled the feature space as parametric Gaussian distribution, a strong and restrictive assumption that might not hold in reality. In this paper, we propose a novel framework, non-parametric outlier synthesis (NPOS), which generates artificial OOD training data and facilitates learning a reliable decision boundary between ID and OOD data. Importantly, our proposed synthesis approach does not make any distributional assumption on the ID embeddings, thereby offering strong flexibility and generality. We show that our synthesis approach can be mathematically interpreted as a rejection sampling framework. Extensive experiments show that NPOS can achieve superior OOD detection performance, outperforming the competitive rivals by a significant margin.", SC2EGSet: StarCraft II Esport Replay and Game-state Dataset,https://openreview.net/forum?id=4GtbV7o7-6l,https://openreview.net/pdf?id=4GtbV7o7-6l,"Infrastructure, and a dataset crucial for research in a new and developing field of esports.","As a relatively new form of sport, esports offers unparalleled data availability. Despite the vast amounts of data that are generated by game engines, it can be challenging to extract them and verify their integrity for the purposes of practical and scientific use. Our work aims to open esports to a broader scientific community by supplying raw and pre-processed files from StarCraft II esports tournaments. These files can be used in statistical and machine learning modeling tasks and related to various laboratory-based measurements (e.g., behavioral tests, brain imaging). We have gathered publicly available game-engine generated ""replays"" of tournament matches and performed data extraction and cleanup using a low-level application programming interface (API) parser library. Additionally, we open-sourced and published all the custom tools that were developed in the process of creating our dataset. These tools include PyTorch and PyTorch Lightning API abstractions to load and model the data. Our dataset contains replays from major and premiere StarCraft II tournaments since 2016. To prepare the dataset, we processed 55 tournament ""replaypacks"" that contained 17930 files with game-state information. Based on initial investigation of available StarCraft II datasets, we observed that our dataset is the largest publicly available source of StarCraft II esports data upon its publication. Analysis of the extracted data holds promise for further Artificial Intelligence (AI), Machine Learning (ML), psychological, Human-Computer Interaction (HCI), and sports-related studies in a variety of supervised and self-supervised tasks.","StarCraft II, esports, machine learning, dataset" Latent-space disentanglement with untrained generator networks allows to isolate different motion types in video data,https://openreview.net/forum?id=n6CXWcySQPm,https://openreview.net/pdf?id=n6CXWcySQPm,The paper provides a novel approach to isolate different types of motion in video data using untrained generator networks with disentangled latent space variables,"Isolating different types of motion in video data is a highly relevant problem in video analysis. Applications can be found, for example, in dynamic medical or biological imaging, where the analysis and further processing of the dynamics of interest is often complicated by additional, unwanted dynamics, such as motion of the measurement subject. In this work, it is shown that a representation of video data via untrained generator networks, together with a specific technique for latent space disentanglement that uses minimal, one-dimensional information on some of the underlying dynamics, allows to efficiently isolate different, highly non-linear motion types. In particular, such a representation allows to freeze any selection of motion types, and to obtain accurate independent representations of other dynamics of interest. Obtaining such a representation does not require any pre-training on a training data set, i.e., all parameters of the generator network are learned directly from a single video. ","video analysis, isolation of motion, generator networks, deep image prior" FV-MgNet: Fully Connected V-cycle MgNet for Interpretable Time Series Forecasting,https://openreview.net/forum?id=1o5SGx71kAO,https://openreview.net/pdf?id=1o5SGx71kAO,"By investigating iterative methods for a constrained linear model, we propose a new class of fully connected V-cycle MgNet for long-term time series forecasting.","By investigating iterative methods for a constrained linear model, we propose a new class of fully connected V-cycle MgNet for long-term time series forecasting, one of the most difficult tasks in forecasting problems. MgNet is a CNN model that was proposed for image classification based on the multigrid (MG) methods for solving discretized partial differential equations (PDEs). We replace the convolutional operations with fully connected operations in the existing MgNet and then apply it to the forecasting problems. Motivated by the V-cycle structure in MG, we further propose the FV-MgNet, a V-cycle version of fully connected MgNet, to extract features hierarchically. By evaluating the performance of FV-MgNet on popular data sets and comparing it with state-of-the-art models, we show that the FV-MgNet achieves better results with less memory usage and faster inference speed. In addition, we also develop ablation experiments to demonstrate the structure of FV-MgNet is the best choice among the many variants.","long time series forecasting, multigrid, MgNet, V-cycle, fully connected layer" Prosody-TTS: Self-Supervised Prosody Pretraining with Latent Diffusion For Text-to-Speech,https://openreview.net/forum?id=y6EnaJlhcWZ,https://openreview.net/pdf?id=y6EnaJlhcWZ,"We propose Prosody-TTS, a TTS model to enhance prosody modeling by introducing self-supervised prosody pre-training and generative latent diffusion.","Expressive text-to-speech aims to generate high-quality samples with rich and diverse prosody, which is hampered by two major challenges: 1) considering the one-to-many mapping problem, prosodic attributes in highly dynamic voices are difficult to capture and model without intonation; 2) the TTS model should learn a diverse latent space and prevent producing dull samples with a collapsed prosodic distribution. This paper proposes Prosody-TTS, a two-stage TTS pipeline that improves prosody modeling and sampling by introducing several components: 1) a self-supervised learning model to derive the prosodic representation without relying on text transcriptions or local prosody ground-truth, which ensures the model covers diverse speaking voices, preventing sub-optimal solutions and distribution collapse; and 2) a latent diffusion model to sample and produce diverse patterns within the learned prosodic space, which prevents TTS models from generating the dull samples with mean distribution. Prosody-TTS achieves high-fidelity speech synthesis with rich and diverse prosodic attributes. Experiments results demonstrate that it surpasses the state-of-the-art models in terms of audio quality and prosody naturalness. The downstream evaluation and ablation studies further demonstrate the effectiveness of each design. Audio samples are available at https://Prosody-TTS.github.io/.","Text-to-speech, Prosody modeling, Self-supervised learning, Diffusion probabilistic model" Robust Self-Supervised Learning with Lie Groups,https://openreview.net/forum?id=qWt3YobXdwe,https://openreview.net/pdf?id=qWt3YobXdwe,We explicitly model variation in data using Lie groups to improve self-supervised vision models' robustness to pose changes,"Deep learning has led to remarkable advances in computer vision. Even so, today’s best models are brittle when presented with variations that differ even slightly from those seen during training. Minor shifts in the pose, color, or illumination of an object can lead to catastrophic misclassifications. State-of-the art models struggle to understand how a set of variations can affect different objects. We propose a framework for instilling a notion of how objects vary in more realistic settings. Our approach applies the formalism of Lie groups to capture continuous transformations to improve models’ robustness to distributional shifts. We apply our framework on top of state-of-the-art self-supervised learning (SSL) models, finding that explicitly modeling transformations with Lie groups leads to substantial performance gains of greater than 10% for MAE on both known instances seen in typical poses now presented in new poses, and on unknown instances in any pose. We also apply our approach to ImageNet, finding that the Lie operator improves performance by almost 4%. These results demonstrate the promise of learning transformations to improve model robustness.","robustness, computer vision, generalization, self-supervised learning, out-of-domain generalization" Self-Paced Learning Enhanced Physics-informed Neural Networks for Solving Partial Differential Equations,https://openreview.net/forum?id=QugfmhDu5Y4,https://openreview.net/pdf?id=QugfmhDu5Y4,,"There is a hit discussion on solving partial differential equation by neural network. The famous PINN (physics-informed neural networks) has drawn worldwide attention since it was put forward. Despite its success in solving nonlinear partial differential equation, the difficulty in converging and the inefficiency in training process are definitely huge concerns. Normally, data for PINN is randomly chosen for a given distribution. Additionally, it's fitted to a model in a meaningless way. Curriculum Learning is a learning strategy that trains a model from easy samples to hard ones, which represents the meaningful human learning order. Self-paced Learning (SPL) is one of the significant branches of Automatic Curriculum Learning, which takes example-wise the training loss as Difficulty Measurer. SPL is an efficient strategy in enhancing the convergence rate of numerous models. In this paper, we propose a novel SPL-PINN learning framework, with SPL to accelerate the convergence progress of PINN. We demonstrate the effectiveness of SPL-PINN in a typical parabolic equation and Burgers equation. ", Moving Beyond Handcrafted Architectures in Self-Supervised Learning,https://openreview.net/forum?id=1KaSx3GrBBm,https://openreview.net/pdf?id=1KaSx3GrBBm,,"The current literature on self-supervised learning (SSL) focuses on developing learning objectives to train neural networks more effectively on unlabeled data. The typical development process involves taking well-established architectures, e.g., ResNet demonstrated on ImageNet, and using them to evaluate newly developed objectives on downstream scenarios. While convenient, this does not take into account the role of architectures which has been shown to be crucial in the supervised learning literature. In this work, we establish extensive evidence showing that architecture plays a significant role in SSL. We conduct a large-scale study with over 100 variants of ResNet and MobileNet architectures and evaluate them across 11 downstream scenarios in the SSL setting. We show that there is no one network that performs consistently well across the scenarios. Based on this, we propose to learn not only network weights but also architecture topologies in the SSL regime. We show that ``self-supervised architectures'' significantly outperform popular handcrafted architectures (ResNet-50 and MobileNetV2) on major image classification benchmarks (ImageNet-1K, iNat2021, and more). Our results suggest that it is time to consider moving beyond handcrafted architectures in SSL and start thinking about incorporating architecture search into self-supervised learning objectives.","NAS, Self-supervised learning" Approximation and non-parametric estimation of functions over high-dimensional spheres via deep ReLU networks,https://openreview.net/forum?id=r90KYcuB7JS,https://openreview.net/pdf?id=r90KYcuB7JS,"We study the approximation and statistical estimation of deep ReLU feed-forward neural networks, when functions of interests are from Sobolev spaces over high-dimensional sphere.","We develop a new approximation and estimation analysis of deep feed-forward neural networks (FNNs) with the Rectified Linear Unit (ReLU) activation. The functions of interests for the approximation and estimation are assumed to be from Sobolev spaces defined over the $d$-dimensional unit sphere with smoothness index $r>0$. In the regime where $r$ is in the constant order (i.e., $r=\mathcal{O}(1)$), it is shown that at most $d^d$ active parameters are required for getting $d^{-C}$ approximation rate for some constant $C>0$. In contrast, in the regime where the index $r$ grows in the order of $d$ (i.e., $r=\mathcal{O}(d)$) asymptotically, we prove the approximation error decays in the rate $d^{-d^{\beta}}$ with $0<\beta<1$ up to some constant factor independent of $d$. The required number of active parameters in the networks for the approximation increases polynomially in $d$ as $d\rightarrow{\infty}$. In addition to this, it is shown that bound on the excess risk has a $d^d$ factor, when $r=\mathcal{O}(1)$, whereas it has $d^{\mathcal{O}(1)}$ factor, when $r=\mathcal{O}(d)$. We emphasize our findings by making comparisons to the results on approximation and estimation errors of deep ReLU FNN when functions are from Sobolev spaces defined over $d$-dimensional cube. Here, we show that with the current state-of-the-art result, $d^{d}$ factor remain both in the approximation and estimation error, regardless of the order of $r$. ","Approximation Theory, Non-parametric regression, Asymptotic, Deep ReLU Neural networks, High-dimensional sphere" Embedding Fourier for Ultra-High-Definition Low-Light Image Enhancement,https://openreview.net/forum?id=5N0wtJZ89r9,https://openreview.net/pdf?id=5N0wtJZ89r9,"In this paper, we propose a new solution for UHD LLIE based on the characteristics of the Fourier domain. We also propose the first real UHD LLIE dataset with diverse data.","Ultra-High-Definition (UHD) photo has gradually become the standard configuration in advanced imaging devices. The new standard unveils many issues in existing approaches for low-light image enhancement (LLIE), especially in dealing with the intricate issue of joint luminance enhancement and noise removal while remaining efficient. Unlike existing methods that address the problem in the spatial domain, we propose a new solution, UHDFour, that embeds Fourier transform into a cascaded network. Our approach is motivated by a few unique characteristics in the Fourier domain: 1) most luminance information concentrates on amplitudes while noise is closely related to phases, and 2) a high-resolution image and its low-resolution version share similar amplitude patterns. Through embedding Fourier into our network, the amplitude and phase of a low-light image are separately processed to avoid amplifying noise when enhancing luminance. Besides, UHDFour is scalable to UHD images by implementing amplitude and phase enhancement under the low-resolution regime and then adjusting the high-resolution scale with few computations. We also contribute the first real UHD LLIE dataset, UHD-LL, that contains 2,150 low-noise/normal-clear 4K image pairs with diverse darkness and noise levels captured in different scenarios. With this dataset, we systematically analyze the performance of existing LLIE methods for processing UHD images and demonstrate the advantage of our solution. We believe our new framework, coupled with the dataset, would push the frontier of LLIE towards UHD. Code and the dataset will be released.","low-light image enhancement, high-resolution image processing, Fourier transform, benchmark" Population-Based Reinforcement Learning for Combinatorial Optimization Problems,https://openreview.net/forum?id=JRFSLFyYAII,https://openreview.net/pdf?id=JRFSLFyYAII,We present a population-based RL method for CO problems: the training procedure makes the agents complementary to maximize the population's performance.,"Applying reinforcement learning to combinatorial optimization problems is attractive as it obviates the need for expert knowledge or pre-solved instances. However, it is unrealistic to expect an agent to solve these (often NP-)hard problems in a single shot at inference due to their inherent complexity, thus leading approaches are often augmented with additional search strategies, from stochastic sampling and beam-search to explicit fine-tuning. In this paper, we argue for the benefits of learning a population of complementary agents, which can be simultaneously rolled out at inference. To this end, we introduce Poppy, a simple theoretically grounded training procedure for populations. Instead of relying on a predefined or hand-crafted notion of diversity, Poppy induces an unsupervised specialization targeted solely at maximizing the performance of the whole population. We show that Poppy leads to a set of complementary heuristics, and obtain state-of-the-art results on three popular NP-hard problems: the traveling salesman (TSP), the capacitated vehicle routing (CVRP), and 0-1 knapsack (KP). On TSP specifically, Poppy divides by 5 the optimality gap while reducing the inference time by more than 10 compared to previous state-of-the-art reinforcement learning approaches.","reinforcement learning, combinatorial optimization, population" Adversarial Attack Detection Through Network Transport Dynamics,https://openreview.net/forum?id=NYjXrU_f20G,https://openreview.net/pdf?id=NYjXrU_f20G,We propose a detector of adversarial attacks inspired by the dynamic viewpoint of neural networks and a regularization that improves detection of adversarial attacks and test accuracy.,"Adversarial attacks are perturbations to the input that don't change its class for a human observer, but fool a neural network into changing its prediction. In this paper, we propose a detector of such attacks that is based on the view of residual networks as discrete dynamical systems. The detector tells clean inputs from abnormal ones by comparing the discrete vector fields they follow throughout the network's layers before the final classification layer. We compare this detector favorably to other detectors on seen and unseen attacks. We also show that regularizing this vector field during training makes the network more regular on the data distribution's support, thus making the network's activations on clean samples more distinguishable from those of abnormal samples. This regularization of the network's dynamics improves the performance of any detection method that uses the internal embeddings as inputs, while also improving the network's test accuracy.","Adversarial Attacks, Deep Learning, Optimal Transport, Residual Networks, Regularization" On the Relationship Between Adversarial Robustness and Decision Region in Deep Neural Networks,https://openreview.net/forum?id=dlQIh4mUtQ8,https://openreview.net/pdf?id=dlQIh4mUtQ8,We propose the novel concept of the Populated Region Set (PRS) and devise PRS regularizer leveraging the characteristics of PRS to improve the adversarial robustness without adversarial training.,"In general, Deep Neural Networks (DNNs) are evaluated by the generalization performance measured on unseen data excluded from the training phase. Along with the development of DNNs, the generalization performance converges to the state-of-the-art and it becomes difficult to evaluate DNNs solely based on this metric. The robustness against adversarial attack has been used as an additional metric to evaluate DNNs by measuring their vulnerability. However, few studies have been performed to analyze the adversarial robustness in terms of the geometry in DNNs. In this work, we perform an empirical study to analyze the internal properties of DNNs that affect model robustness under adversarial attacks. In particular, we propose the novel concept of the Populated Region Set (PRS), where training samples are populated more frequently, to represent the internal properties of DNNs in a practical setting. From systematic experiments with the proposed concept, we provide empirical evidence to validate that a low PRS ratio has a strong relationship with the adversarial robustness of DNNs. We also devise PRS regularizer leveraging the characteristics of PRS to improve the adversarial robustness without adversarial training.","Decision Region, Adversarial Robustness, Robust Training" Confounder Identification-free Causal Visual Feature Learning,https://openreview.net/forum?id=8e5xTOIQpj7,https://openreview.net/pdf?id=8e5xTOIQpj7,We propose a casual visual representation learning paradigm (CICF) for generalization without requiring to identify the existing confounders. ,"Confounders in deep learning are in general detrimental to model's generalization where they infiltrate feature representations. Therefore, learning causal features that are free of interference from confounders is important. Most previous causal learning based approaches employ back-door criterion to mitigate the adverse effect of certain specific confounder, which require the explicit identification of confounder. However, in real scenarios, confounders are typically diverse and difficult to be identified. In this paper, we propose a novel Confounder Identification-free Causal Visual Feature Learning (CICF) method, which obviates the need for identifying confounders. CICF models the interventions among different samples based on front-door criterion, and then approximates the global-scope intervening effect upon the instance-level interventions from the perspective of optimization. In this way, we aim to find a reliable optimization direction, which avoids the intervening effects of confounders, to learn causal features. Furthermore, we uncover the relation between CICF and the popular meta-learning strategy MAML, and provide an interpretation of why MAML works from the theoretical perspective of causal learning for the first time. Thanks to the effective learning of causal features, our CICF enables models to have superior generalization capability. Extensive experiments on domain generalization benchmark datasets demonstrate the effectiveness of our CICF, which achieves the state-of-the-art performance.","Domain Generalization, Casual Learning, Front-door criterion, Confounder identification-free" Enhanced Temporal Knowledge Embeddings with Contextualized Language Representations,https://openreview.net/forum?id=ff_18Qwm13Bp,https://openreview.net/pdf?id=ff_18Qwm13Bp,"We use textual knowledge to enhance the time-evolving knowledge graph embedding, while preserving its temporal nature.","World knowledge exists in both structured (tables, knowledge graphs) and unstructured forms (texts). Recently, there have been extensive research efforts in the integration of structured factual knowledge and unstructured textual knowledge. However, most studies focus on incorporating static factual knowledge into pre-trained language models, while there is less work on enhancing temporal knowledge graph embedding using textual knowledge. Existing integration approaches can not apply to temporal knowledge graphs (tKGs) since they often assume knowledge embedding is time-invariant. In fact, the entity embedding in tKG embedding models usually evolves over time, which poses the challenge of aligning temporally relevant textual information with entities. To this end, we propose Enhanced Temporal Knowledge Embeddings with Contextualized Language Representations (ECOLA), which uses tKG quadruple as an implicit measure to temporally align textual data and the time-evolving entity representations and uses a novel knowledge-text prediction task to inject textual information into temporal knowledge embedding. ECOLA jointly optimizes the knowledge-text prediction objective and the temporal knowledge embedding objective, and thus, can simultaneously take full advantage of textual and structured knowledge. Since existing datasets do not provide tKGs with aligned textual data, we introduce three new datasets for training and evaluating ECOLA. Experimental results on the temporal knowledge graph completion task show that ECOLA outperforms state-of-the-art tKG embedding models by a large margin.","structured and unstructured knowledge integration, temporal knowledge embedding, pre-trained language model, temporal reasoning" Learning Adversarial Linear Mixture Markov Decision Processes with Bandit Feedback and Unknown Transition,https://openreview.net/forum?id=sVU54nyaA9K,https://openreview.net/pdf?id=sVU54nyaA9K,We make the first step to establish a provably efficient algorithm in adversarial linear mixture mdps with bandit feedback and unknown transition.,"We study reinforcement learning (RL) with linear function approximation, unknown transition, and adversarial losses in the bandit feedback setting. Specifically, the unknown transition probability function is a linear mixture model \citep{AyoubJSWY20,ZhouGS21,HeZG22} with a given feature mapping, and the learner only observes the losses of the experienced state-action pairs instead of the whole loss function. We propose an efficient algorithm LSUOB-REPS which achieves $\widetilde{O}(dS^2\sqrt{K}+\sqrt{HSAK})$ regret guarantee with high probability, where $d$ is the ambient dimension of the feature mapping, $S$ is the size of the state space, $A$ is the size of the action space, $H$ is the episode length and $K$ is the number of episodes. Furthermore, we also prove a lower bound of order $\Omega(dH\sqrt{K}+\sqrt{HSAK})$ for this setting. To the best of our knowledge, we make the first step to establish a provably efficient algorithm with a sublinear regret guarantee in this challenging setting and solve the open problem of \citet{HeZG22}.","Reinforcement learning theory, Reinforcement learning with adversarial losses, Reinforcement learning with linear function approximation" Weakly Supervised Knowledge Transfer with Probabilistic Logical Reasoning for Object Detection,https://openreview.net/forum?id=4yqxDCbzS98,https://openreview.net/pdf?id=4yqxDCbzS98,"In this work, we propose ProbKT, a framework based on probabilistic logical reasoning to train object detection models with weak supervision, by transferring knowledge from a source domain where instance-level annotations are available.","Training object detection models usually requires instance-level annotations, such as the positions and labels of all objects present in each image. Such supervision is unfortunately not always available and, more often, only image-level information is provided, also known as weak supervision. Recent works have addressed this limitation by leveraging knowledge from a richly annotated domain. However, the scope of weak supervision supported by these approaches has been very restrictive, preventing them to use all available information. In this work, we propose ProbKT, a framework based on probabilistic logical reasoning to train object detection models with arbitrary types of weak supervision. We empirically show on different datasets that using all available information is beneficial as our ProbKT leads to significant improvement on target domain and better generalisation compared to existing baselines. We also showcase the ability of our approach to handle complex logic statements as supervision signal.","weak supervision, knowledge transfer, object detection, probabilistic logical reasoning" A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification,https://openreview.net/forum?id=YnkGMIh0gvX,https://openreview.net/pdf?id=YnkGMIh0gvX,We present a holistic perspective on the task of failure detection including a large-scale empirical study for the first time enabling benchmarking confidence scoring functions w.r.t all relevant methods and distribution shifts. ,"Reliable application of machine learning-based decision systems in the wild is one of the major challenges currently investigated by the field. A large portion of established approaches aims to detect erroneous predictions by means of assigning confidence scores. This confidence may be obtained by either quantifying the model's predictive uncertainty, learning explicit scoring functions, or assessing whether the input is in line with the training distribution. Curiously, while these approaches all state to address the same eventual goal of detecting failures of a classifier upon real-life application, they currently constitute largely separated research fields with individual evaluation protocols, which either exclude a substantial part of relevant methods or ignore large parts of relevant failure sources. In this work, we systematically reveal current pitfalls caused by these inconsistencies and derive requirements for a holistic and realistic evaluation of failure detection. To demonstrate the relevance of this unified perspective, we present a large-scale empirical study for the first time enabling benchmarking confidence scoring functions w.r.t all relevant methods and failure sources. The revelation of a simple softmax response baseline as the overall best performing method underlines the drastic shortcomings of current evaluation in the plethora of publicized research on confidence scoring. Code and trained models are at https://github.com/kjdhfg/fd-shifts","failure detection, out-of-distribution detection, predictive uncertainty quantification, selective classification, robustness, method evaluation" Signs in the Lottery: Structural Similarities Between Winning Tickets,https://openreview.net/forum?id=3l9mLzLa0BA,https://openreview.net/pdf?id=3l9mLzLa0BA,Winning tickets show structural similarities when taking signs of connections into account.,"Winning tickets are sparse subnetworks of a deep network that can be trained in isolation to the same performance as the full network. Winning tickets have been found in many different contexts, however their structural characteristics are not well understood. We propose that the signs of the connections in winning tickets play a crucial role. We back this claim by introducing a sign-based structural comparison metric that allows to distinguish winning tickets from other sparse networks. We further analyze typical (signed) patterns in convolutional kernels of winning tickets and find structures that resemble patterns found in trained networks.","lottery ticket hypothesis, sparse networks, structural similarity, deep learning" Computational Doob h-transforms for Online Filtering of Discretely Observed Diffusions,https://openreview.net/forum?id=_02M2MYThLz,https://openreview.net/pdf?id=_02M2MYThLz,,"This paper is concerned with online filtering of discretely observed nonlinear diffusion processes. Our approach is based on the fully adapted auxiliary particle filter, which involves Doob's $h$-transforms that are typically intractable. We propose a computational framework to approximate these $h$-transforms by solving the underlying backward Kolmogorov equations using nonlinear Feynman-Kac formulas and neural networks. The methodology allows one to train a locally optimal particle filter prior to the data-assimilation procedure. Numerical experiments illustrate that the proposed approach can be orders of magnitude more efficient than the bootstrap particle filter in the regime of highly informative observations, when the observations are extreme under the model, and if the state dimension is large.","diffusion, filtering, monte-carlo, particle-filters, BSDE" Adversarial Attack Detection Under Realistic Constraints,https://openreview.net/forum?id=aco19TMw9gA,https://openreview.net/pdf?id=aco19TMw9gA,"We propose a simple, real-time, softmax-based detection method for adversarial samples.","While adversarial attacks are a serious threat for neural networks safety, existing defense mechanisms remain very limited regarding their applicability to real-world settings. Any industrial-driven attack detector is expected to meet three unavoidable requirements: (R1) being adapted to black-box scenario where the user has only access to the predicted probabilities, (R2) making fast inference and (R3) not involving any training phase. In this paper, we introduce REFEREE, the first detector that meets all these requirements while improving state-of-the-art performances. It leverages the concept of information projections (I-projection), which generalizes ideas coming from out-of-distribution detection and allows to extract relevant information contained in the softmax outputs of a network. Our extensive experiments demonstrates that REFEREE improves upon existing methods while considerably reducing the inference time: it requires less than 0.05 seconds by test input, which is up to 400 times faster than former methods. This makes REFEREE an excellent candidate for adversarial attacks detection in real-world applications.","Adversarial attacks, Detection, Vision transformers, Safety AI" Reconciling feature sharing and multiple predictions with MIMO Vision Transformers,https://openreview.net/forum?id=pOq1HuMI8Dz,https://openreview.net/pdf?id=pOq1HuMI8Dz,"We propose MixViT, an inexpensive framework that improves vision transformers by training multiple subnetworks at the end of the model through multi-input multi-output training.","Multi-input multi-output training improves network performance by optimizing multiple subnetworks simultaneously. In this paper, we propose MixViT, the first MIMO framework for vision transformers that takes advantage of ViTs’ innate mechanisms to share features between subnetworks. This is in stark contrast to traditional MIMO CNNs that are limited by their inability to mutualize features. Unlike them, MixViT only separates subnetworks in the last layers thanks to a novel source attribution that ties tokens to specific subnetworks. As such, we retain the benefits of multi-output supervision while training strong features useful to both subnetworks. We verify MixViT leads to significant gains across multiple architectures (ConViT, CaiT) and datasets (CIFAR, TinyImageNet, ImageNet-100, and ImageNet-1k) by fitting multiple subnetworks at the end of a base model.","Deep learning, Computer vision, Vision transformers, Classification, Multi-input multi-output" A Neural Mean Embedding Approach for Back-door and Front-door Adjustment,https://openreview.net/forum?id=rLguqxYvYHB,https://openreview.net/pdf?id=rLguqxYvYHB,,"We consider the estimation of average and counterfactual treatment effects, under two settings: back-door adjustment and front-door adjustment. The goal in both cases is to recover the treatment effect without having an access to a hidden confounder. This objective is attained by first estimating the conditional mean of the desired outcome variable given relevant covariates (the ``first stage"" regression), and then taking the (conditional) expectation of this function as a ``second stage"" procedure. We propose to compute these conditional expectations directly using a regression function to the learned input features of the first stage, thus avoiding the need for sampling or density estimation. All functions and features (and in particular, the output features in the second stage) are neural networks learned adaptively from data, with the sole requirement that the final layer of the first stage should be linear. The proposed method is shown to converge to the true causal parameter, and outperforms the recent state-of-the-art methods on challenging causal benchmarks, including settings involving high-dimensional image data. ", Chopping Formers is what you need in Vision,https://openreview.net/forum?id=R4ETr5gcg5v,https://openreview.net/pdf?id=R4ETr5gcg5v,"In this work, we unify prior methods and present a new efficient factorization for a general fully-connected and dynamic layer.","This work presents a new dynamic and fully-connected layer (DFC) that generalizes existing layers and is free from hard inductive biases. Then, it describes how to factorize the DFC weights efficiently. Using the Einstein convention as framework, we define the DFC as a fully connected layer with the weight tensor created as a function of the input. DFC is the non-linear extension of the most general case of linear layer for neural network, and therefore all major neural network layers, from convolution to self-attention, are particular cases of DFCs. A stack of DFCs interleaved by non-linearities defines a new super-class of neural networks: \emph{Formers}. DFC has four major characteristics: it is Dynamic and Spatially Adaptive, it has a Global Receptive Field, and it mixes all the available channels' information. In their complete form, DFCs are powerful layers free from hard inductive biases, but their use is limited in practice by their prohibitive computational cost. To overcome this limitation and deploy DFC in real computer-vision applications, we propose to use the CP decomposition, showing that it is possible to factorize the DFC layer into smaller, manageable blocks without losing any representational power. Finally, we propose ChoP'D Former, an architecture making use of a new decomposition of the DFC layer into five sequential operations, each incorporating one characteristic of the original DFC tensor. Chop'D Former leverages dynamic gating and integral image, achieves global spatial reasoning with constant time complexity, and has a receptive field that can adapt depending on the task. Extensive experiments demonstrate that our ChoP'D Former is competitive with state-of-the-art results on three well-known computer vision benchmarks, namely Large-Scale Classification, Object Detection, and Instance Segmentation, suppressing the need for expensive architecture search and hyperparameter optimization. ","Transformers, Tensor Decomposition, Deep learning Architectures" Towards Estimating Transferability using Hard Subsets,https://openreview.net/forum?id=hRfJzvTYvD-,https://openreview.net/pdf?id=hRfJzvTYvD-,"We propose HASTE, a strategy that ensures better transferability estimation using just a hard subset of target data.","As transfer learning techniques are increasingly used to transfer knowledge from the source model to the target task, it becomes important to quantify which source models are suitable for a given target task without performing computationally expensive fine-tuning. In this work, we propose HASTE (HArd Subset TransfErability), a new strategy to estimate the transferability of a source model to a particular target task using only a harder subset of target data. By leveraging the model’s internal and output representations, we introduce two techniques – one class-agnostic and another class-specific – to identify harder subsets and show that HASTE can be used with any existing transferability metric to improve their reliability. We further analyze the relation between HASTE and the optimal average log-likelihood as well as negative conditional entropy and empirically validate our theoretical bounds. Our experimental results across multiple source model architectures, target datasets, and transfer learning tasks show that HASTE-modified metrics are consistently better or on par with the state-of-the-art transferability metrics.","Transfer Learning, Transferability, Hard Subsets" Data Pricing Mechanism Based on Property Rights Compensation Distribution,https://openreview.net/forum?id=JInmhyuvn6,https://openreview.net/pdf?id=JInmhyuvn6,This paper proposes the first data valuation mechanism based on modern property rights theory. We integrate ownership to clearify ownership and estimate its value while using the core instead of Shapley value to assign compensation.,"While machine learning (ML) benefits from data, it also faces the challenges of ambiguous data ownership, including privacy violations and increased costs of using data. Yet existing approaches to data valuation may focus on preventing privacy breaches, but do not truly protect data ownership. This is because a data trading marketplace that protects data ownership should achieve this goal: once data is traded, its ownership does not transfer to a new owner but merely enlarges its coverage. Considering that the transfer of property rights in the process of data trading makes compensation necessary, this paper proposes the first data valuation mechanism based on modern property rights theory. Specifically, we propose the integration of property rights to improve the final revenue of the entire workflow called the “data chain” while compensating process executors who lost ownership after integration. Then, we consider the expectations of both the integrator and the integrated party during the compensation allocation. For the former, we apply compound interest to assess a total compensation equivalent to the time value for the Data chain. For the latter, we respect and meet their expectations as much as possible. To achieve this, we provide the framework based on Least-core to assign the compensation and prove that our framework can also work compared to existing algorithms. Finally, to cope with more complex situations, we adjust the traditional Least-core and demonstrate theoretically and experimentally that the compensation mechanism is feasible and effective in solving the data pricing problem.","data valuation, game theory, data ownership, modern property rights theory" Incremental Unified Parameter Additional Tuning with Basic Memory Replaying,https://openreview.net/forum?id=CS7dB_FnGx,https://openreview.net/pdf?id=CS7dB_FnGx,We propose a novel method for class incremental learning by tuning an unified additional parameter structure and replaying basic memory.,"Class incremental learning (CIL) aims to develop an open intelligence system that can continuously learn new concepts from new tasks while retaining the knowledge to distinguish between new and old concepts. Recently, parameter-additional-tuning methods (PAT) have successfully alleviated catastrophic forgetting by starting from a well-pre-trained model and only allowing a few additional parameters to be trained. However, the contradiction between stability and plasticity and the lack of inter-task features still challenge PAT-based CIL methods. To address these, we propose Unified PAT and basic memory replaying (BMR). On the one hand, unified PAT transfer the model to sequential arrival downstream tasks based on a fixed pre-trained vision transformer by unifying the prompt-based and the adapter-based methods, offering more diversified plastic structures to efficiently capture more useful features without large-scale parameters. On the other hand, BMR synthesizes on-call virtual old samples with a fixed-size basic memory to create a global task that covers up all the sub-tasks, which makes inter-task features more learnable without a large memory budget. Abundant experiments prove the effectiveness of our method.","class incremental learning, parameter-additional-tuning, basic memory replaying" Knowledge-Driven Active Learning,https://openreview.net/forum?id=hmpjFiUly1,https://openreview.net/pdf?id=hmpjFiUly1,Exploiting simple domain-knowledge within an active learning strategy to minimize the amount of labelled data and whiten the labelling process.,"The deployment of Deep Learning (DL) models is still precluded in those contexts where the amount of supervised data is limited. To answer this issue, active learning strategies aim at minimizing the amount of labelled data required to train a DL model. Most active strategies are based on uncertain sample selection, and even often restricted to samples lying close to the decision boundary. These techniques are theoretically sound, but an understanding of the selected samples based on their content is not straightforward, further driving non-experts to consider DL as a black-box. For the first time, here we propose a different approach, taking into consideration common domain-knowledge and enabling non-expert users to train a model with fewer samples. In our Knowledge-driven Active Learning (KAL) framework, rule-based knowledge is converted into logic constraints and their violation is checked as a natural guide for sample selection. We show that even simple relationships among data and output classes offer a way to spot predictions for which the model need supervision. The proposed approach (i) outperforms many active learning strategies in terms of average F1 score, particularly in those contexts where domain knowledge is rich. Furthermore, we empirically demonstrate that (ii) KAL discovers data distribution lying far from the initial training data unlike uncertainty-based strategies, (iii) it ensures domain experts that the provided knowledge is respected by the model on test data, and (iv) it can be employed even when domain-knowledge is not available by coupling it with a XAI technique. Finally, we also show that KAL is also suitable for object recognition tasks and, its computational demand is low, unlike many recent active learning strategies.","Active Learning, Knowledge-aided Learning, FOL, Human in the loop, Hybrid Model" FastDiff 2: Dually Incorporating GANs into Diffusion Models for High-Quality Speech Synthesis,https://openreview.net/forum?id=-x5WuMO4APy,https://openreview.net/pdf?id=-x5WuMO4APy,"We propose FastDiff 2, a conditional diffusion model to trade off diversity for quality and speed by incorporating GANs into diffusion models.","FastDiff, as a class of denoising probabilistic models, has recently achieved impressive performances in speech synthesis. It utilizes a noise predictor to learn a tight inference schedule for skipping denoising steps. Despite the successful speedup of FastDiff, there is still room for improvements, e.g., further optimizing the speed-quality trade-off and accelerating DDPMs training procedures. After analyzing GANs and diffusion models in conditional speech synthesis, we find that: GANs produce samples but do not cover the whole distribution, and the coverage degree does not distinctly impact audio quality. Inspired by these observations, we propose to trade off diversity for quality and speed by incorporating GANs into diffusion models, introducing two GAN-empowered modeling perspectives: (1) FastDiff 2 (Diff-GAN), whose denoising distribution is parametrized by conditional GANs; and (2) FastDiff 2 (GAN-Diff), in which the denoising model is treated as a generator in GAN for adversarial training. Unlike the acceleration methods based on skipping the denoising steps, FastDiff 2 provides a principled way to speed up both the training and inference processes. Experimental results demonstrate that both variants of FastDiff 2 enjoy an efficient 4-step sampling process as in FastDiff yet demonstrate a superior sample quality. Audio samples are available at https://FastDiff2.github.io/.","Speech synthesis, Neural vocoder, Diffusion probabilistic model, Generative adversarial network" When Neural ODEs meet Neural Operators,https://openreview.net/forum?id=P5ZTXA7zy6,https://openreview.net/pdf?id=P5ZTXA7zy6,,"Differential equation-based neural networks perform well in a variety of deep learning fields. Among those many methods, neural ordinary differential equations (NODEs) are one of the most fundamental work. NODEs have been applied to general downstream tasks such as image classification, time series classification, and image generation. The ODE function of NODEs can be understood as a special type of differential operators, which had been overlooked before. In this paper, therefore, we study the feasibility of modeling NODEs (or the ODE function of NODEs) as neural operators. Our neural operator-based methods are more rigorous than existing approaches when it comes to learning the differential operator (or the ODE function). To this end, we design a new neural operator structure called branched Fourier neural operator (BFNO), which is suitable for modeling the ODE function. It shows improved performance for several general machine learning tasks, as compared to existing various NODE models.", TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation,https://openreview.net/forum?id=UVAmFAtC5ye,https://openreview.net/pdf?id=UVAmFAtC5ye,"We propose TranSpeech, a speech-to-speech translation model with bilateral perturbation to address multimodality and parallel decoding to reduce inference latency.","Direct speech-to-speech translation (S2ST) with discrete units leverages recent progress in speech representation learning, where a sequence of discrete representations derived in a self-supervised manner, are predicted from the model and passed to a vocoder for speech synthesis, still facing the following challenges: 1) Acoustic multimodality: the discrete units derived from speech with same content could be indeterministic due to the acoustic property (e.g., rhythm, pitch, and energy), which causes deterioration of translation accuracy; 2) high latency: current S2ST systems utilize autoregressive models which predict each unit conditioned on the sequence previously generated, failing to take full advantage of parallelism. In this work, we propose TranSpeech, a speech-to-speech translation model with bilateral perturbation. To alleviate the acoustic multimodal problem, we propose bilateral perturbation (BiP), which consists of the style normalization and information enhancement stages, to learn only the linguistic information from speech samples and generate more deterministic representations. With reduced multimodality, we step forward and become the first to establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices and produces high-accuracy results in just a few cycles. Experimental results on three language pairs demonstrate that BiP yields an improvement of 2.9 BLEU on average compared with a baseline textless S2ST model. Moreover, our parallel decoding shows a significant reduction of inference latency, enabling speedup up to 21.4x than autoregressive technique. Audio samples are available at https://TranSpeech.github.io/.","Speech-to-speech translation, Multimodal challenge, Non-autoregressive generation" D4FT: A Deep Learning Approach to Kohn-Sham Density Functional Theory,https://openreview.net/forum?id=aBWnqqsuot7,https://openreview.net/pdf?id=aBWnqqsuot7,This paper propose a deep learning approch to solving Kohn-Sham Density Functional Theory.,"Kohn-Sham Density Functional Theory (KS-DFT) has been traditionally solved by the Self-Consistent Field (SCF) method. Behind the SCF loop is the physics intuition of solving a system of non-interactive single-electron wave functions under an effective potential. In this work, we propose a deep learning approach to KS-DFT. First, in contrast to the conventional SCF loop, we propose to directly minimize the total energy by reparameterizing the orthogonal constraint as a feed-forward computation. We prove that such an approach has the same expressivity as the SCF method, yet reduces the computational complexity from $\mathcal{O}(N^4)$ to $\mathcal{O}(N^3)$. Second, the numerical integration which involves a summation over the quadrature grids can be amortized to the optimization steps. At each step, stochastic gradient descent (SGD) is performed with a sampled minibatch of the grids. Extensive experiments are carried out to demonstrate the advantage of our approach in terms of efficiency and stability. In addition, we show that our approach enables us to explore more complex neural-based wave functions. ","AI for Science, Quantum Chemisty, Density Functional Theory, Deep Learning, Kohn-Sham Equation." FFCV: Accelerating Training by Removing Data Bottlenecks,https://openreview.net/forum?id=Ew9gIwAQ7wr,https://openreview.net/pdf?id=Ew9gIwAQ7wr,"We present FFCV, an easy-to-use yet highly optimized library for training machine learning models.","We present FFCV, a library for easy, fast, resource-efficient training of machine learning models. FFCV speeds up model training by eliminating (often subtle) data bottlenecks from the training process. In particular, we combine techniques such as an efficient file storage format, caching, data pre-loading, asynchronous data transfer, and just-in-time compilation to (a) make data loading and transfer significantly more efficient, ensuring that GPUs can reach full utilization; and (b) offload as much data processing as possible to the CPU asynchronously, freeing GPU up capacity for training. Using FFCV, we train ResNet-18 and ResNet-50 on the ImageNet dataset with a state-of-the-art tradeoff between accuracy and training time. For example, across the range of ResNet-50 models we test, we obtain the same accuracy as the best baselines in half the time. We demonstrate FFCV's performance, ease-of-use, extensibility, and ability to adapt to resource constraints through several case studies.","infrastructure, data loading, fast training, library, usability" Warping the Space: Weight Space Rotation for Class-Incremental Few-Shot Learning,https://openreview.net/forum?id=kPLzOfPfA2l,https://openreview.net/pdf?id=kPLzOfPfA2l,This paper introduced a concept of weight space rotation which makes changes to parameter space itself for solving incremental few-shot learning problem.,"Class-incremental few-shot learning, where new sets of classes are provided sequentially with only a few training samples, presents a great challenge due to catastrophic forgetting of old knowledge and overfitting caused by lack of data. During finetuning on new classes, the performance on previous classes deteriorates quickly even when only a small fraction of parameters are updated, since the previous knowledge is broadly associated with most of the model parameters in the original parameter space. In this paper, we introduce WaRP, the \textit{weight space rotation process}, which transforms the original parameter space into a new space so that we can push most of the previous knowledge compactly into only a few important parameters. By properly identifying and freezing these key parameters in the new weight space, we can finetune the remaining parameters without affecting the knowledge of previous classes. As a result, WaRP provides an additional room for the model to effectively learn new classes in future incremental sessions. Experimental results confirm the effectiveness of our solution and show the improved performance over the state-of-the-art methods.","incremental few-shot learning, catastrophic forgetting, parameter space, weight space rotation" Learning to Reason and Act in Cascading Processes,https://openreview.net/forum?id=hNmf1gnVllX,https://openreview.net/pdf?id=hNmf1gnVllX,We consider the task of controlling the behavior of a cascading process with an intervention at a single point in time. We propose to learn a principled probabilistic scoring function that allows searching efficiently over the space of interventions.,"Training agents to control a dynamic environment is a fundamental task in AI. In many environments the dynamics can be summarized by a small set of events that capture the semantic behavior of the system. Typically, these events form chains or cascades. We often wish to change the system behavior using a single intervention that propagates through the cascade. For instance, one may trigger a biochemical cascade to switch the state of a cell, or reroute a truck in logistic chains to meet an unexpected, urgent delivery. We introduce a new supervised learning setup called ""Cascade"". An agent observes a system with a known dynamics evolving from some initial state. It is given a structured semantic instruction and needs to make an intervention that triggers a cascade of events, such that the system reaches an alternative (counterfactual) behavior. We provide a test-bed for this problem, consisting of physical objects. We combine semantic tree search with an event-driven forward model and devise an algorithm that learns to efficiently search in exponentially large semantic trees of continuous spaces. We demonstrate that our approach learns to effectively follow instructions to intervene in previously unseen complex scenes. When provided an observed cascade of events, it can also reason about alternative outcomes.","cascading processes, intervention, reasoning, tree search" Noether Embeddings: Fast Temporal Association Mining,https://openreview.net/forum?id=d1LQQzVFkSL,https://openreview.net/pdf?id=d1LQQzVFkSL,,"Simple and expressive in representing multi-relational events, temporal knowledge graphs (TKGs) have attracted increasing research interest. While temporal associations (TAs) reveal cause-effect relationships between event pairs across time, to the best of our knowledge, no previous work has paid attention to exploring such basic but meaningful regularities on TKGs. Despite the importance, temporal association mining (TAM) is prohibitively challenging due to its enormous search space. Inspired by Noether’s theorem in theoretical physics, we reduce the problem of TAM to search for conserved quantities with a search space invariant to the number of related events. We develop Noether Embeddings, an embedding model that jointly encodes the absolute time and relative interval of events and event pairs, respectively, by rotating complex vectors. It is proved theoretically and experimentally that our embedding model enforces convergence to a conserved quantity of decoding results implying time translation symmetries within associated event pairs. By using Noether Embeddings, a three-stage TAM framework is developed, respectively of the encoding, decoding, and selecting process. We successfully mined TAs both with semantic interpretability and statistical reliability. Experiments show that our method achieves a 23.7 times speedup over an optimized search algorithm for TAM on GDELT dataset.","temporal knowledge graph, association rule mining, representation learning, Noether's theorem" Searching optimal adjustment features for treatment effect estimation,https://openreview.net/forum?id=BEpJFTH50iT,https://openreview.net/pdf?id=BEpJFTH50iT,We construct a reinforcement-learning based framework to search the optimal adjustment features for more precise treatment effect estimation.,"Most efforts devoted to causal inference focus on controlling the adjustment features to further alleviate the confounding effect. In realistic applications, the collected covariates often contain variables correlating to only one of the treatment (e.g., instrumental variables) and the outcome (e.g., precision variables). Due to the absence of prior knowledge, the brute-force approach for the practitioner is to include every covariate for adjustment. However, previous literature shows that adjusting the former covariates (treatment-only) hurts the treatment effect estimation, while adjusting the latter covariates (outcome-only) brings benefits. Consequently, it is meaningful to find an optimal adjustment set rather than the brute-force approach for more efficient treatment effect estimation. To this end, we establish a variance metric which is computationally tractable to measure the optimality of the adjustment set. From the non-parametric viewpoint, we theoretically show that our metric is minimized if and only if the adjustment features contain the confounders and the outcome-only variables. As optimizing the proposed variance metric is a combinational optimization problem, we incorporate the Reinforcement Learning (RL) to search the corresponding optimal adjustment set. More specifically, we adopt the encoder-decoder model as the actor to generate the binary feature mask on the original covariates, which serves as the differentiable policy. Meanwhile, the proposed variance metric serves as the reward to guide the policy update. Empirical results on synthetic and real-world datasets demonstrate that ~(a) our method successfully searches the optimal adjustment sets and (b) the searched adjustment features achieve more precise treatment effect estimation.","treatment effect estimation, covariate separation, confounder balancing" Over-parameterized Model Optimization with Polyak-{\L}ojasiewicz Condition,https://openreview.net/forum?id=aBIpZvMdS56,https://openreview.net/pdf?id=aBIpZvMdS56,This work proposes a new regularized risk minimization for over-parameterized models with a novel PL regularization and implements it via network pruning guided by PL-based condition number. ,"This work pursues the optimization of over-parameterized deep models for superior training efficiency and test performance. We first theoretically emphasize the importance of two properties of over-parameterized models, i.e., the convergence gap and the generalization gap. Subsequent analyses unveil that these two gaps can be upper-bounded by the ratio of the Lipschitz constant and the Polyak-{\L}ojasiewicz (PL) constant, a crucial term abbreviated as the \emph{condition number}. Such discoveries have led to a structured pruning method with a novel pruning criterion. That is, we devise a gating network that dynamically detects and masks out those poorly-behaved nodes of a deep model during the training session. To this end, this gating network is learned via minimizing the \emph{condition number} of the target model, and this process can be implemented as an extra regularization loss term. Experimental studies demonstrate that the proposed method outperforms the baselines in terms of both training efficiency and test performance, exhibiting the potential of generalizing to a variety of deep network architectures and tasks.","Over-parameterized Model, Model Optimization, Polyak-{\L}ojasiewicz Condition." Differentially private Bias-Term Only Fine-tuning of Foundation Models,https://openreview.net/forum?id=zoTUH3Fjup,https://openreview.net/pdf?id=zoTUH3Fjup,"We propose a parameter efficient and private fine-tuning method, that trains 0.1% of the network (only the bias terms) and substantially improves the time/space complexity.","We study the problem of differentially private (DP) fine-tuning of large pre-trained models — a recent privacy-preserving approach suitable for solving downstream tasks with sensitive data. Existing work has demonstrated that high accuracy is possible under strong privacy constraint, yet requires significant computational overhead or modifications to the network architecture. We propose differentially private bias-term fine-tuning (DP-BiTFiT), which matches the state-of-the-art accuracy for DP algorithms and the efficiency of the standard BiTFiT. DP-BiTFiT is model agnostic (not modifying the network architecture), parameter efficient (only training about $0.1\%$ of the parameters), and computation efficient (almost removing the overhead caused by DP, in both the time and space complexity). On a wide range of tasks, DP-BiTFiT is $2\sim 30\times$ faster and uses $2\sim 8\times$ less memory than DP full fine-tuning, even faster than the standard full fine-tuning. This amazing efficiency enables us to conduct DP fine-tuning on language and vision tasks with long-sequence texts and high-resolution images, which were computationally difficult using existing methods.","deep learning, differential privacy, complexity, parameter efficiency, fine-tuning" Jointly Learning Visual and Auditory Speech Representations from Raw Data,https://openreview.net/forum?id=BPwIgvf5iQ,https://openreview.net/pdf?id=BPwIgvf5iQ,We propose a self-supervised audiovisual approach to jointly learn visual and auditory speech representations.,"We present RAVEn, a self-supervised multi-modal approach to jointly learn visual and auditory speech representations. Our pre-training objective involves encoding masked inputs, and then predicting contextualised targets generated by slowly-evolving momentum encoders. Driven by the inherent differences between video and audio, our design is asymmetric w.r.t. the two modalities' pretext tasks: Whereas the auditory stream predicts both the visual and auditory targets, the visual one predicts only the auditory targets. We observe strong results in low- and high-resource labelled data settings when fine-tuning the visual and auditory encoders resulting from a single pre-training stage, in which the encoders were jointly trained. Notably, combining RAVEn pre-training with self-training using only 30 hours of labelled data achieves competitive performance for visual speech recognition on LRS3, surpassing all self-supervised methods and a recent fully-supervised method trained on 90,000 hours of non-public labelled data. At the same time, we are on par with the state-of-the-art for auditory speech recognition on LRS3. Our findings point to the viability of learning powerful speech representations entirely from raw video and audio, i.e., without relying on handcrafted features. Code and models will be made public.","self-supervised learning, lipreading, speech recognition" Diminishing Return of Value Expansion Methods in Model-Based Reinforcement Learning,https://openreview.net/forum?id=H4Ncs5jhTCu,https://openreview.net/pdf?id=H4Ncs5jhTCu,,"Model-based reinforcement learning is an approach to increase sample efficiency. However, the accuracy of the dynamics models and the resulting compounding error over trajectories are commonly regarded as a limitation of model-based approaches. A natural question to ask is: How much more sample efficiency can be gained by improving the learned dynamics models? Specifically, this paper addresses the value expansion class of model-based approaches. Our empirical study shows that expanding the value function for the critic or actor update increases sample efficiency, but the gain in improvement decreases with each added expansion step. Therefore, longer horizons yield diminishing returns in terms of sample efficiency. In an extensive experimental comparison that uses the oracle dynamics model to avoid compounding model error, we show that short horizons are sufficient to obtain the lowest sample complexity for the given tasks. For long horizons, the improvements are marginal or can even decrease learning performance despite using the oracle dynamics model. Model-free counterparts, which use off-policy trajectories from a replay buffer and introduce no computational overhead, often show on-par performance and pose as a strong baseline. Finally, as we observe the same issues with both oracle and learned models, we conclude that the limitation of model-based value expansion methods is not so much the model accuracy of the learned models.","Model-based Reinforcement Learning, Value Expansion" Differentially Private Optimization on Large Model at Small Cost,https://openreview.net/forum?id=XfQlcpWESqV,https://openreview.net/pdf?id=XfQlcpWESqV,"We propose a new implementation of differentially private deep learning, that substantially improves speed and memory cost to match standard non-private training.","Differentially private (DP) optimization is the standard paradigm to learn large neural networks that are accurate and privacy-preserving. The computational cost for DP deep learning, however, is notoriously heavy due to the per-sample gradient clipping. Existing DP implementations are $2-1000\times$ more costly in time and space complexity than the standard (non-private) training. In this work, we develop a novel Book-Keeping (BK) technique that implements existing DP optimizers (thus achieving the same accuracy), with a substantial improvement on the computational cost. Specifically, BK enables DP training on large models and high dimensional data to be roughly as efficient as the standard training, whereas previous DP algorithms can be inefficient or incapable of training due to memory error. The computational advantage of BK is supported by the complexity analysis as well as extensive experiments on vision and language tasks. Our implementation achieves state-of-the-art (SOTA) accuracy with very small extra cost: on GPT2 and at the same memory cost, BK has 1.0$\times$ the time complexity of the standard training (0.75$\times$ training speed in practice), and 0.6$\times$ the time complexity of the most efficient DP implementation (1.24$\times$ training speed in practice). We will open-source the codebase for the BK algorithm.","deep learning, differential privacy, complexity, computation efficiency" Universal Graph Neural Networks without Message Passing,https://openreview.net/forum?id=P0bfBJaD4KP,https://openreview.net/pdf?id=P0bfBJaD4KP,,"Message-Passing Graph Neural Networks (MP-GNNs) have become the de facto paradigm for learning on graph for years. Nevertheless, recent works also obtain promising empirical results with other kinds of architectures like global self-attention and even MLPs. This raises an important theoretical question: what is the minimal prerequisite for an expressive graph model? In this work, we theoretically show that when equipped with proper position encodings, even a simple Bag-of-Nodes (BoN) model (node-wise MLP followed by global readout) can be universal on graphs. We name this model as Universal Bag-of-Nodes (UBoN). Synthetic experiments on the EXP dataset show that UBoN indeed achieves expressive power beyond 1-WL test. On real-world graph classification tasks, UBoN also obtains comparable performance to MP-GNNs while enjoying better training and inference efficiency (50% less training time compared to GCN). We believe that our theoretical and empirical results might inspire more research on simple and expressive GNN architectures. ", CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment,https://openreview.net/forum?id=GNjzMAgawq,https://openreview.net/pdf?id=GNjzMAgawq,,"Pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language representation learned from a large scale of web-collected image-text data. In light of the well-learned visual features, there are works that transfer image representation to the video domain and achieve good results. However, adapting image-text pre-trained models to video-text pre-training (i.e., post-pretraining) has not demonstrated a significant advantage yet. In this paper, we tackle this challenge by raising and addressing two questions: 1) what are the factors hindering post-pretraining CLIP from improving performance on video-text tasks, and 2) how to mitigate the impact of these factors. Through a series of comparative experiments and analyses, we find that the data scale and domain gap between language sources have large impacts. By these observations, we propose an Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP. Extensive results show that our approach improves the performance of CLIP on video-text retrieval by a large margin. Our model achieves state-of-the-art results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet. We will release the code and model to facilitate future research.", Pre-training via Denoising for Molecular Property Prediction,https://openreview.net/forum?id=tYIMtogyee,https://openreview.net/pdf?id=tYIMtogyee,We describe a technique for pre-training models for molecular property prediction from 3D structures based on denoising and show that it achieves SOTA results for various tasks.,"Many important problems involving molecular property prediction from 3D structures have limited data, posing a generalization challenge for neural networks. In this paper, we describe a pre-training technique based on denoising that achieves a new state-of-the-art in molecular property prediction by utilizing large datasets of 3D molecular structures at equilibrium to learn meaningful representations for downstream tasks. Relying on the well-known link between denoising autoencoders and score-matching, we show that the denoising objective corresponds to learning a molecular force field -- arising from approximating the Boltzmann distribution with a mixture of Gaussians -- directly from equilibrium structures. Our experiments demonstrate that using this pre-training objective significantly improves performance on multiple benchmarks, achieving a new state-of-the-art on the majority of targets in the widely used QM9 dataset. Our analysis then provides practical insights into the effects of different factors -- dataset sizes, model size and architecture, and the choice of upstream and downstream datasets -- on pre-training.","Molecular Property Prediction, Pre-training, Graph Neural Networks, Denoising, Molecules" Equivariant Energy-Guided SDE for Inverse Molecular Design,https://openreview.net/forum?id=r0otLtOwYW,https://openreview.net/pdf?id=r0otLtOwYW,Equivariant energy-guided stochastic differential equations for inverse molecular design.,"Inverse molecular design is critical in material science and drug discovery, where the generated molecules should satisfy certain desirable properties. In this paper, we propose equivariant energy-guided stochastic differential equations (EEGSDE), a flexible framework for controllable 3D molecule generation under the guidance of an energy function in diffusion models. Formally, we show that EEGSDE naturally exploits the geometric symmetry in 3D molecular conformation, as long as the energy function is invariant to orthogonal transformations. Empirically, under the guidance of designed energy functions, EEGSDE significantly improves the baseline on QM9, in inverse molecular design targeted to quantum properties and molecular structures. Furthermore, EEGSDE is able to generate molecules with multiple target properties by combining the corresponding energy functions linearly.","score-based model, diffusion model, inverse molecular design, energy guidance, molecule generation, equivariance" Contrastive Value Learning: Implicit Models for Simple Offline RL,https://openreview.net/forum?id=XRPcmvMFFe,https://openreview.net/pdf?id=XRPcmvMFFe,Learning implicit value functions using noise-contrastive estimation in a model-free setting that can be partially pre-trained on reward-free data from related RL tasks.,"Model-based reinforcement learning (RL) methods are appealing in the offline setting because they allow an agent to reason about the consequences of actions without interacting with the environment. Prior methods learn a 1-step dynamics model, which predicts the next state given the current state and action. These models do not immediately tell the agent which actions to take, but must be integrated into a larger RL framework. Can we model the environment dynamics in a different way, such that the learned model does directly indicate the value of each action? In this paper, we propose Contrastive Value Learning (CVL), which learns an implicit, multi-step model of the environment dynamics. This model can be learned without access to reward functions, but nonetheless can be used to directly estimate the value of each action, without requiring any TD learning. Because this model represents the multi-step transitions implicitly, it avoids having to predict high-dimensional observations and thus scales to high-dimensional tasks. Our experiments demonstrate that CVL outperforms prior offline RL methods on complex continuous control benchmarks.","reinforcement learning, contrastive learning, implicit density models, reward-free learning, offline reinforcement learning, metaworld" CLIPPING: Distilling CLIP-based Models for Video-Language Understanding,https://openreview.net/forum?id=aqIvCsRsYt,https://openreview.net/pdf?id=aqIvCsRsYt,"In this paper, we propose a novel knowledge distillation method that is specially designed for small vison-language models.","Pre-training a vison-language model and then fine-tuning it on downstream tasks have become a popular paradigm. However, pre-trained vison-language models with the Transformer architecture usually have a large number of parameters and take long inference time. Knowledge distillation has been an efficient technique to transfer the capability of a large model to a small one while maintaining the accuracy, which has achieved remarkable success in natural language processing. However, the collection of the pre-training data for the pre-training knowledge distillation costs huge manpower in multi-modality applications. In this paper, we propose a novel knowledge distillation method, named CLIPPING, where the plentiful knowledge of a large teacher model that has been fine-tuned for video-language tasks with the powerful pre-trained CLIP can be effectively transferred to a small student only at the fine-tuning stage. Especially, a new layer-wise alignment is proposed for knowledge distillation of the intermediate layers from the Transformer to the CNN in CLIPPING, which enables the student model to well absorb the knowledge of the teacher. Besides, we present an effective cross-modality knowledge distillation, which includes both the knowledge of the global video-caption distributions from the teacher model and the knowledge of the local video-caption distributions from the pre-training model (CLIP). Finally, CLIPPING with MobileViT-v2 as the vison encoder without any vison-language pre-training achieves 91.5%-95.3% of the performance of its teacher on three video-language retrieval benchmarks, with its vison encoder being 19.5x smaller. CLIPPING also significantly outperforms a state-of-the-art small baseline (ALL-in-one-B) on the MSR-VTT dataset, obtaining relatively 7.4% performance gain, with 29% fewer parameters and 86.9% fewer flops. Moreover, CLIPPING is comparable or even superior to many large pre-training models.","Knowledge Distillation, Vison-Language Understanding, Model Compression" To be private and robust: Differentially Private Optimizers Can Learn Adversarially Robust Models,https://openreview.net/forum?id=4WoJDxyCxq,https://openreview.net/pdf?id=4WoJDxyCxq,We show that DP models can be adversarially robust with rigorous proof on linear models and empirical evidence on deep networks.,"Machine learning models have shone in a variety of domains and attracted increasing attention from both the security and the privacy communities. One important yet worrying question is: will training models under the differential privacy (DP) constraint unfavorably impact on the adversarial robustness? While previous works have postulated that privacy comes at the cost of worse robustness, we give the first theoretical analysis to show that DP models can indeed be robust and accurate, even sometimes more robust than their naturally-trained non-private counterparts. We observe three key factors that influence the privacy-robustness-accuracy tradeoff: (1) hyperparamters for DP optimizers are critical; (2) pre-training on public data significantly mitigates the accuracy and robustness drop; (3) choice of DP optimizers makes a difference. With these factors set properly, we achieve 90\% natural accuracy, 72\% robust accuracy ($+9\%$ than the non-private model) under $l_2(0.5)$ attack, and 69\% robust accuracy ($+16\%$ than the non-private model) with pre-trained SimCLRv2 model under $l_\infty(4/255)$ attack on CIFAR10 with $\epsilon=2$. In fact, we show both theoretically and empirically that DP models are Pareto optimal in terms of accuracy and robustness. Additionally, the robustness of DP models is consistently observed on MNIST, Fashion MNIST and CelebA, with ResNet and Vision Transformer. We believe our encouraging results are a significant step towards training models that are private as well as robust, including deep neural networks.","deep learning, differential privacy, adversarial robustness, Pareto optimality" Vectorial Graph Convolutional Networks,https://openreview.net/forum?id=Uzng0zolM8,https://openreview.net/pdf?id=Uzng0zolM8,," Graph Convolutional Networks (GCN) have drawn considerable attention recently due to their outstanding performance in processing graph-structured data. However, GCNs still limited to the undirected graph because they theoretically require a symmetric matrix as the basis for the Laplacian transform. This causes the isotropic problem of the operator and reduced sensitivity in response to different information. In order to solve the problem, we generalize the spectral convolution operator to directed graphs by field extension, which improves the edge representations from scalars to vectors. Therefore, it brings in the concept of direction. That is to say, and even homogeneous information can become distinguishable by its differences in directions.In this paper, we propose the Vectorial Graph Convolutional Network(VecGCN) and the experimental evidence showing the advantages of a variety of directed graph node classification and link prediction tasks. ","GNN, GCN" Traversing Between Modes in Function Space for Fast Ensembling,https://openreview.net/forum?id=cS45VNtZLW,https://openreview.net/pdf?id=cS45VNtZLW,We propose a novel framework that predicts the outputs for the low-loss subspace to reduce the inference cost of deep ensembles by taking advantage of mode connectivity.,"Deep ensemble is a simple yet powerful way to improve the performance of deep neural networks. Under this motivation, recent works on mode connectivity have shown that parameters of ensembles are connected by low-loss subspaces, and one can efficiently collect ensemble parameters in those subspaces. While this provides a way to efficiently train ensembles, for inference, one should still execute multiple forward passes using all the ensemble parameters, which often becomes a serious bottleneck for real-world deployment. In this work, we propose a novel framework to reduce such costs. Given a low-loss subspace connecting two modes of a neural network, we build an additional neural network predicting outputs of the original neural network evaluated at a certain point in the low-loss subspace. The additional neural network, what we call a “ bridge”, is a lightweight network taking minimal features from the original network, and predicting outputs for the low-loss subspace without forward passes through the original network. We empirically demonstrate that we can indeed train such bridge networks and significantly reduce inference costs with the help of the bridge networks.","deep ensemble, mode connectivity" Poisson Process for Bayesian Optimization,https://openreview.net/forum?id=_QRMikPHXL,https://openreview.net/pdf?id=_QRMikPHXL,,"Bayesian Optimization (BO) is a sample-efficient, model-based method for optimizing black-box functions which can be expensive to evaluate. Traditionally, BO seeks a probabilistic surrogate model, such as Tree-structured Parzen Estimator (TPE), Sequential Model Algorithm Configuration (SMAC), and Gaussian process (GP), based on the exact observed values. However, compared to the value response, relative ranking is hard to be disrupted due to noise resulting in better robustness. Moreover, it has better practicality when the exact value responses are intractable, but information about candidate preferences can be acquired. Thus, this work introduces an efficient BO framework, named PoPBO, consisting of a novel ranking-based response surface based on Poisson process and two acquisition functions to accommodate the proposed surrogate model. We show empirically that PoPBO improves efficacy and efficiency on both simulated and real-world benchmarks, including HPO and NAS.", DPMAC: Differentially Private Communication for Cooperative Multi-Agent Reinforcement Learning,https://openreview.net/forum?id=VIwEYmMID9R,https://openreview.net/pdf?id=VIwEYmMID9R,"We make the first attempt to develop a privacy-preserving communication framework for MARL, named \textit{DPMAC}. ","Communication lays the foundation for cooperation in human society and in multi-agent reinforcement learning (MARL). Humans also desire to maintain their privacy when communicating with others, yet such privacy concern has not been considered in existing works in MARL. We propose the \textit{differentially private multi-agent communication} (DPMAC) algorithm, which protects the sensitive information of individual agents by equipping each agent with a local message sender with rigorous $(\epsilon, \delta)$-differential privacy (DP) guarantee. In contrast to directly perturbing the messages with predefined DP noise as commonly done in privacy-preserving scenarios, we adopt a stochastic message sender for each agent respectively and incorporate the DP requirement into the sender, which automatically adjusts the learned message distribution to alleviate the instability caused by DP noise. Further, we prove the existence of a Nash equilibrium in cooperative MARL with privacy-preserving communication, which suggests that this problem is game-theoretically learnable. Extensive experiments demonstrate a clear advantage of DPMAC over baseline methods in privacy-preserving scenarios.","Communication in deep multi-agent reinforcement learning, Deep multi-agent reinforcement learning, Differential privacy, Game theory" $Q$-learning with regularization converges with non-linear non-stationary features,https://openreview.net/forum?id=Tg9AvNbTUJo,https://openreview.net/pdf?id=Tg9AvNbTUJo,,"The deep $Q$-learning architecture is a neural network composed of non-linear hidden layers that learn features of states and actions and a final linear layer that learns the $Q$-values of the features. The parameters of both components can possibly diverge. Regularization of the updates is known to solve the divergence problem of fully linear architectures, where features are stationary and known a priori. We propose a deep $Q$-learning scheme that uses regularization of the final linear layer of architecture, updating it along a faster time-scale, and stochastic full-gradient descent updates for the non-linear features at a slower time-scale. We prove the proposed scheme converges with probability 1. Finally, we provide a bound on the error introduced by regularization of the final linear layer of the architecture.","Q-learning, Reinforcement Learning, Stochastic Approximation" Polite Teacher: Semi-Supervised Instance Segmentation with Mutual Learning and Pseudo-Label Thresholding,https://openreview.net/forum?id=x4XN_dP6mrQ,https://openreview.net/pdf?id=x4XN_dP6mrQ,"We present Polite Teacher, a mutual learning framework with pseudo-label thresholding for single-stage semi-supervised instance segmentation.","We present Polite Teacher, a simple yet effective method for the task of semi-supervised instance segmentation. The proposed architecture relies on the Teacher-Student mutual learning framework. To filter out noisy pseudo-labels, we use confidence thresholding for bounding boxes and mask scoring for masks. The approach has been tested with CenterMask, a single-stage anchor-free detector. Tested on the COCO 2017 val dataset, our architecture significantly (approx. +8 pp. in mask AP) outperforms the baseline at different supervision regimes. To the best of our knowledge, this is one of the first works tackling the problem of semi-supervised instance segmentation and the first one devoted to an anchor-free detector. The code is available.","semi-supervised learning, instance segmentation, semi-supervised instance segmentation" Reducing Forgetting In Federated Learning with Truncated Cross-Entropy,https://openreview.net/forum?id=nd8Z_Xbdrfx,https://openreview.net/pdf?id=nd8Z_Xbdrfx,Inspired by methods in continual learning we propose and analyze a simple approach for supervised federated learning with non-iid data,"In Federated Learning, a global model is learned by aggregating model updates computed from a set of client nodes, each having their own data. A key challenge in federated learning is the heterogeneity of data across clients whose data distributions differ from one another. Standard federated learning algorithms perform multiple gradient steps before synchronizing the model, which can lead to clients overly minimizing their local objective and diverging from other client solutions, particularly in the supervised learning setting. We demonstrate that in such a setting, individual client models experience the ``catastrophic forgetting"" phenomenon with respect to other client data. We propose a simple yet efficient approach that modifies the cross-entropy objective on a per-client basis such that classes outside a client's label set are shielded from abrupt representation change. Through extensive empirical evaluations, we demonstrate that our approach can greatly alleviate this problem, especially in the most challenging federated learning settings with high heterogeneity, low participation, and large numbers of clients. ","continual learning, catastrophic forgetting, distribution shifts, federated learning" On the Convergence and Calibration of Deep Learning with Differential Privacy,https://openreview.net/forum?id=MpikUXtGQCI,https://openreview.net/pdf?id=MpikUXtGQCI,"We show that differentially private deep learning can be severely mis-calibrated due to the gradient clipping, which can be alleviated by a new clipping method.","Differentially private (DP) neural network achieves the privacy usually at the cost of slower convergence (and thus lower performance) than its non-private counterpart. To analyze the difficulty of DP training, this work gives the first convergence analysis through the lens of training dynamics and the neural tangent kernel (NTK). We successfully characterize the effects of two key components in the DP training: the per-sample gradient clipping (flat or layerwise) and the noise addition. Our analysis not only initiates a general principled framework to understand the DP deep learning with any network architecture and loss function, but also motivates a new clipping method -- the \textit{global clipping}, that significantly improves the convergence, as well as preserves the same DP guarantee and computational efficiency as the existing method, which we term as \textit{local clipping}. Theoretically speaking, we precisely characterize the effect of per-sample clipping on the NTK matrix and show that the noise scale of DP optimizers does not affect the convergence in the \textit{gradient flow} regime. In particular, we shed light on several behaviors that are only guaranteed by our global clipping. For example, the global clipping can preserve the positive semi-definiteness of NTK, which is almost certainly broken by the local clipping; DP gradient descent (GD) with global clipping converges monotonically to zero loss, while the convergence of local clipping can be non-monotone; the global clipping is surprisingly effective at learning \textit{calibrated classifiers}, whereas existing DP classifiers are oftentimes over-confident and unreliable. Notably, our analysis framework easily extends to other optimizers, e.g., DP-Adam. We demonstrate through numerous experiments that DP optimizers equipped with global clipping perform strongly. Implementation-wise, the global clipping can be realized by inserting only one line of code into the Pytorch \texttt{Opacus} library.","deep learning, differential privacy, calibration, convergence, neural tangent kernel" On the Feasibility of Cross-Task Transfer with Model-Based Reinforcement Learning,https://openreview.net/forum?id=KB1sc5pNKFv,https://openreview.net/pdf?id=KB1sc5pNKFv,"We investigate the feasibility of pretraining and cross-task transfer in model-based RL, and improve sample-efficiency substantially over baselines on the Atari100k benchmark.","Reinforcement Learning (RL) algorithms can solve challenging control problems directly from image observations, but they often require millions of environment interactions to do so. Recently, model-based RL algorithms have greatly improved sample-efficiency by concurrently learning an internal model of the world, and supplementing real environment interactions with imagined rollouts for policy improvement. However, learning an effective model of the world from scratch is challenging, and in stark contrast to humans that rely heavily on world understanding and visual cues for learning new skills. In this work, we investigate whether internal models learned by modern model-based RL algorithms can be leveraged to solve new, distinctly different tasks faster. We propose Model-Based Cross-Task Transfer (XTRA), a framework for sample-efficient online RL with scalable pretraining and finetuning of learned world models. By proper pretraining and concurrent cross-task online fine-tuning, we achieve substantial improvements over a baseline trained from scratch; we improve mean performance of model-based algorithm EfficientZero by $23\%$, and by as much as $71\%$ in some instances.","model-based reinforcement learning, visual reinforcement learning" Fast and Precise: Adjusting Planning Horizon with Adaptive Subgoal Search,https://openreview.net/forum?id=7JsGYvjE88d,https://openreview.net/pdf?id=7JsGYvjE88d,"We propose Adaptive Subgoal Search (AdaSubS), a search algorithm that adjusts the planning horizon to match the local complexity of the solved problems.","Complex reasoning problems contain states that vary in the computational cost required to determine a good action plan. Taking advantage of this property, we propose Adaptive Subgoal Search (AdaSubS), a search method that adaptively adjusts the planning horizon. To this end, AdaSubS generates diverse sets of subgoals at different distances. A verification mechanism is employed to filter out unreachable subgoals swiftly, allowing to focus on feasible further subgoals. In this way, AdaSubS benefits from the efficiency of planning with longer subgoals and the fine control with the shorter ones, and thus scales well to difficult planning problems. We show that AdaSubS significantly surpasses hierarchical planning algorithms on three complex reasoning tasks: Sokoban, the Rubik’s Cube, and inequality proving benchmark INT.","search, adaptive horizon, verification, deep learning, hierarchical planning" A Simple Yet Powerful Deep Active Learning With Snapshots Ensembles,https://openreview.net/forum?id=IVESH65r0Ar,https://openreview.net/pdf?id=IVESH65r0Ar,,"Given an unlabeled pool of data and the experts who can label them, active learning aims to build an agent that can effectively acquire data to be queried to the experts, maximizing the gain in performance when trained with them. While there are several principles for active learning, a prevailing approach is to estimate uncertainties of predictions for unlabeled samples and use them to define acquisition functions. Active learning with the uncertainty principle works well for deep learning, especially for large-scale image classification tasks with deep neural networks. Still, it is often overlooked how the uncertainty of predictions is estimated, despite the common findings on the difficulty of accurately estimating uncertainties of deep neural networks. In this paper, we highlight the effectiveness of snapshot ensembles for deep active learning. Compared to the previous approaches based on Monte-Carlo dropout or deep ensembles, we show that a simple acquisition strategy based on uncertainties estimated from parameter snapshots gathered from a single optimization path significantly improves the quality of the acquired samples. Based on this observation, we further propose an efficient active learning algorithm that maintains a single learning trajectory throughout the entire active learning episodes, unlike the existing algorithms training models from scratch for every active learning episode. Through the extensive empirical comparison, we demonstrate the effectiveness of snapshot ensembles for deep active learning.","Active learning, Snapshot ensemble, Uncertainty estimation" Normalizing Flows for Interventional Density Estimation,https://openreview.net/forum?id=bTy4D3KHwWU,https://openreview.net/pdf?id=bTy4D3KHwWU,"We propose a novel, fully-parametric deep learning method for estimating densities of potential outcomes, called Interventional Normalizing Flows","Existing machine learning methods for causal inference usually estimate quantities expressed via the mean of potential outcomes (e.g., average treatment effect). However, such quantities do not capture the full information about the distribution of potential outcomes. In this work, we estimate the density of potential outcomes after interventions from observational data. For this, we propose a novel, fully-parametric deep learning method called Interventional Normalizing Flows. Specifically, we combine two normalizing flows, namely (i) a teacher flow for estimating nuisance parameters and (ii) a student flow for a parametric estimation of the density of potential outcomes. We further develop a tractable optimization objective via a one-step bias correction for an efficient and doubly robust estimation of the student flow parameters. As a result our Interventional Normalizing Flows offer a properly normalized density estimator. Across various experiments, we demonstrate that our Interventional Normalizing Flows are expressive and highly effective, and scale well with both sample size and high-dimensional confounding. To the best of our knowledge, our Interventional Normalizing Flows are the first fully-parametric, deep learning method for density estimation of potential outcomes.","causal inference, normalizing flows, treatment effect estimation, causal machine learning" Backdoor or Feature? A New Perspective on Data Poisoning,https://openreview.net/forum?id=4NT3umNU3D0,https://openreview.net/pdf?id=4NT3umNU3D0,"A new theoretical foundation of data poisoning, with a theory inspired defense algorithm","In a backdoor attack, an adversary adds maliciously constructed (""backdoor"") examples into a training set to make the resulting model vulnerable to manipulation. Defending against such attacks---that is, finding and removing the backdoor examples---typically involves viewing these examples as outliers and using techniques from robust statistics to detect and remove them. In this work, we present a new perspective on backdoor attacks. We argue that without structural information on the training data distribution, backdoor attacks are indistinguishable from naturally-occuring features in the data (and thus impossible to ``detect'' in a general sense). To circumvent this impossibility, we assume that a backdoor attack corresponds to the strongest feature in the training data. Under this assumption---which we make formal---we develop a new framework for detecting backdoor attacks. Our framework naturally gives rise to a corresponding algorithm whose efficacy we show both theoretically and experimentally.", TAPPFL: TASK-AGNOSTIC PRIVACY-PRESERVING REPRESENTATION LEARNING FOR FEDERATED LEARNING AGAINST ATTRIBUTE INFERENCE ATTACKS,https://openreview.net/forum?id=2skHw9HVf3,https://openreview.net/pdf?id=2skHw9HVf3,,"Federated learning (FL), a new collaborative learning paradigm, has been widely studied recently due to its property to collaboratively train data from different sources without needing to share the raw training data. Nevertheless, recent studies show that an adversary (e.g., an honest-but-curious server) can still be possible to infer private information about the training data, e.g., sensitive information such as income, race, and sexual orientation. To mitigate the attribute inference attacks, various existing privacy-preserving FL methods can be adopted/adapted. However, all these existing methods have key limitations: they need to know the FL task in advance, or have intolerable computational overheads or utility losses, or do not have provable privacy guarantees. We aim to address all these issues and design a task-agnostic privacy-preserving FL (short for TAPPFL) method against attribute inference attacks from the information-theoretic perspective. Specifically, we formally formulate TAPPFL via two mutual information goals, where one goal learns task-agnostic data representations that contain the least information about the private attribute in each device’s data, and the other goal includes as much information as possible about the training data to maintain utility. However, it is intractable to compute exact mutual information in general. Then, we derive tractable variational mutual information bounds, and each bound can be parameterized via a neural network. Next, we alternatively train these parameterized neural networks to approximate the true mutual information and learn privacy-preserving representations for device data. We also derive theoretical privacy guarantees of our TAPPFL against worst-case attribute inference attacks. Extensive results on multiple datesets and applications validates the effectiveness of our TAPPFL to protect data privacy, maintain the FL utility, and be efficient as well.", A Curriculum Perspective to Robust Loss Functions,https://openreview.net/forum?id=ZsCgBR1qvo,https://openreview.net/pdf?id=ZsCgBR1qvo,,"Learning with noisy labels is a fundamental problem in machine learning. Much work has been done in designing loss functions that are theoretically robust against label noise. However, it remains unclear why robust loss functions can underfit and why loss functions deviating from theoretical robustness conditions can appear robust. To elucidate these questions, we show that most robust loss functions differ only in the sample-weighting curriculums they implicitly define. The curriculum perspective enables straightforward analysis of the training dynamics with each loss function, which has not been considered in existing theoretical approaches. We show that underfitting can be attributed to marginal sample weights during training, and noise robustness can be attributed to larger weights for clean samples than noisy samples. With a simple fix to the curriculums, robust loss functions that severely underfit can become competitive with the state-of-the-art.", Decoupled Training for Long-Tailed Classification With Stochastic Representations,https://openreview.net/forum?id=bcYZwYo-0t,https://openreview.net/pdf?id=bcYZwYo-0t,We propose a novel classifier re-training algorithm for long-tailed classification.,"Decoupling representation learning and classifier learning has been shown to be effective in classification with long-tailed data. There are two main ingredients in constructing a decoupled learning scheme; 1) how to train the feature extractor for representation learning so that it provides generalizable representations and 2) how to re-train the classifier that constructs proper decision boundaries by handling class imbalances in long-tailed data. In this work, we first apply Stochastic Weight Averaging (SWA), an optimization technique for improving the generalization of deep neural networks, to obtain better generalizing feature extractors for long-tailed classification. We then propose a novel classifier re-training algorithm based on stochastic representation obtained from the SWA-Gaussian, a Gaussian perturbed SWA, and a self-distillation strategy that can harness the diverse stochastic representations based on uncertainty estimates to build more robust classifiers. Experiments on ImageNet-LT and iNaturalist-2018 benchmarks show that our proposed method improves upon previous methods both in terms of prediction accuracy and uncertainty estimation.","long-tailed learning, stochastic weight averaging" IT-NAS: Integrating Lite-Transformer into NAS for Architecture Seletion,https://openreview.net/forum?id=HHcl-5chhkt,https://openreview.net/pdf?id=HHcl-5chhkt,"This paper proposes to integrate Lite-Transformer into NAS for architecture selection, and introduces an additional indicator token (IT) to reflect the importance of each candidate operation.","Neural Architecture Search (NAS) aims to search for the best network in the pre-defined search space. However, much work focuses on the search strategy but little on the architecture selection process. Despite the fact that the weight-sharing based NAS has promoted the search efficiency, we notice that the architecture selection is quite unstable or circuitous. For instance, the differentiable NAS may derive the suboptimal architecture due to the performance collapse caused by bi-level optimization, or the One-shot NAS requires sampling and evaluating a large number of candidate structures. Recently, the self-attention mechanism achieves better performance in terms of the long-range modeling capabilities. Considering that different operations are widely distributed in the search space, we suggest leveraging the self-attention mechanism to extract the relationship among them and to determine which operation is superior to others. Therefore, we integrate Lite-Transformer into NAS for architecture selection. Specifically, we regard the feature map of each candidate operation as distinct patches and feed them into the Lite-Transformer module along with an additional Indicator Token (called IT). The cross attention among various operations can be extracted by the self-attention mechanism, and the importance of each candidate operation is then shown by the softmax result between the query of indicator token (IT) and other values of operational tokens. We experimentally demonstrate that our framework can select the truly representative architecture in different search spaces and achieves 2.39% test error on CIFAR-10 in DARTS search space, and 24.1% test error on ImageNet in the ProxylessNAS search space, as well as the stable and better performance in NAS-Bench-201 search space and S1-S4 search spaces, outperforming state-of-the-art NAS methods.","Neural Architecture Search, Transformer, Self-Attention" Fine-Grained Source Code Vulnerability Detection via Graph Neural Networks,https://openreview.net/forum?id=S5RYm-9Q4o,https://openreview.net/pdf?id=S5RYm-9Q4o,,"The number of exploitable vulnerabilities in software continues to increase, the speed of bug fixes and software updates have not increased accordingly. It is therefore crucial to analyze the source code and identify vulnerabilities in the early phase of software development. In this paper, a fine-grained source code vulnerability detection model based on Graph Neural Networks (GNNs) is proposed with the aim of locating vulnerabilities at the function level and line level. First of all, detailed information about the source code is extracted through multi-dimensional program feature encoding to facilitate learning about patterns of vulnerability. Second, extensive experiments are conducted on both a public hybrid dataset and our proposed dataset, which is collected entirely from real software projects. It is demonstrated that our proposed model outperforms the state-of-the-art methods and achieves significant improvements even when faced with more complex real-project source code. Finally, a novel location module is designed to identify potential key vulnerable lines of code. And the effectiveness of the model and its contributions to reducing human workload in practical production are evaluated.", "Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC",https://openreview.net/forum?id=OboQ71j1Bn,https://openreview.net/pdf?id=OboQ71j1Bn,We show how diffusion models can be composed together to create new models and demonstrate how to make them perform well at this task.,"Since their introduction, diffusion models have quickly become the prevailing approach to generative modeling in many domains. They can be interpreted as learning the gradients of a time-varying sequence of log-probability density functions. This interpretation has motivated classifier-based and classifier-free guidance as methods for post-hoc control of diffusion models. In this work, we build upon these ideas using the score-based interpretation of diffusion models, and explore alternative ways to condition, modify, and reuse diffusion models for tasks involving compositional generation and guidance. In particular, we investigate why certain types of composition fail using current techniques and present a number of solutions. We conclude that the sampler (not the model) is responsible for this failure and propose new samplers, inspired by MCMC, which enable successful compositional generation. Further, we propose an energy-based parameterization of diffusion models which enables the use of new compositional operators and more sophisticated, Metropolis-corrected samplers. Intriguingly we find these samplers lead to notable improvements in compositional generation across a wide variety of problems such as classifier-guided ImageNet modeling and compositional text-to-image generation.","Generative models, diffusion models, compositional generation" Randomized Adversarial Style Perturbations for Domain Generalization,https://openreview.net/forum?id=wekD3L39fk,https://openreview.net/pdf?id=wekD3L39fk,,"While deep neural networks have shown remarkable progress in various computer vision tasks, they often suffer from weak generalization ability on unseen domains. To tackle performance degradation under such domain shifts, Domain Generalization (DG) aims to learn domain invariant features applicable to unseen target domains based only on the data in source domains. This paper presents a simple yet effective approach for domain generalization via style perturbation using adversarial attacks. Motivated by the observation that the characteristics of each domain are captured by the feature statistics corresponding to style, we propose a novel domain generalization technique, referred to as Randomized Adversarial Style Perturbations (RASP). The proposed algorithm augments the styles of features to deceive the network outputs towards randomly selected labels during training and prevents the network from being misled by the unexpected styles observed in unseen target domains. While RASP is effective to handle domain shifts, its naïve integration into the training procedure might degrade the capability of learning knowledge from source domains because it has no restriction on the perturbations of representations. This challenge is alleviated by Normalized Feature Mixup (NFM), which facilitates learning the original features while achieving robustness to perturbed representations via their mixup during training. We evaluate the proposed algorithm via extensive experiments on various benchmarks and show that our approach improves domain generalization ability, especially in large-scale benchmarks.","Domain Generalization, Data Augmentation, Adversarial Attacks" Martingale Posterior Neural Processes,https://openreview.net/forum?id=-9PVqZ-IR_,https://openreview.net/pdf?id=-9PVqZ-IR_,"Martingale Posterior Distribution, Neural Processes","A Neural Process (NP) estimates a stochastic process implicitly defined with neural networks given a stream of data, rather than pre-specifying priors already known, such as Gaussian processes. An ideal NP would learn everything from data without any inductive biases, but in practice, we often restrict the class of stochastic processes for the ease of estimation. One such restriction is the use of a finite-dimensional latent variable accounting for the uncertainty in the functions drawn from NPs. Some recent works show that this can be improved with more “data-driven” source of uncertainty such as bootstrapping. In this work, we take a different approach based on the martingale posterior, a recently developed alternative to Bayesian inference. For the martingale posterior, instead of specifying prior-likelihood pairs, a predictive distribution for future data is specified. Under specific conditions on the predictive distribution, it can be shown that the uncertainty in the generated future data actually corresponds to the uncertainty of the implicitly defined Bayesian posteriors. Based on this result, instead of assuming any form of the latent variables, we equip a NP with a predictive distribution implicitly defined with neural networks and use the corresponding martingale posteriors as the source of uncertainty. The resulting model, which we name as Martingale Posterior Neural Process (MPNP), is demonstrated to outperform baselines on various tasks.", "GuoFeng: A Discourse-aware Evaluation Benchmark for Language Understanding, Translation and Generation",https://openreview.net/forum?id=XIIynqbMXgR,https://openreview.net/pdf?id=XIIynqbMXgR,"A discourse-aware benchmark for evaluating models across language understanding, translation and generation tasks.","Modeling discourse -- the linguistic phenomena that go beyond individual sentences, is a fundamental and challenging problem in natural language processing (NLP). However, existing evaluation benchmarks mainly focus on the evaluation of inter-sentence properties and overlook important discourse phenomena that cross sentences. To bridge the gap, we propose a GuoFeng benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks, covering understanding, translation, and generation. GuoFeng consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena (e.g. cohesion and coherence) in Chinese and/or English. For linguistic analysis, we also propose a diagnostic test suite that can examine whether the target models learn discourse knowledge. We evaluate 17 general- and in-domain models based on Transformer and advanced pre-training architectures, showing that fine-grained pretraining based on document-level training data consistently improves the modeling of discourse information. We will release the datasets, pretrained models, and leaderboard, which we hope can significantly facilitate research in this field.","Discourse, Evaluation Benchmark, Pre-trained Models, Natural Language Processing" Multi-View Independent Component Analysis with Shared and Individual Sources,https://openreview.net/forum?id=7WiIzqeqBNL,https://openreview.net/pdf?id=7WiIzqeqBNL,"We investigate the special setting of noisy linear ICA where the observations are split among different views, each of which receives a mixture of shared and individual sources. ","Independent component analysis (ICA) is a blind source separation method for linear disentanglement of independent latent sources from observed data. We investigate the special setting of noisy linear ICA where the observations are split among different views, each receiving a mixture of shared and individual sources. We prove that the corresponding linear structure is identifiable, and the shared sources can be recovered, provided that sufficiently many diverse views and data points are available. To computationally estimate the sources, we optimize a constrained form of the joint log-likelihood of the observed data among all views. We show empirically that our objective recovers the sources in high dimensional settings, also in the case when the measurements are corrupted by noise. Finally, we apply the proposed model in a challenging real-life application, where the estimated shared sources from two large transcriptome datasets (observed data) provided by two different labs (two different views) lead to a more plausible representation of the underlying graph structure than existing baselines.","multiview independent component analysis, independent component analysis, blind source separation, multiview representation learning" Centralized Training with Hybrid Execution in Multi-Agent Reinforcement Learning,https://openreview.net/forum?id=u9hnCwX99I1,https://openreview.net/pdf?id=u9hnCwX99I1,,"We introduce hybrid execution in multi-agent reinforcement learning (MARL), a new paradigm in which agents aim to successfully perform cooperative tasks with any communication level at execution time by taking advantage of information-sharing among the agents. Under hybrid execution, the communication level can range from a setting in which no communication is allowed between agents (fully decentralized), to a setting featuring full communication (fully centralized). To formalize our setting, we define a new class of multi-agent partially observable Markov decision processes (POMDPs) that we name hybrid-POMDPs, which explicitly models a communication process between the agents. We contribute MARO, an approach that combines an autoregressive predictive model to estimate missing agents' observations, and a dropout-based RL training scheme that simulates different communication levels during the centralized training phase. We evaluate MARO on standard scenarios and extensions of previous benchmarks tailored to emphasize the negative impact of partial observability in MARL. Experimental results show that our method consistently outperforms baselines, allowing agents to act with faulty communication while successfully exploiting shared information.",Multi-Agent Reinforcement Learning Towards Open Temporal Graph Neural Networks,https://openreview.net/forum?id=N9Pk5iSCzAn,https://openreview.net/pdf?id=N9Pk5iSCzAn,"In this paper, we propose a general and principled learning approach for open temporal graphs where the class set for nodes is open.","Graph neural networks (GNNs) for temporal graphs have recently attracted increasing attentions, where a common assumption is that the class set for nodes is closed. However, in real-world scenarios, it often faces the open set problem with the dynamically increased class set as the time passes by. This will bring two big challenges to the existing dynamic GNN methods: (i) How to dynamically propagate appropriate information in an open temporal graph, where new class nodes are often linked to old class nodes. This case will lead to a sharp contradiction. This is because typical GNNs are prone to make the embeddings of connected nodes become similar, while we expect the embeddings of these two interactive nodes to be distinguishable since they belong to different classes. (ii) How to avoid catastrophic knowledge forgetting over old classes when learning new classes occurred in temporal graphs. In this paper, we propose a general and principled learning approach for open temporal graphs, called OTGNet, with the goal of addressing the above two challenges. We assume the knowledge of a node can be disentangled into class-relevant and class-agnostic one, and thus explore a new message passing mechanism by extending the information bottleneck principle to only propagate class-agnostic knowledge between nodes of different classes, avoiding aggregating conflictive information. Moreover, we devise a strategy to select both important and diverse triad sub-graph structures for effective class-incremental learning. Extensive experiments on three real-world datasets of different domains demonstrate the superiority of our method, compared to the baselines.","Temporal Graph Neural Networks, Open Temporal Graphs, Class-Incremental Learning" FedMAE: Federated Self-Supervised Learning with One-Block Masked Auto-Encoder,https://openreview.net/forum?id=3qvEPE6q4L,https://openreview.net/pdf?id=3qvEPE6q4L,A novel federated self-supervised learning framework with a cascade design,"Latest federated learning (FL) methods started to focus on how to use unlabeled data in clients for training due to users' privacy concerns, high labeling costs, or lack of expertise. However, current Federated Semi-Supervised/Self-Supervised Learning (FSSL) approaches fail to learn large-scale images because of the limited computing resources of local clients. In this paper, we introduce a new framework FedMAE, which stands for Federated Masked AutoEncoder, to address the problem of how to utilize unlabeled large-scale images for FL. Specifically, FedMAE can pre-train one-block Masked AutoEncoder (MAE) using large images in lightweight client devices, and then cascades multiple pre-trained one-block MAEs in the server to build a multi-block ViT backbone for downstream tasks. Theoretical analysis and experimental results on image reconstruction and classification show that our FedMAE achieves superior performance compared to the state-of-the-art FSSL methods.","Federated Learning, Self-Supervised Learning, Masked AutoEncoder" Learning Discriminative Representations for Chromosome Classification with Small Datasets,https://openreview.net/forum?id=_GstklGE4l,https://openreview.net/pdf?id=_GstklGE4l,,"Chromosome classification is crucial for karyotype analysis in cytogenetics. Karyotype analysis is a fundamental approach for clinical cytogeneticists to identify numerical and structural chromosomal abnormalities. However, classifying chromosomes accurately and robustly in clinical application is still challenging due to: 1) rich deformations of chromosome shape, 2) similarity of chromosomes, and 3) imbalanced and insufficient labelled dataset. This paper proposes a novel pipeline for the automatic classification of chromosomes. Unlike existing methods, our approach is primarily based on learning meaningful data representations rather than only finding classification features in given samples. The proposed pipeline comprises three stages: The first stage extracts meaningful visual features of chromosomes by utilizing ResNet with triplet loss. The second stage optimizes features from stage one to obtain a linear discriminative representation via maximal coding rate reduction. It ensures the clusters representing different chromosome types are far away from each other while embeddings of the same type are close to each other in the cluster. The third stage is to identify chromosomes. Based on the meaningful feature representation learned in the previous stage, traditional machine learning algorithms such as SVM are adequate for the classification task. Evaluation results on a publicly available dataset show that our method achieves 97.22% accuracy and is better than state-of-the-art methods.","Chromosome classification, data representation learning, deep neural networks, discriminative representation, maximal coding rate reduction" APLA: Class-imbalanced Semi-supervised Learning with Adapative Pseudo-labeling and Loss Adjustment,https://openreview.net/forum?id=PQfP-d9BWkF,https://openreview.net/pdf?id=PQfP-d9BWkF,We use Class-Aware Pseudo-label Thresholding and Class-Aware Loss Adjustment to improve the performance of existing SSL algorithm in Class-imbalanced setting.,"Semi-supervised learning (SSL) can substantially improve the performance of deep neural networks by utilizing unlabeled data when labeled data is scarce. Existing SSL algorithms implicitly assume that the class distribution of labeled datasets and unlabeled datasets are balanced, which means the different classes have the same numbers of training samples. However, they can hardly perform well on minority classes(the classes with few training examples) when the class distribution of training data is imbalanced, since the pseudo-labels learned from unlabeled data tend to be biased toward majority classes(the classes with a large number of training examples). To alleviate this issue, we propose a method called Adaptive Pseudo-labeling and Loss Adjustment (APLA) for class-imbalanced semi-supervised learning (CISSL), which includes Class-Aware Pseudo-label Thresholding (CAPT) that can utilize the imbalanced unlabeled data by dynamically adjusting the threshold for selecting pseudo-labels, and Class-Aware Loss Adjustment (CALA) that can mitigate the bias in both supervised loss and unsupervised loss. According to the experiments, APLA can deliver much higher accuracy than benchmark methods under various CISSL scenarios.","semi-supervised learning, class-imbalanced learning, class-imbalanced semi-supervised learning" Label-Efficient Online Continual Object Detection in Streaming Video,https://openreview.net/forum?id=kD2J9vcfByo,https://openreview.net/pdf?id=kD2J9vcfByo,"Towards label-efficient online continual object detection in video streams, our Efficient-CLS only uses 25% annotation costs while it still outperforms the best baseline.","To thrive in evolving environments, humans are capable of continual acquisition and transfer of new knowledge, from a continuous video stream, with minimal supervision, while retaining previously learnt experiences. In contrast to human learning, most standard continual learning (CL) benchmarks focus on learning from static i.i.d. images that all have labels for training. Here, we examine a more realistic and challenging problem—Label-Efficient Online Continual Object Detection (LEOCOD) in streaming video. By addressing this problem, it would greatly benefit many real-world applications (e.g., personalized robots, augmented/virtual reality headsets, etc.) with reduced data annotation costs and model retraining time. To tackle this problem, we seek inspirations from complementary learning systems (CLS) in human brains and propose Efficient-CLS, a plug-and-play module that can be easily inserted into and improve existing continual learners. On two challenging CL benchmarks for streaming real-world videos, we integrate Efficient-CLS into state-of-the-art CL algorithms, and achieve significant improvement with minimal forgetting across all supervision levels. Remarkably, with only 25% annotated video frames, our Efficient-CLS still outperforms the base CL learners, which are trained with 100% annotations on all video frames. We will make source code publicly available upon publication.","Online Continual Learning, Object Detection, Complementary Learning Systems, Streaming Video" ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View Semantic Consistency,https://openreview.net/forum?id=2XLRBjY46O6,https://openreview.net/pdf?id=2XLRBjY46O6,Discovering text-supervised segmentation masks via multi-view semantic consistency,"Recently, great success has been made in learning visual representations from text supervision, facilitating the emergence of text-supervised semantic segmentation. However, existing works focus on pixel grouping and cross-modal semantic alignment, while ignoring the correspondence among multiple augmented views of the same image. To overcome such limitation, we propose multi-View Consistent learning (ViewCo) for text-supervised semantic segmentation. Specifically, we first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image. Additionally, we propose cross-view segmentation consistency modeling to address the ambiguity issue of text supervision by contrasting the segment features of Siamese visual encoders. The text-to-views consistency benefits dense assignment of the visual features by encouraging different crops to align with the same text, while the cross-view segmentation consistency modeling provides additional self-supervision, overcoming the limitation of ambiguous text supervision for segmentation masks. Trained with large-scale image-text data, our model can directly segment objects of arbitrary categories in a zero-shot manner. Extensive experiments show that ViewCo outperforms state-of-the-art methods on average by up to 2.9%, 1.6%, and 2.4% mIoU on PASCAL VOC2012, PASCAL Context, and COCO, respectively.","Zero-shot semantic segmentation, Vision-Language Pretraining, Visual Self-Supervision, Consistent Semantics" Simplicity bias in $1$-hidden layer neural networks,https://openreview.net/forum?id=PxohstFQm9q,https://openreview.net/pdf?id=PxohstFQm9q,Gradient Descent on 1-hidden-layer neural network learns a function of essentially a lower dimensional projection of the input.,"Recent works \citep{shah2020pitfalls,chen2021intriguing} have demonstrated that neural networks exhibit extreme \emph{simplicity bias} (SB). That is, they learn \emph{only the simplest} features to solve a task at hand, even in the presence of other, more robust but more complex features. Due to lack of a general and rigorous definition of \emph{features}, these works showcase SB on \emph{semi-synthetic} datasets such as Color-MNIST, MNIST-CIFAR where defining features is relatively easier. In this work, we rigorously define as well as thoroughly establish SB for \emph{one hidden layer} neural networks. More concretely, (i) we define SB as the network essentially being a function of a low dimensional projection of the inputs (ii) theoretically, we show that when the data is linearly separable, the network primarily depends on only the linearly separable ($1$-dimensional) subspace even in the presence of an arbitrarily large number of other, more complex features which could have led to a significantly more robust classifier, (iii) empirically, we show that models trained on \emph{real} datasets such as Imagenette and Waterbirds-Landbirds indeed depend on a low dimensional projection of the inputs, thereby demonstrating SB on these datasets, iv) finally, we present a natural ensemble approach that encourages diversity in models by training successive models on features not used by earlier models, and demonstrate that it yields models that are significantly more robust to Gaussian noise.","Simplicity Bias, Neural Network, Gradient Descent" FedEED: Efficient Federated Distillation with Ensemble of Aggregated Models,https://openreview.net/forum?id=fCbTxKYJovs,https://openreview.net/pdf?id=fCbTxKYJovs,,"In this paper, we study the key components of the knowledge distillation-based model aggregation in federated learning (FL). We first propose a generalized distillation framework where the process of federated distillation is divided into three key stages. By investigating the contributions of each stage, we introduce a new FL framework, named Federated Efficient Ensemble Distillation (FedEED), where the ensemble teacher is created based on aggregated models. Experiment results showed that FedEED outperforms the state-of-the-art methods, including FedAvg and FedDF, on the benchmark datasets. Besides performance, FedEED also demonstrated improved scalability and privacy when compared with existing distillation-based aggregation algorithms. In particular, FedEED does not require direct access to users' model, which can protect the users' privacy. Furthermore, due to the ensemble created by aggregated models, FedEED is highly scalable, and the asymmetric distillation scheme allows parallelism between server-side distillation and clients-side local training, which could speed up the training of large scale learning system.","Federated Learning, Knowledge Distillation" Critical Batch Size Minimizes Stochastic First-Order Oracle Complexity of Deep Learning Optimizer using Hyperparameters Close to One,https://openreview.net/forum?id=p6qlG1zXs9v,https://openreview.net/pdf?id=p6qlG1zXs9v,Critical batch size minimizes stochastic first-order oracle complexity of deep learning optimizer using hyperparameters close to one.,"Practical results have shown that deep learning optimizers using small constant learning rates, hyperparameters close to one, and large batch sizes can find the model parameters of deep neural networks that minimize the loss functions. We first show theoretical evidence that the momentum method (Momentum) and adaptive moment estimation (Adam) perform well in the sense that the upper bound of the theoretical performance measure is small with a small constant learning rate, hyperparameters close to one, and a large batch size. Next, we show that there exists a batch size called the critical batch size minimizing the stochastic first-order oracle (SFO) complexity, which is the stochastic gradient computation cost, and that SFO complexity increases once the batch size exceeds the critical batch size. Finally, we provide numerical results that support our theoretical results. That is, the numerical results indicate that Adam using a small constant learning rate, hyperparameters close to one, and the critical batch size minimizing SFO complexity has faster convergence than Momentum and stochastic gradient descent (SGD).","Adam, adaptive method, critical batch size, hyperparameters, learning rate, nonconvex optimization, stochastic first-order oracle complexity" Jointist: Simultaneous Improvement of Multi-instrument Transcription and Music Source Separation via Joint Training,https://openreview.net/forum?id=DClS-1HQ_0P,https://openreview.net/pdf?id=DClS-1HQ_0P,Joint training of music transcription and source separation improves the performance of both,"In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of an instrument recognition module that conditions the other two modules: a transcription module that outputs instrument-specific piano rolls, and a source separation module that utilizes instrument information and transcription results. The joint training of the transcription and source separation modules serves to improve the performance of both tasks. The instrument module is optional and can be directly controlled by human users. This makes Jointist a flexible user-controllable framework. Our challenging problem formulation makes the model highly useful in the real world given that modern popular music typically consists of multiple instruments. Its novelty, however, necessitates a new perspective on how to evaluate such a model. In our experiments, we assess the proposed model from various aspects, providing a new evaluation perspective for multi-instrument transcription. Subjective listening test shows that Jointist achieves state-of-the-art performance on popular music, outperforming existing multi-instrument transcription models such as MT3. %We also argue that transcription models can be used as a preprocessing module for other music analysis tasks. We conducted experiments on several downstream tasks, and found that the proposed method improved transcription by more than 1 percentage points (ppt.); source separation by 5 SDR, downbeat detection by 1.8 ppt., chord recognition by 1.4 ppt., and key estimation by 1.4 ppt., when utilizing transcription results obtained from Jointist. ","multi-task learning, automatic music transcription, music source separation, instrument recognition" Where prior learning can and can't work in unsupervised inverse problems,https://openreview.net/forum?id=c2X1Qa9K3bD,https://openreview.net/pdf?id=c2X1Qa9K3bD,,"Linear inverse problems consist in recovering a signal from its noisy observation in a lower dimensional space. Many popular resolution methods rely on data-driven algorithms that learn a prior from pairs of signals and observations to overcome the loss of information. However, these approaches are difficult, if not impossible, to adapt to unsupervised contexts -- where no ground truth data are available -- due to the need for learning from clean signals. This paper studies situations that do or do not allow learning a prior in unsupervised inverse problems. First, we focus on dictionary learning and point out that recovering the dictionary is unfeasible without constraints when the signal is observed through only one measurement operator. It can, however, be learned with multiple operators, given that they are diverse enough to span the whole signal space. Then, we study methods where weak priors are made available either through optimization constraints or deep learning architectures. We empirically emphasize that they perform better than hand-crafted priors only if they are adapted to the inverse problem. ","Inverse problems, unsupervised learning, dictionary learning, Deep Image Prior, Plug and Play" When are smooth-ReLUs ReLU-like?,https://openreview.net/forum?id=0qnryNf6XwR,https://openreview.net/pdf?id=0qnryNf6XwR,"We parametrize relaxations of ReLU and devise initialization schemes that retain ReLU-like properties while being differentiable, verified experimentally and confirmed during training.","ReLU is one of the most popular activations in deep learning, especially thanks to its stabilizing effect on training. However, because it is non-differentiable at the origin, it complicates the use of analysis methods that examine derivatives, such as the Neural Tangent Kernel (NTK). Many smooth relaxations try to retain the practical benefits of ReLU while increasing network regularity. Although their success has ranged widely, some notable architectures (e.g., the BERT family) do utilize them. We present a theoretical characterization of smooth-ReLUs within fully-connected feed-forward neural networks. In addition to the well-known SWISH and GeLU, we introduce GumbelLU, AlgebraicLU, and GudermanLU, as new relaxations. All these activations can be characterized by a positive temperature parameter which we can lower to continuously improve the approximation. By studying the interplay of initialization schemes with temperature, we confirm that when these relaxations converge uniformly to ReLU, the statistical properties of the corresponding neural networks at initialization also converge to those of ReLU networks. Moreover, we derive temperature-dependent critical initialization schemes with which networks based on these activations exhibit stable ReLU-like behavior at any temperature. Finally, we empirically study both classes of networks on MNIST and CIFAR-10 in the full-batch training regime. We show that, while all networks exhibit very similar train loss trajectories at criticality, smooth-ReLU networks feature differentiable NTKs throughout training, whereas ReLU networks exhibit stochastic NTK fluctuations. Our results clarify how smooth-ReLU relaxations reproduce the practical benefits of ReLU in everywhere-smooth neural networks.","ReLU, SWISH, GeLU, Critical Initialization, Fully Connected Neural Networks, Deep Networks" Hypernetwork approach to Bayesian MAML,https://openreview.net/forum?id=Z4Kexjh34vT,https://openreview.net/pdf?id=Z4Kexjh34vT,"n this paper we propose a novel generalization of Bayesian MAML, which employs Bayesian principles along with Hypernetworks for MAML.","The main goal of Few-Shot learning algorithms is to enable learning from small amounts of data. One of the most popular and elegant Few-Shot learning approaches is Model-Agnostic Meta-Learning (MAML). The main idea behind this method is to learn shared universal weights of a meta-model, which then are adapted for specific tasks. However, due to limited data size, the method suffers from overfitting and poorly quantifies uncertainty. Bayesian approaches could, in principle, alleviate these shortcomings by learning weight distributions in place of point-wise weights. Unfortunately, previous Bayesian modifications of MAML are limited in a way similar to the classic MAML, e.g., task-specific adaptations must share the same structure and can not diverge much from the universal meta-model. Additionally, task-specific distributions are considered posteriors to the universal distributions working as priors, and optimizing them jointly with gradients is hard and poses a risk of getting stuck in local optima. In this paper, we propose BH-MAML, a novel Bayesian MAML generalization that employs Bayesian principles and Hypernetworks for MAML. We achieve better convergence than the previous methods by classically learning universal weights. Furthermore, Bayesian treatment of the specific tasks enables uncertainty quantification, and high flexibility of task adaptations is achieved using Hypernetworks instead of gradient-based updates. Consequently, the proposed approach not only improves over the previous methods, both classic and Bayesian MAML in several standard Few-Shot learning benchmarks but also benefits from the properties of the Bayesian framework.","few-shot learnirng, MAML, hypernetworks" SpectraNet: multivariate forecasting and imputation under distribution shifts and missing data,https://openreview.net/forum?id=bSuY3hSRJPP,https://openreview.net/pdf?id=bSuY3hSRJPP,We propose a novel encoderless multivariate time-series forecasting with SoTA performance and robust to missing-data and distribution shifts,"In this work, we tackle two widespread challenges in real applications for time-series forecasting that have been largely understudied: distribution shifts and missing data. We propose SpectraNet, a novel multivariate time-series forecasting model that dynamically infers a latent space spectral decomposition to capture current temporal dynamics and correlations on the recent observed history. A Convolution Neural Network maps the learned representation by sequentially mixing its components and refining the output. Our proposed approach can simultaneously produce forecasts and interpolate past observations and can, therefore, greatly simplify production systems by unifying imputation and forecasting tasks into a single model. SpectraNet achieves SoTA performance simultaneously on both tasks on five benchmark datasets, compared to forecasting and imputation models, with up to 92% fewer parameters and comparable training times. On settings with up to 80% missing data, SpectraNet has average performance improvements of almost 50% over the second-best alternative.","time series, forecasting, missing values, deep-learning, interpolation, distribution shift" An Evolutionary Approach to Dynamic Introduction of Tasks in Large-scale Multitask Learning Systems,https://openreview.net/forum?id=3Z-xKxKc-R,https://openreview.net/pdf?id=3Z-xKxKc-R,,"Multitask learning assumes that models capable of learning from multiple tasks can achieve better quality and efficiency via knowledge transfer, a key feature of human learning. Though, state of the art ML models rely on high customization for each task and leverage size and data scale rather than scaling the number of tasks. Also, continual learning, that adds the temporal aspect to multitask, is often focused to the study of common pitfalls such as catastrophic forgetting instead of being studied at a large scale as a critical component to build the next generation artificial intelligence. We propose an evolutionary method capable of generating large scale multitask models that support the dynamic addition of new tasks. The generated multitask models are sparsely activated and integrates a task-based routing that guarantees bounded compute cost and fewer added parameters per task as the model expands. The proposed method relies on a knowledge compartmentalization technique to achieve immunity against catastrophic forgetting and other common pitfalls such as gradient interference and negative transfer. We demonstrate empirically that the proposed method can jointly solve and achieve competitive results on 69 public image classification tasks, for example improving the state of the art on a competitive benchmark such as cifar10 by achieving a 15% relative error reduction compared to the best model trained on public data.", Uncertainty and Traffic Light Aware Pedestrian Crossing Intention Prediction,https://openreview.net/forum?id=KiT3-iN8wHJ,https://openreview.net/pdf?id=KiT3-iN8wHJ,We improve pedestrian crossing intention model performance and robustness using traffic light status and predicting uncertainty estimation.,"Predicting Vulnerable Road User (VRU) crossing intention is one of the major challenges in automated driving. Crossing intention prediction systems trained only on pedestrian features underperform in situations that are most obvious to humans, as the latter take additional context features into consideration. Moreover, such systems tend to be over-confident for out-of-distribution samples, therefore making them less reliable to be used by downstream tasks like sensor fusion and trajectory planning for automated vehicles. In this work, we demonstrate that the results of crossing intention prediction systems can be improved by incorporating traffic light status as an additional input. Further, we make the model robust and interpretable by estimating uncertainty. Experiments on the PIE dataset show that the F1-score improved from 0.77 to 0.82 and above for three different baseline systems when considering traffic-light context. By adding uncertainty, we show increased uncertainty values for out-of-distribution samples, therefore leading to interpretable and reliable predictions of crossing intention.","deep learning, computer vision, recurrent neural networks, uncertainty estimation, intention prediction, attention mechanism, autonomous driving" Benchmarking Constraint Inference in Inverse Reinforcement Learning,https://openreview.net/forum?id=vINj_Hv9szL,https://openreview.net/pdf?id=vINj_Hv9szL,We design a benchmark with important applications for Inverse Constrained Reinforcement Learning and propose a variational Bayesian approach for modeling the distribution of constraints.,"When deploying Reinforcement Learning (RL) agents into a physical system, we must ensure that these agents are well aware of the underlying constraints. In many real-world problems, however, the constraints followed by expert agents (e.g., humans) are often hard to specify mathematically and unknown to the RL agents. To tackle these issues, Inverse Constrained Reinforcement Learning (ICRL) considers the formalism of Constrained Markov Decision Processes (CMDPs) and estimates constraints from expert demonstrations by learning a constraint function. As an emerging research topic, ICRL does not have common benchmarks, and previous works tested their algorithms with hand-crafted environments (e.g., grid worlds). In this paper, we construct an ICRL benchmark in the context of two major application domains: robot control and autonomous driving. For each environment, we design relevant constraints, generate the corresponding expert trajectories, and empirically justify the importance of these constraints. To recover the constraints from expert demonstrations, previous ICRL methods typically learn a deterministic constraint function, which might dismiss the true constraint during training. We tackle this issue by proposing a variational Bayesian approach to model the posterior distribution of candidate constraints. Empirical evaluation shows this method outperforms other baselines in terms of collecting rewards and satisfying constraints. The benchmark, including the instructions for reproducing ICRL algorithms, is available at~{\it temporally hidden due to the anonymous policy}.","Inverse Reinforcement Learning, Constrained Reinforcement Learning, Variational Bayesian Inference" Forward and Backward Lifelong Learning with Time-dependent Tasks,https://openreview.net/forum?id=WK22pk7bSFR,https://openreview.net/pdf?id=WK22pk7bSFR,This paper presents lifelong learning methods based on minimax risk classifiers (LMRCs) that effectively exploit forward and backward learning and account for time-dependent tasks.,"For a sequence of classification tasks that arrive over time, lifelong learning methods can boost the effective sample size of each task by leveraging information from preceding and succeeding tasks (forward and backward learning). However, backward learning is often prone to a so-called catastrophic forgetting in which a task’s performance gets worse while trying to repeatedly incorporate information from succeeding tasks. In addition, current lifelong learning techniques are designed for i.i.d. tasks and cannot capture the usual higher similarities between consecutive tasks. This paper presents lifelong learning methods based on minimax risk classifiers (LMRCs) that effectively exploit forward and backward learning and account for time-dependent tasks. In addition, we analytically characterize the increase in effective sample size provided by forward and backward learning in terms of the tasks’ expected quadratic change. The experimental evaluation shows that LMRCs can result in a significant performance improvement, especially for reduced sample sizes. ","Lifelong learning, Continual learning, Supervised Classification, Performance Guarantees, Minimax risk classification" Memory Gym: Partially Observable Challenges to Memory-Based Agents,https://openreview.net/forum?id=jHc8dCx6DDr,https://openreview.net/pdf?id=jHc8dCx6DDr,Memory Gym is a novel challenge especially to memory-based agents.,"Memory Gym is a novel benchmark for challenging Deep Reinforcement Learning agents to memorize events across long sequences, be robust to noise, and generalize. It consists of the partially observable 2D environments Mortar Mayhem, Mystery Path, and Searing Spotlights. These environments are believed to be unsolvable by memory-less agents because they feature strong dependencies on memory and frequent agent-memory interactions. Several commonly used related environments do not share those qualities. Empirical results based on Proximal Policy Optimization (PPO) and Gated Recurrent Unit (GRU) underline the strong memory dependency of the contributed environments. The hardness of these environments can be smoothly scaled, while different levels of difficulty (some of them unsolved yet) emerge for Mortar Mayhem and Mystery Path. Surprisingly, Searing Spotlights poses a tremendous challenge to GRU-PPO, which remains an open puzzle. Even though the randomly moving spotlights reveal parts of the environment's ground truth, environmental ablations hint that these pose a severe perturbation to agents that leverage recurrent model architectures as their memory.","Deep Reinforcement Learning, Memory, Benchmark, Proximal Policy Optimization, Gated Recurrent Unit, HELM" Token-level Fitting Issues of Seq2seq Models,https://openreview.net/forum?id=cri2n_3PJAw,https://openreview.net/pdf?id=cri2n_3PJAw,We find that seq2seq models trained with early-stopping suffer from overfitting and underfitting at the token level. We identify three major factors that influence token-level fitting.,"Sequence-to-sequence (seq2seq) models have been widely used for natural language process, computer vision, and other deep learning tasks. We find that seq2seq models trained with early-stopping suffer from issues at the token level. In particular, while some tokens in the vocabulary demonstrate overfitting, others underfit when training is stopped. Experiments show that the phenomena are pervasive in different models, even in fine-tuned large pretrained-models. We identify three major factors that influence token-level fitting, which include token frequency, parts-of-speech, and prediction discrepancy. Further, we find that external factors such as language, model size, domain, data scale, and pretraining can also influence the fitting of tokens. ","overfitting, underfitting, seq2seq model" Worst-case Few-shot Evaluation: Are Neural Networks Robust Few-shot Learners?,https://openreview.net/forum?id=53yQBJNQVJu,https://openreview.net/pdf?id=53yQBJNQVJu,,"Neural networks have achieved remarkable performance on various few-shot tasks. However, recent studies reveal that existing few-shot models often exploit the spurious correlations between training and test sets, achieving a high performance that is hard to generalize. Motivated that a robust few-shot learner should accurately classify data given any valid training set, we consider a worst-case few-shot evaluation that computes worst-case generalization errors by constructing a challenging few-shot set. Specifically, we search for the label-balanced subset of a full-size training set that results in the largest expected risks. Since the search space is enormous, we propose an efficient method NMMD-attack to optimize the target by maximizing NMMD distance (maximum mean discrepancy based on neural tangent kernel). Experiments show that NMMD-attack can successfully attack various architectures. The large gap between average performance and worst-case performance shows that neural networks still suffer from poor robustness. We appeal to more worst-case benchmarks for better robust few-shot evaluation.","Distributional Robustness, few-shot evaluation" Learning Sampling Policy to Achieve Fewer Queries for Zeroth-Order Optimization,https://openreview.net/forum?id=7KdrFjpmJf7,https://openreview.net/pdf?id=7KdrFjpmJf7,TL,"Zeroth-order (ZO) methods, which use the finite difference of two function evaluations (also called ZO gradient) to approximate first-order gradient, have attracted much attention recently in machine learning because of its broad applications. The accurateness of ZO gradient highly depends on how many finite differences are averaged, which are intrinsically determined by the number of perturbations randomly drawn from a distribution. Existing ZO methods try to learn a data-driven distribution for sampling the perturbations to improve the efficiency of ZO optimization (ZOO) algorithms. In this paper, we explore a new and parallel direction, i.e. , learn an optimal sampling policy instead of using totally random strategy to generate perturbations based on the techniques of reinforcement learning (RL), which makes it possible to approximate the gradient with only two function evaluations. Specifically, we first formulate the problem of learning a sampling policy as a Markov decision process. Then, we propose our ZO-RL algorithm, \textit{i.e.}, using deep deterministic policy gradient, an actor-critic RL algorithm to learn a sampling policy which can guide the generation of perturbed vectors in getting ZO gradients as accurate as possible. Importantly, the existing ZOO algorithms of learning a distribution can be plugged in to improve the exploration of ZO-RL. Experimental results with different ZO estimators show that our ZO-RL algorithm can effectively reduce the query complexity of ZOO algorithms and converge faster than existing ZOO algorithms especially in the later stage of the optimization process.","Zeroth-order optimization, reinforcement learning" Discovering Policies with DOMiNO,https://openreview.net/forum?id=kjkdzBW3b8p,https://openreview.net/pdf?id=kjkdzBW3b8p,,"In this work we propose a Reinforcement Learning (RL) agent that can discover complex behaviours in a rich environment with a simple reward function. We define diversity in terms of state-action occupancy measures, since policies with different occupancy measures visit different states on average. More importantly, defining diversity in this way allows us to derive an intrinsic reward function for maximizing the diversity directly. Our agent, DOMiNO, stands for Diversity Optimization Maintaining Near Optimally. It is based on maximizing a reward function with two components: the extrinsic reward and the diversity intrinsic reward, which are combined with Lagrange multipliers to balance the quality-diversity trade-off. Any RL algorithm can be used to maximize this reward and no other changes are needed. We demonstrate that given a simple reward functions in various control domains, like height (stand) and forward velocity (walk), DOMiNO discovers diverse and meaningful behaviours. We also perform extensive analysis of our approach, compare it with other multi-objective baselines, demonstrate that we can control both the quality and the diversity of the set via interpretable hyperparameters, and show that the set is robust to perturbations of the environment.", Practical Real Video Denoising with Realistic Degradation Model,https://openreview.net/forum?id=jMtwOppbKOU,https://openreview.net/pdf?id=jMtwOppbKOU,This paper proposes a new realistic video degradation model for practical real video denoising.,"Existing video denoising methods typically assume noisy videos are degraded from clean videos by adding Gaussian noise. However, deep models trained on such a degradation assumption will inevitably give rise to poor performance for real videos due to degradation mismatch. Although some studies attempt to train deep models on noisy and noise-free video pairs captured by cameras, such models can only work well for specific cameras and do not generalize well for other videos. In this paper, we propose to lift this limitation and focus on the problem of general real video denoising with the aim to generalize well on unseen real-world videos. We tackle this problem by firstly investigating the common behaviors of video noises and observing two important characteristics: 1) downscaling helps to reduce the noise level in spatial space and 2) the information from the adjacent frames help to remove the noise of the current frame in temporal space. Motivated by these two observations, we propose a multi-scale recurrent architecture by making full use of the above two characteristics. Secondly, we propose a synthetic real noise degradation model by randomly shuffling different noise types to train the denoising model. With a synthesized and enriched degradation space, our degradation model can help to bridge the distribution gap between training data and real-world data. Extensive experiments demonstrate that our proposed method achieves the state-of-the-art performance and better generalization ability than existing methods on both synthetic Gaussian denoising and practical real video denoising.","Real Video Denoising, Degradation Model" SpeedyZero: Mastering Atari with Limited Data and Time,https://openreview.net/forum?id=Mg5CLXZgvLJ,https://openreview.net/pdf?id=Mg5CLXZgvLJ,"SpeedyZero is a distributed model-based RL training system based on EfficientZero, featuring fast training speed and high sample efficiency.","Many recent breakthroughs of deep reinforcement learning (RL) are mainly built upon large-scale distributed training of model-free methods using millions to billions of samples. On the other hand, state-of-the-art model-based RL methods can achieve human-level sample efficiency but often take a much longer overall training time than model-free methods. However, high sample efficiency and fast training time are both important to many real-world applications. We develop SpeedyZero, a distributed RL system built upon a state-of-the-art model-based RL method, EfficientZero, with a dedicated system design for fast distributed computation. We also develop a novel algorithmic technique, Priority Refresh, to stabilize massively parallel model-based training. SpeedyZero maintains on-par sample efficiency compared with EfficientZero while achieving a 20X speedup in wall-clock time, leading to human-level performances on the Atari benchmark within 30 minutes using only 300k samples. In addition, we also present an in-depth analysis on the fundamental challenges in further scaling our system to bring insights to the community.","Reinforcement Learning System, Distributed Training, Model-Based Reinforcement Learning" HT-Net: Hierarchical Transformer based Operator Learning Model for Multiscale PDEs,https://openreview.net/forum?id=UY5zS0OsK2e,https://openreview.net/pdf?id=UY5zS0OsK2e,"We design a hierarchical transformer based operator learning method, so that the accurate, efficient and robust computer simulation of multiscale PDE problems with an ensemble of input parameters becomes feasible.","Complex nonlinear interplays of multiple scales give rise to many interesting physical phenomena and pose major difficulties for the computer simulation of multiscale PDE models in areas such as reservoir simulation, high frequency scattering and turbulence modeling. In this paper, we introduce a hierarchical transformer (HT-Net) scheme to efficiently learn the solution operator for multiscale PDEs. We construct a hierarchical architecture with scale adaptive interaction range, such that the features can be computed in a nested manner and with a controllable linear cost. Self-attentions over a hierarchy of levels can be used to encode and decode the multiscale solution space over all scale ranges. In addition, we adopt an empirical $H^1$ loss function to counteract the spectral bias of the neural network approximation for multiscale functions. In the numerical experiments, we demonstrate the superior performance of the HT-Net scheme compared with state-of-the-art (SOTA) methods for representative multiscale problems. ","hierarchical transformer, operator learning, multiscale PDE, nested self-attention, loss function, generalization error" Multi-Agent Multi-Game Entity Transformer,https://openreview.net/forum?id=cytNlkyjWOq,https://openreview.net/pdf?id=cytNlkyjWOq,,"Building large-scale generalist pre-trained models for many tasks is becoming an emerging and potential direction in reinforcement learning (RL). Research such as Gato and Multi-Game Decision Transformer have displayed outstanding performance and generalization capabilities on many games and domains. However, there exists a research blank about developing highly capable and generalist models in multi-agent RL (MARL), which can substantially accelerate progress towards general AI. To fill this gap, we propose Multi-Agent multi-Game ENtity TrAnsformer (MAGENTA) from the entity perspective as an orthogonal research to previous time-sequential modeling. Specifically, to deal with different state/observation spaces in different games, we analogize games as languages, thus training different ""tokenizers"" for various games. The feature inputs are split according to different entities and tokenized in the same continuous space. Then, two types of transformer-based model are proposed as permutation-invariant architectures to deal with various numbers of entities and capture the attention over different entities. MAGENTA is trained on Honor of Kings, Starcraft II micromanagement, and Neural MMO with a single set of transformer weights. Extensive experiments show that MAGENTA can play games across various categories with arbitrary numbers of agents and increase the efficiency of fine-tuning in new games and scenarios by 50\%-100\%. See our project page at \url{https://sites.google.com/view/rl-magenta}.","reinforcement learning, multi-agent reinforcement learing, transformer, pretrained model" On the Convergence of Gradient Flow on Multi-layer Linear Models,https://openreview.net/forum?id=5ohslQBnxUw,https://openreview.net/pdf?id=5ohslQBnxUw,We study how initialization affect the convergence of gradient flow on multi-layer linear networks,"In this paper, we analyze the convergence of gradient flow on a multi-layer linear model with a loss function of the form $f(W_1W_2\cdots W_L)$. We show that when $f$ satisfies the gradient dominance property, proper weight initialization leads to exponential convergence of the gradient flow to a global minimum of the loss. Moreover, the convergence rate depends on two trajectory-specific quantities that are controlled by the weight initialization: the \emph{imbalance matrices}, which measure the difference between the weights of adjacent layers, and the least singular value of the \emph{weight product} $W=W_1W_2\cdots W_L$. Our analysis provides improved rate bounds for several multi-layer network models studied in the literature, leading to novel characterizations of the effect of weight imbalance on the rate of convergence. Our results apply to most regression losses and extend to classification ones.","Multi-layer Linear Networks, Non-convex optimization, Gradient Flow, Training invariance" Variational Counterfactual Prediction under Runtime Domain Corruption,https://openreview.net/forum?id=aS1Ef2vkIkR,https://openreview.net/pdf?id=aS1Ef2vkIkR,Proposed an upper bound and an adversarially unified variational method for run-time causal inference.,"Causal inference is pivotal to decision-making as it can estimate the potential treatment effect before the decision is actually made on an individual or a population, and its applications are seen in high-stake areas such as healthcare, education, and e-commerce. So far, various neural methods have been proposed to address the causal effect estimation based on observational data, where the counterfactual prediction commonly assumes the same availability of variables at both training and inference (i.e., runtime) stages. In reality, the accessibility of variables is usually impaired due mainly to privacy and ethical concerns, like medical records. We term this \textit{runtime domain corruption}, which seriously challenges the generalizability of the trained counterfactual predictor on top of the existence of confoundedness and selection bias. To counteract runtime domain corruption, we subsume counterfactual prediction under the notion of domain adaptation. Specifically, we upper-bound the error w.r.t. the target domain (i.e., runtime covariates) by the sum of source domain error and inter-domain distribution distance. In addition, we build an adversarially unified variational causal effect model, named VEGAN, with a novel two-stage adversarial domain adaptation to implicitly reduce the distribution disparity between treated and control groups first, and between training and inference domains afterwards. With extensive experiments on causal inference benchmark datasets, we demonstrate that our proposed method outperforms other state-of-the-art baselines on individual-level causal effect estimation in the presence of runtime domain corruption.","Causal inference, treatment effect estimation, deep learning, variational inference, domain adaptation, domain shift, covariate shift, privacy concern." Neural Architecture Design and Robustness: A Dataset,https://openreview.net/forum?id=p8coElqiSDw,https://openreview.net/pdf?id=p8coElqiSDw,,"Deep learning models have proven to be successful in a wide range of machine learning tasks. Yet, they are often highly sensitive to perturbations on the input data which can lead to incorrect decisions with high confidence, hampering their deployment for practical use-cases. Thus, finding architectures that are (more) robust against perturbations has received much attention in recent years. Just like the search for well-performing architectures in terms of clean accuracy, this usually involves a tedious trial-and-error process with one additional challenge: the evaluation of a network's robustness is significantly more expensive than its evaluation for clean accuracy. Thus, the aim of this paper is to facilitate better streamlined research on architectural design choices with respect to their impact on robustness as well as, for example, the evaluation of surrogate measures for robustness. We therefore borrow one of the most commonly considered search spaces for neural architecture search for image classification, NAS-Bench-201, which contains a manageable size of $6\,466$ non-isomorphic network designs. We evaluate all these networks on a range of common adversarial attacks and corruption types and introduce a database on neural architecture design and robustness evaluations. We further present three exemplary use cases of this dataset, in which we (i) benchmark robustness measurements based on Jacobian and Hessian matrices for their robustness predictability, (ii) perform neural architecture search on robust accuracies, and (iii) provide an initial analysis of how architectural design choices affect robustness. We find that carefully crafting the topology of a network can have substantial impact on its robustness, where networks with the same parameter count range in mean adversarial robust accuracy from $0.20\%-0.41\%$.","dataset, robustness, architecture design" Does Deep Learning Learn to Abstract? A Systematic Probing Framework,https://openreview.net/forum?id=QB1dMPEXau5,https://openreview.net/pdf?id=QB1dMPEXau5,"We design a systematic probing framework along with a set of controlled probing tasks, providing strong evidence that PLMs have the abstraction capability. We conduct an in-depth analysis and provide insightful conclusions.","Abstraction is a desirable capability for deep learning models, which means to induce abstract concepts from concrete instances and flexibly apply them beyond the learning context. At the same time, there is a lack of clear understanding about both the presence and further characteristics of this capability in deep learning models. In this paper, we introduce a systematic probing framework to explore the abstraction capability of deep learning models from a transferability perspective. A set of controlled experiments are conducted based on this framework, providing strong evidence that two probed pre-trained language models (PLMs), T5 and GPT2, have the abstraction capability. We also conduct in-depth analysis, thus shedding further light: (1) the whole training phase exhibits a ""memorize-then-abstract"" two-stage process; (2) the learned abstract concepts are gathered in a few middle-layer attention heads, rather than being evenly distributed throughout the model; (3) the probed abstraction capabilities exhibit robustness against concept mutations, and are more robust to low-level/source-side mutations than high-level/target-side ones; (4) PLMs exhibit better abstraction capability with larger model sizes, larger data scales, and higher diversity in data.","Abstraction Capability, Probing Tasks, Deep Learning, Pre-Trained Language Model" Learning to mine approximate network motifs,https://openreview.net/forum?id=XKQU-afvHOd,https://openreview.net/pdf?id=XKQU-afvHOd,An evaluation framework and model for identifying frequent subgraphs with structural flexibility in large datasets.,"Frequent and structurally related subgraphs, also known as network motifs, are valuable features of many datasets. However, strong combinatorial bottlenecks have made it difficult to extract motifs and use them in learning tasks without strong constraints on the motif properties. In this work we propose a representation learning method based on learnable graph coarsening, MotiFiesta which is the first to be able to extract large and approximate motifs in a fully differentiable manner. We build benchmark datasets and evaluation metrics which test the ability our proposed and future models to capture different aspects of motif discovery where ground truth motifs are not known. Finally, explore the notion of exploiting learned motifs as an inductive bias in real-world datasets by showing competitive performance on motif-based featuresets with established real-world benchmark datasets against concurrent architectures.","motif mining, combinatorics, unsupervised learning" Automatic Clipping: Differentially Private Deep Learning Made Easier and Stronger,https://openreview.net/forum?id=72ICa7Wb4ui,https://openreview.net/pdf?id=72ICa7Wb4ui,"We propose automatic DP optimizers that do not need to tune the clipping threshold, with convergence proof and SOTA accuracy.","Per-example gradient clipping is a key algorithmic step that enables practical differential private (DP) training for deep learning models. The choice of clipping threshold $R$, however, is shown to be vital for achieving high accuracy under DP. We propose an easy-to-use replacement, called automatic clipping, that eliminates the need to tune $R$ for any DP optimizers, including DP-SGD, DP-Adam, DP-LAMB and many others. The automatic variants are as private and computationally efficient as existing DP optimizers, but require no DP-specific hyperparameters and thus make DP training as amenable as the standard non-private training. We give a rigorous convergence analysis of automatic DP-SGD in the non-convex setting, which shows that it {can enjoy an asymptotic convergence rate that matches the standard SGD, under a symmetric gradient noise assumption of the per-sample gradients.} We also demonstrate on various language and vision tasks that automatic clipping outperforms or matches the state-of-the-art, and can be easily employed with minimal changes to existing codebases.","deep learning, differential privacy, per-sample gradient clipping, convergence" Trust Your $\nabla$: Gradient-based Intervention Targeting for Causal Discovery,https://openreview.net/forum?id=xFnban3-LC,https://openreview.net/pdf?id=xFnban3-LC,"We propose GIT, a novel gradient-based intervention targeting method, which improves the performance of causal discovery, especially in the low data regime.","Inferring causal structure from data is a challenging task of fundamental importance in science. Observational data are often insufficient to identify a system’s causal structure uniquely. While conducting interventions (i.e., experiments) can improve the identifiability, such samples are usually challenging and expensive to obtain. Hence, experimental design approaches for causal discovery aim to minimize the number of interventions by estimating the most informative intervention target. In this work, we propose a novel gradient-based intervention targeting method, abbreviated GIT, that 'trusts' the gradient estimator of a gradient-based causal discovery framework to provide signals for intervention acquisition function. We provide extensive experiments in simulated and real-world datasets and demonstrate that GIT performs on par with competitive baselines, surpassing them in the low-data regime.","causal discovery, experimental design, active learning, neural networks" Improving Out-of-distribution Generalization with Indirection Representations,https://openreview.net/forum?id=0f-0I6RFAch,https://openreview.net/pdf?id=0f-0I6RFAch,,"We propose a generic module named Indirection Layer (InLay), which leverages indirection and data internal relationships to effectively construct symbolic indirect representations to improve out-of-distribution generalization capabilities of various neural architectures. InLay receives data input in the form of a sequence of objects, treats it as a complete weighted graph whose vertices are the objects and edge weights are scalars representing relationships between vertices. The input is first mapped via indirection to a symbolic graph with data-independent and trainable vertices. This symbolic graph is then propagated, resulting in new vertex features whose indirection will be used for prediction steps afterward. Theoretically, we show that the distances between indirection representations are bounded by the distances between corresponding graphs, implying that unseen samples with very different surface statistics can still be close in the representation space to the seen samples if they share similar internal relationships. We demonstrate that InLay is consistently effective in improving out-of-distribution generalization throughout a comprehensive suite of experiments, including IQ problems, distorted image classification, and few-shot domain adaptation NLP classification. We also conduct ablation studies to verify different design choices of InLay.","out-of-distribution generalization, indirection, representation" Accelerating Guided Diffusion Sampling with Splitting Numerical Methods,https://openreview.net/forum?id=F0KTk2plQzO,https://openreview.net/pdf?id=F0KTk2plQzO,We accelerate guided diffusion sampling using splitting numerical methods.,"Guided diffusion is a technique for conditioning the output of a diffusion model at sampling time without retraining the network for each specific task. One drawback of diffusion models, however, is their slow sampling process. Recent techniques can accelerate unguided sampling by applying high-order numerical methods to the sampling process when viewed as differential equations. On the contrary, we discover that the same techniques do not work for guided sampling, and little has been explored about its acceleration. This paper explores the culprit of this problem and provides a solution based on operator splitting methods, motivated by our key finding that high-order numerical methods are unsuitable for the conditional function. Our proposed method can re-utilize high-order methods for guided sampling and can generate images with the same quality as a 250-step DDIM baseline using 32-58% less sampling time on ImageNet256. We also demonstrate usage on a wide variety of conditional generation tasks, such as text-to-image generation, colorization, inpainting, and super-resolution.","Splitting Numerical Methods, Guided Diffusion Models" RealSinger: Ultra-Realistic Singing Voice Generation via Stochastic Differential Equations,https://openreview.net/forum?id=ctnmrjv6lU5,https://openreview.net/pdf?id=ctnmrjv6lU5,," Synthesizing high-quality singing voice from music score is a challenging problem in music generation and has many practical applications. Samples generated by existing singing voice synthesis (SVS) systems can roughly reflect the lyrics, pitch and duration in a given score, but they fail to contain necessary details. In this paper, based on stochastic differential equations (SDE) we propose RealSinger to generate 22.05kHz ultra-realistic singing voice conditioned on a music score. Our RealSinger learns to find the stochastic process path from a source of white noise to the target singing voice manifold under the conditional music score, allowing to sing the music score while maintaining the local voice details of the target singer. During training, our model learns to accurately predict the direction of movement in the ambient Euclidean space onto the low-dimensional singing voice manifold. RealSinger's framework is very flexible. It can either generate intermediate feature representations of the singing voice, such as mel-spectrogram, or directly generate the final waveform, as in the end-to-end style which rectify defects and accumulation errors introduced by two-stage connected singing synthesis systems. An extensive subjective and objective test on benchmark datasets shows significant gains in perceptual quality using RealSinger. The mean opinion scores (MOS) obtained with RealSinger are closer to those of the human singer's original high-fidelity singing voice than to those obtained with any state-of-the-art method. Audio samples are available at https://realsinger.github.io/.", Homeomorphism Alignment in Two Spaces for Unsupervised Domain Adaptation,https://openreview.net/forum?id=8xoV4ZrIgbk,https://openreview.net/pdf?id=8xoV4ZrIgbk,A new appraoch uses Homeomorphism property to do Unsupervised Domain Adaptation.,"The existing unsupervised domain adaptation methods always align the features from the source and target domains explicitly or implicitly in a common space, i.e., the domain invariant space. Explicit distribution matching always ignores the discriminability of the learned features, while implicit distribution matching such as self-supervised learning suffers from the pseudo-label noises. It is difficult to find a common space which maintains discriminative structure of the source and target domain data when aligning the data distributions. We propose a novel approach dubbed as HomeomorphisM Alignment (HMA) so that the source and target features can be aligned in two different spaces. Specifically, an invertible neural network based homeomorphism is constructed. Distribution matching method is used as a sewing up tool for connecting homeomorphism mapping between the source and target feature spaces. Theoretically, we show this mapping can preserve data topological structure, i.e., the samples in the same cluster are still in the same projected cluster. Based on this property, we adapt the model by the cross entropy of transformed and original source features and prediction consistency between target features and transformed target features. Extensive experiments demonstrate that our method can achieve the state-of-the-art results.","Homeomorphism Alignment, Unsupervised Domain Adaptation, Self-supervised Learning" Demystifying Approximate RL with $\epsilon$-greedy Exploration: A Differential Inclusion View,https://openreview.net/forum?id=Ms1Zs8s7rg,https://openreview.net/pdf?id=Ms1Zs8s7rg,"We provide the first framework for analyzing value-based RL methods with function approximation and $\epsilon$-greedy exploration, answering a long standing open question.","Q-learning and SARSA(0) with $\epsilon$-greedy exploration are leading reinforcement learning methods, and their tabular forms converge to the optimal Q-function under reasonable conditions. However, with function approximation, they exhibit unexpected behaviors, such as i.) policy oscillation and chattering, and ii.) convergence to different attractors (possibly even the worst policy) on different runs, ii.) multiple attractors, and iii.) worst policy convergence, apart from the textbook instability. Accordingly, a theory to explain these phenomena has been a long-standing open problem, even for basic linear function approximation (Sutton, 1999). Our work uses differential inclusion theory to provide the first framework for resolving this problem. We further illustrate via numerical examples how this framework helps explain these algorithms' asymptotic behaviors.","differential inclusion, epsilon-greedy exploration, function approximation, value-based RL, Q-learning, SARSA, policy oscillation, chattering, discontinuous policies, stability" Batch Multivalid Conformal Prediction,https://openreview.net/forum?id=Dk7QQp8jHEo,https://openreview.net/pdf?id=Dk7QQp8jHEo,We give algorithms for conformal prediction in the batch setting that have coverage guarantees even when conditioning on group membership for intersecting groups and on the threshold used to produce the prediction set.,"We develop fast distribution-free conformal prediction algorithms for obtaining multivalid coverage on exchangeable data in the batch setting. Multivalid coverage guarantees are stronger than marginal coverage guarantees in two ways: (1) They hold even conditional on group membership---that is, the target coverage level $1-\alpha$ holds conditionally on membership in each of an arbitrary (potentially intersecting) group in a finite collection $\mathcal{G}$ of regions in the feature space. (2) They hold even conditional on the value of the threshold used to produce the prediction set on a given example. In fact multivalid coverage guarantees hold even when conditioning on group membership and threshold value simultaneously. We give two algorithms: both take as input an arbitrary non-conformity score and an arbitrary collection of possibly intersecting groups $\mathcal{G}$, and then can equip arbitrary black-box predictors with prediction sets. Our first algorithm is a direct extension of quantile regression, needs to solve only a single convex minimization problem, and produces an estimator which has group-conditional guarantees for each group in $\mathcal{G}$. Our second algorithm is iterative, and gives the full guarantees of multivalid conformal prediction: prediction sets that are valid conditionally both on group membership and non-conformity threshold. We evaluate the performance of both of our algorithms in an extensive set of experiments. ","Conformal prediction, multicalibration, uncertainty quantification" Leveraging Online Semantic Point Fusion for 3D-Aware Object Goal Navigation,https://openreview.net/forum?id=W6t8U1eGvSj,https://openreview.net/pdf?id=W6t8U1eGvSj,We propose a two-stage reinforcement learning framework that is powered by an online semantic point fusion algorithm.,"Object goal navigation in unseen environments is a fundamental task for building intelligent embodied agents. Existing works tackle this problem with modular or end-to-end learning-based methods, which implicitly learn from 2D maps, sparse scene graphs or video sequences, ignoring the established fact that objects lie in 3D. Hence, in this work, we propose a dedicated 3D-aware online semantic point fusion algorithm that online aggregates 3D points along with their semantic predictions from RGB-D observations to form a high-efficient 3D point-based sparse map, which further enables us to check spatial semantic consistency. To leverage the 3D information for navigation while remaining sample efficient, we then propose a two-stage reinforcement learning framework that decomposes the object goal navigation into two complementary sub-tasks, namely exploration and verification, each learning in a different discrete action space. Thanks to the highly accurate semantic understanding and robust goal verification, our framework achieves the best performance among all modular-based methods on the Matterport3D and Gibson datasets. Furthermore, compared to mainstream RL-based works, our method requires (5-28x) less computational cost for training. We will release the source code upon acceptance.","Reinforcement Learning, Robot, Navigation" Transferring Pretrained Diffusion Probabilistic Models,https://openreview.net/forum?id=8u9eXwu5GAb,https://openreview.net/pdf?id=8u9eXwu5GAb,We propose a new tuning approach for transferring pretrained diffusion probabilistic models to new tasks with limited data and training resources.,"Diffusion Probabilistic Models (DPMs) achieve impressive performance in visual generative tasks recently. However, the success of DPMs heavily relies on large amounts of data and optimization steps, which limits the application of DPMs to small datasets and limited computational resources. In this paper, we investigate transfer learning in DPMs to leverage the DPMs pretrained on large-scale datasets for generation with limited data. Firstly, we show that previous methods like training from scratch or determining the transferable parts is not suitable for the DPM due to its U-Net based denoising architecture with the external denoising timestep input. To address it, we present a condition-based tuning approach to take full advantages of existing pretrained models. Concretely, we obtain the semantic embeddings of condition images by the pretrained CLIP model, and then inject these semantic informations to the pretrained DPM via a ''Attention-NonLinear'' (ANL) module. The adaptation to a new task can be achieved by only tuning the ANL module inserted into the pretrained DPM hierarchically. To further enhance the diversity of generated images, we introduce a masked sampling strategy based on the condition mechanism. Extensive experiments validate the effectiveness and efficiency of our proposed tuning approach in generative task transfer and data augmentation for semi-supervised learning. ","transfer learning, diffusion probabilistic models, cross-attention, fine-tuning" ELBO-ing Stein Mixtures,https://openreview.net/forum?id=gRgCyyYBR4o,https://openreview.net/pdf?id=gRgCyyYBR4o,Stein mixture can be viewed as matching variational- to target-posterior by the Renyi divergence. This leads to a whole class of inference methods using the Renyi divergence's order.,"Stein variational gradient descent (SVGD) \citep{DBLP:conf/nips/LiuW16} is a particle-based technique for Bayesian inference. SVGD has recently gained popularity because it combines the ability of variational inference to handle tall data with the modeling power of non-parametric inference. Unfortunately, the number of particles required to represent a model adequately grows exponentially with the dimensionality of the model. Stein mixtures \citep{nalisnick2017variational} alleviate the exponential growth in particles by letting each particle parameterize a distribution. However, the inference algorithm proposed by \cite{nalisnick2017variational} can be numerically unstable. We show that their algorithm corresponds to inference with the R\'enyi $\alpha$-divergence for $\alpha=0$ and that using other values for $\alpha$ can lead to more stable inference. We empirically study the performance of Stein mixtures inferred with different $\alpha$ values on various real-world problems, demonstrating significantly improved results when using $\alpha=1$, which coincides with using the evidence lower bound (ELBO). We call this instance of our algorithm ELBO-within-Stein. A black-box version of the inference algorithm (for arbitrary $\alpha\in \sR$) is available in the deep probabilistic programming language NumPyro \citep{phan2019}.","Particle-based variational inference, alpha-indexed Stein mixtures, ELBO-within-Stein" Source-Target Coordinated Training with Multi-head Hybrid-Attention for Domain Adaptive Semantic Segmentation,https://openreview.net/forum?id=oBXFemWGPWN,https://openreview.net/pdf?id=oBXFemWGPWN,," Domain adaptive semantic segmentation aims to densely assign semantic labels for each pixel on the unlabeled target domain by transferring knowledge from the labeled source domain. Due to the domain shift problem, the success of adaptation on the unseen domain depends on the feature alignment between different domains. Hence, this paper focuses on feature alignment for domain adaptive semantic segmentation, \ie, when to align and how to align. Since no label is available in the target domain, aligning the target distribution too early would lead to poor performance due to pseudo-label noise, while too late may cause the model to underfit the target domain. In this paper, we propose a Source-Target Coordinated Training (STCT) framework, where a coordination weight is designed to control the time to align. For the problem of how to align, we design a Multi-head Hybrid-Attention (MHA) module to replace the multi-head self-attention (MSA) module in the transformer. The proposed MHA module consists of intra-domain self-attention and inter-domain cross-attention mechanisms. Compared with the MSA module, the MHA module achieves feature alignment by explicitly constructing interaction between different domains without additional computations and parameters. Moreover, to fully explore the potential of the proposed MHA module, we comprehensively investigate different designs for the MHA module and find some important strategies for effective feature alignment. Our proposed method achieves competitive performance on two challenging synthetic-to-real benchmarks, GTA5-to-CityScapes and SYNTHIA-to-Cityscapes.","domain adaptation, semantic segmentation" On the Role of Self-supervision in Deep Multi-view Clustering,https://openreview.net/forum?id=HNcqEt0zuMo,https://openreview.net/pdf?id=HNcqEt0zuMo,"We investigate self-supervision in deep multi-view clustering, and present several new models and novel findings.","Self-supervised learning is a central component in many recent approaches to deep multi-view clustering (MVC). However, we find large variations in the motivation and design of self-supervision-based methods for deep MVC. To address this, we present DeepMVC, a new, unified framework for deep MVC. Crucially, we show that many recent methods can be regarded as instances of our framework -- allowing us to implement recent methods in a unified and consistent manner. We make key observations about the effect of self-supervision, and in particular, drawbacks of representation alignment. Motivated by these insights, we develop several new DeepMVC instances, with new forms of self-supervision. We conduct extensive experiments, and find that (i) the popular contrastive alignment degrades performance when the number of views becomes large; (ii) all methods benefit from some form of self-supervision; and (iii) our new instances outperform previous methods on several datasets. Based on our findings, we suggest several promising directions for future research. To enhance the openness of the field, we provide an open-source implementation of DeepMVC, including recent models and our new instances. Our implementation includes a consistent evaluation protocol, facilitating fair and accurate evaluation of methods and components. ","deep learning, multi-view clustering, self-supervised learning" Schedule-Robust Online Continual Learning,https://openreview.net/forum?id=bkxynaG3Vm7,https://openreview.net/pdf?id=bkxynaG3Vm7,"We propose a new continual learning approach that is robust to arbitrary schedules (i.e. permutations of samples in sequences, batch sizes) of a data stream.","A continual learning (CL) algorithm learns from a non-stationary data stream. The non-stationarity is modeled by some schedule that determines how data is presented over time. Most current methods make strong assumptions on the schedule and have unpredictable performance when such requirements are not met. A key challenge in CL is thus to design methods robust against arbitrary schedules over the same underlying data, since in real-world scenarios schedules are often unknown and dynamic. In this work, we introduce the notion of schedule-robustness for CL and a novel approach satisfying this desirable property in the challenging online class-incremental setting. We also present a new perspective on CL, as the process of learning a schedule-robust predictor, followed by adapting the predictor using only replay data. Empirically, we demonstrate that our approach outperforms existing methods on CL benchmarks for image classification by a large margin.","Continual Learning, Online Class-incremental Learning, Meta-Learning" "On the Usefulness of Embeddings, Clusters and Strings for Text Generation Evaluation",https://openreview.net/forum?id=bvpkw7UIRdU,https://openreview.net/pdf?id=bvpkw7UIRdU,We provide a theoretical and empirical analysis of why a recently-proposed automatic evaluation metric for language generators correlates well with human judgments. We identify its use of embeddings from pretrained language models as the main reason.,"A good automatic evaluation metric for language generation ideally correlates highly with human judgements of text quality. Yet, there is a dearth of such metrics, which inhibits the rapid and efficient progress of language generators. One exception is the recently proposed Mauve. In theory, Mauve measures an information-theoretic divergence between two probability distributions over strings: one representing the language generator under evaluation; the other representing the true natural language distribution. Mauve's authors argue that its success comes from the qualitative properties of their proposed divergence. Yet in practice, as this divergence is uncomputable, Mauve approximates it by measuring the divergence between multinomial distributions over clusters instead, where cluster assignments are attained by grouping strings based on a pretrained language model's embeddings. As we show, however, this is not a tight approximation---in either theory or practice. This begs the question: why does Mauve work so well? In this work, we show that \mauve was right for the wrong reasons, and that its newly proposed divergence is not necessary for its high performance. In fact, classical divergences paired with its proposed cluster-based approximation may actually serve as better evaluation metrics. We finish the paper with a probing analysis; this analysis leads us to conclude that---by encoding syntactic- and coherence-level features of text, while ignoring surface-level features---such cluster-based approximations to string distributions may simply be better for evaluating state-of-the-art language generators.","language generation, automatic evaluation, contextual embeddings" "A Simple, Yet Effective Approach to Finding Biases in Code Generation",https://openreview.net/forum?id=U7CMcGV6LYM,https://openreview.net/pdf?id=U7CMcGV6LYM,Code generation models suffer from biases that we can expose with simple tricks,"Recently, scores of high-performing code generation systems have surfaced. Like most natural language tasks, code generation is approached using large language models as a core, trained under the masked or causal language modeling schema. This work shows that current code generation systems exhibit biases inherited from large language model backbones, which might leak into generated code under specific circumstances. To investigate the effect, we propose a framework that automatically removes hints and exposes various biases that these code generation models use. We apply our framework to three coding challenges and test it across top-performing coding generation models. Our experiments reveal biases towards specific prompt structure and exploitation of keywords during code generation. Finally, we demonstrate how to use our framework as a data transformation technique, which we find a promising direction toward more robust code generation.","Code generation, Natural Language Processing, Reasoning, Biases" Attention Enables Zero Approximation Error,https://openreview.net/forum?id=AV_bv4Ydcr9,https://openreview.net/pdf?id=AV_bv4Ydcr9,,"Attention-based architectures become the core backbone of many state-of-the-art models for various tasks, including language translation and image classification. However, theoretical properties of attention-based models are seldom considered. In this work, we show that with suitable adaptations, the single-head self-attention transformer with a fixed number of transformer encoder blocks and free parameters is able to generate any desired polynomial of the input with no error. The number of transformer encoder blocks is the same as the degree of the target polynomial. Even more exciting, we find that these transformer encoder blocks in this model do not need to be trained. As a direct consequence, we show that the single-head self-attention transformer with increasing numbers of free parameters is universal. Also, we show that our proposed model can avoid the classical trade-off between approximation error and sample error in the mean squared error analysis for the regression task if the target function is a polynomial. We conduct various experiments and ablation studies to verify our theoretical results.", Revisiting Activation Function Design for Improving Adversarial Robustness at Scale,https://openreview.net/forum?id=BrKY4Wr6dk2,https://openreview.net/pdf?id=BrKY4Wr6dk2,"ReLU significantly weakens adversarial training, but its smooth approximations can fix this issue","Modern ConvNets typically use ReLU activation function. Recently smooth activation functions have been used to improve their accuracy. Here we study the role of smooth activation function from the perspective of adversarial robustness. We find that ReLU activation function significantly weakens adversarial training due to its non-smooth nature. Replacing ReLU with its smooth alternatives allows adversarial training to find harder adversarial training examples and to compute better gradient updates for network optimization. We focus our study on the large-scale ImageNet dataset. On ResNet-50, switching from ReLU to the smooth activation function SILU improves adversarial robustness from 33.0% to 42.3%, while also improving accuracy by 0.9% on ImageNet. Smooth activation functions also scale well with larger networks: it helps EfficientNet-L1 to achieve 82.2% accuracy and 58.6% robustness, largely outperforming the previous state-of-the-art defense by 9.5% for accuracy and 11.6% for robustness. Models are available at https://rb.gy/qt8jya.","adversarial training, activation function, neural network architecture" Contrastive Hierarchical Clustering,https://openreview.net/forum?id=Hv57u3WQ0WZ,https://openreview.net/pdf?id=Hv57u3WQ0WZ,"Hierarchical clustering model based on deep neural networks, which has been applied to large-scale image data","Deep clustering has been dominated by flat clustering models, which split a dataset into a predefined number of groups. Although recent methods achieve extremely high similarity with the ground truth on popular benchmarks, the information contained in the flat partition is limited. In this paper, we introduce CoHiClust, a Contrastive Hierarchical Clustering model based on deep neural networks, which can be applied to large-scale image data. By employing a self-supervised learning approach, CoHiClust distills the base network into a binary tree without access to any labeled data. The hierarchical clustering structure can be used to analyze the relationship between clusters as well as to measure the similarity between data points. Experiments performed on typical image benchmarks demonstrate that CoHiClust generates a reasonable structure of clusters, which is consistent with our intuition and image semantics. Moreover, by applying the proposed pruning strategy, we can restrict the hierarchy to the requested number of clusters (leaf nodes) and obtain the clustering accuracy comparable to the state-of-the-art flat clustering baselines. ","clustering, hierarchical clustering, contrastive learning, soft decision trees" What Does Vision Supervision Bring to Language Models? A Case Study of CLIP,https://openreview.net/forum?id=SdBfRJE9SX-,https://openreview.net/pdf?id=SdBfRJE9SX-,,"Vision-language~(V+L) pre-training has shown promising performance in cross-modal tasks such as image-text retrieval and image captioning. On the other hand, these models surprisingly perform worse than text-only models (e.g., BERT) on widely-used text-only understanding tasks. The conflicting results naturally raise a question: What does vision supervision bring to language models? In this paper, we investigate this under-explored problem with one representative cross-modal model CLIP. We compare the text encoder of CLIP and widely-used text-only models on a wide range of tasks. We design a suite of evaluation tasks across three perception aspects, including the linguistic world featuring syntactic knowledge~(e.g., dependency labeling), the visual world examining visual-related commonsense knowledge (e.g., color), and the embodied world featuring physical-related commonsense knowledge (e.g., mass). Experiments demonstrate that text-only models are not always better than CLIP on these perception tasks. Although the text encoder of CLIP falls far behind text-only models in linguistic-related tasks, CLIP achieves better zero-shot results in visual and embodied worlds with only $0.3\%$ parameters compared to OPT-175B (one of the largest text-only models). This proves that CLIP can empower text encoders to learn rich visual and embodied knowledge through vision-text pre-training. Furthermore, qualitative studies show that CLIP pre-training yet restricts the text encoder from learning fine-grained semantics, like understanding ambiguous texts. These results shed light on future directions to improve V+L pre-training. ","Contrastive Language-Image Pre-training, Vision-and-Language, Knowledge Probing" Accurate Bayesian Meta-Learning by Accurate Task Posterior Inference,https://openreview.net/forum?id=sb-IkS8DQw2,https://openreview.net/pdf?id=sb-IkS8DQw2,We show that accurate inference of the task posterior is all you need for accurate Bayesian meta-learning.,"Bayesian meta-learning (BML) enables fitting expressive generative models to small datasets by incorporating inductive priors learned from a set of related tasks. The Neural Process (NP) is a prominent deep neural network-based BML architecture, which has shown remarkable results in recent years. In its standard formulation, NP encodes epistemic uncertainty in an amortized, factorized Gaussian variational (VI) approximation to the BML task posterior (TP) using reparametrized gradients. Prior work studies a range of architectural modifications to boost performance, such as attentive computation paths or improved context aggregation schemes, while the influence of the VI scheme remains under-explored. We aim to bridge this gap by introducing GMM-NP, a novel BML model, which builds on recent work that enables highly accurate, full-covariance Gaussian mixture (GMM) TP approximations by combining VI with natural gradients and trust regions. We show that (i) GMM-NP yields tighter evidence lower bounds, increasing the efficiency of marginal likelihood optimization, leading to (ii) improved epistemic uncertainty estimation and accuracy, (iii) without any complex architectural modifications, resulting in a powerful, yet (iv) conceptually simple BML model. GMM-NP outperforms the state of the art on a range of challenging experiments, which highlight its applicability to settings where data is scarce.","Bayesian Meta-Learning, Neural Processes, Variational Inference" Learning to Decompose Visual Features with Latent Textual Prompts,https://openreview.net/forum?id=wtcud6HroZr,https://openreview.net/pdf?id=wtcud6HroZr,,"Recent advances in pre-training vision-language models like CLIP have shown great potential in learning transferable visual representations. Nonetheless, for downstream inference, CLIP-like models suffer from either 1) degraded accuracy and robustness in the case of inaccurate text descriptions during retrieval-based inference (the challenge for zero-shot protocol); or 2) breaking the well-established vision-language alignment (the challenge for linear probing). To address them, we propose Decomposed Feature Prompting (DeFo). DeFo leverages a flexible number of learnable embeddings as textual input while maintaining the vision-language dual-model architecture, which enables the model to learn decomposed visual features with the help of feature-level textual prompts. We further use an additional linear layer to perform classification, allowing a scalable size of language inputs. Our empirical study shows DeFo's significance in improving the vision-language models. For example, DeFo obtains 73.2% test accuracy on ImageNet with a ResNet-50 backbone without tuning any pretrained weights of both the vision and language encoder, outperforming zero-shot CLIP by a large margin of 15.0%, and outperforming state-of-the-art vision-language prompt tuning method by 7.6%.","CLIP, vision-language learning, visual prompt" ML-ViG: Multi-Label Image Recognition with Vision Graph Convolutional Network,https://openreview.net/forum?id=ApV_xBR9UUC,https://openreview.net/pdf?id=ApV_xBR9UUC,The first fully graph convolutional model for the task of multi-label image recognition.,"Multi-Label Image Recognition (MLIR) aims to predict multiple object labels in a single image. Graph representations have been used to model label correlation or visual relationships separately. However, the representations of label embeddings and visual features are not well aligned, which hinders effective representation learning and leads to inferior performance. In this work, we propose the first fully graph convolutional model, termed Multi-Label Vision Graph Convolutional Network (ML-ViG), for the task of MLIR. ML-ViG unifies the representation of visual features and label embeddings, enabling the graph structures to capture the (1) spatial relationship among visual region features, (2) semantic relationship among object labels, and (3) cross-level relationship between labels and regions. In order to effectively pass messages between visual features and labels, Multi-Label Graph Convolutional Network (MLG) module is proposed. ML-ViG achieves state-of-the-art performance with significantly lower computational costs on MS-COCO, VOC2007, and VG-500 datasets. Codes and models will be released.","Multi-Label Image Recognition, Graph Convolutional Network" Skill Machines: Temporal Logic Composition in Reinforcement Learning,https://openreview.net/forum?id=4Sp2v2DQcxX,https://openreview.net/pdf?id=4Sp2v2DQcxX,"A framework where an agent first learns a set of base skills in a reward-free setting, and then combines them with the learned skill machine to produce composite behaviours specified by any regular language, such as linear temporal logics.","A major challenge in reinforcement learning is specifying tasks in a manner that is both interpretable and verifiable. One common approach is to specify tasks through reward machines---finite state machines that encode the task to be solved. We introduce skill machines, a representation that can be learned directly from these reward machines that encode the solution to such tasks. We propose a framework where an agent first learns a set of base skills in a reward-free setting, and then combines these skills with the learned skill machine to produce composite behaviours specified by any regular language, such as linear temporal logics. This provides the agent with the ability to map from complex logical task specifications to near-optimal behaviours zero-shot. We demonstrate our approach in both a tabular and high-dimensional video game environment, where an agent is faced with several of these complex, long-horizon tasks. Our results indicate that the agent is capable of satisfying extremely complex task specifications, producing near optimal performance with no further learning. Finally, we demonstrate that the performance of skill machines can be improved with regular off-policy reinforcement learning algorithms when optimal behaviours are desired.","Reinforcement Learning, Lifelong learning, Multi task learning, Transfer learning, Logical composition, Deep Reinforcement Learning" Surrogate Gradient Design for LIF networks,https://openreview.net/forum?id=WgvAB2ffyR,https://openreview.net/pdf?id=WgvAB2ffyR,"We show how to choose the best surrogate derivative for a non differentiable spiking operation, by different experimental and theoretical means.","Spiking Neuromorphic Computing uses binary activity to improve Artificial Intelligence energy efficiency. However, the non-smoothness of binary activity requires approximate gradients, known as Surrogate Gradients (SG), to close the performance gap with Deep Learning. Several SG have been proposed in the literature, but it remains unclear how to determine the best SG for a given task and network. Good performance can be achieved with most SG shapes, after an extensive search of hyper-parameters that can be costly. Thus, we aim at experimentally and theoretically define the best SG across different stress tests, to reduce future need of grid search. Here we first show that the derivative of the fast sigmoid outperforms other SG across tasks and networks, for a wide range of learning rates. Secondly, we focus on the Leaky Integrate and Fire (LIF) spiking neural model, and we show that a SG with low dampening, high sharpness, and low tail fatness, systematically leads to higher accuracy. Thirdly, we observe that the Orthogonal initialization leads the LIF to higher accuracy with most SG. Fourthly, we note that high initial firing rates, combined with a sparsity encouraging loss term, can lead to better generalization, depending on the SG shape. Finally, we provide a theoretical solution, inspired by Glorot and He initializations, to reduce the need of extensive grid-search, to find an SG and initialization that experimentally result in improved accuracy.","Surrogate Gradients, Spiking Networks, Neuromorphic Computing, Glorot Initialization" Context-enriched molecule representations improve few-shot drug discovery,https://openreview.net/forum?id=XrMWUuEevr,https://openreview.net/pdf?id=XrMWUuEevr,We introduce a new architecture for few-shot learning in drug discovery that enriches molecule representations by retrieving from a large set of known molecules.,"A central task in computational drug discovery is to construct models from known active molecules to find further promising molecules for subsequent screening. However, typically only very few active molecules are known. Therefore, few-shot learning methods have the potential to improve the effectiveness of this critical phase of the drug discovery process. We introduce a new method for few-shot drug discovery. Its main idea is to enrich a molecule representation by knowledge about known context or reference molecules. Our novel concept for molecule representation enrichment is to associate molecules from both the support set and the query set with a large set of reference (context) molecules through a modern Hopfield network. Intuitively, this enrichment step is analogous to a human expert who would associate a given molecule with familiar molecules whose properties are known. The enrichment step reinforces and amplifies the covariance structure of the data, while simultaneously removing spurious correlations arising from the decoration of molecules. Our approach is compared with other few-shot methods for drug discovery on the FS-Mol benchmark dataset. On FS-Mol, our approach outperforms all compared methods and therefore sets a new state-of-the art for few-shot learning in drug discovery. An ablation study shows that the enrichment step of our method is the key to improve the predictive quality. In a domain shift experiment, we further demonstrate the robustness of our method.", The Multiple Subnetwork Hypothesis: Enabling Multidomain Learning by Isolating Task-Specific Subnetworks in Feedforward Neural Networks,https://openreview.net/forum?id=FZAKltxF4y2,https://openreview.net/pdf?id=FZAKltxF4y2,"In this paper, we test our ""Multiple Subnetwork Hypothesis,"" which proposes that it is possible to train unused weights within a pruned feedforward neural network to learn subsequent tasks.","Neural networks have seen an explosion of usage and research in the past decade, particularly within the domains of computer vision and natural language processing. However, only recently have advancements in neural networks yielded performance improvements beyond narrow applications and translated to expanded multitask models capable of generalizing across multiple data types and modalities. Simultaneously, it has been shown that neural networks are overparameterized to a high degree, and pruning techniques have proved capable of significantly reducing the number of active weights within the network while largely preserving performance. In this work, we identify a methodology and network representational structure which allows a pruned network to employ previously unused weights to learn subsequent tasks. We employ these methodologies on well-known benchmarking datasets for testing purposes and show that networks trained using our approaches are able to learn multiple tasks, which may be related or unrelated, in parallel or in sequence without sacrificing performance on any task or exhibiting catastrophic forgetting.","Neural Networks, Multitask Learning, Pruning" Warped Convolutional Networks: Bridge Homography to $\mathfrak{sl}(3)$ algebra by Group Convolution,https://openreview.net/forum?id=BcmrpOpUGN2,https://openreview.net/pdf?id=BcmrpOpUGN2,We propose a Warped Convolution Networks to effectively learn the homography on $\mathfrak{sl}(3)$ algebra with group convolution. ,"Homography has an essential relationship with the special linear group and the embedding Lie algebra structure. Although the Lie algebra representation is elegant, few researchers have established the connection between homography and algebra expression in neural networks. In this paper, we propose Warped Convolution Networks (WCN) to effectively learn and represent the homography by $SL(3)$ group and $\mathfrak{sl}(3)$ algebra with group convolution. To this end, six commutative subgroups within the $SL(3)$ group are composed to form a homography. For each subgroup, a warping function is proposed to bridge the Lie algebra structure to its corresponding parameters in homography. By taking advantage of the warped convolution, homography learning is formulated into several simple pseudo-translation regressions. By walking along the Lie topology, our proposed WCN is able to learn the features that are invariant to homography. Moreover, it can be easily plugged into other popular CNN-based methods. Extensive experiments on the POT benchmark, S-COCO-Proj, and MNIST-Proj dataset show that our proposed method is effective for planar object tracking, homography estimation, and classification. ","SL(3), Homography Learning, Lie algebra, Equivariance, Group Equivariant Architecture" Delving into the Openness of CLIP,https://openreview.net/forum?id=9OoFFWDPDQ,https://openreview.net/pdf?id=9OoFFWDPDQ,,"Contrastive Language-Image Pre-training (CLIP) has demonstrated great potential in realizing open-vocabulary visual recognition in a matching style, due to its holistic use of natural language supervision that covers unconstrained real-world visual concepts. However, it is, in turn, also difficult to evaluate and analyze the openness of CLIP-like models, since they are in theory open to any vocabulary but the actual accuracy varies. To address the insufficiency of conventional studies on openness, we resort to an incremental perspective and define the extensibility, which essentially approximates the model's ability to deal with new visual concepts, by evaluating openness through vocabulary expansions. Our evaluation based on extensibility shows that CLIP-like models are hardly truly open and their performances degrade as the vocabulary expands to different degrees. Further analysis reveals that the over-estimation of openness is not because CLIP-like models fail to capture the general similarity of image and text features of novel visual concepts, but because of the confusion among competing text features, that is, they are not stable with respect to the vocabulary. In light of this, we propose to improve the openness of CLIP in feature space by enforcing the distinguishability of text features. Our method retrieves relevant texts from the pre-training corpus to enhance prompts for inference, which boosts the extensibility and stability of CLIP even without fine-tuning.","Contrastive Language-Image Pre-training, CLIP, Openness, Vision-and-Language" Test-Time Adaptation via Self-Training with Nearest Neighbor Information,https://openreview.net/forum?id=EzLtB4M1SbM,https://openreview.net/pdf?id=EzLtB4M1SbM,This work presents a simple and efficient test-time adaptation method to adapt trained classifiers by utilizing an ensemble of adaptation modules and self-training with nearest neighbor information.,"Test-time adaptation (TTA) aims to adapt a trained classifier using online unlabeled test data only, without any information related to the training procedure. Most existing TTA methods adapt the trained classifier using the classifier's prediction on the test data as pseudo-label. However, under test-time domain shift, accuracy of the pseudo labels cannot be guaranteed, and thus the TTA methods often encounter performance degradation at the adapted classifier. To overcome this limitation, we propose a novel test-time adaptation method, called Test-time Adaptation via Self-Training with nearest neighbor information (TAST), which is composed of the following procedures: (1) adds trainable adaptation modules on top of the trained feature extractor; (2) newly defines a pseudo-label distribution for the test data by using the nearest neighbor information; (3) trains these modules only a few times during test time to match the nearest neighbor-based pseudo label distribution and a prototype-based class distribution for the test data; and (4) predicts the label of test data using the average predicted class distribution from these modules. The pseudo-label generation is based on the basic intuition that a test data and its nearest neighbor in the embedding space are likely to share the same label under the domain shift. By utilizing multiple randomly initialized adaptation modules, TAST extracts useful information for the classification of the test data under the domain shift, using the nearest neighbor information. Our experiments on two standard benchmark tasks, domain generalization and image corruption, show that TAST outperforms the state-of-the-art TTA methods.","test-time adaptation, domain adaptation, domain shift" Learning to Counter: Stochastic Feature-based Learning for Diverse Counterfactual Explanations,https://openreview.net/forum?id=FWPLpE981t,https://openreview.net/pdf?id=FWPLpE981t,,"Interpretable machine learning seeks to understand the reasoning process of complex black-box systems that are long notorious for lack of explainability. One growing interpreting approach is through counterfactual explanations, which go beyond why a system arrives at a certain decision to further provide suggestions on what a user can do to alter the outcome. A counterfactual example must be able to counter the original prediction from the black-box classifier, while also satisfying various constraints for practical applications. These constraints exist at trade-offs between one and another presenting radical challenges to existing works. To this end, we propose a stochastic learning-based framework that effectively balances the counterfactual trade-offs. The framework consists of a generation and a feature selection module with complementary roles: the former aims to model the distribution of valid counterfactuals whereas the latter serves to enforce additional constraints in a way that allows for differentiable training and amortized optimization. We demonstrate the effectiveness of our method in generating actionable and plausible counterfactuals that are more diverse than the existing methods and particularly in a more efficient manner than counterparts of the same capacity.", Accurate Neural Training with 4-bit Matrix Multiplications at Standard Formats,https://openreview.net/forum?id=yTbNYYcopd,https://openreview.net/pdf?id=yTbNYYcopd,A method to quantize all training matrix multiplication in 4 bit with standard formats,"Quantization of the weights and activations is one of the main methods to reduce the computational footprint of Deep Neural Networks (DNNs) training. Current methods enable 4-bit quantization of the forward phase. However, this constitutes only a third of the training process. Reducing the computational footprint of the entire training process requires the quantization of the neural gradients, i.e., the loss gradients with respect to the outputs of intermediate neural layers. Previous works separately showed that accurate 4-bit quantization of the neural gradients needs to (1) be unbiased and (2) have a log scale. However, no previous work aimed to combine both ideas, as we do in this work. Specifically, we examine the importance of having unbiased quantization in quantized neural network training, where to maintain it, and how to combine it with logarithmic quantization. Based on this, we suggest a $\textit{logarithmic unbiased quantization}$ (LUQ) method to quantize all both the forward and backward phase to 4-bit, achieving state-of-the-art results in 4-bit training without overhead. For example, in ResNet50 on ImageNet, we achieved a degradation of 1.1 %. We further improve this to degradation of only 0.32 % after three epochs of high precision fine-tunining, combined with a variance reduction method---where both these methods add overhead comparable to previously suggested methods. A reference implementation is supplied in the supplementary material.","quantization, 4bit, acceleration, compression" SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient,https://openreview.net/forum?id=-azium0cV9,https://openreview.net/pdf?id=-azium0cV9,"We propose a model-parallel training algorithm designed for poorly connected, heterogeneous unreliable devices (i.e. preemptible instances or volunteer devices).","Many deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. In this work, we consider alternative setups for training large models: using cheap ``preemptible'' instances or pooling existing resources from multiple regions. We analyze the performance of existing model-parallel algorithms in these conditions and find configurations where training larger models becomes less communication-intensive. Based on these findings, we propose SWARM Parallelism (Stochastically Wired Adaptively Rebalanced Model Parallelism), a model-parallel training algorithm designed for poorly connected, heterogeneous and unreliable devices. SWARM creates temporary randomized pipelines between nodes that are rebalanced in case of failure. We empirically validate our findings and compare SWARM Parallelism with existing large-scale training approaches. Finally, we combine our insights with compression strategies to train a large Transformer language model with 1B shared parameters ($\approx$13B before sharing) on preemptible T4 GPUs with less than 200 Mb/s network.","distributed training, model-parallel training, model parallelism, fault-tolerant training, communication efficiency, volunteer computing" Relative representations enable zero-shot latent space communication,https://openreview.net/forum?id=SrC-nwieGJ,https://openreview.net/pdf?id=SrC-nwieGJ,"Relative representations can be leveraged to enable solving tasks regarding ""latent communication"": from zero-shot model stitching to latent space comparison between diverse settings.","Neural networks embed the geometric structure of a data manifold lying in a high-dimensional space into latent representations. Ideally, the distribution of the data points in the latent space should depend only on the task, the data, the loss, and other architecture-specific constraints. However, factors such as the random weights initialization, training hyperparameters, or other sources of randomness in the training phase may induce incoherent latent spaces that hinder any form of reuse. Nevertheless, we empirically observe that, under the same data and modeling choices, distinct latent spaces typically differ by an unknown quasi-isometric transformation: that is, in each space, the distances between the encodings do not change. In this work, we propose to adopt pairwise similarities as an alternative data representation, that can be used to enforce the desired invariance without any additional training. We show how neural architectures can leverage these relative representations to guarantee, in practice, latent isometry invariance, effectively enabling latent space communication: from zero-shot model stitching to latent space comparison between diverse settings. We extensively validate the generalization capability of our approach on different datasets, spanning various modalities (images, text, graphs), tasks (e.g., classification, reconstruction) and architectures (e.g., CNNs, GCNs, transformers).","relative representation, zero-shot, stitching, invariance, latent communication, isometry, representation learning" oViT: An Accurate Second-Order Pruning Framework for Vision Transformers,https://openreview.net/forum?id=zYWtq_HUCoi,https://openreview.net/pdf?id=zYWtq_HUCoi,We have proposed a new framework for efficient compression of Vision Transformers with the novel pruning method leveraging second order information and optimization of the training procedure.,"Models from the Vision Transformer (ViT) family have recently provided breakthrough results across image classification tasks such as ImageNet. Yet, they still face barriers to deployment, notably the fact that their accuracy can be severely impacted by compression techniques such as pruning. In this paper, we take a step towards addressing this issue by introducing \textit{Optimal ViT Surgeon (oViT)}, a new state-of-the-art method for the weight sparsification of Vision Transformers (ViT) models. At the technical level, oViT introduces a new weight pruning algorithm which leverages second-order information, specifically adapted to be both highly-accurate and efficient in the context of ViTs. We complement this accurate one-shot pruner with an in-depth investigation of gradual pruning, augmentation, and recovery schedules for ViTs, which we show to be critical for successful ViT compression. We validate our method via extensive experiments on classical ViT and DeiT models, as well as on newer variants, such as XCiT, EfficientFormer and Swin. Moreover, our results are even relevant to recently-proposed highly-accurate ResNets. Our results show for the first time that ViT-family models can in fact be pruned to high sparsity levels (e.g. $\geq 75\%$) with low impact on accuracy ($\leq 1\%$ relative drop), and that our approach outperforms prior methods by significant margins at high sparsities. In addition, we show that our method is compatible with structured pruning methods and quantization, and that it can lead to significant speedups on a sparsity-aware inference engine. ","neural network pruning, vision transformer, sparsity, model compression" Learning Basic Interpretable Factors from Temporal Signals via Physics Symmetry,https://openreview.net/forum?id=ifaAztwEHIN,https://openreview.net/pdf?id=ifaAztwEHIN,This study uses physics symmetry as an effective inductive bias to learn interpretable representations from time-series data in a self-supervised fashion. ,"We have recently seen great progress in learning interpretable music representations, ranging from basic factors, such as pitch and timbre, to high-level concepts, such as chord, texture and melody contour. However, most methods rely heavily on music domain knowledge and it remains an open question how to learn interpretable and disentangled representations using inductive biases that are more general. In this study, we use \textit{physical symmetry} as a self-consistency constraint on the latent space. Specifically, it requires the prior model that characterises the dynamics of the latent states to be \textit{equivariant} with respect to a certain group transformation (say, translation and rotation). We show that our model can learn \textit{linear} pitch factor (that agrees with human music perception) as well as pitch-timbre disentanglement from unlabelled monophonic music audio. In addition, the same methodology can be applied to computer vision, learning the 3D Cartesian space as well as space-colour disentanglement from a simple moving object shot by a single fix camera. Furthermore, applying physical symmetry to the prior model naturally leads to \textit{representation augmentation}, a new learning technique which helps improve sample efficiency. ","Physics Symmetry, Time series data, Self-supervised Learning, Representation Augmentation" Addressing High-dimensional Continuous Action Space via Decomposed Discrete Policy-Critic,https://openreview.net/forum?id=blCpfjAeFkn,https://openreview.net/pdf?id=blCpfjAeFkn,,"Reinforcement learning (RL) methods for discrete action spaces like DQNs are being widely used in tasks such as Atari games. However, they encounter difficulties when addressing continuous control tasks, since discretizing continuous action space incurs the curse-of-dimensionality. To tackle continuous control tasks via discretized actions, we propose a decomposed discrete policy-critic (D2PC) architecture, which was inspired by multi-agent RL (MARL) and associates with each action dimension a discrete policy, while leveraging a single critic network to provide a shared evaluation. Building on D2PC, we advocate soft stochastic D2PC (SD2PC) and deterministic D2PC (D3PC) methods with a discrete stochastic or deterministic policy, which show comparable or superior training performances relative to even continuous actor-critic methods. Additionally, we design a mechanism that allows D3PC to interact with continuous actor-critic methods, contributing to the Q-policy-critic (QPC) algorithm, which inherits the training efficiency of discrete RL and the near-optimal final performance of continuous RL algorithms. Substantial experimental results on several continuous benchmark tasks validate our claims.","reinforcement learning, continuous control, actor-critic, decomposed policy, discretized action" Unsupervised Manifold Alignment with Joint Multidimensional Scaling,https://openreview.net/forum?id=lUpjsrKItz4,https://openreview.net/pdf?id=lUpjsrKItz4,A novel approach for unsupervised manifold alignment that only requires intra-domain pairwise dissimilarities as input.,"We introduce Joint Multidimensional Scaling, a novel approach for unsupervised manifold alignment, which maps datasets from two different domains, without any known correspondences between data instances across the datasets, to a common low-dimensional Euclidean space. Our approach integrates Multidimensional Scaling (MDS) and Wasserstein Procrustes analysis into a joint optimization problem to simultaneously generate isometric embeddings of data and learn correspondences between instances from two different datasets, while only requiring intra-dataset pairwise dissimilarities as input. This unique characteristic makes our approach applicable to datasets without access to the input features, such as solving the inexact graph matching problem. We propose an alternating optimization scheme to solve the problem that can fully benefit from the optimization techniques for MDS and Wasserstein Procrustes. We demonstrate the effectiveness of our approach in several applications, including joint visualization of two datasets, unsupervised heterogeneous domain adaptation, graph matching, and protein structure alignment.","unsupervised manifold alignment, multidimensional scaling, optimal transport, graph matching" Can Single-Pass Contrastive Learning Work for Both Homophilic and Heterophilic Graph?,https://openreview.net/forum?id=XE0cIoi-sZ1,https://openreview.net/pdf?id=XE0cIoi-sZ1,,"Existing graph contrastive learning (GCL) typically requires two forward pass for a single instance to construct the contrastive loss. Despite its remarkable success, it is unclear whether such a dual-pass design is (theoretically) necessary. Besides, the empirical results are hitherto limited to the homophilic graph benchmarks. Then a natural question arises: Can we design a method that works for both homophilic and heterophilic graphs with a performance guarantee? To answer this, we theoretically analyze the concentration property of features obtained by neighborhood aggregation on both homophilic and heterophilic graphs, introduce the single-pass graph contrastive learning loss based on the property, and provide performance guarantees of the minimizer of the loss on downstream tasks. As a direct consequence of our theory, we introduce the Single-Pass Graph Contrastive Learning method (SP-GCL). Empirically, on 14 benchmark datasets with varying degrees of heterophily, the features learned by the SP-GCL can match or outperform existing strong baselines with significantly less computational overhead, and empirical results show the feasibility of conclusions derived by our analysis in real-world cases.",Graph Contrastive Learning TOAST: Topological Algorithm for Singularity Tracking,https://openreview.net/forum?id=BN_P4LNiK2,https://openreview.net/pdf?id=BN_P4LNiK2,We develop a multi-scale score that characterises singularities of arbitrary (i.e. non-manifold) data spaces,"The manifold hypothesis, which assumes that data lie on or close to an unknown manifold of low intrinsic dimensionality, is a staple of modern machine learning research. However, recent work has shown that real-world data exhibit distinct non-manifold structures, which result in singularities that can lead to erroneous conclusions about the data. Detecting such singularities is therefore crucial as a precursor to interpolation and inference tasks. We address detecting singularities by developing (i) persistent local homology, a new topology-driven framework for quantifying the intrinsic dimension of a data set locally, and (ii) Euclidicity, a topology-based multi-scale measure for assessing the ‘manifoldness’ of individual points. We show that our approach can reliably identify singularities of complex spaces, while also capturing singular structures in real-world data sets.","topology, persistent homology, topological data analysis, tda, stratified spaces, singularities" Robust Manifold Estimation Approach for Evaluating Fidelity and Diversity,https://openreview.net/forum?id=v4ePDrH91D,https://openreview.net/pdf?id=v4ePDrH91D,,"We propose a robust and reliable evaluation metric for generative models by introducing topological and statistical treatments for a rigorous support manifold estimation. Existing metrics, such as Inception Score (IS), Frechet Inception Distance (FID), and the variants of Precision and Recall (P&R), heavily rely on support manifolds that are estimated from sample features. However, the reliability of their estimation has not been seriously discussed (and overlooked) even though the quality of the evaluation entirely depends on it. In this paper, we propose Topological Precision and Recall (TopP&R, pronounced “topper”), which provides a systematic approach to estimating support manifolds, retaining only topologically and statistically important features with a certain level of confidence. This not only makes TopP&R strong for noisy features, but also provides statistical consistency. Our theoretical and experimental results show that TopP&R is robust to outliers and non-independent and identically distributed (Non-IID) perturbations, while accurately capturing the true trend of change in samples. To the best of our knowledge, this is the first evaluation metric focused on the robust estimation of the support manifold and provides its statistical consistency under noise.", Disentangling Writer and Character Styles for Handwriting Generation,https://openreview.net/forum?id=ApNK_ApJoec,https://openreview.net/pdf?id=ApNK_ApJoec,,"Training machines for synthesizing diverse handwritings is an intriguing task. Recently, some RNN-based methods are proposed to generate stylized online Chinese characters. But these methods mainly focus on learning a person’s overall writing style and hence neglect the detailed style inconsistencies between characters from the same writer. For example, one person’s handwritings always appear an overall uniformity (e.g., character slant and aspect ratios) but there are still small style differences between local regions (e.g., stroke length and curvature) of characters. Motivated by this, in this paper, we propose to disentangle the style representations at both writer and character levels from individual handwritings. Specifically, we propose the style-disentangled transformer (SDT), equipped with two complementary contrastive objectives, to extract the overall writer-wise and detailed character-wise style representations, respectively, which boosts the generation quality of online handwritings. Extensive experiments on various language scripts verify the superiority of SDT. Particularly, we empirically find that the two learned style representations provide information with different frequency magnitudes, which demonstrates the necessity of separate style extraction.", Exploiting Certified Defences to Attack Randomised Smoothing,https://openreview.net/forum?id=AeTl9sbF-VT,https://openreview.net/pdf?id=AeTl9sbF-VT,"Certified defences can be used to attack the models they certify, yielding smaller adversarial perturbations","Certified guarantees of adversarial robustness play an important role in providing assurances regarding a models output, irrespective of the behaviour of an attacker. However, while the development of such guarantees has drawn upon an improved understanding of attacker behaviour, so too can certified guarantees be exploited in order to generate more efficient adversarial attacks. Within this work, we explore this heretofore undiscovered additional attack surface, while also considering how previously discovered attacks could be applied to models defended by randomised smoothing. In all bar one experiment our approach generates smaller adversarial perturbations for more than $70 \%$ of tested samples, reducing the average magnitude of the adversarial perturbation by $13 \%$.","adversarial, attack, certified robustness, machine learning" Simple and Scalable Nearest Neighbor Machine Translation,https://openreview.net/forum?id=uu1GBD9SlLe,https://openreview.net/pdf?id=uu1GBD9SlLe,We propose a simple and scalable nearest neighbor machine translation framework to drastically improve the decoding and storage efficiency of $k$NN-MT,"$k$NN-MT is a straightforward yet powerful approach for fast domain adaptation, which directly plugs the pre-trained neural machine translation (NMT) models with domain-specific token-level $k$-nearest-neighbor ($k$NN) retrieval to achieve domain adaptation without retraining. Despite being conceptually attractive, $k$NN-MT is burdened with massive storage requirements and high computational complexity since it conducts nearest neighbor searches over the entire reference corpus. In this paper, we propose a simple and scalable nearest neighbor machine translation framework to drastically promote the decoding and storage efficiency of $k$NN-based models while maintaining the translation performance. To this end, we dynamically construct a extremely small datastore for each input via sentence-level retrieval to avoid searching the entire datastore in vanilla $k$NN-MT, based on which we further introduce a distance-aware adapter to adaptively incorporate the $k$NN retrieval results into the pre-trained NMT models. Experiments on machine translation in two general settings, static domain adaptation, and online learning, demonstrate that our proposed approach not only achieves almost 90% speed as the NMT model without performance degradation, but also significantly reduces the storage requirements of $k$NN-MT. ","Nearest Neighbor, Machine Translation" On the effectiveness of out-of-distribution data in self-supervised long-tail learning.,https://openreview.net/forum?id=v8JIQdiN9Sh,https://openreview.net/pdf?id=v8JIQdiN9Sh,,"Though Self-supervised learning (SSL) has been widely studied as a promising technique for representation learning, it doesn't generalize well on long-tailed datasets due to the majority classes dominating the feature space. Recent work shows that the long-tailed learning performance could be boosted by sampling extra in-domain (ID) data for self-supervised training, however, large-scale ID data which can rebalance the minority classes are expensive to collect. In this paper, we propose an alternative but easy-to-use and effective solution, \textbf{C}ontrastive with \textbf{O}ut-of-distribution (OOD) data for \textbf{L}ong-\textbf{T}ail learning (COLT), which can effectively exploit OOD data to dynamically re-balance the feature space. We empirically identify the counter-intuitive usefulness of OOD samples in SSL long-tailed learning and principally design a novel SSL method. Concretely, we first localize the `\emph{head}' and `\emph{tail}' samples by assigning a tailness score to each OOD sample based on its neighborhoods in the feature space. Then, we propose an online OOD sampling strategy to dynamically re-balance the feature space. Finally, we enforce the model to be capable of distinguishing ID and OOD samples by a distribution-level supervised contrastive loss. Extensive experiments are conducted on various datasets and several state-of-the-art SSL frameworks to verify the effectiveness of the proposed method. The results show that our method significantly improves the performance of SSL on long-tailed datasets by a large margin, and even outperforms previous work which uses external ID data.","self-supervised learning, long-tail learning, out-of-distribution data" Dynamic Update-to-Data Ratio: Minimizing World Model Overfitting,https://openreview.net/forum?id=ZIkHSXzd9O7,https://openreview.net/pdf?id=ZIkHSXzd9O7,,"Early stopping based on the validation set performance is a popular approach to find the right balance between under- and overfitting in the context of supervised learning. However, in reinforcement learning, even for supervised sub-problems such as world model learning, early stopping is not applicable as the dataset is continually evolving. As a solution, we propose a new general method that dynamically adjusts the update to data (UTD) ratio during training based on under- and overfitting detection on a small subset of the continuously collected experience not used for training. We apply our method to DreamerV2, a state-of-the-art model-based reinforcement learning algorithm, and evaluate it on the DeepMind Control Suite and the Atari 100k benchmark. The results demonstrate that one can better balance under- and overestimation by adjusting the UTD ratio with our approach compared to the default setting in DreamerV2 and that it is competitive with an extensive hyperparameter search which is not feasible for many applications. Our method eliminates the need to set the UTD hyperparameter by hand and even leads to a higher robustness with regard to other learning-related hyperparameters further reducing the amount of necessary tuning.", Deep Leakage from Model in Federated Learning,https://openreview.net/forum?id=aBFFLGhi381,https://openreview.net/pdf?id=aBFFLGhi381,,"Distributed machine learning has been widely used in recent years to tackle large and complex dataset problems. Therewith, the security of distributed learning has also drawn increasing attention from both academia and industry. In this context, federated learning (FL) was developed as a “secure” distributed learning by maintaining private training data locally and only public model gradients are communicated between. However, to date, a variety of gradient leakage attacks have been proposed for this procedure and prove that it is insecure. For instance, a common drawback of these attacks is shared: {they require} too much auxiliary information such as model weights, optimizers, and some hyperparameters (e.g., learning rate), which are difficult to obtain in real situations. Moreover, many existing algorithms avoid transmitting model gradients in FL and turn to sending model weights, such as FedAvg, but few people consider its security breach. In this paper, we present two novel frameworks to demonstrate that transmitting model weights is also likely to leak private local data of clients, i.e., (DLM and DLM+), under the FL scenario. In addition, a variety of experiments are performed to illustrate the effect and generality of our attack frameworks. At the end of this paper, we also introduce two defenses to the proposed attacks and evaluate their protective effects. Comprehensively, the proposed attack and defense schemes can be applied to the generally distributed learning scenario as well, just with some appropriate customization.","Federated learning, model leakage, data security" A Universal 3D Molecular Representation Learning Framework,https://openreview.net/forum?id=6K2RM6wVqKu,https://openreview.net/pdf?id=6K2RM6wVqKu,A universal 3D molecular pretraining framework that significantly enlarges the representation ability and application scope in drug design.,"Molecular representation learning (MRL) has gained tremendous attention due to its critical role in learning from limited supervised data for applications like drug design. In most MRL methods, molecules are treated as 1D sequential tokens or 2D topology graphs, limiting their ability to incorporate 3D information for downstream tasks and, in particular, making it almost impossible for 3D geometry prediction/generation. In this paper, we propose a universal 3D MRL framework that significantly enlarges the representation ability and application scope of MRL schemes. The proposed framework contains two pretrained models with the same SE(3)-equivariant transformer architecture: a molecular model pretrained by 209M molecular conformations; a pocket model pretrained by 3M candidate protein pocket data. Besides, the framework contains several finetuning strategies to apply the pretrained models to various downstream tasks. By properly incorporating 3D information, the framework outperforms SOTA in 14/15 molecular property prediction tasks. Moreover, the framework achieves superior performance in 3D spatial tasks, including protein-ligand binding pose prediction, molecular conformation generation, etc. The code, model, and data will be made publicly available.","Representation Learning, Large-Scale 3D Molecular Pretraining, Molecular Property, Protein-Ligand Complex" CAPE: Channel-Attention-Based PDE Parameter Embeddings for SciML,https://openreview.net/forum?id=22z1JIM6mwI,https://openreview.net/pdf?id=22z1JIM6mwI,a new parameter embedding module based on channel-attention for scientific machine learning,"Scientific Machine Learning (SciML) designs machine learning methods that predict physical systems governed by partial differential equations (PDE). These ML-based surrogate models substitute inefficient and often non-differentiable numerical simulation algorithms and find multiple applications such as weather forecasting, molecular dynamics, and medical applications. While a number of ML-based methods for approximating the solutions of PDEs have been proposed in recent years, they typically do not consider the parameters of the PDEs, making it difficult for the ML surrogate models to generalize to PDE parameters not seen during training. We propose a new channel-attention-based parameter embedding (CAPE) component for scientific machine learning models and a simple and effective curriculum learning strategy. The CAPE module can be combined with any kind of ML surrogate model, which can adapt to changing PDE parameters without harming the original model's ability to find approximate solutions to PDEs. The curriculum learning strategy provides a seamless transition between teacher-forcing and fully auto-regressive training. We compare CAPE in conjunction with the curriculum learning strategy using a PDE benchmark and obtain consistent and significant improvements over the base models. The experiments also show several advantages of CAPE, such as its increased ability to generalize to unseen PDE parameters without substantially increasing inference time and parameter count. An implementation of the method and experiments are available at \url{https://anonymous.4open.science/r/CAPE-ML4Sci-145B}.","machine learning, partial differential equation, attention, generalization" Topic and Hyperbolic Transformer to Handle Multi-modal Dependencies,https://openreview.net/forum?id=96kgRrpnkgS,https://openreview.net/pdf?id=96kgRrpnkgS,,"As multi-modal search relies on jointly learning image-text representations and has been investigated in the literature, our innovation is to develop Chimera, a framework in which to learn their representations and similarities. Because the core of multi-modal search is learning the modalities in a shared semantic space and measuring their similarities, search quality depends on which expressive space is utilized in learning. This motivates us to identify the space that can elucidate their semantic and complex relationships with small information loss. Novelty is assured by introducing the topic and hyperbolic as spaces, and performing contrastive/metric learning tasks to ensure the cooperation of these spaces with Transformer. Experiments show that Chimera empowers pre-trained models for multi-modal search tasks and demonstrate the ability of the layers it introduces.","Multi-modal search, Hyperbolic space, Hyperbolic geometry, Lorentz model, Transformer, Topic models" Variance Covariance Regularization Enforces Pairwise Independence in Self-Supervised Representations,https://openreview.net/forum?id=Nn-7OXvqmSW,https://openreview.net/pdf?id=Nn-7OXvqmSW,"We study how SSL methods such as VICReg and Barlow Twins enforce pairwise independence of representations via their Variance Covariance regularization (VCReg), improve VICReg using our findings and show VCReg to be beneficial outside of SSL.","Self-Supervised Learning (SSL) methods such as VICReg, Barlow Twins or W-MSE avoid collapse of their joint embedding architectures by constraining or regularizing the covariance matrix of their projector’s output. This study highlights important properties of such strategy, which we coin Variance-Covariance regularization (VCReg). More precisely, we show that VCReg enforces pairwise independence between the features of the learned representation. This result emerges by bridging VCReg applied on the projector’s output to kernel independence criteria applied on the projector’s input. This provides the first theoretical motivations and explanations of VCReg. We empirically validate our findings where (i) we put in evidence which projector’s characteristics favor pairwise independence, (ii) we use these findings to obtain nontrivial performance gains for VICReg, (iii) we demonstrate that the scope of VCReg goes beyond SSL by using it to solve Independent Component Analysis. We hope that our findings will support the adoption of VCReg in SSL and beyond.","Self-supervised learning, VICReg, Barlow Twins, HSIC" DEP-RL: Embodied Exploration for Reinforcement Learning in Overactuated and Musculoskeletal Systems,https://openreview.net/forum?id=C-xa_D3oTj6,https://openreview.net/pdf?id=C-xa_D3oTj6,A technique from the self-organization literature is used to improve performance of RL agents on overactuated systems with up to 120 muscle actuators.,"Muscle-actuated organisms are capable of learning an unparalleled diversity of dexterous movements despite their vast amount of muscles. Reinforcement learning (RL) on large musculoskeletal models, however, has not been able to show similar performance. We conjecture that ineffective exploration in large overactuated action spaces is a key problem. This is supported by the finding that common exploration noise strategies are inadequate in synthetic examples of overactuated systems. We identify differential extrinsic plasticity (DEP), a method from the domain of self-organization, as being able to induce state-space covering exploration within seconds of interaction. By integrating DEP into RL, we achieve fast learning of reaching and locomotion in musculoskeletal systems, outperforming current approaches in all considered tasks in sample efficiency and robustness.","reinforcement learning, musculoskeletal, correlated exploration" Restricted Generative Projection for One-Class Classification and Anomaly detection,https://openreview.net/forum?id=yBKkp5LT3FX,https://openreview.net/pdf?id=yBKkp5LT3FX,,"We present a novel framework for one-class classification and anomaly detection. The core idea is to learn a mapping to transform the unknown distribution of training (normal) data to a known distribution that is supposed to be different from the transformed distribution of unknown abnormal data. Crucially, the target distribution of training data should be sufficiently simple, compact, and informative. The simplicity is to ensure that we can sample from the distribution easily, the compactness is to ensure that the decision boundary between normal data and abnormal data is clear and reliable, and the informativeness is to ensure that the transformed data preserve the important information of the original data. Therefore, we propose to use truncated Gaussian, uniform in hyperball, uniform on hypersphere, or uniform between hyperspheres, as the target distribution. We then minimize the distance between the transformed data distribution and the target distribution while keeping the reconstruction error for the original data small enough. Our model is simple and easy to train especially compared with those based on generative models. Comparative studies on a few benchmark datasets verify the effectiveness of our method in comparison to baselines.", Name Your Colour For the Task: Artificially Discover Colour Naming via Colour Quantisation Transformer,https://openreview.net/forum?id=z2kUV2XQBT2,https://openreview.net/pdf?id=z2kUV2XQBT2,a new colour quantistaion transformer to artificially discover and evolve colour naming system similar in human language,"The long-standing theory that a colour-naming system evolves under the dual pressure of efficient communication and perceptual mechanism is supported by more and more linguistic studies including the analysis of four decades’ diachronic data from the Nafaanra language. This inspires us to explore whether artificial intelligence could evolve and discover a similar colour-naming system via optimising the communication efficiency represented by high-level recognition performance. Here, we propose a novel colour quantisation transformer, CQFormer, that quantises colour space while maintaining the accuracy of machine recognition on the quantised image. Given an RGB image, the annotation branch maps it into an index map before generating the quantised image with a colour palette, meanwhile the palette branch utilises a key-point detection way to find proper colours in palette among whole colour space. By interacting with colour annotation, CQFormer is able to balance both the machine vision accuracy and colour perceptual structure such as distinct and stable colour distribution for discovered colour system. Very interestingly, we even observe the consistent evolution pattern between our artificial colour system and basic colour terms across human languages. Besides, our colour quantisation method also offers an efficient quantisation method that effectively compresses the image storage while maintaining a high performance in high-level recognition tasks such as classification and detection. Extensive experiments demonstrate the superior performance of our method with extremely low bit-rate colours. We will release the source code upon acceptance. ","colour quantisation, image compression, artificial colour naming system" The Generalized Eigenvalue Problem as a Nash Equilibrium,https://openreview.net/forum?id=PEgBEB74JjB,https://openreview.net/pdf?id=PEgBEB74JjB,"We formulate the solution to the generalized eigenvalue problem as the Nash of a game, design an unbiased streaming-style algorithm to solve it, and analyze neural representations 1000x larger than before.","The symmetric generalized eigenvalue problem (SGEP) is a fundamental concept in numerical linear algebra. It captures the solution of many classical machine learning problems such as canonical correlation analysis, independent components analysis, partial least squares, linear discriminant analysis, principal components and others. Despite this, most general solvers are prohibitively expensive when dealing with *streaming data sets* (i.e., minibatches) and research has instead concentrated on finding efficient solutions to specific problem instances. In this work, we develop a game-theoretic formulation of the top-$k$ SGEP whose Nash equilibrium is the set of generalized eigenvectors. We also present a parallelizable algorithm with guaranteed asymptotic convergence to the Nash. Current state-of-the-art methods require $\mathcal{O}(d^2k)$ runtime complexity per iteration which is prohibitively expensive when the number of dimensions ($d$) is large. We show how to modify this parallel approach to achieve $\mathcal{O}(dk)$ runtime complexity. Empirically we demonstrate that this resulting algorithm is able to solve a variety of SGEP problem instances including a large-scale analysis of neural network activations.","generalized eigenvalue problem, nash, riemannian optimization, canonical correlation analysis, independent component analysis, distributed computing" learning hierarchical multi-agent cooperation with long short-term intention,https://openreview.net/forum?id=rWW2WAGjfOi,https://openreview.net/pdf?id=rWW2WAGjfOi,This paper proposes a new hierarchical multi-agent cooperation framework which leverages long short-term intention to improve agents' coordination,"Communication is a significant method to relieve partial observable and non-stationary problems in Multi-Agent System. However, most of existing work needs to persistently communicate by exchanging partial observation or latent embeddings (intention) which is unrealistic in real-world settings. To overcome this problem, we propose learning hierarchical multi-agent cooperation with long short-term intention (HLSI), a hierarchical multi-agent cooperation framework. In our work, each agent communicates by sharing high-level policy's latent embeddings (long-term intention) which keeps contant until macro action change. To make the communication messages contain more useful content, we maximize mutual information between agent's macro action and agent's future trajectory conditioned on historical trajectory. Agent integrates these messages through the attention mechanism. Then, long short-term intention fusion module will fuse the long-term intention received from other agents and short-term intention inferred by a behaivor inference network to approximate other agents' real short-term intention, which helps agent better understand others' next behavior. We provide comprehensive evaluation and ablations studies in multi-agent cooperative settings. The results show that our method achieves better performance than other multi-agent communication and hierarchical multi-agent reinforcement learning baselines.","hierarchical multi-agent reinforcement learning, intention, communication, attention, behavior inference" FEAT: A general framework for Feature-aware Multivariate Time-series Representation Learning ,https://openreview.net/forum?id=n9iRY8XFfXW,https://openreview.net/pdf?id=n9iRY8XFfXW,A self-supervised framework for learning feature-aware multivariate time-series representation,"Multivariate time-series is complex and uncertain. The overall temporal patterns change dynamically over time, and each feature is often observed to have a unique pattern. Therefore, it is challenging to model a framework that can flexibly learn dynamically changing temporal patterns as well as feature-specific unique patterns simultaneously. We propose a general framework for FEature-Aware multivariate Time-series representation learning, called FEAT. Unlike previous methods that only focus on training the overall temporal dependencies, we focus on training feature-agnostic as well as feature-specific patterns in a data-driven manner. Specifically, we introduce a feature-wise encoder to explicitly model the feature-specific information and design an element-wise gating layer that learns the influence of feature-specific patterns per dataset in general. FEAT outperforms the benchmark models in average accuracy on 29 UEA multivariate classification datasets and in MSE and MAE on four multivariate forecasting datasets. ","multivariate time-series, representation learning, self-supervised learning, contrastive learning, gating, reconstruction" Learning with Auxiliary Activation for Memory-Efficient Training,https://openreview.net/forum?id=YgC62m4CY3r,https://openreview.net/pdf?id=YgC62m4CY3r,The proposed learning rule reduces training memory requirements without reduction in training speed while achieving high performance close to backpropagation.,"While deep learning has achieved great success in various fields, a large amount of memory is necessary to train deep neural networks, which hinders the development of massive state-of-the-art models. The reason is the conventional learning rule, backpropagation, should temporarily store input activations of all the layers in the network. To overcome this, recent studies suggested various memory-efficient implementations of backpropagation. However, those approaches incur computational overhead due to the recomputation of activations, slowing down neural network training. In this work, we propose a new learning rule which significantly reduces memory requirements while closely matching the performance of backpropagation. The algorithm combines auxiliary activation with output activation during forward propagation, while only auxiliary activation is used during backward propagation instead of actual input activation to reduce the amount of data to be temporarily stored. We mathematically show that our learning rule can reliably train the networks whose loss landscape is convex if the auxiliary activation satisfies certain conditions. Based on this observation, we suggest candidates of auxiliary activation that satisfy those conditions. Experimental results confirm that the proposed learning rule achieves competitive performance compared to backpropagation in various models such as ResNet, Transformer, BERT, ViT, and MLP-Mixer.","Memory Efficient Training, Auxiliary Activation, Backpropagation, Deep Learning" Equivariant 3D-Conditional Diffusion Models for Molecular Linker Design,https://openreview.net/forum?id=cnsHSSLnHVV,https://openreview.net/pdf?id=cnsHSSLnHVV,We propose a conditional diffusion model for generating molecular linkers between disconnected fragments in 3D,"Fragment-based drug discovery has been an effective paradigm in early-stage drug development. An open challenge in this area is designing linkers between disconnected molecular fragments of interest to obtain chemically-relevant candidate drug molecules. In this work, we propose DiffLinker, an E(3)-equivariant 3D-conditional diffusion model for molecular linker design. Given a set of disconnected fragments, our model places missing atoms in between and designs a molecule incorporating all the initial fragments. Unlike previous approaches that are only able to connect pairs of molecular fragments, our method can link an arbitrary number of fragments. Additionally, the model automatically determines the number of atoms in the linker and its attachment points to the input fragments. We demonstrate that DiffLinker outperforms other methods on the standard datasets generating more diverse and synthetically-accessible molecules. Besides, we experimentally test our method in real-world applications, showing that it can successfully generate valid linkers conditioned on target protein pockets.","Molecules, Drug Discovery, Molecular Linker Design, Equivariant, Diffusion Models" Existence of a bad local minimum of neural networks with general smooth activation functions,https://openreview.net/forum?id=Y9gIpiWNvtp,https://openreview.net/pdf?id=Y9gIpiWNvtp,We investigate the existence of a bad local minimum of neural networks with general smooth activation functions.,"Understanding the loss surface of neural networks is essential to the understanding of deep learning. However, the existence of a bad local minimum has not yet been fully identified. We investigate the existence of a bad local minimum of the $2$-layer and $3$-layer neural networks with general smooth activation functions. We provide constructive proof using the algebraic nature of the activation functions. We show this for realistic settings where the data $(X,Y)$ have a positive measure. We hope that such results give theoretical foundations for studies related to local minima and loss surfaces.","local minimum, smooth activation, neural networks" Language Modelling with Pixels,https://openreview.net/forum?id=FkSp8VW8RjH,https://openreview.net/pdf?id=FkSp8VW8RjH,"We train PIXEL, a language model that operates solely on images of rendered text, and show that it is possible to transfer representations across languages based on orthographic similarity or the co-activation of pixels.","Language models are defined over a finite set of inputs, which creates a vocabulary bottleneck when we attempt to scale the number of supported languages. Tackling this bottleneck results in a trade-off between what can be represented in the embedding matrix and computational issues in the output layer. This paper introduces PIXEL, the Pixel-based Encoder of Language, which suffers from neither of these issues. PIXEL is a pretrained language model that renders text as images, making it possible to transfer representations across languages based on orthographic similarity or the co-activation of pixels. PIXEL is trained to reconstruct the pixels of masked patches instead of predicting a distribution over tokens. We pretrain the 86M parameter PIXEL model on the same English data as BERT and evaluate on syntactic and semantic tasks in typologically diverse languages, including various non-Latin scripts. We find that PIXEL substantially outperforms BERT on syntactic and semantic processing tasks on scripts that are not found in the pretraining data, but PIXEL is slightly weaker than BERT when working with Latin scripts. Furthermore, we find that PIXEL is more robust than BERT to orthographic attacks and linguistic code-switching, further confirming the benefits of modelling language with pixels.","representation learning, nlp, transformers, language model, masked autoencoder" Sinkhorn Discrepancy for Counterfactual Generalization,https://openreview.net/forum?id=jEV-GgJ6kRO,https://openreview.net/pdf?id=jEV-GgJ6kRO,,"Estimating individual treatment effects from observational data is very challenging due to the existence of treatment selection bias. Most existing representation-based methods mitigate this issue by aligning distributions of different treatment groups in the representation space. However, they still suffer from two critical problems: (1) Mini-batch Sampling Effects (MSE), where the alignment easily fails due to the outcome imbalance or outliers in the batch; (2) Unobserved Confounder Effects (UCE), where the unobserved confounders damage the correct alignment. To tackle these problems, we propose a principled approach named Entire Space CounterFactual Regression (ESCFR) based on a generalized sinkhorn discrepancy for distribution alignment within the stochastic optimal transport framework. Based on the framework, we propose a relaxed mass preserving regularizer to address the MSE issue and design a proximal factual outcome regularizer to handle the UCE issue. Extensive experiments demonstrate that our proposed ESCFR can successfully tackle the treatment selection bias and achieve significantly better performance than state-of-the-art methods.","causal inference, treatment selection bias" Massively Scaling Heteroscedastic Classifiers,https://openreview.net/forum?id=sIoED-yPK9l,https://openreview.net/pdf?id=sIoED-yPK9l,,"Heteroscedastic classifiers, which learn a multivariate Gaussian distribution over prediction logits, have been shown to perform well on image classification problems with hundreds to thousands of classes. However, compared to standard classifiers, they introduce extra parameters that scale linearly with the number of classes. This makes them infeasible to apply to larger-scale problems. In addition heteroscedastic classifiers introduce a critical temperature hyperparameter which must be tuned. We propose HET-XL, a heteroscedastic classifier whose parameter count when compared to a standard classifier scales independently of the number of classes. In our large-scale settings, we show that we can remove the need to tune the temperature hyperparameter, by directly learning it on the training data. On large image classification datasets with up to 4B images and 30k classes our method requires 14X fewer additional parameters, does not require tuning the temperature on a held-out set and performs consistently better than the baseline heteroscedastic classifier. HET-XL improves ImageNet 0-shot classification in a multimodal contrastive learning setup which can be viewed as a 3.5 billion class classification problem.", Vera Verto: Multimodal Hijacking Attack,https://openreview.net/forum?id=xACeXHo4sf,https://openreview.net/pdf?id=xACeXHo4sf,We propose a new multimodal hijacking attack where the adversary can implement a hijacking task from a completely different domain.,"The increasing cost of training machine learning (ML) models has led to the inclusion of new parties to the training pipeline, such as users who contribute training data and companies that provide computing resources. This involvement of such new parties in the ML training process has introduced new attack surfaces for an adversary to exploit. A recent attack in this domain is the model hijacking attack, whereby an adversary hijacks a victim model to implement their own -- possibly malicious -- hijacking tasks. However, the scope of the model hijacking attack is so far limited to computer vision-related tasks. In this paper, we transform the model hijacking attack into a more general multimodal setting, where the hijacking and original tasks are performed on data of different modalities. Specifically, we focus on the setting where an adversary implements a natural language processing (NLP) hijacking task into an image classification model. To mount the attack, we propose a novel encoder-decoder based framework, namely the Blender, which relies on advanced image and language models. Experimental results show that our modal hijacking attack achieves strong performances in different settings. For instance, our attack achieves 94%, 94%, and 95% attack success rate when using the Sogou news dataset to hijack STL10, CIFAR-10, and MNIST classifiers.","Hijacking Attack, Modal Hijacking, Computer Vision, Natural Language Processing" On Incremental Learning with Long Short Term Strategy,https://openreview.net/forum?id=Gg5PaJRQbRw,https://openreview.net/pdf?id=Gg5PaJRQbRw,,"Incremental learning aims at mitigating the forgetting during the sequential learning of deep neural networks. In the process, a procedure (including distillation, replaying, etc.) is usually adopted to help model accumulate knowledge. However, we discover the tuning of such procedure could face the ``long short term dilemma'' that the optimal procedure of short term learning is not necessarily equivalent to that of long term learning due to their need of different plasticity/stability balances. The existing methods have to take the trade-off to achieve better overall performance along the incremental tasks. In this paper, we propose a novel LongShortTerm strategy that circumvents limitations of widely-used pipeline with single branch and brings model capability in both short and long term into full play. To further control the plasticity/stability balance in LongShortTerm strategy, we discover that for ViT backbone, magnitude of memory augmentation is critical to retention of model and propose Margin-based Data Augmentation to meet different balances in long short term learning. Extensive experiments on two complex CIL benchmarks: ImageNet-100 and ImageNet-1K demonstrate the effectiveness of our LongShortTerm strategy with improvements of 0.59\%-3.10\% over state-of-the-art solution. ", Joint Attention-Driven Domain Fusion and Noise-Tolerant Learning for Multi-Source Domain Adaptation,https://openreview.net/forum?id=Ki_26lfEmey,https://openreview.net/pdf?id=Ki_26lfEmey,,"Multi-source Unsupervised Domain Adaptation (MUDA) transfers knowledge from multiple source domains with labeled data to an unlabeled target domain. Recently, endeavours have been made in establishing connections among different domains to enable feature interaction. However, these approaches essentially enhance category information and thus lack the transfer of the domain-specific information. Moreover, few research has explored the connection between pseudo-label generation and the framework’s learning capabilities, crucial for ensuring robust MUDA. In this paper, we propose a novel framework, which significantly reduces the domain discrepancy and demonstrates new state-of-the-art performance. In particular, we first propose a Contrary Attention-based Domain Merge (CADM) module to enable the interaction among the features so as to achieve the mixture of domain-specific information instead of focusing on the category information. Secondly, to enable the network to correct the pseudo labels during training, we propose an adaptive and reverse cross-entropy loss, which can adaptively impose constraints on the pseudo-label generation process. We conduct experiments on four benchmark datasets, showing that our approach can efficiently fuse all domains for MUDA while showing much better performance than the prior methods.","Multi-source Unsupervised Domain Adaptation, Attention Mechanism, Noisy Label Learning" Efficient Point Cloud Geometry Compression Through Neighborhood Point Transformer,https://openreview.net/forum?id=zhl5bWOCD4v,https://openreview.net/pdf?id=zhl5bWOCD4v,,"Although convolutional representation of multiscale sparse tensor demonstrated its superior efficiency to compress the Point Cloud Geometry (PCG) through exploiting cross-scale and same-scale correlations, its capacity was yet bounded. This is because 1) the fixed receptive field of the convolution cannot best characterize sparsely and irregularly distributed points; and 2) pretrained convolutions with fixed weights are insufficient to capture dynamic information conditioned on the input. This work proposes the Neighborhood Point transFormer (NPFormer) to replace the existing solutions by taking advantage of both convolution and attention mechanism to best exploit correlations under the multiscale representation framework for better geometry occupancy probability estimation. With this aim, a Neighborhood Point Attention layer (NPA) is devised and stacked with Sparse Convolution layers (SConvs) to form the NPFormer. In NPA, for each point, it uses its k Nearest Neighbors (kNN) to construct an adaptive local neighborhood; and then leverages the self-attention to dynamically aggregate information within this neighborhood. Compared with the anchor using standardized G-PCC, our method provides averaged 17% BD-Rate gains and 14% bitrate reduction for respective lossy and lossless modes when compressing the LiDAR point clouds (e.g. SemanticKITTI, Ford). There are also 20%-40% lossy BD-Rate improvement and 37%-53% lossless bitrate reduction for the compression of object point clouds (e.g. MVUB, MPEG 8i). Compared with the state-of-the-art solution using attention optimized octree codec, our approach requires much less decoding runtime with about 640 times speedup on average, while still presenting better compression efficiency.", EA-HAS-Bench: Energy-aware Hyperparameter and Architecture Search Benchmark,https://openreview.net/forum?id=n-bvaLSCC78,https://openreview.net/pdf?id=n-bvaLSCC78,We provide the first HAS dataset aware of the overall search energy cost,"The energy consumption for training deep learning models is increasing at an alarming rate due to the growth of training data and model scale, resulting in a negative impact on carbon neutrality. Energy consumption is an especially pressing issue for AutoML algorithms because it usually requires repeatedly training large numbers of computationally intensive deep models to search for optimal configurations. This paper takes one of the most essential steps in developing energy-aware (EA) NAS methods, by providing a benchmark that makes EA-NAS research more reproducible and accessible. Specifically, we present the first large-scale energy-aware benchmark that allows studying AutoML methods to achieve better trade-offs between performance and search energy consumption, named EA-HAS-Bench. EA-HAS-Bench provides a large-scale architecture/hyperparameter joint search space, covering diversified configurations related to energy consumption. Furthermore, we propose a novel surrogate model specially designed for large joint search space, which proposes a Bezier curve-based model to predict learning curves with unlimited shape and length. Based on the proposed dataset, we new energy-aware AutoML method that arms existing AutoML algorithms to consider the search energy consumption, and our experiments show that the modified energy-aware AutoML methods achieve a better trade-off between energy consumption and model performance.", Breaking the Curse of Dimensionality for Parametric Elliptic PDEs,https://openreview.net/forum?id=3nfMmcditWu,https://openreview.net/pdf?id=3nfMmcditWu,,"Motivated by recent empirical success, we examine how neural network-based ansatz classes can break the curse of dimensionality for high-dimensional, non-linear elliptic partial differential equations (PDEs) with variational structure. The high-dimensionality of the PDEs can either be induced through a high-dimensional physical domain or a high-dimensional parameter space. The latter include parametric right-hand sides, parametric domains, and material constants. Our main result shows that any scheme, that computes neural network based $W^{1,p}$-approximations, leverages the extraordinary approximation capabilities of neural networks and, thus, is able to beat the curse of dimensionality if the ground truth solution is smooth or possesses Barron regularity. Popular examples of $W^{1,p}$-convergent schemes include, e.g., the Deep Ritz Method and physics-informed neural networks. We present numerical experiments supporting our theoretical findings.", RotoGBML: Towards Out-of-Distribution Generalization for Gradient-Based Meta-Learning,https://openreview.net/forum?id=Z4CUw1pIuor,https://openreview.net/pdf?id=Z4CUw1pIuor,,"Gradient-based meta-learning (GBML) algorithms are able to fast adapt to new tasks by transferring the learned meta-knowledge, while assuming that all tasks come from the same distribution (in-distribution, ID). However, in the real world, they often suffer from an out-of-distribution (OOD) generalization problem, where tasks come from different distributions. OOD exacerbates inconsistencies in magnitudes and directions of task gradients, which brings challenges for GBML to optimize the meta-knowledge by minimizing the sum of task gradients in each minibatch. To address this problem, we propose RotoGBML, a novel approach to homogenize OOD task gradients. RotoGBML uses reweighted vectors to dynamically balance diverse magnitudes to a common scale and uses rotation matrixes to rotate conflicting directions close to each other. To reduce overhead, we homogenize gradients with the features rather than the network parameters. On this basis, to avoid the intervention of non-causal features (e.g., backgrounds), we also propose an invariant self-information (ISI) module to extract invariant causal features (e.g., the outlines of objects). Finally, task gradients are homogenized based on these invariant causal features. Experiments show that RotoGBML outperforms other state-of-the-art methods on various few-shot image classification benchmarks.", UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer,https://openreview.net/forum?id=d77RVuVg-Mf,https://openreview.net/pdf?id=d77RVuVg-Mf,"We propose UniFormerV2, which aims to arm the well-pretrained vision transformer with efficient video UniFormer designs, and achieves state-of-the-art results on 8 popular video benchmarks.","Learning discriminative spatiotemporal representation is the key problem of video understanding. Recently, Vision Transformers (ViTs) have shown their power in learning long-term video dependency with self-attention. Unfortunately, they exhibit limitations in tackling local video redundancy, due to the blind global comparison among tokens. UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format. However, this model has to require a tiresome and complicated image-pretraining phrase, before being finetuned on videos. This blocks its wide usage in practice. On the contrary, open-sourced ViTs are readily available and well-pretrained with rich image supervision. Based on these observations, we propose a generic paradigm to build a powerful family of video networks, by arming the pretrained ViTs with efficient UniFormer designs. We call this family UniFormerV2, since it inherits the concise style of the UniFormer block. But it contains brand-new local and global relation aggregators, which allow for preferable accuracy-computation balance by seamlessly integrating advantages from both ViTs and UniFormer. Without any bells and whistles, our UniFormerV2 gets the state-of-the-art recognition performance on 8 popular video benchmarks, including scene-related Kinetics-400/600/700 and Moments in Time, temporal-related Something-Something V1/V2, untrimmed ActivityNet and HACS. In particular, it is the first model to achieve 90% top-1 accuracy on Kinetics-400, to our best knowledge. The models will be released afterward.","Vision Transformer, Action Recognition, Video Learning" Dynamical Equations With Bottom-up Self-Organizing Properties Learn Accurate Dynamical Hierarchies Without Any Loss Function,https://openreview.net/forum?id=ndYrOsNw_B2,https://openreview.net/pdf?id=ndYrOsNw_B2,,"Self-organization is ubiquitous in nature and mind. However, machine learning and theories of cognition still barely touch the subject. The hurdle is that general patterns are difficult to define in terms of dynamical equations and designing a system that could learn by reordering itself is still to be seen. Here, we propose a learning system, where patterns are defined within the realm of nonlinear dynamics with positive and negative feedback loops, allowing attractor-repeller pairs to emerge for each pattern observed. Experiments reveal that such a system can map temporal to spatial correlation, enabling hierarchical structures to be learned from sequential data. The results are accurate enough to surpass state-of-the-art unsupervised learning algorithms in seven out of eight experiments as well as two real-world problems. Interestingly, the dynamic nature of the system makes it inherently adaptive, giving rise to phenomena similar to phase transitions in chemistry/thermodynamics when the input structure changes. Thus, the work here sheds light on how self-organization can allow for pattern recognition and hints at how intelligent behavior might emerge from simple dynamic equations without an objective/loss function.","self-organization, dynamical systems, continual learning, dynamical hierarchy, adaptation" Multi-Label Knowledge Distillation,https://openreview.net/forum?id=3jBXX9Xb1iz,https://openreview.net/pdf?id=3jBXX9Xb1iz,,"Existing knowledge distillation methods typically work by enforcing the consistency of output logits or intermediate feature maps between the teacher network and student network. Unfortunately, these methods can hardly be extended to the multi-label learning scenario. Because each instance is associated with multiple semantic labels, neither the prediction logits nor the feature maps obtained from the whole example can accurately transfer knowledge for each label. In this paper, we propose a novel multi-label knowledge distillation method. On one hand, it exploits the informative semantic knowledge from the logits by label decoupling with the one-versus-all reduction strategy; on the other hand, it enhances the distinctiveness of the learned feature representations by leveraging the structural information of label-wise embeddings. Experimental results on multiple benchmark datasets validate that the proposed method can avoid knowledge counteraction among labels, and achieve superior performance against diverse comparing methods.", How and Why We Detect Distribution Shift: Critical Analysis of Methods and Benchmarks,https://openreview.net/forum?id=E94ID_k7CTA,https://openreview.net/pdf?id=E94ID_k7CTA,we aim to provide a consolidated view of the two largest sub-fields: open-set recognition (OSR) and out-of-distribution detection (OOD),"Detecting test-time distribution shift has emerged as a key capability for safely deployed machine learning models, with the question being tackled under various guises in recent years. In this paper, we aim to provide a consolidated view of the two largest sub-fields within the community: open-set recognition (OSR) and out-of-distribution detection (OOD). In particular, we aim to provide rigorous empirical analysis of different methods across settings and provide actionable takeaways for practitioners and researchers. Concretely, we make the following contributions: (i) For the first time, we perform rigorous cross-evaluation between state-of-the-art methods in the OOD and OSR settings and identify a strong correlation between the performances of methods for them; (ii) We propose a new, large-scale benchmark setting which we suggest better disentangles the problem tackled by OOD and OSR; (iii) We thoroughly examine SOTA methods for OOD and OSR on our large-scale benchmark; and (iv) Finally, we find that the best performing method on previous benchmarks struggles on our large-scale benchmark, while magnitude-aware scoring rules consistently show promise.","Open-set Recognition, Out of distribution Detection" ADVERSARY-AWARE PARTIAL LABEL LEARNING WITH LABEL DISTILLATION,https://openreview.net/forum?id=WmvIJJgt8L,https://openreview.net/pdf?id=WmvIJJgt8L,,"To ensure that the data collected from human subjects is entrusted with a secret, rival labels are introduced to conceal the information provided by the participants on purpose. The corresponding learning task can be formulated as a noisy partial-label learning problem. However, conventional partial-label learning (PLL) methods are still vulnerable to the high ratio of noisy partial labels, especially in a large labelling space. To learn a more robust model, we present Adversary-Aware Partial Label Learning and introduce the $\textit{rival}$, a set of noisy labels, to the collection of candidate labels for each instance. By introducing the rival label, the predictive distribution of PLL is factorised such that a reasonably good predictive label is achieved with less uncertainty coming from the transition matrix, assuming its generation process is known. Nonetheless, the predictive accuracy is still insufficient to produce an adequately good set of positive samples to minimise the loss function. Moreover, the inclusion of rivals also brings an inconsistency issue for the classifier and risk function due to the intractability of the transition matrix. Consequently, the immature teacher within momentum (ITWM) disambiguation algorithm is proposed to cope with the situation. We utilise the confidence score mapping from the instance space to approximate the intractable term, allowing us to obtain a provably consistent classifier and risk function. Extensive experiments demonstrate that our method achieves promising results on the CIFAR10, CIFAR100 and CUB200 datasets.","weak supervised learning, partial label learning" Structural Privacy in Graphs,https://openreview.net/forum?id=O7x_ldrlaO7,https://openreview.net/pdf?id=O7x_ldrlaO7,Make the structure of the graph private in addition to the privacy of node features and labels,"Graph Neural Networks (GNNs) gained popularity to address the tasks over the graph-structured data that best represent many real-world systems. The privacy of the participants of these systems is at risk if the GNNs are not carefully designed. Existing works in privacy-preserving GNNs primarily ensure the privacy of features and labels of a node. In order to ensure complete privacy related to graph data, its structure also needs to be privatized. We provide a method SPGraph to privatize the graph structure by adding noise to the neighborhood data of the node. Our method addresses two challenges in introducing structural privacy in graphs. Applying randomization on the set of actual neighbors to introduce noise leads to a reduction in the degree of a node, which is undesirable. To overcome this first challenge, we introduce $\lambda$-selector that samples nodes to be added to the set of neighbors. The second challenge is to denoise the neighborhood so that the noise added in the neighborhood does not significantly impact the accuracy. In this view, we use $p$-hop neighborhood to compensate for the loss of actual neighbors in the randomization. We continue to use the node and label privacy as implemented in the previous methods for privacy in GNNs. We conduct extensive experiments over real-world datasets to show the impact of perturbation in the graph structure. ","Privacy, Graph Neural Networks, Differential Privacy, Graph Structure" KnowDA: All-in-One Knowledge Mixture Model for Data Augmentation in Low-Resource NLP,https://openreview.net/forum?id=2nocgE1m0A,https://openreview.net/pdf?id=2nocgE1m0A,We propose a Knowledge Mixture Data Augmentation Model (KnowDA) that is trained with diverse NLP task knowledge. KnowDA could generate additional synthetic data to improve model performance in various low-resource NLP tasks.,"This paper focuses on data augmentation for low-resource NLP tasks where the training set is limited. The existing solutions either leverage task-independent heuristic rules (e.g., Synonym Replacement) or fine-tune general-purpose pre-trained language models (e.g., GPT2) using the limited training instances to produce new synthetic data. Consequently, they have trivial task-specific knowledge and are limited to yielding low-quality synthetic data. To combat this issue, we propose Knowledge Mixture Data Augmentation Model (KnowDA), a Seq2Seq language model pretrained on a mixture of diverse NLP tasks under a novel framework of Knowledge Mixture Training (KoMT). The goal of KoMT is to condense diverse NLP task-specific knowledge into the single KnowDA model (i.e., all-in-one). The resulting KnowDA could utilize these knowledge to quickly grasp the inherent synthesis law of the target task through limited training instances. Specifically, KoMT reformulates input examples from various heterogeneous NLP tasks into a unified text-to-text format and employs denoising training objectives in different granularity to learn to reconstruct partial or complete samples. To the best of our knowledge, we are the first to attempt to apply 100+ NLP multi-task training for data augmentation. Extensive experiments show that i) the synthetic data produced by KnowDA successfully improves the performance of the strong pre-trained language models (i.e., Bert, ALBert and Deberta) by a large margin on the low-resource NLP benchmark FewGLUE, CoNLL’03 and WikiAnn; ii) KnowDA successful transfer the task knowledge to NLP tasks whose types are seen and unseen in KoMT.","Data Augmentation, Low-Resource NLP" Distill Vision Transformers to CNNs via Low-Rank Representation Approximation,https://openreview.net/forum?id=U4llPAUi4z,https://openreview.net/pdf?id=U4llPAUi4z,Distill Vision Transformers to CNNs via Low-Rank Representation Approximation,"Vision Transformers attain state-of-the-art performance in diverse vision tasks due to their scalable and long-range dependencies modeling. Meanwhile, CNNs are still practical and efficient in many industry scenarios, thanks to their inductive biases and mature tiny architectures. Thus it is a challenging yet interesting problem to study the Knowledge Distillation (KD) of these two different architectures. In particular, how to transfer global information from Vision Transformers to tiny CNNs. We point out that many current CNN distillation methods are ineffective in the Vision Transformers distillation scenario, which implies that distilling global information is not easy due to the architecture gaps. We develop an encoder-decoder representation distillation framework, namely \textbf{L}ow \textbf{R}ank \textbf{R}epresentation \textbf{A}pproximation, to address the problem. The Key insight of LRRA is that the global information modeling can be seen as finding the most important bases and corresponding codes. This process can be solved by matrix decomposition. Specifically, the student representation is encoded to a low-rank latent representation and used to approximate the teacher representation. The most distinguishable knowledge, i.e., global information, is distilled via the low-rank representation approximation. The proposed method offers a potential closed-form solution without introducing extra learnable parameters and hand-crafted engineering. We benchmark 11 KD methods to demonstrate the usefulness of our approach. Extensive ablation studies validate the necessity of the low-rank structure. ","Knowledge Distillation, Low rank approximation, Transformer, Representation Learning" Learning Graph Neural Network Topologies,https://openreview.net/forum?id=tlhsswFz9x,https://openreview.net/pdf?id=tlhsswFz9x,,"Graph convolutional networks (GCNs) enable end-to-end learning on graph structured data. However, many works begin by assuming a given graph structure. As the ideal graph structure is often unknown, this limits applicability. To address this, we present a novel end-to-end differentiable graph-generator which builds the graph topology on the fly. Our module can be readily integrated into existing pipelines involving graph convolution operations, replacing the predetermined or existing adjacency matrix with one that is learned, and optimised, as part of the general objective. As such it is applicable to any GCN. We show that integrating our module into both node classification and trajectory prediction pipelines improves accuracy across a range of datasets and backbones.", Finding the global semantic representation in GAN through Fréchet Mean,https://openreview.net/forum?id=9ImtNIZ7bYx,https://openreview.net/pdf?id=9ImtNIZ7bYx,We propose the global basis for semantics in the latent space of GAN through Fréchet Mean.,"The ideally disentangled latent space in GAN involves the global representation of latent space using semantic attribute coordinates. In other words, in this disentangled space, there exists the global semantic basis as a vector space where each basis component describes one attribute of generated images. In this paper, we propose an unsupervised method for finding this global semantic basis in the intermediate latent space in GANs. This semantic basis represents sample-independent meaningful perturbations that change the same semantic attribute of an image on the entire latent space. The proposed global basis, called Fréchet basis, is derived by introducing Fréchet mean to the local semantic perturbations in a latent space. Fréchet basis is discovered in two stages. First, the global semantic subspace is discovered by the Fréchet mean in the Grassmannian manifold of the local semantic subspaces. Second, Fréchet basis is found by optimizing a basis of the semantic subspace via the Fréchet mean in the Special Orthogonal Group. Experimental results demonstrate that Fréchet basis provides better semantic factorization and robustness compared to the previous methods. Moreover, we suggest the basis refinement scheme for the previous methods. The quantitative experiments show that the refined basis achieves better semantic factorization while generating the same semantic subspace as the previous method.","generative adversarial network, disentanglement, semantic factorization" Identical Initialization: A Universal Approach to Fast and Stable Training of Neural Networks,https://openreview.net/forum?id=qpeAhwxTopw,https://openreview.net/pdf?id=qpeAhwxTopw,A simple and general method for stable training,"A well-conditioned initialization is beneficial for training deep neural networks. However, existing initialization approaches do not simultaneously show robustness and universality. Specifically, even though the widely-used Xavier and Kaiming initialization approaches can generally fit a variety of networks, they fail to train residual networks without Batch Normalization for calculating an inappropriate scale on data-flow. On the other hand, some literature design stable initialization (e.g., Fixup and ReZero) based on dynamical isometry, an efficient learning mechanism. Nonetheless, these methods are specifically designed for either a non-residual structure or a residual block only, and even include extra auxiliary components, limiting their applicable range. Intriguingly, we find that the identity matrix is a feasible and universal solution to the aforementioned problems, as it adheres to dynamical isometry while remaining applicable to a wide range of models. Motivated by this, we develop Identical Initialization (IDInit), a sufficiently robust, universal, and fast-converging approach on the identity matrix. Empirical results on a variety of benchmarks show that IDInit is universal to various network types, and practically useful with good performance and fast convergence.","Initialization, Idetity Matrix, Dynamical Isometry" Addressing Parameter Choice Issues in Unsupervised Domain Adaptation by Aggregation,https://openreview.net/forum?id=M95oDwJXayG,https://openreview.net/pdf?id=M95oDwJXayG,A method for addressing the issue of hyper-parameter selection in unsupervised domain adaptation.,"We study the problem of choosing algorithm hyper-parameters in unsupervised domain adaptation, i.e., with labeled data in a source domain and unlabeled data in a target domain, drawn from a different input distribution. We follow the strategy to compute several models using different hyper-parameters, and, to subsequently compute a linear aggregation of the models. While several heuristics exist that follow this strategy, methods are still missing that rely on thorough theories for bounding the target error. In this turn, we propose a method that extends weighted least squares to vector-valued functions, e.g., deep neural networks. We show that the target error of the proposed algorithm is asymptotically not worse than twice the error of the unknown optimal aggregation. We also perform a large scale empirical comparative study on several datasets, including text, images, electroencephalogram, body sensor signals and signals from mobile phones. Our method outperforms deep embedded validation (DEV) and importance weighted validation (IWV) on all datasets, setting a new state-of-the-art performance for solving parameter choice issues in unsupervised domain adaptation with theoretical error guarantees. We further study several competitive heuristics, all outperforming IWV and DEV on at least five datasets. However, our method outperforms each heuristic on at least five of seven datasets.","Domain adaptation, parameter choice, model selection, aggregation, importance weighting" MARS: Meta-learning as Score Matching in the Function Space,https://openreview.net/forum?id=WAgXmT8BeRj,https://openreview.net/pdf?id=WAgXmT8BeRj,Meta-learning in the function space by estimating the score function of the data-generating process marginals.,"Meta-learning aims to extract useful inductive biases from a set of related datasets. In Bayesian meta-learning, this is typically achieved by constructing a prior distribution over neural network parameters. However, specifying families of computationally viable prior distributions over the high-dimensional neural network parameters is difficult. As a result, existing approaches resort to meta-learning restrictive diagonal Gaussian priors, severely limiting their expressiveness and performance. To circumvent these issues, we approach meta-learning through the lens of functional Bayesian neural network inference which views the prior as a stochastic process and performs inference in the function space. Specifically, we view the meta-training tasks as samples from the data-generating process and formalize meta-learning as empirically estimating the law of this stochastic process. Our approach can seamlessly acquire and represent complex prior knowledge by meta-learning the score function of the data-generating process marginals instead of parameter space priors. In a comprehensive benchmark, we demonstrate that our method achieves state-of-the-art performance in terms of predictive accuracy and substantial improvements in the quality of uncertainty estimates.","score estimation, meta-learning, bayesian neural networks" Faster Gradient-Free Methods for Escaping Saddle Points,https://openreview.net/forum?id=KDhFkA6MQsW,https://openreview.net/pdf?id=KDhFkA6MQsW,,"Escaping from saddle points has become an important research topic in non-convex optimization. In this paper, we study the case when calculations of explicit gradients are expensive or even infeasible, and only function values are accessible. Currently, there have two types of gradient-free (zeroth-order) methods based on random perturbation and negative curvature finding proposed to escape saddle points efficiently and converge to an $\epsilon$-approximate second-order stationary point. Nesterov's accelerated gradient descent (AGD) method can escape saddle points faster than gradient descent (GD) which have been verified in first-order algorithms. However, whether AGD could accelerate the gradient-free methods is still unstudied. To unfold this mystery, in this paper, we propose two accelerated variants for the two types of gradient-free methods of escaping saddle points. We show that our algorithms can find an $\epsilon$-approximate second-order stationary point with $\tilde{\mathcal{O}}(1/\epsilon^{1.75})$ iteration complexity and $\tilde{\mathcal{O}}(d/\epsilon^{1.75})$ oracle complexity, where $d$ is the problem dimension. Thus, our methods achieve a comparable convergence rate to their first-order counterparts and have fewer oracle complexity compared to prior derivative-free methods for finding second-order stationary points.", $\textrm{D}^3\textrm{Former}$: Debiased Dual Distilled Transformer for Incremental Learning,https://openreview.net/forum?id=jTfflGKNEjb,https://openreview.net/pdf?id=jTfflGKNEjb,Adapting a hybrid ViT for class incremental learning,"In the class incremental learning (CIL) setting, groups of classes are introduced to a model in each learning phase. The goal is to learn a unified model performant on all the classes observed so far. Given the recent popularity of Vision Transformers (ViTs) in conventional classification settings, an interesting question is to study their continual learning behaviour. In this work, we develop a Debiased Dual Distilled Transformer for CIL dubbed $\textrm{D}^3\textrm{Former}$. The proposed model leverages a hybrid nested ViT design to ensure data efficiency and scalability to small as well as large datasets. In contrast to a recent ViT-based CIL approach, our $\textrm{D}^3\textrm{Former}$ does not dynamically expand its architecture when new tasks are learned and remains suitable for a large number of incremental tasks. The improved CIL behavior of $\textrm{D}^3\textrm{Former}$ owes to two fundamental changes to the ViT design. First, we treat incremental learning as a long-tail classification problem where the majority samples from new classes vastly outnumber the limited exemplars available for old classes. To avoid bias against the minority old classes, we propose to dynamically adjust logits to emphasize on retaining the representations relevant to old tasks. Second, we propose to preserve the configuration of spatial attention maps as the learning progresses across tasks. This helps in reducing catastrophic forgetting by constraining the model to retain attention on the most discriminative regions. $\textrm{D}^3\textrm{Former}$ obtains favorable results on incremental versions of CIFAR-100, MNIST, SVHN, and ImageNet datasets.","Incremental Learning, Transformers" Symmetrical SyncMap for Imbalanced General Chunking Problems,https://openreview.net/forum?id=xnscpQU6lvh,https://openreview.net/pdf?id=xnscpQU6lvh,Null,"Recently, SyncMap (2021) pioneered an approach to learn complex structures from sequences as well as adapt to any changes in underlying structures. Such approach, inspired by neuron group behaviors, is achieved by using self-organizing dynamical equations without any loss functions. Here we propose Symmetrical SyncMap that goes beyond the original work to show how to create dynamical equations and attractor-repeller points which are stable over the long run, even dealing with imbalanced continual general chunking problems (CGCPs). The main idea is to apply equal updates from positive and negative feedback loops by symmetrical activation. We then introduce the concept of memory window to allow for more positive updates. Our algorithm surpasses or ties other unsupervised state-of-the-art baselines in all 12 imbalanced CGCPs with various difficulties, including dynamical ones. To verify its performance in real-world scenarios, we conduct experiments on several well-studied structure learning problems. The proposed method surpasses substantially other methods in all scenarios, suggesting that symmetrical activation plays a critical role in uncovering topological structures and even hierarchies encoded in temporal data.","Self-organization, Adaptive learning, Chunking, New learning paradigm, Bio-inspired learning, Structure learning" Solving Partial Label Learning Problem with Multi-Agent Reinforcement Learning,https://openreview.net/forum?id=BNsuf5g-JRd,https://openreview.net/pdf?id=BNsuf5g-JRd,,"Partial label learning (PLL) deals with classifications when a set of candidate labels instead of the true one is given for each training instance. As a weakly supervised learning problem, the main target of PLL is to discover latent relationships within training samples, and utilize such information to disambiguate noisy labels. Many existing methods choose nearest neighbors of each partially-labeled instance in an unsupervised way such that the obtained instance similarities can be empirically non-optimal and unrelated to the downstream classification task. To address this issue, we propose a novel multi-agent reinforcement learning (MARL) framework which models the connection between each pair of training samples as a reinforcement learning (RL) agent. We use attention-based graph neural network (GNN) to learn the instance similarity, and adaptively refine it using a deterministic policy gradient approach until some pre-defined scoring function is optimized. Different from those two-stage and alternative optimization algorithms whose training procedures are not end-to-end, our RL-based approach directly optimizes the objective function and estimates the instance similarities more precisely. The experimental results show that our method outperforms state-of-the-art competitors with a higher classification accuracy in both synthetic and real examples. ", Uncovering the Effectiveness of Calibration on Open Intent Classification,https://openreview.net/forum?id=_E9ibRUQ1iq,https://openreview.net/pdf?id=_E9ibRUQ1iq,We propose novel calibration-based open intent classification approach and provide corresponding analyses in public benchmark settings,"Open intent classification aims to simultaneously identify known and unknown intents, and it is one of the challenging tasks in modern dialogue systems. While prior approaches are based on known intent classifiers trained under the cross-entropy loss, we presume this loss function yields a representation overly biased to the known intents; thus, it negatively impacts identifying unknown intents. In this study, we propose a novel open intent classification approach that utilizes model calibration into the previously-proposed state-of-the-art. We empirically examine that simply changing a learning objective in a more calibrated manner outperforms the past state-of-the-art. We further excavate that the underlying reason behind calibrated classifier's supremacy derives from the high-level layers of the deep neural networks. We also discover that our approach is robust to harsh settings where few training samples per class exist. Consequentially, we expect our findings and takeaways to exhibit practical guidelines of open intent classification, thus helping to inform future model design choices.","open intent classification, model calibration, label smoothing" PMixUp: Simultaneous Utilization of Part-of-Speech Replacement and Feature Space Interpolation for Text Data Augmentation,https://openreview.net/forum?id=O4fNuE8F51T,https://openreview.net/pdf?id=O4fNuE8F51T,We propose novel text augmentation method that accomplishes cutting-edge state-of-the-art performance in various benchmark settings.,"Data augmentation has become a de facto technique in various NLP tasks to overcome the lack of a large-scale, qualified training set. The previous studies presented several data augmentation methods, such as replacing tokens with synonyms or interpolating feature space of given text input. While they are known to be convenient and promising, several limits exist. First, prior studies simply treated topic classification and sentiment analysis under the same category of text classification while we presume they have distinct characteristics. Second, previously-proposed replacement-based methods bear several improvement avenues as they utilize heuristics or statistical approaches for choosing synonyms. Lastly, while the feature space interpolation method achieved current state-of-the-art, prior studies have not comprehensively utilized it with replacement-based methods. To mitigate these drawbacks, we first analyzed which POS tags are important in each text classification task, and resulted that nouns are essential to topic classification, while sentiment analysis regards verbs and adjectives as important POS information. Contrary to the aforementioned analysis, we discover that augmenting verbs and adjective tokens commonly improves text classification performance regardless of its type. Lastly, we propose PMixUp, a novel data augmentation strategy that simultaneously utilizes replacement-based and feature space interpolation methods. We examine that they are new state-of-the-art in nine public benchmark settings, especially under the few training samples. ","text augmentation, part-of-speech, feature space interpolation" SDT: Specific Domain Training in Domain Generalization,https://openreview.net/forum?id=t_OZ5jexnbH,https://openreview.net/pdf?id=t_OZ5jexnbH,we discern the spurious features by specific domain training. ,"Domain generalization (DG) methods aim to achieve generalizability to an unseen target domain by using only training data from the source domains. Although there has been a growing interest to learn from multiple training domains by applying different types of invariance across those domains, the improvements compared to empirical risk minimization (ERM) are almost negligible under controlled evaluation protocols. In this paper, we demonstrate that the disentanglement of spurious and invariant features is a tough task in standard training, since ERM simply minimize the loss and does not exploit invariance among domains. To address the issue, we introduce a simple yet effective method called specific domain training (SDT), which intensifies the trace of spurious features and make them more discernible and exploit masking strategy to decrease their effect. We provide a theoretical and experimental evidence to show the effectiveness of SDT for out-of-distribution generalization. Notably, SDT outperforms previous state of the art \citet{cha2021swad} in DomainNet benchmarks 0.2pp in average. Furthermore, SDT improves accuracy of some domains such as Sketch in PACS, SUN09 in VLCS and L100 in TerraIncognita by clear margins 2.5pp, 3.4pp, and 5.4pp respectively.","Deep learning, Computer vision, Domain generalization, Spurious features unfolding, Specific domain training" Lossy Compression with Gaussian Diffusion,https://openreview.net/forum?id=jBPvRLKP_n_,https://openreview.net/pdf?id=jBPvRLKP_n_,Theoretical and empirical results on a novel lossy compression approach using diffusion models,"We consider a novel lossy compression approach based on unconditional diffusion generative models, which we call DiffC. Unlike modern compression schemes which rely on transform coding and quantization to restrict the transmitted information, DiffC relies on the efficient communication of pixels corrupted by Gaussian noise. We implement a proof of concept and find that it works surprisingly well despite the lack of an encoder transform, outperforming the state-of-the-art generative compression method HiFiC on ImageNet 64x64. DiffC only uses a single model to encode and denoise corrupted pixels at arbitrary bitrates. The approach further provides support for progressive coding, that is, decoding from partial bit streams. We perform a rate-distortion analysis to gain a deeper understanding of its performance, providing analytical results for multivariate Gaussian data as well as theoretic bounds for general distributions. Furthermore, we prove that a flow-based reconstruction achieves a 3 dB gain over ancestral sampling at high bitrates.","diffusion, compression, information theory" Score-Based Graph Generative Modeling with Self-Guided Latent Diffusion,https://openreview.net/forum?id=AykEgQNPJEK,https://openreview.net/pdf?id=AykEgQNPJEK,We propose a novel and unified latent-based framework Score-Based Graph Generative Model powered by Self-Guided Latent Diffusion to promote graph generation in different scenarios.,"Graph generation is a fundamental task in machine learning, and it is critical for numerous real-world applications, biomedical discovery and social science. Existing diffusion-based graph generation methods have two limitations: (i) they conduct diffusion process directly in complex graph space (i.e., node feature, adjacency matrix, or both), resulting in hard optimization with network evaluations; (ii) they usually neglect to sufficiently cover the whole distribution of target unlabeled graph set and thus fail to make semantic controllable generation. In this paper, we first propose a unified latent-based graph generative framework, Score-Based Graph Generative Model (SGGM), powered by Self-Guided Latent Diffusion (SLD) to address both limitations. Specifically, we pretrain a variational graph autoencoder to map raw graph of high-dimensional discrete space to low-dimensional topology-injected latent space, and apply score-based generative model there, yielding a smoother, faster and more expressive graph generation procedure. To sufficiently cover the whole semantical distribution of unlabeled graph set, we propose SLD to make controllable self-guidance of the sample generation with gradients from the designed assigning function towards the hierarchical pseudo label, produced by iteratively clustering on the latent embeddings. In addition, we conduct periodic update on the pseudo label in training process to achieve mutual adaptation between self-guidance and score-based generation. Experiments show that our SGGM powered by SLD outperforms previous graph generation baselines on both generic and molecular graph datasets, demonstrating the generality and extensibility along with further theoretical proofs.","Generative Model, Diffusion Model, Graph Generation" Gradient-Informed Quality Diversity for the Illumination of Discrete Spaces,https://openreview.net/forum?id=yyygh7OqdCQ,https://openreview.net/pdf?id=yyygh7OqdCQ,We present a method to use gradient information for Quality Diversity in the case where those functions are differentiable and the input variables are discrete.,"Quality Diversity (QD) algorithms have been proposed to search for a large collection of both diverse and high-performing solutions instead of a single set of local optima. While early QD algorithms view the objective and descriptor functions as black-box functions, novel tools have been introduced to use gradient information to accelerate the search and improve overall performance of those algorithms over continuous input spaces. However a broad range of applications involve discrete spaces, such as drug discovery or image generation. Exploring those spaces is challenging as they are combinatorially large and gradients cannot be used in the same manner as in continuous spaces. We introduce MAP-Elites with a Gradient-Informed Discrete Emitter (ME-GIDE), which extends QD optimisation with differentiable functions over discrete search spaces. ME-GIDE leverages the gradient information of the objective and descriptor functions with respect to its discrete inputs to propose gradient-informed updates that guide the search towards a diverse set of high quality solutions. We evaluate our method on challenging benchmarks including protein design and discrete latent space illumination and find that our method outperforms state-of-the-art QD algorithms in all benchmarks. ","Quality Diversity, Latent Space, Protein" Deep Generative Wasserstein Gradient Flows,https://openreview.net/forum?id=zjSeBTEdXp1,https://openreview.net/pdf?id=zjSeBTEdXp1,We scale Wasserstein gradient flows to high dimensional image generation tasks.,"Deep generative modeling is a rapidly-advancing field with a wealth of modeling choices developed in the past decades. Amongst them, Wasserstein gradient flows (WGF) are a powerful and theoretically rich class of methods. However, their applications to high-dimensional distributions remain relatively underexplored. In this paper, we present Deep Generative Wasserstein Gradient Flows (DGGF), which constructs a WGF between two distributions by minimizing the entropy-regularized $f$-divergence. We demonstrate how to train a deep density ratio estimator that is required for the WGF and apply it to the task of generative modeling. Experiments demonstrate that DGGF is able to synthesize high-fidelity images of resolutions up to $128\times128$, directly in data space. We demonstrate that DGGF has an interpretable diagnostic of sample quality by naturally estimating the KL divergence throughout the gradient flow. Finally, we show DGGF's modularity by composition with external density ratio estimators for conditional generation, as well as for unpaired image-to-image translation with no modifications to the framework.","deep generative modeling, gradient flow" Linear Scalarization for Byzantine-Robust Learning on non-IID data,https://openreview.net/forum?id=dYFg48Ye6rl,https://openreview.net/pdf?id=dYFg48Ye6rl,An enhancing method for current Byzantine defenses when data between clients is unbalanced.,"In this work we study the problem of Byzantine-robust learning when data among clients is heterogeneous. We focus on poisoning attacks targeting the convergence of SGD. Although this problem has received great attention; the main Byzantine defenses rely on the IID assumption causing them to fail when data distribution is non-IID even with no attack. We propose the use of Linear Scalarization (LS) as an enhancing method to enable current defenses to circumvent Byzantine attacks in the non-IID setting. The LS method is based on the incorporation of a trade-off vector that penalizes the suspected malicious clients. Empirical analysis corroborates that the proposed LS variants are viable in the IID setting. For mild to strong non-IID data splits, LS is either comparable or outperforming current approaches under state-of-the-art Byzantine attack scenarios. ","Byzantine SGD, Distributed Deep Learning, Non-IID" Where to Go Next for Recommender Systems? ID- vs. Modality-based recommender models revisited,https://openreview.net/forum?id=bz3MAU-RhnW,https://openreview.net/pdf?id=bz3MAU-RhnW,,"Recommender models that utilize unique identities (IDs for short) to represent distinct users and items have been the state-of-the-arts and dominating the recommender system (RS) literature for over a decade. In parallel, the pre-trained modality encoders, such as BERT and ResNet, are becoming increasingly powerful in modeling raw modality features, e.g., text and images. In light of this, a natural question arises: whether the modality (a.k.a, content) only based recommender models (MoRec) can exceed or be on par with the ID-only based models (IDRec) when item modality features are available? In fact, this question had been answered once a decade ago, when IDRec beat MoRec with strong advantages in terms of both recommendation accuracy and efficiency. We aim to revisit this `old' question and systematically study MoRec from several aspects. Specifically, we study several sub-questions: (i) which recommender paradigm, MoRec or IDRec, performs best in various practical scenarios, including regular, cold and new item scenarios? does this hold for items with different modality features? (ii) will MoRec benefit from the latest technical advances in corresponding communities, for example, natural language processing and computer vision? (iii) what is an effective way to leverage item modality representations, freezing them or adapting them by fine-tuning on new data? (iv) are there any other factors that affect the efficacy of MoRec. To answer these questions, we conduct rigorous experiments for item recommendations with two popular modalities, i.e., text and vision. We provide empirical evidence that MoRec with standard end-to-end training is highly competitive and even exceeds IDRec in some cases. Many of our observations imply that the dominance of IDRec in terms of recommendation accuracy does not hold well when items' raw modality features are available. We promise to release all related codes & datasets upon acceptance.", Pixel-Level Task Helps Pruned Network Transfer to Downstream Tasks,https://openreview.net/forum?id=f13bbIPM1hG,https://openreview.net/pdf?id=f13bbIPM1hG,,"Pruning well-trained neural networks is effective to achieve a promising accuracy-efficiency trade-off in computer vision regimes. However, most of existing pruning algorithms only focus on the classification task defined on the source domain. Different from the strong transferability of the original model, a pruned network is hard to transfer to complicated downstream tasks such as object detection \citet{girish2021lottery}. In this paper, we show that the image-level pretrain task is not capable of pruning models for diverse downstream tasks. To mitigate this problem, we introduce image reconstruction, a pixel-level task, into the traditional pruning framework. Concretely, an autoencoder is trained based on the original model, and then the pruning process is optimized with both autoencoder and classification losses. The empirical study on benchmark downstream tasks shows that the proposed method can outperform state-of-the-art results explicitly.", Is Class Incremental Learning Truly Learning Representations Continually?,https://openreview.net/forum?id=aqvU0FfRqT,https://openreview.net/pdf?id=aqvU0FfRqT,,"Class incremental learning (CIL) aims to continually learn a classifier for new object classes from incrementally arriving data while not forgetting the past learned classes. The average test accuracy across all classes learned so far has been a widely used metric to evaluate the CIL algorithms, but we argue that a simple horse race toward maximizing the accuracy may not necessarily lead to developing effective CIL algorithms. Namely, since a classification model is often used as a backbone model that transfers the learned representations to other downstream tasks, we believe it is also important to ask whether the CIL algorithms are indeed learning representations continually. To that end, we borrow several typical evaluation protocols of representation learning to solely evaluate the quality of encoders learned by the CIL algorithms: 1) fix the encoder and re-train the final linear layer or run the k-nearest neighbor (NN) classifier using the entire training set obtained for all classes so far and check the test accuracy, and 2) perform transfer learning with the incrementally learned encoder to several downstream tasks and report the test accuracy on those tasks. Our comprehensive experimental results disclose the limitation of conventional accuracy-based CIL evaluation protocol as follows. First, the state-of-the-art CIL algorithms with high test accuracy do not necessarily perform equally well with respect to our representation-level evaluation, in fact, sometimes may perform even worse than naive baselines. Second, it turns out the high test accuracy of the state-of-the-art CIL algorithms may be largely due to the good quality of the representations learned from the first task, which means those algorithms mainly focus on stability (not forgetting the first task model's capability), but not really on continually learning new tasks, i.e., plasticity, to attain high overall average accuracy. Based on these results, we claim that our representation-level evaluation should be an essential recipe for more objectively evaluating and effectively developing the CIL algorithms. ","continual learning, class-incremental learning, representation learning" "Optimising 2D Pose Representation: Improving Accuracy, Stability and Generalisability inUnsupervised 2D-3D Human Pose Estimation",https://openreview.net/forum?id=2lbtqs4enl,https://openreview.net/pdf?id=2lbtqs4enl,Investigating how the representation of a 2D pose can effect the 3D ordinate predictions during the unsupervised adversarial 2D-3D lifting cycle.,"This paper addresses the problem of 2D pose representation during unsupervised 2D to 3D pose lifting to improve the accuracy, stability and generalisability of 3D human pose estimation (HPE) models. All unsupervised 2D-3D HPE approaches provide the entire 2D kinematic skeleton to a model during training. We argue that this is sub-optimal and disruptive as long-range correlations are induced between independent 2D key points and predicted 3D ordinates during training. To this end, we conduct the following study. With a maximum architecture capacity of 6 residual blocks, we evaluate the performance of 5 models which each represent a 2D pose differently during the adversarial unsupervised 2D-3D HPE process. Additionally, we show the correlations between 2D key points which are learned during the training process, highlighting the unintuitive correlations induced when an entire 2D pose is provided to a lifting model. Our results show that the most optimal representation of a 2D pose is that of two independent segments, the torso and legs, with no shared features between each lifting network. This approach decreased the average error by 20% on the Human3.6M dataset when compared to a model with a near identical parameter count trained on the entire 2D kinematic skeleton. Furthermore, due to the complex nature of adversarial learning, we show how this representation can also improve convergence during training allowing for an optimum result to be obtained more often.","Unsupervised Learning, 3D Human Pose Estimation, Data Representation, Adversarial Learning" Model Obfuscation for Securing Deployed Neural Networks,https://openreview.net/forum?id=ib482K6HQod,https://openreview.net/pdf?id=ib482K6HQod,"A model obfuscation method to make the AI model ""unreadable"".","More and more edge devices and mobile apps are leveraging deep learning (DL) capabilities. Deploying such models on devices -- referred to as on-device models -- rather than as remote cloud-hosted services, has gained popularity as it avoids transmitting user's data off of the device and for high response time. However, on-device models can be easily attacked, as they can be accessed by unpacking corresponding apps and the model is fully exposed to attackers. Recent studies show that adversaries can easily generate white-box-like attacks for an on-device model or even inverse its training data. To protect on-device models from white-box attacks, we propose a novel technique called model obfuscation. Specifically, model obfuscation hides and obfuscates the key information -- structure, parameters and attributes -- of models by renaming, parameter encapsulation, neural structure obfuscation, shortcut injection, and extra layer injection. We have developed a prototype tool ModelObfuscator to automatically obfuscate on-device TFLite models. Our experiments show that this proposed approach can dramatically improve model security by significantly increasing the overhead of extracting models' inner information, without increasing the latency of DL models. Our proposed on-device model obfuscation has the potential to be a fundamental technique for on-device model deployment. Our prototype tool is publicly available at https://github.com/AnonymousAuthor000/Code2536.","model obfuscation, AI safety, AI system" Optimising Event-Driven Spiking Neural Network with Regularisation and Cutoff,https://openreview.net/forum?id=H4xO3doonl-,https://openreview.net/pdf?id=H4xO3doonl-,"Two novel optimisation techniques are presented to consider anytime optimal inference SNNs, AOI-SNNs: a regularisation and a cutoff.","Spiking neural networks (SNNs), a variant of artificial neural networks (ANNs) with the benefit of energy efficiency, have achieved the accuracy close to its ANN counterparts, on benchmark datasets such as CIFAR10/100 and ImageNet. However, comparing with frame-based input (e.g., images), event-based inputs from e.g., Dynamic Vision Sensor (DVS) can make a better use of SNNs thanks to the SNNs' asynchronous working mechanism. In this paper, we strengthen the marriage between SNNs and event-based inputs with a proposal to consider anytime optimal inference SNNs, or AOI-SNNs, which can terminate anytime during the inference to achieve optimal inference result. Two novel optimisation techniques are presented to achieve AOI-SNNs: a regularisation and a cutoff. The regularisation enables the training and construction of SNNs with optimised performance, and the cutoff technique optimises the inference of SNNs on event-driven inputs. We conduct an extensive set of experiments on multiple benchmark event-based datasets, including CIFAR10-DVS, N-Caltech101 and DVS128 Gesture. The experimental results demonstrate that our techniques are superior to the state-of-the-art with respect to the accuracy and latency. ","Spiking Neural Network, Event-driven Neural Network, ANN-to-SNN Conversion" ESP: Exponential Smoothing on Perturbations for Increasing Robustness to Data Corruptions,https://openreview.net/forum?id=U7LLhh3VFxH,https://openreview.net/pdf?id=U7LLhh3VFxH,A high-level data augmentation method to increase model robustness against unforeseen data corruptions.,"Despite the great advances in the machine learning field over the past decade, deep learning algorithms are often vulnerable to data corruption in real-world environments. We propose a simple yet efficient data augmentation method named Exponential Smoothing on Perturbations (ESP) that imposes perturbations on training data to enhance a model’s robustness to unforeseen data corruptions. With the perturbation on the input side, the target label of a sample is smoothed with an exponentially decaying confidence level with respect to the size of the perturbation. ESP enforces a contour-like decision boundary that smoothly encompasses the region around inter-class samples. We theoretically show that perturbations in input space can encourage a model to find a flat minimum on the parameter space, which makes a model robust to domain shifts. In the extensive evaluation on common corruption benchmarks including MNIST-C, CIFAR-10/100-C, and Tiny-ImageNet-C, our method improves the robustness of a model both as a standalone method and in conjunction with the previous state-of-the-art augmentation-based methods. ESP is a model-agnostic algorithm in the sense that it is neither model-specific nor data-specific.","Deep Learning, Model Robustness, Domain Generalization, Common Corruption" MATS: Memory Attention for Time-Series forecasting,https://openreview.net/forum?id=JjEtPDn0eRb,https://openreview.net/pdf?id=JjEtPDn0eRb,,"Long-term time series forecasting (LTSF) is still very challenging in many real-world applications. A fundamental difficulty is in efficiently modeling both the short-term temporal patterns and long-term dependencies. in this paper, we introduce a novel two-stage attention-based LTSF model called Memory Attention for Time-Series forecasting (MATS). In stage I, short-term temporal patterns are extracted to a memory bank such that the input time series is represented by a much shorter sequence of memory attentions. In stage II, a sequence-to-sequence predictor is trained to discover long-term dependencies in the memory attention sequence, and forecast memory attentions corresponding to the time series in the future. The use of attention allows a flexible representation, and its shorter sequence length enables the model to more easily learn long-term dependencies. Extensive experiments on a number of multivariate and univariate benchmark datasets demonstrate that MATS outperforms SOTA LTSF methods almost all the time.", MultiViz: Towards Visualizing and Understanding Multimodal Models,https://openreview.net/forum?id=i2_TvOFmEml,https://openreview.net/pdf?id=i2_TvOFmEml,"MultiViz is a framework for visualizing & understanding multimodal models across unimodal importance, cross-modal interactions, multimodal representations & multimodal prediction that enables model understanding, error analysis & model debugging.","The promise of multimodal models for real-world applications has inspired research in visualizing and understanding their internal mechanics with the end goal of empowering stakeholders to visualize model behavior, perform model debugging, and promote trust in machine learning models. However, modern multimodal models are typically black-box neural networks, which makes it challenging to understand their internal mechanics. How can we visualize the internal modeling of multimodal interactions in these models? Our paper aims to fill this gap by proposing MultiViz, a method for analyzing the behavior of multimodal models by scaffolding the problem of interpretability into 4 stages: (1) unimodal importance: how each modality contributes towards downstream modeling and prediction, (2) cross-modal interactions: how different modalities relate with each other, (3) multimodal representations: how unimodal and cross-modal interactions are represented in decision-level features, and (4) multimodal prediction: how decision-level features are composed to make a prediction. MultiViz is designed to operate on diverse modalities, models, tasks, and research areas. Through experiments on 8 trained models across 6 real-world tasks, we show that the complementary stages in MultiViz together enable users to (1) simulate model predictions, (2) assign interpretable concepts to features, (3) perform error analysis on model misclassifications, and (4) use insights from error analysis to debug models. MultiViz is publicly available, will be regularly updated with new interpretation tools and metrics, and welcomes inputs from the community.","multimodal learning, representation learning, interpretation, visualization" How Informative is the Approximation Error from Tensor Decomposition for Neural Network Compression?,https://openreview.net/forum?id=sKHqgFOaFXI,https://openreview.net/pdf?id=sKHqgFOaFXI,"We show empirically an approximation error resulting from compressing a network layer with tensor decomposition is correlated with the classification error, enabling the choice of layer, decomposition and rank to be based on the approximation error.","Tensor decompositions have been successfully applied to compress neural networks. The compression algorithms using tensor decompositions commonly minimize the approximation error on the weights. Recent work assumes the approximation error on the weights is a proxy for the performance of the model to compress multiple layers and fine-tune the compressed model. Surprisingly, little research has systematically evaluated which approximation errors can be used to make choices regarding the layer, tensor decomposition method, and level of compression. To close this gap, we perform an experimental study to test if this assumption holds across different layers and types of decompositions, and what the effect of fine-tuning is. We include the approximation error on the features resulting from a compressed layer in our analysis to test if this provides a better proxy, as it explicitly takes the data into account. We find the approximation error on the weights has a positive correlation with the performance error, before as well as after fine-tuning. Basing the approximation error on the features does not improve the correlation significantly. While scaling the approximation error commonly is used to account for the different sizes of layers, the average correlation across layers is smaller than across all choices (i.e. layers, decompositions, and level of compression) before fine-tuning. When calculating the correlation across the different decompositions, the average rank correlation is larger than across all choices. This means multiple decompositions can be considered for compression and the approximation error can be used to choose between them.","Tensor Decomposition, Convolutional Neural Networks, Compression" DISCO-DANCE: Learning to Discover Skills with Guidance,https://openreview.net/forum?id=IUGwUr5_9wY,https://openreview.net/pdf?id=IUGwUr5_9wY,"This paper proposes a novel unsupervised skill learning algorithm GSD, which attempts to provide direct guidance in order to accelerate the learning process of diverse skills by encouraging further exploration.","Unsupervised skill discovery (USD) allows agents to learn diverse and discriminable skills without access to pre-defined rewards, by maximizing the mutual information (MI) between skills and states reached by each skill. The most common problem of MI-based skill discovery is insufficient exploration, because each skill is heavily penalized when it deviates from its initial settlement. Recent works introduced an auxiliary reward to encourage the exploration of the agent via maximizing the state's epistemic uncertainty or entropy. However, we have discovered that the performance of these auxiliary rewards decreases as the environment becomes more challenging. Therefore, we introduce a new unsupervised skill discovery algorithm, skill discovery with guidance (DISCO-DANCE), which (1) selects the guide skill which has the highest potential to reach the unexplored states, (2) guide other skills to follow the guide skill, then (3) the guided skills are diffused to maximize their discriminability in the unexplored states. Empirically, DISCO-DANCE substantially outperforms other USD baselines on challenging environments including two navigation benchmarks and a continuous control benchmark.","Unsupervised skill discovery, Reinforcement Learning" Exploring Generalization of Non-Contrastive self-supervised Learning,https://openreview.net/forum?id=mfPEzfKJL4n,https://openreview.net/pdf?id=mfPEzfKJL4n,We give an upper bound on the generalization error rateof non-contrastive learning methods represented by Barlow Twins and SimSiam.,"Contrastive learning have recently produced results comparable to the state-of-the-art supervised models. Non-contrastive methods do not use negative samples, but separate samples of different classes by explicitly or implicitly optimizing the representation space. Although we have some understanding of the core of the non-contrastive learning method, theoretical analysis of its generalization performance is still missing. Thus we present a theoretical analysis of generalizability of non-contrastive models. We focus on the inter-class distance, show how non-contrastive methods increase the inter-class distance, and how the distance affects the generalization performance of the model. We find that the generalization of non-contrastive methods is affected by the output dimension and the number of latent classes. Models with much fewer dimensions than the number of latent classes are not sufficient to generalize well. We demonstrate our findings through experiments on the CIFAR dataset.","contrastive learning, representation learning" Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN,https://openreview.net/forum?id=LOTGOB5_Xh2,https://openreview.net/pdf?id=LOTGOB5_Xh2,We delve deep into masked image modeling (MIM) working mechanism and propose a generic pre-training framework (A$^2$MIM) for Transformers and CNNs.,"Masked image modeling (MIM), an emerging self-supervised pre-training method, has shown impressive success across numerous downstream vision tasks with Vision transformers (ViTs). Its underlying idea is simple: a portion of the input image is randomly masked out and then reconstructed via the pre-text task. However, the working principle behind MIM is not well explained, and previous studies insist that MIM primarily works for the Transformer family but is incompatible with CNNs. In this paper, we first study interactions among patches to understand what knowledge is learned and how it is acquired via the MIM task. We observe that MIM essentially teaches the model to learn better middle-order interactions among patches and extract more generalized features. Based on this fact, we propose an Architecture-Agnostic Masked Image Modeling framework (A$^2$MIM), which is compatible with both Transformers and CNNs in a unified way. Extensive experiments on popular benchmarks show that our A$^2$MIM learns better representations without explicit design and endows the backbone model with the stronger capability to transfer to various downstream tasks for both Transformers and CNNs.","Self-supervised Learning, Vision Transformer, Representation Learning, Unsupervised Learning" Blurring Diffusion Models,https://openreview.net/forum?id=OjDkC57x5sz,https://openreview.net/pdf?id=OjDkC57x5sz,"We show that blurring can equivalently be defined through a Gaussian diffusion process with non-isotropic noise, bridging the gap between inverse heat dissipation and denoising diffusion","Recently, Rissanen et al., (2022) have presented a new type of diffusion process for generative modeling based on heat dissipation, or blurring, as an alternative to isotropic Gaussian diffusion. Here, we show that blurring can equivalently be defined through a Gaussian diffusion process with non-isotropic noise. In making this connection, we bridge the gap between inverse heat dissipation and denoising diffusion, and we shed light on the inductive bias that results from this modeling choice. Finally, we propose a generalized class of diffusion models that offers the best of both standard Gaussian denoising diffusion and inverse heat dissipation, which we call Blurring Diffusion Models. ","blurring, diffusion, generative model" BrGANs: Stabilizing GANs' Training Process with Brownian Motion Control,https://openreview.net/forum?id=0YYQ_KKsIZ,https://openreview.net/pdf?id=0YYQ_KKsIZ,We propose a higher order Brownian Motion Controller (BMC) for BrGANs to stablize GANs' training process ,"The training process of generative adversarial networks (GANs) is unstable and does not converge globally. In this paper, we propose a universal higher order noise based control called Brownian Motion Control (BMC) that is invariant to GANs frameworks so that the training process of GANs is exponential stable almost surely. Specifically, starting with the prototypical case of Dirac-GANs, we design a BMC and propose Dirac-BrGANs that retrieve exactly the same but reachable optimal equilibrium regardless of GANs' framework. The optimal equilibrium of our Dirac-BrGANs' training system is globally unique and always exists. Furthermore, the training process of Dirac-BrGANs achieve exponentially stability almost surely for any arbitrary initial value. Then we extend our BMC to normal GANs' settings and propose BrGANs. We provide numerical experiments showing that our BrGANs effectively stabilizes GANs's training process and obtains state-of-the art performance compared to other stabilizing methods. ","GAN, stability, control theory, Brownian motion" Detecting Backdoor Attacks via Layer-wise Feature Analysis,https://openreview.net/forum?id=gncu27b4elL,https://openreview.net/pdf?id=gncu27b4elL,"We find out that the feature difference between benign and poisoned samples tends to reach the maximum at a critical layer, based on which we propose a simple yet effective method to filter poisoned samples by analyzing the features at that layer.","Training well-performing deep neural networks (DNNs) usually requires massive training data and computational resources, which might not be affordable for some users. For this reason, users may prefer to outsource their training process to a third party or directly exploit publicly available pre-trained models. Unfortunately, doing so opens the possibility of a new dangerous training-time attack (dubbed backdoor attack) against DNNs. Currently, most of the existing backdoor detectors filter poisoned samples based on the latent feature representations generated by convolutional layers. In this paper, we first conduct a layer-wise feature analysis of poisoned and benign samples from the target class. We find out that the feature difference between benign and poisoned samples tends to reach the maximum at a critical layer, which is not always the one typically used in existing defenses, namely the layer before fully-connected layers. In particular, we can locate this critical layer easily based on the behaviors of benign samples. Based on this finding, we propose a simple yet effective method to filter poisoned samples by analyzing the feature differences between suspicious and benign samples at the critical layer. We conduct extensive experiments on two benchmark datasets, which confirm the effectiveness of our backdoor detection.","Backdoor Detection, Backdoor Defense, Backdoor Learning, Trustworthy ML, AI Security" Hyperbolic Self-paced Learning for Self-supervised Skeleton-based Action Representations,https://openreview.net/forum?id=3Bh6sRPKS3J,https://openreview.net/pdf?id=3Bh6sRPKS3J,,"Self-paced learning has been beneficial for tasks where some initial knowledge is available, such as weakly supervised learning and domain adaptation, to select and order the training sample sequence, from easy to complex. However its applicability remains unexplored in unsupervised learning, whereby the knowledge of the task matures during training. We propose a novel HYperbolic Self-Paced model (HYSP) for learning skeleton-based action representations. HYSP adopts self-supervision: it uses data augmentations to generate two views of the same sample, and it learns by matching one (named online) to the other (the target). We propose to use hyperbolic uncertainty to determine the algorithmic learning pace, under the assumption that less uncertain samples should be more strongly driving the training, with a larger weight and pace. Hyperbolic uncertainty is a by-product of the adopted hyperbolic neural networks, it matures during training and it comes with no extra cost, compared to the established Euclidean SSL framework counterparts. When tested on three established skeleton-based action recognition datasets, HYSP outperforms the state-of-the-art on PKU-MMD I, as well as on 2 out of 3 downstream tasks on NTU-60 and NTU-120. Additionally, HYSP only uses positive pairs and bypasses therefore the complex and computationally-demanding mining procedures required for the negatives in contrastive techniques. Code is enclosed in the submission and will be released.", Unfair geometries: exactly solvable data model with fairness implications,https://openreview.net/forum?id=6f47WT-HtuH,https://openreview.net/pdf?id=6f47WT-HtuH,"We propose a generative model, exactly solvable using statistical physics, which emphasize the impact of data geometry in inducing bias in classification.","Machine learning (ML) may be oblivious to human bias but it is not immune to its perpetuation. Marginalisation and iniquitous group representation are often traceable in the very data used for training, and may be reflected or even enhanced by the learning models. In the present work, we aim at clarifying the role played by data geometry in the emergence of ML bias. We introduce an exactly solvable high-dimensional model of data imbalance, where parametric control over the many bias-inducing factors allows for an extensive exploration of the bias inheritance mechanism.Through the tools of statistical physics, we analytically characterise the typical properties of learning models trained in this synthetic framework and obtain exact predictions for the observables that are commonly employed for fairness assessment. Despite the simplicity of the data model, we retrace and unpack typical unfairness behaviour observed on real-world datasets. We also obtain a detailed analytical characterisation of a class of bias mitigation strategies. We first consider a basic loss-reweighing scheme, which allows for an implicit minimisation of different unfairness metrics, and quantify the incompatibilities between some existing fairness criteria. Then, we consider a novel mitigation strategy based on a matched inference approach, consisting in the introduction of coupled learning models. Our theoretical analysis of this approach shows that the coupled strategy can strike superior fairness-accuracy trade-offs.","statistical physics, statistical mechanics of learning, generalization model, modelling structured data, data imbalance, bias, fairness, bias mitigation" DropAut: Automatic Dropout Approaches to learn and adapt Drop Rates,https://openreview.net/forum?id=snktGNQb-kD,https://openreview.net/pdf?id=snktGNQb-kD,Data-drive extensions of Dropout to automatically detect drop rates,"Over time, it has been shown that Dropout is one of the best techniques to fight overfitting and at the same time improve the overall performance of deep learning models. When training with Dropout, a randomly selected subset of activations are set to zero within each layer based on a hyper-parameter called drop rate. Finding a suitable drop rate can be very expensive, especially nowadays where modern neural networks contain a large number of parameters. We introduce ""DropAut"", a completely data-driven extension of Dropout which enables the model to learn and adapt the drop rate based on the task and data it is dealing with. However, both Dropout and ""DropAut"" exploit the same drop rate for all the units of a layer, but this could be a sub-optimal solution since not all of them have the same importance. Based on this, we also propose two DropAut extensions called ""UnitsDropAut"" and ""BottleneckDropAut"" which additionally allow the model to learn and use different and specific drop rates for each unit of a layer. We first derived a bound on the generalization performance of Dropout, ""DropAut"", ""UnitsDropAut"" and ""BottleneckDropAut"" and then we evaluated the proposed approaches using different kinds of neural models on a range of datasets, showing good improvements over Dropout in all the experiments conducted. The code is available at https://github.com/$<$anonymous$>$.","Deep Learning, Neural Networks, Dropout, Automatic Dropout" Understanding Adversarial Transferability in Federated Learning,https://openreview.net/forum?id=nP7f5XW4FVa,https://openreview.net/pdf?id=nP7f5XW4FVa,"This paper proposes a different, simpler but paratical setting for evaluating the robustness of federated learning. To understand the robustness of federated models, this paper investigates two core properties that relates to the transfer robustness.","With the promises Federated Learning (FL) delivers, various topics regarding its robustness and security issues have been widely studied in recent years: such as the possibility to conduct adversarial attacks (or transferable adversarial attacks) in a while-box setting with full knowledge of the model (or the entire data), or the possibility to conduct poisoning/backdoor attacks during the training process as a malicious client. In this paper, we investigate the robustness and security issues from a different, simpler, but practical setting: a group of malicious clients has impacted the model during training by disguising their identities and acting as benign clients, and only revealing their adversary position after the training to conduct transferable adversarial attacks with their data, which is usually a subset of the data that FL system is trained with. Our aim is to offer a full understanding of the challenges the FL system faces in this setting across a spectrum of configurations. We notice that such an attack is possible, but the federated model is more robust compared with its centralized counterpart when the accuracy on clean images is comparable. Through our study, we hypothesized the robustness is from two factors: the decentralized training on distributed data and the averaging operation. Our work has implications for understanding the robustness of federated learning systems and poses a practical question for federated learning applications.","federared learning, adversarial attack, transfer-based black-box attack" RankCSE: Unsupervised Sentence Representations Learning via Learning to Rank,https://openreview.net/forum?id=y_sZyxuuFh3,https://openreview.net/pdf?id=y_sZyxuuFh3,We learn semantically discriminative sentence representations by incorporating ranking consistency and ranking distillation with contrastive learning into a unified framework.,"Unsupervised sentence representation learning is one of the fundamental problems in natural language processing with various downstream applications. Recently, contrastive learning has been widely adopted which derives high-quality sentence representations by pulling similar semantics closer and pushing dissimilar ones away. However, these methods fail to capture the fine-grained ranking information among the sentences, where each sentence is only treated as either positive or negative. In many real-world scenarios, one needs to distinguish and rank the sentences based on their similarities to a query sentence, e.g., very relevant, moderate relevant, less relevant, irrelevant, etc. In this paper, we propose a novel approach, RankCSE, for unsupervised sentence representation learning, which incorporates ranking consistency and ranking distillation with contrastive learning into a unified framework. In particular, we learn semantically discriminative sentence representations by simultaneously ensuring ranking consistency between two representations with different dropout masks, and distilling listwise ranking knowledge from the teacher. An extensive set of experiments are conducted on both semantic textual similarity (STS) and transfer (TR) tasks. Experimental results demonstrate the superior performance of our approach over several state-of-the-art baselines.","sentence representations, self-supervised learning, learning to rank" Efficient Offline Policy Optimization with a Learned Model,https://openreview.net/forum?id=Yt-yM-JbYFO,https://openreview.net/pdf?id=Yt-yM-JbYFO,We propose a regularized one-step model-based method that outperforms MuZero Unplugged on Atari benchmark.,"MuZero Unplugged presents a promising approach for offline policy learning from logged data. It conducts Monte-Carlo Tree Search (MCTS) with a learned model and leverages Reanalyze algorithm to learn purely from offline data. For good performance, MCTS requires accurate learned models and a large number of simulations, thus costing huge computing time. This paper investigates a few hypotheses where MuZero Unplugged may not work well under the offline RL settings, including 1) learning with limited data coverage; 2) learning from offline data of stochastic environments; 3) improperly parameterized models given the offline data; 4) with a low compute budget. We propose to use a regularized one-step look-ahead approach to tackle the above issues. Instead of planning with the expensive MCTS, we use the learned model to construct an advantage estimation based on a one-step rollout. Policy improvements are towards the direction that maximizes the estimated advantage with regularization of the dataset. We conduct extensive empirical studies with BSuite environments to verify the hypotheses and then run our algorithm on the RL Unplugged Atari benchmark. Experimental results show that our proposed approach achieves stable performance even with an inaccurate learned model. On the large-scale Atari benchmark, the proposed method outperforms MuZero Unplugged by 43%. Most significantly, it uses only 5.6% wall-clock time (i.e., 1 hour) compared to MuZero Unplugged (i.e., 17.8 hours) to achieve a 150% IQM normalized score with the same hardware and software stacks. ","Offline RL, Model-based RL" New Insights for the Stability-Plasticity Dilemma in Online Continual Learning,https://openreview.net/forum?id=fxC7kJYwA_a,https://openreview.net/pdf?id=fxC7kJYwA_a,We propose a novel online continual learning framework that utilizes multi-scale feature maps in addition to a structure-wise distillation loss and a stability-plasticity normalization module to maintain high stability and plasticity simultaneously.,"The aim of continual learning is to learn new tasks continuously (i.e., plasticity) without forgetting previously learned knowledge from old tasks (i.e., stability). In the scenario of online continual learning, wherein data comes strictly in a streaming manner, the plasticity of online continual learning is more vulnerable than offline continual learning because the training signal that can be obtained from a single data point is limited. To overcome the stability-plasticity dilemma in online continual learning, we propose an online continual learning framework named multi-scale feature adaptation network (MuFAN) that utilizes a richer context encoding extracted from different levels of a pre-trained network. Additionally, we introduce a novel structure-wise distillation loss and replace the commonly used batch normalization layer with a newly proposed stability-plasticity normalization module to train MuFAN that simultaneously maintains high plasticity and stability. MuFAN outperforms other state-of-the-art continual learning methods on the SVHN, CIFAR100, miniImageNet, and CORe50 datasets. Extensive experiments and ablation studies validate the significance and scalability of each proposed component: 1) multi-scale feature maps from a pre-trained encoder, 2) the structure-wise distillation loss, and 3) the stability-plasticity normalization module in MuFAN. We will release our code upon acceptance.","Continual Learning, Online Continual Learning, Catastrophic Forgetting" MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer,https://openreview.net/forum?id=dRjWsd3gwsm,https://openreview.net/pdf?id=dRjWsd3gwsm,,"The recently proposed data augmentation TransMix employs attention labels to help visual transformers (ViT) achieve better robustness and performance. However, TransMix is deficient in two aspects: 1) The image cropping method of TransMix may not be suitable for vision transformer. 2) At the early stage of training, the model produces unreliable attention maps. TransMix uses unreliable attention maps to compute mixed attention labels that can affect the model. To address the aforementioned issues, we propose MaskMix and Progressive Attention Labeling (PAL) in image and label space, respectively. In detail, from the perspective of image space, we design MaskMix, which mixes two images based on a patch-like grid mask. In particular, the size of each mask patch is adjustable and is a multiple of the image patch size, which ensures each image patch comes from only one image and contains more global contents. From the perspective of label space, we design PAL, which utilizes a progressive factor to dynamically re-weight the attention weights of the mixed attention label. Finally, we combine MaskMix and Progressive Attention Labeling as our new data augmentation method, named MixPro. The experimental results show that our method can improve various ViT-based models at scales on ImageNet classification (73.8% top-1 accuracy based on DeiT-T for 300 epochs). After being pre-trained with MixPro on ImageNet, the ViT-based models also demonstrate better transferability to semantic segmentation, object detection, and instance segmentation. Furthermore, compared to TransMix, MixPro also shows stronger robustness on several benchmarks.", Multiple Invertible and Equivariant Transformation for Disentanglement in VAEs,https://openreview.net/forum?id=9ts90B3xUvP,https://openreview.net/pdf?id=9ts90B3xUvP,We improve disentangled representation learning with Multiple Invertible and Equivariant transformation (MIE-transformation) in VAEs.,"Disentanglement learning is a core issue for understanding and re-using trained information in Variational AutoEncoder (VAE), and effective inductive bias has been reported as a key factor. However, the actual implementation of such bias is still vague. In this paper, we propose a novel method, called MIE-transformation, to inject inductive bias by 1) guaranteeing the invertibility of latent-to-latent vector transformation while preserving a certain portion of equivariance of input-to-latent vector transformation, called IE-transformation, 2) extending the form of prior and posterior in VAE frameworks to an unrestricted form through a learnable conversion to an approximated exponential family, called EF-conversion, and 3) integrating multiple units of IE-transformation and EF-conversion, and their training. In experiments on 3D Cars, 3D Shapes, and dSprites datasets, MIE-transformation improves the disentanglement performance of state-of-the-art VAEs.","Variational AutoEncoder (VAE), Unsupervised Disentanglement Learning, Invertible and Equivariant function, Exponential Family" "StyleMorph: Disentangling Shape, Pose and Appearance through 3D Morphable Image and Geometry Generation ",https://openreview.net/forum?id=Ojpb1y8jflw,https://openreview.net/pdf?id=Ojpb1y8jflw,A deformable 3D-aware photorealistic image generator,"We introduce StyleMorph, a 3D generative model that relies on the 3D morphable model paradigm to disentangle shape, pose, object and scene texture for high quality image synthesis. We represent 3D shape variability through 3D deformation fields with respect to a canonical object template. Both the deformations and the template are expressed as implicit networks and learned in an unsupervised manner only from 2D image supervision. We connect 3D morphable modelling with deferred neural rendering by performing an implicit surface rendering of “Template Object Coordinates” (TOCS), thereby constructing a purely geometric, deformation-equivariant 2D signal that reflects the compounded geometric effects of non-rigid shape, pose, and perspective projection. We use TOCS maps in tandem with object and background appearance codes to condition a StyleGAN-based deferred neural rendering (DNR) network for high-resolution image synthesis. We show competitive photorrealistic image synthesis results on 4 datasets (FFHQ faces, AFHQ Cats, Dogs, Wild), while achieving the joint disentanglement of shape, pose, object and scene texture. ","3D-aware GAN, Template-based, Morphable, Disentanglement, Photorealistic, Neural Radiance Field, StyleGAN" Accelerated Riemannian Optimization: Handling Constraints to Bound Geometric Penalties,https://openreview.net/forum?id=05rBhFU3mLX,https://openreview.net/pdf?id=05rBhFU3mLX,We propose accelerated first-order methods for Riemannian optimization in Hadamard manifolds by using a proximal method that we design. We can work without undesirable assumptions previous accelerated works made," We propose a globally-accelerated, first-order method for the optimization of smooth and (strongly or not) geodesically-convex functions in Hadamard manifolds. Our algorithm enjoys the same convergence rates as Nesterov's accelerated gradient descent, up to a multiplicative geometric penalty and log factors. Crucially, we can enforce our method to stay within a compact set we define. Prior fully accelerated works resort to assuming that the iterates of their algorithms stay in some pre-specified compact set, except for two previous methods, whose applicability is limited to local optimization and to spaces of constant curvature, respectively. Achieving global and general Riemannian acceleration without iterates assumptively staying in the feasible set was asked as an open question in (Kim & Yang, 2022), which we solve for Hadamard manifolds. In our solution, we show that we can use a linearly convergent algorithm for constrained strongly g-convex smooth problems to implement a Riemannian inexact proximal point operator that we use as a subroutine, which is of independent interest.","Riemannian optimization, geodesic convexity, first-order accelerated methods" Searching Lottery Tickets in Graph Neural Networks: A Dual Perspective,https://openreview.net/forum?id=Dvs-a3aymPe,https://openreview.net/pdf?id=Dvs-a3aymPe,This paper generalizes Dual Lottery Ticket Hypothesis (DLTH) to the graph to address information loss and aggregation failure issues caused by sampling-based GNN pruning algorithms,"Graph Neural Networks (GNNs) have shown great promise in various graph learning tasks. However, the computational overheads of fitting GNNs to large-scale graphs grow rapidly, posing obstacles to GNNs from scaling up to real-world applications. To tackle this issue, Graph Lottery Ticket (GLT) hypothesis articulates that there always exists a sparse subnetwork/subgraph with admirable performance in GNNs with random initialization. Such a pair of core subgraph and sparse subnetwork (called graph lottery tickets) can be uncovered by iteratively applying a novel sparsification method. While GLT provides new insights for GNN compression, it requires a full pretraining process to obtain graph lottery tickets, which is not universal and friendly to real-world applications. Moreover, the graph sparsification in GLT utilizes sampling techniques, which may result in massive information loss and aggregation failure. In this paper, we explore the searching of graph lottery tickets from a complementary perspective -- transforming a random ticket into a graph lottery ticket, which allows us to more comprehensively explore the relationships between the original network/graph and their sparse counterpart. To achieve this, we propose regularization-based network pruning and hierarchical graph sparsification, leading to our Dual Graph Lottery Ticket (DGLT) framework for a joint sparsification of network and graph. Compared to GLT, our DGLT helps achieve a triple-win situation of graph lottery tickets with high sparsity, admirable performance, and good explainability. More importantly, we rigorously prove that our model can eliminate noise and maintain reliable information in substructures using the graph information bottleneck theory. Extensive experimental results on various graph-related tasks validate the effectiveness of our framework.","Lottery Tickets Hypothesis, Dual Lottery Tickets Hypothesis, Graph pooling, Graph information bottleneck" Video Scene Graph Generation from Single-Frame Weak Supervision,https://openreview.net/forum?id=KLrGlNoxzb4,https://openreview.net/pdf?id=KLrGlNoxzb4,We propose a novel method for weakly-supervised VidSGG task with only single-frame weak supervision.,"Video scene graph generation (VidSGG) aims to generate a sequence of graph-structure representations for the given video. However, all existing VidSGG methods are fully-supervised, i.e., they need dense and costly manual annotations. In this paper, we propose the first weakly-supervised VidSGG task with only single-frame weak supervision: SF-VidSGG. By ``weakly-supervised"", we mean that SF-VidSGG relaxes the training supervision from two different levels: 1) It only provides single-frame annotations instead of all-frame annotations. 2) The single-frame ground-truth annotation is still a weak image SGG annotation, i.e., an unlocalized scene graph. To solve this new task, we also propose a novel Pseudo Label Assignment based method, dubbed as PLA. PLA is a two-stage method, which generates pseudo visual relation annotations for the given video at the first stage, and then trains a fully-supervised VidSGG model with these pseudo labels. Specifically, PLA consists of three modules: an object PLA module, a predicate PLA module, and a future predicate prediction (FPP) module. Firstly, in the object PLA, we localize all objects for every frame. Then, in the predicate PLA, we design two different teachers to assign pseudo predicate labels. Lastly, in the FPP module, we fusion these two predicate pseudo labels by the regularity of relation transition in videos. Extensive ablations and results on the benchmark Action Genome have demonstrated the effectiveness of our PLA.","computer vision, video scene graph generation, weakly-supervised learning" Planning With Uncertainty: Deep Exploration in Model-Based Reinforcement Learning,https://openreview.net/forum?id=PcR6Lir5mxu,https://openreview.net/pdf?id=PcR6Lir5mxu,Demonstrating deep exploration with MuZero by planning optimistically with epistemic uncertainty,"Deep model-based reinforcement learning has shown super-human performance in many challenging domains. Low sample efficiency and limited exploration remain however as leading obstacles in the field. In this paper, we demonstrate deep exploration in model-based RL by incorporating epistemic uncertainty into planning trees, circumventing the standard approach of propagating uncertainty through value learning. We evaluate this approach with the state of the art model-based RL algorithm MuZero, and extend its training process to stabilize learning from explicitly-exploratory decisions. Our results demonstrate that planning with uncertainty is able to achieve effective deep exploration with standard uncertainty estimation mechanisms, and with it significant gains in sample efficiency.","Reinforcement learning, exploration, uncertainty, planning" Unsupervised visualization of image datasets using contrastive learning,https://openreview.net/forum?id=nI2HmVA0hvt,https://openreview.net/pdf?id=nI2HmVA0hvt,,"Visualization methods based on the nearest neighbor graph, such as t-SNE or UMAP, are widely used for visualizing high-dimensional data. Yet, these approaches only produce meaningful results if the nearest neighbors themselves are meaningful. For images represented in pixel space this is not the case, as distances in pixel space are often not capturing our sense of similarity and therefore neighbors are not semantically close. This problem can be circumvented by self-supervised approaches based on contrastive learning, such as SimCLR, relying on data augmentation to generate implicit neighbors, but these methods do not produce two-dimensional embeddings suitable for visualization. Here, we present a new method, called t-SimCNE, for unsupervised visualization of image data. T-SimCNE combines ideas from contrastive learning and neighbor embeddings, and trains a parametric mapping from the high-dimensional pixel space into two dimensions. We show that the resulting 2D embeddings achieve classification accuracy comparable to the state-of-the-art high-dimensional SimCLR representations, thus faithfully capturing semantic relationships. Using t-SimCNE, we obtain informative visualizations of the CIFAR-10 and CIFAR-100 datasets, showing rich cluster structure and highlighting artifacts and outliers.","data visualization, contrastive learning" On the Expressive Equivalence Between Graph Convolution and Attention Models,https://openreview.net/forum?id=-kzQHkTvyMg,https://openreview.net/pdf?id=-kzQHkTvyMg,,"Graph neural networks (GNNs) have achieved remarkable successes in various graph tasks, and recent years have witnessed a flourishing growth in research regarding GNNs' expressive power. The number of linear regions generated from GNNs is a recently considered metric that quantifies GNNs' capacity. The estimate of the number of linear regions has been previously developed for deep and convolution neural networks (DNN and CNN). In this paper, we compare the expressive power of the classic graph convolution network (GCN) and attention based models in terms of their capability to generate linear regions. We show that the prediction advantage of attention models can be matched or even surpassed by enhancing GCN with refined graph Ricci curvature resulting the so-called high rank graph convolution network (HRGCN). Thus, the two models are equivalent to each other in terms of expressive power. Experimental results show that the proposed HRGCN model outperforms the state-of-the-art results in various classification and prediction tasks.", Contrastive Consistent Representation Distillation,https://openreview.net/forum?id=NSMlX2F21C7,https://openreview.net/pdf?id=NSMlX2F21C7,We propose Contrastive Consistent Representation Distillation (CoCoRD) to provide consistent representations for efficient contrastive-learning-based distillation.,"The combination of knowledge distillation with contrastive learning has great potential to distill structural knowledge. Most of the contrastive-learning-based distillation methods treat the entire training dataset as the memory bank and maintain two memory banks, one for the student and one for the teacher. Besides, the representations in the two memory banks are updated in a momentum manner, leading to representation inconsistency. In this work, we propose Contrastive Consistent Representation Distillation (CoCoRD) to provide consistent representations for efficient contrastive-learning-based distillation. Instead of momentum-updating the cached representations, CoCoRD updates the encoders in a momentum manner. Specifically, the teacher is equipped with a momentum-updated projection head to generate consistent representations. These teacher representations are cached in a fixed-size queue which serves as the only memory bank in CoCoRD and is significantly smaller than the entire training dataset. Additionally, a slow-moving student, implemented as a momentum-based moving average of the student, is built to facilitate contrastive learning. CoCoRD, which utilizes only one memory bank and much fewer negative keys, provides highly competitive results under typical teacher-student settings. On ImageNet, CoCoRD-distilled ResNet50 outperforms the teacher ResNet101 by 0.2% top-1 accuracy. Furthermore, in PASCAL VOC and COCO detection, the detectors whose backbones are initialized by CoCoRD-distilled models exhibit considerable performance improvements.","contrastive learning, knowledge distillation, model compression" PowerQuant: Automorphism Search for Non-Uniform Quantization,https://openreview.net/forum?id=s1KljJpAukm,https://openreview.net/pdf?id=s1KljJpAukm,,"Deep neural networks (DNNs) are nowadays ubiquitous in many domains such as computer vision. However, due to their high latency, the deployment of DNNs hinges on the development of compression techniques such as quantization which consists in lowering the number of bits used to encode the weights and activations. Growing concerns for privacy and security have motivated the development of data-free techniques, at the expanse of accuracy. In this paper, we identity the uniformity of the quantization operator as a limitation of existing approaches, and propose a data-free non-uniform method. More specifically, we argue that to be readily usable without dedicated hardware and implementation, non-uniform quantization shall not change the nature of the mathematical operations performed by the DNN. This leads to search among the continuous automorphisms of $(\mathbb{R}_+^*,\times)$, which boils down to the power functions defined by their exponent. To find this parameter, we propose to optimize the reconstruction error of each layer: in particular, we show that this procedure is locally convex and admits a unique solution. At inference time, we show that our approach, dubbed PowerQuant, only require simple modifications in the quantized DNN activation functions. As such, with only negligible overhead, it significantly outperforms existing methods in a variety of configurations.","deep learning, quantization, compression, acceleration, data-free" CLEEGN: A Convolutional Neural Network for Plug-and-Play Automatic EEG Reconstruction,https://openreview.net/forum?id=SWPFPk9Tm81,https://openreview.net/pdf?id=SWPFPk9Tm81,A novel CNN model for training-free online EEG reconstruction with the SOTA performance.,"Human electroencephalography (EEG) is a brain monitoring modality that senses cortical neuroelectrophysiological activity in high-temporal resolution. One of the greatest challenges posed in applications of EEG is the unstable signal quality susceptible to inevitable artifacts during recordings. To date, most existing techniques for EEG artifact removal and reconstruction are applicable to offline analysis solely, or require individualized training data to facilitate online reconstruction. We have proposed CLEEGN, a novel convolutional neural network for plug-and-play automatic EEG reconstruction. CLEEGN is based on a subject-independent pre-trained model using existing data and can operate on a new user without any further calibration. The performance of CLEEGN was validated using multiple evaluations including waveform observation, reconstruction error assessment, and decoding accuracy on well-studied labeled datasets. The results of simulated online validation suggest that, even without any calibration, CLEEGN can largely preserve inherent brain activity and outperforms leading online/offline artifact removal methods in the decoding accuracy of reconstructed EEG data. In addition, visualization of model parameters and latent features exhibit the model behavior and reveal explainable insights related to existing knowledge of neuroscience. We foresee pervasive applications of CLEEGN in prospective works of online plug-and-play EEG decoding and analysis.","EEG, Brain-computer interface, EEG artifact removal, convolutional neural network" Neural Layered Min-sum Decoders for Algebraic Codes,https://openreview.net/forum?id=B7gBcrKQCl4,https://openreview.net/pdf?id=B7gBcrKQCl4,A neural min-sum decoder based on the layered min-sum algorithm with reduced weights and better error rates.,"In this article, we propose low-complexity neural network decoders based on the layered min-sum algorithm to decode binary algebraic codes. By generalizing the layered min-sum algorithm to its neural network counterpart, the number of iterations required for convergence is reduced. Consequently, the number of network weights decreases while retaining a good error correction performance. The Bose-Chaudhuri-Hocquenghem (BCH) codes and quadratic residue (QR) codes are selected as two exemplary binary algebraic codes. Simulation results show that the proposed decoders achieve superior performance with less computational complexity, compared with the decoders proposed by Chen & Ye (2021). Further, a neural decoder incorporating the modified random redundant decoding (mRRD) algorithm is investigated to approach the performance of maximum-likelihood (ML) decoding for some short codes.",Error correction code Deep Gaussian Process State-Space Model for Motion Generation via Stochastic Expectation Propagation,https://openreview.net/forum?id=gB-WcoUyyTN,https://openreview.net/pdf?id=gB-WcoUyyTN,,"Gaussian Processes (GPs) and related unsupervised learning techniques such as Gaussian process dynamical models (GP-DMs) have been very successful in the accurate modeling of high-dimensional data based on limited amounts of training data. Usually these techniques have the disadvantage of a high computational complexity. This makes it difficult to solve the associated learning problems for large data sets, since the related computations, as opposed to neural networks, are not node-local. Combining sparse approximation techniques for GPs and stochastic expectation propagation (SEP), we present a framework for the computationally efficient implementation of deep Gaussian process (state-space) models. We provide implementations of this approach on the GPU as well as on the CPU. We present the first implementation of such deep GP-SSMs and demonstrate the computational efficiency of our GPU implementation.","Deep GP-SSM, probabilistic model, dimension reduction, motion synthesis, Expectation Propagation" On Uni-modal Feature Learning in Multi-modal Learning,https://openreview.net/forum?id=mb7VM83DkyC,https://openreview.net/pdf?id=mb7VM83DkyC,,"We abstract the features of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interaction. Multi-modal joint training is expected to benefit from cross-modal interaction on the basis of ensuring uni-modal feature learning. However, recent late-fusion training approaches still suffer from insufficient learning of uni-modal features on each modality and we prove that this phenomenon does hurt the model's generalization ability. Given a multi-modal task, we propose to choose targeted late-fusion learning method from Uni-Modal Ensemble (UME) and the proposed Uni-Modal Teacher (UMT), according to the distribution of uni-modal and paired features. We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on multi-modal datasets, including VGG-Sound, Kinetics-400, UCF101, and ModelNet40.",Supervised Multi-modal Late-fusion Learning Unified neural representation model for physical and conceptual spaces,https://openreview.net/forum?id=3_NvTLGjDKy,https://openreview.net/pdf?id=3_NvTLGjDKy,A single model explains how grid-like and concept-specific representations emerge and function in the entorhinal cortex.,"The spatial processing system of the brain uses grid-like neural representations (grid cells) for supporting vector-based navigation. Experiments also suggest that neural representations for concepts (concept cells) exist in the human brain, and conceptual inference relies on navigation in conceptual spaces. We propose a unified model called ``disentangled successor information (DSI)'' that explains neural representations for both physical and conceptual spaces. DSI generates grid-like representations in a 2-dimensional space that highly resemble those observed in the brain. Moreover, the same model creates concept-specific representations from linguistic inputs, corresponding to concept cells. Mathematically, DSI vectors approximate value functions for navigation and word vectors obtained by word embedding methods, thus enabling both spatial navigation and conceptual inference based on vector-based calculation. Our results suggest that a single principle can explain computation of physical and conceptual spaces in the human brain.","Neuroscience, Grid cell, Concept cell, Spatial navigation, Reinforcement learning, Word embedding" Symbolic Physics Learner: Discovering governing equations via Monte Carlo tree search,https://openreview.net/forum?id=ZTK3SefE8_Z,https://openreview.net/pdf?id=ZTK3SefE8_Z,Proposed a novel Symbolic Physics Learner (SPL) machine to discover the mathematical structure of nonlinear dynamics based on limited measurement data.,"Nonlinear dynamics is ubiquitous in nature and commonly seen in various science and engineering disciplines. Distilling analytical expressions that govern nonlinear dynamics from limited data remains vital but challenging. To tackle this fundamental issue, we propose a novel Symbolic Physics Learner (SPL) machine to discover the mathematical structure of nonlinear dynamics. The key concept is to interpret mathematical operations and system state variables by computational rules and symbols, establish symbolic reasoning of mathematical formulas via expression trees, and employ a Monte Carlo tree search (MCTS) agent to explore optimal expression trees based on measurement data. The MCTS agent obtains an optimistic selection policy through the traversal of expression trees, featuring the one that maps to the arithmetic expression of underlying physics. Salient features of the proposed framework include search flexibility and enforcement of parsimony for discovered equations. The efficacy and superiority of the PSL machine are demonstrated by numerical examples, compared with state-of-the-art baselines.","symbolic regression, Monte Carlo tree search, governing equations, nonlinear dynamics" The Dynamic of Consensus in Deep Networks and the Identification of Noisy Labels,https://openreview.net/forum?id=-AdWUM183OU,https://openreview.net/pdf?id=-AdWUM183OU,"We propose a new way to detect label noise through the lens of model disagreement, and describe a method that improves the SOTA in supervised learning with noisy labels. ","Deep neural networks have incredible capacity and expressibility, and can seemingly memorize any training set. This introduces a problem when training in the presence of noisy labels, as the noisy examples cannot be distinguished from clean examples by the end of training. Recent research has dealt with this challenge by utilizing the fact that deep networks seem to memorize clean examples much earlier than noisy examples. Here we report a new empirical result: for each example, when looking at the time it has been memorized by each model in an ensemble of networks, the diversity seen in noisy examples is much larger than the clean examples. We use this observation to develop a new method for noisy labels filtration. The method is based on a statistics of the data, which captures the differences in ensemble learning dynamics between clean and noisy data. We test our method on three tasks: (i) noise amount estimation; (ii) noise filtration; (iii) supervised classification. We show that our method improves over existing baselines in all three tasks using a variety of datasets, noise models, and noise levels. Aside from its improved performance, our method has two other advantages. (i) Simplicity, which implies that no additional hyperparameters are introduced. (ii) Our method is modular: it does not work in an end-to-end fashion, and can therefore be used to clean a dataset for any other future usage.","Noisy Labels, Training Dynamics, Label Noise" Efficient block contrastive learning via parameter-free meta-node approximation,https://openreview.net/forum?id=hTCBqt7pgxf,https://openreview.net/pdf?id=hTCBqt7pgxf,"A simple block contrastive loss approximation technique to efficiently contrast all negative samples, in linear cluster time, at graph level","Contrastive learning has recently achieved remarkable success in many domains including graphs. However contrastive loss, especially for graphs, requires a large number of negative samples which is unscalable and computationally prohibitive with a quadratic time complexity. Sub-sampling is not optimal and incorrect negative sampling leads to sampling bias. In this work, we propose a meta-node based approximation technique that can (a) proxy all negative combinations (b) in quadratic cluster size time complexity, (c) at graph level, not node level, and (d) exploit graph sparsity. By replacing node-pairs with additive cluster-pairs, we compute the negatives in cluster-time at graph level. The resulting Proxy approximated meta-node Contrastive (PamC) loss, based on simple optimized GPU operations, captures the full set of negatives, yet is efficient with a linear time complexity. By avoiding sampling, we effectively eliminate sample bias. We meet the criterion for larger number of samples, thus achieving block-contrastiveness, which is proven to outperform pair-wise losses. We use learnt soft cluster assignments for the meta-node constriction, and avoid possible heterophily and noise added during edge creation. Theoretically, we show that real world graphs easily satisfy conditions necessary for our approximation. Empirically, we show promising accuracy gains over state-of-the-art graph clustering on 6 benchmarks. Importantly, we gain substantially in efficiency; up to 3x in training time, 1.8x in inference time and over 5x in GPU memory reduction. ","Contrastive, approximation, efficient, parameter-free, block, theory" Attribute Alignment and Enhancement for Generalized Zero-Shot Learning,https://openreview.net/forum?id=arg1dQSS6Mh,https://openreview.net/pdf?id=arg1dQSS6Mh,,"Generalized zero-shot learning (GZSL) aims to recognize both seen and unseen classes, which challenges the generalization ability of a model. In this paper, we propose a novel approach to fully utilize attributes information, referred to as attribute alignment and enhancement (A3E) network. It contains two modules. First, attribute localization (AL) module utilizes the supervision of class attribute vectors to guide visual localization for attributes through the implicit localization capability within the feature extractor, and the visual features corresponding to the attributes (attribute-visual features) are obtained. Second, enhanced attribute scoring (EAS) module employs the supervision of the attribute word vectors (attribute semantics) to project input attribute visual features to attribute semantic space using Graph Attention Network (GAT). Based on the constructed attribute relation graph (ARG), EAS module generates enhanced representation of attributes. Experiments on standard datasets demonstrate that the enhanced attribute representation greatly improves the classification performance, which helps A3E to achieve state-of-the-art performances in both ZSL and GZSL tasks.","zero-shot learning, image classification, attribute alignment, graph neural network, attention network" BAYES RISK CTC: CONTROLLABLE CTC ALIGNMENT IN SEQUENCE-TO-SEQUENCE TASKS,https://openreview.net/forum?id=Bd7GueaTxUz,https://openreview.net/pdf?id=Bd7GueaTxUz,A Bayes risk function is applied to each CTC path to express the preference for selected paths and achieve controllable CTC alignment prediction,"Sequence-to-Sequence (seq2seq) tasks transcribe the input sequence to a target sequence. The Connectionist Temporal Classification (CTC) criterion is widely used in multiple seq2seq tasks. Besides predicting the target sequence, a side product of CTC is to predict the alignment, which is the most probable input-long sequence that specifies a hard aligning relationship between the input and target units. As there are multiple potential aligning sequences (called paths) that are equally considered in CTC formulation, the choice of which path will be most probable and become the predicted alignment is always uncertain. In addition, it is usually observed that the alignment predicted by vanilla CTC will drift compared with its reference and rarely provides practical functionalities. Thus, the motivation of this work is to make the CTC alignment prediction controllable and thus equip CTC with extra functionalities. The Bayes risk CTC (BRCTC) criterion is then proposed in this work, in which a customizable Bayes risk function is adopted to enforce the desired characteristics of the predicted alignment. With the risk function, the BRCTC is a general framework to adopt some customizable preference over the paths in order to concentrate the posterior into a particular subset of the paths. In applications, we explore one particular preference which yields models with the down-sampling ability and reduced inference costs. By using BRCTC with another preference for early emissions, we obtain an improved performance-latency trade-off for online models. Experimentally, the proposed BRCTC reduces the inference cost of offline models by up to 47% without performance degradation and cuts down the overall latency of online systems to an unseen level.","CTC, alignment, sequence-to-sequence, speech recognition" A Convergent Single-Loop Algorithm for Gromov-Wasserstein in Graph Data ,https://openreview.net/forum?id=0jxPyVWmiiF,https://openreview.net/pdf?id=0jxPyVWmiiF,We propose the first provable single-loop algorithm for computing the Gromov-Wasserstein (GW) distance.,"We propose the first provable single-loop algorithm for computing the Gromov-Wasserstein (GW) distance. We call it Bregman Alternating Projected Gradient (BAPG) method. Our analysis is based on a surprising observation that the GW problem satisfies a so-called Luo-Tseng error bound condition~\citep{luo1992error}, which relates to estimating the distance of a point to the critical point set of the GW problem based on the optimality residual. Armed with this error-bound condition, we are able to show that BAPG converges to a critical point of the GW problem asymptotically (as step size goes to infinity), and further give an approximation bound for the distance between the fixed-point set of BAPG and the critical point set of GW explicitly. We conduct comprehensive numerical experiments to validate the effectiveness of BAPG on graph alignment and partition tasks. In terms of both wall-clock time and quality of solutions, the proposed method achieves state-of-the-art results. ","Gromov-Wasserstein, Graph Learning, Optimization" The Importance of Suppressing Complete Reconstruction in Autoencoders for Unsupervised Outlier Detection,https://openreview.net/forum?id=dj_U5MZDia6,https://openreview.net/pdf?id=dj_U5MZDia6,,"Autoencoders are widely used in outlier detection due to their superiority in handling high-dimensional and nonlinear datasets. The reconstruction of any dataset by the autoencoder can be considered as a complex regression process. In regression analysis, outliers can usually be divided into high leverage points and influential points. Although the autoencoder has shown good results for the identification of influential points, there are still some problems when detect high leverage points. Through theoretical derivation, we found that most outliers are detected in the direction corresponding to the worst-recovered principal component, but in the direction of the well-recovered principal components, the anomalies are often ignored. We propose a new loss function which solve the above deficiencies in outlier detection. The core idea of our scheme is that in order to better detect high leverage points, we should suppress the complete reconstruction of the dataset to convert high leverage points into influential points, and it is also necessary to ensure that the differences between the eigenvalues of the covariance matrix of the original dataset and their corresponding reconstructed results in the direction of each principal component are equal. Besides, we explain the rationality of our scheme through rigorous theoretical derivation. Finally, our experiments on multiple datasets confirm that our scheme significantly improves the accuracy of outlier detection.", FrAug: Frequency Domain Augmentation for Time Series Forecasting,https://openreview.net/forum?id=j83rZLZgYBv,https://openreview.net/pdf?id=j83rZLZgYBv,A frequency domain data augmentation technique for time-series forecasting task,"Data augmentation (DA) has become a de facto solution to expand training data size for deep learning. With the proliferation of deep models for time series analysis, various time series DA techniques are proposed in the literature, e.g., cropping-, warping-, flipping-, and mixup-based methods. However, these augmentation methods are mainly applicable for time series classification and anomaly detection tasks. In time series forecasting (TSF), we need to model the fine-grained temporal relationship within time series segments so that we could generate faithful forecasting results given data in a look-back window. Existing DA solutions in the time domain would break such relationship, leading to poor forecasting accuracy. To tackle this problem, this paper proposes simple yet effective frequency domain augmentation techniques that ensure the semantic consistency of augmented data-label pairs in forecasting, named FrAug. We conduct comprehensive experiments on eight widely-used benchmarks with several state-of-the-art TSF deep models. Our results show that FrAug can boost the forecasting accuracy of existing models in most cases. Moreover, we show that, FrAug enables models trained with 1\% of the original training data to achieve similar performance to the ones trained on full training data, which is particularly attractive for cold-start forecasting often occurred in real-life applications. ","Time series forecasting, Data augmentation, Few shot learning" A Hierarchical Hyper-rectangle Mass Model for Fine-grained Entity Typing,https://openreview.net/forum?id=jotL-ImpbF,https://openreview.net/pdf?id=jotL-ImpbF,,"Fine-grained entity typing is the task of detecting types of entities inside a given language text. Entity typing models typically transform entities into vectors in high-dimensional space, hyperbolic space, or add additional context information. However, such spaces or feature transformations are not compatible with modeling types' inter-dependencies and diverse scenarios. We study the ability of the hierarchical hyper-rectangle mass model(hRMM), which represents mentions and types into hyper-rectangle mass(hRM) and thus captures the relationships of ontology into a geometric mass view. Natural language contexts are fed into the encoder and then projected to hyper-rectangle mass embedding(hRME). We find that hRM perfectly depicts features of mentions and types. With further research in hypervolume indicators and adaptive thresholds, performance achieves additional improvement. Experiments show that our approach achieves better performance on several entity typing benchmarks and attains state-of-the-art results on two benchmark datasets.","entity typing, hierarchical classification, hRMM, geometric embedding" Bayesian semi-supervised learning with a principled likelihood from a generative model of data curation,https://openreview.net/forum?id=zOHQGKO3WGY,https://openreview.net/pdf?id=zOHQGKO3WGY,"We develop Bayesian semi-supervised learning, by showing that standard SSL objectives can be understood as lower bounds on a principled log-likelihood","We currently do not have an understanding of semi-supervised learning (SSL) objectives such as pseudo-labelling and entropy minimization as log-likelihoods, which precludes the development of e.g. Bayesian SSL. Here, we note that benchmark image datasets such as CIFAR-10 are carefully curated, and we formulate SSL objectives as a log-likelihood in a generative model of data curation. We show that SSL objectives, from entropy minimization and pseudo-labelling, to state-of-the-art techniques similar to FixMatch can be understood as lower-bounds on our principled log-likelihood. We are thus able to introduce Bayesian SSL, which gives considerable improvements over standard SSL in the setting of 40 labelled points on CIFAR-10, with performance of $92.2\pm 0.3\%$ vs $88.6\%$ in the original FixMatch paper. Finally, our theory suggests that SSL is effective in part due to the statistical patterns induced by data curation. This provides an explanation of past results which show SSL performs better on clean datasets without any ``out of distribution'' examples. Confirming these results we find that SSL gave much larger performance improvements on curated than on uncurated data, using matched curated and uncurated datasets based on Galaxy Zoo 2.","Bayesian deep learning, Bayesian neural networks, principled likelihoods" Continual Learning via Adaptive Neuron Selection,https://openreview.net/forum?id=HiuupcGa-0g,https://openreview.net/pdf?id=HiuupcGa-0g,This paper presents a novel continual learning solution with adaptive neuron selection.,"Continual learning (CL) aims at learning a sequence of tasks without losing previously acquired knowledge. Early efforts have achieved promising results in overcoming the catastrophic forgetting problem. As a consequence, contemporary studies turn to investigate whether learning a sequence of tasks can be facilitated from the perspective of knowledge consolidation. However, existing solutions either still confront severe forgetting issues or share narrow knowledge between the new and previous tasks. This paper presents a novel Continual Learning solution with Adaptive Neuron Selection (CLANS), which treats the used neurons in earlier tasks as a knowledge pool and makes it scalable via reinforcement learning with a small margin. Subsequently, the adaptive neuron selection enables knowledge consolidation for both old and new tasks in addition to overcoming the CF problem. The experimental results conducted on four datasets widely used in CL evaluations demonstrate that CLANS outperforms the state-of-the-art baselines. ","continual learning, knowledge transfer, neural network, neuron selection, deep learning" Revisiting Fast Adversarial Training,https://openreview.net/forum?id=yf8TiD7HpAN,https://openreview.net/pdf?id=yf8TiD7HpAN,,"Fast Adversarial Training (FAT) not only improves the model robustness but also reduces the training cost of standard adversarial training. However, FAT often suffers from Catastrophic Overfitting (CO), which results in poor robustness performance. CO describes the phenomenon that model robust accuracy can decrease dramatically and suddenly during the training of FAT. Many effective techniques have been developed to prevent CO and improve the model robustness from different perspectives. However, these techniques adopt inconsistent training settings and require different training costs, i.e, training time and memory costs, resulting in an unfair comparison. In this paper, we first conduct a comprehensive study of more than 10 FAT methods in terms of adversarial robustness and training costs. We revisit the effectiveness and efficiency of FAT techniques in preventing CO from the perspective of model local nonlinearity and propose an effective Lipschitz regularization method for FAT. Furthermore, we explore the effect of data augmentation and weight averaging in FAT and propose a simple yet effective auto weight averaging method to improve robustness further. By assembling these techniques, we propose a FGSM-based fast adversarial training method equipped with Lipschitz regularization and Auto Weight averaging, abbreviated as FGSM-LAW. Experimental evaluations on four benchmark databases demonstrate the superiority of the proposed method.","adversarial training,model robustness, adversarial examples" Ti-MAE: Self-Supervised Masked Time Series Autoencoders,https://openreview.net/forum?id=9AuIMiZhkL2,https://openreview.net/pdf?id=9AuIMiZhkL2,Self-Supervised Masked Time Series Autoencoders for time series forecasting and classification,"Multivariate Time Series forecasting has been an increasingly popular topic in various applications and scenarios. Recently, contrastive learning and Transformer-based models have achieved good performance in many long-term series forecasting tasks. However, there are still several issues in existing methods. First, the training paradigm of contrastive learning and downstream prediction tasks are inconsistent, leading the accuracy of prediction not good enough. Second, existing Transformer-based models which learn similar patterns in historical time series data to predict future values always induces severe distribution shift problems, and does not fully leverage the sequence information compared to self-supervised methods. To address these issues, we propose a novel framework named Ti-MAE, in which the input time series are assumed to follow an integrate distribution. In detail, Ti-MAE randomly masks out embedded time series data and learns an autoencoder to reconstruct them at the point-level. Ti-MAE adopts mask modeling as the auxiliary task rather than contrastive learning and bridges the connection between existing representation learning and generative Transformer-based methods, reducing the difference between upstream and downstream forecasting tasks while maintaining the utilization of original time series data. Experiments on several public real-world datasets demonstrate that our framework of masked autoencoding could learn strong representations directly from the raw data, yielding better performance in time series forecasting and classification tasks. The code will be made public after this paper is accepted.","Time-Series, Autoencoders, Representation Learning, Self-Supervised Learning" E3Bind: An End-to-End Equivariant Network for Protein-Ligand Docking,https://openreview.net/forum?id=sO1QiAftQFv,https://openreview.net/pdf?id=sO1QiAftQFv,An end-to-end equivariant framework for protein-ligand docking through iterative coordinate refinement with careful consideration of the geometric constraints in docking and the local context of the binding site.,"In silico prediction of the ligand binding pose to a given protein target is a crucial but challenging task in drug discovery. This work focuses on blind flexible self-docking, where we aim to predict the positions, orientations and conformations of docked molecules. Traditional physics-based methods usually suffer from inaccurate scoring functions and high inference cost. Recently, data-driven methods based on deep learning techniques are attracting growing interest thanks to their efficiency during inference and promising performance. These methods usually either adopt a two-stage approach by first predicting the distances between proteins and ligands and then generating the final coordinates based on the predicted distances, or directly predicting the global roto-translation of ligands. In this paper, we take a different route. Inspired by the resounding success of AlphaFold2 for protein structure prediction, we propose E3Bind, an end-to-end equivariant network that iteratively updates the ligand pose. E3Bind models the protein-ligand interaction through careful consideration of the geometric constraints in docking and the local context of the binding site. Experiments on standard benchmark datasets demonstrate the superior performance of our end-to-end trainable model compared to traditional and recently-proposed deep learning methods.","protein-ligand docking, end-to-end training, iterative refinement framework, geometric deep learning" Deep High-Frequency Extrapolation for Neuronal Spike Restoration,https://openreview.net/forum?id=ZpiJj-PliJs,https://openreview.net/pdf?id=ZpiJj-PliJs,Transformer reconstruction of high-frequency and high-resolution spikes from low-passed neuronal signals,"Recording neuronal activity using multiple electrodes has been widely used for studying functional mechanisms of the brain. However, handling massive amounts of data is still a challenge. In this paper, we propose a novel strategy to restore high-frequency neuronal spikes from small-volume and low-frequency band signals. Inspired by the fact that high-frequency extrapolation is equivalent to super-resolution problems in 2D signals, we applied a Swin transformer to extrapolate high-frequency information from downsampled neuronal signals both in vitro and in vivo. We found that aliasing components of input signals and the spike jittering-based selection of the training batch improved the performance of reconstructing accurate neuronal spikes. As a result, we observed reasonably restored neuronal spiking activity, including the spike timing, waveforms, and network connectivity, even with the $\times 8$ subsampled dataset. ","implicit neural representation, neuronal spike reconstruction, super resolution, high frequency extrapolation, in vitro neuronal culture, in vivo neural activity" Improving Model Consistency of Decentralized Federated Learning via Sharpness Aware Minimization and Multiple Gossip Approaches,https://openreview.net/forum?id=NhR0jUSuelq,https://openreview.net/pdf?id=NhR0jUSuelq,,"To mitigate the privacy leakages and reduce the communication burden of Federated Learning (FL), decentralized FL (DFL) discards the central server and each client only communicates with its neighbors in the decentralized communication network. However, existing DFL algorithms tend to feature high inconsistency among local models, which results in severe distribution shifts across clients and inferior performance compared with centralized FL (CFL), especially on heterogeneous data or with sparse connectivity of communication topology. To alleviate this challenge, we propose two DFL algorithms named DFedSAM and DFedSAM-MGS to improve the performance. Specifically, DFedSAM leverages gradient perturbation to generate local flatness models via Sharpness Aware Minimization (SAM), which searches for model parameters with uniformly low loss function values. In addition, DFedSAM-MGS further boosts DFedSAM by adopting the technique of Multiple Gossip Steps (MGS) for a better model consistency, which accelerates the aggregation of local flatness models and better balances the communication complexity and learning performance. In the theoretical perspective, we present the improved convergence rates $\small \mathcal{O}\big(\frac{1}{T}+\frac{1}{T^2(1-\lambda)^2}\big)$ and $\small \mathcal{O}\big(\frac{1}{T}+\frac{\lambda^Q+1}{T^2(1-\lambda^Q)^2}\big)$ in the stochastic non-convex setting for DFedSAM and DFedSAM-MGS, respectively, where $1-\lambda$ is the spectral gap of the gossip matrix $W$ and $Q$ is the gossip steps in MGS. Meanwhile, we empirically confirm that our methods can achieve competitive performance compared with CFL baselines and outperform existing DFL baselines. ", VA-DepthNet: A Variational Approach to Single Image Depth Prediction,https://openreview.net/forum?id=xjxUjHa_Wpa,https://openreview.net/pdf?id=xjxUjHa_Wpa,,"We introduce VA-DepthNet, a simple, effective, and accurate deep neural network approach for a single image depth prediction (SIDP) problem. The proposed approach advocate using classical first-order variational constraints for this problem. While state-of-the-art deep neural network methods for SIDP learn the scene depth from images in a supervised setting, they often overlook the invaluable invariances and priors in the rigid scene space, such as the regularity of the scene. The paper's main contribution is to reveal the benefit of classical and well-founded variational constraints in the neural network design for the SIDP task. It is shown that imposing first-order variational constraints in the scene space together with popular encoder-decoder-based network architecture design provides excellent results for the supervised SIDP task. The imposed first-order variational constraint makes the network aware of the depth gradient in the scene space, i.e., regularity. The paper demonstrates the usefulness of the proposed approach via extensive evaluation and ablation analysis over several benchmark datasets, such as KITTI, NYU Depth V2, and SUN RGB-D. The VA-DepthNet at test time shows considerable improvements in depth prediction accuracy compared to the prior art and is accurate also at high-frequency regions in the scene space. At the time of submission, our method---labeled as VA-DepthNet, when tested on the KITTI official depth-prediction evaluation set, indexed second on the leader board, and our accuracy is top performing among the published method.","Single Image Depth Estimation, Variational Approach." Prompt-to-Prompt Image Editing with Cross-Attention Control,https://openreview.net/forum?id=_CDixzkzeyb,https://openreview.net/pdf?id=_CDixzkzeyb,,"Recent large-scale text-driven synthesis diffusion models have attracted much attention thanks to their remarkable capabilities of generating highly diverse images that follow given text prompts. Therefore, it is only natural to build upon these synthesis models to provide text-driven image editing capabilities. However, Editing is challenging for these generative models, since an innate property of an editing technique is to preserve some content from the original image, while in the text-based models, even a small modification of the text prompt often leads to a completely different outcome. State-of-the-art methods mitigate this by requiring the users to provide a spatial mask to localize the edit, hence, ignoring the original structure and content within the masked region. In this paper, we pursue an intuitive prompt-to-prompt editing framework, where the edits are controlled by text only. We analyze a text-conditioned model in depth and observe that the cross-attention layers are the key to controlling the relation between the spatial layout of the image to each word in the prompt. With this observation, we propose to control the attention maps along the diffusion process. Our approach enables us to monitor the synthesis process by editing the textual prompt only, paving the way to a myriad of caption-based editing applications such as localized editing by replacing a word, global editing by adding a specification, and even controlling the extent to which a word is reflected in the image. We present our results over diverse images and prompts with different text-to-image models, demonstrating high-quality synthesis and fidelity to the edited prompts.","Image generation, Image editing, Diffusion models, Attention layer, Computer vision, Machine learning" ExtraMix: Extrapolatable Data Augmentation for Regression using Generative Models,https://openreview.net/forum?id=NgEuFT-SIgI,https://openreview.net/pdf?id=NgEuFT-SIgI,"We introduce a new data augmentation method of non-Euclidean data for regression tasks. This method exploits a mixup concept for generating extrapolated samples. Our method can not only generate reliable pseudo-labels, but also improve predictors.","The primary objective of material science is discovery of novel materials. Because an unseen region can have a high probability of target materials (molecules), high predictive accuracy in out-of-distribution and few-shot regions is essential. However, limited data are available in material science because of high labeling costs. To overcome these difficulties, numerous techniques have been proposed for image and text domains. However, applying these techniques to material data is difficult because the data consists of combinatorial (non-Euclidean) input and continuous labels. In particular, in mixup-based methods, mixed labels are clustered in the middle range of the training set, which renders structured samples invalid. In this study, a novel data augmentation method is proposed for non-Euclidean input with regression tasks. (1) A mixup technique capable of extrapolation is defined to broaden not only the structure but also the label distribution. In contrast to existing mixup-based methods, the proposed method minimizes label imbalance. (2) The proposed method optimizes pseudo-label from the mixup-based approaches using decoder's knowledge of generative models. We proved that the proposed method generates high-quality pseudo data for the ZINC database. Furthermore, the phosphorescent organic light-emitting diode was used to prove that the method is effective in real problems with large-sized and highly complex properties. Moreover, this method can improve property prediction models.","mixup, out-of-distribution, optimization, generative models, molecule" Exact Group Fairness Regularization via Classwise Robust Optimization,https://openreview.net/forum?id=Q-WfHzmiG9m,https://openreview.net/pdf?id=Q-WfHzmiG9m,,"Existing group fairness-aware training methods typically employ some heuristics, such as re-weighting underrepresented groups based on some rules or using approximated surrogates for the exact fairness metrics as regularization terms, which result in models with sub-optimal accuracy-fairness trade-offs. The reason for using such heuristics is that the fairness metrics are usually non-differentiable or non-convex, and exactly incorporating those metrics in a tractable learning objective is challenging. To that end, we propose a principled method that indeed can incorporate an $\textit{exact}$ form of a well-justified group fairness metric, Difference of Conditional Accuracy (DCA), as a regularizer using a $\textit{classwise}$ distributionally robust optimization (DRO) framework. Namely, we first show that the DCA is equivalent (up to a constant) to the average (over the classes) of the roots of the $\textit{variances}$ of group losses, then employ the Group DRO formulation for each class $\textit{separately}$ to convert the non-differentiable DCA (or variance) regularized group-balanced empirical risk minimization to a more tractable minimax optimization. We further develop an efficient iterative optimization algorithm and show that our resulting method, dubbed as FairDRO, makes an interesting connection between the re-weighting based and regularization-based fairness-aware learning. Our experiments show that FairDRO is scalable, easily adaptable to diverse applications, and consistently improves the group fairness on several benchmark datasets in terms of the accuracy-fairness trade-off, compared to recent state-of-the-art baselines. ","Group Fairness, DRO" Lightweight Uncertainty for Offline Reinforcement Learning via Bayesian Posterior,https://openreview.net/forum?id=55Eet8WGJTv,https://openreview.net/pdf?id=55Eet8WGJTv,,"Offline Reinforcement Learning (RL) aims to learn optimal policies from fixed datasets. Directly applying off-policy RL algorithms to offline datasets typically suffers from the distributional shift issue and fails to obtain a reliable value estimation for out-of-distribution (OOD) actions. To this end, several methods penalize the value function with uncertainty quantification and achieve tremendous success from both theoretical and empirical perspectives. However, such uncertainty-based methods typically require estimating the lower confidence bound (LCB) of the $Q$-function based on a large number of ensemble networks, which is computationally expensive. In this paper, we propose a lightweight uncertainty quantifier based on approximate Bayesian inference in the last layer of the $Q$-network, which estimates the Bayesian posterior with minimal parameters in addition to the ordinary $Q$-network. We then obtain the uncertainty quantification by the disagreement of the $Q$-posterior. Moreover, to avoid mode collapse in OOD samples and improve diversity in the $Q$-posterior, we introduce a repulsive force for OOD predictions in training. We show that our method recovers the provably efficient LCB-penalty under linear MDP assumptions. We further compare our method with other baselines on the D4RL benchmark. The experimental results show that our proposed method achieves state-of-the-art performance on most tasks with more lightweight uncertainty quantifiers.","Offline reinforcement learning, Uncertainty quantification, Bayesian neural networks" DiffEdit: Diffusion-based semantic image editing with mask guidance,https://openreview.net/forum?id=3lge0p5o-M-,https://openreview.net/pdf?id=3lge0p5o-M-,,"Image generation has recently seen tremendous advances, with diffusion models allowing to synthesize convincing images for a large variety of text prompts. In this article, we propose DiffEdit, a method to take advantage of text-conditioned diffusion models for the task of semantic image editing, where the goal is to edit an image based on a text query. Semantic image editing is an extension of image generation, with the additional constraint that the generated image should be as similar as possible to a given input image. Current editing methods based on diffusion models usually require to provide a mask, making the task much easier by treating it as a conditional inpainting task. In contrast, our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited, by contrasting predictions of a diffusion model conditioned on different text prompts. Moreover, we rely on latent inference to preserve content in those regions of interest and show excellent synergies with mask-based diffusion. DiffEdit achieves state-of-the-art editing performance on ImageNet. In addition, we evaluate semantic image editing in more challenging settings, using images from the COCO dataset as well as text-based generated images.","computer vision, image editing, diffusion models" Are More Layers Beneficial to Graph Transformers?,https://openreview.net/forum?id=uagC-X9XMi8,https://openreview.net/pdf?id=uagC-X9XMi8,We analyze and solve the depth bottleneck of graph transformers from the perspective of attention capacity.,"Despite that going deep has proven successful in many neural architectures, the existing graph transformers are relatively shallow. In this work, we explore whether more layers are beneficial to graph transformers, and find that current graph transformers suffer from the bottleneck of improving performance by increasing depth. Our further analysis reveals the reason is that deep graph transformers are limited by the vanishing capacity of global attention, restricting the graph transformer from focusing on the critical substructure and obtaining expressive features. To this end, we propose a novel graph transformer model named DeepGraph that explicitly employs substructure tokens in the encoded representation, and applies local attention on related nodes to obtain substructure based attention encoding. Our model enhances the ability of the global attention to focus on substructures and promotes the expressiveness of the representations, addressing the limitation of self-attention as the graph transformer deepens. Experiments show that our method unblocks the depth limitation of graph transformers and results in state-of-the-art performance across various graph benchmarks with deeper models.","Transformer, Graph Representation, Depth" Learning Combinatorial Node Labeling Algorithms,https://openreview.net/forum?id=u9sFrzSBRK8,https://openreview.net/pdf?id=u9sFrzSBRK8,We present the combinatorial node labeling framework and an accompanying neural network architecture to solve hard graph optimization problems.,"We present the combinatorial node labeling framework, which generalizes many prior approaches to solving hard graph optimization problems by supporting problems where solutions consist of arbitrarily many node labels, such as graph coloring. We then introduce a neural network architecture to implement this framework. Our architecture builds on a graph attention network with several inductive biases to improve solution quality and is trained using policy gradient reinforcement learning. We demonstrate our approach on both graph coloring and minimum vertex cover. Our learned heuristics match or outperform classical hand-crafted greedy heuristics and machine learning approaches while taking only seconds on large graphs. We conduct a detailed analysis of the learned heuristics and architecture choices and show that they successfully adapt to different graph structures.","graph learning, reinforcement learning, combinatorial optimization" Simplicial Hopfield networks,https://openreview.net/forum?id=_QLsH8gatwx,https://openreview.net/pdf?id=_QLsH8gatwx,"Without increasing the number of parameters, we improve the memory capacity of Hopfield networks by adding setwise connections embedded in a simplicial complex.","Hopfield networks are artificial neural networks which store memory patterns on the states of their neurons by choosing recurrent connection weights and update rules such that the energy landscape of the network forms attractors around the memories. How many stable, sufficiently-attracting memory patterns can we store in such a network using $N$ neurons? The answer depends on the choice of weights and update rule. Inspired by setwise connectivity in biology, we extend Hopfield networks by adding setwise connections and embedding these connections in a simplicial complex. Simplicial complexes are higher dimensional analogues of graphs which naturally represent collections of pairwise and setwise relationships. We show that our simplicial Hopfield networks increase memory storage capacity. Surprisingly, even when connections are limited to a small random subset of equivalent size to an all-pairwise network, our networks still outperform their pairwise counterparts. Such scenarios include non-trivial simplicial topology. We also test analogous modern continuous Hopfield networks, offering a potentially promising avenue for improving the attention mechanism in Transformer models.","Hopfield network, associative memory, attention, computational neuroscience, simplicial complex, topology, memory capacity" Volumetric Disentanglement for 3D Scene Manipulation,https://openreview.net/forum?id=akk2jh4nMN7,https://openreview.net/pdf?id=akk2jh4nMN7,We propose a framework for disentangling a 3D scene into a foreground and background volumetric representations and show a variety of downstream applications involving 3D manipulation.,"Recently, advances in differential volumetric rendering enabled significant breakthroughs in the photo-realistic and fine-detailed reconstruction of complex 3D scenes, which is key for many virtual reality applications. However, in the context of augmented reality, one may also wish to effect semantic manipulations or augmentations of objects within a scene. To this end, we propose a volumetric framework for (i) disentangling or separating, the volumetric representation of a given foreground object from the background, and (ii) semantically manipulating the foreground object, as well as the background. Our framework takes as input a set of 2D masks specifying the desired foreground object for training views, together with the associated 2D views and poses, and produces a foreground-background disentanglement that respects the surrounding illumination, reflections, and partial occlusions, which can be applied to both training and novel views. Unlike previous work, our method does not rely on 3D information in the form of 3D object bounding boxes or a scene voxel grid. It correctly captures reflective foreground objects, objects occluded by the background, and objects with noisy and inaccurate masks. Our method enables the separate control of pixel color and depth as well as 3D similarity transformations of both the foreground and background objects. We subsequently demonstrate our framework's applicability on several downstream manipulation tasks, going beyond the placement and movement of foreground objects. These tasks include object camouflage, non-negative 3D object inpainting, 3D object translation, 3D object inpainting, and 3D text-based object manipulation. ","3D Object Editing, Neural Radiance Fields, Disentanglement" Versatile Neural Processes for Learning Implicit Neural Representations,https://openreview.net/forum?id=2nLeOOfAjK,https://openreview.net/pdf?id=2nLeOOfAjK,"We propose a new neural process framework for efficient learning of the implicit neural representations w.r.t. various signals, including complex 3D scenes.","Representing a signal as a continuous function parameterized by neural network (a.k.a. Implicit Neural Representations, INRs) has attracted increasing attention in recent years. Neural Processes (NPs), which model the distributions over functions conditioned on partial observations (context set), provide a practical solution for fast inference of continuous functions. However, existing NP architectures suffer from inferior modeling capability for complex signals. In this paper, we propose an efficient NP framework dubbed Versatile Neural Processes (VNP), which largely increases the capability of approximating functions. Specifically, we introduce a bottleneck encoder that produces fewer and informative context tokens, relieving the high computational cost while providing high modeling capability. At the decoder side, we hierarchically learn multiple global latent variables that jointly model the global structure and the uncertainty of a function, enabling our model to capture the distribution of complex signals. We demonstrate the effectiveness of the proposed VNP on a variety of tasks involving 1D, 2D and 3D signals. Particularly, our method shows promise in learning accurate INRs w.r.t. a 3D scene without further finetuning.","Implicit Neural Representations, Neural Processes, Variational Inference" Supplementing Domain Knowledge to BERT with Semi-structured Information of Documents,https://openreview.net/forum?id=7frgl8pKJpY,https://openreview.net/pdf?id=7frgl8pKJpY,"A new domain adaptation method is proposed, which emphasize the importance of semi-structured information of documents for BERT capturing domain knowledge.","Adapting BERT on in-domain text corpus is a good way to boost its performance on domain-specific natural language processing (NLP) tasks. Common domain adaptation methods, however, can be deficient in capturing domain knowledge. Meanwhile, the context fragmentation inherent in Transformer-based models also hinders the acquisition of domain knowledge. Given the semi-structural characteristics of documents and their potential for alleviating these problems, we leverage semi-structured information of documents to supplement domain knowledge to BERT. To this end, we propose a topic-based domain adaptation method, which enhances the capture of domain knowledge at various levels of text granularity. Specifically, topic masked language model is designed at the paragraph level for pre-training; topic subsection matching degree dataset is automatically constructed at the subsection level for intermediate fine-tuning. Experiments are conducted over three biomedical NLP tasks across five datasets, and the results highlight the importance of the previously overlooked semi-structured information for domain adaptation. Our method benefits BERT, RoBERTa, BioBERT, and PubMedBERT in nearly all cases and yield significant gains over the topic-related task, question answering, with an average accuracy improvement of 4.8.","Domain adaptation, Semi-structured information, BERT, Pre-trained language model, Biomedical question answering" Window Projection Features are All You Need for Time Series Anomaly Detection,https://openreview.net/forum?id=A6MliD2e5Xp,https://openreview.net/pdf?id=A6MliD2e5Xp,A simple hand-crafted representation combined with a Gaussian estimator obtains SOTA results in time series anomaly detection.,"The challenge of time series anomaly detection has motivated the development of increasingly more complex deep representations and anomaly metrics. In this paper we demonstrate that a simple approach based on window projection features can achieve better results. Projection features are a common way to discretize multivariate data; they first multiply the data by a projection matrix followed by discretization of each output dimension. We first show that short temporal windows, encoded by projection features, are often already sufficiently expressive for linearly separating between normal and anomalous time series. However, we find that while the expressivity of projection features is sufficient, current one-class classification methods are unable to use them effectively to detect anomalies. We hypothesize this is due to the difficulty of density estimation. The difficulty can be overcome by estimating the probability density of the sample mean, which follows the Gaussian distribution when the conditions of the Central Limit Theorem are met. Simply put, we fit a multivariate Gaussian model to the average of the projection features of adjacent windows within a time series. Despite its simplicity, our method outperforms the state-of-the-art in diverse settings including: five UEA datasets, video trajectory anomaly detection and standard anomaly segmentation benchmarks. Code is provided.","time series, anomaly detection" DEEP ACCURATE SOLVER FOR THE GEODESIC PROBLEM,https://openreview.net/forum?id=WyXH_H0Pdtv,https://openreview.net/pdf?id=WyXH_H0Pdtv,Deep Local solver for approximating geodesic distances on manifolds. ,"A high order accurate deep learning method for computing geodesic distances on surfaces is introduced. The proposed method exploits a dynamic programming principle which lends itself to a scheme with quasi-linear computational complexity. We consider two main components for computing distances on surfaces; A numerical solver that locally approximates the distance function and an efficient causal ordering scheme by which points are updated. The quality of the distance approximation is determined by the local solver and is the main focus of this paper. A common approach to compute distances on continuous surfaces is by considering a discretized polygonal mesh approximating the surface, and estimating distances on the polygon. With such an approximation, the exact geodesic distances restricted to the polygon are at most second order accurate with respect to the distances on the corresponding continuous surface. Here, by order of accuracy we refer to the rate of convergence as a function of the average distance between sampled points. To improve the rate of convergence, we consider a neural network based local solver which implicitly approximates the structure of the continuous surface. The proposed solver circumvents the polyhedral representation, by directly consuming sampled mesh vertices for approximation of distances on the sampled continuous surfaces. We provide numerical evidence that the proposed update scheme, with appropriate local numerical support, provides better accuracy compared to the best possible polyhedral approximations and previous learning based methods. We introduce a trained solver which is third order accurate, with quasi-linear complexity in the number of sampled points.","Geodesic distance, Geometric Deep learning." PBFormer: Capturing Complex Scene Text Shape with Polynomial Band Transformer,https://openreview.net/forum?id=9_pgtXEB652,https://openreview.net/pdf?id=9_pgtXEB652,"This paper presents PBFormer, an efficient yet powerful scene text detector that unifies the transformer with a novel text shape representation, Polynomial Band, which performs well for complex shape or crowded texts.","We present PBFormer, an efficient yet powerful scene text detector that unifies the transformer with a novel text shape representation Polynomial Band (PB). The representation has four polynomial curves to fit a text's top, bottom, left, and right sides, which can capture a text with a complex shape by varying polynomial coefficients. PB has appealing features compared with conventional representations: 1) It can model different curvatures with a fixed number of parameters, while polygon-points-based methods need to utilize a different number of points. 2) It can distinguish adjacent or overlapping texts as they have apparent different curve coefficients, while segmentation-based methods suffer from adhesive spatial positions. PBFormer combines the PB with the transformer, which can directly generate smooth text contours sampled from predicted curves without interpolation. To leverage the advantage of PB, PBFormer has a parameter-free cross-scale pixel attention module. The module can enlarge text features and suppress irrelevant areas to benefit from detecting texts with diverse scale variations. Furthermore, PBFormer is trained with a shape-contained loss, which not only enforces the piecewise alignment between the ground truth and the predicted curves but also makes curves' position and shapes consistent with each other. Without bells and whistles about text pre-training, our method is superior to the previous state-of-the-art text detectors on the arbitrary-shaped CTW1500 and Total-Text datasets. Codes will be public.","Complex Shape Text Detection, Text Representation, Transformer, Computer Vision, Application" Classically Approximating Variational Quantum Machine Learning with Random Fourier Features,https://openreview.net/forum?id=ymFhZxw70uz,https://openreview.net/pdf?id=ymFhZxw70uz,We show theoretically and experimentally that models built from exponentially large quantum feature space can be classically reproduced by sampling a few frequencies to build an equivalent low dimensional kernel,"Many applications of quantum computing in the near term rely on variational quantum circuits (VQCs). They have been showcased as a promising model for reaching a quantum advantage in machine learning with current noisy intermediate scale quantum computers (NISQ). It is often believed that the power of VQCs relies on their exponentially large feature space, and extensive works have explored the expressiveness and trainability of VQCs in that regard. In our work, we propose a classical sampling method that can closely approximate most VQCs with Hamiltonian encoding, given only the description of their architecture. It uses the seminal proposal of Random Fourier Features (RFF) and the fact that VQCs can be seen as large Fourier series. We show theoretically and experimentally that models built from exponentially large quantum feature space can be classically reproduced by sampling a few frequencies to build an equivalent low dimensional kernel. Precisely, we show that the number of required samples grows favourably with the size of the quantum spectrum. This tool therefore questions the hope for quantum advantage from VQCs in many cases, but conversely helps to narrow the conditions for their potential success. We expect VQCs with various and complex encoding Hamiltonians, or with large input dimension, to become more robust to classical approximations.","Quantum Machine Learning, Variational Quantum Circuits, Random Fourier Features, Kernel Approximation, Quantum Computing" Distributional Meta-Gradient Reinforcement Learning,https://openreview.net/forum?id=LGkmUauBUL,https://openreview.net/pdf?id=LGkmUauBUL,A model-free meta gradient RL algorithm with distributional return,"Meta-gradient reinforcement learning (RL) algorithms have substantially boosted the performance of RL agents by learning an adaptive return. All the existing algorithms adhere to the same reward learning regime, where the adaptive return is simply formulated in the form of expected cumulative rewards, upon which the policy and critic update rules are specified under well adopted distance metrics. In this paper, we present a novel algorithm which builds on the success of meta-gradient RL algorithms and effectively improves such algorithms by following a simple recipe, i.e., going beyond the expected return to formulate and learn the return in a more expressive form, value distributions. To this end, we first formulate a distributional return that could effectively capture bootstrapping and discounting behaviors over distributions, to form an informative distributional return target in value update. Then we derive an efficient meta update rule to learn the adaptive distributional return with meta-gradients. For empirical evaluation, we first present an illustrative example on a toy two-color grid-world domain, which validates the benefit of learning distributional return; then we conduct extensive comparisons on a large-scale RL benchmark Atari 2600, where we confirm that our proposed method with distributional return works seamlessly well with the actor-critic framework and leads to state-of-the-art median human normalized score among meta-gradient RL literature.","Reinforcement Learning, Meta Learning" Towards A Unified Policy Abstraction Theory and Representation Learning Approach in Markov Decision Processes,https://openreview.net/forum?id=fKuGCzLoje,https://openreview.net/pdf?id=fKuGCzLoje,,"Lying on the heart of intelligent decision-making systems, how policy is represented and optimized is a fundamental problem. The root challenge in this problem is the large scale and the high complexity of policy space, which exacerbates the difficulty of policy learning especially in real-world scenarios. Towards a desirable surrogate policy space, recently policy representation in a low-dimensional latent space has shown its potential in improving both the evaluation and optimization of policy. The key question involved in these studies is by what criterion we should abstract the policy space for desired compression and generalization. However, both the theory on policy abstraction and the methodology on policy representation learning are less studied in the literature. In this work, we make very first efforts to fill up the vacancy. First, we propose a unified policy abstraction theory, containing three types of policy abstraction associated to policy features at different levels. Then, we generalize them to three policy metrics that quantify the distance (i.e., similarity) of policies, for more convenient use in learning policy representation. Further, we propose a policy representation learning approach based on deep metric learning. For the empirical study, we investigate the efficacy of the proposed policy metrics and representations, in characterizing policy difference and conveying policy generalization respectively. Our experiments are conducted in both policy optimization and evaluation problems, containing trust-region policy optimization (TRPO), diversity-guided evolution strategy (DGES) and off-policy evaluation (OPE). Somewhat naturally, the experimental results indicate that there is no a universally optimal abstraction for all downstream learning problems; while the influence-irrelevance policy abstraction can be a generally preferred choice.", Inapplicable Actions Learning for Knowledge Transfer in Reinforcement Learning,https://openreview.net/forum?id=TXH64IwWgS,https://openreview.net/pdf?id=TXH64IwWgS,"This paper presents a framework to use, learn and reuse knowledge about sate-dependent inapplicable actions in order improve the sample efficiency of RL algorithms.","Reinforcement Learning (RL) algorithms are known to scale poorly to environments with many available actions, requiring numerous samples to learn an optimal policy. The traditional approach of considering the same fixed action space in every possible state implies that the agent must understand, while also learning to maximize its reward, to ignore irrelevant actions such as $\textit{inapplicable actions}$ (i.e. actions that have no effect on the environment when performed in a given state). Knowing this information can help reduce the sample complexity of RL algorithms by masking the inapplicable actions from the policy distribution to only explore actions relevant to finding an optimal policy. This is typically done in an ad-hoc manner with hand-crafted domain logic added to the RL algorithm. In this paper, we propose a more systematic approach to introduce this knowledge into the algorithm. We (i) standardize the way knowledge can be manually specified to the agent; and (ii) present a new framework to autonomously learn these state-dependent action constraints jointly with the policy. We show experimentally that learning inapplicable actions greatly improves the sample efficiency of the algorithm by providing a reliable signal to mask out irrelevant actions. Moreover, we demonstrate that thanks to the transferability of the knowledge acquired, it can be reused in other tasks to make the learning process more efficient.","reinforcement learning, transfer learning" CENTROID-BASED JOINT REPRESENTATION FOR HUMAN POSE ESTIMATION AND INSTANCE SEGMENTATION,https://openreview.net/forum?id=ytLA65K3xJ,https://openreview.net/pdf?id=ytLA65K3xJ,,"Joint pose estimation and instance segmentation combines keypoint heatmaps with segmentation masks for multi-person pose and instance-level segmenta- tion. Unlike easy cases with explicit heatmap activation, hard cases with im- plicit heatmap due to multi-person entanglement, overlap, and occlusions requires joint representation with a segmentation mask in end-to-end training. This pa- per presents a new centroid-based joint representation method called CENTER- FOCUS. It follows a bottom-up paradigm to generate Strong Keypoint Feature Maps for both soft and hard keypoints and improve keypoints detection accuracy as well as the confidence score by introducing KeyCentroids and a Body Heat Map. CENTERFOCUS then uses the high-resolution representation of keypoint as a center of attraction for the pixels in the embedding space to generate MaskCen- troid to cluster the pixels to a particular human instance to whom it belongs, even if 70% of the body is occluded. Finally, we propose a new PoseSeg algorithm that collects the feature representation of a 2D human pose and segmentation for the joint structure of the pose and instance segmentation. We then experimentally demonstrate the effectiveness and generalization ability of our system on chal- lenging scenarios such as occlusions, entangled limbs, and overlapping people. The experimental results show the effectiveness of CENTERFOCUS outperforms representative models on the challenging MS COCO and OCHuman benchmarks in terms of both accuracy and runtime performance, Ablation experiments analyze the impact of each component of the system. The code will be released publicly.","Representation Learning, Pose Estimation, Instance Segmentation, Clustering, Feature Extraction" Addressing Variable Dependency in GNN-based SAT Solving,https://openreview.net/forum?id=_bP-uQzQ1T,https://openreview.net/pdf?id=_bP-uQzQ1T,We address the variable dependency problem in SAT solving by a new GNN-based model.,"Boolean satisfiability problem (SAT) is fundamental to many applications. Existing works have used graph neural networks (GNNs) for (approximate) SAT solving. Typical GNN-based end-to-end SAT solvers predict SAT solutions concurrently. We show that for a group of symmetric SAT problems, the concurrent prediction is guaranteed to produce a wrong answer because it neglects the dependency among Boolean variables in SAT problems. We propose AsymSAT, a GNN-based architecture which integrates recurrent neural networks to generate dependent predictions for variable assignments. The experiment results show that dependent variable prediction extends the solving capability of the GNN-based method as it improves the number of solved SAT instances on large test sets.","Boolean Satisfiability, GNN, Variable Dependency" Pairwise Confidence Difference on Unlabeled Data is Sufficient for Binary Classification,https://openreview.net/forum?id=1-B8dz847_,https://openreview.net/pdf?id=1-B8dz847_,"The difference of confidence labels on unlabeled data pairs, as a novel type of weak supervision, is sufficient to train binary classifiers with theoretical guarantees.","Learning with confidence labels is an emerging weakly supervised learning paradigm, where training data are equipped with confidence labels instead of exact labels. Positive-confidence (Pconf) classification is a typical learning problem in this context, where we are given only positive data equipped with confidence. However, pointwise confidence may not be accessible in real-world scenarios. In this paper, we dive into a novel weakly supervised learning problem called confidence-difference (ConfDiff) classification. Instead of pointwise confidence, we are given only unlabeled data pairs equipped with confidence difference specifying the difference in the probabilities of being positive. An unbiased risk estimator is derived to tackle the problem, and we show that the estimation error bound achieves the optimal convergence rate. Extensive experiments on benchmark data sets validate the effectiveness of our proposed approaches in leveraging the supervision information of the confidence difference.","Weakly supervised learning, binary classification, unbiased risk estimator" Emergent Communication with Attention,https://openreview.net/forum?id=4JoV9g5R1M,https://openreview.net/pdf?id=4JoV9g5R1M,We study emergent language from attention agents with the referential game showing that their language is more compositional and interpretable.,"To develop computational agents that can better communicate with others using their own emergent language, we endow the agents with an ability to focus their attention on particular concepts in the environment. Humans often understand a thing or scene as a composite of concepts and those concepts are further mapped onto words. We implement this intuition as attention mechanisms in Speaker and Listener agents in a referential game and show attention leads to more compositional and interpretable emergent language. We also demonstrate how attention helps us understand the learned communication protocol by investigating the attention weights associated with each message symbol and the alignment of attention weights between Speaker and Listener agents. Overall, our results suggest that attention is a promising mechanism for developing more human-like emergent language.","emergent communication, attention mechanism, compositionality, interpretability" Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning,https://openreview.net/forum?id=v61jhmI2zz,https://openreview.net/pdf?id=v61jhmI2zz,This work demonstrates the utility of large-scale generative models to automatically discover bugs in vision models in an open-ended manner.,"Automatically discovering failures in vision models under real-world settings remains an open challenge. This work shows how off-the-shelf, large-scale, image-to-text and text-to-image models, trained on vast amounts of data, can be leveraged to automatically find such failures. In essence, a conditional text-to-image generative model is used to generate large amounts of synthetic, yet realistic, inputs given a ground-truth label. A captioning model is used to describe misclassified inputs. Descriptions are used in turn to generate more inputs, thereby assessing whether specific descriptions induce more failures than expected. As failures are grounded to natural language, we automatically obtain a high-level, human-interpretable explanation of each failure. We use this pipeline to demonstrate that we can effectively interrogate classifiers trained on ImageNet to find specific failure cases and discover spurious correlations. We also show that we can scale the approach to generate adversarial datasets targeting specific classifier architectures. This work demonstrates the utility of large-scale generative models to automatically discover bugs in vision models in an open-ended manner. We also describe a number of limitations and pitfalls related to this approach.","robustness, failure discovery" MetaFS: An Effective Wrapper Feature Selection via Meta Learning,https://openreview.net/forum?id=zaEiQ2dgLv,https://openreview.net/pdf?id=zaEiQ2dgLv,We propose a meta-learning-based wrapper feature selection framewrok that doesn't require re-training plenty of models to evaluate different subsets.,"Feature selection is of great importance and applies in lots of fields, such as medical and commercial. Wrapper methods, directly comparing the performance of different feature combinations, are widely used in real-world applications. However, selecting effective features meets the following two main challenges: 1) feature combinations are distributed in a huge discrete space; and 2) efficient and precise combinations evaluation is hard. To tackle these challenges, we propose a novel deep meta-learning-based feature selection framework, termed MetaFS, containing a Feature Subset Sampler (FSS) and a Meta Feature Estimator (MetaFE), which transforms the discrete search space into continuous and adopts meta-learning technique for effective feature selection. Specifically, FSS parameterizes the distribution of discrete search space and applies gradient-based methods to optimize. MetaFE learns the representations of different feature combinations, and dynamically generates unique models without retraining for efficient and precise combination evaluation. We adopt a bi-level optimization strategy to optimize the MetaFS. After optimization, we evaluate multiple feature combinations sampled from the converged distribution (i.e., the condensed search space) and select the optimal one. Finally, we conduct extensive experiments on two datasets, illustrating the superiority of MetaFS over 7 state-of-the-art methods.","Meta Learning, feature selection" "Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models",https://openreview.net/forum?id=F5uYcwABMu,https://openreview.net/pdf?id=F5uYcwABMu,We study the role of implicit bias in language modeling,"Language modeling on large-scale datasets leads to impressive performance gains on various downstream language tasks. The (validation) pre-training loss (or perplexity in autoregressive language modeling) is often used as the evaluation metric when developing language models since the pre-training loss tends to be well-correlated with downstream performance (which is itself difficult to evaluate comprehensively). Contrary to this conventional wisdom, this paper shows that 1) pre-training loss cannot fully explain downstream performance and 2) flatness of the model is well-correlated with downstream performance where pre-training loss is not. On simplified datasets, we identify three ways to produce models with the same (statistically optimal) pre-training loss but different downstream performance: continue pre-training after convergence, increasing the model size, and changing the training algorithm. These experiments demonstrate the existence of implicit bias of pre-training algorithms/optimizers---among models with the same minimal pre-training loss, they implicitly prefer more transferable ones. Toward understanding this implicit bias, we prove that SGD with standard mini-batch noise implicitly prefers flatter minima in language models, and empirically observe a strong correlation between flatness and downstream performance among models with the same minimal pre-training loss. We also prove in a synthetic language setting that among the models with the minimal pre-training loss, the flattest model transfers to downstream tasks.","Language Modeling, Implicit Bias" Signal to Sequence Attention-Based Multiple Instance Network for Segmentation Free Inference of RNA Modifications,https://openreview.net/forum?id=-XC_lMynIT,https://openreview.net/pdf?id=-XC_lMynIT,,"Direct RNA sequencing technology works by allowing long RNA molecules to pass through tiny pores, generating electrical current, called squiggle, that are interpreted as a series of RNA nucleotides through the use of Deep Learning algorithms. The platform has also facilitated computational detection of RNA modifications via machine learning and statistical approaches as they cause detectable shift in the current generated as the modified nucleotides pass through the pores. Nevertheless, since modifications only occur in a handful of positions along the molecules, existing techniques require segmentation of the long squiggle in order to filter off irrelevant signals and this step produces large computational and storage overhead. Inspired by the recent work in vector similarity search, we introduce a segmentation-free approach by utilizing scaled-dot product attention to perform implicit segmentation and feature extraction of raw signals that correspond to sites of interest. We further demonstrate the feasibility of our approach by achieving significant speedup while maintaining competitive performance in m6A detection against existing state-of-the-art methods.","Multiple Instance Learning, Deep Learning, RNA Modification, Computational Biology" Interval-based Offline Policy Evaluation without Sufficient Exploration or Realizability,https://openreview.net/forum?id=cwf7nnoK5o,https://openreview.net/pdf?id=cwf7nnoK5o,"We characterize the minimax bias of OPE caused by the insufficiency of exploration and the lack of (strong) realizability, and propose a new estimator achieving it.","We study the problem of offline policy evaluation (OPE), where the goal is to estimate the value of given decision-making policy without interacting with the actual environment. In particular, we consider the interval-based OPE, where the output is an interval rather than a point, indicating the uncertainty of the evaluation. The interval-based estimation is especially important in OPE since, when the data coverage is insufficient relative to the complexity of the environmental model, any OPE method can be biased even with infinite sample size. In this paper, we characterize the worst case of such irreducible bias, called the *minimax bias*, in terms of the discrepancy between the target policy and the data-sampling distribution, and show that the marginal-importance-sampling (MIS) estimator achieves the minimax bias with an appropriate importance-weight function. Motivated with this result, we then propose a new interval-based MIS estimator that asymptotically achieves the minimax bias.","Offline policy evaluation, marginal importance sampling, offline reinforcement learning" A Differential Geometric View and Explainability of GNN on Evolving Graphs,https://openreview.net/forum?id=lRdhvzMpVYV,https://openreview.net/pdf?id=lRdhvzMpVYV,,"Graphs are ubiquitous in social networks and biochemistry, where Graph Neural Networks (GNN) are the state-of-the-art models for prediction. Graphs can be evolving and it is vital to formally model and understand how a trained GNN responds to graph evolution. We propose a smooth parameterization of the GNN predicted distributions using axiomatic attribution, where the distributions are on a low- dimensional manifold within a high-dimensional embedding space. We exploit the differential geometric viewpoint to model distributional evolution as smooth curves on the manifold. We reparameterize families of curves on the manifold and design a convex optimization problem to find a unique curve that concisely approximate the distributional evolution for human interpretation. Extensive experiments on node classification, link prediction, and graph classification tasks with evolving graphs demonstrate the better sparsity, faithfulness, and intuitiveness of the proposed method over the state-of-the-art methods.", $\rm A^2Q$: Aggregation-Aware Quantization for Graph Neural Networks,https://openreview.net/forum?id=7L2mgi0TNEP,https://openreview.net/pdf?id=7L2mgi0TNEP,"We propose an Aggregation-Aware mixed-precision Quantization method that fully utilizes the property of GNNs, achieving up to $2\times$ speedup and $11.4\%$ accuracy improvement compared to the state-of-the-art quantization method on GNNs.","As graph data size increases, the vast latency and memory consumption during inference pose a significant challenge to the real-world deployment of Graph Neural Networks (GNNs). While quantization is a powerful approach to reducing GNNs complexity, most previous works on GNNs quantization fail to exploit the unique characteristics of GNNs, suffering from severe accuracy degradation. Through an in-depth analysis of the topology of GNNs, we observe that the topology of the graph leads to significant differences between nodes, and most of the nodes in a graph appear to have a small aggregation value. Motivated by this, in this paper, we propose the Aggregation-Aware mixed-precision Quantization ($\rm A^2Q$) for GNNs, where an appropriate bitwidth is automatically learned and assigned to each node in the graph. To mitigate the vanishing gradient problem caused by sparse connections between nodes, we propose a Local Gradient method to serve the quantization error of the node features as the supervision during training. We also develop a Nearest Neighbor Strategy to deal with the generalization on unseen graphs. Extensive experiments on eight public node-level and graph-level datasets demonstrate the generality and robustness of our proposed method. Compared to the FP32 models, our method can achieve up to $18.8\times$ (i.e., 1.70bits) compression ratio with negligible accuracy degradation. Moreover, compared to the state-of-the-art quantization method, our method can achieve up to $11.4\%$ and $9.5\%$ accuracy improvements on the node-level and graph-level tasks, respectively, and up to $2\times$ speedup on a dedicated hardware accelerator.","Graph Neural Networks, MPNN framework, Mixed-precision, Quantization" Text-Driven Generative Domain Adaptation with Spectral Consistency Regularization,https://openreview.net/forum?id=w4eQcMZsJa,https://openreview.net/pdf?id=w4eQcMZsJa,,"Combined with the generative prior of pre-trained models and the flexibility of text, text-driven generative domain adaptation can generate images from a wide range of target domains. However, current methods still suffer from overfitting and the mode collapse problem. In this paper, we analyze the mode collapse from the geometric point of view and reveal its relationship to the Hessian matrix of generator. To alleviate it, we propose the spectral consistency regularization to preserve the diversity of source domain without restricting the semantic adaptation to target domain. We also design granularity adaptive regularization to flexibly control the balance between diversity and stylization for target model. We conduct experiments for broad target domains compared with state-of-the-art methods and extensive ablation studies. The experiments demonstrate the effectiveness of our method to preserve the diversity of source domain and generate high fidelity target images.","GAN, StyleGAN, Clip, Domain Adaptation, Style Transfer" Multi-Prompt Alignment for Multi-source Unsupervised Domain Adaptation,https://openreview.net/forum?id=WFBksaezAs,https://openreview.net/pdf?id=WFBksaezAs,,"Most existing methods for multi-source unsupervised domain adaptation (UDA) rely on a common feature encoder to extract domain-invariant features. However, learning such an encoder involves updating the parameters of the entire network, which makes the optimization computationally expensive, particularly when coupled with min-max objectives. Inspired by recent advances in prompt learning that adapts high-capacity deep models for downstream tasks in a computationally economic way, we introduce Multi-Prompt Alignment (MPA), a simple yet efficient two-stage framework for multi-source UDA. Given a source and target domain pair, MPA first trains an individual prompt to minimize the domain gap through a contrastive loss, while tuning only a small set of parameters. Then, MPA derives a low-dimensional latent space through an auto-encoding process that maximizes the agreement of multiple learned prompts. The resulting embedding further facilitates generalization to unseen domains. Extensive experiments show that our method achieves state-of-the-art results on popular benchmark datasets while requiring substantially fewer tunable parameters. To the best of our knowledge, we are the first to apply prompt learning to the multi-source UDA problem and our method achieves the highest reported average accuracy of 54.1% on DomainNet, the most challenging UDA dataset to date, with only 15.9M parameters trained. More importantly, we demonstrate that the learned embedding space can be easily adapted to novel unseen domains.", Adversarial Examples Guided Pseudo-label Refinement for Decentralized Domain Adaptation,https://openreview.net/forum?id=cKAhbE-woN,https://openreview.net/pdf?id=cKAhbE-woN,,"Unsupervised domain adaptation (UDA) methods usually assume data from multiple domains can be put together for centralized adaptation. Unfortunately, this assumption impairs data privacy, which leads to the failure of traditional methods in practical scenarios. To cope with the above issue, we present a new approach named Adversarial Examples Guided Pseudo-label Refinement for Decentralized Domain Adaptation (AGREE), which conducts target adaptation in an iterative training process during which only models can be delivered across domains. More specifically, to train a promising target model, we leverage Adversarial Examples (AEs) to filter out error prone predictions of source models towards each target sample based on both robustness and confidence, and then treat the most frequent prediction as the pseudo-label. Besides, to improve central model aggregation, we introduce Knowledge Contribution (KC) to compute reasonable aggregation weights. Extensive experiments conducted on several standard datasets verify the superiority of the proposed AGREE. Especially, our AGREE achieves the new state-of-the-art performance on the DomainNet and Office-Caltech10 datasets. The implementation code will be publicly available.","Domain Adaptation, Federated Learning" Clean-image Backdoor: Attacking Multi-label Models with Poisoned Labels Only,https://openreview.net/forum?id=rFQfjDC9Mt,https://openreview.net/pdf?id=rFQfjDC9Mt,,"Multi-label models have been widely used in various applications including image annotation and object detection. The fly in the ointment is its inherent vulnerability to backdoor attacks due to the adoption of deep learning techniques. However, all existing backdoor attacks exclusively require to modify training inputs (e.g., images), which may be impractical in real-world applications. In this paper, we aim to break this wall and propose the first clean-image backdoor attack, which only poisons the training labels without touching the training samples. Our key insight is that in a multi-label learning task, the adversary can just manipulate the annotations of training samples consisting of a specific set of classes to activate the backdoor. We design a novel trigger exploration method to find convert and effective triggers to enhance the attack performance. We also propose three target label selection strategies to achieve different goals. Experimental results indicate that our clean-image backdoor can achieve a 98% attack success rate while preserving the model's functionality on the benign inputs. Besides, the proposed clean-image backdoor can evade existing state-of-the-art defenses.", Dense Correlation Fields for Motion Modeling in Action Recognition,https://openreview.net/forum?id=rPqxwpEm7M,https://openreview.net/pdf?id=rPqxwpEm7M,"In this paper, we present Dense Correlation Fields (DCF) which build up dense visual correlation volumes that preserves both fine local information provided in the lower layer and the high-level semantic information from the deeper layer.","The challenge of action recognition is to capture reasoning motion information. Compared to spatial convolution for appearance, the temporal component provides an additional (and important) clue for motion modeling, as a number of actions can be reliably recognized based on the motion information. In this paper, we present an effective and interpretable module, Dense Correlation Fields (DCF), which builds up dense visual correlation volumes at the feature level to model different motion patterns explicitly. To achieve this goal, we rely on a spatially hierarchical architecture that preserves both fine local information provided in the lower layer and the high-level semantic information from the deeper layer. Our method fuses spatial hierarchical correlation and temporal long-term correlation, which is better suited for small objects and large displacements. This module is extensible and can be plugged into many backbone architectures to accurately predict object interactions in the video. DCF shows consistent improvements over 2D CNNs and 3D CNNs baseline networks with 3.7% and 3.0% gains respectively on the standard video action benchmark of SSV1.","Action recognition, Motion modeling, Video understanding" "Variance Reduction is an Antidote to Byzantines: Better Rates, Weaker Assumptions and Communication Compression as a Cherry on the Top",https://openreview.net/forum?id=pfuqQQCB34,https://openreview.net/pdf?id=pfuqQQCB34,"We propose a new Byzantine-tolerant method with variance reduction, communication compression, and theoretical guarantees superior to previously known results","Byzantine-robustness has been gaining a lot of attention due to the growth of the interest in collaborative and federated learning. However, many fruitful directions, such as the usage of variance reduction for achieving robustness and communication compression for reducing communication costs, remain weakly explored in the field. This work addresses this gap and proposes Byz-VR-MARINA -- a new Byzantine-tolerant method with variance reduction and compression. A key message of our paper is that variance reduction is key to fighting Byzantine workers more effectively. At the same time, communication compression is a bonus that makes the process more communication efficient. We derive theoretical convergence guarantees for Byz-VR-MARINA outperforming previous state-of-the-art for general non-convex and Polyak-Lojasiewicz loss functions. Unlike the concurrent Byzantine-robust methods with variance reduction and/or compression, our complexity results are tight and do not rely on restrictive assumptions such as boundedness of the gradients or limited compression. Moreover, we provide the first analysis of a Byzantine-tolerant method supporting non-uniform sampling of stochastic gradients. Numerical experiments corroborate our theoretical findings.","byzantine robustness, variance reduction, communication compression" What's Behind the Mask: Estimating Uncertainty in Image-to-Image Problems,https://openreview.net/forum?id=kzqRIEHBgH,https://openreview.net/pdf?id=kzqRIEHBgH,"We show how to estimate uncertainty in image-to-image problems by computing a mask such that the distance between the masked reconstructed image and the masked true image is guaranteed to be less than a specified threshold, with high probability.","Estimating uncertainty in image-to-image networks is an important task, particularly as such networks are being increasingly deployed in the biological and medical imaging realms. In this paper, we introduce a new approach to this problem based on masking. Given an existing image-to-image network, our approach computes a mask such that the distance between the masked reconstructed image and the masked true image is guaranteed to be less than a specified threshold, with high probability. The mask thus identifies the more certain regions of the reconstructed image. Our approach is agnostic to the underlying image-to-image network, and only requires triples of the input (degraded), reconstructed and true images for training. Furthermore, our method is agnostic to the distance metric used. As a result, one can use $L_p$-style distances or perceptual distances like LPIPS, which contrasts with interval-based approaches to uncertainty. Our theoretical guarantees derive from a conformal calibration procedure. We evaluate our mask-based approach to uncertainty on image colorization, image completion, and super-resolution tasks, demonstrating high quality performance on each.","uncertainty estimation, image-to-image" A Time-Consistency Curriculum for Learning from Instance-Dependent Noisy Labels,https://openreview.net/forum?id=4RwkbKZhGV,https://openreview.net/pdf?id=4RwkbKZhGV,,"Many machine learning algorithms are known to be fragile on simple instance-independent noisy labels. However, noisy labels in real-world data are more devastating since they are produced by more complicated mechanisms in an instance-dependent manner. In this paper, we target this practical challenge of \textit{Instance-Dependent Noisy Labels} by jointly training (1) a model reversely engineering the noise generating mechanism, which produces an \textit{instance-dependent mapping} between the clean label posterior and the observed noisy label; and (2) a robust classifier that produces clean label posteriors. Compared to previous methods, the former model is novel and enables end-to-end learning of the latter directly from noisy labels. An extensive empirical study indicates that the time-consistency of data is critical to the success of training both models and motivates us to develop a curriculum selecting training data based on their dynamics on the two models' outputs over the course of training. We show that the curriculum-selected data provide both clean labels and high-quality input-output pairs for training the two models. Therefore, it leads to promising and robust classification performance even in notably challenging settings of instance-dependent noisy labels where many SoTA methods could easily fail. Extensive experimental comparisons and ablation studies further demonstrate the advantages and significance of the time-consistency curriculum in learning from instance-dependent noisy labels on multiple benchmark datasets.", Black-box Knowledge Distillation,https://openreview.net/forum?id=x8NPd0MFTf,https://openreview.net/pdf?id=x8NPd0MFTf,We introduce an approach for black-box knowledge distillation via prediction augmentations and multi-level prediction alignment.,"Knowledge Distillation (KD) aims at distilling the knowledge from the large teacher model to a light-weight student model. Enhancing model efficiency effectively, mainstream methods often rely on the assumption that the teacher model is white-box (i.e., visible during distillation). However, this assumption does not always hold due to commercial, privacy, or safety concerns, which hinders these strong methods from being applied. Towards this dilemma, in this paper, we consider black-box knowledge distillation, an interesting yet challenging problem which aims at distilling teacher knowledge when merely the teacher predictions are accessible (i.e., the teacher model is invisible). Some early KD methods can be directly applied to black-box knowledge distillation, but the performance appears to be unsatisfactory. In this paper, we propose a simple yet effective approach, which makes better utilization of teacher predictions with prediction augmentation and multi-level prediction alignment. Through this framework, the student model learns from more diverse teacher predictions. Meanwhile, the prediction alignment is not only conducted at the instance level, but also at the batch and class level, through which the student model learns instance prediction, input correlation, and category correlation simultaneously. Extensive experiment results validate that our method enjoys consistently higher performance than previous black-box methods, and even reaches competitive performance with mainstream white-box methods. We promise to release our code and models to ensure reproducibility. ","black-box model, knowledge distillation" Open Set Recognition by Mitigating Prompt Bias,https://openreview.net/forum?id=O8EK-eWjUm,https://openreview.net/pdf?id=O8EK-eWjUm,,"Existing open set recognition (OSR) methods are usually performed on relatively small datasets by training a visual model from scratch. OSR on large-scale datasets have been rarely studied for their great complexity and difficulty. Recently, vision-language (VL) pre-training has promoted closed-set image recognition with prompt engineering on datasets with various scales. However, prompts tuned on the training data often exhibit label bias towards known classes, leading to the poor performance in recognizing unknown data in the open environment. In this paper, we aim at developing a new paradigm for OSR both on small and large-scale datasets by prompt engineering on VL models in a divide-and-conquer strategy. Firstly, the closed-set data is processed as the combination of one or more groups. Each group is devised with a group-specific prompt. Then, we propose the Group-specific Contrastive Tuning (GCTu), in which negative label words are introduced into tuning to mitigate the label bias of group-specific prompts. In inference, to achieve comprehensive predictions both on small and large-scale datasets, we propose the Group Combined Testing (GCTe). It determines the optimal prediction prompt among the multiple group-specific predictions by focusing on the group-wise closed-set probability distributions. Our method namely GCT2 achieves excellent performance on both small and large-scale OSR benchmarks. The strong and wide applicability of our method is also verified in ablation studies.", Efficient Personalized Federated Learning via Sparse Model-Adaptation,https://openreview.net/forum?id=3vOtC1t1kF,https://openreview.net/pdf?id=3vOtC1t1kF,"We propose an efficient personalized FL method with theoretical analysis, which adaptively learns sparse local models, and achieves SOTA accuracy and efficiency simultaneously.","Federated Learning (FL) aims to train machine learning models for multiple clients without sharing their own private data. Due to the heterogeneity of clients' local data distribution, recent studies explore the personalized FL that learns and deploys distinct local models with the help of auxiliary global models. However, the clients can be heterogeneous in terms of not only local data distribution, but also their computation and communication resources. The capacity and efficiency of personalized models are restricted by the lowest-resource clients, leading to sub-optimal performance and limited practicality of personalized FL. To overcome these challenges, we propose a novel approach named pFedGate for efficient personalized FL by adaptively and efficiently learning sparse local models. With a lightweight trainable gating layer, pFedGate enables clients to reach their full potential in model capacity by generating different sparse models accounting for both the heterogeneous data distributions and resource constraints. Meanwhile, the computation and communication efficiency are both improved thanks to the adaptability between the model sparsity and clients' resources. Further, we theoretically show that the proposed pFedGate has superior complexity with guaranteed convergence and generalization error. Extensive experiments show that pFedGate achieves superior global accuracy, individual accuracy and efficiency simultaneously over state-of-the-art methods, by up to 4.53\% accuracy improvement and 12x smaller model size. We also demonstrate that pFedGate performs better than competitors in the novel clients participation and partial clients participation scenarios, and can learn meaningful sparse local models adapted to different data distributions.","Efficient Federated Learning, Personalization, Sparse Model-Adaptation" Molecule Generation for Target Receptor Binding via Continuous Normalizing Flows,https://openreview.net/forum?id=XwR41dpign,https://openreview.net/pdf?id=XwR41dpign,We derive semi-equivariance conditions to yield invariant conditional distributions of ligand molecules given receptor molecules.,"We propose an algorithm for learning a conditional generative model of a molecule given a target. Specifically, given a receptor molecule that one wishes to bind to, the conditional model generates candidate ligand molecules that may bind to it. The distribution should be invariant to rigid body transformations that act jointly on the ligand and the receptor; it should also be invariant to permutations of either the ligand or receptor atoms. Our learning algorithm is based on a continuous normalizing flow. We establish semi-equivariance conditions on the flow which guarantee the aforementioned invariance conditions on the conditional distribution. We propose a graph neural network architecture which implements this flow, and which is designed to learn effectively despite the vast differences in size between the ligand and receptor. We evaluate our method on the CrossDocked2020 dataset, attaining a 52.7% relative improvement over the current state of the art.","molecule generative models, normalizing flows, equivariance" Momentum Tracking: Momentum Acceleration for Decentralized Deep Learning on Heterogeneous Data,https://openreview.net/forum?id=3eQEil044E,https://openreview.net/pdf?id=3eQEil044E,"In this work, we propose Momentum Tracking, which is the method with momentum acceleration whose convergence rate is proved to be independent of the data-heterogeneity.","SGD with momentum acceleration is one of the key components for improving the performance of neural networks. For decentralized learning, a straightforward approach using momentum acceleration is Distributed SGD (DSGD) with momentum acceleration (DSGDm). However, DSGDm performs worse than DSGD when the data distributions are statistically heterogeneous. Recently, several studies have addressed this issue and proposed methods with momentum acceleration that are more robust to data heterogeneity than DSGDm, although their convergence rates remain dependent on data heterogeneity and decrease when the data distributions are heterogeneous. In this study, we propose Momentum Tracking, which is a method with momentum acceleration whose convergence rate is proven to be independent of data heterogeneity. More specifically, we analyze the convergence rate of Momentum Tracking in the standard deep learning setting, where the objective function is non-convex and the stochastic gradient is used. Then, we identify that it is independent of data heterogeneity for any momentum coefficient $\beta \in [0, 1)$. Through image classification tasks, we demonstrate that Momentum Tracking is more robust to data heterogeneity than the existing decentralized learning methods with momentum acceleration and can consistently outperform these existing methods when the data distributions are heterogeneous.","Decentralized Optimization, Non-Convex Stochastic Optimization, Momentum Acceleration" Deep Graph-Level Orthogonal Hypersphere Compression for Anomaly Detection,https://openreview.net/forum?id=yCtxVkTaXg,https://openreview.net/pdf?id=yCtxVkTaXg,A deep orthogonal graph-level anomaly detection method and its improvement.,"Graph-level anomaly detection aims to identify abnormal samples of a set of graphs in an unsupervised manner. It is non-trivial to find a reasonable decision boundary between normal data and anomalous data without using any anomalous data in the training stage, especially for data in graphs. This paper first proposes a novel deep graph-level anomaly detection model, which learns the graph representation with maximum mutual information between substructure features and global structure features while exploring a hypersphere anomaly decision boundary. The deep orthogonal projection layer is adopted to keep the training data distribution consistent with the decision hypersphere thus avoiding erroneous evaluations. We further propose projecting the normal data into the interval region between two co-centered hyperspheres, which makes the normal data distribution more compact and effectively overcomes the issue of outliers falling close to the center of the hypersphere. The numerical and visualization results on a few graph datasets demonstrate the effectiveness and superiority of our methods in comparison to many baselines and state-of-the-art.",Unsupervised learning GPR-Net: Multi-view Layout Estimation via a Geometry-aware Panorama Registration Network,https://openreview.net/forum?id=jSXsRwdux_,https://openreview.net/pdf?id=jSXsRwdux_,,"Reconstructing 3D layouts from multiple $360^{\circ}$ panoramas has received increasing attention recently as estimating a complete layout of a large-scale and complex room from a single panorama is very difficult. A state-of-the-art method, called PSMNet, introduces the first learning-based framework that jointly estimates the room layout and registration given a pair of panoramas. However, PSMNet relies on an approximate (i.e., ""noisy"") registration as input. This assumption is inherently a challenging problem in the context of wide baseline registration. In this work, we present a complete multi-view panoramic layout estimation framework that jointly learns panorama registration and layout estimation given a pair of panoramas without relying on a pose prior. The major improvement over PSMNet comes from a novel Geometry-aware Panorama Registration Network or GPR-Net that effectively tackles the wide baseline registration problem by exploiting the layout geometry and computing fine-grained correspondences on the layout boundaries, rather on the global pixel-space. Our architecture consists of two parts. First, given two panoramas, we adopt a vision transformer to learn a set of compact 1D horizon features sampled on the panorama. These 1D horizon features encode the depths of individual layouts and the correspondence and covisibility maps between layout boundaries. We then exploit a non-linear registration module to convert these 1D horizon features into a set of corresponding 2D boundary points on the layout. Finally, we estimate the final relative camera pose via RANSAC and obtain the complete layout simply by taking the union of registered layouts. Experimental results indicate that our method achieves state-of-the-art performance in both panorama registration and layout estimation on a large-scale indoor panorama dataset.", Gradient Deconfliction via Orthogonal Projections onto Subspaces For Multi-task Learning,https://openreview.net/forum?id=tF_iDkYA_Z5,https://openreview.net/pdf?id=tF_iDkYA_Z5,"We propose a multi-task learning method which not only solves all conflicts among the tasks, but can also effectively search for diverse solutions towards different trade-off preferences among the tasks.","Although multi-task learning (MTL) has been a preferred approach and successfully applied in many real-world scenarios, MTL models are not guaranteed to outperform single-task models on all tasks mainly due to the negative effects of conflicting gradients among the tasks. In this paper, we fully examine the influence of conflicting gradients and further emphasize the importance and advantages of achieving non-conflicting gradients which allows simple but effective trade-off strategies among the tasks with stable performance. Base on our findings, we propose the Gradient Deconfliction via Orthogonal Projections onto Subspaces (GradOPS) spanned by other task-specific gradients. Our method not only solves all conflicts among the tasks, but can also effectively search for diverse solutions towards different trade-off preferences among the tasks. Theoretical analysis on convergence is provided, and performance of our algorithm is fully testified on multiple benchmarks in various domains. Results demonstrate that our method can effectively find multiple state-of-the-art solutions with different trade-off strategies among the tasks on multiple datasets.","multi-task learning, deep learning" Relative Contribution Mechanism: A Unified Paradigm for Disassembling Convolutional Neural Networks,https://openreview.net/forum?id=gPgI6mStqTc,https://openreview.net/pdf?id=gPgI6mStqTc,,"With the tremendous development of CNNs, obtaining an available CNN classifier is more challenging due to the massive number of parameters and deep structure. Recently, an emerging model disassembling and assembling task (MDA-Task) has been proposed to obtain new models easily from the perspective of model reusing. However, the existing methods are usually slow or inaccurate. In this paper, we put forward a contribution paradigm for MDA-Task, which unifies existing model disassembling and assembling methods into a universal formulation. We first propose a relative contribution mechanism that the prediction results of the CNN classifier are decided by the larger contribution value. Then, the analysis and two discoveries of contribution allocation and aggregation procedures are given around the above mechanism. Based on the two discoveries, we introduce a contribution attribution based CNN disassembling technique composed of single-layer contribution attribution and backward accumulation attribution, which can effectively find the category-aware components in each layer and associated components in adjacent layers, respectively. In addition, a contribution rescaling based CNN assembling technique is devised for assembling the above disassembled category-aware components from different CNN classifiers, which can achieve comparable accuracy performance with original CNN classifiers. Experiments on five benchmark datasets with three mainstream CNN classifiers verify the effectiveness of the proposed contribution paradigm and demonstrate that the contribution attribution based CNN disassembling and assembling technique can achieve significant accuracy increases and faster speed than the existing methods.", Pareto Optimization for Active Learning under Out-of-Distribution Data Scenarios,https://openreview.net/forum?id=BGvOEUEMBzE,https://openreview.net/pdf?id=BGvOEUEMBzE,,"Pool-based Active Learning (AL) has achieved great success in minimizing labeling costs by sequentially selecting the most informative unlabeled samples from a large unlabeled data pool and querying their labels from oracle/annotators. However, existing AL sampling schemes might not work well under out-of-distribution (OOD) data scenarios, where the unlabeled data pool contains data samples that do not belong to the pre-defined categories of the target task. Achieving good AL performance under OOD data scenarios is a challenging task due to the natural conflict between AL sampling strategies and OOD sample detection -- both more informative in-distribution (ID) data and OOD data in unlabeled data pool may be assigned high informativeness scores (e.g., high entropy) during AL processes. In this paper, we propose a sampling scheme, Monte-Carlo Pareto Optimization for Active Learning (POAL), which selects optimal subsets of unlabeled samples with \emph{fixed batch size} from the unlabeled data pool. We cast the AL sampling task as a multi-objective optimization problem and utilize Pareto optimization based on two conflicting objectives: (1) the typical AL sampling scheme (e.g., maximum entropy), and (2) the confidence of not being an OOD data sample. Experimental results show the effectiveness of our POAL on classical Machine Learning (ML) and Deep Learning (DL) tasks.","active learning, pareto optimization, out-of-distribution" Self-Consistent Learning: Cooperation between Generators and Discriminators,https://openreview.net/forum?id=btmflCmNxDl,https://openreview.net/pdf?id=btmflCmNxDl,"This paper presents a self-consistent learning framework with a cooperative closed-loop form, achieving new state-of-the-art results on the sentence semantic matching task on both zero-shot and full-data settings.","Using generated data to improve the performance of downstream discriminative models has recently gained popularity due to the great development of pre-trained language models. In most previous studies, generative models and discriminative models are trained separately and thus could not adapt to any changes in each other. As a result, the generated samples can easily deviate from the real data distribution, while the improvement of the discriminative model quickly reaches saturation. Generative adversarial networks (GANs) train generative models via an adversarial process with discriminative models to achieve joint training. However, the training of standard GANs is notoriously unstable and often falls short of convergence. In this paper, to address these issues, we propose a $\textit{self-consistent learning}$ framework, in which a discriminator and a generator are cooperatively trained in a closed-loop form. The discriminator and the generator enhance each other during multiple rounds of alternating training until a scoring consensus is reached. This framework proves to be easy to train and free from instabilities such as mode collapse and non-convergence. Extensive experiments on sentence semantic matching demonstrate the effectiveness of the proposed framework: the discriminator achieves 10+ AP of improvement on the zero-shot setting and new state-of-the-art performance on the full-data setting. ","Cooperative Closed-loop Training, Language Modeling, Sentence Semantic Matching" Learning Dynamical Characteristics with Neural Operators for Data Assimilation,https://openreview.net/forum?id=Sc3Ylriwp4,https://openreview.net/pdf?id=Sc3Ylriwp4,A new deep learning framework is proposed for data assimilation issues.,"Data assimilation refers to a group of algorithms that combines numerical models with observations to obtain an optimal estimation of the system's states. In areas like earth science, numerical models are usually formulated by differential equations, also known as the prior dynamics. It is a great challenge for neural networks to properly exploit the dynamical characteristics for data assimilation, because first, it is difficult to represent complicated dynamical characteristics in neural networks, and second, the dynamics are likely to be biased. The state-of-the-art neural networks borrow from the traditional method to introduce dynamical characteristics by optimizing the 4D-Var objective function in which the dynamics are inherently quantified, but the iterative optimization process leads to high computational cost. In this paper, we develop a novel deep learning framework with neural operators for data assimilation. The key novelty of our proposed approach is that we design a so-called flow operator through self-supervised learning to explicitly learn dynamical characteristics for reconstructed states. Numerical experiments on the Lorenz-63 and Lorenz-96 systems, which are the standard benchmarks for data assimilation performance evaluation, show that the proposed method is at least three times faster than state-of-the-art neural networks, and reduces the dynamic loss by two orders of magnitude. It is also demonstrated that our method is well-adapted to biases in the prior dynamics. ","AI for science, data assimilation, generative models" Lost Domain Generalization Is a Natural Consequence of Lack of Training Domains,https://openreview.net/forum?id=9KmnrUpU2DG,https://openreview.net/pdf?id=9KmnrUpU2DG,We show a hardness result for the number of training domains required to achieve a small population error on the test domain.,"We show a hardness result for the number of training domains required to achieve a small population error in the test domain. Although many domain generalization algorithms have been developed under various domain-invariance assumptions, there is significant evidence to indicate that out-of-distribution (o.o.d.) test accuracy of state-of-the-art o.o.d. algorithms is on par with empirical risk minimization and random guess on the domain generalization benchmarks such as DomainBed. In this work, we analyze its cause and attribute the lost domain generalization to the lack of training domains. We show that, in a minimax lower bound fashion, \emph{any} learning algorithm that outputs a classifier with an $\epsilon$ excess error to the Bayes optimal classifier requires at least $\mathrm{poly}(1/\epsilon)$ number of training domains, even though the number of training data sampled from each training domain is large. Experiments on the DomainBed benchmark demonstrate that o.o.d. test accuracy is monotonically increasing as the number of training domains increases. Our result sheds light on the intrinsic hardness of domain generalization and suggests benchmarking o.o.d. algorithms by the datasets with a sufficient number of training domains.","Domain Generalization, Domain Complexity" Graph Neural Networks for Link Prediction with Subgraph Sketching,https://openreview.net/forum?id=m1oqEOAozQU,https://openreview.net/pdf?id=m1oqEOAozQU,A method that solves the expressivity issues that plague most MPNNs for link prediction while being as efficient to run as GCN. This is achieved by passing subgraph sketches as messages.,"Many Graph Neural Networks (GNNs) perform poorly compared to simple heuristics on Link Prediction (LP) tasks. This is due to limitations in expressive power such as the inability to count triangles (the backbone of most LP heuristics) and because they can not distinguish automorphic nodes (those having identical structural roles). Both expressiveness issues can be alleviated by learning link (rather than node) representations and incorporating structural features such as triangle counts. Since explicit link representations are often prohibitively expensive, recent works resorted to subgraph-based methods, which have achieved state-of-the-art performance for LP, but suffer from poor efficiency due to high levels of redundancy between subgraphs. We analyze the components of subgraph GNN (SGNN) methods for link prediction. Based on our analysis, we propose a novel full-graph GNN called ELPH (Efficient Link Prediction with Hashing) that passes subgraph sketches as messages to approximate the key components of SGNNs without explicit subgraph construction. ELPH is provably more expressive than Message Passing GNNs (MPNNs). It outperforms existing SGNN models on many standard LP benchmarks while being orders of magnitude faster. However, it shares the common GNN limitation that it is only efficient when the dataset fits in GPU memory. Accordingly, we develop a highly scalable model, called BUDDY, which uses feature precomputation to circumvent this limitation without sacrificing predictive performance. Our experiments show that BUDDY also outperforms SGNNs on standard LP benchmarks while being highly scalable and faster than ELPH.","Graph Neural Networks, Link Prediction, Data Sketching" Leveraging Hard Negative Priors for Automatic Medical Report Generation,https://openreview.net/forum?id=kv6N5B_5gx,https://openreview.net/pdf?id=kv6N5B_5gx,,"Recently, automatic medical report generation has become an active research topic in medical imaging field. It is imperative for the model to identify normal and abnormal regions in a medical image to generate a coherent and diverse report. However, medical datasets are highly biased towards normal regions. This makes most existing models tend to generate a generic report without sufficiently considering the uniqueness of individual images. In this paper, we propose a learning framework to extract distinctive image and report features for each sample by distinguishing it from its closest peer (denoted as hard negative in this paper) and gradually increasing the difficulty of such a task through synthesizing harder and harder negatives during training. Specifically, a prior hard negative report, which is the report closest to an anchor report in the dataset, is initially identified by using a pre-trained Sentence Transformer. To force our report decoder to capture highly distinctive and image-correlated text features, harder and harder negative reports keep being synthesized by gradually moving the prior hard negative report towards the anchor report in the latent space during training. The harder negative report is used to evaluate a triplet loss that is minimized to enforce the distance between the matched image and report to be smaller than the distance between an image and its synthesized harder negative report. Meanwhile, the associated images of the anchor report and its prior hard negative report form a hard negative image pair, and a cosine similarity loss is used to capture the distinctive features of the anchor image by pushing the hard negative image away. In this way, our model could achieve subtle representative resolution (i.e., the ability to distinguish two similar samples). As a general method, we demonstrate experimentally that our framework could be readily incorporated into a variety of existing medical report generation models, and significantly improve the corresponding baselines. Our code will be publicly released at","Medical Report Generation, Image Captioning, Hard Negatives" Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning,https://openreview.net/forum?id=Nk2pDtuhTq,https://openreview.net/pdf?id=Nk2pDtuhTq,We propose multitask prompt tuning which learns a single transferable prompt by decomposing and distilling knowledge from multiple task-specific source prompts.,"Prompt tuning, in which a base pretrained model is adapted to each task via conditioning on learned prompt vectors, has emerged as a promising approach for the efficient adaptation of large language models to multiple downstream tasks. However, existing methods typically learn soft prompt vectors from scratch, and it has not been clear how to exploit the rich cross-task knowledge in task-specific prompt vectors to improve performance on target downstream tasks. In this paper, we propose multitask prompt tuning (MPT), which first learns a single transferable prompt by decomposing and distilling knowledge from multiple task-specific source prompts. We then learn multiplicative low rank updates to this shared prompt to efficiently adapt it to each downstream target task. Extensive experiments on 21 NLP datasets demonstrate that our proposed approach outperforms the state-of-the-art methods, including the full finetuning baseline in some cases, despite only tuning $0.035\%$ as many task-specific parameters.","Prompt Tuning, Multitask Learning, Transfer Learning" Style Balancing and Test-Time Style Shifting for Domain Generalization,https://openreview.net/forum?id=7_3oRsaogr,https://openreview.net/pdf?id=7_3oRsaogr,"We propose style balancing and test-time style shifting for domain generalization, to handle the imbalance issues and the issue on large style gap between source and target domains.","Given a training set that consists of multiple source domains, the goal of domain generalization (DG) is to train the model to have generalization capability on the unseen target domain. Although various solutions have been proposed, existing ideas suffer from severe cross-domain data/class imbalance issues that naturally arise in DG. Moreover, the performance of prior works are degraded in practice where the gap between the style statistics of source and target domains is large. In this paper, we propose a new strategy to handle these issues in DG. We first propose style balancing, which strategically balances the number of samples for each class across all source domains in the style-space, providing a great platform for the model to get exposed to various styles per classes during training. Based on the model trained with our style balancing, we also propose test-time style shifting, which shifts the style of the test sample (that has a large style gap with the source domains) to the nearest source domain that the model is already familiar with, to further improve the prediction performance. Our style balancing and test-time style shifting work in a highly complementary fashion, and can successfully work in conjunction with various other DG schemes. Experimental results on benchmark datasets show the improved performance of our scheme over existing methods.","Domain generalization, Style mixing, Arbitrary style transfer" Least Disagree Metric-based Active Learning,https://openreview.net/forum?id=UgLKEBoE3QP,https://openreview.net/pdf?id=UgLKEBoE3QP,"The uncertainty-based active learning algorithm based on the least disagree metric, which is the smallest perturbation required to alter the sample prediction.","The most popular class of active learners today queries for the labels of the samples for which the prediction is most uncertain and uses the labeled samples to update its prediction. Unfortunately, quantifying uncertainty is an open question. This paper mathematically defines uncertainty in terms of the least disagree metric (LDM), which is the smallest perturbation required to alter the sample prediction. Based on this metric, the predictor is updated by querying the label of the most uncertain samples. Given a finite-sized training set, empirical LDM is incorporated into an active learning algorithm and used to approximate the theoretical LDM of each sample. Theoretical convergence properties between the empirical and the mathematical definition of LDM are provided. Experimental results show that our algorithm mostly outperforms other high-performing active learning algorithms and leads to state-of-the-art performance on various datasets and deep networks.","active learning, uncertainty, disagree metric, diversity" Personalized Federated Hypernetworks for Privacy Preservation in Multi-Task Reinforcement Learning,https://openreview.net/forum?id=AGLG_ncNp0X,https://openreview.net/pdf?id=AGLG_ncNp0X,We use hypernetworks to aggregate learning across multiple reinforcement learning agents in a microgrid energy demand response setting while preserving privacy.,"Multi-Agent Reinforcement Learning currently focuses on implementations where all data and training can be centralized to one machine. But what if local agents are split across multiple tasks, and need to keep data private between each? We develop the first application of Personalized Federated Hypernetworks (PFH) to Reinforcement Learning (RL). We then present a novel application of PFH to few-shot transfer, and demonstrate significant initial increases in learning. PFH has never been demonstrated beyond supervised learning benchmarks, so we apply PFH to an important domain: RL price-setting for energy demand response. We consider a general case across where agents are split across multiple microgrids, wherein energy consumption data must be kept private within each microgrid. Together, our work explores how the fields of personalized federated learning and RL can come together to make learning efficient across multiple tasks while keeping data secure.","microgrid clusters, energy demand response, transactive energy control, neural networks, multi-agent reinforcement learning, reinforcement learning, multi-task learning, transfer learning, hypernetworks, federated learning, personalized federated learning, microgrids" NSCL: Noise-Resistant Soft Contrastive Learning for Universal Domain Adaptation,https://openreview.net/forum?id=vny63BYDCS,https://openreview.net/pdf?id=vny63BYDCS,,"Domain adaptation (DA) transfers knowledge from label-rich domains to new domains where labels are scarce to address the problem of generalization of deep neural networks in new domains. Universal domain adaptation (UNDA) assumes the label distributions of labeled and unlabeled data are different and unknowable. In this paper, we concentrate on solving the noise problem on the UNDA problem based on contrastive learning (CL), which includes view noise in data augmentation and label noise in the classifier training. The domain differences in UNDA amplify the noise in the view of data augmentation, resulting in data augmentation schemes that apply to all domains being challenging to find. In addition, the mainstream UNDA classifiers combine closed-set classifiers with open-set classifiers; insufficient competition among open-set classifiers leads to overconfidence, which results in incredible sensitivity to noise in labeled data. Therefore, we propose Noise-Resistant Soft Contrastive Learning (NSCL) addresses the above issues. Firstly, we propose a soft contrast learning loss to avoid the over-response of typical CL loss to noisy samples, thus enabling data augmentation to improve the performance of UNDA further. Secondly, we design an all-in-one (AIO) classifier to improve the robustness of noisy labels while introducing multi-category unknown class competition. Extensive experimental results on UNDA and openset DA demonstrate the advantages of NSCL over existing methods, especially in downstream tasks such as classification and visualization.","Universal Domain Adaptation, Contrastive Learning" Global-Local Bayesian Transformer for Semantic Correspondence,https://openreview.net/forum?id=aGkxJtOxKx,https://openreview.net/pdf?id=aGkxJtOxKx,,"Cost aggregation is the key to finding semantic correspondence between a pair of similar images. Transformer-based cost aggregators have recently shown strong performance in obtaining high-quality correlation maps due to their capability of capturing long-range dependencies between matching points. However, such models are data-hungry and prone to over-fitting when training data is not sufficiently large. Besides, they easily incur incorrect matches when finding correspondences in the local semantic context. To address these issues, we propose a Global-Local Bayesian Transformer (GLBT) for cost aggregation. Specifically, GLBT introduces one global Bayesian self-attention module, whose weights are sampled from a learnable Bayesian posterior distribution, to mitigate over-fitting while modeling the long-range interaction from correlation maps. Furthermore, to model the short-range interaction between candidate matches, GLBT introduces another local Bayesian self-attention module, which factorizes both correlation maps and Bayesian attention weights into pairs of patches and conducts a matrix multiplication on individuals rather than a direct dot-product. Two self-attention modules are joined together to model the long-range and short-range interactions from correlation maps. Ultimately, GLBT is hierarchically aggregated for the refinement of correlation maps before feeding it to the flow estimator. We conduct extensive experiments to show the superiority of our proposed network to the state-of-the-art methods on datasets, including SPair-71k, PF-PASCAL, and PF-WILLOW.", Semantic Category Discovery with Vision-language Representations,https://openreview.net/forum?id=sQ0TzsZTUn,https://openreview.net/pdf?id=sQ0TzsZTUn,,"Object recognition is the task of identifying the category of an object in an image. While current models report excellent performance on existing benchmarks, most fall short of the task accomplished by the human perceptual system. For instance, traditional classifiers (e.g those trained on ImageNet) only learn to map an image to a predefined class index, without revealing the actual semantic meaning of the object in the image. Meanwhile, vision-language models like CLIP are able to assign semantic class names to unseen objects in a `zero-shot' manner, though they are once again provided a predefined set of candidate names at test-time. In this paper, we reconsider the recognition problem and bring it closer to a practical setting. Specifically, given only a large (essentially unconstrained) taxonomy of categories as prior information, we task a vision-language model with assigning class names to all images in a dataset. We first use non-parametric methods to establish relationships between images, which allow the model to automatically narrow down the set of possible candidate names. We then propose iteratively clustering the data and voting on class names within clusters, showing that this enables a roughly 50% improvement over the baseline on ImageNet. We demonstrate the efficacy of our method in a number of settings: using different taxonomies as the semantic search space; in unsupervised and partially supervised settings; as well as with coarse-grained and fine-grained evaluation datasets.", Deep Causal Generative Modeling for Tabular Data Imputation and Intervention,https://openreview.net/forum?id=HBLr-G1Zpn,https://openreview.net/pdf?id=HBLr-G1Zpn,,"Tabular data synthesis could overcome the tabular data incompleteness and data availability issue. In most prior works, deep generative models are basically constructed following standard architecture designs. However, these works do not consider the inter-relationships among the features, or the latent variables. To fully leverage these inter-relationships, we develop a novel causal-aware asymmetric variational autoencoder architecture (CAT) for tabular data generation, imputation, and intervention. The developed model, called CAT-MIWAE, learns exogenous causal representation with a pre-defined causal graph in incomplete data context. It provides interpretability for partially observed features and could efficiently address missing value imputation problem. Besides, CAT-MIWAE can sample data from distributions under arbitrary conditions and interventions. This merit enables us to actively generate counterfactuals or debiased fair data samples for any subpopulation of interest. To validate the effectiveness of the proposed causally aware models, we conduct extensive experiments on real-world tabular datasets. Experiments show that the proposed models outperform the state of the art models. Moreover, we perform CATE estimations to show that CAT-MIWAE model could appropriately extrapolate any conditional or interventional distributions from the original observed data distribution.","tabular data, generative models, missing value imputation, causal knowledge" CBLab: Scalable Traffic Simulation with Enriched Data Supporting,https://openreview.net/forum?id=5iqzNK-Qeb,https://openreview.net/pdf?id=5iqzNK-Qeb,"We present CBLab, a toolkit for scalable traffic simulation with enriched input data supporting.","Traffic simulation provides interactive data for the optimization of traffic policies. However, existing traffic simulators are limited by their lack of scalability and shortage in input data, which prevents them from generating interactive data from traffic simulation in the scenarios of real large-scale city road networks. In this paper, we present \textbf{C}ity \textbf{B}rain \textbf{Lab}, a toolkit for scalable traffic simulation. CBLab is consist of three components: CBEngine, CBData, and CBScenario. CBEngine is a highly efficient simulator supporting large-scale traffic simulation. CBData includes a traffic dataset with road network data of 100 cities all around the world. We also develop a pipeline to conduct a one-click transformation from raw road networks to input data of our traffic simulation. Combining CBEngine and CBData allows researchers to run scalable traffic simulations in the road network of real large-scale cities. Based on that, CBScenario implements an interactive environment and several baseline methods for two scenarios of traffic policies respectively, with which traffic policies adaptable for large-scale urban traffic can be trained and tuned. To the best of our knowledge, CBLab is the first infrastructure supporting traffic policy optimization in large-scale urban scenarios. The code is available on GitHub:~\url{https://github.com/CityBrainLab/CityBrainLab.git}.","Infrastructure, Traffic Policy, Traffic Simulation, Large-scale Dataset" Personalized Decentralized Bilevel Optimization over Stochastic and Directed Networks,https://openreview.net/forum?id=qr0EbR8lH5,https://openreview.net/pdf?id=qr0EbR8lH5,We propose a gradient-based bilevel optimization as a general approach of personalization and propose a decentralized hyper-gradient estimation altgorithm that runs on stochastic and directed communication networks.,"While personalization in distributed learning has been extensively studied, existing approaches employ dedicated algorithms to optimize their specific type of parameters (e.g., client clusters or model interpolation weights), making it difficult to simultaneously optimize different types of parameters to yield better performance. Moreover, their algorithms require centralized or static undirected communication networks, which can be vulnerable to center-point failures or deadlocks. This study proposes optimizing various types of parameters using a single algorithm that runs on more practical communication environments. First, we propose a gradient-based bilevel optimization that reduces most personalization approaches to the optimization of client-wise hyperparameters. Second, we propose a decentralized algorithm to estimate gradients with respect to the hyperparameters, which can run even on stochastic and directed communication networks. Our empirical results demonstrated that the gradient-based bilevel optimization enabled combining existing personalization approaches which led to state-of-the-art performance, confirming it can perform on multiple simulated communication environments including a stochastic and directed network.",decentralized stochastic gradient descent: bilevel optimization: hyper-gradient: personalization: directed network: federated learning: distributed learning: fully-decentralized ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading,https://openreview.net/forum?id=gaIMkuIFwCG,https://openreview.net/pdf?id=gaIMkuIFwCG,"We propose ContextSpeech with memory reuse mechanism ,broaden contextual semantic information and linearized attention for paragraph reading TTS.","Although Text-to-Speech (TTS) has made rapid progress in speech quality at sentence level, it still faces a lot of challenges in paragraph / long-form reading. Synthesizing sentence by sentence in a paragraph and then concatenating them together will cause inconsistent issues that affect paragraph-level expressiveness. While directly modelling all the sentences in a paragraph will incur large computation / memory cost. In this paper, we develop a TTS system called ContextSpeech, which models the contextual information in a paragraph for coherence and expressiveness without largely increasing the computation or memory cost. On the one hand, we introduce a memory-cached recurrence mechanism to let the current sentence see more history information both on the text and speech sides. On the other hand, we construct text-based semantic information in a hierarchical structure, which can broaden the horizon and incorporate the future information. Additionally, we use a linearized self-attention with compatible relative-position encoding to reduce the computation / memory cost. Experiments show that ContextSpeech significantly improves the paragraph-level voice quality and prosody expressiveness in terms of both subjective and objective evaluation metrics. Furthermore, ContextSpeech achieves better model efficiency in both training and inference stage.","Text-to-Speech, Contextual Modeling, Efficient Transformer" Learning Object Affordance with Contact and Grasp Generation,https://openreview.net/forum?id=e9-w5aLkZM,https://openreview.net/pdf?id=e9-w5aLkZM,,"Understanding object affordance can help in designing better and more robust robotic grasping. Existing work in the computer vision community formulates the object affordance understanding as a grasping pose generation problem, which treats the problem as a black box by learning a mapping between objects and the distributions of possible grasping poses for the objects. On the other hand, in the robotics community, estimating object affordance represented by contact maps is of the most importance as localizing the positions of the possible affordance can help the planning of grasping actions. In this paper, we propose to formulate the object affordance understanding as both contacts and grasp poses generation. we factorize the learning task into two sequential stages, rather than the black-box strategy: (1) we first reason the contact maps by allowing multi-modal contact generation; (2) assuming that grasping poses are fully constrained given contact maps, we learn a one-to-one mapping from the contact maps to the grasping poses. Further, we propose a penetration-aware partial optimization from the intermediate contacts. It combines local and global optimization for the refinement of the partial poses of the generated grasps exhibiting penetration. Extensive validations on two public datasets show our method outperforms state-of-the-art methods regarding grasp generation on various metrics.","Object Affordance, Hand Grasps Generation, Conditional Variational Autoencoder, DLearning" Deep Graph-Level Clustering Using Pseudo-Label-Guided Mutual Information Maximization Network,https://openreview.net/forum?id=6RWJe6lPbQ,https://openreview.net/pdf?id=6RWJe6lPbQ,,"In this work, we study the problem of partitioning a set of graphs into different groups such that the graphs in the same group are similar while the graphs in different groups are dissimilar. This problem was rarely studied previously, although there have been a lot of work on node clustering and graph classification. The problem is challenging because it is difficult to measure the similarity or distance between graphs. One feasible approach is using graph kernels to compute a similarity matrix for the graphs and then performing spectral clustering, but the effectiveness of existing graph kernels in measuring the similarity between graphs is very limited. To solve the problem, we propose a novel method called Deep Graph-Level Clustering (DGLC). DGLC utilizes a graph isomorphism network to learn graph-level representations by maximizing the mutual information between the representations of entire graphs and substructures, under the regularization of a clustering module that ensures discriminative representations via pseudo labels. DGLC achieves graph-level representation learning and graph-level clustering in an end-to-end manner. The experimental results on six benchmark datasets of graphs show that our DGLC has state-of-the-art performance in comparison to many baselines.","Graph-level clustering, Graph representation learning, Deep learning, Unsupervised learning" Deep Generative Model based Rate-Distortion for Image Downscaling Assessment,https://openreview.net/forum?id=kcemndN1Tw,https://openreview.net/pdf?id=kcemndN1Tw,,"In this paper, we propose a novel measure, namely Image Downscaling Assessment by Rate-Distortion (IDA-RD), to quantitatively evaluate image downscaling algorithms. In contrast to image-based methods that measure the quality of downscaled images, ours is process-based that draws ideas from the rate-distortion theory to measure the distortion incurred during downscaling. Our main idea is that downscaling and super-resolution (SR) can be viewed as the encoding and decoding processes in the rate-distortion model, respectively, and that a downscaling algorithm that preserves more details in the resulting low-resolution (LR) images should lead to less distorted high-resolution (HR) images in SR. In other words, the distortion should increase as the downscaling algorithm deteriorates. However, it is non-trivial to measure this distortion as it requires the SR algorithm to be blind and stochastic. Our key insight is that such requirements can be met by recent SR algorithms based on deep generative models that can find all matching HR images for a given LR image on their learned image manifolds. Empirically, we first validate our IDA-RD measure with synthetic downscaling algorithms which simulate distortions by adding various types and levels of degradations to the downscaled images. We then test our measure on traditional downscaling algorithms such as bicubic, bilinear, nearest neighbor interpolation as well as state-of-the-art downscaling algorithms such as DPID, L0-regularized downscaling, and Perceptual downscaling. Experimental results show the effectiveness of our IDA-RD in evaluating image downscaling algorithms.", Selective Classifier Ensemble,https://openreview.net/forum?id=e1WfacHtbj,https://openreview.net/pdf?id=e1WfacHtbj,,"Selective classification allows a machine learning model to abstain on some hard inputs and thus improve the safety of its predictions. In this paper, we study the ensemble of selective classifiers, i.e. selective classifier ensemble, which combines several weak selective classifiers to obtain a more powerful model. We prove that under some assumptions, the ensemble has a lower selective risk than the individual model under a range of coverage. This is nontrivial since the selective risk is a non-convex function of the model prediction. The assumptions and the theoretical result are supported by systematic experiments on both computer vision and natural language processing tasks. A surprising empirical result is that a simple selective classifier ensemble, namely, the ensemble model with maximum probability as confidence, is the state-of-the-art selective classifier. For instance, on CIFAR-10, using the same VGG-16 backbone model, this ensemble reduces the AURC (Area Under Risk-Coverage Curve) by about 24%, relative to the previous state-of-the-art method.","selective prediction, selective classification, ensemble learning" Better Generative Replay for Continual Federated Learning,https://openreview.net/forum?id=cRxYWKiTan,https://openreview.net/pdf?id=cRxYWKiTan,introduce a new continual federated leanring setting with generative replay and solve an important technical problem to make it work,"Federated Learning (FL) aims to develop a centralized server that learns from distributed clients via communications without accessing the clients’ local data. However, existing works mainly focus on federated learning in a single task sce- nario with static data. In this paper, we introduce the continual federated learning (CFL) problem, where clients incrementally learn new tasks and history data can- not be stored due to certain reasons, such as limited storage and data retention policy 1. Generative replay (GR) based methods are effective for continual learning without storing history data. However, we fail when trying to intuitively adapt GR models for this setting. By analyzing the behaviors of clients during training, we find the unstable training process caused by distributed training on non-IID data leads to a notable performance degradation. To address this problem, we propose our FedCIL model with two simple but effective solutions: 1. model consolidation and 2. consistency enforcement. Experimental results on multiple benchmark datasets demonstrate that our method significantly outperforms baselines.","federated learning, continual learning" Unified Probabilistic Modeling of Image Aesthetic Rating Distributions towards Measuring Subjectivity,https://openreview.net/forum?id=rEKl9rzR7S,https://openreview.net/pdf?id=rEKl9rzR7S,We propose a unified probabilistic framework for modeling and quantifying subjective aesthetic preference.,"Assessing image aesthetics is a challenging computer vision task. One reason is that aesthetic preference is highly subjective and may vary significantly among people for certain images. Thus, it is important to properly model and quantify such subjectivity, but there has not been much effort to resolve this issue. In this paper, we propose a novel unified probabilistic framework that can model and quantify subjective aesthetic preference based on the subjective logic. In this framework, the distribution of aesthetic ratings is modeled as a beta distribution, from which the probabilities of being definitely pleasing, being definitely unpleasing, and being uncertain can be obtained. We use the probability of being uncertain to define an intuitive metric of subjectivity. Furthermore, we present a method to learn deep neural networks for prediction of image aesthetics, which is shown to be effective in improving the performance of subjectivity prediction via experiments. We also present an application scenario where the framework is beneficial for aesthetics-based image recommendation.","image aesthetics, probabilistic modeling, subjective logic, subjective preference" Enhancing the Transferability of Adversarial Examples via a Few Queries and Fuzzy Domain Eliminating,https://openreview.net/forum?id=e20T84suZx,https://openreview.net/pdf?id=e20T84suZx,"In this work, we propose a novel method called query prior-based method and the fuzzy domain eliminating technique to enhance the family of fast gradient sign methods and improve their attack transferability by using a few queries.","Due to the vulnerability of deep neural networks, the black-box attack has drawn great attention from the community. Though transferable priors decrease the query number of the black-box query attacks in recent efforts, the average number of queries is still larger than 100, which is easily affected by the number of queries limit policy. In this work, we propose a novel method called query prior-based method to enhance the attack transferability of the family of fast gradient sign methods by using a few queries. Specifically, for the untargeted attack, we find that the successful attacked adversarial examples prefer to be classified as the wrong categories with higher probability by the victim model. Therefore, the weighted augmented cross-entropy loss is proposed to reduce the gradient angle between the surrogate model and the victim model for enhancing the transferability of the adversarial examples. In addition, the fuzzy domain eliminating technique is proposed to avoid the generated adversarial examples getting stuck in the local optimum. Specifically, we define the fuzzy domain of the input example $x$ in the $\epsilon$-ball of $x$. Then, temperature scaling and fuzzy scaling are utilized to eliminate the fuzzy domain for enhancing the transferability of the generated adversarial examples. Theoretical analysis and extensive experiments demonstrate that our method could significantly improve the transferability of gradient-based adversarial attacks on CIFAR10/100 and ImageNet and outperform the black-box query attack with the same few queries.","adversarial examples, transferability, deep neural network" Analyzing adversarial robustness of vision transformers against spatial and spectral attacks,https://openreview.net/forum?id=NWZOL5kZv6,https://openreview.net/pdf?id=NWZOL5kZv6,We discover that Transformers are vulnerable to adversarial attacks perturbing phase information of images in the frequency domain.,"Vision Transformers have emerged as a powerful architecture that can outperform convolutional neural networks (CNNs) in image classification tasks. Several attempts have been made to understand robustness of Transformers against adversarial attacks, but existing studies draw inconsistent results, i.e., some conclude that Transformers are more robust than CNNs, while some others find that they have similar degrees of robustness. In this paper, we address two issues unexplored in the existing studies examining adversarial robustness of Transformers. First, we argue that the image quality should be simultaneously considered in evaluating adversarial robustness. We find that the superiority of one architecture to another in terms of robustness can change depending on the attack strength expressed by the quality of the attacked images. Second, by noting that Transformers and CNNs rely on different types of information in images, we formulate an attack framework as a tool for implementing flexible attacks, where an image can be attacked in the spectral domain as well as in the spatial domain. This attack perturbs the magnitude and phase information of particular frequency components selectively. Through extensive experiments, we find that Transformers tend to rely more on phase information and low frequency information than CNNs, and thus sometimes they are even more vulnerable under frequency-selective attacks. It is our hope that this work provides new perspectives in understanding the properties and adversarial robustness of Transformers.","Transformers, adversarial attack, Fourier transform" Label-distribution-agnostic Ensemble Learning on Federated Long-tailed Data,https://openreview.net/forum?id=l4f-zJqY2s,https://openreview.net/pdf?id=l4f-zJqY2s,,"Federated Learning (FL) is a distributed machine learning paradigm that enables devices to collaboratively train a shared model. However, the long-tailed distribution in nature deteriorates the performance of the global model, which is difficult to address due to data heterogeneity, e.g., local clients may exhibit diverse imbalanced class distributions. Moreover, existing re-balance strategies generally utilize label distribution as the class prior, which may conflict with the privacy requirement of FL. To this end, we propose a Label-Distribution-Agnostic Ensemble (LDAE) learning framework to integrate heterogeneous data distributions using multiple experts, which targets to optimize a balanced global objective under privacy protection. In particular, we derive a privacy-preserving proxy from the model updates of clients to guide the grouping and updating of multiple experts. Knowledge from clients can be aggregated via implicit interactions among different expert groups. We theoretically and experimentally demonstrate that (1) there is a global objective gap between global and local re-balance strategies\footnote{The local re-balance strategy means that each client utilizes re-balance methods based on the local label distribution, while the global re-balance strategy applies re-balance methods using global label distribution as the class-wise prior.} and (2) with protecting data privacy, the proxy can be used as an alternative to label distribution for existing class prior based re-balance strategies. Extensive experiments on long-tailed decentralized datasets demonstrate the effectiveness of our method, showing superior performance to state-of-the-art methods.","fedearted learning, long-tailed learning" MULTI-VIEW DEEP EVIDENTIAL FUSION NEURAL NETWORK FOR ASSESSMENT OF SCREENING MAMMOGRAMS,https://openreview.net/forum?id=snjmwYRuqh,https://openreview.net/pdf?id=snjmwYRuqh,,"Mammography is an X-ray-based imaging technique widely used for breast cancer screening and early-risk assessment. A large number of mammograms are acquired in regular breast cancer screening programs. The assessment of mammograms is a tedious task and may be difficult to accomplish due to a shortage of expert radiologists in breast imaging. Artificial intelligence-powered algorithms, especially deep learning, could assist radiologists by automating the assessment, however, substantial trust needs to be established in incorporating such algorithms in real-world settings. The evidential neural networks algorithm provides an interpretable approach using Dempster-Shafter evidential theory that supports network predictive confidence. Recent studies have suggested that multi-view analysis improves the assessment of mammograms. In this study, we advance the multi-view assessment of mammograms by using a deep evidential neural network to address the following questions: 1. What is the effect of various pre-trained convolutional neural networks in extracting features from mammograms? 2. Which fusion strategies work better for the multi-view assessment of mammograms using a deep evidential learning framework? The multi-view deep evidential neural network extracts features from each mammogram’s view using a pre-trained convolutional neural network. The extracted features are combined using Dempster-Shafer evidence theory for the following two classification tasks, mammogram density assessment in BI-RADS categories and mammogram finding as benign or malignant. We conducted extensive experiments using two open-sourced digital mammogram datasets, VinDr-mammo, and mini-DDSM, with 4,977 and 1,885 patients, each with four mammogram views, respectively. The results suggest that the multi-view approach outperforms the single-view by relative improvements of 2.99% and 2.64% for VinDr-mammo, and 6.51% and 8.75% for mini-DDSM datasets, in terms of F1-score, in mammogram density assessment and BI-RADS findings benign/malignant classification tasks, respectively. Our results show that the multi-view assessment of mammograms using a deep evidential fusion approach not only provides superior performance than the single-view assessment but also enhances trust in incorporating artificial intelligence-powered algorithms for the assessment of screening mammograms.","Multi view fusion, Mammograms, Evidential learning, Deep learning" Data-Free Continual Graph Learning ,https://openreview.net/forum?id=RtB4CXS1Jxv,https://openreview.net/pdf?id=RtB4CXS1Jxv,consider and study an important yet ignored case in existing continual graph learning works ,"Graph Neural Networks (GNNs), which effectively learn from static graph-structured data become ineffective when directly applied to streaming data in a continual learning (CL) scenario. A few recent works study this so-called “catastrophic forgetting” problem in GNNs, where historical data are not available during the training stage. However, they make a strong assumption that full access of historical data is provided during the inference stage. This assumption could make the graph learning system impractical to deploy due to a number of reasons, such as limited storage, GDPR1 data retention policy, to name a few. In this work, we study continual graph learning without this strong assumption. Moreover, in practical continual learning, models are sometimes trained with accumulated batch data but required to do on-the-fly inference with a stream of test samples. In this case, without being re-inserted into previous training graphs for inference, streaming test nodes are often very sparsely connected. It makes the inference more difficult as the model is trained on a much more dense graph while required to infer on a sparse graph with insufficient neighborhood information. We propose a simple Replay GNN (ReGNN) to jointly solve the above two challenges without memory buffers (i.e., data-free): catastrophic forgetting and poor neighbour information during inference. Extensive experiments demonstrate the effectiveness of our model over baseline models, including competitive baselines with memory buffers.","continual learning, graph representation learning, graph neural networks, lifelong learning" Generative Modelling with Inverse Heat Dissipation,https://openreview.net/forum?id=4PJUBT9f2Ol,https://openreview.net/pdf?id=4PJUBT9f2Ol,"We propose a generative model that iteratively reverses the heat equation, increasing the effective resolution of the image","While diffusion models have shown great success in image generation, their noise-inverting generative process does not explicitly consider the structure of images, such as their inherent multi-scale nature. Inspired by diffusion models and the empirical success of coarse-to-fine modelling, we propose a new model that generates images through iteratively inverting the heat equation, a PDE that locally erases fine-scale information when run over the 2D plane of the image. We interpret a noise-relaxed solution of the forward heat equation as a variational approximation in a diffusion-like latent variable model. Our new model shows emergent qualitative properties not seen in standard diffusion models, such as disentanglement of overall colour and shape in images and data efficiency. Spectral analysis on natural images highlights connections to diffusion models and reveals implicit inductive biases in them.","diffusion model, partial differential equation, inductive bias" Self-supervision through Random Segments with Autoregressive Coding (RandSAC),https://openreview.net/forum?id=Ubc74gTVo3,https://openreview.net/pdf?id=Ubc74gTVo3,,"Inspired by the success of self-supervised autoregressive representation learning in natural language (GPT and its variants), and advances in recent visual architecture design with Vision Transformers (ViTs), in this paper, we explore the effects various design choices have on the success of applying such training strategies for visual feature learning. Specifically, we introduce a novel strategy that we call Random Segments with Autoregressive Coding (RandSAC). In RandSAC, we group patch representations (image tokens) into hierarchically arranged segments; within each segment, tokens are predicted in parallel, similar to BERT, while across segment predictions are sequential, similar to GPT. We illustrate that randomized serialization of the segments significantly improves the performance and results in distribution over spatially-long (across-segments) and -short (within-segment) predictions which are effective for feature learning. We illustrate the pertinence of these design choices and explore alternatives on a number of datasets (e.g., CIFAR10, ImageNet). While our pre-training strategy works with vanilla Transformer, we also propose a conceptually simple, but highly effective, addition to the decoder that allows learnable skip-connections to encoder feature layers, which further improves the performance.", Rarity Score : A New Metric to Evaluate the Uncommonness of Synthesized Images,https://openreview.net/forum?id=JTGimap_-F,https://openreview.net/pdf?id=JTGimap_-F,,"Evaluation metrics in image synthesis play a key role to measure performances of generative models. However, most metrics mainly focus on image fidelity. Existing diversity metrics are derived by comparing distributions, and thus they cannot quantify the diversity or rarity degree of each generated image. In this work, we propose a new evaluation metric, called `rarity score', to measure both image-wise uncommonness and model-wise diversified generation performance. We first show empirical observation that common samples are close to each other and rare samples are far from each other in nearest-neighbor distances on latent spaces represented by feature extractor networks such as VGG16. We then use our metric to demonstrate that the extent to which different generative models produce rare images can be effectively compared. We also propose a method to compare rarities between datasets that share the same concept such as CelebA-HQ and FFHQ. Finally, we analyze the use of metrics in different designs of feature extractors to better understand the relationship between feature spaces and resulting high-rarity images. Code will be publicly available for the research community.", Benchmarking Approximate k-Nearest Neighbour Search for Big High Dimensional Dynamic Data,https://openreview.net/forum?id=XtRJsuVsLU,https://openreview.net/pdf?id=XtRJsuVsLU,A novel framework for benchmarking Approximate k-Nearest Neighbour (ANN) methods on big high dimensional dynamic data that identifies suitable ANN methods for ML and other applications and will accelerate future ANN research.,"Approximate k-Nearest Neighbour (ANN) methods are commonly used for mining information from big high-dimensional datasets. For each application the high-level dataset properties and run-time requirements determine which method will provide the most suitable tradeoffs. However, due to a significant lack of comprehensive benchmarking, judicious method selection is not currently possible for ANN applications that involve frequent online changes to datasets. Here we address this issue by building upon existing benchmarks for static search problems to provide a new benchmarking framework for big high dimensional dynamic data. We apply our framework to dynamic scenarios modelled after common real world applications. In all cases we are able to identify a suitable recall-runtime tradeoff to improve upon a worst-case exhaustive search. Our framework provides a flexible solution to accelerate future ANN research and enable researchers in other online data-rich domains to find suitable methods for handling their ANN searches.","Nearest Neighbour Search, Similarity Search, Indexing, Knowledge retrieval, Knowledge discovery, High Dimensional Data, Big Data, Large Scale, Hashing, Graph Traversal, Product Quantisation, Online Learning, Representation Learning, Metric Learning, Robotic Vision" Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories,https://openreview.net/forum?id=6OxI4WqGr6,https://openreview.net/pdf?id=6OxI4WqGr6,,"Natural agents can effectively learn from multiple data sources that differ in size, quality, and types of measurements. We study this heterogeneity in the context of offline reinforcement learning (RL) by introducing a new, practically motivated semi-supervised setting. Here, an agent has access to two sets of trajectories: labelled trajectories containing state, action, reward triplets at every timestep, along with unlabelled trajectories that contain only state and reward information. For this setting, we develop a simple meta-algorithmic pipeline that learns an inverse-dynamics model on the labelled data to obtain proxy-labels for the unlabelled data, followed by the use of any offline RL algorithm on the true and proxy-labelled trajectories. Empirically, we find this simple pipeline to be highly successful --- on several D4RL benchmarks~\cite{fu2020d4rl}, certain offline RL algorithms can match the performance of variants trained on a fully labeled dataset even when we label only 10\% trajectories from the low return regime. Finally, we perform a large-scale controlled empirical study investigating the interplay of data-centric properties of the labelled and unlabelled datasets, with algorithmic design choices (e.g., inverse dynamics, offline RL algorithm) to identify general trends and best practices for training RL agents on semi-supervised offline datasets.", E-Forcing: Improving Autoregressive Models by Treating it as an Energy-Based One,https://openreview.net/forum?id=UROBiQEOLP,https://openreview.net/pdf?id=UROBiQEOLP,we propose a unique method termed E-Forcing for training autoregressive generative models that takes advantage of a well-designed energy-based learning objective.,"Autoregressive generative models are commonly used to solve tasks involving sequential data. They have, however, been plagued by a slew of inherent flaws due to the intrinsic characteristics of chain-style conditional modeling (e.g., exposure bias or lack of long-range coherence), severely limiting their ability to model distributions properly. In this paper, we propose a unique method termed E-Forcing for training autoregressive generative models that takes advantage of a well-designed energy-based learning objective. By leveraging the extra degree of freedom of the softmax operation, we are allowed to make the autoregressive model itself an energy-based model for measuring the likelihood of input without introducing any extra parameters. Furthermore, we show that with the help of E-Forcing, we can alleviate the above flaws for autoregressive models. Extensive empirical results, covering numerous benchmarks demonstrate the effectiveness of the proposed approach.","autoregressive models, exposure bias, language modeling, neural machine translation" Joint Generator-Ranker Learning for Natural Language Generation,https://openreview.net/forum?id=94WEPuo8D_a,https://openreview.net/pdf?id=94WEPuo8D_a,,"Generate-then-rank is a widely used mechanism for text generation, where a generator produces multiple candidates and a ranker chooses the best one. However, existing methods usually train the generator and the ranker separately, which causes a lack of mutual feedback and a misalignment of their objectives. This results in suboptimal generation quality. To address this issue, we propose JGR, a novel joint training algorithm that integrates the generator and the ranker in a single framework. JGR optimizes the generator with a hybrid objective that combines data likelihood and ranker reward, and trains the ranker with a contrastive loss that compares the generator outputs. By alternately updating the generator and the ranker, JGR can effectively harmonize their learning and enhance their quality jointly. We evaluate JGR on various text generation tasks and demonstrate that it surpasses existing methods on four public datasets across three common generation scenarios. We will make our code and models publicly available for reproducibility. ","natural language processing, natural language generation" The Progressive Alignment-aware Multimodal Fusion with Easy2hard Strategy for Multimodal Neural Machine Translation,https://openreview.net/forum?id=lNQAXpf7rGu,https://openreview.net/pdf?id=lNQAXpf7rGu,,"Multimodal neural machine translation (MNMT) aims to improve textual level machine translation performance in the presence of text-related images. Most of the previous works on MNMT have only focused on either multimodal feature fusion or noise multi-modal representations based on full visual and textual features, however, the degree of multi-modal alignment is often ignored. Generally, the fine-grained multi-modal information, such as visual object, textual entity, is easy to align, but the global-level semantic alignment is always difficult. In order to alleviate the challenging problem of multi-modal alignment, this paper proposes a novel progressive multimodal fusion approach with the easy-to-hard (easy2hard) cross-model alignment strategy by fully exploiting visual information for MNMT. We first extract both visual and textual features with modal-specific pre-trained models, respectively, and the fine-grained features (e.g., the regional visual features, the entity features) are roughly aligned as multi-modal anchors based on cross-modal interactive module. Then a progressive multi-modal fusion framework is employed for MNMT by gradually narrowing the global-level multi-modal semantic gap based on the roughly aligned anchors. We validate our method on the Multi30K dataset. The experimental results show the superiority of our proposed model, and achieve the state-of-the-art (SOTA) scores in all En-De, En-Fr and En-Cs translation tasks.","Multimodal neural machine translation, Multi-modal alignment, Easy2hard, Progressive multi-modal fusion, Multi30K" Masked Vector Quantization,https://openreview.net/forum?id=ezgCdnzApo,https://openreview.net/pdf?id=ezgCdnzApo,"We proposed Masked Vector Quantization, a novel variant of Vector Quantization, which increases the representational capacity of each code vector by learning mask configurations via winner-takes-all training regime called Multiple Hypotheses Dropout.","Generative models with discrete latent representations have recently demonstrated an impressive ability to learn complex high-dimensional data distributions. However, their performance relies on a long sequence of tokens per instance and a large number of codebook entries, resulting in long sampling times and considerable computation to fit the categorical posterior. To address these issues, we propose the Masked Vector Quantization (MVQ) framework which increases the representational capacity of each code vector by learning mask configurations via a stochastic winner-takes-all training regime called Multiple Hypotheses Dropout (MH-Dropout). On ImageNet 64$\times$64, reduces FID in existing vector quantization architectures by up to $68\%$ at 2 tokens per instance and $57\%$ at 5 tokens. These improvements widen as codebook entries is reduced and allows for $7\textup{-}45\times$ speed-up in token sampling during inference. As an additional benefit, we find that smaller latent spaces lead to MVQ identifying transferable visual representations where multiple can be smoothly combined.","generative models, dropout, vector quantization, autoencoder, discrete representations" On the Importance of the Policy Structure in Offline Reinforcement Learning,https://openreview.net/forum?id=EJPWfoJRba,https://openreview.net/pdf?id=EJPWfoJRba,"We introduce a structure in a policy representation in offline reinforcement learning, which reduces the critic loss during the training and improves the resulting policy performance. ","Offline reinforcement learning (RL) has attracted a great deal of attention recently as an approach to utilizing past experience to learn a policy. Recent studies have reported the challenges of offline RL, such as estimating the values of actions that are out of the data distribution. To mitigate the issues of offline RL, we propose an algorithm that leverages a mixture of deterministic policies. With our framework, the state-action space is divided by learning discrete latent variables, and sub-policies corresponding to each region are trained. The proposed algorithm, which we call Value-Weighted Variational Auto-Encoder (V2AE), is derived by considering the variational lower bound of the offline RL objective function. The aim of this work is to shed lights on the importance on the policy structure in offline RL. We show empirically that the use of the proposed mixture policy can reduce the accumulation of the approximation error in offline RL, which was reported in previous studies. Experimental results also indicate that introducing the policy structure improves the performance on tasks with D4RL benchmarking datasets.","offline reinforcement learning, discrete latent representations" Bandit Learning in Many-to-one Matching Markets with Uniqueness Conditions,https://openreview.net/forum?id=VgCqZL_N0a,https://openreview.net/pdf?id=VgCqZL_N0a,,"An emerging line of research is dedicated to the problem of one-to-one matching markets with bandits, where the preference of one side is unknown and thus we need to match while learning the preference through multiple rounds of interaction. However, in many real-world applications such as online recruitment platform for short-term workers, one side of the market can select more than one participant from the other side, which motivates the study of the many-to-one matching problem. Moreover, the existence of a unique stable matching is crucial to the competitive equilibrium of the market. In this paper, we first introduce a more general new \textit{$\tilde{\alpha}$}-condition to guarantee the uniqueness of stable matching in many-to-one matching problems, which generalizes some established uniqueness conditions such as \textit{SPC} and \textit{Serial Dictatorship}, and recovers the known $\alpha$-condition if the problem is reduced to one-to-one matching. Under this new condition, we design an MO-UCB-D4 algorithm with $O\left(\frac{NK\log(T)}{\Delta^2}\right)$ regret bound, where $T$ is the time horizon, $N$ is the number of agents, $K$ is the number of arms, and $\Delta$ is the minimum reward gap. Extensive experiments show that our algorithm achieves uniform good performances under different uniqueness conditions.","matching markets, multi-armed bandit, many-to-one setting, uniqueness conditions" Can you Trust your Disentanglement?,https://openreview.net/forum?id=MQ2IvNeZJD,https://openreview.net/pdf?id=MQ2IvNeZJD,"by exposing problems in disentanglment metrics, and introducing new metrics and a new task, we make the case that existing disentangled models actually produce representations that are largely entangled","There has been growing interest, in recent years, in learning disentangled representations of data. These are representations in which distinct features, such as size or shape, are represented by distinct neurons. Measuring disentanglement, i.e., quantifying the extent to which a given representation is disentangled, is not straightforward. Multiple metrics have been proposed. In this paper, we identify two failings of existing metrics, and show how they can assign a high score to a model which is still entangled. We then propose two new metrics which redress these problems. Additionally, we introduce the task of recognizing novel combinations of familiar features (NCFF), which we argue is doable if and only if the model is disentangled. As well as being desirable in itself, NCFF provides a tangible downstream task that can help focus the field of disentanglement research, in contrast to the set of bespoke metrics that are currently used. We then show empirically that existing methods perform poorly on our proposed metrics and fail at recognizing NCFF and so, we argue, are not disentangled.","deep learning, disentanglement" TRANSFORMER-PATCHER: ONE MISTAKE WORTH ONE NEURON,https://openreview.net/forum?id=4oYUGeGBPm,https://openreview.net/pdf?id=4oYUGeGBPm,A Sequential Model Editor to correct model's output on the specific input.,"Large Transformer-based Pretrained Language Models (PLMs) dominate almost all Natural Language Processing (NLP) tasks. Nevertheless, they still make mistakes from time to time. For a model deployed in an industrial environment, fixing these mistakes quickly and robustly is vital to improve user experiences. Previous works formalize such problems as Model Editing (ME) and mostly focus on fixing one mistake. However, the one-mistake-fixing scenario is not an accurate abstraction of the real-world challenge. In the deployment of AI services, there are ever-emerging mistakes, and the same mistake may recur if not corrected in time. Thus a preferable solution is to rectify the mistakes as soon as they appear nonstop. Therefore, we extend the existing ME into the Sequential Model Editing (SME) to help develop more practical editing methods. Our study shows that current ME methods either fail to make a sequence of edits or to remember previous edits. We then introduce Transformer-Patcher, a novel model editor that can shift the behavior of transformer-based models by simply adding and training a few neurons in the last Feed-Forward Network layer. Experimental results on both classification and generation tasks show that Transformer-Patcher can successively correct up to thousands of errors (Reliability) and generalize to their equivalent inputs (Generality) while retaining the model’s accuracy on irrelevant inputs (Locality). Our method outperforms previous fine-tuning and HyperNetwork-based methods and achieves state-of-the-art performance for Sequential Model Editing (SME).",Sequential Model Editing Corrupted Image Modeling for Self-Supervised Visual Pre-Training,https://openreview.net/forum?id=09hVcSDkea,https://openreview.net/pdf?id=09hVcSDkea,,"We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training. CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial [MASK] tokens, where some patches are randomly selected and replaced with plausible alternatives sampled from the BEiT output distribution. Given this corrupted image, an enhancer network learns to either recover all the original image pixels, or predict whether each visual token is replaced by a generator sample or not. The generator and the enhancer are simultaneously trained and synergistically updated. After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks. CIM is a general and flexible visual pre-training framework that is suitable for various network architectures. For the first time, CIM demonstrates that both ViT and CNN can learn rich visual representations using a unified, non-Siamese framework. Experimental results show that our approach achieves compelling results in vision benchmarks, such as ImageNet classification and ADE20K semantic segmentation.","Self-supervised Learning, Representation Learning, Vision Transformer" Semi-Implicit Variational Inference via Score Matching,https://openreview.net/forum?id=sd90a2ytrt,https://openreview.net/pdf?id=sd90a2ytrt,A new semi-implict variational inference method using a score matching training objective,"Semi-implicit variational inference (SIVI) greatly enriches the expressiveness of variational families by considering implicit variational distributions defined in a hierarchical manner. However, due to the intractable densities of variational distributions, current SIVI approaches often use surrogate evidence lower bounds (ELBOs) or employ expensive inner-loop MCMC runs for unbiased ELBOs for training. In this paper, we propose SIVI-SM, a new method for SIVI based on an alternative training objective via score matching. Leveraging the hierarchical structure of semi-implicit variational families, the score matching objective allows a minimax formulation where the intractable variational densities can be naturally handled with denoising score matching. We show that SIVI-SM closely matches the accuracy of MCMC and outperforms ELBO-based SIVI methods in a variety of Bayesian inference tasks. ","Semi-implicit variational inference, denoising score matching, minimax optimization" Sharper Bounds for Uniformly Stable Algorithms with Stationary $\varphi$-mixing Process,https://openreview.net/forum?id=8E5Yazboyh,https://openreview.net/pdf?id=8E5Yazboyh,We develop stability and generalization bounds for learning with mixing sequences.,"Generalization analysis of learning algorithms often builds on a critical assumption that training examples are independently and identically distributed (i.i.d.), which is often violated in practical problems such as time series prediction. In this paper, we use algorithmic stability to study the generalization performance of learning algorithms with $\varphi$-mixing data, where the dependency between observations weakens over time. We show uniformly stable algorithms guarantee high-probability generalization bounds, which significantly improves the state of the art by a factor of $\sqrt{n}$ with $n$ being the sample size. We apply our general result to specific algorithms including regularization schemes, stochastic gradient descent and localized iterative regularization, and develop excess population risk bounds for learning with $\varphi$-mixing data. Our analysis builds on a novel moment bound for weakly-dependent random variables on a mixing sequence and a novel error decomposition of generalization error.","Algorithmic Stability, Non-I.I.D. Learning, Generalization Error, Learning Theory" Few-Shot Anomaly Detection on Industrial Images through Contrastive Fine-Tuning,https://openreview.net/forum?id=iV0r9898C-,https://openreview.net/pdf?id=iV0r9898C-,We proposed a few-shot anomaly detection approach towards industrial defect identification,"Detecting abnormal products through imagery data is essential to the quality control in manufacturing. Existing approaches towards anomaly detection~(AD) often rely on substantial amount of anomaly-free samples to train representation and density models. Nevertheless, large anomaly-free datasets may not always be available before inference stage and this requires building an anomaly detection framework with only a handful of normal samples, a.k.a. few-shot anomaly detection (FSAD). We propose two techniques to address the challenges in FSAD. First, we employ a model pretrained on large source dataset to initialize model weights. To ameliorate the covariate shift between source and target domains, we adopt contrastive training on the few-shot target domain data. Second, to encourage learning representations suitable for downstream AD, we further incorporate cross-instance pairs to increase tightness within normal sample cluster and better separation between normal and synthesized negative samples. Extensive evaluations on six few-shot anomaly detection benchmarks demonstrate the effectiveness of the proposed method.","Anomaly Detection, Transfer Learning, Few-Shot Learning" Rate-Distortion Optimized Post-Training Quantization for Learned Image Compression,https://openreview.net/forum?id=EA6YF_qwVe,https://openreview.net/pdf?id=EA6YF_qwVe,,"Quantizing floating-point neural network to its fixed-point representation is crucial for Learned Image Compression (LIC) because it ensures the decoding consistency for interoperability and reduces space-time complexity for implementation. Existing solutions often have to retrain the network for model quantization which is time consuming and impractical. This work suggests the use of Post-Training Quantization (PTQ) to directly process pretrained, off-the-shelf LIC models. We theoretically prove that minimizing the mean squared error (MSE) in PTQ is sub-optimal for compression task and thus develop a novel Rate-Distortion (R-D) Optimized PTQ (RDO-PTQ) to best retain the compression performance. Such RDO-PTQ just needs to compress few images (e.g., 10) to optimize the transformation of weight, bias, and activation of underlying LIC model from its native 32-bit floating-point (FP32) format to 8-bit fixed-point (INT8) precision for fixed-point inference onwards. Experiments reveal outstanding efficiency of the proposed method on different LICs, showing the closest coding performance to their floating-point counterparts. And, our method is a lightweight and plug-and-play approach without any need of model retraining which is attractive to practitioners. ","learned image compression, post-training quantization, rate-distortion optimization" On the Edge of Benign Overfitting: Label Noise and Overparameterization Level,https://openreview.net/forum?id=UrEwJebCxk,https://openreview.net/pdf?id=UrEwJebCxk,Provide a theoretical analysis for the phenomenon that ResNet model overfits benignly on Cifar10 but not benignly on ImageNet,"Studies on benign overfitting provide insights for the success of overparameterized deep learning models. In this work, we examine whether overfitting is truly benign in real-world classification tasks. We start with the observation that a ResNet model overfits benignly on Cifar10 but not benignly on ImageNet. To understand why benign overfitting fails in the ImageNet experiment, we theoretically analyze benign overfitting under a more restrictive setup where the number of parameters is not significantly larger than the number of data points. Under this mild overparameterization setup, our analysis identifies a phase change: unlike in the previous heavy overparameterization settings, benign overfitting can now fail in the presence of label noise. Our analysis explains our empirical observations, and is validated by a set of control experiments with ResNets. Our work highlights the importance of understanding implicit bias in underfitting regimes as a future direction.","generalization, benign overfitting, mild overparameterization, implicit bias" Predictive Inference with Feature Conformal Prediction,https://openreview.net/forum?id=0uRm1YmFTu,https://openreview.net/pdf?id=0uRm1YmFTu,Conformal inference in feature space. ,"Conformal prediction is a distribution-free technique for establishing valid prediction intervals. Although conventionally people conduct conformal prediction in the output space, this is not the only possibility. In this paper, we propose feature conformal prediction, which extends the scope of conformal prediction to semantic feature spaces by leveraging the inductive bias of deep representation learning. From a theoretical perspective, we demonstrate that feature conformal prediction provably outperforms regular conformal prediction under mild assumptions. Our approach could be combined with not only vanilla conformal prediction, but also other adaptive conformal prediction methods. Experiments on various predictive inference tasks corroborate the efficacy of our method.","conformal prediction, uncertainty" Measuring Image Complexity as a Discrete Hierarchy using MDL Clustering,https://openreview.net/forum?id=iZ3Qo_akPA,https://openreview.net/pdf?id=iZ3Qo_akPA,"The first image complexity measure that does not assign white noise high complexity, based on clustering and inspired by molecular assembly theory.","Being able to quantify the complexity of data is an important question in machine learning, computer science, and data science. In the case of image data, a number of methods have been proposed. However, existing methods are based only on the degree of variation across the image, and cannot distinguish meaningful content from noise. In particular, existing methods assign a very high complexity to white noise images, despite such images containing no meaningful information. In this paper, we present a method to measure the complexity of images by analyzing them has a discrete hierarchy of patches, using MDL clustering. Beginning with individual pixels, each level of the hierarchy is formed using the cluster labels from the level below. The complexity is the sum, across all levels, of the entropy of cluster labels inside all patches on that level. Clustering is performed using the minimum description length principle (MDL), which we leverage in a novel way to distinguish signal from noise. We test against existing methods on seven different sets of images, four from public image datasets and three synthetic, and show that ours is the only method that can assign an accurate measure of complexity to all images considered. Every other method measures white noise as highly complex, while our method gives it zero complexity. We then present ablation studies showing the contribution of the components of our method, and further experiments showing robustness to image quality.","image complexity, clustering" Recon: Reducing Conflicting Gradients From the Root For Multi-Task Learning,https://openreview.net/forum?id=ivwZO-HnzG_,https://openreview.net/pdf?id=ivwZO-HnzG_,We propose a very simple yet effective approach to reduce the occurrence of conflicting gradients for multi-task learning.,"A fundamental challenge for multi-task learning is that different tasks may conflict with each other when they are solved jointly, and a cause of this phenomenon is conflicting gradients during optimization. Recent works attempt to mitigate the influence of conflicting gradients by directly altering the gradients based on some criteria. However, our empirical study shows that ``gradient surgery'' cannot reduce the occurrence of conflicting gradients. In this paper, we take a different approach to reduce conflicting gradients from the root. In essence, we investigate the task gradients w.r.t. each shared network layer, select the layers with high conflict scores, and set them task-specific. Our experiments show that with only a slight increase in model parameters, such a simple approach can effectively reduce the occurrence of conflicting gradients in the remaining shared layers and achieve better performance. We demonstrate the generality of our approach by combining it with state-of-the-art approaches including gradient manipulation methods and branched architecture search methods. Comprehensive experiments on various benchmarks show that our approach can substantially improve their performance.","Multi-task Learning, Conflicting Gradients" OCD: Learning to Overfit with Conditional Diffusion Models,https://openreview.net/forum?id=_p6enPE4Xa,https://openreview.net/pdf?id=_p6enPE4Xa,Local learning with a hypernetwork that employs a diffusion process,"We present a dynamic model in which the weights are conditioned on an input sample $x$ and are learned to match those that would be obtained by finetuning a base model on $x$ and its label $y$. This mapping between an input sample and network weights is shown to be approximated by a linear transformation of the sample distribution, which suggests that a denoising diffusion model can be suitable for this task. The diffusion model we therefore employ focuses on modifying a single layer of the base model and is conditioned on the input, activations, and output of this layer. Our experiments demonstrate the wide applicability of the method for image classification, 3D reconstruction, tabular data, and speech separation. Our code is attached as supplementary.","Local learning, hypernetworks, diffusion processes" Measure the Predictive Heterogeneity,https://openreview.net/forum?id=g2oB_k-18b,https://openreview.net/pdf?id=g2oB_k-18b,"In this work, we propose the predictive heterogeneity to measure the data heterogeneity that affects prediction. Theoretical analysis and empirical results validate the rationality of the proposed measure.","As an intrinsic and fundamental property of big data, data heterogeneity exists in a variety of real-world applications, such as in agriculture, sociology, health care, etc. For machine learning algorithms, the ignorance of data heterogeneity will significantly hurt the generalization performance and the algorithmic fairness, since the prediction mechanisms among different sub-populations are likely to differ. In this work, we focus on the data heterogeneity that affects the prediction of machine learning models, and first formalize the \emph{Predictive Heterogeneity}, which takes into account the model capacity and computational constraints. We prove that it can be reliably estimated from finite data with PAC bounds even in high dimensions. Additionally, we propose the Information Maximization (IM) algorithm, a bi-level optimization algorithm, to explore the predictive heterogeneity of data. Empirically, the explored predictive heterogeneity provides insights for sub-population divisions in agriculture, sociology, and object recognition, and leveraging such heterogeneity benefits the out-of-distribution generalization performance.","data heterogeneity, predictive information, predictive heterogeneity" On the robustness of self-supervised models for generative spoken language modeling,https://openreview.net/forum?id=hT4qiZK0Iv,https://openreview.net/pdf?id=hT4qiZK0Iv,Method to learn robust self-supervised speech representation for generative spoken language modeling,"Self-supervised representations have been extensively studied for discriminative and generative tasks. However, their robustness capabilities have not been extensively investigated. This work focuses on self-supervised representations for spoken generative language models. First, we empirically demonstrate how current state-of-the-art speech representation models lack robustness to basic signal variations that do not alter the spoken information. To overcome this, we propose an effective and efficient method to learn robust self-supervised speech representation for generative spoken language modeling. The proposed approach is based on applying a set of signal transformations to the speech signal and optimizing the model using an iterative pseudo-labeling scheme. Our method significantly improves over the evaluated baselines when considering encoding metrics. We additionally evaluate our method on the speech-to-speech translation task. We consider Spanish-English and French-English conversions and empirically demonstrate the benefits of following the proposed approach.","robustness, speech, language modeling, spoken language modeling" Non-equispaced Fourier Neural Solvers for PDEs,https://openreview.net/forum?id=r6a21wSch9p,https://openreview.net/pdf?id=r6a21wSch9p,," Solving partial differential equations is difficult. Recently proposed neural resolution-invariant models, despite their effectiveness and efficiency, usually require equispaced spatial points of data. However, sampling in spatial domain is sometimes inevitably non-equispaced in real-world systems, limiting their applicability. In this paper, we aim to propose a Non-equispaced Fourier PDE Solver (\textsc{NFS}) with adaptive interpolation on resampled equispaced points and a variant of Fourier Neural Operators as its components. Experimental results on complex PDEs demonstrate its advantages in accuracy and efficiency. Compared with the spatially-equispaced benchmark methods, it achieves superior performance with $42.85\%$ improvements on MAE, and is able to handle non-equispaced data with a tiny loss of accuracy. Besides, to our best knowledge, \textsc{NFS} is the first ML-based method with mesh invariant inference ability to successfully model turbulent flows in non-equispaced scenarios, with a minor deviation of the error on unseen spatial points. ","Neural Operators, PDE Solvers" Time to augment visual self-supervised learning,https://openreview.net/forum?id=o8xdgmwCP8l,https://openreview.net/pdf?id=o8xdgmwCP8l,We show that time-based augmentations resulting from ego-motion and object manipulations improve over standard data-augmentations methods on the ability to visually recognize object categories.,"Biological vision systems are unparalleled in their ability to learn visual representations without supervision. In machine learning, self-supervised learning (SSL) has led to major advances in forming object representations in an unsupervised fashion. These systems learn representations invariant to augmentation operations over images, like cropping or flipping. In contrast, biological vision systems exploit the temporal structure of the visual experience. This gives access to ``augmentations'' not commonly used in SSL, like watching the same object from multiple viewpoints or against different backgrounds. Here, we systematically investigate and compare the potential benefits of such time-based augmentations for learning object categories. Our results show that time-based augmentations achieve large performance gains over state-of-the-art image augmentations. Specifically, our analyses reveal that: 1) 3-D object manipulations drastically improve the learning of object categories; 2) viewing objects against changing backgrounds is vital for learning to discard background-related information. Overall, we conclude that time-based augmentations can greatly improve contrastive learning, narrowing the gap between artificial and biological vision systems.","object representations, self-supervised learning, time-based augmentations, data augmentations" Adversarial IV Regression for Demystifying Causal Features on Adversarial Examples,https://openreview.net/forum?id=h2L7xRNh7n_,https://openreview.net/pdf?id=h2L7xRNh7n_,We propose a way of understanding the adversarial vulnerability through a causal perspective.,"The origin of adversarial examples is still inexplicable in research fields, and it arouses arguments from various viewpoints, albeit comprehensive investigations. In this paper, we propose a way of delving into the unexpected vulnerability in adversarially trained networks from a causal perspective, namely adversarial instrumental variable (IV) regression. By deploying it, we estimate the causal relation of adversarial prediction under an unbiased environment dissociated from unknown confounders. Our approach aims to demystify inherent causal features on adversarial examples by leveraging a zero-sum optimization game between a casual feature estimator (i.e.,hypothesis model) and worst-case counterfactuals (i.e.,test function) disturbing to find causal features. Through extensive analyses, we demonstrate that the estimated causal features are highly related to the correct prediction for adversarial robustness, and the counterfactuals exhibit extreme features significantly deviating from the correct prediction. In addition, we present how to effectively inoculate CAusal FEatures (CAFE) into defense networks for improving adversarial robustness.","Adversarial Examples, Causal Feature, Adversarial Robustness" Probable Dataset Searching Method with Uncertain Dataset Information in Adjusting Architecture Hyper Parameter,https://openreview.net/forum?id=UvlCVoLV1i,https://openreview.net/pdf?id=UvlCVoLV1i,,"Different types of tasks with uncertain dataset information are studied because different parts of data may have different difficulties to achieve. For example, in unsupervised learning and domain adaptation, datasets are provided without label information because of the cost of human annotation. In deep learning, adjusting architecture hyper parameters is important for the model performance and is also time consuming, so we try to adjust hyper parameters in two types of uncertain dataset information:1, dataset labels are postponed to be obtained so hyper parameters need to be adjusted without complete dataset information. 2, hyper parameters are adjusted with a subset training dataset since training models with complete training dataset is time consuming. Here, we propose several loss functions to search for probable dataset when the complete dataset information is not obtained. The experiments on 9 real world data demonstrate the performance of our method.", Impact of the Last Fully Connected Layer on Out-of-distribution Detection,https://openreview.net/forum?id=42Xu5gudPL,https://openreview.net/pdf?id=42Xu5gudPL,,"Out-of-distribution (OOD) detection, a task that aims to detect OOD data during deployment, has received lots of research attention recently, due to its importance for the safe deployment of deep models. In this task, a major problem is how to handle the overconfidence problem in OOD data. While this problem has been explored from several perspectives in previous works, such as the measure of OOD uncertainty and the activation function, the connection between the last fully connected (FC) layer and this overconfidence problem is still less explored. In this paper, we find that the weight of the last FC layer of the model trained on in-distribution (ID) data can be an important source of the overconfidence problem, and we propose a simple yet effective OOD detection method to assign the weight of the last FC layer with small values instead of using the original weight trained on ID data. We analyze in Sec.5 that our proposed method can make the OOD data and the ID data to be more separable, and thus alleviate the overconfidence problem. Moreover, our proposed method can be flexibly applied on various off-the-shelf OOD detection methods. We show the effectiveness of our proposed method through extensive experiments on the ImageNet dataset, the CIFAR-10 dataset, and the CIFAR-100 dataset.", "Towards Lightweight, Model-Agnostic and Diversity-Aware Active Anomaly Detection",https://openreview.net/forum?id=-vKlt84fHs,https://openreview.net/pdf?id=-vKlt84fHs,,"Active Anomaly Discovery (AAD) is flourishing in the anomaly detection research area, which aims to incorporate analysts’ feedback into unsupervised anomaly detectors. However, existing AAD approaches usually prioritize the samples with the highest anomaly scores for user labeling, which hinders the exploration of anomalies that were initially ranked lower. Besides, most existing AAD approaches are specially tailored for a certain unsupervised detector, making it difficult to extend to other detection models. To tackle these problems, we propose a lightweight, model-agnostic and diversity-aware AAD method, named LMADA. In LMADA, we design a diversity-aware sample selector powered by Determinantal Point Process (DPP). It considers the diversity of samples in addition to their anomaly scores for feedback querying. Furthermore, we propose a model-agnostic tuner. It approximates diverse unsupervised detectors with a unified proxy model, based on which the feedback information is incorporated by a lightweight non-linear representation adjuster. Through extensive experiments on 8 public datasets, LMADA achieved 74% F1-Score improvement on average, outperforming other comparative AAD approaches. Besides, LMADA can also achieve significant performance boosting under any unsupervised detectors.","Active Anomaly Discovery, Diversity Sampling, Deep Learning" Multi-Level Contrastive Learning for Dense Prediction Task,https://openreview.net/forum?id=Iwq3HPz96O,https://openreview.net/pdf?id=Iwq3HPz96O,Multi-Level Contrastive Learning is an efficient self-supervised method to learn region-level feature representation for dense prediction tasks.,"In this work, we present Multi-Level Contrastive Learning for Dense Prediction Task (MCL), an efficient self-supervised method to learn region-level feature representation for dense prediction tasks. This approach is motivated by the three key factors in detection: localization, scale consistency and recognition. Considering the above factors, we design a novel pretext task, which explicitly encodes absolute position and scale information simultaneously by assembling multi-scale images in a montage manner to mimic multi-object scenario. Unlike the existing image-level self-supervised methods, our method constructs a multi-level contrastive loss by considering each sub-region of the montage image as a singleton to learn a regional semantic representation for translation and scale consistency, while reducing the pre-training epochs to the same as supervised pre-training. Extensive experiments show that MCL consistently outperforms the recent state-of-the-art methods on various datasets with significant margins. In particular, MCL obtains 42.5 AP^bb and 38.3 AP^mk on COCO with the 1x schedule and surpasses MoCo by 4.0 AP^bb and 3.1 AP^mk, when using Mask R-CNN with an R50-FPN backbone pre-trained with 100 epochs. In addition, we further explore the alignment between pretext task and downstream tasks. We extend our pretext task to supervised pre-training, which achieves a similar performance with self-supervised learning, demonstrating the importance of the alignment between pretext task and downstream tasks. ","Self-supervised learning, Detection, Segmentation" Switching One-Versus-the-Rest Loss to Increase Logit Margins for Adversarial Robustness,https://openreview.net/forum?id=IVE5g1af87,https://openreview.net/pdf?id=IVE5g1af87,We prove that one-versus-rest loss (OVR) increases logit margins two times greater than cross-entropy and propose switching between cross-entropy and OVR by the criterion of logit margins to improve adversarial robustness.,"Adversarial training is a promising method to improve the robustness against adversarial attacks. To enhance its performance, recent methods impose high weights on the cross-entropy loss for important data points near the decision boundary. However, these importance-aware methods are vulnerable to sophisticated attacks, e.g., Auto-Attack. In this paper, we experimentally investigate the cause of their vulnerability via margins between logits for the true label and the other labels because they should be large enough to prevent the largest logit from being flipped by the attacks. Our experiments reveal that the histogram of the logit margins of naive adversarial training has two peaks. Thus, the levels of difficulty in increasing logit margins are roughly divided into two: difficult samples (small logit margins) and easy samples (large logit margins). On the other hand, only one peak near zero appears in the histogram of importance-aware methods, i.e., they reduce the logit margins of easy samples. To increase logit margins of difficult samples without reducing those of easy samples, we propose switching one-versus-the-rest loss (SOVR), which switches from cross-entropy to one-versus-the-rest loss (OVR) for difficult samples. We derive trajectories of logit margins for a simple problem and prove that OVR increases logit margins two times larger than the weighted cross-entropy loss. Thus, SOVR increases logit margins of difficult samples, unlike existing methods. We experimentally show that SOVR achieves better robustness against Auto-Attack than importance-aware methods.","Adversarial examples, Deep learning, Loss function, Adversarial training" Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection,https://openreview.net/forum?id=1T853KDY3t,https://openreview.net/pdf?id=1T853KDY3t,,"We present an approach to efficiently and effectively adapt a masked image modeling (MIM) pre-trained vanilla Vision Transformer (ViT) for object detection, which is based on our two novel observations: (i) A MIM pre-trained vanilla ViT encoder can work surprisingly well in the challenging object-level recognition scenario even with randomly sampled partial observations, e.g., only 25% ~ 50% of the input embeddings. (ii) In order to construct multi-scale representations for object detection from single-scale ViT, a randomly initialized compact convolutional stem supplants the pre-trained patchify stem, and its intermediate features can naturally serve as the higher resolution inputs of a feature pyramid network without further upsampling or other manipulations. While the pre-trained ViT is only regarded as the third-stage of our detector's backbone instead of the whole feature extractor. This naturally results in a ConvNet-ViT hybrid architecture. The proposed detector, named MIMDet, enables a MIM pre-trained vanilla ViT to outperform leading hierarchical architectures such as Swin Transformer, MViTv2 and ConvNeXt on COCO object detection & instance segmentation, and achieves better results compared with the previous best adapted vanilla ViT detector using a more modest fine-tuning recipe while converging 2.8x faster.","Vision Transformer, Object Detection, Instance Segmentation, Representation Learning" Scaled Neural Multiplicative Model for Tractable Optimization,https://openreview.net/forum?id=Z7d-t6wbNy,https://openreview.net/pdf?id=Z7d-t6wbNy,,"Challenging decision problems in retail and beyond are often solved using the predict-then-optimize paradigm. An initial effort to develop and parameterize a model of an uncertain environment is followed by a separate effort to identify the best possible solution of an optimization problem. Linear models are often used to ensure optimization problems are tractable. Remarkably accurate Deep Neural Network (DNN) models have recently been developed for various prediction tasks. Such models have been shown to scale to large datasets without loss of accuracy and with good computational performance. It can, however, be challenging to formulate tractable optimization problems based on DNN models. In this work we consider the problem of shelf space allocation for retail stores using DNN models. We highlight the trade-off between predictive performance and the tractability of optimization problems. We introduce a Scaled Neural Multiplicative Model (SNMM) with shape constraints for demand learning that leads to a tractable optimization formulation. Although, this work focuses on a specific application, the formulation of the models are general enough such that they can be extended to many real world applications.","Input Convex Neural Networks, Shape Constrained Models, Shelf Space Optimization" Quasi-Taylor Samplers for Diffusion Generative Models based on Ideal Derivatives,https://openreview.net/forum?id=7ks5PS09q1,https://openreview.net/pdf?id=7ks5PS09q1,Taylor-expansion approach for diffusion generative models is discussed.,"Diffusion generative models have emerged as a new challenger to popular deep neural generative models such as GANs, but have the drawback that they often require a huge number of neural function evaluations (NFEs) during synthesis unless some sophisticated sampling strategies are employed. This paper proposes new efficient samplers based on the numerical schemes derived by the familiar Taylor expansion, which directly solves the ODE/SDE of interest. In general, it is not easy to compute the derivatives that are required in higher-order Taylor schemes, but in the case of diffusion models, this difficulty is alleviated by the trick that the authors call ``ideal derivative substitution,'' in which the higher-order derivatives are replaced by tractable ones. To derive ideal derivatives, the authors argue the ``single point approximation,'' in which the true score function is approximated by a conditional one, holds in many cases, and considered the derivatives of this approximation. Applying thus obtained new quasi-Taylor samplers to image generation tasks, the authors experimentally confirmed that the proposed samplers could synthesize plausible images in small number of NFEs, and that the performance was better or at the same level as DDIM and Runge-Kutta methods. The paper also argues the relevance of the proposed samplers to the existing ones mentioned above. ","diffusion models, score based models, neural generative models" Group-oriented Cooperation in Multi-Agent Reinforcement Learning,https://openreview.net/forum?id=qPI2SzRjA3,https://openreview.net/pdf?id=qPI2SzRjA3,"We propose an automatic grouping mechanism in cooperative MARL, which dynamically adjusts the grouping of agents as training proceeds and achieves efficient team cooperation by facilitating intra- and inter-group coordination.","Grouping is ubiquitous in natural systems and is essential for promoting efficiency in team coordination. This paper introduces the concept of grouping into multi-agent reinforcement learning (MARL) and provides a novel formulation of Group-oriented MARL (GoMARL). In contrast to existing approaches that attempt to directly learn the complex relationship between the joint action-values and individual values, we empower groups as a bridge to model the connection between a small set of agents and encourage cooperation among them, thereby improving the efficiency of the whole team. In particular, we factorize the joint action-values as a combination of group-wise values, which guide agents to improve their policies in a fine-grained fashion. We propose a flexible grouping mechanism inspired by variable selection and sparse regularization to generate dynamic groups and group action-values. We further propose a hierarchical control for policy learning that drives the agents in the same group to specialize in similar policies and possess diversified strategies for various groups. Extensive experiments on a challenging set of StarCraft II micromanagement tasks and Google Research Football scenarios verify our method's effectiveness and learning efficiency. Detailed component studies show how grouping works and enhances performance.","MARL, Multi-Agent Reinforcement Learning, Group-wise Learning" Exploring Temporally Dynamic Data Augmentation for Video Recognition,https://openreview.net/forum?id=fxjzKOdw9wb,https://openreview.net/pdf?id=fxjzKOdw9wb,We propose a novel data augmentation framework for video recognition that extends the static nature of image augmentations into temporally dynamic.,"Data augmentation has recently emerged as an essential component of modern training recipes for visual recognition tasks. However, data augmentation for video recognition has been rarely explored despite its effectiveness. Few existing augmentation recipes for video recognition naively extend the image augmentation methods by applying the same operations to the whole video frames. Our main idea is that the magnitude of augmentation operations for each frame needs to be changed over time to capture the real-world video's temporal variations. These variations should be generated as diverse as possible using fewer additional hyper-parameters during training. Through this motivation, we propose a simple yet effective video data augmentation framework, DynaAugment. The magnitude of augmentation operations on each frame is changed by an effective mechanism, Fourier Sampling that parameterizes diverse, smooth, and realistic temporal variations. DynaAugment also includes an extended search space suitable for video for automatic data augmentation methods. DynaAugment experimentally demonstrates that there are additional performance rooms to be improved from static augmentations on diverse video models. Specifically, we show the effectiveness of DynaAugment on various video datasets and tasks: large-scale video recognition (Kinetics-400 and Something-Something-v2), small-scale video recognition (UCF-101 and HMDB-51), fine-grained video recognition (Diving-48 and FineGym), video action segmentation on Breakfast, video action localization on THUMOS'14, and video object detection on MOT17Det.","Video Recognition, Data Augmentation" CacheGNN: Enhancing Graph Neural Networks with Global Information Caching,https://openreview.net/forum?id=6KYPBGeYxv,https://openreview.net/pdf?id=6KYPBGeYxv,,"Graph neural networks (GNNs) have achieved impressive results on various graph learning tasks. Most GNNs merely leverage information from a limited range of local neighbors, which is difficult to effectively capture global information in the graph. However, utilising global information enables GNNs to capture long-range dependencies and learn more informative node representations. To this end, we propose CacheGNN, an approach that leverages information from global similar nodes to enhance GNNs. Our CacheGNN uses a cache to store node representations and utilises those cached embeddings to efficiently find global similar nodes. To quickly and efficiently making predictions at test time, our CacheGNN retrieves global similar nodes from a set of representative nodes, which is selected from a sparse node selection distribution with Dirichlet prior. We conduct node classification experiments on seven real-world datasets under inductive and transductive settings. Experimental results verify the effectiveness of our CacheGNN.", Towards Information-Theoretic Pattern Mining in Time Series,https://openreview.net/forum?id=Y3McfgrhX5,https://openreview.net/pdf?id=Y3McfgrhX5,The paper offers a novel method for discovering patterns from time series data.,"Time series pattern discovery is one of the most fundamental tasks in data mining. Existing literature addressing this task often follows a generic paradigm in which a similarity metric is defined a priori and an extensive pattern-matching search is executed to find similar subsequences based on the metric. Algorithms developed under this paradigm therefore risk missing important patterns that do not meet the implicit biases within such pre-defined metrics. To mitigate this risk, we propose a new information-theoretic discovery paradigm that aims to find the most informative patterns on an embedding space that can learn to encode representative statistical variation trends in the time series. This paradigm is achieved by a probabilistic time-to-pattern mining algorithm, T2P, based on a biophysically-inspired adaptation of a variational auto-encoder (VAE). The adapted VAE incorporates a specific design for its latent space that learns to surface the most recurring and informative patterns without the need to run costly pattern-matching searches. Empirically, we demonstrate that our method is more scalable than existing works. Furthermore, T2P can find multiple diverse patterns that more effectively compress and represent the time series without relying on prior knowledge of the data.","Deep learning, Unspervised learning, Variational Autoencoders" Agent Prioritization with Interpretable Relation for Trajectory Prediction,https://openreview.net/forum?id=4vfv4GDG6G,https://openreview.net/pdf?id=4vfv4GDG6G,,"In this paper, we present a novel multi-agent trajectory prediction model, which discovers interpretable relations among agents and prioritize agent's motion. Different from existing approaches, our interpretable design is inspired by the fundamental navigation and motion functions of agent movements, which represent 'where' and 'how' the agents move in the scenes. Specifically, it generates the relation matrix, where each element indicates the motion impact from one to another. In addition, in highly interactive scenarios, one agent may implicitly gain higher priority to move, while the motion of other agents may be impacted by the prioritized agents with higher priority (e.g., a vehicle stopping or reducing its speed due to crossing pedestrians). Based on this intuition, we design a novel motion prioritization module to learn the agent motion priorities based on the inferred relation matrix. Then, a decoder is proposed to sequentially predict and iteratively update the future trajectories of each agent based on their priority orders and the learned relation structures. We first demonstrate the effectiveness of our prediction model on simulated Charged Particles dataset. Next, extensive evaluations are performed on commonly-used datasets for robot navigation, human-robot interactions, and autonomous agents: real-world NBA basketball and INTERACTION. Finally, we show that the proposed model outperforms other state-of-the-art relation based methods, and is capable to infer interpretable, meaningful relations among agents.", $z$-SignFedAvg: A Unified Stochastic Sign-based Compression for Federated Learning,https://openreview.net/forum?id=ykql_wKavL,https://openreview.net/pdf?id=ykql_wKavL,This work proposes the first federated averaging algorithm with sign-based compression.,"Federated Learning (FL) is a promising privacy-preserving distributed learning paradigm but suffers from high communication cost when training large-scale machine learning models. Sign-based methods, such as SignSGD \citep{bernstein2018signsgd}, have been proposed as a biased gradient compression technique for reducing the communication cost. However, sign-based algorithms could diverge under heterogeneous data, which thus motivated the development of advanced techniques, such as the error-feedback method and stochastic sign-based compression, to fix this issue. Nevertheless, these methods still suffer from slower convergence rates. Besides, none of them allows multiple local SGD updates like FedAvg \citep{mcmahan2017communication}. In this paper, we propose a novel noisy perturbation scheme with a general symmetric noise distribution for sign-based compression, which not only allows one to flexibly control the tradeoff between gradient bias and convergence performance, but also provides a unified viewpoint to existing stochastic sign-based methods. More importantly, we propose the very first sign-based FedAvg algorithm ($z$-SignFedAvg). Theoretically, we show that $z$-SignFedAvg achieves a faster convergence rate than existing sign-based methods and, under the uniformly distributed noise, can enjoy the same convergence rate as its uncompressed counterpart. Last but not the least, we remark that adding random noise to the local gradients has a double benefit: it protects the clients' privacy by, e.g., the Differential Privacy. Extensive experiments are conducted to demonstrate that the $z$-SignFedAvg can achieve competitive empirical performance on real datasets. ","federated averaging, compression, communication efficiency, signSGD" Transfer Learning with Pre-trained Conditional Generative Models,https://openreview.net/forum?id=5-3YJbVPp6m,https://openreview.net/pdf?id=5-3YJbVPp6m,We propose a novel transfer learning method using conditional generative models pre-trained on source dataset for an inductive transfer learning setting where NN architectures are not consistent.,"Transfer learning is crucial in training deep neural networks on new target tasks. Current transfer learning methods always assume at least one of (i) source and target task label spaces overlap, (ii) source datasets are available, and (iii) target network architectures are consistent with source ones. However, holding these assumptions is difficult in practical settings because the target task rarely has the same labels as the source task, the source dataset access is restricted due to storage costs and privacy, and the target architecture is often specialized to each task. To transfer source knowledge without these assumptions, we propose a transfer learning method that uses deep generative models and is composed of the following two stages: pseudo pre-training (PP) and pseudo semi-supervised learning (P-SSL). PP trains a target architecture with an artificial dataset synthesized by using conditional source generative models. P-SSL applies SSL algorithms to labeled target data and unlabeled pseudo samples, which are generated by cascading the source classifier and generative models to condition them with target samples. Our experimental results indicate that our method can outperform the baselines of scratch training and knowledge distillation.","Deep Learning, Transfer Learning, Deep Generative Models, Semi-supervised Learning" DECN: Evolution Inspired Deep Convolution Network for Black-box Optimization,https://openreview.net/forum?id=Ur_qORZ6-9R,https://openreview.net/pdf?id=Ur_qORZ6-9R,,"We design a deep evolutionary convolution network (DECN) for continuous black-box optimization to force the random population to move near the optimal solution, which is the goal of population-based optimization, such as evolutionary algorithm and evolutionary strategy. DECN is composed of two modules: convolution-based reasoning module (CRM) and selection module (SM), to move from hand-designed searching strategies to learned searching strategies in population-based optimization. CRM produces a population that is closer to the optimal solution based on the convolution operators, and SM removes poor solutions. We also design a proper loss function to support the training of DECN. The experimental results on unconstrained continuous optimization problems show that DECN can learn searching strategies and surpass population-based baselines. Moreover, DECN obtains good performance when transferred to optimization problems unseen during the training stage. In addition, DECN is friendly to the acceleration with Graphics Processing Units (GPUs) and runs 102 times faster than unaccelerated EA when evolving 32 populations, each containing 6400 individuals.","Learning to Optimize, Black-box Optimization" Q-Pensieve: Boosting Sample Efficiency of Multi-Objective RL Through Memory Sharing of Q-Snapshots,https://openreview.net/forum?id=AwWaBXLIJE,https://openreview.net/pdf?id=AwWaBXLIJE,We boost the sample efficiency of multi-objective RL by using Q snapshots ,"Many real-world continuous control problems are in the dilemma of weighing the pros and cons, multi-objective reinforcement learning (MORL) serves as a generic framework of learning control policies for different preferences over objectives. However, the existing MORL methods either rely on multiple passes of explicit search for finding the Pareto front and therefore are not sample-efficient, or utilizes a shared policy network for coarse knowledge sharing among policies. To boost the sample efficiency of MORL, we propose $Q$-Pensieve, a policy improvement scheme that stores a collection of $Q$-snapshots to jointly determine the policy update direction and thereby enables data sharing at the policy level. We show that $Q$-Pensieve can be naturally integrated with soft policy iteration with convergence guarantee. To substantiate this concept, we propose the technique of $Q$ replay buffer, which stores the learned $Q$-networks from the past iterations, and arrive at a practical actor-critic implementation. Through extensive experiments and an ablation study, we demonstrate that with much fewer samples, the proposed algorithm can outperform the benchmark MORL methods on a variety of MORL benchmark tasks.","Multi-objective reinforcement learning, sample efficiency" On the Power-Law Hessian Spectra in Deep Learning,https://openreview.net/forum?id=_G7dzxxSKM,https://openreview.net/pdf?id=_G7dzxxSKM,We are the first to demonstrate that the Hessian spectra of well-trained deep neural networks exhibit simple power-law structures and critically relate to multiple behaviors of deep learning. .,"It is well-known that the Hessian of deep loss landscape matters to optimization, generalization, and even robustness of deep learning. Recent works empirically discovered that the Hessian spectrum in deep learning has a two-component structure that consists of a small number of large eigenvalues and a large number of nearly-zero eigenvalues. However, the mathematical structure behind the Hessian spectra is still under-explored. To the best of our knowledge, we are the first to demonstrate that the Hessian spectra of well-trained deep neural networks exhibit simple power-law structures. Inspired by the statistical physics theories, we provide a maximum-entropy theoretical interpretation for explaining why the power-law structure exists. Our extensive experiments using the novel power-law spectral method reveal that the power-law Hessian spectra critically relate to multiple important behaviors of deep learning, including optimization, generalization, overparameterization, and overfitting.","Deep Learning, Loss Landscape, Hessian" Optformer: Beyond Transformer for Black-box Optimization,https://openreview.net/forum?id=sP0p5S-gZ2,https://openreview.net/pdf?id=sP0p5S-gZ2,,"We design a novel Transformer for continuous unconstrained black-box optimization, called Optformer. Inspired by the similarity between Vision Transformer and evolutionary algorithms (EAs), we modify Tansformer's multi-head self-attention layer, feed-forward network, and residual connection to implement the functions of crossover, mutation, and selection operators. Moreover, we devise an iterated mode to generate and survive potential solutions like EAs. Optformer establishes a mapping from the random population to the optimal population. Compared to baselines, such as EAs, Bayesian optimization, and the learning-to-optimize method, Optformer shows the top performance in six black-box functions and one real-world application. We also find that untrained Optformer can also achieve good performance.","Transformer, Black-box optimization" Deformable Graph Transformer,https://openreview.net/forum?id=DL8dTTvCpU,https://openreview.net/pdf?id=DL8dTTvCpU,,"Transformer-based models have recently shown success in representation learning on graph-structured data beyond natural language processing and computer vision. However, the success is limited to small-scale graphs due to the drawbacks of full dot-product attention on graphs such as the quadratic complexity with respect to the number of nodes and message aggregation from enormous irrelevant nodes. To address these issues, we propose Deformable Graph Transformer (DGT) that performs sparse attention via dynamically sampled relevant nodes for efficiently handling large-scale graphs with a linear complexity in the number of nodes. Specifically, our framework first constructs multiple node sequences with various criteria to consider both structural and semantic proximity. Then, combining with our learnable Katz Positional Encodings, the sparse attention is applied to the node sequences for learning node representations with a significantly reduced computational cost. Extensive experiments demonstrate that our DGT achieves state-of-the-art performance on 7 graph benchmark datasets with 2.5 ∼ 449 times less computational cost compared to transformer-based graph models with full attention.","Graph Transformer, Graph Neural Networks" Exact manifold Gaussian Variational Bayes,https://openreview.net/forum?id=a30kyHbuXfI,https://openreview.net/pdf?id=a30kyHbuXfI,New algorithm for bayesian optimization,"We propose an optimization algorithm for Variational Inference (VI) in complex models. Our approach relies on natural gradient updates where the variational space is a Riemann manifold. We develop an efficient algorithm for Gaussian Variational Inference that implicitly satisfies the positive definite constraint on the variational covariance matrix. Our Exact manifold Gaussian Variational Bayes (EMGVB) provides exact but simple update rules and is straightforward to implement. Due to its black-box nature, EMGVB stands as a ready-to-use solution for VI in complex models. Over five datasets, we empirically validate our feasible approach on different statistical, econometric, and deep learning models, discussing its performance with respect to baseline methods.","variational inference, Bayes, Riemann, black box, deep learning" SuperMarioDomains: Generalizing to Domains with Evolving Graphics,https://openreview.net/forum?id=BMsqS_XALQU,https://openreview.net/pdf?id=BMsqS_XALQU,SuperMarioDomains is a new challenging Domain Generalization benchmark featuring domains derived from evolving video game graphics.,"Domains in previous Domain Generalization (DG) benchmarks have been sampled from various image collections of different styles such as photographs, sketches, cartoons, paintings, product images, and etc. However, from these existing DG datasets, it is still difficult to quantify the magnitude of domain shift between different domains and relate that to the performance gap across domains. It is also unclear how to measure the overlap between different domains. Therefore, we present a new DG dataset, SuperMarioDomains, containing four domains that are derived from four chronological titles in the Mario video game franchise on four generations of video game hardware. The discrepancy between our domains is quantified in terms of image representation complexity that reflect the hardware evolution in image resolution, color palette, and presence of 3D rendering. We benchmark state-of-the-art DG algorithms under both Multi-Source and Single-Source DG settings on our dataset and find that they can only surpass the random average baseline in our dataset by at most 18.0% and 10.4% respectively. In addition, we show that adding our dataset as part of the pre-training process improves performance of existing DG algorithms on the PACS benchmark.","Domain Generalization, Domain, Shift, Domain Adaptation" Variance-Aware Sparse Linear Bandits,https://openreview.net/forum?id=tkwP32nsEq,https://openreview.net/pdf?id=tkwP32nsEq,,"It is well-known that for sparse linear bandits, when ignoring the dependency on sparsity which is much smaller than the ambient dimension, the worst-case minimax regret is $\widetilde{\Theta}\left(\sqrt{dT}\right)$ where $d$ is the ambient dimension and $T$ is the number of rounds. On the other hand, in the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve $\widetilde{\mathcal O}(1)$ regret, which is (nearly) independent of $d$ and $T$. In this paper, we present the first variance-aware regret guarantee for sparse linear bandits: $\widetilde{\mathcal O}\left(\sqrt{d\sum_{t=1}^T \sigma_t^2} + 1\right)$, where $\sigma_t^2$ is the variance of the noise at the $t$-th round. This bound naturally interpolates the regret bounds for the worst-case constant-variance regime (i.e., $\sigma_t \equiv \Omega(1)$) and the benign deterministic regimes (i.e., $\sigma_t \equiv 0$). To achieve this variance-aware regret guarantee, we develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits in a ""black-box"" manner. Specifically, we take two recent algorithms as black boxes to illustrate that the claimed bounds indeed hold, where the first algorithm can handle unknown-variance cases and the second one is more efficient.", Multi-Treatment Effect Estimation with Proxy: Contrastive Learning and Rank Weighting,https://openreview.net/forum?id=EpvL_FaLtw,https://openreview.net/pdf?id=EpvL_FaLtw,,"We study the treatment effect estimation problem for continuous and multi-dimensional treatments, in the setting with unobserved confounders, but high-dimension proxy variables for unobserved confounders are available. Existing methods either directly adjust the relationship between observed covariates and treatments or recover the hidden confounders by probabilistic models. However, they either rely on a correctly specified treatment assignment model or require strong prior of the unobserved confounder distribution. To relax these requirements, we propose a Contrastive Regularizer (CR) to learn the proxy representation that contains all the relevant information in unobserved confounders. Based on the CR, we propose a novel ranked weighting method (Rw) to de-bias the treatment assignment. Combining Cr and Rw, we propose a neural network framework named CRNet to estimate the effects of multiple continuous treatments under unobserved confounders, evaluated by the Average Dose-Response Function. Empirically, we demonstrate that CRNet achieves state-of-the-art performance on both synthetic and semi-synthetic datasets.", CircNet: Meshing 3D Point Clouds with Circumcenter Detection,https://openreview.net/forum?id=zQWqV2tzDv,https://openreview.net/pdf?id=zQWqV2tzDv,We present a deep neural architecture that detects circumcenters of triangles in the dual space to reconstruct 3D point clouds into triangular meshes efficiently,"Reconstructing 3D point clouds into triangle meshes is a key problem in computational geometry and surface reconstruction. Point cloud triangulation solves this problem by providing edge information to the input points. Since no vertex interpolation is involved, it is beneficial to preserve sharp details on the surface. Taking advantage of learning-based techniques in triangulation, existing methods enumerate the complete combinations of candidate triangles, which is both complex and inefficient. In this paper, we leverage the duality between a triangle and its circumcenter, and introduce a deep neural network that detects the circumcenters to achieve point cloud triangulation. Specifically, we introduce multiple anchor priors to divide the neighborhood space of each point. The neural network then learns to predict the presences and locations of circumcenters under the guidance of those anchors. We extract the triangles dual to the detected circumcenters to form a primitive mesh, from which an edge-manifold mesh is produced via simple post-processing. Unlike existing learning-based triangulation methods, the proposed method bypasses an exhaustive enumeration of triangle combinations and local surface parameterization. We validate the efficiency, generalization, and robustness of our method on prominent datasets of both watertight and open surfaces. The code and trained models are provided at this link.","Meshing, 3D Point Cloud, Point Cloud Triangulation, Surface Reconstruction, Geometry Processing" In-sample Actor Critic for Offline Reinforcement Learning,https://openreview.net/forum?id=dfDv0WU853R,https://openreview.net/pdf?id=dfDv0WU853R,,"Offline reinforcement learning suffers from out-of-distribution issue and extrapolation error. Most methods penalize the out-of-distribution state-action pairs or regularize the trained policy towards the behavior policy but cannot guarantee to get rid of extrapolation error. We propose In-sample Actor Critic (IAC) which utilizes sampling-importance resampling to execute in-sample policy evaluation. IAC only uses the target Q-values of the actions in the dataset to evaluate the trained policy, thus avoiding extrapolation error. The proposed method performs unbiased policy evaluation and has a lower variance than importance sampling in many cases. Empirical results show that IAC obtains competitive performance compared to the state-of-the-art methods on Gym-MuJoCo locomotion domains and much more challenging AntMaze domains.",offline reinforcement learning Leveraging Future Relationship Reasoning for Vehicle Trajectory Prediction,https://openreview.net/forum?id=CGBCTp2M6lA,https://openreview.net/pdf?id=CGBCTp2M6lA,We defined and modeled Future Relationship to better modeling interaction between vehicles.,"Understanding the interaction between multiple agents is crucial for realistic and plausible vehicle trajectory prediction. Accordingly, existing methods tried to model and predict the interaction using observed past trajectories of agents with pooling, attention, or graph-based methods. However, we observed that they easily fail under complex road structures. It is because they do not explicitly utilize the map information for predicting the relationship, and they only model the relationship between vehicles in a deterministic manner, not a stochastic manner. In this paper, we propose a new method to formulate a stochastic future relationship among agents using lane structure. Our method first predicts a probability of lane-level waypoint occupancy of vehicles. Then we utilizes the temporal probability of passing the same lanes to learn the interaction between agents. In addition, we model the interaction using probabilistic distribution. This distribution is trained by posterior distribution of interaction from GT future trajectory. We validate our method on popular trajectory prediction datasets: nuScenes and Argoverse. The code will be available in public upon acceptance.","Trajectory prediction, Autonomous driving, Neural relation inference, Stochasticity modeling, Multimodal prediction" DeepTime: Deep Time-index Meta-learning for Non-stationary Time-series Forecasting,https://openreview.net/forum?id=13rQhx37o3u,https://openreview.net/pdf?id=13rQhx37o3u,We propose a deep time-index model which leverages a meta-learning formulation to tackle non-stationary time-series forecasting.,"Advances in I.T. infrastructure has led to the collection of longer sequences of time-series. Such sequences are typically non-stationary, exhibiting distribution shifts over time -- a challenging scenario for the forecasting task, due to the problems of covariate shift, and conditional distribution shift. In this paper, we show that deep time-index models possess strong synergies with a meta-learning formulation of forecasting, displaying significant advantages over existing neural forecasting methods in tackling the problems arising from non-stationarity. These advantages include having a stronger smoothness prior, avoiding the problem of covariate shift, and having better sample efficiency. To this end, we propose DeepTime, a deep time-index model trained via meta-learning. Extensive experiments on real-world datasets in the long sequence time-series forecasting setting demonstrate that our approach achieves competitive results with state-of-the-art methods, and is highly efficient. Code is attached as supplementary material, and will be publicly released.","time-series, forecasting, deep learning, implicit neural representation, meta-learning, time-index, non-stationary" "Non-Parametric State-Space Models: Identifiability, Estimation and Forecasting",https://openreview.net/forum?id=RVgssxlEVfl,https://openreview.net/pdf?id=RVgssxlEVfl,"Flexible state space model for time series forecasting, inspired by the general structural causal model.","State-space models (SSMs) provide a standard methodology for time series analysis and prediction. While recent works utilize nonlinear functions to parameterize the transition and emission processes to enhance their expressivity, the form of additive noise still limits their applicability in real-world scenarios. In this work, we propose a general formulation of SSMs with a completely non-parametric transition model and a flexible emission model which can account for sensor distortion. Besides, to deal with more general scenarios (e.g., non-stationary time series), we add a higher level model to capture time-varying characteristics of the process. Interestingly, we find that even though the proposed model is remarkably flexible, the latent processes are generally identifiable. Given this, we further propose the corresponding estimation procedure and make use of it for the forecasting task. Our model can recover the latent processes and their relations from observed sequential data. Accordingly, the proposed procedure can also be viewed as a method for causal representation learning. We argue that forecasting can benefit from causal representation learning, since the estimated latent variables are generally identifiable. Empirical comparisons on various datasets validate that our model could not only reliably identify the latent processes from the observed data, but also consistently outperform baselines in the forecasting task.","state-space model, time series forecasting, causal representation learning" ETSformer: Exponential Smoothing Transformers for Time-series Forecasting,https://openreview.net/forum?id=5m_3whfo483,https://openreview.net/pdf?id=5m_3whfo483,"We propose an interpretable Transformer architecture which decomposes forecasts into level, growth, and seasonality components.","Transformers have recently been actively studied for time-series forecasting. While often showing promising results in various scenarios, traditional Transformers are not designed to fully exploit the characteristics of time-series data and thus suffer some fundamental limitations, e.g., they are generally not decomposable or interpretable, and are neither effective nor efficient for long-term forecasting. In this paper, we propose ETSformer, a novel time-series Transformer architecture, which exploits the principle of exponential smoothing methods in improving Transformers for time-series forecasting. Specifically, ETSformer leverages a novel level-growth-seasonality decomposed Transformer architecture which leads to more interpretable and disentangled decomposed forecasts. We further propose two novel attention mechanisms -- the exponential smoothing attention and frequency attention, which are specially designed to overcome the limitations of the vanilla attention mechanism for time-series data. Extensive experiments on various time-series benchmarks validate the efficacy and advantages of the proposed method. Code is attached in the supplementary material, and will be made publicly available. ","time-series, forecasting, transformer, decomposition, season-trend, interpretable" LMSeg: Language-guided Multi-dataset Segmentation,https://openreview.net/forum?id=P44WPn1_aJV,https://openreview.net/pdf?id=P44WPn1_aJV,,"It’s a meaningful and attractive topic to build a general and inclusive segmentation model that can recognize more categories in various scenarios. A straightforward way is to combine the existing fragmented segmentation datasets and train a multi-dataset network. However, there are two major issues with multi-dataset segmentation: (i) the inconsistent taxonomy demands manual reconciliation to construct a unified taxonomy; (ii) the inflexible one-hot common taxonomy causes time-consuming model retraining and defective supervision of unlabeled categories. In this paper, we investigate the multi-dataset segmentation and propose a scalable Language-guided Multi-dataset Segmentation framework, dubbed LMSeg, which supports both semantic and panoptic segmentation. Specifically, we introduce a pretrained text encoder to map the category names to a text embedding space as a unified taxonomy, instead of using inflexible one-hot label. The model dynamically aligns the segment queries with the category embeddings. Instead of relabeling each dataset with the unified taxonomy, a category-guided decoding module is designed to dynamically guide predictions to each dataset’s taxonomy. Furthermore, we adopt a dataset-aware augmentation strategy that assigns each dataset a specific image augmentation pipeline, which can suit the proper- ties of images from different datasets. Extensive experiments demonstrate that our method achieves significant improvements on four segmentation datasets and three panoptic datasets, while the ablation study evaluates the effectiveness of each component. ","Segmentation, Multi-dataset, Vision-language" Horizon-Free Reinforcement Learning for Latent Markov Decision Processes,https://openreview.net/forum?id=g9VAye0eIKO,https://openreview.net/pdf?id=g9VAye0eIKO,"We studied RL for Latent MDPs under the episodic, context in hindsight setting. A SOTA upperbound and the first lowerbound were presented.","We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight. We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver. We prove an $\widetilde{O}\left(\sqrt{M \Gamma S A K}\right)$ regret bound where $M$ is the number of contexts, $S$ is the number of states, $A$ is the number of actions, $K$ is the number of episodes, and $\Gamma \le S$ is the maximum transition degree of any state-action pair. The regret bound only scales logarithmically with the planning horizon, thus yielding the first (nearly) horizon-free regret bound for LMDP. Key in our proof is an analysis of the total variance of alpha vectors, which is carefully bounded by a recursion-based technique. We complement our positive result with a novel $\Omega\left(\sqrt{M S A K}\right)$ regret lower bound with $\Gamma = 2$, which shows our upper bound minimax optimal when $\Gamma$ is a constant. Our lower bound relies on new constructions of hard instances and an argument based on the symmetrization technique from theoretical computer science, both of which are technically different from existing lower bound proof for MDPs, and thus can be of independent interest.","reinforcement learning theory, markov decision process, latent markov decision process" Learning Invariant Features for Online Continual Learning,https://openreview.net/forum?id=PXRN-uxHoIE,https://openreview.net/pdf?id=PXRN-uxHoIE,,"It has been shown recently that learning only discriminative features that are sufficient to separate the classes in a task using a traditional learning method has a major shortcoming for continual learning (CL). This is because many features that are not learned may be necessary for distinguishing classes of some future tasks. When such a future task arrives, these features have to be learned by updating the network, which causes catastrophic forgetting (CF). A recent work on online CL showed that if the learning method can learn as many features as possible from each class, called holistic representations, CF can be significantly reduced to achieve a large performance gain. This paper argues that learning only holistic representations is still insufficient. The learned representations should also be invariant and those features that are present in the data but are irrelevant to the class (e.g., the background information) should be ignored for better generalization across tasks. This new condition further boosts the performance significantly. This paper proposes several strategies and a loss to learn holistic and invariant representations and evaluates their effectiveness in online CL.","continual learning, online continual learning" RoPAWS: Robust Semi-supervised Representation Learning from Uncurated Data,https://openreview.net/forum?id=G1H4NSATlr,https://openreview.net/pdf?id=G1H4NSATlr,We propose a robust semi-supervised learning method for uncurated data derived from a novel probabilistic view of learned representations,"Semi-supervised learning aims to train a model using limited labels. State-of-the-art semi-supervised methods for image classification such as PAWS rely on self-supervised representations learned with large-scale unlabeled but curated data. However, PAWS is often less effective when using real-world unlabeled data that is uncurated, e.g., contains out-of-class data. We propose RoPAWS, a robust extension of PAWS that can work with real-world unlabeled data. We first reinterpret PAWS as a generative classifier that models densities using kernel density estimation. From this probabilistic perspective, we calibrate its prediction based on the densities of labeled and unlabeled data, which leads to a simple closed-form solution from the Bayes' rule. We demonstrate that RoPAWS significantly improves PAWS for uncurated Semi-iNat by +5.3% and curated ImageNet by +0.4%.","semi-supervised learning, representation learning, uncurated data" Treeformer: Dense Gradient Trees for Efficient Attention Computation,https://openreview.net/forum?id=DWn1TEb2fK,https://openreview.net/pdf?id=DWn1TEb2fK,Efficient Decision Tree based attention computation to reduce FLOPs for self-attention,"Standard inference and training with transformer based architectures scale quadratically with input sequence length. This is prohibitively large for a variety of applications especially in web-page translation, query-answering etc. Consequently, several approaches have been developed recently to speedup attention computation by enforcing different attention structures such as sparsity, low-rank, approximating attention using kernels. In this work, we view attention computation as that of nearest neighbor retrieval, and use decision tree based hierarchical navigation to reduce the retrieval cost per query token from linear in sequence length to nearly logarithmic. Based on such hierarchical navigation, we design Treeformer which can use one of two efficient attention layers -- TF-Attention and TC-Attention. TF-Attention computes the attention in a fine-grained style, while TC-Attention is a coarse attention layer which also ensures that the gradients are ""dense"". To optimize such challenging discrete layers, we propose a two-level bootstrapped training method. Using extensive experiments on standard NLP benchmarks, especially for long-sequences, we demonstrate that our Treeformer architecture can be almost as accurate as baseline Transformer while using 30x lesser FLOPs in the attention layer. Compared to Linformer, the accuracy can be as much as 12% higher while using similar FLOPs in the attention layer.","Transformers, Attention, Decision Trees" Visual Reinforcement Learning with Self-Supervised 3D Representations,https://openreview.net/forum?id=4gUIeq2lyM,https://openreview.net/pdf?id=4gUIeq2lyM,,"A prominent approach to visual Reinforcement Learning (RL) is to learn an internal state representation using self-supervised methods, which has the potential benefit of improved sample-efficiency and generalization through additional learning signal and inductive biases. However, while the real world is inherently 3D, prior efforts have largely been focused on leveraging 2D computer vision techniques as auxiliary self-supervision. In this work, we present a unified framework for self-supervised learning of 3D representations for motor control. Our proposed framework consists of two phases: a \textit{pretraining} phase where a deep voxel-based 3D autoencoder is pretrained on a large object-centric dataset, and a \textit{finetuning} phase where the representation is jointly finetuned together with RL on in-domain data. We empirically show that our method enjoys improved sample efficiency in simulated manipulation tasks compared to 2D representation learning methods. Additionally, our learned policies transfer zero-shot to a real robot setup with only approximate geometric correspondence, and successfully solve motor control tasks that involve grasping and lifting from \textit{a single, uncalibrated RGB camera}. Videos are available at https://3d4rl.github.io/.","Reinforcement Learning, 3D Representation Learning" ODAM: Gradient-based Instance-Specific Visual Explanations for Object Detection,https://openreview.net/forum?id=kJWcI39kXY,https://openreview.net/pdf?id=kJWcI39kXY,ODAM: a gradient-based instance-specific explanation technique for object detectors; ODAM-Train: improve the explanation ability on object discrimination; ODAM-NMS: distinguish the duplicate detected objects with the help of ODAM.,"We propose the Gradient-weighted Object Detector Activation Mapping (Grad-ODAM), a visualized explanation technique for interpreting the predictions of object detectors. Utilizing the gradients of detector targets flowing into the intermediate feature maps, Grad-ODAM produces heat maps that show the influence of regions on the detector's decision. Compared to previous classification activation mapping works, Grad-ODAM generates instance-specific explanations rather than class-specific ones. We show that Grad-ODAM is applicable to both one-stage detectors such as FCOS and two-stage detectors such as Faster R-CNN, and produces higher-quality visual explanations than the state-of-the-art both effectively and efficiently. We next propose a training scheme, ODAM-Train, to improve the explanation ability on object discrimination of the detector through encouraging consistency between explanations for detections on the same object, and distinct explanations for detections on different objects. Based on the heat maps produced by Grad-ODAM with ODAM-Train, we propose ODAM-NMS, which considers the information of the model's explanation for each prediction to distinguish the duplicate detected objects. We present a detailed analysis of the visualized explanations of detectors and carry out extensive experiments to validate the effectiveness of the proposed ODAM. ","instance-specific visual explanation, object detection" Understanding Curriculum Learning in Policy Optimization for Online Combinatorial Optimization,https://openreview.net/forum?id=pYC3W83uwm,https://openreview.net/pdf?id=pYC3W83uwm,"We initiate the study on using reinforcement learning for solving combinatorial optimization problems, focusing on the curriculum learning technique.","Over the recent years, reinforcement learning (RL) starts to show promising results in tackling combinatorial optimization (CO) problems, in particular when coupled with curriculum learning to facilitate training. Despite emerging empirical evidence, theoretical study on why RL helps is still at its early stage. This paper presents the first systematic study on policy optimization methods for online CO problems. We show that online CO problems can be naturally formulated as latent Markov Decision Processes (LMDPs), and prove convergence bounds on natural policy gradient (NPG) for solving LMDPs. Furthermore, our theory explains the benefit of curriculum learning: it can find a strong sampling policy and reduce the distribution shift, a critical quantity that governs the convergence rate in our theorem. For a canonical online CO problem, Secretary Problem, we formally prove that distribution shift is reduced exponentially with curriculum learning even if the curriculum is randomly generated. Our theory also shows we can simplify the curriculum learning scheme used in prior work from multi-step to single-step. Lastly, we provide extensive experiments on Secretary Problem and Online Knapsack to verify our findings.","reinforcement learning theory, curriculum learning" Toward Adversarial Training on Contextualized Language Representation,https://openreview.net/forum?id=xZD10GhCvM,https://openreview.net/pdf?id=xZD10GhCvM,Adversarial training optimized to deviate contextualized language representation with all-powerful performance gain,"Beyond the success story of adversarial training (AT) in the recent text domain on top of pre-trained language models (PrLMs), our empirical results showcase that current AT can appear mediocre or even harmful on certain tasks, e.g. reading comprehension and commonsense reasoning. This paper investigates AT from the perspective of contextualized language representation. We find that the gain from AT does not derive from increasing the training risk, but from deviating the language representation. The fact is that the current AT attack is better at fooling the decoder (i.e. the classifier), but can be trivial to the encoder. Based on the observations, we propose simple yet effective \textit{Contextualized representation-Adversarial Training} (CreAT), in which the attack is explicitly optimized to deviate the contextualized representation and obtains the global worst-case adversarial examples. CreAT is proven to be all-powerful compared to AT, with performance gain covering a wider range of downstream tasks. We apply CreAT to language pre-training. Our CreAT-empowered DeBERTa outperforms naive DeBERTa by a large margin, achieving the new state-of-the-art performances on a wide range of challenging benchmarks, e.g. AdvGLUE (59.1 $ \rightarrow $ 61.1), HellaSWAG (93.0 $ \rightarrow $ 94.9), ANLI (68.1 $ \rightarrow $ 69.3), PAWS (50.3 $ \rightarrow $ 54.5).","pre-trained language model, adversarial training" Efficient Method for Bi-level Optimization with Non-smooth Lower-Level Problem,https://openreview.net/forum?id=gsU2MKneFy,https://openreview.net/pdf?id=gsU2MKneFy,We propose a method solving the nonsmooth bilevel problem and give a new analysis,"Bi-level optimization plays a key role in a lot of machine learning applications. Existing state-of-the-art bi-level optimization methods are limited to smooth or some specific non-smooth lower-level problems. Therefore, achieving an efficient algorithm for the bi-level problems with a generalized non-smooth lower-level objective is still an open problem. To address this problem, in this paper, we propose a new bi-level optimization algorithm based on smoothing and penalty techniques. Using the theory of generalized directional derivative, we derive new conditions for the bilevel optimization problem with nonsmooth, perhaps non-Lipschitz lower-level problem, and prove our method can converge to the points satisfying these conditions. We also compare our method with existing state-of-the-art bi-level optimization methods and demonstrate that our method is superior to the others in terms of accuracy and efficiency.","bilevel optimization, nonsmooth" Estimating Riemannian Metric with Noise-Contaminated Intrinsic Distance,https://openreview.net/forum?id=lc8asG5NwF,https://openreview.net/pdf?id=lc8asG5NwF,,"We extend metric learning by studying the Riemannian manifold structure of the underlying data space induced by dissimilarity measures between data points. The key quantity of interest here is the Riemannian metric, which characterizes the Riemannian geometry and defines straight lines and derivatives on the manifold. Being able to estimate the Riemannian metric allows us to gain insights into the underlying manifold and compute geometric features such as the geodesic curves. We model the observed dissimilarity measures as noisy responses generated from a function of the intrinsic geodesic distance between data points. A new local regression approach is proposed to learn the Riemannian metric tensor and its derivatives based on a Taylor expansion for the squared geodesic distances. Our framework is general and accommodates different types of responses, whether they are continuous, binary, or comparative, extending the existing works which consider a single type of response at a time. We develop theoretical foundation for our method by deriving the rates of convergence for the asymptotic bias and variance of the estimated metric tensor. The proposed method is shown to be versatile in simulation studies and a real data application involving taxi trip time in New York City.", In Search of Smooth Minima for Purifying Backdoor in Deep Neural Networks,https://openreview.net/forum?id=xWFguIF_hG,https://openreview.net/pdf?id=xWFguIF_hG,,"The success of a deep neural network (DNN) heavily relies on the details of the training scheme; e.g., training data, architectures, hyper-parameters, etc. Recent backdoor attacks suggest that an adversary can take advantage of such training details and compromise the integrity of DNN. Our studies show that a backdoor model is usually optimized to a bad local minima, i.e., sharper minima as compared to a benign model. Intuitively, backdoor can be purified by re-optimizing the model to a smoother minima through fine-tuning with a few clean validation data. However, fine-tuning all DNN parameters often requires huge computational costs as well as sub-par clean test performance. To address this concern, we propose a novel backdoor purification technique—N atural G radient Fine-tuning (NGF)—which focuses on removing backdoor by fine-tuning only one layer. Specifically, NGF utilizes a loss surface geometry-aware optimizer that can successfully overcome the challenge of reaching a smooth minima under one-layer optimization scenario. To enhance the generalization performance of our proposed method, we introduce a clean data distribution-aware regularizer based on the knowledge of loss surface curvature matrix, i.e., Fisher Information Matrix. To validate the effectiveness of our method, we conduct extensive experimentation with four different datasets— CIFAR10, GTSRB, Tiny-ImageNet, and ImageNet; as well as 11 recent backdoor attacks, e.g., Blend, Dynamic, Clean Label, etc. NGF achieves state-of-the-art performance in most of these benchmarks.","AI Security, Backdoor or Trojan Attacks on Deep Networks, Safe and Robust AI" Joint Gaussian Mixture Model for Versatile Deep Visual Model Explanation,https://openreview.net/forum?id=amyZRbMrUA,https://openreview.net/pdf?id=amyZRbMrUA,This paper proposes a GMM-based probablistic model to explain DCNN representations and inference by proxy models and explanatory examples.,"Post-hoc explanations of deep neural networks improve human understanding on the learned representations, decision-making process and uncertainty of the model with faithfulness. Explaining deep convolutional neural networks(DCNN) is especially challenging, due to the high dimensionality of deep features and the complexity of model inference. Most post-hoc explaining methods serve a single form of explanation, restricting the diversity and consistency of the explanation. This paper proposes joint Gaussian mixture model(JGMM), a probabilistic model jointly models inter-layer deep features and produces faithful and consistent post-hoc explanations. JGMM explains deep features by Gaussian mixture model and inter-layer deep feature relations by posterior distribution on the latent component variables. JGMM enables a versatile explaining framework that unifies interpretable proxy model, global or local explanatory example generation or mining. Experiments are performed on various DCNN image classifiers in comparison with other explaining methods. It shows that JGMM can efficiently produce versatile, consistent, faithful and understandable explanations.","Explainable AI, interpretable model, pixel attributing, convolutional neural networks" Gromov-Wasserstein Autoencoders,https://openreview.net/forum?id=sbS10BCtc7,https://openreview.net/pdf?id=sbS10BCtc7,"GWAEs, our novel generative models, learn representations based on meta-priors by directly fitting their latent space into the data space.","Variational Autoencoder (VAE)-based generative models offer flexible representation learning by incorporating meta-priors, general premises considered beneficial for downstream tasks. However, the incorporated meta-priors often involve ad-hoc model deviations from the original likelihood architecture, causing undesirable changes in their training. In this paper, we propose a novel representation learning method, Gromov-Wasserstein Autoencoders (GWAE), which directly matches the latent and data distributions using the variational autoencoding scheme. Instead of likelihood-based objectives, GWAE models minimize the Gromov-Wasserstein (GW) metric between the trainable prior and given data distributions. The GW metric measures the distance structure-oriented discrepancy between distributions even with different dimensionalities, which provides a direct measure between the latent and data spaces. By restricting the prior family, we can introduce meta-priors into the latent space without changing their objective. The empirical comparisons with VAE-based models show that GWAE models work in two prominent meta-priors, disentanglement and clustering, with their GW objective unchanged.","representation learning, deep generative models, variational autoencoders, optimal transport, implicit distributions, meta-prior, disentanglement, clustering" Localized Graph Contrastive Learning,https://openreview.net/forum?id=dSYkYNNZkV,https://openreview.net/pdf?id=dSYkYNNZkV,,"Contrastive learning methods based on InfoNCE loss are popular in node representation learning tasks on graph-structured data. However, its reliance on data augmentation and its quadratic computational complexity might lead to inconsistency and inefficiency problems. To mitigate these limitations, in this paper, we introduce a simple yet effective contrastive model named Localized Graph Contrastive Learning (Local-GCL in short). Local-GCL consists of two key designs: 1) We fabricate the positive examples for each node directly using its first-order neighbors, which frees our method from the reliance on carefully-designed graph augmentations; 2) To improve the efficiency of contrastive learning on graphs, we devise a kernelized contrastive loss, which could be approximately computed in linear time and space complexity with respect to the graph size. We provide theoretical analysis to justify the effectiveness and rationality of the proposed methods. Experiments on various datasets with different scales and properties demonstrate that in spite of its simplicity, Local-GCL achieves quite competitive performance in self-supervised node representation learning tasks on graphs with various scales and properties.", OrthoReg: Improving Graph-regularized MLPs via Orthogonality Regularization,https://openreview.net/forum?id=5s2v_0F7MG,https://openreview.net/pdf?id=5s2v_0F7MG,,"Graph Neural Networks (GNNs) are currently dominating in modeling graph-structure data, while their high reliance on graph structure for inference significantly impedes them from widespread applications. By contrast, graph-Regularized MLPs (GR-MLPs) implicitly inject the graph structure information into model weights, while their performance can hardly match that of GNNs in most tasks. This motivates us to study the causes of the limited performance of GR-MLPs. In this paper, we demonstrate that node embeddings learned from conventional GR-MLPs suffer from dimensional collapse, a phenomenon in which the largest a few eigenvalues dominate the embedding space, and thus the expressive power is constrained. We further propose ORTHO-REG, a novel GR-MLP model, to mitigate the dimensional collapse issue. Through a soft regularization loss on the correlation matrix of node embeddings, ORTHO-REG explicitly encourages orthogonal node representations and thus can naturally avoid dimensionally collapsed representations. Experiments on traditional transductive semi-supervised classification tasks and inductive node classification for cold-start scenarios demonstrate its effectiveness and superiority.", Group-Equivariant Transformers Without Positional Encoding,https://openreview.net/forum?id=0tiMn18oNd,https://openreview.net/pdf?id=0tiMn18oNd,"We propose an effective group-equivariant transformer without positional encoding, replacing point-wise MLPs with group-equivariant convolutions to act as both a group mixer and an implicit positional encoding.","Self-attention is a permutation-equivariant operator in its basic form and can further extend to achieve equivariance for a specific symmetry group by incorporating group-invariant positional encoding. In this work, we propose an effective group-equivariant transformer without positional encoding. Instead of injecting group-invariant position encoding to the transformer, we replace point-wise MLPs with group-equivariant convolutions that act as both a group mixer and an implicit positional encoding. This allows to reduce the group of self-attention to translation only while preserving group equivariance, resulting in less computation and memory. Our strategy not only retains dynamic long-range interactions of transformers but also incorporates the static effective kernel learning of convolution, resulting in a significant accuracy gain. We also find that adopting a group-equivariant convolution stem and a translation-equivariant pooling further improves the performance. The proposed method sets a new state of the art in standard benchmarks, outperforming the existing group-equivariant transformers by a large margin.","equivariant, invariant, group-equivariant, self-attention, transformer, group-equivariant convolution, group-equivariant self-attention" CUSTOMIZING PRE-TRAINED DIFFUSION MODELS FOR YOUR OWN DATA,https://openreview.net/forum?id=FOeVcSmRAeQ,https://openreview.net/pdf?id=FOeVcSmRAeQ,We propose a method to utilize the pre-trained text-to-image diffusion models to generate a custom dataset.,"Recently, several large-scale text-to-image diffusion models have been released, showing unprecedented performance. Since the shift from learning a task-specific model from scratch to leveraging pre-trained large-scale models is an inevitable trend in deep generative modeling, it is necessary to develop methods to better utilize these models. In this paper, we propose a method dubbed Diffusion model for Your Own Data (DYOD) that can effectively utilize a pre-trained text-to-image diffusion model to approximate the implicit distribution of a custom dataset. Specifically, we first obtain a text prompt that can best represent the custom dataset through optimization in the semantic latent space of the diffusion model. In order to be able to better control generative image content, in particular geometry of the objects, we show that the text prompt alone is not sufficient, but rather an informative initialization that can guide the pre-trained diffusion model is necessary. As representative examples, we demonstrate that learned distribution initialization from user's data set or an image initialization by user's sketch, photo, etc. serves the goal for customizing diffusion model for user's own data. Experiments show that the customized DYOD outperforms the Stable Diffusion baselines both qualitatively and quantitatively with accelerated sampling speed.","Diffusion models, score-based models, generative models, personalization" Optimal Activation Functions for the Random Features Regression Model,https://openreview.net/forum?id=ltWade-cpK,https://openreview.net/pdf?id=ltWade-cpK,,"The asymptotic mean squared test error and sensitivity of the Random Features Regression model (RFR) have been recently studied. We build on this work and identify in closed-form the family of Activation Functions (AFs) that minimize a combination of the test error and sensitivity of the RFR under different notions of functional parsimony. We find scenarios under which the optimal AFs are linear, saturated linear functions, or expressible in terms of Hermite polynomials. Finally, we show how using optimal AFs impacts well established properties of the RFR model, such as its double descent curve, and the dependency of its optimal regularization parameter on the observation noise level.","Random Features Regression Model, Learning theory for neural networks, Functional analysis and variational calculus" Deep Learning-based Source Code Complexity Prediction,https://openreview.net/forum?id=9irBKvxsw9,https://openreview.net/pdf?id=9irBKvxsw9,We suggest a deep-learning based approach for estimating computational (time) complexity of given programs and provide the largest code complexity dataset as the benchmark.,"Deciding the computational complexity of algorithms is a really challenging problem even for human algorithm experts. Theoretically, the problem of deciding the computational complexity of a given program is undecidable due to the famous Halting problem. In this paper, we tackle the problem by designing a neural network that comprehends the algorithmic nature of codes and estimates the worst-case complexity. First, we construct a code dataset called the CodeComplex that consists of 4,120Java codes submitted to programming competitions by human programmers and their complexity labels annotated by a group of algorithm experts. As far as we are aware, the CodeComplex dataset is by far the largest code dataset for the complexity prediction problem. Then, we present several baseline algorithms using the previous code understanding neural models such as CodeBERT, GraphCodeBERT, PLBART, and CodeT5. As the previous code understanding models do not work well on longer codes due to the code length limit, we propose the hierarchical Transformer architecture which takes method-level code snippets instead of whole codes and combines the method-level embeddings to the class-level embedding and ultimately to the code-level embedding. Moreover, we introduce pre-training objectives for the proposed model to induce the model to learn both the intrinsic property of the method-level codes and the relationship between the components. Lastly, we demonstrate that the proposed hierarchical architecture and pre-training objectives achieve state-of-the-art performance in terms of complexity prediction accuracy compared to the previous code understanding models.","computational complexity, code classification, programming language, data augmentation, code understanding" Learning to Learn with Generative Models of Neural Network Checkpoints,https://openreview.net/forum?id=JXkz3zm8gJ,https://openreview.net/pdf?id=JXkz3zm8gJ,We construct a dataset of neural network checkpoints and train a loss-conditional generative model on the parameters. The generative model can train neural networks with unseen initializations in one step.,"We explore a data-driven approach for learning to optimize neural networks. We construct a dataset of neural network checkpoints and train a generative model on the parameters. In particular, our model is a conditional diffusion transformer that, given an initial input parameter vector and a prompted loss, error, or return, predicts the distribution over parameter updates that achieve the desired metric. At test time, it can optimize neural networks with unseen parameters for downstream tasks in just one update. We find that our approach successfully generates parameters for a wide range of loss prompts. Moreover, it can sample multimodal parameter solutions and has favorable scaling properties. We apply our method to different neural network architectures and tasks in supervised and reinforcement learning. ","diffusion, DDPMs, learning to learn, generative models, transformers" Improving Explanation Reliability through Group Attribution,https://openreview.net/forum?id=BeI1fdNH_X,https://openreview.net/pdf?id=BeI1fdNH_X,We have proposed the group-wise attribution methods to yield more reliable explanation in understanding a model's prediction,"Although input attribution methods are mainstream in understanding predictions of DNNs for straightforward interpretations, the non-linearity of DNNs often makes the attributed scores unreliable in explaining a given prediction, deteriorating the faithfulness of the explanation. However, the challenge could be mitigated by attributing scores to groups of explanatory components instead of the individuals, termed group attribution. While a group attribution would explain the group-wise contribution more reliably, it does not explain the component-wise contributions so that estimating component-wise scores yields less reliable explanation, indicating the trade-off of group attributions. In this work, we introduce the generalized definition of reliability loss and group attribution, and formulate the optimization problem of the trade-off with these terms. We apply our formalization to Shapley value attribution to propose the optimization method G-SHAP. We show the effectiveness and explanatory benefits of our method through empirical results on image classification tasks.","Explainable AI, Attribution methods, Group attribution, Attribution reliability" SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data,https://openreview.net/forum?id=CtR4H2enl90,https://openreview.net/pdf?id=CtR4H2enl90,This is the first work to explicitly use phoneme/hidden units as the the shared space of speech and text modalities for pre-training.,"How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities, including phoneme-unit and hidden-unit tokenizers, which can be trained using a small amount of paired speech-text data. Based on the trained tokenizers, we convert the unlabeled speech and text data into tokens of phoneme units or hidden units. The pre-training objective is designed to unify the speech and the text into the same discrete semantic space with a unified Transformer network. Leveraging only 10K text sentences, our SpeechLM gets a 16\% relative WER reduction over the best base model performance (from 6.8 to 5.7) on the public LibriSpeech ASR benchmark. Moreover, SpeechLM with fewer parameters even outperforms previous SOTA models on CoVoST-2 speech translation tasks. We also evaluate our SpeechLM on various spoken language processing tasks under the universal representation evaluation framework SUPERB, demonstrating significant improvements on content-related tasks. Our code and models are available at https://anonymous.","Speech pre-training, Speech-text joint modeling, Unified tokenizer" Uncertainty Guided Depth Fusion for Spike Camera,https://openreview.net/forum?id=xxhl7l64Nsz,https://openreview.net/pdf?id=xxhl7l64Nsz,We propose a novel uncertainty-guided fusion framework for spike depth estimation.,"Neuromorphic spike camera captures visual streams with high frame rate in a bio-inspired way, bringing vast potential in various real-world applications such as autonomous driving.Compared with traditional cameras, spike camera data has an inherent advantage to overcome motion blur, leading to more accurate depth estimation in high-velocity circumstances. However, depth estimation with spike camera remains very challenging when using traditional monocular or stereo depth estimation algorithms, which are based on the photometric consistency. In this paper, we propose a novel and effective approach for spike depth estimation, which fuses the monocular and stereo depth estimation for spike camera based on the uncertainty of the prediction.Our approach is motivated by the fact that stereo spike depth estimation achieves better results in closer range while monocular spike depth estimation obtains better results in farther range. Therefore, we introduce an Uncertainty-Guided Depth Fusion (UGDF) framework with a joint training strategy and estimate the distributed uncertainty to fuse the monocular and stereo results. In order to demonstrate the advantage of spike depth estimation over traditional camera-based depth estimation, we contribute a spike-depth dataset named CitySpike20K, which contains 20K paired samples, for spike depth estimation. We also introduce the Spike-Kitti dataset to demonstrate the effectiveness and generalization of our method under real-world scenarios.Extensive experiments are conducted to evaluate our method on CitySpike20K and Spike-Kitti. UGDF achieves state-of-the-art results on both CitySpike20K and Spike-Kitti, surpassing all the monocular or stereo spike depth estimation baselines. To the best of our knowledge, our framework is the first end-to-end dual-task fusion framework for spike camera depth estimation. Code and dataset will be released.","Spike camera, Uncertainty, Depth estimation, Fusion Strategies" Personalized Semantics Excitation for Federated Image Classification,https://openreview.net/forum?id=NxnYzayR2CW,https://openreview.net/pdf?id=NxnYzayR2CW,,"Federated learning casts a light on the collaboration of distributed local clients with privacy protected to attain a more generic global model. However, significant distribution shift in input/label space across different clients makes it challenging to well generalize to all clients, which motivates personalized federated learning (PFL). Existing PFL methods typically customize the local model by fine-tuning with limited local supervision and the global model regularizer, which secures local specificity but risks ruining the global discriminative knowledge. In this paper, we propose a novel Personalized Semantics Excitation ($\textbf{PSE}$) mechanism to breakthrough this limitation by exciting and fusing $\textit{personalized}$ semantics from the global model during local model customization. Specifically, PSE explores channel-wise gradient differentiation across global and local models to identify important low-level semantics mostly from convolutional layers which are embedded into the client-specific training. In addition, PSE deploys the collaboration of global and local models to enrich high-level feature representations and facilitate the robustness of client classifier through a cross-model attention module. Extensive experiments and analysis on various image classification benchmarks demonstrate the effectiveness and advantage of our method over the state-of-the-art PFL methods.", Intrinsic Motivation via Surprise Memory,https://openreview.net/forum?id=hlsu-HrU7ON,https://openreview.net/pdf?id=hlsu-HrU7ON,Intrinsic motivation emerges through measuring surprise novelty as retrieval errors of a memory network,"We present a new computing model for intrinsic rewards in reinforcement learning that addresses the limitations of existing surprise-driven explorations. The reward is the novelty of the surprise rather than the surprise norm. We estimate the surprise novelty as retrieval errors of a memory network wherein the memory stores and reconstructs surprises. Our surprise memory (SM) augments the capability of surprise-based intrinsic motivators, maintaining the agent's interest in exciting exploration while reducing unwanted attraction to unpredictable or noisy observations. Our experiments demonstrate that the SM combined with various surprise predictors exhibits efficient exploring behaviors and significantly boosts the final performance in sparse reward environments, including Noisy-TV, navigation and challenging Atari games. ","reinforcement learning, intrinsic motivation, exploration, memory" Dr-Fairness: Dynamic Data Ratio Adjustment for Fair Training on Real and Generated Data,https://openreview.net/forum?id=1FVv8PS8LYW,https://openreview.net/pdf?id=1FVv8PS8LYW,"We propose a novel sampling approach called Dr-Fairness that adaptively adjusts data ratios among groups and between real and generated data, which improves group fairness while minimizing accuracy degradation.","Fair visual recognition has become critical for preventing demographic disparity. A major cause of model unfairness is the imbalanced representation of different groups in training data. Recently, several works aim to alleviate this issue using generated data. However, these approaches often use generated data to obtain similar amounts of data across groups, which is not optimal for achieving high fairness due to different learning difficulties and generated data qualities across groups. To address this issue, we propose a novel adaptive sampling approach that leverages both real and generated data for fairness. We design a bilevel optimization that finds the optimal data sampling ratios among groups and between real and generated data while training a model. The ratios are dynamically adjusted considering both the model's accuracy as well as its fairness. To efficiently solve our non-convex bilevel optimization, we propose a simple approximation to the solution given by the implicit function theorem. Extensive experiments show that our framework achieves state-of-the-art fairness and accuracy on the CelebA and ImageNet People Subtree datasets. We also observe that our method adaptively relies less on the generated data when it has poor quality.","trustworthy AI, fairness, visual recognition, generated data" Unsupervised Object-Centric Learning with Bi-level Optimized Query Slot Attention,https://openreview.net/forum?id=_-FN9mJsgg,https://openreview.net/pdf?id=_-FN9mJsgg,"With simple adjustments to slot attention, we propose a model that significantly outperforms previous baselines (~10%) on both synthetic and real images, and shows potential in concept binding and generalization.","The ability to decompose complex natural scenes into meaningful object-centric abstractions lies at the core of human perception and reasoning. In the recent culmination of unsupervised object-centric learning, the Slot-Attention module has played an important role with its simple yet effective design and fostered many powerful variants. These methods, however, have been exceedingly difficult to train without supervision and are ambiguous in the notion of object, especially for complex natural scenes. In this paper, we propose to address these issues by (1) initializing Slot-Attention modules with learnable queries and (2) optimizing the model with bi-level optimization. With simple code adjustments on the vanilla Slot-Attention, our model, BO-QSA, achieves state-of-the-art results on both synthetic and complex real-world datasets in unsupervised image segmentation and reconstruction, outperforming previous baselines by a large margin (~10%). We provide thorough ablative studies to validate the necessity and effectiveness of our design. Additionally, our model exhibits excellent potential for concept binding and zero-shot learning. We hope our effort could provide a single home for the design and learning of slot-based models and pave the way for more challenging tasks in object-centric learning.",unsupervised object-centric learning Set-Level Self-Supervised Learning from Noisily-Labeled Data,https://openreview.net/forum?id=xeg2fW5E2l3,https://openreview.net/pdf?id=xeg2fW5E2l3,This paper proposes a set-level self-supervised learning techniques on each training mini-batch to tackle noisy-label learning problems.,"Noisy labels are inevitably presented in real-world datasets due to labeling error or visual content ambiguity. Existing methods generally approach the task of noisy label learning (NLL) by either properly regularizing the model, or reweighting clean/noisy labeled samples. While self-supervised learning (SSL) has been applied to pre-train deep neural networks without label supervision, downstream tasks like image classification still require clean labeled data. And, most SSL strategies are performed at the instance level, regardless of the correctness of its label. In this paper, we propose set-level self-supervised learning (SLSSL), which performs SSL at mini-batch levels with observed noisy labels. By corrupting the labels of each training mini-batch, our SLSSL enforces the model to exhibit sufficient robustness. Moreover, the proposed SLSSL can also be utilized for sample reweighting technique. As a result, the proposed learning scheme can be applied as an expectation-maximization (EM) algorithm during model training. Extensive experiments on synthetic and real-world noisy label data confirm the effectiveness of our framework.","Self-supervised learning, Noisy label learning, Meta-learning, EM algorithm" Learning an Invertible Output Mapping Can Mitigate Simplicity Bias in Neural Networks,https://openreview.net/forum?id=zH9GcZ3ZGXu,https://openreview.net/pdf?id=zH9GcZ3ZGXu,"We propose a regularizer that enforces the reconstruction of features from the output logits of neural networks, in order to overcome Simplicity Bias and boost their OOD generalization.","Deep Neural Networks are known to be brittle to even minor distribution shifts compared to the training distribution. While one line of work has demonstrated that \emph{Simplicity Bias} (SB) of DNNs -- bias towards learning only the simplest features -- is a key reason for this brittleness, another recent line of work has surprisingly found that diverse/ complex features are indeed learned by the backbone, and their brittleness is due to the linear classification head relying primarily on the simplest features. To bridge the gap between these two lines of work, we first hypothesize and verify that while SB may not altogether preclude learning complex features, it amplifies simpler features over complex ones. Namely, simple features are replicated several times in the learned representations while complex features might not be replicated. This phenomenon, we term \emph{Feature Replication Hypothesis}, coupled with the \emph{Implicit Bias} of SGD to converge to maximum margin solutions in the feature space, leads the models to rely mostly on the simple features for classification. To mitigate this bias, we propose \emph{Feature Reconstruction Regularizer (FRR)} to ensure that the learned features can be reconstructed back from the logits. The use of \emph{FRR} in linear layer training (\emph{FRR-L}) encourages the use of more diverse features for classification. We further propose to finetune the full network by freezing the weights of the linear layer trained using \emph{FRR-L}, to refine the learned features, making them more suitable for classification. Using this simple solution, we demonstrate up to 15\% gains in OOD accuracy on the recently introduced semi-synthetic datasets with extreme distribution shifts. Moreover, we demonstrate noteworthy gains over existing SOTA methods on the standard OOD benchmark DomainBed as well.","Simplicity Bias, Out-of-distribution robustness, OOD Generalization, Deep Learning" Theoretical generalization bounds for improving the efficiency of deep online training,https://openreview.net/forum?id=ItUvrU0dQpC,https://openreview.net/pdf?id=ItUvrU0dQpC,,"In the era of data explosion, online machine learning in which learning models are updated in real-time has become essential due to the growth of data in practice. In particular, it is more challenging to collect and annotate new massive data accurately and timely compared to traditional offline supervised training settings. Although this online training framework has been shown to be practically beneficial, there has been a lack of theoretical guarantees for the learning performance, especially for the case with noisy labels. This paper aims to investigate a learning theory for both original deep online training and online training with noisy labels. We first introduce a theoretical bound of the gaps of empirical risks and gaps of generalization risks in micro-batch online training when learning with both clean and noisy labels. Those bounds will efficiently help guide the online training scheme when receiving new data. We next analyze the impact of micro-batch size on the learning performance of models with noisy labels through our experimental results on CIFAR10, and CIFAR100 datasets using different noise, which consistently demonstrates the merit of the bounds above in the online training setting.","online training, generalization risk, noisy label" EUCLID: Towards Efficient Unsupervised Reinforcement Learning with Multi-choice Dynamics Model,https://openreview.net/forum?id=xQAjSr64PTc,https://openreview.net/pdf?id=xQAjSr64PTc,"We propose a novel model-fused paradigm for Unsupervised Reinforcement Learning to jointly pre-train the dynamics model and unsupervised exploration policy in the pre-training phase, thus improving the downstream task sampling efficiency.","Unsupervised reinforcement learning (URL) poses a promising paradigm to learn useful behaviors in a task-agnostic environment without the guidance of extrinsic rewards to facilitate the fast adaptation of various downstream tasks. Previous works focused on the pre-training in a model-free manner while lacking the study of transition dynamics modeling that leaves a large space for the improvement of sample efficiency in downstream tasks. To this end, we propose an Efficient Unsupervised Reinforcement Learning Framework with Multi-choice Dynamics model (EUCLID), which introduces a novel model-fused paradigm to jointly pre-train the dynamics model and unsupervised exploration policy in the pre-training phase, thus better leveraging the environmental samples and improving the downstream task sampling efficiency. However, constructing a generalizable model which captures the local dynamics under different behaviors remains a challenging problem. We introduce the multi-choice dynamics model that covers different local dynamics under different behaviors concurrently, which uses different heads to learn the state transition under different behaviors during unsupervised pre-training and selects the most appropriate head for prediction in the downstream task. Experimental results in the manipulation and locomotion domains demonstrate that EUCLID achieves state-of-the-art performance with high sample efficiency, basically solving the state-based URLB benchmark and reaching a mean normalized score of 104.0±1.2% in downstream tasks with 100k fine-tuning steps, which is equivalent to DDPG’s performance at 2M interactive steps with 20× more data. Codes and visualization videos are released on our homepage.","Reinforcement Learning, Unsupervised RL, Model-based RL" A General Framework for Sample-Efficient Function Approximation in Reinforcement Learning,https://openreview.net/forum?id=dqITIpZ5Z4b,https://openreview.net/pdf?id=dqITIpZ5Z4b,We provide a unified framework that nearly includes all model-free and model-based RL classes while maintaining sharp sample efficiency.,"With the increasing need for handling large state and action spaces, general function approximation has become a key technique in reinforcement learning (RL). In this paper, we propose a general framework that unifies model-based and model-free RL, and an Admissible Bellman Characterization (ABC) class that subsumes nearly all Markov decision process (MDP) models in the literature for tractable RL. We propose a novel estimation function with decomposable structural properties for optimization-based exploration and the functional Eluder dimension as a complexity measure of the ABC class. Under our framework, a new sample-efficient algorithm namely OPtimization-based ExploRation with Approximation (OPERA) is proposed, achieving regret bounds that match or improve over the best-known results for a variety of MDP models. In particular, for MDPs with low Witness rank, under a slightly stronger assumption, OPERA improves the state-of-the-art sample complexity results by a factor of $dH$. Our framework provides a generic interface to design and analyze new RL models and algorithms.","general function approximation, sample-efficient RL, optimization-based exploration, Eluder dimension, Bellman rank, witness rank, complexity measure, hypothesis class" FedDM: Iterative Distribution Matching for Communication-Efficient Federated Learning,https://openreview.net/forum?id=40RNVzSoCqD,https://openreview.net/pdf?id=40RNVzSoCqD,,"Federated learning (FL) has recently attracted increasing attention from academia and industry, with the ultimate goal of achieving collaborative training under privacy and communication constraints. Existing iterative model averaging based FL algorithms require a large number of communication rounds to obtain a well-performed model due to extremely unbalanced and non-i.i.d data partitioning among different clients. Thus, we propose FedDM to build the global training objective from multiple local surrogate functions, which enables the server to gain a more global view of the loss landscape. In detail, we construct synthetic sets of data on each client to locally match the loss landscape from original data through distribution matching. FedDM reduces communication rounds and improves model quality by transmitting more informative and smaller synthesized data compared with unwieldy model weights. We conduct extensive experiments on three image classification datasets, and results show that our method can outperform other FL counterparts in terms of efficiency and model performance. Moreover, we demonstrate that FedDM can be adapted to preserve differential privacy with Gaussian mechanism and train a better model under the same privacy budget.","data distillation, federated learning" A Representation Bottleneck of Bayesian Neural Networks,https://openreview.net/forum?id=DWDPhB6Hi7k,https://openreview.net/pdf?id=DWDPhB6Hi7k,We theoretically prove and empirically verify a representation bottleneck of Bayesian neural networks.,"Unlike standard deep neural networks (DNNs), Bayesian neural networks (BNNs) formulate network weights as probability distributions, which results in distinctive representation capacities from standard DNNs. In this paper, we explore the representation bottleneck of BNNs from the perspective of conceptual representations. It is proven that the logic of a neural network can be faithfully mimicked by a specific sparse causal graph, where each causal pattern can be considered as a concept encoded by the neural network. Then, we formally define the complexity of concepts, and prove that compared to standard DNNs, it is more difficult for BNNs to encode complex concepts. Extensive experiments verify our theoretical proofs. The code will be released when the paper is accepted.","interpretability, Bayesian neural network" Maximizing Spatio-Temporal Entropy of Deep 3D CNNs for Efficient Video Recognition,https://openreview.net/forum?id=lj1Eb1OPeNw,https://openreview.net/pdf?id=lj1Eb1OPeNw,,"3D convolution neural networks (CNNs) have been the prevailing option for video recognition. To capture the temporal information, 3D convolutions are computed along the sequences, leading to cubically growing and expensive computations. To reduce the computational cost, previous methods resort to manually designed 3D/2D CNN structures with approximations or automatic search, which sacrifice the modeling ability or make training time-consuming. In this work, we propose to automatically design efficient 3D CNN architectures via a novel training-free neural architecture search approach tailored for 3D CNNs considering the model complexity. To measure the expressiveness of 3D CNNs efficiently, we formulate a 3D CNN as an information system and derive an analytic entropy score, based on the Maximum Entropy Principle. Specifically, we propose a spatio-temporal entropy score (STEntr-Score) with a refinement factor to handle the discrepancy of visual information in spatial and temporal dimensions, through dynamically leveraging the correlation between the feature map size and kernel size depth-wisely. Highly efficient and expressive 3D CNN architectures, i.e., entropy-based 3D CNNs (E3D family), can then be efficiently searched by maximizing the STEntr-Score under a given computational budget, via an evolutionary algorithm without training the network parameters. Extensive experiments on Something-Something V1&V2 and Kinetics400 demonstrate that the E3D family achieves state-of-the-art performance with higher computational efficiency.", Cycle to Clique (Cy2C) Graph Neural Network: A Sight to See beyond Neighborhood Aggregation,https://openreview.net/forum?id=7d-g8KozkiE,https://openreview.net/pdf?id=7d-g8KozkiE,,"Graph neural networks have been successfully adapted for learning vector representations of graphs through various neighborhood aggregation schemes. Previous researches suggest, however, that they possess limitations in incorporating key non-Euclidean topological properties of graphs. This paper mathematically identifies the caliber of graph neural networks in classifying isomorphism classes of graphs with continuous node attributes up to their local topological properties. In light of these observations, we construct the Cycle to Clique graph neural network, a novel yet simple algorithm which topologically enriches the input data of conventional graph neural networks while preserving their architectural components. This method theoretically outperforms conventional graph neural networks in classifying isomorphism classes of graphs while ensuring comparable time complexity in representing random graphs. Empirical results further support that the novel algorithm produces comparable or enhanced results in classifying benchmark graph data sets compared to contemporary variants of graph neural networks.", Latent State Marginalization as a Low-cost Approach to Improving Exploration,https://openreview.net/forum?id=b0UksKFcTOL,https://openreview.net/pdf?id=b0UksKFcTOL,We show how to efficiently marginalize latent variable policies for MaxEnt RL to enable better exploration and more robust training at very minimal additional cost.,"While the maximum entropy (MaxEnt) reinforcement learning (RL) framework -- often touted for its exploration and robustness capabilities -- is usually motivated from a probabilistic perspective, the use of deep probabilistic models have not gained much traction in practice due to their inherent complexity. In this work, we propose the adoption of latent variable policies within the MaxEnt framework, which we can provably approximate any policy distribution, and additionally, naturally emerges under the use of world models with a latent belief state. We discuss why latent variable policies are difficult to train, how naive approaches can fail, and subsequently introduce a series of improvements centered around low-cost marginalization of the latent state, allowing us to make full use of the latent state at minimal additional cost. We instantiate our method under the actor-critic framework, marginalizing both the actor and critic. The resulting algorithm, referred to as Stochastic Marginal Actor-Critic (SMAC), is simple yet effective. We experimentally validate our method on continuous control tasks, showing that effective marginalization can lead to better exploration and more robust training.","MaxEnt RL, latent variable modeling, world models" TensorVAE: A Direct Generative Model for Molecular Conformation Generation driven by Novel Feature Engineering,https://openreview.net/forum?id=TtMJJWG_J1j,https://openreview.net/pdf?id=TtMJJWG_J1j,,"Efficient generation of 3D conformations of a molecule from its 2D graph is a key challenge in in-silico drug discovery. Deep learning (DL) based generative modelling has recently become a potent tool to tackling this challenge. However, many existing DL-based methods are either indirect-leveraging inter-atomic distances or direct-but requiring complex feature transformation or numerous sampling steps to generate conformations. In this work, we propose a simple model abbreviated TensorVAE capable of generating conformations directly from a 2D molecular graph in a single step. The main novelty of the proposed method is focused on feature engineering. We develop a novel encoding and feature extraction mechanism relying solely on standard convolution operation to generate token-like feature vector for each atom. These feature vectors are then transformed through standard transformer encoders under a conditional Variational Auto Encoder framework for learning to generate conformations directly. We show through experiments on two benchmark datasets that with intuitive and sensible feature engineering, a relatively simple and standard model can provide promising generative capability rivalling recent state-of-the-art models employing more sophisticated and specialized generative architecture.","generative model, feature engineering, molecular conformation generation" Smoothed-SGDmax: A Stability-Inspired Algorithm to Improve Adversarial Generalization,https://openreview.net/forum?id=O0sS_cujvV0,https://openreview.net/pdf?id=O0sS_cujvV0,,"Unlike standard training, deep neural networks can suffer from serious overfitting problems in adversarial settings, which is studied extensively by empirical papers. Recent research (e.g., Xing et al. (2021); Xiao et al. (2022)) show that SGDmax-based adversarial training algorithms with $1/s(T)$ training loss incurs a stability-based generalization bound in $\Theta(c+s(T)/n)$. Here $T$ is the number of iterations, $n$ is the number of samples, $s(T)\rightarrow \infty$ as $T\rightarrow \infty$, and $c$ is a $n$-independent term. This reveals that adversarial training can have nonvanishing generalization errors even if the sample size $n$ goes to infinity. A natural question arises: can we eliminate the nonvanishing term $c$ by designing a more generalizable algorithm? We give an affirmative answer in this paper. First, by an adaptation of information-theoretical lower bound on the complexity of solving Lipschitz-convex problems using randomized algorithms, we show that a minimax lower bound for adversarial generalization gap is $\Omega(s(T)/n)$ given training loss $1/s(T)$. This implies that SGDmax does not achieve the lower bound. Next, by observing that the nonvanishing generalization error term for SGDmax comes from the non-smoothness of the adversarial loss function, we employ a smoothing technique to smooth the adversarial loss function. Based on the smoothed loss function, we design a smoothed SGDmax algorithm achieving generalization bound $\mathcal{O}(s(T)/n)$, which matches the minimax lower bound. Experimentally, we show that our algorithm improves adversarial generalization on common datasets.","Adversarial Training, Robust Overfitting, Generalization Bound" Generalizing and Decoupling Neural Collapse via Hyperspherical Uniformity Gap,https://openreview.net/forum?id=inU2quhGdNU,https://openreview.net/pdf?id=inU2quhGdNU,,"The neural collapse (NC) phenomenon describes an underlying geometric symmetry for deep neural networks, where both deeply learned features and classifiers converge to a simplex equiangular tight frame. It has been shown that both cross-entropy loss and mean square error can provably lead to NC. Inspired by how NC characterizes the training target of neural networks, we decouple NC into two objectives: minimal intra-class variability and maximal inter-class separability. We then introduce the concept of hyperspherical uniformity (which characterizes the degree of uniformity on the unit hypersphere) as a unified framework to quantify these two objectives. Finally, we propose a generic objective -- hyperspherical uniformity gap~(HUG), which is defined by the difference between inter-class and intra-class hyperspherical uniformity. HUG not only provably converges to NC, but also decouples NC into two separate objectives. Unlike cross-entropy loss that couples intra-class compactness and inter-class separability, HUG enjoys more flexibility and serves as a good alternative loss function. Empirical results show that HUG works well in terms of generalization, calibration and robustness.", Bias Mimicking: A Simple Sampling Approach for Bias Mitigation,https://openreview.net/forum?id=aKK-Dhm3O4,https://openreview.net/pdf?id=aKK-Dhm3O4,,"Prior work has shown that Visual Recognition datasets frequently under-represent sensitive groups (\eg Female) within a category (\eg Programmers). This dataset bias can lead to models that learn spurious correlations between class labels and sensitive attributes such as age, gender, or race. Most of the recent methods that address this problem require significant architectural changes or expensive hyper-parameter tuning. Alternatively, data re-sampling baselines from the class imbalance literature (\eg Undersampling, Upweighting), which can often be implemented in a single line of code and often have no hyperparameters, offer a cheaper and more efficient solution. However, we found that some of these baselines were missing from recent bias mitigation benchmarks. In this paper, we show that these simple methods are strikingly competitive with state-of-the-art bias mitigation methods on many datasets. Furthermore, we improve these methods by introducing a new class conditioned sampling method: Bias Mimicking. In cases where the baseline dataset re-sampling methods do not perform well, Bias Mimicking effectively bridges the performance gap and improves the total averaged accuracy of under-represented subgroups by over $3\%$ compared to prior work. ","Fairness, Spurious Correlations, Bias, Sampling Methods." MaskFusion: Feature Augmentation for Click-Through Rate Prediction via Input-adaptive Mask Fusion,https://openreview.net/forum?id=QzbKH8nNq_V,https://openreview.net/pdf?id=QzbKH8nNq_V,Feature Augmentation via Adaptive Mask Fusion,"Click-through rate (CTR) prediction plays important role in the advertisement, recommendation, and retrieval applications. Given the feature set, how to fully utilize the information from the feature set is an active topic in deep CTR model designs. There are several existing deep CTR works focusing on feature interactions, feature attentions, and so on. They attempt to capture high-order feature interactions to enhance the generalization ability of deep CTR models. However, these works either suffer from poor high-order feature interaction modeling using DNN or ignore the balance between generalization and memorization during the recommendation. To mitigate these problems, we propose an adaptive feature fusion framework called MaskFusion, to additionally capture the explicit interactions between the input feature and the existing deep part structure of deep CTR models dynamically, besides the common feature interactions proposed in existing works. MaskFusion is an instance-aware feature augmentation method, which makes deep CTR models more personalized by assigning each feature with an instance-adaptive mask and fusing each feature with each hidden state vector in the deep part structure. MaskFusion can also be integrated into any existing deep CTR models flexibly. MaskFusion achieves state-of-the-art (SOTA) performance on all seven benchmarks deep CTR models with three public datasets.","Input-adaptive, Mask Fusion, Feature Augmentation, Click-Through rate prediction" Finite-time Analysis of Single-timescale Actor-Critic on Linear Quadratic Regulator,https://openreview.net/forum?id=9sPDt0z3oL4,https://openreview.net/pdf?id=9sPDt0z3oL4,Finite-time convergence of single-sample single-timescale actor-critic with a global optimality guarantee,"Actor-critic (AC) methods have achieved state-of-the-art performance in many challenging tasks. However, their convergence in most practical applications are still poorly understood. Existing works mostly consider the uncommon double-loop or two-timescale stepsize variants for the ease of analysis. We investigate the practical yet more challenging vanilla single-sample single-timescale AC for solving the canonical linear quadratic regulator problem. Specifically, the actor and the critic update only once with a single sample in each iteration using proportional stepsizes. We prove that the vanilla AC can attain an $\epsilon$-optimal solution with a sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-2})$, which elucidates on the practical efficiency of single-sample single-timescale AC. We develop a novel analysis framework that directly bounds the whole interconnected iteration system without the conservative decoupling commonly adopted in previous analysis of AC. Our work presents the first finite-time analysis of single-sample single-timescale AC with a global optimality guarantee.","single-timescale actor-critic, linear quadratic regulator" From Coarse to Fine-grained Concept based Discrimination for Phrase Detection,https://openreview.net/forum?id=w_Pz9Z3DOB6,https://openreview.net/pdf?id=w_Pz9Z3DOB6,,"Phrase detection requires methods to identify if a phrase is relevant to an image and localize it if applicable. A key challenge in training more discriminative phrase detection models is sampling negatives. However, sampling techniques from prior work focus primarily on hard, often noisy, negatives disregarding the broader distribution of negative samples. To address this problem, we introduce CFCD-Net, a phrase detector that differentiates between phrases through two novels methods. First, we generate groups that consist of semantically similar words we call concepts (\eg \{dog, cat, horse, ...\} vs.\ \{car, truck, ...\}), and then train our CFCD-Net to discriminate between a region of interest and its unrelated concepts. Second, for phrases containing fine-grained mutually-exclusive words (\eg colors), we force the model into selecting only one applicable phrase for each region using our novel fine grained module (FGM). We evaluate our approach on the Flickr30K Entities and RefCOCO+ datasets, where we improve mAP over the state-of-the-art by 1.5-2 points. When considering only the phrases affected by our fine-grained reasoning module, we improve by 3-4 points on both datasets","Phrase Detection, Vision Language Understanding" Scalable 3D Object-centric Learning,https://openreview.net/forum?id=yvF7mAuWv3z,https://openreview.net/pdf?id=yvF7mAuWv3z,,"We tackle the task of unsupervised 3D object-centric representation learning on scenes of potentially unbounded scale. Existing approaches to object-centric representation learning exhibit significant limitations in achieving scalable inference due to their dependencies on a fixed global coordinate system. In contrast, we propose to learn view-invariant 3D object representations in localized object coordinate systems. To this end, we estimate the object pose and appearance representation separately and explicitly project object representations across views. We adopt amortized variational inference to process sequential input and update object representations online. To scale up our model to scenes with an arbitrary number of objects, we further introduce a Cognitive Map that allows the registration and querying of objects on a global map. We employ the object-centric neural radiance field (NeRF) as our 3D scene representation, which is jointly inferred by our unsupervised object-centric learning framework. Experimental results demonstrate that our method can infer and maintain object-centric representations of unbounded 3D scenes. Further combined with a per-object NeRF finetuning process, our model can achieve scalable high-quality object-aware scene reconstruction.", Towards Boosting the Open-Domain Chatbot with Human Feedback,https://openreview.net/forum?id=XGHRFuJ_ue-,https://openreview.net/pdf?id=XGHRFuJ_ue-,"A novel and efficient approach Diamante is proposed to boost the open-domain chatbot, where two kinds of human feedback (including explicit demonstration and implicit preference) are collected and leveraged.","Many open-domain dialogue models pre-trained with social media comments can generate coherent replies but have difficulties producing engaging responses. This phenomenon might mainly result from the deficiency of annotated human-human conversations and the misalignment with human preference. In this paper, we propose a novel and efficient framework Diamante to boost the open-domain chatbot, where two kinds of human feedback (including explicit demonstration and implicit preference) are collected and leveraged. By asking annotators to select or amend the model-generated candidate responses, Diamante efficiently collects the human demonstrated responses and constructs a Chinese chit-chat dataset. To enhance the alignment with human preference, Diamante leverages the implicit preference in the data collection process and introduces the generation-evaluation joint training. Comprehensive experiments indicate that the Diamante dataset and joint training paradigm can significantly boost the performance of pre-trained dialogue models. The overall engagingness of the previous state-of-the-art model has been improved remarkably by 50% in Chinese open-domain conversations.","Dialogue System, Human Feedback" Learning to Generate All Feasible Actions,https://openreview.net/forum?id=P8DHF1Y_dph,https://openreview.net/pdf?id=P8DHF1Y_dph,We propose to train a generative neural network to generate all feasible actions within an interactive environment.,"Several machine learning (ML) applications are characterized by searching for an optimal solution to a complex task. The search space for this optimal solution is often very large, so large in fact that this optimal solution is often not computable. Part of the problem is that many candidate solutions found via ML are actually infeasible and have to be discarded. Restricting the search space to only the feasible solution candidates simplifies finding an optimal solution for the tasks. Further, the set of feasible solutions could be re-used in multiple problems characterized by different tasks. In particular, we observe that complex tasks can be decomposed into subtasks and corresponding skills. We propose to learn a reusable and transferable skill by training an actor to generate all feasible actions. The trained actor can then propose feasible actions, among which an optimal one can be chosen according to a specific task. The actor is trained by interpreting the feasibility of each action as a target distribution. The training procedure minimizes a divergence of the actor's output distribution to this target. We derive the general optimization target for arbitrary f-divergences using a combination of kernel density estimates, resampling, and importance sampling. We further utilize an auxiliary critic to reduce the interactions with the environment. A preliminary comparison to related strategies shows that our approach learns to visit all the modes in the feasible action space, demonstrating the framework's potential for generating multimodal action distributions.","Generative neural networks, feasible actions, f-divergence" Empirical Study of Pre-training a Backbone for 3D Human Pose and Shape Estimation,https://openreview.net/forum?id=8U4joMeLRF,https://openreview.net/pdf?id=8U4joMeLRF,Empirical Study of Pre-training a Backbone for 3D Human Pose and Shape Estimation,"We empirically study unexplored, yet must-know baselines of pre-training a backbone for 3D human pose and shape estimation (3DHPSE). Recently, a few self-supervised representation learning (SSL) methods have been reported to outperform the ImageNet classification pre-training for vision tasks such as object detection. However, its effects on 3DHPSE are open to question, whose target is fixed to a single class, the human. In this regard, we inspect the effectiveness of SSL on 3DHPSE and investigate two other pre-training approaches that have received relatively less attention. They are 2D annotation-based pre-training and synthetic data pre-training. Similar to the motivation of SSL to benefit from unlabeled data, they have potential advantages to exploit data with less data collection cost compared with real 3D data. SSL methods underperform the conventional ImageNet classification pre-training on multiple 3DHPSE benchmarks by 7.7% on average. In contrast, despite a much less amount of pre-training data, the 2D annotation-based pre-training improves accuracy on all benchmarks and shows faster convergence during fine-tuning. In the semi-supervised setting, the improvement increases up to 8.2%, while SSL decreases accuracy by 10.7%, and synthetic data pre-training shows 0.2% decreased accuracy compared with the classification pre-training. Our observations would make the community carefully think about the current SSL-based pre-training trend for 3DHPSE and diversify research on pre-training approaches.","pre-training, 3D human pose and shape estimation, self-supervised representation learning" Sparsity by Redundancy: Solving $L_1$ with a Simple Reparametrization,https://openreview.net/forum?id=idY99Ugd5ek,https://openreview.net/pdf?id=idY99Ugd5ek,We propose a simple method for solving general L1 penalized objectives with gradient descent,"We identify and prove a general principle: $L_1$ sparsity can be achieved using a redundant parametrization plus $L_2$ penalty. Our results lead to a simple algorithm, \textit{spred}, that seamlessly integrates $L_1$ regularization into any modern deep learning framework. Practically, we demonstrate (1) the efficiency of \textit{spred} in optimizing conventional tasks such as lasso and sparse coding, (2) benchmark our method for nonlinear feature selection of six gene selection tasks, and (3) illustrate the usage of the method for achieving structured and unstructured sparsity in deep learning in an end-to-end manner. Conceptually, our result bridges the gap in understanding the inductive bias of the redundant parametrization common in deep learning and conventional statistical learning.","L1, sparsity, feature selection, deep learning, landscape" Test-Time Adaptation for Visual Document Understanding,https://openreview.net/forum?id=FwlV6h4KxVE,https://openreview.net/pdf?id=FwlV6h4KxVE,Proposing a novel test-time adaptation approach and three benchmarks for visual document understanding via masked language modeling and pseudo labeling.,"Self-supervised pretraining has been able to produce transferable representations for various visual document understanding (VDU) tasks. However, the ability of such representations to adapt to new distribution shifts at test-time has not been studied yet. We propose DocTTA, a novel test-time adaptation approach for documents that leverages cross-modality self-supervised learning via masked visual language modeling as well as pseudo labeling to adapt models learned on a \textit{source} domain to an unlabeled \textit{target} domain at test time. We also introduce new benchmarks using existing public datasets for various VDU tasks including entity recognition, key-value extraction, and document visual question answering tasks where DocTTA improves the source model performance up to 1.79\% in (F1 score), 3.43\% (F1 score), and 17.68\% (ANLS score), respectively.","Test-time adaptation, source data-free domain adaptation, visual document understanding" Learned Index with Dynamic $\epsilon$,https://openreview.net/forum?id=UiaUEICawgw,https://openreview.net/pdf?id=UiaUEICawgw,"Based on the theoretically derived prediction error bounds, we propose a mathematically-grounded learned index framework with dynamic $\epsilon$, which is efficient and pluggable to several state-of-the-art learned index methods.","Index structure is a fundamental component in database and facilitates broad data retrieval applications. Recent learned index methods show superior performance by learning hidden yet useful data distribution with the help of machine learning, and provide a guarantee that the prediction error is no more than a pre-defined $\epsilon$. However, existing learned index methods adopt a fixed $\epsilon$ for all the learned segments, neglecting the diverse characteristics of different data localities. In this paper, we propose a mathematically-grounded learned index framework with dynamic $\epsilon$, which is efficient and pluggable to existing learned index methods. We theoretically analyze prediction error bounds that link $\epsilon$ with data characteristics for an illustrative learned index method. Under the guidance of the derived bounds, we learn how to vary $\epsilon$ and improve the index performance with a better space-time trade-off. Experiments with real-world datasets and several state-of-the-art methods demonstrate the efficiency, effectiveness and usability of the proposed framework. ","Learned Index, Dynamic $\epsilon$" Breaking the Curse of Dimensionality in Multiagent State Space: A Unified Agent Permutation Framework,https://openreview.net/forum?id=OxNQXyZK-K8,https://openreview.net/pdf?id=OxNQXyZK-K8,,"The state space in Multiagent Reinforcement Learning (MARL) grows exponentially with the agent number. Such a curse of dimensionality results in poor scalability and low sample efficiency, inhibiting MARL for decades. To break this curse, we propose a unified agent permutation framework that exploits the permutation invariance (PI) and permutation equivariance (PE) inductive biases to reduce the multiagent state space. Our insight is that permuting the order of entities in the factored multiagent state space does not change the information. Specifically, we propose two novel implementations: a Dynamic Permutation Network (DPN) and a Hyper Policy Network (HPN). The core idea is to build separate entity-wise PI input and PE output network modules to connect the entity-factored state space and action space in an end-to-end way. DPN achieves such connections by two separate module selection networks, which consistently assign the same input module to the same input entity (guarantee PI) and assign the same output module to the same entity-related output (guarantee PE). To enhance the representation capability, HPN replaces the module selection networks of DPN with hypernetworks to directly generate the corresponding module weights. Extensive experiments in SMAC, Google Research Football and MPE validate that the proposed methods significantly boost the performance and the learning efficiency of existing MARL algorithms. Remarkably, in SMAC, we achieve 100% win rates in almost all hard and super-hard scenarios (never achieved before).","Multiagent Reinforcement Learning, Permutation Invariance, Permutation Equivariance" LAU: A novel two-parameter learnable Logmoid Activation Unit,https://openreview.net/forum?id=uwBUzlm0GS,https://openreview.net/pdf?id=uwBUzlm0GS,A learnable Activation Unit,"The activation function in deep neural networks has a major impact on the performance of the training stage. In this work, we proposed a novel learnable Logmoid Activation Unit (LAU), $f(x)=x\ln(1+\alpha \textrm{sigmoid}(\beta x))$, with two free parameters $\alpha$ and $\beta$ that can be optimized via back-propagation algorithm. We design quasi-interpolation type neural network operators with Logmoid-1 in a given feed-forward neural network for approximating any continuous function in closed spaces. This provides a theoretical basis for the excellent empirical performance of LAUs in experimental simulations. For instance, compared with ReLUs the proposed LAUs improves Top-1 classification accuracy on ImageNet-200 by $7\%$ respectively in ShuffleNet-V2, on CIFAR-10 by 6$\%$ respectively in EfficientNet-B0, and on CIFAR-100 by 5$\%$ respectively in MobileNet-V2. Our simulations show that end-to-end learning deep neural networks with learnable Logmoids can increase the predictive performance.","Neural network, Learnable activation function, Function approximation, Dilated convolution" 3D Molecular Generation by Virtual Dynamics,https://openreview.net/forum?id=tZmqS73_07,https://openreview.net/pdf?id=tZmqS73_07,"We propose a novel pocket-based 3D molecular generation framework VD-Gen, which consists of a Virtual Dynamics mechanism and several carefully designed stages to generate fine-grained molecules with binding positions in the pocket cavity end-to-end.","Structure-based drug design, i.e., finding molecules with high affinities to the target protein pocket, is one of the most critical tasks in drug discovery. Traditional solutions, like virtual screening, require exhaustively searching on a large molecular database, which are inefficient and cannot get novel molecules beyond the database. The pocket-based 3D molecular generation model, i.e., directly generating a molecule with a 3D structure and binding position in the pocket, is a new promising way to address this issue. However, the method is very challenging due to the complexity brought by the huge continuous 3D space in the pocket cavity. Herein, inspired by Molecular Dynamics, we propose a novel pocket-based 3D molecular generation framework VD-Gen. VD-Gen consists of a Virtual Dynamics mechanism and several carefully designed stages to generate fine-grained 3D molecules with binding positions in the pocket cavity end-to-end. Rather than directly generating or sampling atoms with 3D positions in the pocket like in early attempts, in VD-Gen, we first randomly scatter many virtual particles in the pocket; then with the proposed Virtual Dynamics mechanism, a deep model, acting like a ""force field"", iteratively moves these virtual particles to positions that are highly possible to contain real atoms. After virtual particles are stabilized in 3D space, we extract the atoms from them. Finally, we further refine the 3D positions of atoms by Virtual Dynamics again, to get a fine-grained 3D molecule. Extensive experiment results on pocket-based molecular generation demonstrate that VD-Gen can generate novel 3D molecules to fill the target pocket cavity with high binding affinities, significantly outperforming previous baselines.","3D molecular generation, structure-based drug design" N-Student Learning: An Approach to Model Uncertainty and Combat Overfitting,https://openreview.net/forum?id=N5fNFLO_MyD,https://openreview.net/pdf?id=N5fNFLO_MyD,A pseudo-label based multi-network training setup to help combat the problem of overfitting.,"This work presents N-Student Learning, a pseudo-label based multi-network training setup that can be applied to nearly any supervised learning architecture in order to help combat the problem of overfitting and control the way in which a network models uncertainty in the data. The effectiveness of N-Student Learning relies on the idea that a network's predictions on unseen data are largely independent of any instance-dependent noise in the labels. In N-Student Learning, each student network is assigned a subset of the training dataset such that no data point is in every student's training subset. Unbiased pseudo-labels can thus be generated for every data point in the training set by taking the predictions of appropriate student networks. Training on these unbiased pseudo-labels minimizes the extent to which each network overfits to instance-dependent noise in the data. Furthermore, based on prior knowledge of the domain, we can control how the networks learn to model uncertainty that is present in the dataset by adjusting the way that pseudo-labels are generated. While this method is largely inspired by the general problem of overfitting, a natural application is found in the problem of classification with noisy labels — a domain where overfitting is a significant concern. After developing intuition through a toy classification task, we proceed to demonstrate that N-Student Learning performs favorably on benchmark datasets when compared to state-of-the-art methods in the problem of classification with noisy labels.","Noisy Labels, Pseudo-Labels, Overfitting, Supervised Learning" Wav2Tok: Deep Sequence Tokenizer for Audio Retrieval,https://openreview.net/forum?id=v8Mi8KU6056,https://openreview.net/pdf?id=v8Mi8KU6056,Represent query and target sequences as compressed token sequences for quick retrieval; similarity semantics are learned from sequence pairs,"Search over audio sequences is a fundamental problem. Very efficient solutions exist for text sequences, consisting of discrete tokens chosen from a finite alphabet. These discrete tokens may correspond to words or characters depending on the problem. Audio sequences are composed of continuous-valued samples with a large sampling rate, making similarity search inefficient. This paper proposes Wav2Tok, a deep sequence tokenizer for audio that converts continuous-valued audio sequences to sequences of discrete tokens that are easier to retrieve via sequence queries. The only information available for training Wav2Tok is pairs of similar sequences, i.e., depending on how we form the pairs, the similarity semantics are learned. Wav2Tok compresses the query and target sequences into shorter sequences of tokens that are faster to match. Experiments show the consistent performance of Wav2Tok across various audio retrieval tasks, namely, music search (query by humming) and speech search via audio query. ","sequence representation learning, audio search, music retrieval" Image to Sphere: Learning Equivariant Features for Efficient Pose Prediction,https://openreview.net/forum?id=_2bDpAtr7PI,https://openreview.net/pdf?id=_2bDpAtr7PI,We propose a novel architecture which efficiently describes uncertainty in pose estimation from images by using learned SO(3)-equivariant features to generate complex distributions over SO(3) with the Fourier basis.,"Predicting the pose of objects from a single image is an important but difficult computer vision problem. Methods that predict a single point estimate do not predict the pose of objects with symmetries well and cannot represent uncertainty. Alternatively, some works predict a distribution over orientations in $\mathrm{SO}(3)$. However, training such models can be computation- and sample-inefficient. Instead, we propose a novel mapping of features from the image domain to the 3D rotation manifold. Our method then leverages $\mathrm{SO}(3)$ equivariant layers, which are more sample efficient, and outputs a distribution over rotations that can be sampled at arbitrary resolution. We demonstrate the effectiveness of our method at object orientation prediction, and achieve state-of-the-art performance on the popular PASCAL3D+ dataset. Moreover, we show that our method can model complex object symmetries, without any modifications to the parameters or loss function.","equivariance, sample efficiency, pose detection, symmetry, SO(3)" Better handling unlabeled entity problem using PU-learning and negative sampling,https://openreview.net/forum?id=1ndyt02WPo,https://openreview.net/pdf?id=1ndyt02WPo,,"The NER task is largely developed based on well-annotated data. However, in many scenarios, the entities may not be fully annotated, leading to performance degradation. A common approach for this problem is to distinguish unlabeled entities from negative instances using labeled data. However, the vast differences between entities make such empirical approaches difficult to realize. Our solution is to treat unlabeled entities based on both empirical inference and random sampling. To this end, we propose a simple yet effective two-step method that consists of a novel Positive-Unlabeled (PU-learning) algorithm and negative sampling, in which PU-learning distinguishes part of the unlabeled entities from negative instances based on the confidence threshold. In general, the proposed method can mitigate the impact of unlabeled entities at the outset and can be easily applied to any character-level NER model. We verify the effectiveness of our method on several NER models and datasets, showing a strong ability to deal with unlabeled entities. Finally, in real-world situations, we establish new state-of-the-art results on many benchmark NER datasets.","NER, unlabeled entity problem, PU-learning, negative sampling, self-supervision" PV3D: A 3D Generative Model for Portrait Video Generation,https://openreview.net/forum?id=o3yygm3lnzS,https://openreview.net/pdf?id=o3yygm3lnzS,,"Recent advances in generative adversarial networks (GANs) have demonstrated the capabilities of generating stunning photo-realistic portrait images. While some prior works have applied such image GANs to unconditional 2D portrait video generation and static 3D portrait synthesis, there are few works successfully extending GANs for generating 3D-aware portrait videos. In this work, we propose PV3D, the first generative framework that can synthesize multi-view consistent portrait videos. Specifically, our method extends the recent static 3D-aware image GAN to the video domain by generalizing the 3D implicit neural representation to model the spatio-temporal space. To introduce motion dynamics to the generation process, we develop a motion generator by stacking multiple motion layers to generate motion features via modulated convolution. To alleviate motion ambiguities caused by camera/human motions, we propose a simple yet effective camera condition strategy for PV3D, enabling both temporal and multi-view consistent video generation. Moreover, PV3D introduces two discriminators for regularizing the spatial and temporal domains to ensure the plausibility of the generated portrait videos. These elaborated designs enable PV3D to generate 3D-aware motion-plausible portrait videos with high-quality appearance and geometry, significantly outperforming prior works. As a result, PV3D is able to support many downstream applications such as animating static portraits and view-consistent video motion editing. Code and models will be released at the project page https://iclr2023-pv3d.github.io.", k-Median Clustering via Metric Embedding: Towards Better Initialization with Differential Privacy,https://openreview.net/forum?id=TkSRbrUjQf3,https://openreview.net/pdf?id=TkSRbrUjQf3,,"In clustering algorithms, the choice of initial centers is crucial for the quality of the learned clusters. We propose a new initialization scheme for the $k$-median problem in the general metric space (e.g., discrete space induced by graphs), based on the construction of metric embedding tree structure of the data. From the tree, we propose a novel and efficient search algorithm, for good initial centers that can be used subsequently for the local search algorithm. The so-called HST initialization method can produce initial centers achieving lower errors than those from another popular initialization method, $k$-median++, with comparable efficiency. Our HST initialization can also be easily extended to the setting of differential privacy (DP) to generate private initial centers. We show that the error of applying DP local search followed by our private HST initialization improves previous results on the approximation error, and approaches the lower bound within a small factor. Experiments demonstrate the effectiveness of our proposed methods.","k-Median Clustering, Differential Privacy" Analysis of Error Feedback in Compressed Federated Non-Convex Optimization,https://openreview.net/forum?id=lbzA07xjED,https://openreview.net/pdf?id=lbzA07xjED,,"Communication cost between the clients and the central server could be a bottleneck in real-world Federated Learning (FL) systems. In classical distributed learning, the method of Error Feedback (EF) has been a popular technique to remedy the downsides of biased gradient compression, but literature on applying EF to FL is still very limited. In this work, we propose a compressed FL scheme equipped with error feedback, named Fed-EF, with two variants depending on the global optimizer. We provide theoretical analysis showing that Fed-EF matches the convergence rate of the full-precision FL counterparts in non-convex optimization under data heterogeneity. Moreover, we initiate the first analysis of EF under partial client participation, which is an important scenario in FL, and demonstrate that the convergence rate of Fed-EF exhibits an extra slow down factor due to the ``stale error compensation'' effect. Experiments are conducted to validate the efficacy of Fed-EF in practical FL tasks and justify our theoretical findings.", Characterizing the Influence of Graph Elements,https://openreview.net/forum?id=51GXyzOKOp,https://openreview.net/pdf?id=51GXyzOKOp,"Use influence functions to model the influence of elements in graphs, and understand the model behavior of graph convolution networks. ","Influence function, a method from the robust statistics, measures the changes of model parameters or some functions about model parameters with respect to the removal or modification of training instances. It is an efficient and useful post-hoc method for studying the interpretability of machine learning models without the need of expensive model re-training. Recently, graph convolution networks (GCNs), which operate on graph data, have attracted a great deal of attention. However, there is no preceding research on the influence functions of GCNs to shed light on the effects of removing training nodes/edges from an input graph. Since the nodes/edges in a graph are interdependent in GCNs, it is challenging to derive influence functions for GCNs. To fill this gap, we started with the simple graph convolution (SGC) model that operates on an attributed graph, and formulated an influence function to approximate the changes of model parameters when a node or an edge is removed from an attributed graph. Moreover, we theoretically analyzed the error bound of the estimated influence of removing an edge. We experimentally validated the accuracy and effectiveness of our influence estimation function. In addition, we showed that the influence function of a SGC model could be used to estimate the impact of removing training nodes/edges on the test performance of the SGC without re-training the model. Finally, we demonstrated how to use influence functions to effectively guide the adversarial attacks on GCNs.","Interpretable Machine Learning, Influence functions, Graph Neural Networks" Adversarially Robust Neural Lyapunov Control,https://openreview.net/forum?id=lV0fWRDJwR,https://openreview.net/pdf?id=lV0fWRDJwR,,"State-of-the-art learning-based stability control methods for nonlinear robotic systems suffer from the issue of reality gap, which stems from discrepancy of the system dynamics between training and target (test) environments. To mitigate this gap, we propose an adversarially robust neural Lyapunov control (ARNLC) method to improve the robustness and generalization capabilities for Lyapunov theory-based stability control. Specifically, inspired by adversarial learning, we introduce an adversary to simulate the dynamics discrepancy, which is learned through deep reinforcement learning to generate the worst-case perturbations during the controller's training. By alternatively updating the controller to minimize the perturbed Lyapunov risk and the adversary to deviate the controller from its objective, the learned control policy enjoys a theoretical guarantee of stability. Empirical evaluations on five stability control tasks with the uniform and worst-case perturbations demonstrate that ARNLC not only accelerates the convergence to asymptotic stability, but can generalize better in the entire perturbation space.", MICN: Multi-scale Local and Global Context Modeling for Long-term Series Forecasting,https://openreview.net/forum?id=zt53IDUR1U,https://openreview.net/pdf?id=zt53IDUR1U,"New modeling perspective, new forecasting framework, linear complexity and best performance.","Recently, Transformer-based methods have achieved surprising performance in the field of long-term series forecasting, but the attention mechanism for computing global correlations entails high complexity. And they do not allow for targeted modeling of local features as CNN structures do. To solve the above problems, we propose to combine local features and global correlations to capture the overall view of time series (e.g., fluctuations, trends). To fully exploit the underlying information in the time series, a multi-scale branch structure is adopted to model different potential patterns separately and purposefully. Each pattern is extracted with down-sampled convolution and isometric convolution for local features and global correlations, respectively. In addition to being more effective, our proposed method, termed as Multi-scale Isometric Convolution Network (MICN), is more efficient with linear complexity with respect to the sequence length. Our experiments on five benchmark datasets show that compared with state-of-the-art methods, MICN yields 18.2% and 24.5% relative improvements for multivariate and univariate time series, respectively. Code will be released soon. ","long-term forecasting, local and global context, multi-branch architecture, different potential patterns." EMP: Effective Multidimensional Persistence for Graph Representation Learning,https://openreview.net/forum?id=pBaSwBkHBE,https://openreview.net/pdf?id=pBaSwBkHBE,,"Topological data analysis (TDA) has become increasingly popular in a broad range of machine learning tasks, ranging from anomaly detection and manifold learning to graph classification. Persistent homology being the key approach in TDA provides a unique topological fingerprint of the data by assessing the evolution of various hidden patterns in the data as we vary a scale parameter. Current PH tools are limited to analyze the data by filtering with single parameter while in many applications, several relevant parameters are equally important to get a much finer information on the data. In this paper, we overcome this problem by introducing Effective Multidimensional Persistence (EMP) framework which enables to investigate the data by varying multiple scale parameters simultaneously. EMP framework provides a highly expressive summary of the data by integrating the multiple descriptor functions to the process successfully. EMP naturally adapts to many known single PH summaries and convert them into multidimensional summaries, for example, EMP Landscapes, EMP Silhouettes, EMP Images, and EMP Surfaces. These summaries deliver a multidimensional fingerprint of the data as matrices and arrays which are suitable for various machine learning models. We apply EMP framework in graph classification tasks and observe that EMP boosts the performances of various single PH descriptors, and outperforms the most state-of-the-art methods on benchmark datasets. We further derive theoretical guarantees of the proposed EMP summary and prove the stability properties.","multiparameter persistence, topological data analysis, machine learning" Class Prototype-based Cleaner for Label Noise Learning,https://openreview.net/forum?id=ZPtEyovpo6,https://openreview.net/pdf?id=ZPtEyovpo6,,"Semi-supervised learning based methods are current SOTA solutions to the noisy-label learning problem, which rely on learning an unsupervised label cleaner first to divide the training samples into a clean labeled set and a noisy unlabeled set. Typically, the cleaner is obtained via fitting a mixture model to the distribution of per-sample training losses. However, the modeling procedure is \emph{class agnostic} and assumes the loss distributions of clean and noisy samples are the same across different classes. Unfortunately, in practice, such an assumption does not always hold due to the varying learning difficulty of different classes, thus leading to sub-optimal label noise partition criteria. In this work, we first reveal this long-ignored problem and propose a simple yet effective solution, named \textbf{C}lass \textbf{P}rototype-based label noise \textbf{C}leaner (\textbf{CPC}). Unlike previous works treating all the classes equally, CPC fully considers loss distribution heterogeneity and applies class-aware modulation to partition the clean and noisy data. CPC takes advantage of loss distribution modeling and intra-class consistency regularization in feature space simultaneously and thus can better distinguish clean and noisy labels. We theoretically justify the effectiveness of our method by explaining it from the Expectation-Maximization (EM) framework. Extensive experiments are conducted on the noisy-label benchmarks CIFAR-10, CIFAR-100, Clothing1M and WebVision. The results show that CPC brings about impressive performance improvement across all benchmarks. Moreover, CPC shows outstanding performance especially in the extremely noisy scenarios, and improves the accuracy on CIFAR-100 at 90\% noise rate by as high as 13\% over the SOTAs.",Label Noise Learning Improving Vision Attention with Random Walk Graph Kernel,https://openreview.net/forum?id=LTvSyvRaJO,https://openreview.net/pdf?id=LTvSyvRaJO,"We approach a novel linear attention mechanism based on random walk graph kernel, can be widely used in vision transformer with long sequence inputs","Vision transformers, which propose to tokenize an image and introduce attention mechanism to learn cross-token relationship, have advanced many computer vision tasks.However, the attention module owns a quadratic computational complexity and hence suffers from slow computing speed and high memory cost, hindering it from handling long sequences of tokens.Some attempts optimize the quadratic attention with linear approximation yet observe undesired performance drop.This work balances the trade-off between modeling efficiency and capacity of vision attention.We notice that, by treating queries and keys as nodes in a graph, existing algorithms are akin to modeling one-step interaction between nodes.To strengthen the cross-node connection for a more representative attention, we introduce multi-step interaction, which is equivalent to solving an inverse matrix as in random walk graph kernel.We then come up with a new strategy to construct queries and keys, with the help of bipartite graph, to ease the calculation of matrix inversion.The effectiveness of our approach is verified on various visual tasks. We also make it possible to learn a vision transformer with extremely long sequences of tokens.We achieved the competitive results on the semantic segmentation task with 15% fewer parameters and 10-25% less computation. In addition, the vision transformer based quantization method can be applied to 512x512 or even 1024x1024 resolution images. Code will be made publicly available.","vision transformer, long sequence modeling" SWIFT: Rapid Decentralized Federated Learning via Wait-Free Model Communication,https://openreview.net/forum?id=jh1nCir1R3d,https://openreview.net/pdf?id=jh1nCir1R3d,We propose a wait-free decentralized Federated Learning algorithm which achieves SOTA results while massively reducing communications costs.,"The decentralized Federated Learning (FL) setting avoids the role of a potentially unreliable or untrustworthy central host by utilizing groups of clients to collaboratively train a model via localized training and model/gradient sharing. Most existing decentralized FL algorithms require synchronization of client models where the speed of synchronization depends upon the slowest client. In this work, we propose SWIFT: a novel wait-free decentralized FL algorithm that allows clients to conduct training at their own speed. Theoretically, we prove that SWIFT matches the gold-standard iteration convergence rate $\mathcal{O}(1/\sqrt{T})$ of parallel stochastic gradient descent for convex and non-convex smooth optimization (total iterations $T$). Furthermore, we provide theoretical results for IID and non-IID settings without any bounded-delay assumption for slow clients which is required by other asynchronous decentralized FL algorithms. Although SWIFT achieves the same iteration convergence rate with respect to $T$ as other state-of-the-art (SOTA) parallel stochastic algorithms, it converges faster with respect to runtime due to its wait-free structure. Our experimental results demonstrate that SWIFT's runtime is reduced due to a large reduction in communication time per epoch, which falls by an order of magnitude compared to synchronous counterparts. Furthermore, SWIFT produces loss levels for image classification, over IID and non-IID data settings, upwards of 50\% faster than existing SOTA algorithms.","Federated Learning, Asynchronous, Decentralized, Wait-Free" Hierarchical Sliced Wasserstein Distance,https://openreview.net/forum?id=CUOaVn6mYEj,https://openreview.net/pdf?id=CUOaVn6mYEj,The paper proposes hierarchical sliced Wasserstein distance which is faster than the conventional sliced Wasserstein distance.,"Sliced Wasserstein (SW) distance has been widely used in different application scenarios since it can be scaled to a large number of supports without suffering from the curse of dimensionality. The value of sliced Wasserstein distance is the average of transportation cost between one-dimensional representations (projections) of original measures that are obtained by Radon Transform (RT). Despite its efficiency in the number of supports, estimating the sliced Wasserstein requires a relatively large number of projections in high-dimensional settings. Therefore, for applications where the number of supports is relatively small compared with the dimension, e.g., several deep learning applications where the mini-batch approaches are utilized, the complexities from matrix multiplication of Radon Transform become the main computational bottleneck. To address this issue, we propose to derive projections by linearly and randomly combining a smaller number of projections which are named bottleneck projections. We explain the usage of these projections by introducing Hierarchical Radon Transform (HRT) which is constructed by applying Radon Transform variants recursively. We then formulate the approach into a new metric between measures, named Hierarchical Sliced Wasserstein (HSW) distance. By proving the injectivity of HRT, we derive the metricity of HSW. Moreover, we investigate the theoretical properties of HSW including its connection to SW variants and its computational and sample complexities. Finally, we compare the computational cost and generative quality of HSW with the conventional SW on the task of deep generative modeling using various benchmark datasets including CIFAR10, CelebA, and Tiny ImageNet.","Sliced Wasserstein, Radon Transform, Optimal Transport, Generative Models" Test-time Adaptation for Better Adversarial Robustness,https://openreview.net/forum?id=rUxKM6u8WER,https://openreview.net/pdf?id=rUxKM6u8WER,,"Standard adversarial training and its variants have been widely adopted in practice to achieve robustness against adversarial attacks. However, we show in this work that such an approach does not necessarily achieve near optimal generalization performance on test samples. Specifically it is shown that under suitable assumptions, Bayesian optimal robust estimator requires test-time adaptation, and such adaptation can lead to significant performance boost over standard adversarial training. Motivated by this observation, we propose a practically easy to implement method to improve the generalization performance of adversarially-trained networks via an additional self-supervised test-time adaptation step. We further employs a meta adversarial training method to find a good starting point for test-time adaptation, which incorporates the test-time adaptation procedure into the training phase and it strengthens the correlation between the pre-text tasks in self-supervised learning and the original classification task. Extensive empirical experiments on CIFAR10, STL10 and Tiny ImageNet using several different self-supervised tasks show that our method consistently improves the robust accuracy of standard adversarial training under different white-box and black-box attack strategies.", AutoShot: A Short Video Dataset and State-of-the-Art Shot Boundary Detection,https://openreview.net/forum?id=u89Eq-_3oE4,https://openreview.net/pdf?id=u89Eq-_3oE4,,"The short-form videos have explosive popularity and have dominated the new social media trends. Prevailing short-video platforms, e.g., TikTok, Instagram Reels, and YouTube Shorts, have changed the way we consume and create content. For video content creation and understanding, the shot boundary detection (SBD) is one of the most essential components in various scenarios. In this work, we release a new public Short video sHot bOundary deTection dataset, named SHOT, consisting of 853 complete short videos and 11,606 shot annotations, with 2,716 high quality shot boundary annotations in 200 test videos. Leveraging this new data wealth, we propose to optimize the model design for video SBD, by conducting neural architecture search in a search space encapsulating various advanced 3D ConvNets and Transformers. Our proposed approach, named AutoShot, achieves higher F1 scores than previous state-of-the-art approaches, e.g., outperforming TransNetV2 by 4.2%, when being derived and evaluated on our newly constructed SHOT dataset. Moreover, to validate the generalizability of the AutoShot architecture, we directly evaluate it on another three public datasets: ClipShots, BBC and RAI, and the F1 scores of AutoShot outperform previous state-of-the-art approaches by 1.1%, 0.9% and 1.2%, respectively. The SHOT dataset and code will be released.","Shot boundary detection, short video, dataset" Prototypical Calibration for Few-shot Learning of Language Models,https://openreview.net/forum?id=nUsP9lFADUF,https://openreview.net/pdf?id=nUsP9lFADUF,,"In-context learning of GPT-like models has been recognized as fragile across different hand-crafted templates, and demonstration permutations. In this work, we propose prototypical calibration to adaptively learn a more robust decision boundary for zero- and few-shot classification, instead of greedy decoding. Concretely, our method first adopts Gaussian mixture distribution to estimate the prototypical clusters for all categories. Then we assign each cluster to the corresponding label by solving a weighted bipartite matching problem. Given an example, its prediction is calibrated by the likelihood of prototypical clusters. Experimental results show that prototypical calibration yields a substantial improvement on a diverse set of tasks. Extensive analysis across different scales also indicates that our method calibrates the decision boundary as expected, greatly improving the robustness of GPT to templates, permutations, and class imbalance.", NERDS: A General Framework to Train Camera Denoisers from Single Noisy Images,https://openreview.net/forum?id=NO0ThzteQdI,https://openreview.net/pdf?id=NO0ThzteQdI,," We aim to train accurate denoising networks for smartphone/digital cameras from single noisy images. Downscaling is commonly used as a practical denoiser for low-resolution images. Based on this processing, we found that the pixel variance of the natural images is more robust to downscaling than the pixel variance of the camera noises. Intuitively, downscaling easily removes high-frequency noises than natural textures. To utilize this property, we can adopt noisy/clean image synthesis at low-resolution to train camera denoisers. On this basis, we propose a new solution pipeline -- NERDS that estimates camera noises and synthesizes noisy-clean image pairs from only noisy images. In particular, it first models the noise in raw-sensor images as a Poisson-Gaussian distribution, then estimates the noise parameters using the difference of pixel variances by downscaling. We formulate the noise estimation as a gradient-descent-based optimization problem through a reparametrization trick. We further introduce a new Image Signal Processor (ISP) estimation method that enables denoiser training in a human-readable RGB space by transforming the synthetic raw images to the style of a given RGB noisy image. The noise and ISP estimations utilize rich augmentation to synthesize image pairs for denoiser training. Experiments show that our NERDS can accurately train CNN-based denoisers (\textit{e.g.}, DnCNN, ResNet-style network) outperforming previous noise-synthesis-based and self-supervision-based denoisers in real datasets.", Communication-Efficient and Drift-Robust Federated Learning via Elastic Net,https://openreview.net/forum?id=-59_mb1lOf4,https://openreview.net/pdf?id=-59_mb1lOf4,,"Federated learning (FL) is a distributed method to train a global model over a set of local clients while keeping data localized, which reduces risks of privacy and security. FL framework faces important challenges including expensive communication cost and client drift problem. Leveraging the elastic net, we propose a communication-efficient and drift-robust FL framework to improve the communication efficiency and resolve the client drift problem. We repurpose two types of the elastic net regularizers (i.e., $\ell_1$ and $\ell_2$ penalties on the local model updates): (1) the $\ell_1$-norm regularizer sparsifies the local updates to enhance the communication efficiency and (2) the $\ell_2$-norm regularizer attempts to resolve the client drift problem by limiting the impact of drifting local updates due to data heterogeneity. Our framework is general; hence, it can be integrated with prior FL techniques, e.g., FedAvg, FedProx, SCAFFOLD, and FedDyn. We show that our framework effectively resolves the communication cost problem and the client drift problem simultaneously.","Federated learning, Data heterogeneity, Optimization" Hierarchical Protein Representations via Complete 3D Graph Networks,https://openreview.net/forum?id=9X-hgLDLYkQ,https://openreview.net/pdf?id=9X-hgLDLYkQ,,"We consider representation learning for proteins with 3D structures. We build 3D graphs based on protein structures and develop graph networks to learn their representations. Depending on the levels of details that we wish to capture, protein representations can be computed at different levels, \emph{e.g.}, the amino acid, backbone, or all-atom levels. Importantly, there exist hierarchical relations among different levels. In this work, we propose to develop a novel hierarchical graph network, known as ProNet, to capture the relations. Our ProNet is very flexible and can be used to compute protein representations at different levels of granularity. By treating each amino acid as a node in graph modeling as well as harnessing the inherent hierarchies, our ProNet is more effective and efficient than existing methods. We also show that, given a base 3D graph network that is complete, our ProNet representations are also complete at all levels. Experimental results show that ProNet outperforms recent methods on most datasets. In addition, results indicate that different downstream tasks may require representations at different levels.", Adversarial Attacks on Adversarial Bandits,https://openreview.net/forum?id=bBpT6dEjeRG,https://openreview.net/pdf?id=bBpT6dEjeRG,,"We study a security threat to adversarial multi-armed bandit, in which an attacker perturbs the loss or reward signal to control the behavior of the victim bandit player. We show that the attacker is able to mislead any no-regret adversarial bandit algorithm into selecting a suboptimal target action in every but sublinear (T−o(T )) number of rounds, while incurring only sublinear (o(T)) cumulative attack cost. This result implies critical security concern in real-world bandit-based systems, e.g., in online recommendation, an attacker might be able to hijack the recommender system and promote a desired product. Our proposed attack algorithms require knowledge of only the regret rate, thus are agnostic to the concrete bandit algorithm employed by the victim player. We also derived a theoretical lower bound on the cumulative attack cost that any victim-agnostic attack algorithm must incur. The lower bound matches the upper bound achieved by our attack, which shows that our attack is asymptotically optimal.","adversarial attacks, adversarial bandits, target action, sublinear cumulative attack cost" Multiscale Multimodal Transformer for Multimodal Action Recognition,https://openreview.net/forum?id=aqP3WFwMPbe,https://openreview.net/pdf?id=aqP3WFwMPbe,,"While action recognition has been an active research area for several years, most existing approaches merely leverage the video modality as opposed to humans that efficiently process video and audio cues simultaneously. This limits the usage of recent models to applications where the actions are visually well-defined. On the other hand, audio and video can be perceived in a hierarchical structure, e.g., from audio signal per sampling time point to audio activities and the whole category in the audio classification. In this work, we develop a multiscale multimodal Transformer (MMT) that employs hierarchical representation learning. Particularly, MMT is composed of a novel multiscale audio Transformer (MAT) and a multiscale video Transformer. Furthermore, we propose a set of multimodal supervised contrastive objectives called audio-video contrastive loss (AVC) and intra-modal contrastive loss (IMC) that specifically align the two modalities for robust multimodal representation fusion. MMT surpasses previous state-of-the-art approaches by 7.3%, 1.6% and 2.1% on Kinetics-Sounds, Epic-Kitchens-100 and VGGSound in terms of the top-1 accuracy without external training data. Moreover, our MAT significantly outperforms AST by 22.2%, 4.4% and 4.7% on the three public benchmark datasets and is 3x more efficient based on the number of FLOPs. Through extensive ablation studies and visualizations, we demonstrate that the proposed MMT can effectively capture semantically more separable feature representations from a combination of video and audio signals.","audio and video classification, multimodal action recognition" Partition Matters in Learning and Learning-to-Learn Implicit Neural Representations,https://openreview.net/forum?id=VN6MxIRez-c,https://openreview.net/pdf?id=VN6MxIRez-c,We use partition techniques to speed up the convergence of learning INRs and learning-to-learn INRs.,"$\textit{Implicit neural representation}$ (INR) aims to learn a $\textit{continuous function}$ (i.e., a neural network) to represent an image, where the input and output of the function are pixel coordinator and RGB/Gray values, respectively. However, images tend to consist of many objects whose colors are not perfectly consistent, resulting in the challenge that image itself is actually a $\textit{discontinuous piecewise function}$ and cannot be well estimated by a continuous function. In this paper, we empirically investigate that if a neural network is enforced to fit a discontinuous piecewise function (e.g., a step function) to reach a fixed small error $\epsilon$, the time costs will increase exponentially. We name this phenomenon as $\textit{exponential-increase}$ hypothesis. Obviously, handling an image with many objects is almost impossible in INR if this hypothesis is true. To address this issue, we first prove that partitioning a complex signal into several sub-regions and utilizing piecewise INRs to fit that signal can significantly reduce the convergence time, even when the exponential-increase hypothesis is true. Based on this fact, we introduce two partition-based INR methods: one for learning INRs, and the other for learning-to-learn INRs. Both methods are designed to partition an image into different sub-regions, and dedicate smaller networks for each part. In addition, we further propose two partition rules based on regular grids and semantic segmentation maps, respectively. Extensive experiments validate the effectiveness of the proposed partitioning methods in terms of learning INR for a single image (ordinary learning framework) and the learning-to-learn framework.","implicit neural representations, partition, meta-learning" Grounding High Dimensional Representation Similarity by Comparing Decodability and Network Performance,https://openreview.net/forum?id=QHiuyzE69Bx,https://openreview.net/pdf?id=QHiuyzE69Bx,We evaluate representation similarity measures for sensitivity to decoding and network function using ablation on convolutional neural networks.,"To understand and interpret neural networks, representation similarity metrics have been used to compare learned representations between and across networks. Recent experiments have compared these similarity metrics to find the best performing and the most robust metrics, noting that classic baselines perform surprisingly well. These experiments are mostly constrained to studying relatively low-dimensional representations because of the computational cost of prominent representation similarity metrics. We extend previous work to test representation similarity metrics on larger convolutional networks processing larger images. In order to make this work possible, we employ reformulated representation similarity metrics for use on very high-dimensional representations. Using these reformulated similarity metrics, we test how well each metric captures changes to representations induced by ablations in two popular convolutional networks. In order to ground the effects of changes to representations in function, we use linear decoding probes and network performance measures. These measures of function allow us to test how well similarity metrics capture changes in decodable information versus changes in network performance. Linear decoding methods index available information in the representation, while network performance measures index the information used by the network. We show that all the tested representation similarity metrics significantly predict changes in network function and decodability. Within these metrics, on average, Procrustes and CKA outperform regularized CCA-based methods. All metrics predict decodability changes significantly better than they do network function. Procrustes and CKA do not outperform regularized CCA-based metrics for all network and functionality measure combinations. We add to the growing literature on representational similarity metrics to facilitate the improvement of current metrics for network interpretability.","ablation, representation, semantic decoding, linear decoding, representation similarity, neural network interpretability, activation space" Likelihood adjusted semidefinite programs for clustering heterogeneous data,https://openreview.net/forum?id=BsxMeLGAmU,https://openreview.net/pdf?id=BsxMeLGAmU,We propose an iterative likelihood adjusted SDP for clustering under data heterogeneity.,"Clustering is a widely deployed unsupervised learning tool. Model-based clustering is a flexible framework to tackle data heterogeneity when the clusters have different shapes. Likelihood-based inference for mixture distributions often involves non-convex and high-dimensional objective functions, imposing difficult computational and statistical challenges. The classic expectation-maximization (EM) algorithm is a computationally thrifty iterative method that maximizes a surrogate function minorizing the log-likelihood of observed data in each iteration, which however suffers from bad local maxima even in the special case of the standard Gaussian mixture model with common isotropic covariance matrices. On the other hand, recent studies reveal that the unique global solution of a semidefinite programming (SDP) relaxed $K$-means achieves the information-theoretically sharp threshold for perfectly recovering the cluster labels under the standard Gaussian mixture model. In this paper, we extend the SDP approach to a general setting by integrating cluster labels as model parameters and propose an iterative likelihood adjusted SDP (iLA-SDP) method that directly maximizes the \emph{exact} observed likelihood in the presence of data heterogeneity. By lifting the cluster assignment to group-specific membership matrices, iLA-SDP avoids centroids estimation -- a key feature that allows exact recovery under well-separateness of centroids without being trapped by their adversarial configurations. Thus iLA-SDP is less sensitive than EM to initialization and more stable on high-dimensional data. Our numeric experiments demonstrate that iLA-SDP can achieve lower mis-clustering errors over several widely used clustering methods including $K$-means, SDP and EM algorithms.","clustering, likelihood inference, semidefinite programming, alternating maximization" RGI: robust GAN-inversion for mask-free image inpainting and unsupervised pixel-wise anomaly detection,https://openreview.net/forum?id=1UbNwQC89a,https://openreview.net/pdf?id=1UbNwQC89a,,"Generative adversarial networks (GANs), trained on a large-scale image dataset, can be a good approximator of the natural image manifold. GAN-inversion, using a pre-trained generator as a deep generative prior, is a promising tool for image restoration under corruptions. However, the performance of GAN-inversion can be limited by a lack of robustness to unknown gross corruptions, i.e., the restored image might easily deviate from the ground truth. In this paper, we propose a Robust GAN-inversion (RGI) method with a provable robustness guarantee to achieve image restoration under unknown \textit{gross} corruptions, where a small fraction of pixels are completely corrupted. Under mild assumptions, we show that the restored image and the identified corrupted region mask converge asymptotically to the ground truth. Moreover, we extend RGI to Relaxed-RGI (R-RGI) for generator fine-tuning to mitigate the gap between the GAN learned manifold and the true image manifold while avoiding trivial overfitting to the corrupted input image, which further improves the image restoration and corrupted region mask identification performance. The proposed RGI/R-RGI method unifies two important applications with state-of-the-art (SOTA) performance: (i) mask-free semantic inpainting, where the corruptions are unknown missing regions, the restored background can be used to restore the missing content. (ii) unsupervised pixel-wise anomaly detection, where the corruptions are unknown anomalous regions, the retrieved mask can be used as the anomalous region’s segmentation mask.","Robust GAN-inversion, Mask-free Semantic Inpainting, Unsupervised Pixel-wise Anomaly Detection" Coverage-centric Coreset Selection for High Pruning Rates,https://openreview.net/forum?id=QwKvL6wC8Yi,https://openreview.net/pdf?id=QwKvL6wC8Yi,"We study the importance of data coverage in coreset selection and propose a coverage-centric method for coreset selection, which we show achieves significantly better accuracy than SOTA methods with high pruning rates.","One-shot coreset selection aims to select a subset of the training data, given a pruning rate, that can achieve high accuracy for models that are subsequently trained only with that subset. State-of-the-art coreset selection methods typically assign an importance score to each example and select the most important examples to form a coreset. These methods perform well at low pruning rates; but at high pruning rates, they have been found to suffer a catastrophic accuracy drop, performing worse than even random coreset selection. In this paper, we explore the reasons for this accuracy drop both theoretically and empirically. We extend previous theoretical results on the bound for model loss in terms of coverage provided by the coreset. Inspired by theoretical results, we propose a novel coverage-based metric and, based on the metric, find that coresets selected by importance-based coreset methods at high pruning rates can be expected to perform poorly compared to random coresets because of worse data coverage. We then propose a new coreset selection method, Coreset-centric Coreset Selection (CCS), where we jointly consider overall data coverage based on the proposed metric as well as importance of each example. We evaluate CCS on four datasets and show that they achieve significantly better accuracy than state-of-the-art coreset selection methods as well as random sampling under high pruning rates, and comparable performance at low pruning rates. For example, CCS achieves 7.04% better accuracy than random sampling and at least 20.16% better than popular importance-based selection methods on CIFAR10 with a 90% pruning rate. ","Coreset, Data coverage, Data pruning" AIA: learn to design greedy algorithm for NP-complete problems using neural networks,https://openreview.net/forum?id=frwz3TheDeH,https://openreview.net/pdf?id=frwz3TheDeH,,"Algorithm design is an art that heavily requires intuition and expertise of the human designers as well as insights into the problems under consideration. In particular, the design of greedy-selection rules, the core of greedy algorithms, is usually a great challenge to designer: it is relatively easy to understand a greedy algorithm while it is always difficult to find out an effective greedy-selection rule. In the study, we present an approach, called AIA, to learn algorithm design with the aid of neural networks. We consider the minimum weighted set cover (SC) problem, one of the NP-hard problems, as an representative example. Initially, we formulate a given weighted SC problem as an 0-1 integer linear program (ILP): each variable $x_i$ has two options, i.e., $x_i=0$, which denotes abandon of the set $s_i$, and $x_i = 1$, which denotes selection of $s_i$. Each option of a variable leads to a sub-problem with respect to the original ILP problem. Next, we design a generic search framework to find the optimal solution to the ILP problem. At each search step, the value of a variable is determined with the aid of neural networks. The key of our neural network is the loss function: the original ILP problem and the sub-problems generated by assigning a variable $x_i$ should satisfy the Bellman-Ford equation, and the dissatisfication of the Bellman-Ford equation is evaluated and used as loss function of our neural network. The trained neural network is used as greedy-selection rule. Experimental results on representative instances suggest that using the NN-based greedy selection rule, we can successfully find the optimal solutions. More importantly, the NN-based greedy-selection rule outperform the outstanding Chavatal greedy algorithm, which was designed by human expert. The basic idea of our approach can be readily extended without significant modification to design greedy algorithm for other NP-hard problems. ", Self-Adaptive Perturbation Radii for Adversarial Training,https://openreview.net/forum?id=JX1OCjfABRj,https://openreview.net/pdf?id=JX1OCjfABRj,,"Adversarial training has been shown to be the most popular and effective technique to protect models from imperceptible adversarial samples. Despite its success, it also accompanies the significant performance degeneration to clean data. To achieve a good performance on both clean and adversarial samples, the main effort is searching for an adaptive perturbation radius for each training sample, which essentially suffers from a conflict between exact searching and computational overhead. To address this conflict, in this paper, firstly we show the superiority of adaptive perturbation radii intuitively and theoretically regarding the accuracy and robustness respectively. Then we propose our novel self-adaptive adjustment framework for perturbation radii without tedious searching. We also discuss this framework on both deep neural networks (DNNs) and kernel support vector machines (SVMs). Finally, extensive experimental results show that our framework can improve not only natural generalization performance but also adversarial robustness. It is also competitive with existing searching strategies in terms of running time.", Hybrid and Collaborative Passage Reranking,https://openreview.net/forum?id=Iki4ufHeEGN,https://openreview.net/pdf?id=Iki4ufHeEGN,,"In information retrieval system, the initial passage retrieval results may be unsatisfactory, which can be refined by a reranking scheme. Existing solutions to passage reranking focus on enriching the interaction between query and each passage separately, neglecting the context among the top-ranked passages in the initial retrieval list. To tackle this problem, we propose a Hybrid and Collaborative Passage Reranking (HybRank) method, which leverages the substantial similarity measurements of upstream retrievers for passage collaboration and incorporates the lexical and semantic properties of sparse and dense retrievers for reranking. Besides, built on off-the-shelf retriever features, the flexible plug-in HybRank is capable of enhancing arbitrary passage list. Extensive experiments demonstrate the stable improvements of performance over prevalent retrieval methods, and verify the effectiveness of the core components in HybRank.", GCINT: Dynamic Quantization Algorithm for Training Graph Convolution Neural Networks Using Only Integers,https://openreview.net/forum?id=cIFtriyX6on,https://openreview.net/pdf?id=cIFtriyX6on,Robust and Adaptive Quantization Algorithm for Training Graph Convolution Neural Networks Using Only Integers,"Quantization approaches can minimize storage costs while decreasing the computational complexity of a model, although there is minimal study in the GNN field on quantization networks. We studied the four primary reasons why existing quantization approaches cannot be employed extensively with GNNs: (1)Quantifying the distinctions between data sources; (2)Quantifying the distinctions between data streams; (3)Quantifying the distinctions between concentrations; (4)QAT’s Limitations. Based on this, we propose GCINT, which is an efficient quantization framework prepared for GNN training. The entire forward, backward, optimizer, and loss functions are calculated using integer data. We achieved a training acceleration ratio of nearly 10× compared to FP32 Cuda Core in RTX 2080TI INT8 Tensor Core. Our quantization is independent of the dataset and weight distribution, and more than 2,000 randomized trials have been undertaken on the 8 popular GNN benchmark datasets, with all achieving errors within 1% of the FP32.","Quantization, Graph Neural Networks, Acceleration Training, Integers Networks" ILA-DA: Improving Transferability of Intermediate Level Attack with Data Augmentation,https://openreview.net/forum?id=OM7doLjQbOQ,https://openreview.net/pdf?id=OM7doLjQbOQ,"We proposed ILA-DA, a method that employs 3 novel augmentation techniques to improve the transferability of adversarial attacks.","Adversarial attack aims to generate deceptive inputs to fool a machine learning model. In deep learning, an adversarial input created for a specific neural network can also trick other neural networks. This intriguing property is known as black-box transferability of adversarial examples. To improve black-box transferability, a previously proposed method called Intermediate Level Attack (ILA) fine-tunes an adversarial example by maximizing its perturbation on an intermediate layer of the source model. Meanwhile, it has been shown that simple image transformations can also enhance attack transferability. Based on these two observations, we propose ILA-DA, which employs three novel augmentation techniques to enhance ILA. Specifically, we propose (1) an automated way to apply effective image transformations, (2) an efficient reverse adversarial update technique, and (3) an attack interpolation method to create more transferable adversarial examples. Shown by extensive experiments, ILA-DA greatly outperforms ILA and other state-of-the-art attacks by a large margin. On ImageNet, we attain an average attack success rate of 84.5%, which is 19.5% better than ILA and 4.7% better than the previous state-of-the-art across nine undefended models. For defended models, ILA-DA also leads existing attacks and provides further gains when incorporated into more advanced attack methods.","adversarial examples, adversarial transferability, data augmentation" Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning ,https://openreview.net/forum?id=x0BPR9iXc1,https://openreview.net/pdf?id=x0BPR9iXc1,Changing a small number (<7%) of parameters in already trained language and image models can match the performance of full model training for creating CLIP-like models. ,"The creation of contrastive vision-language models has traditionally required aligning a vision model with a language model by updating all of their parameters through gradient descent. It is not known if contrastive vision-language models (e.g. CLIP) can be created by a small number of parameter updates to already-trained language and vision models. The literature describes techniques that can create vision-language models by updating a small number of parameters in a language model, but these require already aligned visual representations and are non-contrastive, hence unusable for latency-sensitive applications such as neural search. We explore the feasibility and benefits of parameter-efficient contrastive vision-language alignment through transfer learning: creating a model such as CLIP by minimally updating an already-trained vision and language model. We find that a minimal set of parameter updates (<7\%) can achieve the same performance as full-model training, and updating specific components (<1\% of parameters) can match 75\% of full-model training. We describe a series of experiments: we show that existing knowledge is conserved more strongly in parameter-efficient training and that parameter-efficient scaling scales with model and dataset size. We show evidence of an intriguing asymmetry in the vision and language models, and how it affects alignment. Where paired-image text data is scarce but strong multilingual language models exist (e.g. low resource languages), parameter-efficient training is even preferable to full-model training. Given a fixed compute budget, parameter-efficient training allows training larger models on the same hardware, achieving equivalent performance in less time. Parameter-efficient training hence constitutes an energy-efficient and effective training strategy for contrastive vision-language models that may be preferable to the current full-model training paradigm for common use cases. ","vision-language, CLIP, image-text retrieval, transformers" 3EF: Class-Incremental Learning via Efficient Energy-Based Expansion and Fusion,https://openreview.net/forum?id=iP77_axu0h3,https://openreview.net/pdf?id=iP77_axu0h3,A unifying energy-based theory and framework called 3EF to analyze and achieve the goal of class-incremental learning. ,"Neural networks suffer from catastrophic forgetting when sequentially learning tasks phase-by-phase, making them inapplicable in dynamically renewal systems. Class-incremental learning (CIL) aims to enable neural networks to learn different categories at multi-stages. Recently, dynamic-structure-based methods achieves remarkable performance. However, these methods train all modules in a coupled manner and do not consider possible conflicts among modules, resulting in increasing training costs and spoilage of eventual predictions. In this work, we propose a unifying energy-based theory and framework called Efficient Energy-Based Expansion and Fusion (3EF) to analyze and achieve the goal of CIL. We demonstrate the possibility of training independent modules in a decoupled manner while achieving bi-directional compatibility among modules through two additionally allocated prototypes, and then integrating them into a unifying classifier with minimal cost. Furthermore, 3EF extends the exemplar-set to a more challenging setting, where exemplars are randomly selected and imbalanced, where 3EF maintains its performance when prior methods fail dramatically. Extensive experiments on three widely used benchmarks: CIFAR-100, ImageNet-100, and ImageNet-1000 demonstrate that 3EF achieves state-of-the-art performance in both the ordinary and challenging CIL settings. ","EBMs, Compatibility, Continual Learning" Out-of-distribution Representation Learning for Time Series Classification,https://openreview.net/forum?id=gUZWOE42l6Q,https://openreview.net/pdf?id=gUZWOE42l6Q,"We present a novel perspective on time series classification and present algorithms and theory to solve it, with solid experiments.","Time series classification is an important problem in real world. Due to its non-stationary property that the distribution changes over time, it remains challenging to build models for generalization to unseen distributions. In this paper, we propose to view time series classification from the distribution perspective. We argue that the temporal complexity of a time series dataset could attribute to unknown latent distributions inside a time series that need to be characterized. To this end, we propose DIVERSIFY for out-of-distribution (OOD) representation learning. DIVERSIFY is an end-to-end approach that takes an iterative process: it first obtains the ‘worst-case’ distribution scenario via adversarial training, then matches the distributions of the latent distributions. We further present some theoretical insights. Extensive experiments are conducted on seven datasets with different OOD settings across gesture recognition, speech commands recognition, wearable stress and affect detection, and sensor-based human activity recognition. Qualitative and quantitative results demonstrate that DIVERSIFY significantly outperforms other baselines and effectively characterizes the latent distributions. ","Domain generalization, out-of-distribution generalization, time series classification" A Closer Look at the Calibration of Differentially Private Learners,https://openreview.net/forum?id=fGm87trHel_,https://openreview.net/pdf?id=fGm87trHel_,,"We systematically study the calibration of classifiers trained with differentially private stochastic gradient descent (DP-SGD) and observe miscalibration across a wide range of vision and language tasks. Our analysis identifies per-example gradient clipping in DP-SGD as a major cause of miscalibration, and we show that existing baselines for improving private calibration only provide small improvements in calibration error while occasionally causing large degradation in accuracy. As a solution, we show that differentially private variants of post-processing calibration methods such as temperature calibration and Platt scaling are surprisingly effective and have negligible utility cost to the overall model. Across 7 tasks, temperature calibration and Platt scaling with DP-SGD result in an average 58-fold reduction in the expected calibration error and only incurs an up to 1.59 percent drop in accuracy.","Calibration, Differential Privacy" AVT: Audio-Video Transformer for Multimodal Action Recognition,https://openreview.net/forum?id=yFuHxmSwGus,https://openreview.net/pdf?id=yFuHxmSwGus,,"Action recognition is an essential field for video understanding. To learn from heterogeneous data sources effectively, in this work, we propose a novel multimodal action recognition approach termed Audio-Video Transformer (AVT). AVT uses a combination of video and audio signals to improve action recognition accuracy, leveraging the effective spatio-temporal representation by the video Transformer. For multimodal fusion, simply concatenating multimodal tokens in a cross-modal Transformer requires large computational and memory resources, instead we reduce the cross-modality complexity through an audio-video bottleneck Transformer. To improve the learning efficiency of multimodal Transformer, we integrate self-supervised objectives, i.e., audio-video contrastive learning, audio-video matching, and masked audio and video learning, into AVT training, which maps diverse audio and video representations into a common multimodal representation space. We further propose a masked audio segment loss to learn semantic audio activities in AVT. Extensive experiments and ablation studies on three public datasets and two in-house datasets consistently demonstrate the effectiveness of the proposed AVT. Specifically, AVT outperforms its previous state-of-the-art counterparts on Kinetics-Sounds and Epic-Kitchens-100 datasets by 8% and 1%, respectively, without external training data. AVT also surpasses one of the previous state-of-the-art video Transformers by 10% on the VGGSound dataset by leveraging the audio signal. Compared to one of the previous state-of-the-art multimodal Transformers, AVT is 1.3x more efficient in terms of FLOPs and improves the accuracy by 4.2% on Epic-Kitchens-100. Visualization results further demonstrate that the audio provides complementary and discriminative features, and our AVT can effectively understand the action from a combination of audio and video.","Audio and video classification, multimodal action recognition" Exploring Transformer Backbones for Heterogeneous Treatment Effect Estimation,https://openreview.net/forum?id=C9sU3Tnnki8,https://openreview.net/pdf?id=C9sU3Tnnki8,We propose a general-purpose treatment effect estimator which significantly outperforms competitive baselines on a variety of challenging TEE problems.,"Previous works on Treatment Effect Estimation (TEE) are not in widespread use because they are predominantly theoretical, where strong parametric assumptions are made but untractable for practical application. Recent works use Multilayer Perceptron (MLP) for modeling casual relationships, however, MLPs lag far behind recent advances in ML methodology, which limits their applicability and generalizability. To extend beyond the single domain formulation and towards more realistic learning scenarios, we explore model design spaces beyond MLPs, i.e., transformer backbones, which provide flexibility where attention layers govern interactions among treatments and covariates to exploit structural similarities of potential outcomes for confounding control. Through careful model design, Transformers as Treatment Effect Estimators (TransTEE) is proposed. We show empirically that TransTEE can: (1) serve as a general-purpose treatment effect estimator which significantly outperforms competitive baselines on a variety of challenging TEE problems (e.g., discrete, continuous, structured, or dosage-associated treatments.) and is applicable to both when covariates are tabular and when they consist of structural data (e.g., texts, graphs); (2) yield multiple advantages: compatibility with propensity score modeling, parameter efficiency, robustness to continuous treatment value distribution shifts, explainable in covariate adjustment, and real-world utility in auditing pre-trained language models. ","Causal Inference, Continuous Treatment Effect, Heterogeneous Treatment Effect" Few-Shot Learning with Representative Global Prototype,https://openreview.net/forum?id=vT2OIobt3pQ,https://openreview.net/pdf?id=vT2OIobt3pQ,,"Few-shot learning is often challenged by low generalization performance due to the assumption that the data distribution of novel classes and base classes is similar while the model is trained only on the base classes. To mitigate the above issues, we propose a few-shot learning with representative global prototype method. Specifically, to enhance the generalization to novel classes, we propose a method to jointly train the base classes and the novel classes, using selected representative and non-representative samples to optimize representative global prototypes, respectively. Additionally, a method that organically combines the sample of base classes conditional on semantic embedding to generate new samples of novel classes with the original data is proposed to enhance the data of novel classes. Results show that this training method improves the model's ability to describe novel classes, improving the classification performance for a few shots. Intensive experiments have been conducted on two popular benchmark datasets, and the experimental results show that this method significantly improves the classification ability of few-shot learning tasks and achieves state-of-the-art performance.", Important Channel Tuning,https://openreview.net/forum?id=TTMyoOdB9hZ,https://openreview.net/pdf?id=TTMyoOdB9hZ,,"Large vision transformers (ViT) have tremendously succeeded in various computer vision tasks. These ViT models pre-trained on large datasets such as ImageNet21K and JFT-300M enjoy robustness in both low-level and high-level visual representations, and they repeatedly yield performance improvements on multiple downstream tasks. One straightforward way to inherit these robust representations is full fine-tuning. However, full fine-tuning is prone to overfitting the small downstream data by adjusting the massive weights of pre-trained large models. In addition, updating the whole parameters of pre-trained large models requires high GPU memory and computations, which limits the application of these large models. To address the above two drawbacks of full fine-tuning, in this paper, we propose a parameter-efficient tuning (PET) method dubbed Important Channel Tuning (ICT). Different from previous PET methods that adopt a trainable module to tune all the channels of a feature map, we hypothesize and corroborate experimentally that not all channels are equal for adaptation. Specifically, we design a tiny external module that determines the most informative channels in the feature map for effective adaptation. In particular, with only a simple linear layer applied to the important channels, our ICT surpasses full fine-tuning on 18 out of 19 datasets in VTAB-1K benchmark by adding only 0.11M parameters of the ViT-B, which is 0.13% of its full fine-tuning counterpart. Moreover, compared with the previous PET methods, ICT achieves the state-of-the-art average performance in the VTAB-1K benchmark with ViT and Swin Transformer backbones.", Feature-Driven Talking Face Generation with StyleGAN2,https://openreview.net/forum?id=79xEHFvjx9p,https://openreview.net/pdf?id=79xEHFvjx9p,Audio and image features are extracted through Gan to generate talking face.,"In this work, we wish to use a face image that generate a more natural and real face talking animation video. This is not an easy task because face appearance variation and semantics of speech are coupled together when tacking face have a micro movement. Audio features sometimes contain information about expressions, but they are not accurate enough. So a single audio feature cannot fully represent the movement of the face. For the above reason, we want to use different features to generate talking faces. The StyleGan series show good performance in the direction of image processing, and can perform the style migration task of portraits very well at the same time. We find that StyleGan can be used as a talking face generator. At the same time, we also encode and extract non-identity features and non-lip features, and try to find the subtle relationship between the features and the talking face. We also use the evaluation and ablation study to measure the quality of the generated videos and examine whether our approach is effective and feasible.","Talking face, GAN, Feature Selection" Schema Inference for Interpretable Image Classification,https://openreview.net/forum?id=VGI9dSmTgPF,https://openreview.net/pdf?id=VGI9dSmTgPF,,"In this paper, we study a novel inference paradigm, termed as schema inference, that learns to deductively infer the explainable predictions by rebuilding the prior deep neural network (DNN) forwarding scheme, guided by the prevalent philosophical cognitive concept of schema. We strive to reformulate the conventional model inference pipeline into a graph matching policy that associates the extracted visual concept of an image with the pre-computed scene impression, by analogy with the human reasoning mechanism via impression matching. To this end, we devise an elaborated architecture, termed as SchemaNet, as a dedicated instantiation of the proposed schema inference concept, that models both the visual semantics of input instances and the learned abstract imaginations of target categories as topological relational graphs. Meanwhile, to capture and leverage the compositional contributions of visual semantics in a global view, we also introduce a universal Feat2Graph scheme in SchemaNet to establish the relational graphs that contain abundant interaction information. Both the theoretical analysis and the experimental results on several benchmarks demonstrate that the proposed schema inference achieves encouraging performance and meanwhile yields a clear picture of the deductive process leading to the predictions. Our code and model will be made publicly available.", SimForest: An Efficient Plug-in to Boost Few-Shot Learning Performance,https://openreview.net/forum?id=V9ppvImFso_,https://openreview.net/pdf?id=V9ppvImFso_,"We develop an efficient (i.e: computation-efficient, convenient, universally adaptable) ensemble method which can significantly boost the performance of varioius few-shot learning algorithms.","Due to a lack of labeled training data and the subsequent unreliability in empirical risk minimization, few-shot learning algorithms usually suffer from high variance and bias in their predictions. Ensemble, on the other hand, combines predictions from multiple predictors, thus alleviating the aforementioned unreliability. Essentially, we believe that ensemble is a simple yet effective solution to tackle the core problems in few-shot learning; therefore, we develop a plug-in (ensemble) method to boost the performance of trained few-shot models. To maximize the performance of ensemble, we use epoch-training to develop the feature representations used in the plug-in; in contrast, episodic training is used to obtain the feature representations of the original few-shot models. To minimize the extra computation cost induced by ensemble, we adopt a non-deep classifier (e.g: random forest) for the plug-in, which can complete its training within a few seconds. Our method achieves substantial improvements for the few-shot learning, consistently outperforming all the baseline methods.","few-shot learning, ensemble learning, deep learning, machine learning, image classification" Supernet Training for Federated Image Classification Under System Heterogeneity,https://openreview.net/forum?id=K8oz8DyuJD,https://openreview.net/pdf?id=K8oz8DyuJD,We propose a novel framework to tackle issues of data-heterogeneity and model-heterogeneity simultaneously by referring to supernet training.,"Efficient deployment of deep neural networks across many devices and resource constraints, particularly on edge devices, is one of the most challenging problems in the presence of data-privacy preservation issues. Conventional approaches have evolved to either improve a single global model while keeping each local heterogeneous training data decentralized (i.e. data heterogeneity; Federated Learning (FL)) or to train an overarching network that supports diverse architectural settings to address heterogeneous systems equipped with different computational capabilities (i.e. system heterogeneity; Neural Architecture Search). However, few studies have considered both directions simultaneously. This paper proposes the federation of supernet training (FedSup) framework to consider both scenarios simultaneously, i.e., where clients send and receive a supernet that contains all possible architectures sampled from itself. The approach is inspired by observing that averaging parameters during model aggregation for FL is similar to weight-sharing in supernet training. Thus, the proposed FedSup framework combines a weight-sharing approach widely used for training single shot models with FL averaging (FedAvg). Furthermore, we develop an efficient algorithm (E-FedSup) by sending the sub-model to clients on the broadcast stage to reduce communication costs and training overhead, including several strategies to enhance supernet training in the FL environment. We verify the proposed approach with extensive empirical evaluations. The resulting framework also ensures data and model heterogeneity robustness on several standard benchmarks.","Federated Learning, Image Classification, Supernet Training, System Heterogeneity" Domain-Specific Risk Minimization for Out-of-Distribution Generalization,https://openreview.net/forum?id=vCVTZYFcmCm,https://openreview.net/pdf?id=vCVTZYFcmCm,"In this paper, we develop a new generalization bound that is independent of hypothesis space choice and measure the adaptivity gap directly. Two test-time adaptation methods are then proposed inspiried by the bound. ","Recent domain generalization (DG) approaches typically use the classifier trained on source domains for inference on the unseen target domain. However, such a classifier can be arbitrarily far from the optimal one for the target domain, induced by a gap termed ``adaptivity gap ''. Without exploiting the domain information from the unseen test samples, adaptivity gap estimation and minimization are intractable, which hinders us to robustify a model to any unknown distribution. In this paper, we first establish a generalization bound that naturally considers the adaptivity gap. Our bound motivates two strategies to reduce the gap: the first one is ensembling multiple classifiers and thus enriching the hypothesis space, and the other one is adapting model parameters by online target samples. We thus propose Domain-specific Risk Minimization (DRM) for better domain generalization. During training, DRM models the distribution of different source domains separately; during test, DRM combines classifiers dynamically for different target samples and each arriving unlabeled target sample will be used to retrain our model. Extensive experiments demonstrate the effectiveness of the proposed DRM for domain generalization with the following advantages: 1) it significantly outperforms competitive baselines on different distributional shift settings; 2) it enables either comparable or superior accuracies on all training domains compared to vanilla empirical risk minimization (ERM); 3) it remains very simple and efficient during training, and 4) it is complementary to invariant learning approaches. ","Out-of-Distribution Generalization, adaptivity gap, hypothesis space enhancement" CircuitNet: A Generic Neural Network to Realize Universal Circuit Motif Modeling,https://openreview.net/forum?id=GUSf17i8RMZ,https://openreview.net/pdf?id=GUSf17i8RMZ,We proposed CircuitNet by modeling the universal circuit motifs and structures in human brains to function as a generic neural network and tested in several machine learning tasks.,"The successes of artificial neural networks (ANNs) are largely attributed to mimicking the human brain structures. Recent advances in neuroscience revealed that neurons interact with each other through various kinds of connectivity patterns to process information, in which the common connectivity patterns are also called circuit motifs. However, many existing ANNs can only model one or two circuit motifs in their architectures, so that their performance may drastically vary among different types of machine learning tasks. In this paper, we propose a new type of neural network inspired by the architectures of neuronal circuits, namely Circuit Neural Network (CircuitNet). In CircuitNet, a group of densely connected neurons, namely circuit motif unit (CMU), form the basic unit of the network, which is capable of modeling universal circuit motifs by adjusting the weights within the CMUs. Compared with traditional feed-forward networks, CircuitNet has the ability to model more types of neuron connections such as feed-back and lateral motifs. Inspired by the locally dense and globally sparse structure of the human brain, several iterations of signal transmission among different CMUs are achieved by sparse connections through the input ports and output ports of different CMUs. Experiments have demonstrated that CircuitNet can outperform popular neural network architectures in function approximation, reinforcement learning, image classification, and time series forecasting tasks.","Bio-inspired neural network, Deep Learning" Your Contrastive Learning Is Secretly Doing Stochastic Neighbor Embedding,https://openreview.net/forum?id=XFSCKELP3bp,https://openreview.net/pdf?id=XFSCKELP3bp,"This work proposes a novel perspective that interprets SSCL methods as a type of SNE methods, which facilitates both deeper theoretical understandings of SSCL, and methodological guidelines for practical improvement.","Contrastive learning, especially self-supervised contrastive learning (SSCL), has achieved great success in extracting powerful features from unlabeled data. In this work, we contribute to the theoretical understanding of SSCL and uncover its connection to the classic data visualization method, stochastic neighbor embedding (SNE), whose goal is preserving pairwise distances. In the perspective of preserving neighboring information, SSCL can be viewed as a special case of SNE with the input space pairwise similarities specified by data augmentation. The established correspondence facilitates deeper theoretical understanding of learned features of SSCL, as well as methodological guidelines for practical improvement. Specifically, through the lens of SNE, we provide novel analysis on domain-agnostic augmentations, implicit bias and robustness of learned features. To illustrate the practical advantage, we demonstrate that the modifications from SNE to t-SNE can also be adopted in the SSCL setting, achieving significant improvement in both in-distribution and out-of-distribution generalization. ","theoretical understanding, contrastive learning, stochastic neighbor embedding" Covariance-Robust Minimax Probability Machines for Algorithmic Recourse,https://openreview.net/forum?id=AO8F51yRk67,https://openreview.net/pdf?id=AO8F51yRk67,We propose a novel pipeline to generate a model-agnostic recourse that is robust to model shifts.,"Algorithmic recourse is rising as a prominent technique to promote the explainability and transparency of the predictive model in ethical machine learning. Existing approaches to algorithmic recourse often assume an invariant predictive model; however, this model, in reality, is usually updated temporally upon the input of new data. Thus, a recourse that is valid respective to the present model may become invalid for the future model. To resolve this issue, we propose a pipeline to generate a model-agnostic recourse that is robust to model shifts. Our pipeline first estimates a linear surrogate of the nonlinear (black-box) model using covariance-robust minimax probability machines (MPM); then, the recourse is generated with respect to this robust linear surrogate. We show that the covariance-robust MPM recovers popular regularization schemes, including l_2-regularization and class-reweighting. We also show that our covariance-robust MPM pushes the decision boundary in an intuitive manner, which facilitates an interpretable generation of a robust recourse. The numerical results demonstrate the usefulness and robustness of our pipeline. ","Optimization and Learning under Uncertainty, Algorithmic Recourse, Trustworthy ML and Statistics" Harnessing Mixed Offline Reinforcement Learning Datasets via Trajectory Weighting,https://openreview.net/forum?id=OhUAblg27z,https://openreview.net/pdf?id=OhUAblg27z,We propose a sample selection strategy that enables offline reinforcement learning algorithms to learn a better policy in mixed datasets with sparse high-return trajectories.,"Most offline reinforcement learning (RL) algorithms return a target policy maximizing a trade-off between (1) the expected performance gain over the behavior policy that collected the dataset, and (2) the risk stemming from the out-of-distribution-ness of the induced state-action occupancy. It follows that the performance of the target policy is strongly related to the performance of the behavior policy and, thus, the trajectory return distribution of the dataset. We show that in mixed datasets consisting of mostly low-return trajectories and minor high-return trajectories, state-of-the-art offline RL algorithms are overly restrained by low-return trajectories and fail to exploit high-performing trajectories to the fullest. To overcome this issue, we show that, in deterministic MDPs with stochastic initial states, the dataset sampling can be re-weighted to induce an artificial dataset whose behavior policy has a higher return. This re-weighted sampling strategy may be combined with any offline RL algorithm. We further analyze that the opportunity for performance improvement over the behavior policy correlates with the positive-sided variance of the returns of the trajectories in the dataset. We empirically show that while CQL, IQL, and TD3+BC achieve only a part of this potential policy improvement, these same algorithms combined with our reweighted sampling strategy fully exploit the dataset. Furthermore, we empirically demonstrate that, despite its theoretical limitation, the approach may still be efficient in stochastic environments. ","offline reinforcement learning, reinforcement learning, sampling, experience replay" Self-Consistency Improves Chain of Thought Reasoning in Language Models,https://openreview.net/forum?id=1PL1NIMMrw,https://openreview.net/pdf?id=1PL1NIMMrw,"We propose a new decoding strategy, self-consistency, that greatly improves chain-of-thought prompting","Chain-of-thought prompting combined with pretrained large language models has achieved encouraging results on complex reasoning tasks. In this paper, we propose a new decoding strategy, self-consistency, to replace the naive greedy decoding used in chain-of-thought prompting. It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out all possible reasoning paths. Self-consistency leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer. Our extensive empirical evaluation shows that self-consistency boosts the performance of chain-of-thought prompting with a striking margin on a range of popular arithmetic and commonsense reasoning benchmarks, including GSM8K (+17.9%), SVAMP (+11.0%), AQuA (+12.2%), StrategyQA (+6.4%) and ARC-challenge (+3.9%).","Language models, natural language processing, reasoning" Ensuring DNN Solution Feasibility for Optimization Problems with Linear Constraints,https://openreview.net/forum?id=QVcDQJdFTG,https://openreview.net/pdf?id=QVcDQJdFTG,This paper proposes a preventive learning framework to ensure DNN solution feasibility for optimization problems with linear constraints without post-processing.,"We propose preventive learning as the first framework to guarantee Deep Neural Network (DNN) solution feasibility for optimization problems with linear constraints without post-processing. Without loss of generality, we focus on problems with only inequality constraints. We systematically calibrate the inequality constraints used in training, thereby anticipating DNN prediction errors and ensuring the obtained solutions remain feasible. We characterize the calibration rate and a critical DNN size, based on which we can directly construct a DNN with provable solution feasibility guarantee. We further propose an Adversary-Sample Aware training algorithm to improve its optimality performance. We apply the framework to develop DeepOPF+ for solving the essential DC optimal power flow problem in grid operation. Simulation results over IEEE test cases show that it outperforms existing strong DNN baselines in ensuring 100% feasibility and attaining consistent optimality loss (<0.19%) and speedup (up to x228) in both light-load and heavy-load regimes, as compared to a state-of-the-art solver.","Deep learning, Deep neural network, Constrained optimization, Solution feasibility guarantee, Optimal power flow" SpeedAug: A Simple Co-Augmentation Method for Unsupervised Audio-Visual Pre-training,https://openreview.net/forum?id=ckIyo92KL6,https://openreview.net/pdf?id=ckIyo92KL6,,"We present a speed co-augmentation method for unsupervised audio-visual pre-training. A playback speed is randomly selected and applied to both audio and video data to diversify audio-visual views. By applying this augmentation, we observe an interesting phenomenon that multi-modal co-augmentation leads to data entanglement and even semantic meaning shift (e.g., a sped-up sound from a cat can be mistaken as the sound from a mouse). This differs from the common intuition in single-modality representation learning, where samples are invariant to different augmentations. To combat this, augmented audio-visual views are formulated as a partial relationship via our proposed SoftInfoNCE during unsupervised pre-training. The learned representations are evaluated on three downstream tasks, including action recognition and video retrieval on the UCF101 and HMDB51 datasets, and video-audio retrieval on the Kinetics-Sounds dataset. Extensive experimental results show that we achieve a new state-of-the-art.", EM-Network: Learning Better Latent Variable for Sequence-to-Sequence Models,https://openreview.net/forum?id=oWjudQ3w2y,https://openreview.net/pdf?id=oWjudQ3w2y,We propose a new sequence model that can learn a promising latent variable by allowing the target sequence as the model's additional input. It significantly advances the current SOTA approaches in speech recognition and machine translation tasks.,"In a sequence-to-sequence (seq2seq) framework, the use of an unobserved latent variable, such as latent alignment and representation, is important to address the mismatch problem between the source input and target output sequences. Existing seq2seq literature typically learns the latent space by only consuming the source input, which might produce a sub-optimal latent variable for predicting the target. Extending an expectation-maximization (EM)-like algorithm, we introduce EM-Network that can yield the promising latent variable by leveraging the target sequence as the model's additional training input. The target input is used as guidance to provide the target-side context and reduce the candidates of the latent variable. The proposed framework is trained in a new self-distillation setup, allowing the original sequence model to benefit from the latent variable of the EM-Network. Specifically, the EM-Network's prediction serves as a soft label for training the inner sequence model, which only takes the source as input. We theoretically show that our training objective can be a lower bound for the log-likelihood of the sequence model and is justified from the EM perspective. We conduct comprehensive experiments on two sequence learning tasks: speech recognition and machine translation. Experimental results demonstrate that the EM-Network significantly advances the current state-of-the-art self-supervised learning approaches. It improves over the best prior work on speech recognition and establishes state-of-the-art performance on WMT'14 and IWSLT'14 datasets. Moreover, the proposed method even achieves considerable performance improvement for fully supervised learning.","Connectionist temporal classification, Speech recognition, Machine translation" AutoFHE: Automated Adaption of CNNs for Efficient Evaluation over FHE,https://openreview.net/forum?id=Hq16Jk2bVlp,https://openreview.net/pdf?id=Hq16Jk2bVlp,Automated adaption of CNNs to the RNS-CKKS FHE scheme by jointly evolving polynomial activations (EvoReLUs) and searching for placement of bootstrapping operations.,"Secure inference of deep convolutional neural networks (CNNs) was recently demonstrated under the RNS-CKKS fully homomorphic encryption (FHE) scheme. The state-of-the-art solution uses a high-order composite polynomial to approximate non-arithmetic ReLUs and refreshes zero-level ciphertext through bootstrapping. However, this solution suffers from prohibitively high latency, both due to the number of levels consumed by the polynomials ($47\%$) and the inference time consumed by bootstrapping operations ($70\%$). Furthermore, it requires a hand-crafted architecture for homomorphically evaluating CNNs by placing a bootstrapping operation after every Conv-BN layer. To accelerate CNNs on FHE and automatically design a homomorphic evaluation architecture, we propose AutoFHE: Automated adaption of CNNs for evaluation over FHE. AutoFHE exploits the varying sensitivity of approximate activations across different layers in a network and jointly evolves polynomial activations (EvoReLUs) and searches for placement of bootstrapping operations for evaluation under RNS-CKKS. The salient features of AutoFHE include: i) a multi-objective co-evolutionary (MOCoEv) search algorithm to maximize validation accuracy and minimize the number of bootstrapping operations, ii) a gradient-free search algorithm, R-CCDE, to optimize EvoReLU coefficients, and iii) polynomial-aware training (PAT) to fine-tune polynomial-only CNNs for one epoch to adapt trainable weights to EvoReLUs. We demonstrate the efficacy of AutoFHE through the evaluation of ResNets on CIFAR-10 and CIFAR-100 under RNS-CKKS. Experimental results on CIFAR-10 indicate that in comparison to the state-of-the-art solution, AutoFHE reduces inference time (50 images on 50 threads) by 1,000 seconds and amortized inference time (per image) by $28\%$ and $17\%$ for ResNet-20 and ResNet-32, respectively.","Fully Homomorphic Encryption, Multi-Objective Co-Evolutionary Search, RNS-CKKS" REPRESENTATIVE PROTOTYPE WITH CONSTRASTIVE LEARNING FOR SEMI-SUPENVISED FEW-SHOT CLASSIFICATION,https://openreview.net/forum?id=zNQ0IywxSuU,https://openreview.net/pdf?id=zNQ0IywxSuU,,"Few-shot learning aims to learn novel classes in the dataset with few samples per class, which is a very challenging task. To mitigate this issue, the prior work obtain representative prototypes with semantic embeddin based on prototypical networks. While the above methods do not meet the requirement of few-shot learning, which requires abundant labeled samples. Therefore, We propose a new model framework to get representative prototypes with semi-supervised learning. Specifically, we introduces the dataset containing unlabeled samples to assist training the model. More importantly, to fully utilize these unlabeled samples, we adopt conditional variational autoencoder to construct more representative prototypes. Simultaneously, we develop novel contrastive loss to improve the model generalization ability. We evaluate our method on miniImageNet and tieredImageNet benchmarks for both 1-shot and 5-shot settings and achieve better performance over the state-of-the-art semisupervised few-shot method.",SEMI-SUPENVISED FEW-SHOT CLASSIFICATION Data-efficient Supervised Learning is Powerful for Neural Combinatorial Optimization,https://openreview.net/forum?id=a_yFkJ4-uEK,https://openreview.net/pdf?id=a_yFkJ4-uEK,,"Neural combinatorial optimization (NCO) is a promising learning-based approach to solve difficult combinatorial optimization problems. However, how to efficiently train a powerful NCO solver remains challenging. The widely-used reinforcement learning method suffers from sparse rewards and low data efficiency, while the supervised learning approach requires a large number of high-quality solutions. In this work, we develop efficient methods to extract sufficient supervised information from limited labeled data, which can significantly overcome the main shortcoming of supervised learning. To be specific, we propose a set of efficient data augmentation methods and a novel bidirectional loss to better leverage the equivalent properties of problem instances, which finally lead to a promising supervised learning approach. The thorough experimental studies demonstrate our proposed method can achieve state-of-the-art performance on the traveling salesman problem (TSP) only with a small set of 50,000 labeled instances, while it also enjoys better generalization performance. We believe this somewhat surprising finding could lead to valuable rethinking on the value of efficient supervised learning for NCO.","combinatorial optimization, data augmentation, neural combinatorial optimization, learning to optimize" Temporally-Weighted Spike Encoding for Event-based Object Detection and Classification,https://openreview.net/forum?id=SEfxlDwL7fR,https://openreview.net/pdf?id=SEfxlDwL7fR,Performing spiking neural network-based classification and object detection using a new spike encoding method for event-based vision sensors.,"Event-based cameras exhibit high dynamic range and temporal precision that could make them ideal for detecting objects with high speeds and low relative luminance. These properties have made event-based cameras especially interesting for use in space domain awareness tasks, such as detecting dim, artificial satellites with high brightness backgrounds using ground-based optical sensors; however, the asynchronous nature of event-based data presents new challenges to performing objection detection. While spiking neural networks (SNNs) have been shown to naturally complement the asynchronous and binary properties of event-based data, they also present a number of challenges in their training, such as the spike vanishing problem and the large number of timesteps required for maximizing classification and detection accuracy. Furthermore, the extremely high sampling rate of event-based sensors and the density of noisy space-based data collections can results in excessively large event streams within a short window of recording. We present a temporally-weighted spike encoding that greatly reduces the number of spikes derived from an event-based data stream, enabling the training of larger SNNs with fewer timesteps for maximal accuracy. We propose using this spike encoding with a variant of convolutional SNN trained utilizing surrogate spiking neuron gradients with backpropagation-through-time (BPTT) for both classification and object detection tasks with an emphasis on space-domain awareness. To demonstrate the efficacy of our encoding and SNN approach, we present competitive classification accuracies on benchmark datasets N-MNIST (99.7%), DVS-CIFAR10 (74.0%), and N-Caltech101 (72.8%), as well as state-of-the-art object detection performance on event-based, satellite collections. ","Event-based vision, spiking neural networks, object detection, classification" Spiking Convolutional Neural Networks for Text Classification,https://openreview.net/forum?id=pgU3k7QXuz0,https://openreview.net/pdf?id=pgU3k7QXuz0,Spiking Convolutional Neural Networks for Text Classification,"Spiking neural networks (SNNs) offer a promising pathway to implement deep neural networks (DNNs) in a more energy-efficient manner since their neurons are sparsely activated and inferences are event-driven. However, there have been very few works that have demonstrated the efficacy of SNNs in language tasks partially because it is non-trivial to represent words in the forms of spikes and to deal with variable-length texts by SNNs. This work presents a ""conversion + fine-tuning'' two-step method for training SNN for text classification and proposes a simple but effective way to encode pre-trained word embeddings as spike trains. We show empirically that after further fine-tuning with surrogate gradients, the converted SNNs achieve comparable results to their DNN counterparts across multiple datasets for Both English and Chinese. We also demonstrate that such SNNs are more robust against adversarial attacks than DNNs.","Spiking neural networks, Text classification, Training method" RegCLR: A Self-Supervised Framework for Tabular Representation Learning in the Wild,https://openreview.net/forum?id=qV3g530QHhg,https://openreview.net/pdf?id=qV3g530QHhg,Regularized contrastive learning is effective in self-supervised tabular representation learning.,"Recent advances in self-supervised learning (SSL) using large models to learn visual representations from natural images are rapidly closing the gap between the results produced by fully supervised learning and those produced by SSL on downstream vision tasks. Inspired by this advancement and primarily motivated by the emergence of tabular and structured document image applications, we investigate which self-supervised pretraining objectives, architectures, and fine-tuning strategies are most effective. To address these questions, we introduce RegCLR, a new self-supervised framework that combines contrastive and regularized methods and is compatible with the standard Vision Transformer architecture. Then, RegCLR is instantiated by integrating masked autoencoders as a representative example of a contrastive method and enhanced Barlow Twins as a representative example of a regularized method with configurable input image augmentations in both branches. Several real-world table recognition scenarios (e.g., extracting tables from document images), ranging from standard Word and Latex documents to even more challenging electronic health records (EHR) computer screen images, have been shown to benefit greatly from the representations learned from this new framework, with detection average-precision (AP) improving relatively by 4.8% for Table, 11.8% for Column, and 11.1% for GUI objects over a previous fully supervised baseline on real-world EHR screen images.","self-supervised learning, representation learning, computer vision, table detection" Personalized Federated Learning with Feature Alignment and Classifier Collaboration,https://openreview.net/forum?id=SXZr8aDKia,https://openreview.net/pdf?id=SXZr8aDKia,,"Data heterogeneity is one of the most challenging issues in federated learning, which motivates a variety of approaches to learn personalized models for participating clients. One such approach in deep neural networks based tasks is employing a shared feature representation and learning a customized classifier head for each client. However, previous works do not utilize the global knowledge during local representation learning and also neglect the fine-grained collaboration between local classifier heads, which limits the model generalization ability. In this work, we conduct explicit local-global feature alignment by leveraging global semantic knowledge for learning a better representation. Moreover, we quantify the benefit of classifier combination for each client as a function of the combining weights and derive an optimization problem for estimating optimal weights. Finally, extensive evaluation results on benchmark datasets with various heterogeneous data scenarios demonstrate the effectiveness of our proposed method.","Federated Learning, Personalization, Collaboration" Distributionally Robust Recourse Action,https://openreview.net/forum?id=E3ip6qBLF7,https://openreview.net/pdf?id=E3ip6qBLF7,Distributionally Robust Recourse Action framework generates a recourse action that has high probability of being valid under a mixture of model shifts.,"A recourse action aims to explain a particular algorithmic decision by showing one specific way in which the instance could be modified to receive an alternate outcome. Existing recourse generation methods often assume that the machine learning model does not change over time. However, this assumption does not always hold in practice because of data distribution shifts, and in this case, the recourse action may become invalid. To redress this shortcoming, we propose the Distributionally Robust Recourse Action (DiRRAc) framework, which generates a recourse action that has high probability of being valid under a mixture of model shifts. We first formulate the robustified recourse setup as a min-max optimization problem, where the max problem is specified by Gelbrich distance over an ambiguity set around the distribution of model parameters. Then we suggest a projected gradient descent algorithm to find a robust recourse according to the min-max objective. We also show that our DiRRAc framework can be extended to hedge against the misspecification of the mixture weights. Numerical experiments with both synthetic and three real-world datasets demonstrate the benefits of our proposed framework over the state-of-the-art recourse methods, which generate robust recourses. ","Robust Optimization, Explainable AI, Algorithmic Recourse" Randomized Smoothing with Masked Inference for Adversarially Robust NLP Systems,https://openreview.net/forum?id=zgWbA-AecP,https://openreview.net/pdf?id=zgWbA-AecP,We propose a new framework called randomized smoothing with masked inference (RSMI) for improving adversarial robustness of NLP systems.," Large-scale pre-trained language models have shown outstanding performance in a variety of NLP tasks. However, they are also known to be significantly brittle against specifically crafted adversarial examples, leading to increasing interest in probing the adversarial robustness of NLP systems. We introduce RSMI, a novel two-stage framework that combines randomized smoothing (RS) with masked inference (MI) to improve the adversarial robustness of NLP systems. RS transforms a classifier into a smoothed classifier to obtain robust representations, whereas MI forces a model to exploit the surrounding context of a masked token in an input sequence. RSMI improves adversarial robustness by 2 to 3 times over existing state-of-the-art methods on benchmark datasets. We also perform in-depth qualitative analysis to validate the effectiveness of the different stages of RSMI and probe the impact of its components through extensive ablations. By empirically proving the stability of RSMI, we put it forward as a practical method to robustly train large-scale NLP models. Our code and datasets are available at https://anonymous.4open.science/r/RSMI. ","Natural language processing, textual adversarial example, adversarial attack, adversarial robustness" Rethinking the Structure of Stochastic Gradients: Empirical and Statistical Evidence,https://openreview.net/forum?id=9xlU4lhri9,https://openreview.net/pdf?id=9xlU4lhri9,We rethink the heavy-tail phenomenon and the covariance structure of stochastic gradients via novel empirical and statistical evidences.,"It is well known that stochastic gradients significantly improve both optimization and generalization of deep neural networks (DNNs). Some works attempted to explain the success of stochastic optimization for deep learning by the arguably heavy-tail properties of gradient noise, while other works presented theoretical and empirical evidence against the heavy-tail hypothesis on gradient noise. Unfortunately, formal statistical tests for analyzing the structure and heavy tails of stochastic gradients in deep learning are still under-explored. In this paper, we mainly make two contributions. First, we conduct formal statistical tests on the distribution of stochastic gradients and gradient noise across both parameters and iterations. Our statistical tests reveal that dimension-wise gradients usually exhibit power-law heavy tails, while iteration-wise gradients and stochastic gradient noise caused by minibatch training usually do not exhibit power-law heavy tails. Second, we further discover that the covariance spectra of stochastic gradients have the power-law structures in deep learning. While previous papers believed that the anisotropic structure of stochastic gradients matters to deep learning, they did not expect the gradient covariance can have such an elegant mathematical structure. Our work challenges the existing belief and provides novel insights on the structure of stochastic gradients. The novel structure of stochastic gradients may help understand the success of stochastic optimization for deep learning.","Gradient Noise, SGD, Deep Learning" Representing Multi-view Time-series Graph Structures for Multivariate Long-term Time-series Forecasting,https://openreview.net/forum?id=44GCcwJ5X2,https://openreview.net/pdf?id=44GCcwJ5X2,"An efficient, highly accurate, lightweight model for multivariate long-term time series forecasting.","Multivariate long-term time-series forecasting task is a very challenging task in real-world application areas, such as electricity consumption and influenza-like illness forecasting. At present, researchers are focusing on designing robust and effective models, and have achieved good results. However, there are several issues with existing models that need to be overcome to ensure they provide optimal performance. First, the lack of a relationship structure between multivariate variables needs to be addressed. Second, most models only have a weak ability to capture local dynamic changes across the entire long-term time-series. And, third, the current models suffer from high computational complexity and unsatisfactory accuracy. To address these issues, we propose a novel method called Multi-view Time-series Graph Structure Representation (MTGSR) for multivariate long-term time-series forecasting tasks. MTGSR uses graph convolutional networks (GCNs) to construct topological relationships in the multivariate long-term time-series from three different perspectives: time, dimension, and crossing segments. Variation trends in the different dimensions of the multivariate long-term time-series are extracted through a difference operation so as to construct a topological map that reflects the correlations between the different dimensions. Then, to capture the dynamically changing characteristics of the fluctuation correlations between adjacent local sequences, MTGSR constructs a cross graph by calculating the correlation coefficients between adjacent local sequences. Extensive experiments on five different datasets show that MTGSR reduces errors by 20.41% over the state-of-the-art while maintaining linear complexity. Additionally, memory use is decreased by 66.52% and running time is reduced by 78.09%. ","time series forecasting, deep learning, representational learning" Improving Language Model Pretraining with Text Structure Information,https://openreview.net/forum?id=N-S6pJrlkK,https://openreview.net/pdf?id=N-S6pJrlkK,A pretraining task that distinguishes text structure relationships between sentences can improve general-purpose language model pretraining.,"Inter-sentence pretraining tasks learn from sentence relationships and facilitate high-level language understanding that cannot be directly learned in word-level pretraining tasks. However, we have found experimentally that existing inter-sentence methods for general-purpose language pretraining improve performance only at a relatively small scale but not at larger scales. For an alternative, we propose Text Structure Prediction (TSP), a more sophisticated inter-sentence task that uses text structure to provide more abundant self-supervised learning signals to pretraining models at larger scales. TSP classifies sentence pairs over six designed text structure relationships and it can be seen as an implicit form of learning high-level language understanding by identifying key concepts and relationships in texts. Experiments show that TSP provides improved performance on language understanding tasks for models at various scales. Our approach thus serves as an initial attempt to demonstrate that the exploitation of text structure can facilitate language understanding.","Language Model Pretraining, Representation Learning" Can Language Models Make Fun? A Case Study in Chinese Comical Crosstalk,https://openreview.net/forum?id=9wx-QXt-JaN,https://openreview.net/pdf?id=9wx-QXt-JaN,Testing whether AI could make fun!,"Language is the principal tool for human communication, in which humor is one of the most attractive parts. Producing natural language like humans using computers, a.k.a, Natural Language Generation (NLG), has been widely used for dialogue systems, chatbots, machine translation, as well as computer-aid creation e.g., idea generations, scriptwriting. However, the humor aspect of natural language is relatively under-investigated, especially in the age of pre-trained language models. In this work, we aim to preliminarily test whether \textit{NLG can generate humor as humans do}. We build a new dataset consisting of numerous digitized \textbf{C}hinese \textbf{C}omical \textbf{C}rosstalk scripts (called \textbf{C}$^3$ in short), which is for a popular Chinese performing art called `Xiangsheng' or `相声' since 1800s \footnote{For convenience for non-Chinese speakers, we called `crosstalk' for `Xiangsheng' in this paper.}. We benchmark various generation approaches including training-from-scratch Seq2seq, fine-tuned middle-scale PLMs, and large-scale PLMs (with and without fine-tuning). Moreover, we also conduct a human assessment, showing that 1) \textit{large-scale pretraining largely improves crosstalk generation quality}; and 2) \textit{ even the scripts generated from the best PLM is far from what we expect}. We conclude humor generation could be largely improved using large-scaled PLMs, but it is still in its infancy. The data and benchmarking code are publicly available in \url{https://github.com/anonNo2/crosstalk-generation}.","humor generation, Chinese crosstalk, pre-trained language model, GPT, natural language generation" Chasing Better Deep Image Priors Between Over- and Under-parameterization,https://openreview.net/forum?id=JwUmXwqXhr,https://openreview.net/pdf?id=JwUmXwqXhr,,"Deep Neural Networks (DNNs) are well-known to act as over-parameterized deep image priors (DIP) that regularize various image inverse problems. Meanwhile, researchers also proposed extremely compact, under-parameterized image priors (e.g., deep decoder) that are strikingly competent for image restoration too, despite a loss of accuracy. These two extremes push us to think whether there exists a better solution in the middle: between over- and under-parameterized image priors, can one identify ""intermediate"" parameterized image priors that achieve better trade-offs between performance, efficiency, and even preserving strong transferability? Drawing inspirations from the lottery ticket hypothesis (LTH), we conjecture and study a novel ""lottery image prior"" (LIP) by exploiting DNN inherent sparsity, stated as: given an over-parameterized DNN-based image prior, it will contain a sparse subnetwork that can be trained in isolation, to match the original DNN's performance when being applied as a prior to various image inverse problems}. Our results validate the superiority of LIPs: we can successfully locate the LIP subnetworks from over-parameterized DIPs at substantial sparsity ranges. Those LIP subnetworks significantly outperform deep decoders under comparably compact model sizes (by often fully preserving the effectiveness of their over-parameterized counterparts), and they also possess high transferability across different images as well as restoration task types. Besides, we also extend LIP to compressive sensing image reconstruction, where a pre-trained GAN generator is used as the prior (in contrast to untrained DIP or deep decoder), and confirm its validity in this setting too. To our best knowledge, this is the first time that LTH is demonstrated to be relevant in the context of inverse problems or image priors. Codes will be publicly available upon acceptance.", Generalizable Person Re-identification Without Demographics,https://openreview.net/forum?id=917v6o8fO7,https://openreview.net/pdf?id=917v6o8fO7,,"Domain generalizable person re-identification (DG-ReID) aims to learn a ready-to-use domain-agnostic model directly for cross-dataset/domain evaluation, while current methods mainly explore the demographic information such as domain and/or camera labels for domain-invariant representation learning. However, the above-mentioned demographic information is not always accessible in practice due to privacy and security issues. In this paper, we consider the problem of person re-identification in a more general setting, \ie domain generalizable person re-identification without demographics (\textbf{DGWD-ReID}). To address the underlying uncertainty of domain distribution, we introduce distributionally robust optimization (DRO) to learn robust person re-identification models that perform well on all possible data distributions within the uncertainty set without demographics. However, directly applying the popular Kullback-Leibler divergence constrained DRO (or KL-DRO) fails to generalize well under the distribution shifts in real-world scenarios, since the convex condition may not hold for overparameterized neural networks. Inspired by this, we analyze and reformulate the popular KL-DRO by applying the change-of-measure technique, and then propose a simple yet efficient approach, \textbf{Unit-DRO}, which minimizes the loss over a new dataset with hard samples upweighted and other samples downweighted. We perform extensive experiments on both domain generalizable and cross-domain person re-identification tasks, and the empirical results on several large-scale benchmarks show that \iw~achieves superior performance compared to all baselines without using demographics. ","Generalizable Person Re-Identification, Distributionally robust optimization, Change-of-measure technique" Simple Yet Effective Graph Contrastive Learning for Recommendation,https://openreview.net/forum?id=FKXVK9dyMM,https://openreview.net/pdf?id=FKXVK9dyMM,A new lightweight graph contrastive learning approach to enhance recommender systems,"Graph neural network (GNN) is a powerful learning approach for graph-based recommender systems. Recently, GNN intergrated with contrastive learning has shown superior performance with data augmentation for recommendation, with the aim of dealing with highly sparse data. Despite their success, most existing graph contrastive learning methods either perform stochastic augmentation (e.g., node/edge perturbation) on the user-item interaction graph, or rely on the heuristic-based augmentation techniques (e.g., user clustering) for generating contrastive views. We argue that these methods cannot well preserve the intrinsic semantic structures and are easily biased by the noise perturbation. In this paper, we propose a simple yet effective graph contrastive learning paradigm LightGCL that mitigates these issues that negatively impact the generality and robustness of CL-based recommenders. Our model exclusively utilizes singular value decomposition for contrastive augmentation, which enables the unconstrained structure refinement with global collaborative relation modeling. Experiments conducted on several benchmark datasets demonstrate that our method significantly improves the performance over state-of-the-arts. Further analyses show the superiority of LightGCL's robustness against data sparsity and popularity bias. The source code of our model is available at https://anonymous.4open.science/r/LightGCL/.","recommender systems, graph neural networks, contrastive learning" Clustering-Assisted Foreground and Background Separation for Weakly-supervised Temporal Action Localization,https://openreview.net/forum?id=i7hWcu3t0S,https://openreview.net/pdf?id=i7hWcu3t0S,,"Weakly-supervised temporal action localization aims to localize action instances in videos with only video-level action labels. Existing methods mainly embrace a localization-by-classification pipeline that optimizes snippet-level prediction with a video classification loss. However, this formulation suffers from the discrepancy between classification and detection, resulting in the noisy foreground and background (F&B) snippets separation. To alleviate this problem, we propose to explore the underlying structure among the snippets by unsupervised snippet clustering, rather than heavily relying on the video classification loss. Specifically, we propose a novel clustering-assisted F&B separation network dubbed CASE, which achieves F&B separation by two core components: a snippet clustering component that groups the snippets into multiple latent clusters and a cluster classification component that attempts to further classify the cluster as foreground or background. In the absence of ground-truth labels to train these two components, we propose to adopt an online self-training algorithm that allows online interaction of pseudo-label rectification and model training. More importantly, we propose a distribution-constrained labeling strategy that utilizes different priors to regularize the distribution of the pseudo-labels, so as to reinforce the quality of the pseudo-labels. With the aid of the online self-training algorithm and distribution-constrained labeling strategy, our method is able to exploit the latent clusters that are simultaneously typical to snippets and discriminative to F&B. Thereby, the cluster assignments of the snippets can be associated with their F&B labels to enable the F&B separation. The effectiveness of the proposed CASE is demonstrated by the experimental results on three publicly available benchmarks: THUMOS14, ActivityNet v1.2 and v1.3. ","weakly-supervised temporal action localization, online clustering" MemoNav: Working Memory Model for Visual Navigation,https://openreview.net/forum?id=9dFQcu9vmX,https://openreview.net/pdf?id=9dFQcu9vmX,"The MemoNav learns three types of scene representations, which contain goal-relevant information and scene-level features and are utilized to improve navigation performance in both 1-goal and multi-goal ImageNav tasks.","We present MemoNav, a novel memory model for image-goal navigation, which utilizes a working memory-inspired pipeline to improve navigation performance. Specifically, the node features on the topological map are stored in the short-term memory (STM), as these features are dynamically updated. The MemoNav retains the informative fraction of the STM via a forgetting module to improve navigation efficiency. To learn a global representation of 3D scenes, we introduce long-term memory (LTM) that continuously aggregates the STM. Afterward, a graph attention module encodes the retained STM and the LTM to generate working memory (WM). After encoding, the WM contains the informative features in the retained STM and the scene-level feature in the LTM and is finally used to generate actions. Consequently, the synergy of these three types of memory increases navigation performance by selectively retaining goal-relevant information and learning a high-level scene feature. When evaluated on multi-goal tasks, the MemoNav outperforms the SoTA methods at all difficulty levels in both Gibson and Matterport3D scenes. The MemoNav also achieves consistent improvements on traditional 1-goal tasks. Moreover, the qualitative results show that our model is less likely to be trapped in a deadlock.","Image-Goal Navigation, Memory mechanism, Embodied visual navigation, Embodied AI" Write and Paint: Generative Vision-Language Models are Unified Modal Learners,https://openreview.net/forum?id=HgQR0mXQ1_a,https://openreview.net/pdf?id=HgQR0mXQ1_a,"The paper proposes a simple, scalable, and versatile seq2seq foundation model, which is capable of vision-language understanding, image-to-text generation, and text-to-image generation with a single unified architecture","Recent advances in vision-language pre-training have pushed the state-of-the-art on various vision-language tasks, making machines more capable of multi-modal writing (image-to-text generation) and painting (text-to-image generation). However, few studies investigate if these two essential capabilities can be learned together and boost each other, making a versatile and powerful multi-modal foundation model. In this work, we disclose the potential of symmetrical generative vision-language pre-training in learning to write and paint concurrently, and propose a new unified modal model, named DaVinci, trained with prefix language modeling and prefix image modeling, a simple generative self-supervised objective on image-text pairs. Thanks to the proposed prefix multi-modal modeling framework, DaVinci is simple to train, scalable to huge data, adaptable to both writing and painting tasks, and also strong on other vision, text, and multi-modal understanding tasks. DaVinci achieves competitive performance on a wide range of 27 generation/understanding tasks and demonstrates the superiority of combining vision/language generative pre-training. Furthermore, we carefully benchmark the performance of different vision-language pre-training objectives on different scales of pre-training dataset on a heterogeneous and broad distribution coverage. Our results demonstrate the potential of exploiting self-supervision in both language and vision inputs, and establish new, stronger baselines for future comparisons at different data scales. The code and pre-trained models will be released in the final version.","Foundation model, Multi-modal learning, Vision-language pre-training" Progressive Voronoi Diagram Subdivision Enables Accurate Data-free Class-Incremental Learning,https://openreview.net/forum?id=zJXg_Wmob03,https://openreview.net/pdf?id=zJXg_Wmob03,We show that progressive Voronoi Diagram is a powerful model for Class-incremental Learning.,"Data-free Class-incremental Learning (CIL) is a challenging problem because rehearsing data from previous phases is strictly prohibited, causing catastrophic forgetting of Deep Neural Networks (DNNs). In this paper, we present \emph{iVoro}, a novel framework derived from computational geometry. We found Voronoi Diagram (VD), a classical model for space subdivision, is especially powerful for solving the CIL problem, because VD itself can be constructed favorably in an incremental manner -- the newly added sites (classes) will only affect the proximate classes, making the non-contiguous classes hardly forgettable. Furthermore, we bridge DNN and VD using Power Diagram Reduction, and show that the VD structure can be progressively refined along the phases using a divide-and-conquer algorithm. Moreover, our VD construction is not restricted to the deep feature space, but is also applicable to multiple intermediate feature spaces, promoting VD to be multilayer VD that efficiently captures multi-grained features from DNN. Importantly, \emph{iVoro} is also capable of handling uncertainty-aware test-time Voronoi cell assignment and has exhibited high correlations between geometric uncertainty and predictive accuracy (up to ${\sim}0.9$). Putting everything together, \emph{iVoro} achieves up to $25.26\%$, $37.09\%$, and $33.21\%$ improvements on CIFAR-100, TinyImageNet, and ImageNet-Subset, respectively, compared to the state-of-the-art non-exemplar CIL approaches. In conclusion, \emph{iVoro} enables highly accurate, privacy-preserving, and geometrically interpretable CIL that is particularly useful when cross-phase data sharing is forbidden, e.g. in medical applications.","Voronoi Diagram, Computational Geometry" Data Valuation Without Training of a Model,https://openreview.net/forum?id=XIzO8zr-WbM,https://openreview.net/pdf?id=XIzO8zr-WbM,"In this paper, we define a training-free data valuation score, which can be directly computed from data and can effectively quantify the impact of individual instances in optimization and generalization of neural networks.","Many recent works on understanding deep learning try to quantify how much individual data instances influence the optimization and generalization of a model, either by analyzing the behavior of the model during training or by measuring the performance gap of the model when the instance is removed from the dataset. Such approaches reveal characteristics and importance of individual instances, which may provide useful information in diagnosing and improving deep learning. However, most of the existing works on data valuation require actual training of a model, which often demands high-computational cost. In this paper, we provide a training-free data valuation score, called complexity-gap score, which is a data-centric score to quantify the influence of individual instances in generalization of two-layer overparameterized neural networks. The proposed score can quantify irregularity of the instances and measure how much each data instance contributes in the total movement of the network parameters during training. We theoretically analyze and empirically demonstrate the effectiveness of the complexity-gap score in finding 'irregular or mislabeled' data instances, and also provide applications of the score in analyzing datasets and diagnosing training dynamics. ","Data valuation, generalization error bounds, complexity-gap score, data pruning, training dynamics" HotProtein: A Novel Framework for Protein Thermostability Prediction and Editing,https://openreview.net/forum?id=YDJRFWBMNby,https://openreview.net/pdf?id=YDJRFWBMNby,A new dataset and a novel learning framework for protein thermostability prediction and editing.,"The molecular basis of protein thermal stability is only partially understood and has major significance for drug and vaccine discovery. The lack of datasets and standardized benchmarks considerably limits learning-based discovery methods. We present $\texttt{HotProtein}$, a large-scale protein dataset with \textit{growth temperature} annotations of thermostability, containing $182$K amino acid sequences and $3$K folded structures from $230$ different species with a wide temperature range $-20^{\circ}\texttt{C}\sim 120^{\circ}\texttt{C}$. Due to functional domain differences and data scarcity within each species, existing methods fail to generalize well on our dataset. We address this problem through a novel learning framework, consisting of ($1$) Protein structure-aware pre-training (SAP) which leverages 3D information to enhance sequence-based pre-training; ($2$) Factorized sparse tuning (FST) that utilizes low-rank and sparse priors as an implicit regularization, together with feature augmentations. Extensive empirical studies demonstrate that our framework improves thermostability prediction compared to other deep learning models. Finally, we propose a novel editing algorithm to efficiently generate positive amino acid mutations that improve thermostability. Codes and datasets will be publicly released. ","Protein Thermostability, Protein Editing, Dataset, Structure-aware Pre-training, Factorized Sparse Tuning" Agent-Controller Representations: Principled Offline RL with Rich Exogenous Information,https://openreview.net/forum?id=gLl0fZQo6Vu,https://openreview.net/pdf?id=gLl0fZQo6Vu,representation learning methods for robustness to exogenous information based new offline RL benchmarks,"Learning to control an agent from data collected offline in a rich pixel-based visual observation space is vital for real-world applications of reinforcement learning (RL). A major challenge in this setting is the presence of input information that is hard to model and irrelevant to controlling the agent. This problem has been approached by the theoretical RL community through the lens of exogenous information, i.e, any control-irrelevant information contained in observations. For example, a robot navigating in busy streets needs to ignore irrelevant information, such as other people walking in the background, textures of objects, or birds in the sky. In this paper, we focus on the setting with visually detailed exogenous information, and introduce new offline RL benchmarks offering the ability to study this problem. We find that contemporary representation learning techniques can fail on datasets where the noise is a complex and time dependent process, which is prevalent in practical applications. To address these, we propose to use multi-step inverse models, which have seen a great deal of interest in the RL theory community, to learn Agent-Controller Representations for Offline-RL (ACRO). Despite being simple and requiring no reward, we show theoretically and empirically that the representation created by this objective greatly outperforms baselines. ","offline RL, exogenous information, representation learning, latent state recovery, robustness" RPM: Generalizable Behaviors for Multi-Agent Reinforcement Learning,https://openreview.net/forum?id=HnSceSzlfrY,https://openreview.net/pdf?id=HnSceSzlfrY,,"Despite the recent advancement in multi-agent reinforcement learning (MARL), the MARL agents easily overfit the training environment and perform poorly in the evaluation scenarios where other agents behave differently. Obtaining generalizable policies for MARL agents is thus necessary but challenging mainly due to complex multi-agent interactions. In this work, we model the problem with Markov Games and propose a simple yet effective method, ranked policy memory (RPM), to collect diverse multi-agent trajectories for training MARL policies with good generalizability. The main idea of RPM is to maintain a look-up memory of policies. In particular, we try to acquire various levels of behaviors by saving policies via ranking the training episode return, i.e., the episode return of agents in the training environment; when an episode starts, the learning agent can then choose a policy from the RPM as the behavior policy. This innovative novel self-play training framework leverages agents’ past policies and guarantees the diversity of multi-agent interaction in the training data. We implement RPM on top of MARL algorithms and conduct extensive experiments on Melting Pot. It has been demonstrated that RPM enables MARL agents to interact with unseen agents in multi-agent generalization evaluation scenarios and complete given tasks, and it significantly boosts the performance up to 402% on average.","multi-agent system, multi-agent reinforcement learning" Behavior Prior Representation learning for Offline Reinforcement Learning,https://openreview.net/forum?id=hQ4K9Bf4G2B,https://openreview.net/pdf?id=hQ4K9Bf4G2B,"We propose a state representation learning method with a surprisingly simple, easy-to-integrate objective based on behavior cloning of the dataset","Offline reinforcement learning (RL) struggles in environments with rich and noisy inputs, where the agent only has access to a fixed dataset without environment interactions. Past works have proposed common workarounds based on the pre-training of state representations, followed by policy training. In this work, we introduce a simple, yet effective approach for learning state representations. Our method, Behavior Prior Representation (BPR), learns state representations with an easy-to-integrate objective based on behavior cloning of the dataset: we first learn a state representation by mimicking actions from the dataset, and then train a policy on top of the fixed representation, using any off-the-shelf Offline RL algorithm. Theoretically, we prove that BPR carries out performance guarantees when integrated into algorithms that have either policy improvement guarantees (conservative algorithms) or produce lower bounds of the policy values (pessimistic algorithms). Empirically, we show that BPR combined with existing state-of-the-art Offline RL algorithms leads to significant improvements across several offline control benchmarks.","offline reinforcement learning, representation learning" How Does Adaptive Optimization Impact Local Neural Network Geometry?,https://openreview.net/forum?id=h5z_RaWLdG1,https://openreview.net/pdf?id=h5z_RaWLdG1,We study the difference between the local geometry of the training objective in deep learning using Adaptive algorithms and SGD.,"Adaptive optimization methods are well known to achieve superior convergence relative to vanilla gradient methods. The traditional viewpoint in optimization, particularly in convex optimization, explains this improved performance by arguing that, unlike vanilla gradient schemes, adaptive algorithms mimic the behavior of a second-order method by adapting to the global geometry of the loss function. We argue that in the context of neural network optimization, this traditional viewpoint is insufficient. Instead, we advocate for a local trajectory analysis. For iterate trajectories produced by running a generic optimization algorithm OPT, we introduce $R^{\text{OPT}}_{\text{med}}$, a statistic that is analogous to the condition number of the loss Hessian evaluated at the iterates. Through extensive experiments, we show that adaptive methods such as Adam bias the trajectories towards regions where $R^{\text{Adam}}_{\text{med}}$ is small, where one might expect faster convergence. By contrast, vanilla gradient methods like SGD bias the trajectories towards regions where $R^{\text{SGD}}_{\text{med}}$ is comparatively large. We complement these empirical observations with a theoretical result that provably demonstrates this phenomenon in the simplified setting of a two-layer linear network. We view our findings as evidence for the need of a new explanation of the success of adaptive methods, one that is different than the conventional wisdom.","optimization, adaptive algorithms, neural networks" Substructured Graph Convolution for Non-overlapping Graph Decomposition,https://openreview.net/forum?id=NTTc8wZktaT,https://openreview.net/pdf?id=NTTc8wZktaT,A novel graph convolution for non-overlapping graph decomposition.,"Graph convolutional networks have been widely used to solve the graph problems such as node classification, link prediction, and recommender systems. It is well known that large graphs require large amount of memory and time to train graph convolutional networks. To deal with large graphs, many methods are being done, such as graph sampling or decomposition. In particular, graph decomposition has the advantage of parallel computation, but information loss occurs in the interface part. In this paper, we propose a novel substructured graph convolution that reinforces the interface part lost by graph decomposition. Numerical results indicate that the proposed method is robust in the number of subgraphs compared to other methods.","Graph convolution, non-overlapping graph decomposition, parallel computation, substructuring method" Concentric Ring Loss for Face Forgery Detection,https://openreview.net/forum?id=PxwqKdOshWI,https://openreview.net/pdf?id=PxwqKdOshWI,,"Due to growing societal concerns about indistinguishable deepfake images, face forgery detection has received an increasing amount of interest in computer vision. Since the differences between actual and fake images are frequently small, improving the discriminative ability of learnt features is one of the primary problems in deepfake detection. In this paper, we propose a novel Concentric Ring Loss (CRL) to encourage the model to learn intra-class compressed and inter-class separated features. Specifically, we independently add margin penalties in angular and Euclidean space to force a more significant margin between real and fake images, and hence encourage better discriminating performance. Compared to softmax loss, CRL explicitly encourages intra-class compactness and inter-class separability. Moreover, a frequency-aware feature learning module is proposed to exploit high-frequency features and further improve the generalization ability of the model. Extensive experiments demonstrate the superiority of our methods over different datasets. We show that CRL consistently outperforms the state-of-the-art by a large margin.","face forgery detection, metric learning" MaskConver: A Universal Panoptic and Semantic Segmentation Model with Pure Convolutions,https://openreview.net/forum?id=WW7BJ15ivoo,https://openreview.net/pdf?id=WW7BJ15ivoo,Universal panoptic segmentation model with pure convolutions tailored for mobile devices.,"Universal panoptic segmentation models have achieved state-of-the-art quality by using transformers for predicting masks. However, in mobile applications, transformer models are not computation-friendly due to the quadratic complexity with respect to the input length. In this work, we present MaskConver, a unified panoptic and semantic segmentation model with pure convolutions, which is optimized for mobile devices. We propose a novel lightweight mask embedding decoder to predict mask embeddings. These mask embeddings are used to predict a set of binary masks for both things and stuff classes. MaskConver achieves \textbf{37.2\%} panoptic quality score on COCO validation set, which is \textbf{6.4\%} better than Panoptic DeepLab with the same MobileNet backbone. After mobile-specific optimizations, MaskConver runs at \textbf{30} FPS and delivers 29.7\% panoptic quality score on a Pixel 6, making it a real-time model, which is 10$\times$ faster than Panoptic DeepLab using the same backbone.","panoptic segmentation, semantic segmentation, convolutional networks, mobile models" On the Neural Tangent Kernel of Equilibrium Models,https://openreview.net/forum?id=gnULZPMCPz,https://openreview.net/pdf?id=gnULZPMCPz,,"This work studies the neural tangent kernel (NTK) of deep equilibrium (DEQ) model, a practical ``infinite-depth'' architecture which directly computes the infinite-depth limit of a weight-tied network via root-finding. Even though the NTK of a fully-connected neural network is stochastic if its width and depth both tend to infinity simultaneously, we show that contrarily a DEQ model still enjoys a deterministic NTK despite its width and depth going to infinity at the same time. Moreover, such deterministic NTK can be found efficiently via root-finding.","Equilibrium model, neural tangent kernel" From Play to Policy: Conditional Behavior Generation from Uncurated Robot Data,https://openreview.net/forum?id=c7rM7F7jQjN,https://openreview.net/pdf?id=c7rM7F7jQjN,"We train a transformer-based model on uncurated play data, which can produce targeted real-world robot policies by conditioning on future observations.","While large-scale sequence modelling from offline data has led to impressive performance gains in natural language generation and image generation, directly translating such ideas to robotics has been challenging. One critical reason for this is that uncurated robot demonstration data, i.e. play data, collected from non-expert human demonstrators are often noisy, diverse, and distributionally multi-modal. This makes extracting useful, task-centric behaviors from such data a difficult generative modelling problem. In this work, we present Conditional Behavior Transformers (C-BeT), a method that combines the multi-modal generation ability of Behavior Transformer with future-conditioned goal specification. On a suite of simulated benchmark tasks, we find that C-BeT improves upon prior state-of-the-art work in learning from play data by an average of 45.7%. Further, we demonstrate for the first time that useful task-centric behaviors can be learned on a real-world robot purely from play data without any task labels or reward information. Robot videos are best viewed on our project website: cbet-anon.github.io","behavior generation, robot manipulation, learning from play" Causal Knowledge Transfer from Task Affinity,https://openreview.net/forum?id=lUr4FoC1wCx,https://openreview.net/pdf?id=lUr4FoC1wCx,,"Recent developments in deep representation models through counterfactual balancing have led to a promising framework for estimating Individual Treatment Effects (ITEs) that are essential to causal inference in the Neyman-Rubin potential outcomes framework. While Randomized Control Trials are vital to understanding causal effects, they are sometimes infeasible, costly, or unethical to conduct. Motivated by these potential obstacles to data acquisition, we focus on transferring the causal knowledge acquired in prior experiments to new scenarios for which only limited data is available. To this end, we first observe that the absolute values of ITEs are invariant under the action of the symmetric group on the labels of treatments. Given this invariance, we propose a symmetrized task distance for calculating the similarity of a target scenario with those encountered before. The aforementioned task distance is then used to transfer causal knowledge from the closest of all the available previously learned tasks to the target scenario. We provide upper bounds on the counterfactual loss and ITE error of the target task indicating the transferability of causal knowledge. Empirical studies are provided for various real-world, semi-synthetic, and synthetic datasets demonstrating that the proposed symmetrized task distance is strongly related to the estimation of the counterfactual loss. Numerical results indicate that transferring causal knowledge reduces the amount of required data by up to 95\% when compared to training from scratch. These results reveal the promise of our method when applied to important albeit challenging real-world scenarios such as transferring the knowledge of treatment effects (e.g., medicine, social policy, personal training, etc.) studied on a population to other groups absent in the study.","Causal Inference, Transfer Learning" "Beyond Counting Linear Regions of Neural Networks, Simple Linear Regions Dominate!",https://openreview.net/forum?id=uFWSIObdx5H,https://openreview.net/pdf?id=uFWSIObdx5H,,"Functions represented by a neural network with the widely-used ReLU activation are piecewise linear functions over linear regions (polytopes). Figuring out the properties of such polytopes is of fundamental importance for the development of neural networks. So far, either theoretical or empirical studies on polytopes stay at the level of counting their number. Despite successes in explaining the power of depth and so on, counting the number of polytopes puts all polytopes on an equal booting, which is essentially an incomplete characterization of polytopes. Beyond counting, here we study the shapes of polytopes via the number of simplices obtained by triangulations of polytopes. First, we demonstrate the properties of the number of simplices in triangulations of polytopes, and compute the upper and lower bounds of the maximal number of simplices that a network can generate. Next, by computing and analyzing the histogram of simplices across polytopes, we find that a ReLU network has surprisingly uniform and simple polytopes, although these polytopes theoretically can be rather diverse and complicated. This finding is a novel implicit bias that concretely reveals what kind of simple functions a network learns and sheds light on why deep learning does not overfit. Lastly, we establish a theorem to illustrate why polytopes produced by a deep network are simple and uniform. The core idea of the proof is counter-intuitive: adding depth probably does not create a more complicated polytope. We hope our work can inspire more research into investigating polytopes of a ReLU neural network, thereby upgrading the knowledge of neural networks to a new level. ", Recovering Top-Two Answers and Confusion Probability in Multi-Choice Crowdsourcing,https://openreview.net/forum?id=h3vfP9ASoXEK,https://openreview.net/pdf?id=h3vfP9ASoXEK,We propose a computationally and statistically efficient algorithm for multi-choice crowdsourced labeling to recover not only the ground truth but also the most confusing answer with confusion probability.,"We consider multi-choice crowdsourced labeling with the goal of recovering not only the ground truth but also the most confusing answer and the confusion probability. The most confusing answer provides useful information about the task by revealing the most plausible answer other than the ground truth and how plausible it is. To theoretically analyze such scenarios, we propose a model where there are top-two plausible answers for each task, distinguished from the rest of choices. Task difficulty is quantified by the confusion probability between the top two, and worker reliability is quantified by the probability of giving an answer among the top two. Under this model, we propose a two-stage inference algorithm to infer the top-two answers, where the first stage uses the spectral method to obtain an initial estimate for the top two, and the second stage uses the result of the first stage to refine the estimates based on the maximum likelihood estimator (MLE). We show that our algorithm achieves the minimax optimal convergence rate. We conduct both synthetic and real-data experiments and demonstrate that our algorithm achieves the performance near the optimal MLE for synthetic datasets and the best performance for real datasets compared to other recent algorithms. This shows that our model explains well the real datasets with heterogeneous task difficulties due to confusion between plausible answers. ","Crowdsourcing, multiple choice, detecting confusion, task difficulty, two-stage inference algorithm, minimax optimal convergence rate" SCALE-UP: An Efficient Black-box Input-level Backdoor Detection via Analyzing Scaled Prediction Consistency,https://openreview.net/forum?id=o0LFPcoFKnr,https://openreview.net/pdf?id=o0LFPcoFKnr,"We reveal an intriguing phenomenon that the predictions of poisoned samples are significantly more consistent when amplifying all pixel values, based on which we design a simple yet effective black-box input-level backdoor detection.","Deep neural networks (DNNs) are vulnerable to backdoor attacks, where adversaries embed a hidden backdoor trigger during the training process for malicious prediction manipulation. These attacks pose great threats to the applications of DNNs under the real-world machine learning as a service (MLaaS) setting, where the deployed model is fully black-box while the users can only query and obtain its predictions. Currently, there are many existing defenses to reduce backdoor threats. However, almost all of them cannot be adopted in MLaaS scenarios since they require getting access to or even modifying the suspicious models. In this paper, we propose a simple yet effective black-box input-level backdoor detection, called SCALE-UP, which requires only the predicted labels to alleviate this problem. Specifically, we identify and filter malicious testing samples by analyzing their prediction consistency during the pixel-wise amplification process. Our defense is motivated by an intriguing observation (dubbed \emph{scaled prediction consistency}) that the predictions of poisoned samples are significantly more consistent compared to those of benign ones when amplifying all pixel values. Besides, we also provide theoretical foundations to explain this phenomenon. Extensive experiments are conducted on benchmark datasets, verifying the effectiveness and efficiency of our defense and its resistance to potential adaptive attacks.","Backdoor Detection, Backdoor Defense, Backdoor Learning, AI Security, Deep Learning" On the Perils of Cascading Robust Classifiers,https://openreview.net/forum?id=tQG-o3SeipT,https://openreview.net/pdf?id=tQG-o3SeipT,Our work reveals a critical pitfall of cascading certifiably robust models by showing that the seemingly beneficial strategy of cascading can actually hurt the robustness of the resulting ensemble.,"Ensembling certifiably robust neural networks is a promising approach for improving the \emph{certified robust accuracy} of neural models. Black-box ensembles that assume only query-access to the constituent models (and their robustness certifiers) during prediction are particularly attractive due to their modular structure. Cascading ensembles are a popular instance of black-box ensembles that appear to improve certified robust accuracies in practice. However, we show that the robustness certifier used by a cascading ensemble is unsound. That is, when a cascading ensemble is certified as locally robust at an input $x$ (with respect to $\epsilon$), there can be inputs $x'$ in the $\epsilon$-ball centered at $x$, such that the cascade's prediction at $x'$ is different from $x$ and thus the ensemble is not locally robust. Our theoretical findings are accompanied by empirical results that further demonstrate this unsoundness. We present a new attack against cascading ensembles and show that: (1) there exists an adversarial input for up to 88\% of the samples where the ensemble claims to be certifiably robust and accurate; and (2) the accuracy of a cascading ensemble under our attack is as low as 11\% when it claims to be certifiably robust and accurate on 97\% of the test set. Our work reveals a critical pitfall of cascading certifiably robust models by showing that the seemingly beneficial strategy of cascading can actually hurt the robustness of the resulting ensemble.","Certifiable Robustness, Ensemble, Adversarial Attack, Soundness" GENERATIVE OF ORIGIN MODEL DISTRIBUTION MASKED WITH EMOTIONS AND TOPICS DISTRIBUTION IN HYBRID METHOD,https://openreview.net/forum?id=p8ZiYjVPk8g,https://openreview.net/pdf?id=p8ZiYjVPk8g,Embedding constructors for effective representation of natural language,"There is a vast amount of data in the form of natural signals in the world, and difficult expression processors are required to analyze such data. Traditional embedding methods are susceptible to generalization failure. In this study, we developed a classification model that creates and approximates an origin hypothesis model using limited emotions and topics. To solve the hypothesis, the proposed model utilizes dynamic learner modules. Using this mechanism, a text-based origin distribution representation learning model was designed. In order to simulate and generalize, we analyzed the experimental evaluation results via various natural language data sets and measured the corresponding performance. Thus, we demonstrated that the machine achieves the classification task more effectively by integrating learning distribution and multiple learning methods.","Masked Distribution, Learning Approximation Representation, Topic Analytics, Sentiment Analytics, Meta Learning" Visual Classification via Description from Large Language Models,https://openreview.net/forum?id=jlAjNL8z5cs,https://openreview.net/pdf?id=jlAjNL8z5cs,"We enhance zero-shot recognition with vision-language models by comparing to category descriptors from GPT-3, enabling better performance in an interpretable setting that also allows for incorporation of new concepts and bias mitigation.","Vision-language models such as CLIP have shown promising performance on a variety of recognition tasks using the standard zero-shot classification procedure -- computing similarity between the query image and the embedded words for each category. By only using the category name, they neglect to make use of the rich context of additional information that language affords. The procedure gives no intermediate understanding of why a category is chosen, and furthermore provides no mechanism for adjusting the criteria used towards this decision. We present an alternative framework for classification with VLMs, which we call classification by description. We ask VLMs to check for descriptive features rather than broad categories: to find a tiger, look for its stripes; its claws; and more. By basing decisions on these descriptors, we can provide additional cues that encourage using the features we want to be used. In the process, we can get a clear idea of what the model ``thinks"" it is seeing to make its decision; it gains some level of inherent explainability. We query large language models (e.g., GPT-3) for these descriptors to obtain them in a scalable way. Extensive experiments show our framework has numerous advantages past interpretability. We show improvements in accuracy on ImageNet across distribution shifts; demonstrate the ability to adapt VLMs to recognize concepts unseen during training; and illustrate how descriptors can be edited to effectively mitigate bias compared to the baseline. ","vision-language models, CLIP, prompting, GPT-3, large language models, zero-shot recognition, multimodal" Unsupervised Visual Anomaly Detection with Score-Based Generative Model,https://openreview.net/forum?id=j6sAOkvn4GI,https://openreview.net/pdf?id=j6sAOkvn4GI,,"We consider leveraging the deviated outputs and gradient information from generative models due to out of distribution samples in anomaly detection (AD). Visual anomaly detection has been critical problems and widely discussed. However, in many application scenes, abnormal image samples are very rare and difficult to collect. In this paper, we focus on the unsupervised visual anomaly detection and localization task through a score-based generative model applicable to more general cases. Our work is inspired by the fact that injected noises to the original image through forward diffusion process may reveal the image defects in the reverse process (i.e., reconstruction). First, due to the differences of normal pixels between the reconstructed and original images, we propose to use a score-based generative model and associated score values as metric to gauge the defects. Second, to accelerate inference process, a novel $T$ scales approach is developed which reduces the use of redundant information from adjacent moments while leverages the information provided by the score model at different moments. These two practices allows our model to improve the generalization of AD in an unsupervised manner, but maintain a reasonable speed. We evaluate our method on several datasets to demonstrate its effectiveness.", A Data-Based Perspective on Transfer Learning,https://openreview.net/forum?id=IrUFsuTxVfY,https://openreview.net/pdf?id=IrUFsuTxVfY,"In this work, we present a framework for probing the impact of the source dataset on transfer learning performance.","It is commonly believed that more pre-training data leads to better transfer learning performance. However, recent evidence suggests that removing data from the source dataset can actually help too. In this work, we present a framework for probing the impact of the source dataset's composition on transfer learning performance. Our framework facilitates new capabilities such as identifying transfer learning brittleness and detecting pathologies such as data-leakage and the presence of misleading examples in the source dataset. In particular, we demonstrate that removing detrimental datapoints identified by our framework improves transfer performance from ImageNet on a variety of transfer tasks.","transfer learning, datasets, subpopulations" Contrastive Novelty Learning: Anticipating Outliers with Large Language Models,https://openreview.net/forum?id=gEvzRWqFoCO,https://openreview.net/pdf?id=gEvzRWqFoCO,"We present Contrastive Novelty Learning, a method to improve open-set selective classification by generating probable novel examples with a large language model, then training a classifier for lower relative confidence on generated examples.","In many task settings, text classification models are likely to encounter examples from novel classes on which they cannot predict correctly. Selective prediction, in which models abstain on low-confidence examples, provides a possible solution, but existing models are often overly confident on OOD examples. To remedy this overconfidence, we introduce Contrastive Novelty Learning (CNL), a two-step method that generates OOD examples representative of novel classes, then trains to decrease confidence on them. First, we generate OOD examples by prompting a large language model twice: we prompt it to enumerate novel classes relevant to the label set, then generate examples from each novel class matching the task format. Second, we train our classifier with a novel contrastive objective that encourages lower confidence on generated OOD examples than training examples. When trained with CNL, classifiers improve in their ability to detect and abstain on OOD examples over prior methods by an average of 2.3% AUAC and 5.5% AUROC across 4 NLP datasets, with no cost to in-distribution accuracy.","selective prediction, open-set classification, large language models, NLP" MIMT: Masked Image Modeling Transformer for Video Compression,https://openreview.net/forum?id=j9m-mVnndbm,https://openreview.net/pdf?id=j9m-mVnndbm,draft,"Deep learning video compression outperforms its hand-craft counterparts with enhanced flexibility and capacity. One key component of the learned video codec is the autoregressive entropy model conditioned on spatial and temporal priors. Operating autoregressive on raster scanning order naively treats the context as unidirectional. This is neither efficient nor optimal, considering that conditional information probably locates at the end of the sequence. We thus introduce an entropy model based on a masked image modeling transformer (MIMT) to learn the spatial-temporal dependencies. Video frames are first encoded into sequences of tokens and then processed with the transformer encoder as priors. The transformer decoder learns the probability mass functions (PMFs) \emph{conditioned} on the priors and masked inputs. Then it is capable of selecting optimal decoding orders without a fixed direction. During training, MIMT aims to predict the PMFs of randomly masked tokens by attending to tokens in all directions. This allows MIMT to capture the temporal dependencies from encoded priors and the spatial dependencies from the unmasked tokens, i.e., decoded tokens. At inference time, the model begins with generating PMFs of all masked tokens in parallel and then decodes the frame iteratively from the previously-selected decoded tokens (i.e., with high confidence). In addition, we improve the overall performance with more techniques, e.g., manifold conditional priors accumulating a long range of information, shifted window attention to reduce complexity. Extensive experiments demonstrate the proposed MIMT framework equipped with the new transformer entropy model achieves state-of-the-art performance on HEVC, UVG, and MCL-JCV datasets, generally outperforming the VVC in terms of PSNR and SSIM. ","video compression, masked image modeling, transformer, entropy model" Speculative Decoding: Lossless Speedup of Autoregressive Translation,https://openreview.net/forum?id=H-VlwsYvVi,https://openreview.net/pdf?id=H-VlwsYvVi,,"Different from some previous work accelerating autoregressive translation (AT) at the sacrifice of quality, we propose Speculative Decoding (SpecDec) -- a novel decoding paradigm inspired by speculative execution in computer architecture, which combines respective advantages of AT and non-autoregressive translation (NAT) for lossless speedup of translation. At each decoding step, SpecDec first speculatively drafts (i.e. decodes) next $k$ tokens with an NAT model and then verifies them with an AT model, where only the drafted tokens passing the verification are accepted as decoded tokens for guaranteeing its translation result is exactly the same as AT. The collaboration of NAT drafting and AT verification leads to a much higher decoding speed without quality loss due to parallel computing enabled by speculative decoding. We conduct experiments in 4 standard WMT translation benchmarks and confirm the vanilla SpecDec yields exactly the same results as AT greedy decoding with an around $3\times$ speedup, and that its variant (SpecDec++) with an advanced verification strategy not only outperforms AT greedy decoding, but also further improves the decoding speed, resulting in an around $5\times$ speedup over AT. Moreover, SpecDec can be easily generalized for speeding up other seq2seq tasks like Abstractive Summarization, and benefit more from stronger computing devices, demonstrating its potential to become a de facto decoding standard in the future for efficient and lossless seq2seq generation.", $$CONVOLUTION AND POOLING OPERATION MODULE WITH ADAPTIVE STRIDE PROCESSING EFFEC$$,https://openreview.net/forum?id=tCPheuUFBC,https://openreview.net/pdf?id=tCPheuUFBC,,"$$Convolutional neural network is one of the representative models of deep learning, which has a wide range of applications. Convolution and pooling are two key op- erations in convolutional neural networks. They play an important role in extract- ing input features and mapping low-level semantic features to high-level semantic features. Stride is an important parameter involved in convolution and pooling operations, which refers to the distance of each slide of the convolution kernel (pooling kernel) during the convolution (pooling) operation. The stride has an impact on the granularity of feature extraction and the selection (filtering) of fea- tures, thus affecting the performance of convolutional neural networks. At present, in the training of convolutional neural networks, the content of convolution ker- nel and pooling kernel can be determined by the optimization algorithm based on gradient descent. However, the stride usually cannot be treated similarly, and can only be selected manually as a hyperparameter. Most of the existing related works choose a fixed stride, for example, the value is 1. In fact, different tasks or inputs may require different stride for better model processing. Therefore, this paper views the role of stride in convolution and pooling operation from the per- spective of sampling, and proposes a convolution and pooling operation module with adaptive stride processing effect. The feature of the proposed module is that the feature map finally obtained by convolution or pooling operation is no longer limited to equal interval downsampling (feature extraction) according to a fixed stride, but adaptively extracted according to the changes of input features. We ap- ply the proposed module on many convolutional neural network models, including VGG, Alexnet and MobileNet for image classification, YOLOX-S for object de- tection, Unet for image segmentation, and so on. Simulation results show that the proposed module can effectively improve the perform$$","convolution, pooling, adaptive, stride" Transformer Module Networks for Systematic Generalization in Visual Question Answering,https://openreview.net/forum?id=0bLE93R9d0O,https://openreview.net/pdf?id=0bLE93R9d0O,Investigating whether and how modularity brings benefits to Transformer-based models,"Transformers achieve great performance on Visual Question Answering (VQA). However, their systematic generalization capabilities, i.e., handling novel combinations of known concepts, is unclear. We reveal that Neural Module Networks (NMNs), i.e., question-specific compositions of modules that tackle a sub-task, achieve better or similar systematic generalization performance than the conventional Transformers, even though NMNs' modules are CNN-based. In order to address this shortcoming of Transformers with respect to NMNs, in this paper we investigate whether and how modularity can bring benefits to Transformers. Namely, we introduce Transformer Module Network (TMN), a novel NMN based on compositions of Transformer modules. TMNs achieve state-of-the-art systematic generalization performance in three VQA datasets, improving more than 30% over standard Transformers for novel compositions of sub-tasks. We show that not only the module composition but also the module specialization for each sub-task are the key of such performance gain.","Systematic generalization, Neural Module Network, Transformer" Diving into Unified Data-Model Sparsity for Class-Imbalanced Graph Representation Learning,https://openreview.net/forum?id=YCMKCwN4PQw,https://openreview.net/pdf?id=YCMKCwN4PQw,"We propose Graph Decantation, a novel method of discovering unified dynamic sparsity from both GNN model and graph data, to learn balanced graph representations.","Even pruned by the state-of-the-art network compression methods, recent research shows that deep learning model training still suffers from the demand of massive data usage. In particular, Graph Neural Networks (GNNs) trained upon non-Euclidean graph data often encounter relatively higher time costs, due to its irregular and nasty density properties, compared with data in the regular Euclidean space (e.g., image or text). Another natural property accompanied with graphs is class-imbalance which cannot be alleviated even with massive data, therefore hinders GNNs’ ability in generalization. To fully tackle these unpleasant properties, (i) theoretically, we introduce a hypothesis about to what extent a subset of the training data can approximate the full dataset’s learning effectiveness. The effectiveness is further guaranteed and proved by the gradients’ distance between the subset and the full set; (ii) empirically, we discover that during the learning process of a GNN, some samples in the training dataset are informative in providing gradients to update model parameters. Moreover, the informative subset evolves dynamically during the training process, for samples that are informative in the current training epoch may not be so in the next one. We refer this observation as dynamic data sparsity. We also notice that sparse subnets pruned from a well-trained GNN sometimes forget the information provided by the informative subset, reflected in their poor performance upon the subset. Based on these findings, we develop a unified data-model dynamic sparsity framework named Graph Decantation (GraphDec) to address challenges brought by training upon a massive class-imbalanced graph dataset. The key idea of GraphDec is to identify the informative subset dynamically during the training process by adopting sparse graph contrastive learning. Extensive experiments on multiple benchmark datasets demonstrate that GraphDec outperforms state-of-the-art baselines for the class-imbalanced graph classification and class-imbalanced node classification tasks, with respect to classification accuracy and data usage efficiency.","Graph representation learning, Class-imbalanced data" Cluster and Landmark Attributes Infused Graph Neural Networks for Link prediction,https://openreview.net/forum?id=sKD1LojqWYR,https://openreview.net/pdf?id=sKD1LojqWYR,"We propose a simple representation of positional information using a set of representative nodes called landmarks, and show that the proposed method achieves superior link prediction performances on various datasets.","Learning positional information of nodes in a graph is important for link prediction tasks. We propose a simple representation of positional information using a set of representative nodes called landmarks. The position of a node is represented as a vector of its distances to the landmarks, where the landmarks are selected from the nodes with high degree centrality. We justify this selection strategy by analyzing well-known models of random graphs, and deriving closed-form bounds on the average path lengths involving landmarks. In a model for scale-free networks, we show that the distances to landmarks provide asymptotically accurate information on inter-node shortest distances. Our result is consistent with small-world phenomenon, i.e., a landmark can provide short paths between nodes as a hub. We apply theoretical insights to practical networks, and propose Cluster and Landmark Attributes-iNfused graph neural networks (CLAN). CLAN combines graph clustering and landmark selection, in which the graph is partitioned into densely connected clusters, and local node with the maximum degree is selected as landmarks. In addition, CLAN encodes the distances to landmarks using cluster-specific embedding in order to extract locality among the nodes in the common cluster. Experiments demonstrate that CLAN achieves superior performances and robustness over baseline methods on various datasets.","Graph Neural Networks, Link prediction" Learning Math Reasoning from Self-Sampled Correct and Partially-Correct Solutions,https://openreview.net/forum?id=4D4TSJE6-K,https://openreview.net/pdf?id=4D4TSJE6-K,We propose to let pretrained language models sample additional solutions for each problem and learn from the self-sampled solutions that are correct or partially-correct.,"Pretrained language models (PLMs) have shown superior performance on many natural language processing tasks, yet they still struggle at multi-step formal reasoning tasks like grade school math problems. One key challenge of finetuning PLMs to solve such math reasoning problems is that many existing datasets only contain one reference solution for each problem, despite the fact that there are often alternative solutions resembling different reasoning paths to the final answer. This way, the finetuned models are biased towards the limited reference solutions, which limits their generalization to unseen examples. To mitigate this issue, we propose to let the model perform sampling during training and learn from both self-sampled fully-correct solutions, which yield the correct answer upon execution, and partially-correct solutions, whose intermediate state matches an intermediate state of a known correct solution. We show that our use of self-sampled correct and partially-correct solutions can benefit learning and help guide the sampling process, leading to more efficient exploration of the solution space. Additionally, we explore various training objectives to support learning from multiple solutions per example and find they greatly affect the performance. Experiments on two math reasoning datasets show the effectiveness of our method compared to learning from a single reference solution with MLE, where we improve pass@100 from 35.5% to 44.5% for GSM8K, and 27.6% to 36.2% pass@80 for MathQA. Such improvements are also consistent across different PLM sizes.","mathematical reasoning, multi-target learning, self-sampling, large language models" Adaptive Robust Evidential Optimization For Open Set Detection from Imbalanced Data,https://openreview.net/forum?id=3yJ-hcJBqe,https://openreview.net/pdf?id=3yJ-hcJBqe,We propose adaptive robust uncertainty mass quantification for effective open set detection from imbalanced data. ,"Open set detection (OSD) aims at identifying data samples of an unknown class ($i.e.$, open set) from those of known classes ($i.e.$, close set) based on a model trained from close set samples. However, a close set may involve a highly imbalanced class distribution. Accurately differentiating open set samples and those from a minority class in the close set poses a fundamental challenge as the model may be equally uncertain when recognizing samples from the minority class. In this paper, we propose Adaptive Robust Evidential Optimization (AREO) that offers a principled way to quantify sample uncertainty through evidential learning while optimally balancing the model training over all classes in the close set through adaptive distributively robust optimization (DRO). To avoid the model to primarily focus on the most difficult samples by following the standard DRO, adaptive DRO training is performed, which is governed by a novel multi-scheduler learning mechanism to ensure an optimal model training behavior that gives sufficient attention to the difficult samples and the minority class while capable of learning common patterns from the majority classes. Our experimental results on multiple real-world datasets demonstrate that the proposed model outputs uncertainty scores that can clearly separate samples from close and open sets, respectively, and the detection results outperform the competitive baselines. ","Open Set Detection, Imbalanced Data" The Modality Focusing Hypothesis: Towards Understanding Crossmodal Knowledge Distillation,https://openreview.net/forum?id=w0QXrZ3N-s,https://openreview.net/pdf?id=w0QXrZ3N-s,We provide a thorough investigation of crossmodal knowledge transfer,"Crossmodal knowledge distillation (KD) extends traditional knowledge distillation to the area of multimodal learning and demonstrates great success in various applications. To achieve knowledge transfer across modalities, a pretrained network from one modality is adopted as the teacher to provide supervision signals to a student network learning from the other modality. In contrast to the empirical success reported in prior works, the working mechanism of crossmodal KD remains a mystery. In this paper, we present a thorough understanding of crossmodal KD. We begin by providing two failure cases and demonstrate that KD is not a universal cure in crossmodal knowledge transfer. We then present the modality Venn diagram to understand modality relationships and the modality focusing hypothesis revealing the decisive factor in the efficacy of crossmodal KD. Experimental results on 6 multimodal datasets help justify our hypothesis, diagnose failure cases, and point directions to improve crossmodal knowledge transfer in the future.","multimodal learning, knowledge distillation" Representational Task Bias in Zero-shot Recognition at Scale,https://openreview.net/forum?id=vbnxKVZDr4,https://openreview.net/pdf?id=vbnxKVZDr4,"We show CLIP image representations are biased towards being used for a specific task a priori, and provide a simple method cue which task is desired without model retraining.","Research from the last year has demonstrated that vision-language pre-training at scale from incidental supervision on the Internet can result in representations with clear advantages over traditional supervised training for many computer vision tasks. We conduct an in-depth exploration of the CLIP model, and find that the interface that language creates to these learned representations -- by the same token as enabling zero-shot application for many tasks -- leads the model to solve tasks that may not have been intended by the user in realistic scenarios. We call the inherent uncertainty of which task a user intends to solve in zero-shot recognition \textit{task ambiguity}. To evaluate task ambiguity, we construct a dataset of images where each image has labels for multiple semantic recognition tasks. We demonstrate that the representation produced for a given image tends to be strongly biased towards a particular task over others; in other words, they exhibit \textit{task bias}. Moreover, which task a particular image will be biased towards is unpredictable, with little consistency across images. Our results show that we can learn visual prompts to serve as effective conditioning mechanisms for which task is desired, and can even improve performance for the task when used outside the context of evaluating task ambiguity. ","vision-language models, CLIP, prompting, task representation" AxBERT: An Explainable Chinese Spelling Correction Method Driven by Associative Knowledge Network,https://openreview.net/forum?id=nZGu4Ltnl5,https://openreview.net/pdf?id=nZGu4Ltnl5,The proposed AxBERT as an explainable Chinese spelling correction method can achieve a predictable and regulatable correction process with extraordinary performance.,"Deep learning has shown promising performance on various machine learning tasks. Nevertheless, the unexplainability of deep learning models severely restricts the usage domains that require feature explanations, such as text correction. Therefore, a novel explainable deep learning model (named AxBERT) is proposed for Chinese spelling correction by aligning with an associative knowledge network (AKN). Wherein AKN is constructed based on the co-occurrence relations among Chinese characters, which denotes the explainable statistic logic contrasted with unexplainable BERT logic. And a translator matrix between BERT and AKN is introduced for the alignment and regulation of the attention component in BERT. In addition, a weight regulator is designed to adjust the attention distributions in BERT to appropriately model the sentence semantics. Experimental results on SIGHAN datasets demonstrate that AxBERT can achieve extraordinary performance, especially upon model precision compared to baselines. Our explainable analysis, together with qualitative reasoning, can effectively illustrate the explainability of AxBERT.","Chinese spelling correction, explainable deep learning, associative knowledge network, explainable statistic, attention distribution, statistical alignment" Hungry Hungry Hippos: Towards Language Modeling with State Space Models,https://openreview.net/forum?id=COZDy0WYGg,https://openreview.net/pdf?id=COZDy0WYGg,We study the expressivity gap between state space models (SSMs) and attention on language modeling and reduce the hardware barrier between SSMs and attention.,"State space models (SSMs) have demonstrated state-of-the-art sequence modeling performance in some modalities, but underperform attention in language modeling. Moreover, despite scaling nearly linearly in sequence length instead of quadratically, SSMs are still slower than Transformers due to poor hardware utilization. In this paper, we make progress on understanding the expressivity gap between SSMs and attention in language modeling, and on reducing the hardware barrier between SSMs and attention. First, we use synthetic language modeling tasks to understand the gap between SSMs and attention. We find that existing SSMs struggle with two capabilities: recalling earlier tokens in the sequence and comparing tokens across the sequence. To understand the impact on language modeling, we propose a new SSM layer, H3, that is explicitly designed for these abilities. H3 matches attention on the synthetic languages and comes within 0.4 PPL of Transformers on OpenWebText. Furthermore, a hybrid H3-attention model that retains two attention layers surprisingly outperforms Transformers on OpenWebText by 1.0 PPL. When trained on the Pile at small/medium scale (125M and 355M parameters), hybrid H3-attention language models display promising initial results, achieving lower perplexity than Transformers and outperforming Transformers in zero- and few-shot learning on a majority of tasks in the SuperGLUE benchmark. Next, to improve the efficiency of training SSMs on modern hardware, we propose FlashFFTConv. FlashFFTConv uses a fused block FFT algorithm to improve efficiency on sequences up to 8K, and introduces a novel state passing algorithm that exploits the recurrent properties of SSMs to scale to longer sequences. FlashFFTConv yields 2$\times$ speedup on the long-range arena benchmark and allows hybrid language models to generate text 1.6$\times$ faster than Transformers.","language modeling, state space models, efficiency" FINE: Future-Aware Inference for Streaming Speech Translation,https://openreview.net/forum?id=0VhwJYrZew,https://openreview.net/pdf?id=0VhwJYrZew,Future-aware inference for streaming speech translation,"A popular approach to streaming speech translation is to employ a single offline model together with a \textit{wait-$k$} policy to support different latency requirements. It is a simpler alternative compared to training multiple online models with different latency constraints. However, there is an apparent mismatch in using a model trained with complete utterances on partial streaming speech during online inference. We demonstrate that there is a significant difference between the speech representations extracted at the end of a streaming input and their counterparts at the same positions when the complete utterance is available. Built upon our observation that this problem can be alleviated by introducing a few frames of future speech signals, we propose \textbf{F}uture-aware \textbf{in}ferenc\textbf{e} (FINE) for streaming speech translation with two different methods to make the model aware of the future. The first method FINE-Mask incorporates future context through a trainable masked speech model. The second method FINE-Wait simply waits for more actual future audio frames at the cost of extra latency. Experiments on the MuST-C EnDe, EnEs and EnFr benchmarks show that both methods are effective and can achieve better trade-offs between translation quality and latency than strong baselines, and a hybrid approach combining the two can achieve further improvement. Extensive analyses suggest that our methods can effectively alleviate the aforementioned mismatch problem between offline training and online inference.","Streaming Speech Translation, Future-Aware Inference" PATCH-MIX TRANSFORMER FOR UNSUPERVISED DOMAIN ADAPTATION: A GAME PERSPECTIVE,https://openreview.net/forum?id=YfSFF4WaTj,https://openreview.net/pdf?id=YfSFF4WaTj,,"Endeavors have been recently made to leverage the vision transformer (ViT) for the challenging unsupervised domain adaptation (UDA) task. They typically adopt the cross-attention in ViT for direct domain alignment. However, as the performance of cross-attention highly relies on the quality of pseudo labels for the targeted samples, it becomes less effective when the domain gap becomes larger. We solve this problem from a game theory’s perspective with the model called PMTrans, which bridges the source and the target domains with an intermediate domain. Specifically, we propose a novel ViT-based module called PatchMix that effectively builds up the intermediate domain, i.e., probability distribution, by learning to sample patches from both domains based on the game-theoretical models. In this way, it learns to mix the patches from source and target domains to maximize the cross entropy (CE), while exploiting two semi-supervised mixup losses in the feature and label spaces to minimize CE. As such, we interpret the process of UDA as a min-max CE game with three players, including the feature extractor, classifier, and PatchMix, to find the optimal Nash Equilibria solution. Moreover, we leverage attention maps from ViT to re-weight the label of each patch by its importance, making it possible to obtain more domain-discriminative feature representations. We conduct extensive experiments on four benchmark datasets, and the results show that PMTrans significantly surpasses the ViT-based and CNN-based SoTA methods by +1.4% on Office-31, +3.5% on Office-Home, and +17.7% on DomainNet, respectively.","Unsupervised domain adaptation, Game theory, Transformer, Mixup" Dual Diffusion Implicit Bridges for Image-to-Image Translation,https://openreview.net/forum?id=5HLoTvVGDe,https://openreview.net/pdf?id=5HLoTvVGDe,,"Common image-to-image translation methods rely on joint training over data from both source and target domains. This prevents the training process from preserving privacy of domain data (e.g., in a federated setting), and often means that a new model has to be trained for a new pair of domains. We present Dual Diffusion Implicit Bridges (DDIBs), an image translation method based on diffusion models, that circumvents training on domain pairs. Image translation with DDIBs relies on two diffusion models trained independently on each domain, and is a two-step process: DDIBs first obtain latent encodings for source images with the source diffusion model, and then decode such encodings using the target model to construct target images. Both steps are defined via an ODE, and thus the process is cycle consistent only up to discretization errors of the ODE solvers. Theoretically, we interpret DDIBs as concatenation of source to latent, and latent to target Schrödinger Bridges, a form of entropy-regularized optimal transport, to explain the efficacy of the method. Experimentally, we apply DDIBs on both synthetic and high-resolution image datasets, to demonstrate their utility in a wide variety of translation tasks as well as their connections to existing optimal transport methods.", HYPERPRUNING: EFFICIENT PRUNING THROUGH LYAPUNOV METRIC HYPERSEARCH,https://openreview.net/forum?id=wFOGJB88Y5,https://openreview.net/pdf?id=wFOGJB88Y5,We proposed a novel method to search over pruning method and hyperparameters based on Lyapunov Spectrum.,"A variety of pruning methods have been introduced for over-parameterized recurrent neural networks to improve efficiency in terms of power and storage. With the advance in pruning methods and their variety, a new problem of ‘hyperpruning’ is becoming apparent: finding a suitable pruning method with optimal hyperparameter configuration for a particular task and network. Such search is different from the standard hyperparameter search, where the accuracy of the optimal configuration is unknown. In the context of network pruning, the accuracy of the non-pruned (dense) model sets the target for the accuracy of the pruned model. Thereby, the goal of hyperpruning is to reach or even surpass this target. It is critical to develop efficient strategies for hyperpruning since direct search through pruned variants would require time-consuming training without guarantees for improved performance. To address this problem, we introduce a novel distance based on Lyapunov Spectrum (LS) which provides means to compare pruned variants with the dense model and early in training to estimate the accuracy that pruned variants will achieve after extensive training. The ability to predict performance allows us to incorporate the LS-based distance with Bayesian hyperparameter optimization methods and to propose an efficient and first-of-its-kind hyperpruning approach called LS-based Hyperpruning (LSH) which can optimize the search time by an order of magnitude compared to standard full training search with the loss (or perplexity) being the accuracy metric. Our experiments on stacked LSTM and RHN language models trained with the Penn Treebank dataset show that with a given budget of training epochs and desired pruning ratio, LSH obtains more optimal variants than standard loss-based hyperparameter optimization methods. Furthermore, as a result of the search, LSH identifies pruned variants that outperform state-of-the-art pruning methods and surpass the accuracy of the dense model.","Network Pruning, Efficient Hyperparameter Searching, Lyapunov Spectrum" Relational Curriculum Learning for Graph Neural Networks,https://openreview.net/forum?id=1bLT3dGNS0,https://openreview.net/pdf?id=1bLT3dGNS0,We propose a novel curriculum learning strategy to improve the generalization performance of graph neural network models by gradually involving edges from well-expected to less-expected in training.,"Graph neural networks have achieved great success in representing structured data and its downstream tasks such as node classification. The key idea is to recursively propagate and aggregate information along the edges of a given graph topology. However, edges in real-world graphs often have varying degrees of difficulty, and some edges may even be noisy to the downstream tasks. Therefore, existing graph neural network models may lead to suboptimal learned representations because they usually consider every edge in a given graph topology equally. On the other hand, curriculum learning, which mimics the human learning principle of learning data samples in a meaningful order, has been shown to be effective in improving the generalization ability of representation learners by gradually proceeding from easy to more difficult samples during training. Unfortunately, most existing curriculum learning strategies are designed for i.i.d data samples and cannot be trivially generalized to handle structured data with dependencies. In order to address these issues, in this paper we propose a novel curriculum learning method for structured data to leverage the various underlying difficulties of data dependencies to improve the quality of learned representations on structured data. Specifically, we design a learning strategy that gradually incorporates edges in a given graph topology into training according to their difficulty from easy to hard, where the degree of difficulty is measured by a self-supervised learning paradigm. We demonstrate the strength of our proposed method in improving the generalization ability of learned representations through extensive experiments on nine synthetic datasets and seven real-world datasets with different commonly used graph neural network models as backbone models.","Graph neural networks, Curriculum learning" The World is Changing: Improving Fair Training under Correlation Shifts,https://openreview.net/forum?id=6Fq1-57gff,https://openreview.net/pdf?id=6Fq1-57gff,We analyze fundamental limits in accuracy and fairness of in-processing fair algorithms when the data bias changes with correlation shifts and propose a novel pre-processing step that improves their performances.,"Model fairness is an essential element for Trustworthy AI. While many techniques for model fairness have been proposed, most of them assume that the training and deployment data distributions are identical, which is often not true in practice. In particular, when the bias between labels and sensitive groups changes, the fairness of the trained model is directly influenced and can worsen. We make two contributions for solving this problem. First, we analytically show that existing in-processing fair algorithms have fundamental limits in accuracy and fairness. We introduce the notion of correlation shifts, which can explicitly capture the change of the above bias. Second, we propose a novel pre-processing step that samples the input data to reduce correlation shifts and thus enables the in-processing approaches to overcome their limitations. We formulate an optimization problem for adjusting the data ratio among labels and sensitive groups to reflect the shifted correlation. A key advantage of our approach lies in decoupling the roles of pre-processing and in-processing approaches: correlation adjustment via pre-processing and unfairness mitigation on the processed data via in-processing. Experiments show that our framework effectively improves existing in-processing fair algorithms w.r.t. accuracy and fairness, both on synthetic and real datasets.","trustworthy AI, fairness, correlation shifts" ACMP: Allen-Cahn Message Passing with Attractive and Repulsive Forces for Graph Neural Networks,https://openreview.net/forum?id=4fZc_79Lrqs,https://openreview.net/pdf?id=4fZc_79Lrqs,,Neural message passing is a basic feature extraction unit for graph-structured data considering neighboring node features in network propagation from one layer to the next. We model such process by an interacting particle system with attractive and repulsive forces and the Allen-Cahn force arising in the modeling of phase transition. The dynamics of the system is a reaction-diffusion process which can separate particles without blowing up. This induces an Allen-Cahn message passing (ACMP) for graph neural networks where the numerical iteration for the particle system solution constitutes the message passing propagation. ACMP which has a simple implementation with a neural ODE solver can propel the network depth up to one hundred of layers with theoretically proven strictly positive lower bound of the Dirichlet energy. It thus provides a deep model of GNNs circumventing the common GNN problem of oversmoothing. GNNs with ACMP achieve state of the art performance for real-world node classification tasks on both homophilic and heterophilic datasets., Average Sensitivity of Decision Tree Learning,https://openreview.net/forum?id=boik01yhssB,https://openreview.net/pdf?id=boik01yhssB,We design decision tree learning algorithms that are stable against perturbations in the training data.,"A decision tree is a fundamental model used in data mining and machine learning. In practice, the training data used to construct a decision tree may change over time or contain noise. A drastic change in the learned tree structure owing to such data perturbation is unfavorable in practice. For example, in data mining, a change in the tree implies that the extracted knowledge can be unstable, which raises the question of whether the extracted knowledge is truly reliable or is only a noisy artifact. To alleviate this issue, we design decision tree learning algorithms that are stable against insignificant perturbations in the training data. Specifically, we adopt the notion of average sensitivity as a stability measure, and design an algorithm with low average sensitivity that outputs a decision tree whose accuracy is nearly equal to the optimal decision tree. The experimental results on real-world datasets demonstrate that the proposed algorithm achieves a low average sensitivity with an insignificant decrease in accuracy.","decision tree, average sensitivity, trustworthy machine learning" Minimum Curvature Manifold Learning,https://openreview.net/forum?id=yxj33c6NuX,https://openreview.net/pdf?id=yxj33c6NuX,"We propose a minimum extrinsic curvature principle for manifold regularization and Minimum Curvature Autoencoder (MCAE), a graph-free coordinate-invariant extrinsic curvature minimization framework for autoencoder regularization.","It is widely observed that vanilla autoencoders can have low manifold learning accuracy given a noisy or small training dataset. Recent work has discovered that it is important to regularize the decoder that explicitly parameterizes the manifold, where a neighborhood graph is employed for decoder regularization. However, one caveat of this method is that it is not always straightforward to construct a correct graph. Alternatively, one may consider naive graph-free regularization methods such as minimizing the norm of the decoder's Jacobian or Hessian, but these norms are not coordinate-invariant (i.e. reparametrization-invariant) and hence do not capture any meaningful geometric quantity of the manifold nor result in geometrically meaningful manifold regularization effects. Another recent work called the isometric regularization implicitly forces the manifold to have zero intrinsic curvature, resulting in some geometrically meaningful regularization effects. But, since the intrinsic curvature does not capture how the manifold is embedded in the data space from an extrinsic perspective, the regularization effects are often limited. In this paper, we propose a {\it minimum extrinsic curvature principle} for manifold regularization and {\bf Minimum Curvature Autoencoder (MCAE)}, a graph-free coordinate-invariant extrinsic curvature minimization framework for autoencoder regularization. Experiments with various standard datasets demonstrate that MCAE improves manifold learning accuracy compared to existing methods, especially showing strong robustness to noise.","Autoencoder, Manifold, Curvature, Riemannian geometry" GeONet: a neural operator for learning the Wasserstein geodesic,https://openreview.net/forum?id=cuZNcYohQCX,https://openreview.net/pdf?id=cuZNcYohQCX,We design a neural operator deep learning framework for learning the Wasserstein geodesic from any input pair of distributions.,"Optimal transport (OT) offers a versatile framework to compare complex data distributions in a geometrically meaningful way. Traditional methods for computing the Wasserstein distance and geodesic between probability measures require mesh-dependent domain discretization and suffer from the curse-of-dimensionality. We present GeONet, a mesh-invariant deep neural operator network that learns the non-linear mapping from the input pair of initial and terminal distributions to the Wasserstein geodesic connecting the two endpoint distributions. In the offline training stage, GeONet learns the saddle point optimality conditions for the dynamic formulation of the OT problem in the primal and dual spaces that are characterized by a coupled PDE system. The subsequent inference stage is instantaneous and can be deployed for real-time predictions in the online learning setting. We demonstrate that GeONet achieves comparable testing accuracy to the standard OT solvers on a simulation example and the CIFAR-10 dataset with considerably reduced inference-stage computational cost by orders of magnitude.","Wasserstein, optimal transport, neural operator, GeONet" Causal Proxy Models For Concept-Based Model Explanations,https://openreview.net/forum?id=oHBgj83w1MB,https://openreview.net/pdf?id=oHBgj83w1MB,"We introduce Causal Proxy Models (CPMs), a novel concept-based explanation method. Our method used counterfactual training data to achieve state-of-the-art explanation performance on the CEBaB benchmark.","Explainability methods for NLP systems encounter a version of the fundamental problem of causal inference: for a given ground-truth input text, we never truly observe the counterfactual texts necessary for isolating the causal effects of model representations on outputs. In response, many explainability methods make no use of counterfactual texts, assuming they will be unavailable. In this paper, we show that robust causal explainability methods can be created using approximate counterfactuals, which can be written by humans to approximate a specific counterfactual or simply sampled using metadata-guided heuristics. The core of our proposal is the Causal Proxy Model (CPM). A CPM explains a black-box model $\mathcal{N}$ because it is trained to have the same \emph{actual} input/output behavior as $\mathcal{N}$ while creating neural representations that can be intervened upon to simulate the \emph{counterfactual} input/output behavior of $\mathcal{N}$. Furthermore, we show that the best CPM for $\mathcal{N}$ performs comparably to $\mathcal{N}$ in making factual predictions, which means that the CPM can simply replace $\mathcal{N}$, leading to more explainable deployed models.","Explainability, Causality, Concept-Based Explanations, Causal Explanations" Offline Reinforcement Learning with Differential Privacy,https://openreview.net/forum?id=NT51Ty0-Bfu,https://openreview.net/pdf?id=NT51Ty0-Bfu,We present the first provably efficient offline RL algorithms with differential privacy guarantees.,"The offline reinforcement learning (RL) problem is often motivated by the need to learn data-driven decision policies in financial, legal and healthcare applications. However, the learned policy could retain sensitive information of individuals in the training data (e.g., treatment and outcome of patients), thus susceptible to various privacy risks. We design offline RL algorithms with differential privacy guarantees which provably prevent such risks. These algorithms also enjoy strong instance-dependent learning bounds under both tabular and linear Markov Decision Process (MDP) settings. Our theory and simulation suggest that the privacy guarantee comes at (almost) no drop in utility comparing to the non-private counterpart for a medium-size dataset.","offline reinforcement learning, differential privacy" Relational Attention: Generalizing Transformers for Graph-Structured Tasks,https://openreview.net/forum?id=cFuMmbWiN6,https://openreview.net/pdf?id=cFuMmbWiN6,"We generalize transformer attention to include edge vectors, which are then updated along with the standard node vectors in each layer of a transformer's computation.","Transformers flexibly operate over sets of real-valued vectors representing task-specific entities and their attributes, where each vector might encode one word-piece token and its position in a sequence, or some piece of information that carries no position at all. As set processors, transformers are at a disadvantage in reasoning over more general graph-structured data where nodes represent entities and edges represent relations between entities. To address this shortcoming, we generalize transformer attention to consider and update edge vectors in each transformer layer. We evaluate this relational transformer on a diverse array of graph-structured tasks, including the large and challenging CLRS Algorithmic Reasoning Benchmark. There, it dramatically outperforms state-of-the-art graph neural networks expressly designed to reason over graph-structured data. Our analysis demonstrates that these gains are attributable to relational attention's inherent ability to leverage the greater expressivity of graphs over sets.","Graph Neural Networks, Transformers, Graph Representation Learning, Neural Algorithmic Reasoning" Accelerating Adaptive Federated Optimization with Local Gossip Communications,https://openreview.net/forum?id=o8g9WfWDTF,https://openreview.net/pdf?id=o8g9WfWDTF,,"Recently, adaptive federated optimization methods, such as FedAdam and FedAMSGrad, have gained increasing attention for their fast convergence and stable performance, especially in training models with heavy-tail stochastic gradient distributions. However, these adaptive federated methods suffer from the dilemma of local steps, i.e., the convergence rate gets worse as the number of local steps increases in partial participation settings, making it challenging to further improve the efficiency of adaptive federated optimization. In this paper, we propose a novel method to accelerate adaptive federated optimization with local gossip communications when data is heterogeneous. Particularly, we aim to lower the impact of data dissimilarity by gathering clients into disjoint clusters inside which they are connected with local client-to-client links and are able to conduct local gossip communications. We show that our proposed algorithm achieves a faster convergence rate as the local steps increase thus solving the dilemma of local steps. Specifically, our solution improves the convergence rate from $\mathcal{O}(\sqrt{\tau}/\sqrt{TM})$ in FedAMSGrad to $\mathcal{O}(1/\sqrt{T\tau M})$ in partial participation scenarios for nonconvex stochastic setting. Extensive experiments and ablation studies demonstrate the effectiveness and broad applicability of our proposed method. ","Federated learning, Nonconvex optimization" On the Complexity of Bayesian Generalization,https://openreview.net/forum?id=0DTpO6lLIN,https://openreview.net/pdf?id=0DTpO6lLIN,Correlating the shift between rule- and similarity-based generalization with the subjective complexity of the natural visual world.,"We consider the concept generalization at a large scale in a diverse and natural visual spectrum. Established computational modes (\ie, rule-based or similarity-based) are primarily studied isolated and focus on confined and abstract problem spaces. In this work, we study the two modes when the problem space scales up and the *complexity* of concepts becomes diverse. Specifically, at the **representational level**, we seek to answer how the complexity varies when a visual concept is mapped to the representation space. Prior psychology literature has shown that two types of complexities (*i.e.*, subjective complexity and visual complexity) (Griffiths and Tenenbaum, 2003) build an inverted-U relation (Donderi, 2006; Sun and Firestone, 2021). Leveraging *Representativeness of Attribute* (RoA), we computationally confirm the following observation: Models use attributes with high RoA to describe visual concepts, and the description length falls in an inverted-U relation with the increment on visual complexity. Meanwhile, at the **computational level**, we aim to answer how the complexity of representation affects the shift between the rule- and similarity-based generalization. We hypothesize that category-conditioned visual modeling estimates the co-occurrence frequency between visual and categorical attributes, thus having the potential to serve as the prior for the natural visual world. Experimental results show that representations with relatively high subjective complexity outperform those with relatively low subjective complexity in the rule-based generalization, while the trend is opposite in the similarity-based generalization.","Bayesian Generalization, Rational Analysis" Distilling Model Failures as Directions in Latent Space,https://openreview.net/forum?id=99RpBVpLiX,https://openreview.net/pdf?id=99RpBVpLiX,We present a scalable method for automatically distilling and captioning a model's failure modes as directions in a latent space.," Existing methods for isolating hard subpopulations and spurious correlations in datasets often require human intervention. This can make these methods labor-intensive and dataset-specific. To address these shortcomings, we present a scalable method for automatically distilling a model's failure modes. Specifically, we harness linear classifiers to identify consistent error patterns, and, in turn, induce a natural representation of these failure modes as directions within the feature space. We demonstrate that this framework allows us to discover and automatically caption challenging subpopulations within the training dataset. Moreover, by combining our framework with off-the-shelf diffusion models, we can generate images that are especially challenging for the analyzed model, and thus can be used to perform synthetic data augmentation that helps remedy the model's failure modes.","datasets, biases, subpopulations" Stable Target Field for Reduced Variance Score Estimation,https://openreview.net/forum?id=WmIwYTd0YTF,https://openreview.net/pdf?id=WmIwYTd0YTF,We propose a low variance objective to improve the training of score-based models,"Score-based generative models (SGMs) generate samples by reversing a fixed forward diffusion process. Despite impressive empirical results, we observe that the training process leads to unstable outcomes, especially when the reverse-time solvers adopt a large step size. The performance of converged models varies significantly with different random seeds, and they produce noticeable artifacts in generated samples. We suggest that the source of such instability lies in the handling of intermediate noise-variance scales, where multiple modes in the data affect the direction of reverse paths. Thus, the score-matching objective has a large sample variance in this regime, explaining lesser quality score estimates. We propose to remedy the problem by incorporating a reference batch for minibatch updates where the reference batch is used to calculate weighted conditional scores as the more stable training targets. We show that the procedure indeed helps in the challenging intermediate regime by reducing (the trace of) the covariance of training targets. The new stable targets can be seen as trading bias for reduced variance where the bias vanishes with increasing reference batch size. Empirically, we show that the new objective improves the image quality of state-of-the-art SGMs across datasets with both general ODE and SDE solvers. In particular, our method improves and stabilizes the final performance of SGMs, as well as speeding up the training process.","generative model, score-based models, diffusion models, variance reduction" Graph Contrastive Learning Under Heterophily: Utilizing Graph Filters to Generate Graph Views,https://openreview.net/forum?id=NzcUQuhEGef,https://openreview.net/pdf?id=NzcUQuhEGef,"We proposed HLCL, a contrastive learning framework that leverages a high-pass graph filter as our augmentation method to generate meaningful representations for heterophily graphs.","Graph Neural Networks have achieved tremendous success in (semi-)supervised tasks for which task-specific node labels are available. However, obtaining labels is expensive in many domains, specially as the graphs grow larger in size. Hence, there has been a growing interest in the application of self-supervised techniques, in particular contrastive learning (CL), to graph data. In general, CL methods work by maximizing the agreement between encoded augmentations of the same example, and minimizing agreement between encoded augmentations of different examples. However, we show that existing graph CL methods perform very poorly on graphs with heterophily, in which connected nodes tend to belong to different classes. First, we show that this is attributed to the ineffectiveness of existing graph augmentation methods. Then, we leverage graph filters to directly generate augmented graph views for graph CL under heterophily. In particular, instead of explicitly augmenting the graph topology and encoding the augmentations, we use a high-pass filter in the encoder to generate node representations only based on high-frequency graph signals. Then, we contrast the high-pass filtered representations with their low-pass counterparts produced by the same encoder, to generate representations. Our experimental results confirm that our proposed method, HLCL, outperforms state-of-the-art CL methods on benchmark graphs with heterophily, by up to 10%.","GNN, Contrastive learning, Heterophily, Graph Representation Learning" Countinuous pseudo-labeling from the start,https://openreview.net/forum?id=m3twGT2bAug,https://openreview.net/pdf?id=m3twGT2bAug,We show how to perform continuous self-training right from the start without any supervised pre-training.,"Self-training (ST), or pseudo-labeling has sparked significant interest in the automatic speech recognition (ASR) community recently because of its success in harnessing unlabeled data. Unlike prior semi-supervised learning approaches that relied on iteratively regenerating pseudo-labels (PLs) from a trained model and using them to train a new model, recent state-of-the-art methods perform ‘continuous training’ where PLs are generated using a very recent version of the model being trained. Nevertheless, these approaches still rely on bootstrapping the ST using an initial supervised learning phase where the model is trained on labeled data alone. We believe this has the potential for over-fitting to the labeled dataset in low resource settings and that ST from the start of training should reduce over-fitting. In this paper we show how we can do this by dynamically controlling the evolution of PLs during the training process in ASR. To the best of our knowledge, this is the first study that shows the feasibility of generating PLs from the very start of the training. We are able to achieve this using two techniques that avoid instabilities which lead to degenerate models that do not generalize. Firstly, we control the evolution of PLs through a curriculum that uses the online changes in PLs to control the membership of the cache of PLs and improve generalization. Secondly, we find that by sampling transcriptions from the predictive distribution, rather than only using the best transcription, we can stabilize training further. With these techniques, our ST models match prior works without an external language model.","selt-training, pseudo-labeling, speech recognition, data selection and filtering" Hybrid Federated Learning for Feature & Sample Heterogeneity: Algorithms and Implementation,https://openreview.net/forum?id=4xzk3zGtz1h,https://openreview.net/pdf?id=4xzk3zGtz1h,"In this paper, we proposed the first hybrid federated learning model and algorithm, which deals with partially overlapped features and samples in clients' datasets"," Federated learning (FL) is a popular distributed machine learning paradigm dealing with distributed and private data sets. Based on the data partition pattern, FL is often categorized into horizontal, vertical, and hybrid settings. All three settings have many applications, but the hybrid FL remains relatively less explored, because it deals with the challenging situation where both the feature space and the data samples are heterogeneous. This work designs a novel mathematical model that effectively allows the clients to aggregate distributed data with heterogeneous, and possibly overlapping features and samples. Our main idea is to partition each client's model into a feature extractor part and a classifier part, where the former can be used to process the input data, while the latter is used to perform the learning from the extracted features. The heterogeneous feature aggregation is done through building a server model, which assimilates local classifiers and feature extractors through a carefully designed matching mechanism. A communication-efficient algorithm is then designed to train both the client and server models. Finally, we conducted numerical experiments on multiple image classification data sets to validate the performance of the proposed algorithm. To our knowledge, this is the first formulation and algorithm developed for hybrid FL.","Federated Learning, Model Ensemble, Model Design, Algorithm Design" SimA: Simple Softmax-free Attention For Vision Transformers,https://openreview.net/forum?id=rSHfAMCc997,https://openreview.net/pdf?id=rSHfAMCc997,SimA is a simple softmax-free and hardware friendly attention that has on-par accuracy with SOTA vision transformers. ,"Recently, vision transformers have become very popular. However, deploying them in many applications is computationally expensive partly due to the Softmax layer in the attention block. We introduce a simple yet effective, Softmax-free attention block, SimA, which normalizes query and key matrices with simple $\ell_1$-norm instead of using Softmax layer. Then, the attention block in SimA is a simple multiplication of three matrices, so SimA can dynamically change the ordering of the computation at the test time to achieve linear computation on the number of tokens or the number of channels. We empirically show that SimA applied to three SOTA variations of transformers, DeiT, XCiT, and CvT, results in on-par accuracy compared to the SOTA models, without any need for Softmax layer. Interestingly, changing SimA from multi-head to single-head has only a small effect on the accuracy, which further simplifies the attention block.","Computer Vision, Vision Transformer, Efficient Vision Transformer, Image Recognition" Bridging the Gap Between Cascade and End-to-End Cross-modal Translation Models: A Zero-Shot Approach,https://openreview.net/forum?id=Y6Gs9DdZGj5,https://openreview.net/pdf?id=Y6Gs9DdZGj5,,"One of the main problems in cross-modal translation, such as Speech Translation or OCR Image Translation, is the mismatches among different modalities. The second problem, scarcity of parallel data covering multiple modalities, means that the end-to-end multi-modal neural network models tend to perform worse than cascade models, although there are exceptions under favorable conditions. To address these problems, we present a differentiable cascade translation model, connecting two pre-trained uni-modality modules in a trainable way. We adapt the Word Rotator’s Distance loss using the Optimal Transport approach, which effectively handles the multi-modal discrepancy. Furthermore, the approach naturally enables zero-shot multi-modal training, reducing the dependence of end-to-end models on large amounts of data, and at the same time allowing end-to-end training when data do become available. Our comprehensive experiments on the MuSTC benchmarks show that our end-to-end zero-shot approach performs better than or as well as those of the CTC-based cascade models, and that our end-to-end model with supervised training matches the latest state-of-the-art results.","Zero-Shot, End-to-End, Speech Translation" Policy Architectures for Compositional Generalization in Control,https://openreview.net/forum?id=0W1TQ_hoMFN,https://openreview.net/pdf?id=0W1TQ_hoMFN,,"Several tasks in control, robotics, and planning can be specified through desired goal configurations for entities in the environment. Learning goal-conditioned policies is a natural paradigm to solve such tasks. However, learning and generalizing on complex tasks can be challenging due to variations in number of entities or compositions of goals. To address this challenge, we introduce the Entity-Factored Markov Decision Process (EFMDP), a formal framework for modeling the entity-based compositional structure in control tasks. Geometrical properties of the EFMDP framework provide theoretical motivation for policy architecture design, particularly Deep Sets and popular relational mechanisms such as graphs and self attention. These structured policy architectures are flexible and can be trained end-to-end with standard reinforcement and imitation learning algorithms. We study and compare the learning and generalization properties of these architectures on a suite of simulated robot manipulation tasks, finding that they achieve significantly higher success rates with less data compared to standard multilayer perceptrons. Structured policies also enable broader and more compositional generalization, producing policies that extrapolate to different numbers of entities than seen in training, and stitch together (i.e. compose) learned skills in novel ways. Video results can be found at https://sites.google.com/view/comp-gen-anon.","Reinforcement Learning, Imitation Learning, Compositionality" GNNDelete: A General Unlearning Strategy for Graph Neural Networks,https://openreview.net/forum?id=X9yCkmT5Qrl,https://openreview.net/pdf?id=X9yCkmT5Qrl,,"We consider the problem of graph unlearning, wherein graph neural network (GNN) model is trained to specified accuracy and then deployed while a sequence of requests arrives to delete graph elements (nodes, edges) from the model. As GNN models are used in real-world implementation, this problem is increasingly vital to address—for example, when a user seeks to hide their connections with others in a social graph or when relationships in a knowledge graph become irrelevant or are not true anymore. To unlearn information from a trained GNN, its influence on both GNN model weights as well as on representations of neighbors in the graph must be deleted from the model. However, existing methods using retraining and weight modification either degrade model weights shared across all nodes or are ineffective because of strong dependency of deleted edges on their local graph neighborhood. Realizing these pitfalls, we formalize the required properties for graph unlearning in the form of Deleted Edge Consistency and Neighborhood Influence and develop GNNDELETE, a model-agnostic layer-wise operator that optimize both properties for unlearning tasks. GNNDELETE updates latent representations to delete nodes and edges from the model while keeping the rest of the learned knowledge intact. Experiments on six real-world and two knowledge graphs show that GNNDELETE outperforms existing graph unlearning models by up to 36.9% in AUC on link prediction tasks and 22.5% in AUC on distinguishing deleted edges from nondeleted edges. GNNDELETE efficient—e.g., it takes 12.3x less time and 9.3x less space than retraining from scratch on WordNet18. Our code is available here.","Graph Unlearning, Graph Neural Networks, Knowledge Graphs, Graph Representation Learning, Data Deletion" Lower Bounds for Differentially Private ERM: Unconstrained and Non-Euclidean,https://openreview.net/forum?id=Jqas82UP428,https://openreview.net/pdf?id=Jqas82UP428,,"We consider the lower bounds of differentially private empirical risk minimization (DP-ERM) for convex functions in both constrained and unconstrained cases concerning the general $\ell_p$ norm beyond the $\ell_2$ norm considered by most of the previous works. We provide a simple black-box reduction approach that can generalize lower bounds in constrained to unconstrained cases. Moreover, for $(\epsilon,\delta)$-DP, we achieve the optimal $\Omega(\frac{\sqrt{d \log(1/\delta)}}{\epsilon n})$ lower bounds for both constrained and unconstrained cases and any $\ell_p$ geometry where $p\geq 1$ by considering $\ell_1$ loss over the $\ell_{\infty}$ ball.", Compound Tokens: Channel Fusion for Vision-Language Representation Learning,https://openreview.net/forum?id=J9Z3MlnPU_f,https://openreview.net/pdf?id=J9Z3MlnPU_f,We provide a new multi-modal fusion method that concatenates tokens along the channel dimension. ,"We present an effective method for fusing visual-and-language representations for several question answering tasks including visual question answering and visual entailment. In contrast to prior works that concatenate unimodal representations or use only cross-attention, we compose multimodal representations via channel fusion. By fusing on the channels, the model is able to more effectively align the tokens compared to standard methods. These multimodal representations, which we call compound tokens are generated with cross-attention transformer layers. First, vision tokens are used as queries to retrieve compatible text tokens through cross-attention. We then chain the vision tokens and the queried text tokens along the channel dimension. We call the resulting representations compound tokens. A second group of compound tokens are generated using an analogous process where the text tokens serve as queries to the cross-attention layer. We concatenate all the compound tokens for further processing with multimodal encoder. We demonstrate the effectiveness of compound tokens using an encoder-decoder vision-language model trained end-to-end in the open-vocabulary setting. Compound Tokens achieve highly competitive performance across a range of question answering tasks including GQA, VQA2.0, and SNLI-VE. We plan to make the code public. ","question answering tasks, multi-modal fusion, vision-language model, representation learning" The Convergence Rate of SGD's Final Iterate: Analysis on Dimension Dependence,https://openreview.net/forum?id=bzpCoHn1Vc_,https://openreview.net/pdf?id=bzpCoHn1Vc_,,"Stochastic Gradient Descent (SGD) is among the simplest and most popular optimization and machine learning methods. Running SGD with a fixed step size and outputting the final iteration is an ideal strategy one can hope for, but it is still not well-understood even though SGD has been studied extensively for over 70 years. Given the $\Theta(\log T)$ gap between current upper and lower bounds for running SGD for $T$ steps, it was then asked by [Koren and Segal COLT 20'] how to characterize the final-iterate convergence of SGD with a fixed step size in the constant dimension setting, i.e., $d=O(1)$. In this paper, we consider the more general setting for any $d\leq T$, proving $\Omega(\log d/\sqrt{T})$ lower bounds for the sub-optimality of the final iterate of SGD in minimizing non-smooth Lipschitz convex functions with standard step sizes. Our results provide the first general dimension-dependent lower bound on the convergence of SGD's final iterate, partially resolving the COLT open question raised by [Koren and Segal COLT 20']. Moreover, we present a new method in one dimension based on martingale and Freedman’s inequality, which gets the tight $O(1/\sqrt{T})$ upper bound with mild assumptions and recovers the same bounds $O(\log T/\sqrt{T})$ as previous best results under the standard assumptions.", "Multi-Rate VAE: Train Once, Get the Full Rate-Distortion Curve",https://openreview.net/forum?id=OJ8aSjCaMNK,https://openreview.net/pdf?id=OJ8aSjCaMNK,MR-VAEs can construct the rate-distortion curve in a single training run.,"Variational autoencoders (VAEs) are powerful tools for learning latent representations of data used in a wide range of applications. In practice, VAEs usually require multiple training rounds to choose the amount of information the latent variable should retain. This trade-off between the reconstruction error (distortion) and the KL divergence (rate) is typically parameterized by a hyperparameter $\beta$. In this paper, we introduce Multi-Rate VAE (MR-VAE), a computationally efficient framework for learning optimal parameters corresponding to various $\beta$ in a single training run. The key idea is to explicitly formulate a response function that maps $\beta$ to the optimal parameters using hypernetworks. MR-VAEs construct a compact response hypernetwork where the pre-activations are conditionally gated based on $\beta$. We justify the proposed architecture by analyzing linear VAEs and showing that it can represent response functions for linear VAEs. With the learned hypernetwork, MR-VAEs can construct the rate-distortion curve without additional training and can be deployed with significantly less hyperparameter tuning. Empirically, our approach is competitive and often exceeds the performance of multiple $\beta$-VAEs training with minimal computation and memory overheads.","Variational Autoencoders, VAEs, Hypernetworks, Response Functions, Hyperparameter Tuning" Are vision transformers more robust than CNNs for Backdoor attacks?,https://openreview.net/forum?id=7P_yIFi6zaA,https://openreview.net/pdf?id=7P_yIFi6zaA,,"Transformer architectures are based on a self-attention mechanism that processes images as a sequence of patches. As their design is quite different compared to CNNs, it is interesting to study if transformers are vulnerable to backdoor attacks and how different transformer architectures affect attack success rates. Backdoor attacks happen when an attacker poisons a small part of the training images with a specific trigger or backdoor which will be activated later. The model performance is good on clean test images, but the attacker can manipulate the decision of the model by showing the trigger on an image at test time. In this paper, we perform a comparative study of state-of-the-art architectures through the lens of backdoor robustness, specifically how attention mechanisms affect robustness. We show that the popular vision transformer architecture (ViT) is the least robust architecture and ResMLP, which belongs to a class called Feed Forward Networks (FFN), is the most robust one to backdoor attacks among state-of-the-art architectures. We also find an intriguing difference between transformers and CNNs – interpretation algorithms effectively highlight the trigger on test images for transformers but not for CNNs. Based on this observation, we find that a test-time image blocking defense reduces the attack success rate by a large margin for transformers. We also show that such blocking mechanisms can be incorporated during the training process to improve robustness even further. We believe our experimental findings will encourage the community to understand the building block components in developing novel architectures robust to backdoor attacks.","Backdoor Attacks, Vision Transformers, Robustness" Adaptive Gradient Methods with Local Guarantees,https://openreview.net/forum?id=_tIZQEMcWyv,https://openreview.net/pdf?id=_tIZQEMcWyv,,"Adaptive gradient methods are the method of choice for optimization in machine learning and used to train the largest deep models. In this paper we study the problem of learning a local preconditioner, that can change as the data is changing along the optimization trajectory. We propose an adaptive gradient method that has provable adaptive regret guarantees vs. the best local preconditioner. To derive this guarantee, we prove a new adaptive regret bound in online learning that improves upon previous adaptive online learning methods. We demonstrate the robustness of our method in automatically choosing the optimal learning rate schedule for popular benchmarking tasks in vision and language domains. Without the need to manually tune a learning rate schedule, our method can, in a single run, achieve comparable and stable task accuracy as a fine-tuned optimizer.", Combinatorial-Probabilistic Trade-Off: P-Values of Community Properties Test in the Stochastic Block Models,https://openreview.net/forum?id=8qjSA5QACb40,https://openreview.net/pdf?id=8qjSA5QACb40,We propose an inferential framework testing the general community combinatorial properties of the stochastic block model and prove the minimax lower bound of the general community property test.,"We propose an inferential framework testing the general community combinatorial properties of the stochastic block model. We aim to test the hypothesis on whether a certain community property is satisfied, e.g., whether a given set of nodes belong to the same community, and provide p-values for uncertainty quantification. Our framework is applicable to all symmetric community properties. To ease the challenges caused by the combinatorial nature of community properties, we develop a novel shadowing bootstrap method. Utilizing the symmetry, our method can find a shadowing representative of the true assignment and the number of tested assignments in the alternative is largely reduced. In theory, we introduce a combinatorial distance between two community classes and show a combinatorial-probabilistic trade-off phenomenon. Our test is honest as long as the product of the combinatorial distance between two communities and the probabilistic distance between two connection probabilities is sufficiently large. Besides, we show that such trade-off also exists in the information-theoretic lower bound. We also implement numerical experiments to show the validity of our method.","combinatorial inference, stochastic block models, community properties, minimax lower bound" RelationCLIP: Training-free Fine-grained Visual and Language Concept Matching,https://openreview.net/forum?id=yPQoijKtdHO,https://openreview.net/pdf?id=yPQoijKtdHO,A simple but effective method that can improve zero-shot performance for CLIP-like models on fine-grained Image-text matching datasets,"Contrastive Language-Image Pretraining (CLIP) has demonstrated great zero-shot performance for image-text matching, because of its holistic use of natural language supervision that covers large-scale, unconstrained real-world visual concepts. However, it is still challenging to adapt CLIP to fined-grained image-text matching between disentangled visual concepts and text semantics without training. Towards a more accurate zero-shot inference of CLIP-like models for fine-grained concept matching, in this paper, we study the image-text matching problem from a causal perspective: the erroneous semantics of individual entities are essentially confounders that cause the matching failure. Therefore, we propose a novel training-free framework, RelationCLIP, by disentangling input images into subjects, objects, and action entities. By exploiting fine-grained matching between visual components and word concepts from different entities, RelationCLIP can mitigate spurious correlations introduced by the pretrained CLIP models and dynamically assess the contribution of each entity when performing image and text matching. Experiments on SVO-Probes and our newly-introduced Visual Genome Concept datasets demonstrate the effectiveness of our plug-and-play method, which boosts the zero-shot inference ability of CLIP even without pre-training or fine-tuning. Our code is available at https://anonymous.4open.science/r/Relation-CLIP.","Zero-shot, Image-text Matching, CLIP" Min-Max Zero-Shot Multi-Label Classification,https://openreview.net/forum?id=gxq1n1f0c7l,https://openreview.net/pdf?id=gxq1n1f0c7l,,"In many classification problems, acquiring labeled examples for many classes is difficult, resulting in high interest in zero-shot learning frameworks. Zero-shot learning (ZSL) is a problem setup, where at test time, a learner observes samples from classes that were not observed/trained in the training phase and is required to predict the category they belong to. Zero-shot learning transfers knowledge from seen classes (observed classes in the training phase) to unseen classes (unobserved classes in the training phase but present in the testing phase), reducing human labor of data annotation to build new classifiers. However, most zero-shot learning researches target single-label classification (multi-class setting). There are few studies on multi-label zero-shot learning due to the difficulty in modeling complex semantics conveyed by a set of labels. We propose a novel probabilistic model that incorporates more general feature representation (e.g., Word-Net hierarchy, word2vec features, convolutional neural network features (layer-wise), and co-occurrence statistics) and learns the knowledge transfer in terms of data structure and relations. We also investigate the effect of leveraging different CNN layers' features. Our experimental results prove the efficacy of our model in handling unseen labels. We run additional experiments to analyze the flat-sharp minima convergence of methods as a generalization factor. Our study suggests that our proposed method converges to flat minima resulting in strong generalization.","Transfer learning, zero-shot learning" Meta-Learning in Games,https://openreview.net/forum?id=uHaWaNhCvZD,https://openreview.net/pdf?id=uHaWaNhCvZD,We formalize and study the problem of meta-learning across a wide range of fundamental multi-agent settings.,"In the literature on game-theoretic equilibrium finding, focus has mainly been on solving a single game in isolation. In practice, however, strategic interactions—ranging from routing problems to online advertising auctions—evolve dynamically, thereby leading to many similar games to be solved. To address this gap, we introduce meta-learning for equilibrium finding and learning to play games. We establish the first meta-learning guarantees for a variety of fundamental and well-studied games, including two-player zero-sum games, general-sum games, Stackelberg games, and multiple extensions thereof. In particular, we obtain rates of convergence to different game-theoretic equilibria that depend on natural notions of similarity between the sequence of games encountered, while at the same time recovering the known single-game guarantees when the sequence of games is arbitrary. Along the way, we prove a number of new results in the single-game regime through a simple and unified framework, which may be of independent interest. Finally, we evaluate our meta-learning algorithms on endgames faced by the poker agent Libratus against top human professionals. The experiments show that games with varying stack sizes can be solved significantly faster using our meta-learning techniques than by solving them separately, often by an order of magnitude.","meta-learning, algorithmic game theory, online learning" GLASU: A Communication-Efficient Algorithm for Federated Learning with Vertically Distributed Graph Data,https://openreview.net/forum?id=fR_0uObMTjG,https://openreview.net/pdf?id=fR_0uObMTjG,This paper proposed a GNN model design approach and a communication efficient algorithm for federated learning on feature distributed graph data,"Vertical federated learning (VFL) is a distributed learning paradigm, where computing clients collectively train a model based on the partial features of the same set of samples they possess. Current research on VFL focuses on the case when samples are independent, but it rarely addresses an emerging scenario when samples are interrelated through a graph. For graph-structured data, graph neural networks (GNNs) are rather competitive machine learning models, but a naive implementation in the VFL setting causes a significant communication overhead; moreover, the analysis is faced with a challenge caused by the biased stochastic gradients. In this paper, we propose a model splitting method that splits a backbone GNN across the clients and the server and a communication-efficient algorithm, GLASU, to train such a model. GLASU adopts lazy aggregation and stale updates to skip aggregation when evaluating the model and skip feature exchanges during training, greatly reducing communication. We offer a theoretical analysis and conduct extensive numerical experiments on real-world datasets, showing that the proposed algorithm effectively trains a GNN model, whose performance matches that of the backbone GNN when trained in a centralized manner.","Federated Learning, Graph Neural Network, Feature Distributed Federated Learning" Dynamic Embeddings of Temporal High-Order Interactions via Neural Diffusion-Reaction Processes,https://openreview.net/forum?id=wOzKzPf6BBv,https://openreview.net/pdf?id=wOzKzPf6BBv,We develop a neural diffusion-reaction process model to estimate the dynamic embeddings for the participant entities in tensor decomposition.,"High-order interactions of multiple entities are ubiquitous in practical applications. The associated data often includes the participants, interaction results, and the timestamps when each interaction occurred. While tensor factorization is a popular tool to analyze such data, it often ignores or underuses valuable timestamp information. More important, standard tensor factorization only estimates a static representation for each entity and ignores the temporal variation of the representations. However, such variations might reflect important evolution patterns of the underlying properties of the entities. To address these limitations, we propose Dynamical eMbedIngs of TempoRal hIgh-order interactions (DMITRI). We develop a neural diffusion-reaction process model to estimate the dynamic embeddings for the participant entities. Specifically, based on the observed interactions, we build a multi-partite graph to encode the correlation between the entities. We construct a graph diffusion process to co-evolve the embedding trajectories of the correlated entities and use a neural network to construct a reaction process for each individual entity. In this way, our model is able to capture both the commonalities and personalities during the evolution of the embeddings for different entities. We then use a neural network to model the interaction result as a nonlinear function of the embedding trajectories. For model estimation, we combine ODE solvers to develop a stochastic mini-batch learning algorithm. We propose a simple stratified sampling method to balance the cost of processing each mini-batch so as to improve the overall efficiency. We show the advantage of our approach in both the ablation study and real-world applications. ","Embedding Trajectory, Tensor Decomposition" Fair Federated Learning via Bounded Group Loss,https://openreview.net/forum?id=KkI8sjKqtnV,https://openreview.net/pdf?id=KkI8sjKqtnV,,"In federated learning, fair prediction across protected groups is an important constraint for many applications. Unfortunately, prior work studying group fair federated learning lacks formal convergence or fairness guarantees. In this work we propose a general framework for provably fair federated learning. In particular, we explore and extend the notion of Bounded Group Loss as a theoretically-grounded approach for group fairness. Using this setup, we propose a scalable federated optimization method that optimizes the empirical risk under a number of group fairness constraints. We provide convergence guarantees for the method as well as fairness guarantees for the resulting solution. Empirically, we evaluate our method across common benchmarks from fair ML and federated learning, showing that it can provide both fairer and more accurate predictions than baseline approaches.","Federated Learning, Group Fairness" "DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking",https://openreview.net/forum?id=kKF8_K-mBbS,https://openreview.net/pdf?id=kKF8_K-mBbS,Molecular docking via non-Euclidean diffusion modeling and confidence estimation,"Predicting the binding structure of a small molecule ligand to a protein---a task known as molecular docking---is critical to drug design. Recent deep learning methods that treat docking as a regression problem have decreased runtime compared to traditional search-based methods but have yet to offer substantial improvements in accuracy. We instead frame molecular docking as a generative modeling problem and develop DiffDock, a diffusion generative model over the non-Euclidean manifold of ligand poses. To do so, we map this manifold to the product space of the degrees of freedom (translational, rotational, and torsional) involved in docking and develop an efficient diffusion process on this space. Empirically, DiffDock obtains a 38% top-1 success rate (RMSD<2Å) on PDBBind, significantly outperforming the previous state-of-the-art of traditional docking (23%) and deep learning (20%) methods. Moreover, DiffDock has fast inference times and provides confidence estimates with high selective accuracy.","molecular docking, protein-ligand binding, diffusion models, score-based models, molecular structure, equivariance, geometric deep learning" An Upper Bound for the Distribution Overlap Index and Its Applications,https://openreview.net/forum?id=WdN2gD6EsXm,https://openreview.net/pdf?id=WdN2gD6EsXm,This paper proposes an easy-to-compute upper bound for the overlap index and applies it for domain shift analysis and one-class classification.,"The overlap index between two probability distributions has various applications in statistics, machine learning, and other scientific research. However, approximating the overlap index is challenging when the probability distributions are unknown (i.e., distribution-free settings). This paper proposes an easy-to-compute upper bound for the overlap index without requiring any knowledge of the distribution models. We first utilize the bound to find the upper limit for the accuracy of a trained machine learning model when a domain shift exists. We additionally employ this bound to study the distribution membership classification of given samples. Specifically, we build a novel, distribution-free, computation-efficient, memory-efficient one-class classifier by converting the bound into a confidence score function. The proposed classifier does not need to train any parameters and requires only a small number of in-class samples to be accurate. The classifier shows its efficacy on various datasets and outperforms many state-of-the-art methods in different one-class classification scenarios, including novelty detection, out-of-distribution detection, and backdoor detection. The obtained results show significant promise toward broadening the applications of overlap-based metrics. ","Distribution Overlap, Finite-Sample Approximation, One-Class Classification, Domain Shift Analysis" Learning by Distilling Context,https://openreview.net/forum?id=am22IukDiKf,https://openreview.net/pdf?id=am22IukDiKf,"applying context distillation as a general tool to translate language into parameter updates; used to distill instructions, explanations, examples, knowledge, and reasoning procedure.","Language models significantly benefit from context tokens, such as prompts or scratchpads. They perform better when prompted with concrete training examples and abstract statements about the target task (instructions), and they acquire new capabilities to perform complex tasks by generating step-by-step reasoning (scratch-pad) before predicting the final answers. However, they do not internalize these performance gains, which disappear when the context tokens are gone. Consequently, we always need to pay extra computation for this gain, and it is unclear how to transfer the capabilities acquired by context tokens to other tasks, or how to leverage the context tokens when their length exceeds the context window size. Our work proposes to apply context distillation so that a language model can internalize these gains. Concretely, given an input for the target task, we let the model use all relevant context tokens to generate the output, using ``[instructions] + [task-input]'' to predict ``[scratch-pad] + [final answer]''; then we fine-tune the same model to predict the above ``[final answer]'' conditioned on the ``[task-input]'', without seeing the ``[instructions]'' or using the ``[scratch-pad]''. This incentivizes the model to behave as if the context were present, hence updating the parameters to internalize the context information. We show that context distillation can be used as a general method for learning. In particular, we demonstrate that context distillation can effectively internalize 3 types of contexts: 1) abstract task instructions and natural language explanations of why an output is correct or incorrect on Natural-Instructions-V2; 2) step-by-step reasoning on 8-digit addition questions, where we show the model can apply this newly acquired capability to downstream question answering tasks; and 3) concrete training examples on the SPIDER Text-to-SQL dataset, where context distillation outperforms directly learning with gradient descent by 7%.","language models, NLP, prompting, distillation" Target-Free Ligand Scoring via One-Shot Learning,https://openreview.net/forum?id=lGh3JsP0j7k,https://openreview.net/pdf?id=lGh3JsP0j7k,A new method for scoring ligands by activity using a one-shot learning approach.,"Scoring ligands in a library based on their structural similarity to a known hit compound is widely used in drug discovery following high-throughput screening. However, such ""similarity search"" relies on the assumption that structurally similar compounds have similar activities, and will therefore only retrieve ligands with hit-like affinity, requiring resource-intensive tweaking by medicinal chemists to reach a more active lead compound. We propose a novel approach, One-Shot Ligand Scoring (OSLS), that is much more capable of directly retrieving lead-like compounds from a library using a novel one-shot learning technique. For this new task, we design a Siamese-inspired neural architecture using two Transformer encoders without tied weights, a novel positional encoding-like mechanism, and a final prediction head. OSLS is able to score ligands by activity against a target without any target-specific knowledge beyond a single known activity value, a cost-effective approach to ligand-based or phenotypic drug discovery. We show that OSLS surpasses traditional similarity search as well as modern deep learning baselines on a simulated ligand retrieval task. Furthermore, we demonstrate the applicability of our approach on various drug discovery tasks that also involve ligand scoring, including drug repositioning, precision patient-level drug efficacy prediction, and even molecular generative modeling.","drug discovery, ligand scoring, one-shot learning" Inverse Kernel Decomposition,https://openreview.net/forum?id=vEAHushUBGc,https://openreview.net/pdf?id=vEAHushUBGc,We propose a novel eigen-decomposition-based nonlinear dimensionality reduction method.,"The state-of-the-art dimensionality reduction approaches largely rely on complicated optimization procedures. On the other hand, closed-form approaches requiring merely eigen-decomposition do not have enough sophistication and nonlinearity. In this paper, we propose a novel nonlinear dimensionality reduction method---Inverse Kernel Decomposition (IKD)---based on an eigen-decomposition of the sample covariance matrix of data. The method is inspired by Gaussian process latent variable models (GPLVMs) and has comparable performance with GPLVMs. To deal with very noisy data with weak correlations, we propose two solutions---blockwise and geodesic---to make use of locally correlated data points and provide better and numerically more stable latent estimations. We use synthetic datasets and four real-world datasets to show that IKD is a better dimensionality reduction method than other eigen-decomposition-based methods, and achieves comparable performance against optimization-based methods with faster running speeds. Open-source IKD implementation in Python can be accessed at this \url{https://anonymous.4open.science/r/ikd-BABC}","Nonlinear dimensionality reduction, GPLVM, eigen-decomposition, kernel, latent estimation" Structured Pruning of CNNs at Initialization,https://openreview.net/forum?id=iA8XoWjDeGK,https://openreview.net/pdf?id=iA8XoWjDeGK,Structured pruning-at-initialization method can be as good as unstructured ones.,"Pruning-at-initialization (PAI) proposes to prune the individual weights of the CNN before training, thus avoiding expensive fine-tuning or retraining of the pruned model. While PAI shows promising results in reducing model size, the pruned model still requires unstructured sparse matrix computation, making it difficult to achieve wall-clock speedups. In this work, we show theoretically and empirically that the accuracy of CNN models pruned by PAI methods only depends on the fraction of remaining parameters in each layer (i.e., layer-wise density), regardless of the granularity of pruning. We formulate the PAI problem as a convex optimization of our newly proposed expectation-based proxy for model accuracy, which leads to finding the optimal layer-wise density of that specific model. Based on our formulation, we further propose a structured and hardware-friendly PAI method, named PreCrop, to prune or reconfigure CNNs in the channel dimension. Our empirical results show that PreCrop achieves a higher accuracy than existing PAI methods on several modern CNN architectures, including ResNet, MobileNetV2, and EfficientNet for both CIFAR-10 and ImageNet. PreCrop achieves an accuracy improvement of up to $2.7\%$ over the state-of-the-art PAI algorithm when pruning MobileNetV2 on ImageNet. PreCrop also improves the accuracy of EfficientNetB0 by $0.3\%$ on ImageNet with only $80\%$ of the parameters and the same FLOPs. ","Pruning, Pruning-at-Initialization, Structured Pruning, Efficient Deep Learning, Efficient Model" Constructive TT-representation of the tensors given as index interaction functions with applications,https://openreview.net/forum?id=yLzLfM-Esnu,https://openreview.net/pdf?id=yLzLfM-Esnu,A method to build tensor train representation for a wide class of tensors for which an analytical dependence on the indices is given.,"This paper presents a method to build explicit tensor-train (TT) representations. We show that a wide class of tensors can be explicitly represented with sparse TT-cores, obtaining, in many cases, optimal TT-ranks. Numerical experiments show that our method outperforms the existing ones in several practical applications, including game theory problems. Theoretical estimations of the number of operations show that in some problems, such as permanent calculation, our methods are close to the known optimal asymptotics, which are obtained by a completely different type of methods.","Tensor approximation, Discrete multivariate functions, Tensor train decomposition, TT-Tucker format, Game theory, Combinatorial problems" Thinking Two Moves Ahead: Anticipating Other Users Improves Backdoor Attacks in Federated Learning,https://openreview.net/forum?id=B7HJ9KLFV9U,https://openreview.net/pdf?id=B7HJ9KLFV9U,,"Federated learning is particularly susceptible to model poisoning and backdoor attacks because individual users have direct control over the training data and model updates. At the same time, the attack power of an individual user is limited because their updates are quickly drowned out by those of many other users. Existing attacks do not account for future behaviors of other users, and thus require many sequential updates and their effects are quickly erased. We propose an attack that anticipates and accounts for the entire federated learning pipeline, including behaviors of other clients, and ensures that backdoors are effective quickly and persist even after multiple rounds of community updates. We show that this new attack is effective in realistic scenarios where the attacker only contributes to a small fraction of randomly sampled rounds and demonstrate this attack on image classification, next-word prediction, and sentiment analysis.","Privacy, Federated Learning" Continuized Acceleration for Quasar Convex Functions in Non-Convex Optimization,https://openreview.net/forum?id=yYbhKqdi7Hz,https://openreview.net/pdf?id=yYbhKqdi7Hz,,"Quasar convexity is a condition that allows some first-order methods to efficiently minimize a function even when the optimization landscape is non-convex. Previous works develop near-optimal accelerated algorithms for minimizing this class of functions, however, they require a subroutine of binary search which results in multiple calls to gradient evaluations in each iteration, and consequently the total number of gradient evaluations does not match a known lower bound. In this work, we show that a recently proposed continuized Nesterov acceleration can be applied to minimizing quasar convex functions and achieves the optimal bound with a high probability. Furthermore, we find that the objective functions of training generalized linear models (GLMs) satisfy quasar convexity, which broadens the applicability of the relevant algorithms, while known practical examples of quasar convexity in non-convex learning are sparse in the literature. We also show that if a smooth and one-point strongly convex, Polyak-Lojasiewicz, or quadratic-growth function satisfies quasar convexity, then attaining an accelerated linear rate for minimizing the function is possible under certain conditions, while acceleration is not known in general for these classes of functions. ", Towards Global Optimality in Cooperative MARL with Sequential Transformation,https://openreview.net/forum?id=dZaYbIIW9Cu,https://openreview.net/pdf?id=dZaYbIIW9Cu,,"Policy learning in multi-agent reinforcement learning (MARL) is challenging due to the exponential growth of joint state-action space with respect to the number of agents. To achieve higher scalability, the paradigm of centralized training with decentralized execution (CTDE) is broadly adopted with factorized structure in MARL. However, we observe that existing CTDE algorithms in cooperative MARL cannot achieve optimality even in simple matrix games. To understand this phenomenon, we analyze two mainstream classes of CTDE algorithms -- actor-critic algorithms and value-decomposition algorithms. Our theoretical and experimental results characterize the weakness of these two classes of algorithms when the optimization method is taken into consideration, which indicates that the currently used centralized training manner is deficient in compatibility with decentralized policy. To address this issue, we present a transformation framework that reformulates a multi-agent MDP as a special ""single-agent"" MDP with a sequential structure and can allow employing off-the-shelf single-agent reinforcement learning (SARL) algorithms to efficiently learn corresponding multi-agent tasks. After that, a decentralized policy can still be learned by distilling the ""single-agent"" policy. This framework retains the optimality guarantee of SARL algorithms into cooperative MARL. To instantiate this transformation framework, we propose a Transformed PPO, called T-PPO, which can theoretically perform optimal policy learning in the finite multi-agent MDPs and shows significant outperformance on a large set of cooperative multi-agent tasks.", Sparse tree-based Initialization for Neural Networks,https://openreview.net/forum?id=78xgBm6ckZr,https://openreview.net/pdf?id=78xgBm6ckZr,,"Dedicated neural network (NN) architectures have been designed to handle specific data types (such as CNN for images or RNN for text), which ranks them among state-of-the-art methods for dealing with these data. Unfortunately, no architecture has been found for dealing with tabular data yet, for which tree ensemble methods (tree boosting, random forests) usually show the best predictive performances. In this work, we propose a new sparse initialization technique for (potentially deep) multilayer perceptrons (MLP): we first train a tree-based procedure to detect feature interactions and use the resulting information to initialize the network, which is subsequently trained via standard stochastic gradient strategies. Numerical experiments on several tabular data sets show that this new, simple and easy-to-use method is a solid concurrent, both in terms of generalization capacity and computation time, to default MLP initialization and even to existing complex deep learning solutions. In fact, this wise MLP initialization raises the resulting NN methods to the level of a valid competitor to gradient boosting when dealing with tabular data. Besides, such initializations are able to preserve the sparsity of weights introduced in the first layers of the network through training. This fact suggests that this new initializer operates an implicit regularization during the NN training, and emphasizes that the first layers act as a sparse feature extractor (as for convolutional layers in CNN).", Learning Soft Constraints From Constrained Expert Demonstrations,https://openreview.net/forum?id=8sSnD78NqTN,https://openreview.net/pdf?id=8sSnD78NqTN,,"Inverse reinforcement learning (IRL) methods assume that the expert data is generated by an agent optimizing some reward function. However, in many settings, the agent may optimize a reward function subject to some constraints, where the constraints induce behaviors that may be otherwise difficult to express with just a reward function. We consider the setting where the reward function is given, and the constraints are unknown, and propose a method that is able to recover these constraints satisfactorily from the expert data. While previous work has focused on recovering hard constraints, our method can recover cumulative soft constraints that the agent satisfies on average per episode. In IRL fashion, our method solves this problem by adjusting the constraint function iteratively through a constrained optimization procedure, until the agent behavior matches the expert behavior. We demonstrate our approach on synthetic environments and real world highway driving data.","inverse reinforcement learning, constraint learning" VoGE: A Differentiable Volume Renderer using Gaussian Ellipsoids for Analysis-by-Synthesis,https://openreview.net/forum?id=AdPJb9cud_Y,https://openreview.net/pdf?id=AdPJb9cud_Y,"VoGE is a differentiable renderer based on ray tracing volume densities, which gives better gradients for occlusion reasoning and yields better pose estimation results.","Differentiable rendering allows the application of computer graphics on vision tasks, e.g. object pose and shape fitting, via analysis-by-synthesis, where gradients at occluded regions are important when inverting the rendering process.To obtain those gradients, state-of-the-art (SoTA) differentiable renderers use rasterization to collect a set of nearest components for each pixel and aggregate them based on the viewing distance. In this paper, we propose VoGE, which uses ray tracing to capture nearest components with their volume density distributions on the rays and aggregates via integral of the volume densities based on Gaussian ellipsoids, which brings more efficient and stable gradients. To efficiently render via VoGE, we propose an approximate close-form solution for the volume density aggregation and a coarse-to-fine rendering strategy. Finally, we provide a CUDA implementation of VoGE, which gives a competitive rendering speed in comparison to PyTorch3D. Quantitative and qualitative experiment results show VoGE outperforms SoTA counterparts when applied to various vision tasks, e.g., object pose estimation, shape/texture fitting, and occlusion reasoning. The VoGE code will be publicly available.","Differentiable Rendering, Analysis-by-Synthesis, Pose Estimation" Unravel Structured Heterogeneity of Tasks in Meta-Reinforcement Learning via Exploratory Clustering,https://openreview.net/forum?id=Pe7R48fCkM_,https://openreview.net/pdf?id=Pe7R48fCkM_,We propose a method to automatically discover and utilize the cluster structures of tasks for meta-reinforcement learning.,"Meta-reinforcement learning (meta-RL) is developed to quickly solve new tasks by leveraging knowledge from prior tasks. The assumption that tasks are drawn IID is typically made in previous studies, which ignore possible structured heterogeneity of tasks. The non-transferable knowledge caused by structured heterogeneity hinders fast adaptation in new tasks. In this paper, we formulate the structured heterogeneity of tasks via clustering such that transferable knowledge can be inferred within different clusters and non-transferable knowledge would be excluded across clusters thereby. To facilitate so, we develop a dedicated exploratory policy to discover task clusters by reducing uncertainty in posterior inference. Within the identified clusters, the exploitation policy is able to solve related tasks by utilizing knowledge shared within the clusters. Experiments on various MuJoCo tasks showed the proposed method can unravel cluster structures effectively in both rewards and state dynamics, proving strong advantages against a set of state-of-the-art baselines.","Meta Reinforcement Learning, Variational Inference" An Investigation of Domain Generalization with Rademacher Complexity,https://openreview.net/forum?id=Refb0S-paCx,https://openreview.net/pdf?id=Refb0S-paCx,,"The domain generalization (DG) setting challenges a model trained on multiple known data distributions to generalise well on unseen data distributions. Due to its practical importance, many methods have been proposed to address this challenge. However much work in general purpose DG is heuristically motivated, as the DG problem is hard to model formally; and recent evaluations have cast doubt on existing methods’ practical efficacy -- in particular compared to a well tuned empirical risk minimisation baseline. We present a novel learning-theoretic generalisation bound for DG that bounds unseen domain generalisation performance in terms of the model’s empirical risk and Rademacher complexity -- providing a sufficient condition for DG. Based on this insight, we empirically analyze the performance of several methods and show that their performance is indeed influenced by model complexity in practice. Algorithmically, our analysis suggests that tuning for domain generalisation should be achieved by simply performing regularised ERM with a leave-one-domain-out cross-validation objective. Empirical results on the DomainBed benchmark corroborate this.", Towards Efficient Posterior Sampling in Deep Neural Networks via Symmetry Removal,https://openreview.net/forum?id=Xl5Wwp495iC,https://openreview.net/pdf?id=Xl5Wwp495iC,,"Bayesian inference in deep neural networks is challenging due to the high-dimensional, strongly multi-modal posterior landscape. Markov Chain Monte Carlo approaches asymptotically recover the true, intractable posterior but are prohibitively expensive for large modern architectures. Local posterior approximations, while often yielding satisfactory results in practice, crudely disregard the posterior geometry. We propose to exploit well-known parameter symmetries induced by neuron interchangeability and output activation to retrieve a drastically reduced -- yet exact -- posterior over uniquely identified parametrizations. To this end, we provide an algorithm for explicit symmetry removal and develop an upper bound on Monte Carlo samples required to capture the reduced posterior. Our experiments suggest that efficient sampling from the functionally relevant part of the posterior is indeed possible, opening up a promising path to faithful uncertainty quantification in deep learning.", Local Stochastic Bilevel Optimization with Momentum-Based Variance Reduction,https://openreview.net/forum?id=6i6ajdIinJm,https://openreview.net/pdf?id=6i6ajdIinJm,,"Bilevel Optimization has witnessed notable progress recently with new emerging efficient algorithms and has been applied to many machine learning tasks such as data cleaning, few-shot learning, and neural architecture search. However, little attention has been paid to solving the bilevel problems under distributed setting. Federated learning (FL) is an emerging paradigm which solves machine learning tasks over distributed-located data. FL problems are challenging to solve due to the heterogeneity and communication bottleneck. However, it is unclear how these challenges will affect the convergence of Bilevel Optimization algorithms. In this paper, we study Federated Bilevel Optimization problems. Specifically, we first propose the FedBiO, a deterministic gradient-based algorithm and we show it requires $O(\epsilon^{-1.5})$ number of steps/communication steps to reach an $\epsilon$-stationary point. Then we propose FedBiOAcc to accelerate FedBiO with the momentum-based variance-reduction technique under the stochastic scenario. We show FedBiOAcc needs $O(\epsilon^{-1.5})$ number of steps and $O(\epsilon^{-1})$ communication steps, this matches the best known rate for single-level stochastic federated algorithms. Finally, we validate our proposed algorithms via the important Fair Federated Learning task. More specifically, we define a bilevel-based group fair FL objective. Our algorithms show superior performances compared to other baselines in numerical experiments.", FedDA: Faster Framework of Local Adaptive Gradient Methods via Restarted Dual Averaging,https://openreview.net/forum?id=Z4QNXXyLhGN,https://openreview.net/pdf?id=Z4QNXXyLhGN,,"Federated learning (FL) is an emerging learning paradigm to tackle massively distributed data. In Federated Learning, a set of clients jointly perform a machine learning task under the coordination of a server. The FedAvg algorithm is one of the most widely used methods to solve Federated Learning problems. In FedAvg, the learning rate is a constant rather than changing adaptively. The adaptive gradient methods show superior performance over the constant learning rate schedule; however, there is still no general framework to incorporate adaptive gradient methods into the federated setting. In this paper, we propose \textbf{FedDA}, a novel framework for local adaptive gradient methods. The framework adopts a restarted dual averaging technique and is flexible with various gradient estimation methods and adaptive learning rate formulations. In particular, we analyze \textbf{FedDA-MVR}, an instantiation of our framework, and show that it achieves gradient complexity $\tilde{O}(\epsilon^{-1.5})$ and communication complexity $\tilde{O}(\epsilon^{-1})$ for finding a stationary point $\epsilon$. This matches the best known rate for first-order FL algorithms and \textbf{FedDA-MVR} is the first adaptive FL algorithm that achieves this rate. We also perform extensive numerical experiments to verify the efficacy of our method.", On Emergence of Activation Sparsity in Trained Transformers,https://openreview.net/forum?id=TJ2nxciYCk-,https://openreview.net/pdf?id=TJ2nxciYCk-,"Learned Transformers for NLP (e.g., T5) and Vision (e.g., ViT) tasks produce sparse representations in their MLP layers. The sparsity may be leveraged to improve robustness, calibration, and computational efficiency of Transformer models.","This paper reveals a curious observation that modern large-scale machine learning models with Transformer architectures have sparse activation maps. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by “sparse” we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP. Moreover, larger Transformers with more layers and wider MLP hidden dimensions are sparser as measured by the percentage of nonzero entries. Through extensive experiments we demonstrate that the emergence of sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks,on both training and evaluation data, for Transformers of various configurations,at layers of all depth levels. We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers.Moreover, we demonstrate perhaps surprisingly that enforcing an even sparser activation via Top-$k$ thresholding with a small value of $k$ brings a collection of desired but missing properties for Transformers, namely less sensitivity to noisy training data, more robustness to input corruptions, and better calibration for their prediction confidence.","Transformers, Sparse, Calibration, Robustness, Label Noise, Efficiency" Explainable Recommender with Geometric Information Bottleneck,https://openreview.net/forum?id=I_IJf5oDRo,https://openreview.net/pdf?id=I_IJf5oDRo,"To consider user-item interactions for an interpretable recommender system, we propose to incorporate the geometric regularisation derived from user-item interaction graphs to learn the latent factors of review text in a variational network.","Explainable recommender systems have attracted much interest in recent years as they can explain their recommendation decisions, enhancing user trust in the systems. Most explainable recommender systems rely on human-generated rationales or annotated aspect features from user reviews to train models for rational generation or extraction. The rationales produced are often confined to a single review. To avoid the expensive human annotation process and to generate explanations beyond individual reviews, we propose an explainable recommender system trained on user reviews by developing a transferable Geometric Information Bottleneck (GIANT), which leverages the prior knowledge acquired through clustering on a user-item graph built on user-item rating interactions, since graph nodes in the same cluster tend to share common characteristics or preferences. We then feed user reviews and item reviews into a variational network to learn latent topic distributions which are regularised by the distributions of user/item estimated based on their distances to various cluster centroids of the user-item graph. By iteratively refining the instance-level review latent topics with GIANT, our method learns a robust latent space from the text for rating prediction and explanation generation. Experimental results on three e-commerce datasets show that our model significantly improves the interpretability of a variational recommender using a standard Gaussian prior, in terms of coherence, diversity and faithfulness, while achieving performance comparable to existing content-based recommender systems in terms of rating prediction accuracy. ","Interpretability, Recommender System, Information Extraction" Near-optimal Policy Identification in Active Reinforcement Learning,https://openreview.net/forum?id=3OR2tbtnYC-,https://openreview.net/pdf?id=3OR2tbtnYC-,We propose a novel kernelized LSVI algorithm for active reinforcement learning which provably identifies a near-optimal policy uniformly over the entire state space.,"Many real-world reinforcement learning tasks require control of complex dynamical systems that involve both costly data acquisition processes and large state spaces. In cases where the expensive transition dynamics can be readily evaluated at specified states (e.g., via a simulator), agents can operate in what is often referred to as planning with a \emph{generative model}. We propose the AE-LSVI algorithm for best policy identification, a novel variant of the kernelized least-squares value iteration (LSVI) algorithm that combines optimism with pessimism for active exploration (AE). AE-LSVI provably identifies a near-optimal policy \emph{uniformly} over an entire state space and achieves polynomial sample complexity guarantees that are independent of the number of states. When specialized to the recently introduced offline contextual Bayesian optimization setting, our algorithm achieves improved sample complexity bounds. Experimentally, we demonstrate that AE-LSVI outperforms other RL algorithms in a variety of environments when robustness to the initial state is required. ","reinforcement learning, contextual bayesian optimization, kernelized least-squares value iteration" FixEval: Execution-based Evaluation of Program Fixes for Competitive Programming Problems,https://openreview.net/forum?id=XEQ2pdweH9q,https://openreview.net/pdf?id=XEQ2pdweH9q,"We introduce FixEval, a novel dataset consisting of competitive programming submissions that incorporates additional program contexts, such as time and space requirements, to evaluate code generated by deep learning models for automatic bug fixing.","The increasing complexity of software has led to a drastic rise in time and costs for identifying and fixing bugs. Various approaches are explored in the literature to generate fixes for buggy code automatically. However, due to the large combinatorial space of possible fixes for a particular bug, few tools and datasets are available to evaluate model generated fixes effectively. In this work, we introduce FixEval, a benchmark comprising buggy code submissions to competitive programming problems and their respective fixes. FixEval is composed of a rich test suite to evaluate and assess the correctness of model-generated program fixes and further information regarding time and memory constraints and acceptance based on a verdict. We consider two Transformer language models pretrained on programming languages as our baselines and compare them using match-based and execution-based evaluation metrics. Our experiments show that match-based metrics do not reflect model-generated program fixes accurately. At the same time, execution-based methods evaluate programs through all cases and scenarios designed explicitly for that solution. Therefore, we believe FixEval provides a step towards real-world automatic bug fixing and model-generated code evaluation. The dataset and models are open-sourced.\footnote{\url{https://github.com/FixEval/FixEval_official}}","automated program repair, language models, dataset, deep learning, bug fixing" Algorithmic Determination of the Combinatorial Structure of the Linear Regions of ReLU Neural Networks,https://openreview.net/forum?id=3IFO8Jii0vI,https://openreview.net/pdf?id=3IFO8Jii0vI,,"We algorithmically determine the regions and facets of all dimensions of the canonical polyhedral complex, the universal object into which a ReLU network decomposes its input space. We show that the locations of the vertices of the canonical polyhedral complex along with their signs with respect to layer maps determine the full facet structure across all dimensions. We present an algorithm which calculates this full combinatorial structure, making use of our theorems that the dual complex to the canonical polyhedral complex is cubical and it possesses a multiplication compatible with its facet structure. The resulting algorithm is numerically stable, polynomial time in the number of intermediate neurons, and obtains accurate information across all dimensions. This permits us to obtain, for example, the true topology of the decision boundaries of networks with low-dimensional inputs. We run empirics on such networks at initialization, finding that width alone does not increase observed topology, but width in the presence of depth does. ","ReLU networks, algebraic topology, linear regions, computational geometry" Are Neurons Actually Collapsed? On the Fine-Grained Structure in Neural Representations,https://openreview.net/forum?id=TTSyyMBNUjd,https://openreview.net/pdf?id=TTSyyMBNUjd,"We provide compelling empirical evidence proving that there exists fine-grained structures in the last-layer representations of a well trained neural network, as a complement to existing Neural Collapse hypothesis.","Recent work has observed an intriguing ""Neural Collapse"" phenomenon in well-trained neural networks, where the last-layer representations of training samples with the same label collapse into each other. This suggests that the last-layer representations are completely determined by the labels, and do not depend on the intrinsic structure of input distribution. We provide evidence that this is not a complete description, and that the apparent collapse hides important fine-grained structure in the representations. Specifically, even when representations apparently collapse, the small amount of remaining variation can still faithfully and accurately captures the intrinsic structure of input distribution. As an example, if we train on CIFAR-10 using only 5 coarse-grained labels (by combining two classes into one super-class) until convergence, we can reconstruct the original 10-class labels from the learned representations via unsupervised clustering. The reconstructed labels achieve $93\%$ accuracy on the CIFAR-10 test set, nearly matching the normal CIFAR-10 accuracy for the same architecture. Our findings show concretely how the structure of input data can play a significant role in determining the fine-grained structure of neural representations, going beyond what Neural Collapse predicts. ","Neural Collapse, Representation Learning, Neural Networks" FoSR: First-order spectral rewiring for addressing oversquashing in GNNs,https://openreview.net/forum?id=3YjQfCLdrzz,https://openreview.net/pdf?id=3YjQfCLdrzz,We propose a graph rewiring algorithm that prevents oversquashing in GNNs via spectral expansion while retaining the original graph via a relational structure that prevents oversmoothing.,"Graph neural networks (GNNs) are able to leverage the structure of graph data by passing messages along the edges of the graph. While this allows GNNs to learn features depending on the graph structure, for certain graph topologies it leads to inefficient information propagation and a problem known as oversquashing. This has recently been linked with the curvature and spectral gap of the graph. On the other hand, adding edges to the message-passing graph can lead to increasingly similar node representations and a problem known as oversmoothing. We propose a computationally efficient algorithm that prevents oversquashing by systematically adding edges to the graph based on spectral expansion. We combine this with a relational architecture, which lets the GNN preserve the original graph structure and provably prevents oversmoothing. We find experimentally that our algorithm outperforms existing graph rewiring methods in several graph classification tasks.","oversquashing, oversmoothing, graph rewiring, graph neural networks, GNN, relational GNN, spectral expansion" Early Stopping for Deep Image Prior,https://openreview.net/forum?id=JIl_kij_aov,https://openreview.net/pdf?id=JIl_kij_aov,,"Deep image prior (DIP) and its variants have shown remarkable potential for solving inverse problems in computational imaging (CI), needing no separate training data. Practical DIP models are often substantially overparameterized. During the learning process, these models learn the desired visual content first and then pick up the potential modeling and observational noise, i.e., overfitting. Thus, the practicality of DIP hinges on early stopping (ES) that captures the transition period. In this regard, the majority of prior DIP works for CI tasks only demonstrate the potential of the models---reporting the peak performance against the groundtruth but providing no clue about how to operationally obtain near-peak performance without access to the groundtruth. In this paper, we set to break this practicality barrier of DIP, and propose an efficient ES strategy that consistently detects near-peak performance across several CI tasks and DIP variants. Simply based on the running variance of DIP intermediate reconstructions, our ES method not only outpaces the existing ones---which only work in very narrow regimes, but also remains effective when combined with methods that try to mitigate overfitting.","early stopping, deep image prior, deep generative models, overparametrization, overfitting" ON COMPLEX-DOMAIN CNN REPRESENTATIONS FOR CLASSIFYING REAL/COMPLEX-VALUED DATA,https://openreview.net/forum?id=SmjW4kKLjuU,https://openreview.net/pdf?id=SmjW4kKLjuU,We address the contradictory answers present in the literature for the following question: CV-CNN performs better or worse than RV-CNN for classification task?,"This paper is about complex-valued CNNs (CV-CNNs) for computer vision that use representations that are complex-valued instead of real-valued. We divide input data into three categories: inherently real-valued, inherently complex-valued, and complex-valued obtained by transforming real-valued. We study the question whether complex-valued representation of CV-CNNs offers any advantages over the commonly used real-valued CNNs (RV-CNNs). For concreteness, we focus on the classification task. The existing literature offers contradictory answers to our question. We find that this is mainly because (a) they seldom employ a common performance measure (e.g., CV-CNN compared against RV-CNN with similar network structure vs similar number of parameters) (b) diversity of evaluation datasets used are limited (e.g., datasets in which magnitude information is more, less or as important as phase information) (c) less effort has been devoted to reduce the randomness in training between CV-CNN and RV-CNN. Towards this, we propose performance measures based on similar network structure, number of parameters and number of MAC operations. Also, we consider diverse datasets with varying magnitude/phase information, and deal with the randomness in training. As a result, we expect that any observed performance differences will be independent of the above disparities, and arise from the use of real vs complex representations. Theoretically, we show that, unlike RV-CNNs, CV-CNNs can preserve magnitude and phase through intermediate stages of processing. Our main experimental findings are the following. (1) As network depth decreases --- the performance of CV-CNNs improves with respect to similar network structure; the performances of CV-CNN and RV-CNN having a similar number of parameters become more comparable; and the performance of RV-CNNs improves with respect to similar number of MAC operations; (2) The above performance differences diminish as the network depth increases. (3) With respect to data diversity, performance depends on whether the dataset has dominant magnitude or phase, i.e., whether reconstruction error is lower using only magnitude or only phase. If a complex-valued data has dominant magnitude, instead of providing real and imaginary parts as input, providing the magnitude part produces significant performance gain, whereas if the data has dominant phase, providing both real and imaginary parts is important. This holds true for different network depths.","Complex-valued neural networks, Complex-valued representations, Complex-value CNN, Classification, Complex numbers" FAME: Fast Adaptive Moment Estimation based on Triple Exponential Moving Average,https://openreview.net/forum?id=WUs4tNgcBYe,https://openreview.net/pdf?id=WUs4tNgcBYe,We propose the first optimizer that is based on the Triple Exponential Moving Average for deep learning ,"Network optimization is a key step in deep learning, which broadly impacts different domains (e.g. natural language, computer vision). Over the years, several optimizers have been developed - some are adaptive and converge quickly, while others are not adaptive but may be more accurate. However, due to the fact that most current optimizers' use simple Exponential Moving Average, gradient trends and their rapid changes may not be accurately identified, resulting in sub-optimal network performance. In this paper, we propose the first deep optimizer based on the Triple Exponential Moving Average (TEMA), a technical indicator originally developed to predict stock market trends. TEMA adds richer multi-level information about data changes and trends compared to the simple Exponential Moving Average. As a result, the gradients moments are better estimated. Furthermore, instead of using TEMA in the same way as the stock domain, here we use it as part of a continuous average during an optimization procedure. We extensively validated our method. Five benchmarks (CIFAR-10, CIFAR-100, PASCAL-VOC, MS-COCO, and Cityscapes) were used to test our method, as well as 14 different learning architectures, five different optimizers, and various vision tasks (detection, segmentation, and classification). The results clearly indicate that the robustness and accuracy of our FAME optimizer are superior to those of others.","Optimization, Computer Vision, Deep learning, Triple Exponential Moving Average" Progressive Transformation Learning For Leveraging Virtual Images in Training,https://openreview.net/forum?id=IUXAo-N9AGh,https://openreview.net/pdf?id=IUXAo-N9AGh,We introduce progressive transformation learning (PTL) that progressively expands the training set with realistically transformed virtual images while addressing the large domain gap in training a transformation generator.,"To effectively interrogate UAV-based images for detecting objects of interest, such as humans, it is essential to acquire large-scale UAV-based datasets that include human instances with various poses captured from widely varying viewing angles. As a viable alternative to laborious and costly data curation, we introduce Progressive Transformation Learning (PTL), which gradually augments a training dataset by adding transformed virtual images with enhanced realism. Generally, a virtual2real transformation generator in the conditional GAN framework suffers from quality degradation when a large domain gap exists between real and virtual images. To deal with the domain gap, PTL takes a novel approach that progressively iterates the following three steps: 1) select a subset from a pool of virtual images according to the domain gap, 2) transform the selected virtual images to enhance realism, and 3) add the transformed virtual images to the training set while removing them from the pool. In PTL, accurately quantifying the domain gap is critical. To do that, we theoretically demonstrate that the feature representation space of a given object detector can be modeled as a multivariate Gaussian distribution from which the Mahalanobis distance between a virtual object and the Gaussian distribution of each object category in the representation space can be readily computed. Experiments show that PTL results in a substantial performance increase over the baseline, especially in the small data and the cross-domain regime.","progressive learning, virtual image, synthetic image, low-shot learning, cross-domain detection, UAV-based human detection" In-Context Policy Iteration,https://openreview.net/forum?id=9jXqR128vKs,https://openreview.net/pdf?id=9jXqR128vKs,We present a novel algorithm for performing policy iteration through in-context adaptation,"This work presents In-Context Policy Iteration, an algorithm for performing Reinforcement Learning (RL), in-context, using foundation models. While the application of foundation models to RL has received considerable attention, most approaches rely on either (1) the curation of expert demonstrations (either through manual design or task-specific pretraining) or (2) adaptation to the task of interest using gradient methods (either fine-tuning or training of adapter layers). Both of these techniques have drawbacks. Collecting demonstrations is labor-intensive, and algorithms that rely on them do not outperform the experts from which the demonstrations were derived. All gradient techniques are inherently slow, sacrificing the “few-shot” quality that made in-context learning attractive to begin with. In this work, we present an algorithm, ICPI, that learns to perform RL tasks without expert demonstrations or gradients. Instead we present a policy-iteration method in which the prompt content is the entire locus of learning. ICPI iteratively updates the contents of the prompt from which it derives its policy through trial-and-error interaction with an RL environment. In order to eliminate the role of in-weights learning (on which approaches like Decision Transformer rely heavily), we demonstrate our algorithm using Codex Chen et al. (2021b), a language model with no prior knowledge of the domains on which we evaluate it.","Reinforcement Learning, In-Context Learning, Foundation Models" Learning to Grow Pretrained Models for Efficient Transformer Training,https://openreview.net/forum?id=cDYRS5iZ16f,https://openreview.net/pdf?id=cDYRS5iZ16f,"Learning to grow smaller, extant models to enable faster training of newer, larger transformers.","Scaling transformers has led to significant breakthroughs in many domains, leading to a paradigm in which larger versions of existing models are trained and released on a periodic basis. Curiously, new instances of such models are typically trained completely from scratch, despite the fact that they are often just scaled-up versions of their smaller counterparts. How can we use the implicit knowledge in the parameters of smaller, extant models to enable faster training of newer, larger models? This paper describes an approach for accelerating transformer training by learning to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model. For tractable learning, we factorize the linear transformation as a composition of (linear) width- and depth-growth operators, and further employ a Kronecker factorization of these growth operators to encode architectural knowledge. Experiments across both language and vision transformers demonstrate that our LEarning to GrOw (LEGO) approach can save around $50%$ computational cost of training from scratch, while also consistently outperforming strong baselines that also reuse smaller pretrained models to initialize larger models. Our code will be made publicly available.","Transformer, Efficient Training, Model Reuse" Generative Modeling Helps Weak Supervision (and Vice Versa),https://openreview.net/forum?id=3OaBBATwsvP,https://openreview.net/pdf?id=3OaBBATwsvP,,"Many promising applications of supervised machine learning face hurdles in the acquisition of labeled data in sufficient quantity and quality, creating an expensive bottleneck. To overcome such limitations, techniques that do not depend on ground truth labels have been studied, including weak supervision and generative modeling. While these techniques would seem to be usable in concert, improving one another, how to build an interface between them is not well-understood. In this work, we propose a model fusing programmatic weak supervision and generative adversarial networks and provide theoretical justification motivating this fusion. The proposed approach captures discrete latent variables in the data alongside the weak supervision derived label estimate. Alignment of the two allows for better modeling of sample-dependent accuracies of the weak supervision sources, improving the estimate of unobserved labels. It is the first approach to enable data augmentation through weakly supervised synthetic images and pseudolabels. Additionally, its learned latent variables can be inspected qualitatively. The model outperforms baseline weak supervision label models on a number of multiclass image classification datasets, improves the quality of generated images, and further improves end-model performance through data augmentation with synthetic samples.","generative model, weak supervision" What does a platypus look like? Generating customized prompts for zero-shot image classification,https://openreview.net/forum?id=3ly9cG9Ql9h,https://openreview.net/pdf?id=3ly9cG9Ql9h,Using GPT-3 to generate better CLIP prompts,"Open vocabulary models are a promising new paradigm for image classification. Unlike traditional classification models, open vocabulary models classify among any arbitrary set of categories specified with natural language during inference. This natural language, called ""prompts"", typically consists of a set of hand-written templates (e.g., ""a photo of a {}”) which are completed with each of the category names. This work introduces a simple method to generate higher accuracy prompts, without relying on any explicit knowledge of the task domain and with far fewer hand-constructed sentences. To achieve this, we combine open vocabulary models with large language models (LLMs) to create Customized Prompts via Language models (CuPL, pronounced ""couple""). In particular, we leverage the knowledge contained in LLMs in order to generate many descriptive sentences that are customized for each object category. We find that this straightforward and general approach improves accuracy on a range of zero-shot image classification benchmarks, including over one percentage point gain on ImageNet. Finally, this simple baseline requires no additional training and remains completely zero-shot.","zero-shot, image classification, prompts, open vocabulary models" Hyperbolic Contrastive Learning for Visual Representations beyond Objects,https://openreview.net/forum?id=7kpmIkHVpHu,https://openreview.net/pdf?id=7kpmIkHVpHu,"We use hyperbolic objective to learn scene-object hypernymy, and show significant improvements for multiple datasets across multiple SSL tasks."," Although self-/un-supervised methods have led to rapid progress in visual representation learning, these methods generally treat objects and scenes using the same lens. In this paper, we focus on learning representations for objects and scenes \that preserve the structure among them. Motivated by the observation that visually similar objects are close in the representation space, we argue that the scenes and objects should instead follow a hierarchical structure based on their compositionality. To exploit such a structure, we propose a contrastive learning framework where a Euclidean loss is used to learn object representations and a hyperbolic loss is used to encourage representations of scenes to lie close to representations of their constituent objects in a hyperbolic space. This novel hyperbolic objective encourages the scene-object hypernymy among the representations by optimizing the magnitude of their norms. We show that when pretraining on the COCO and OpenImages datasets, the hyperbolic loss improves downstream performance of several baselines across multiple datasets and tasks, including image classification, object detection, and semantic segmentation. We also show that the properties of the learned representations allow us to solve various vision tasks that involve the interaction between scenes and objects in a zero-shot way.","Self Supervised learning for Scene images, Hyperbolic objective, Hierarchical scene-object structure." Provable Memorization Capacity of Transformers,https://openreview.net/forum?id=8JCg5xJCTPR,https://openreview.net/pdf?id=8JCg5xJCTPR,We provide the memorization capacity of Transformer architecture in sequence input.,"Quantifying memorization capacity is essential for understanding the expressiveness and generalizability of deep learning model architectures. However, the memorization capacity of the Transformer architecture has yet to be explored. In this work, we present the first study of the memorization capacity of the Transformer architecture. We prove that Transformers are capable of memorizing $N$ sequence-to-sequence mappings of length $n$ with $d$-dimensional input tokens using $\tilde{O}(d + n + \sqrt{nN})$ parameters. Our theory supports memorization both with and without permutation equivariance, utilizing positional encodings in the latter case. Building on our theory, we also analyze the memorization capacity of Transformers in the sequence classification task. To verify these theoretical findings, we conduct experiments analyzing the memorization capacity of Transformers in the natural language domain.","Transformer, Expressivness, Memorization, Deep learning theory, contextual mapping, permutation equivariance" Knowledge-Driven New Drug Recommendation,https://openreview.net/forum?id=83xscrmnw6Q,https://openreview.net/pdf?id=83xscrmnw6Q,recommendation for new drugs with limited historical prescription data,"Drug recommendation assists doctors in prescribing personalized medications to patients based on their health conditions. Existing drug recommendation solutions adopt the supervised multi-label classification setup and only work with existing drugs with sufficient prescription data from many patients. However, newly approved drugs do not have much historical prescription data and cannot leverage existing drug recommendation methods. To address this, we formulate the new drug recommendation as a few-shot learning problem. Yet, directly applying existing few-shot learning algorithms faces two challenges: (1) complex relations among diseases and drugs and (2) numerous false-negative patients who were eligible but did not yet use the new drugs. To tackle these challenges, we propose EDGE, which can quickly adapt to the recommendation for a new drug with limited prescription data from a few support patients. EDGE maintains a drug-dependent multi-phenotype few-shot learner to bridge the gap between existing and new drugs. Specifically, EDGE leverages the drug ontology to link new drugs to existing drugs with similar treatment effects and learns ontology-based drug representations. Such drug representations are used to customize the metric space of the phenotype-driven patient representations, which are composed of a set of phenotypes capturing complex patient health status. Lastly, EDGE eliminates the false-negative supervision signal using an external drug-disease knowledge base. We evaluate EDGE on two real-world datasets: the public EHR data (MIMIC-IV) and private industrial claims data. Results show that EDGE achieves 7.3% improvement on the ROC-AUC score over the best baseline.","drug recommendation, medication recommendation, healthcare, electronic health record, few-shot learning" Beyond Traditional Transfer Learning: Co-finetuning for Action Localisation,https://openreview.net/forum?id=E4-uRvmKkeB,https://openreview.net/pdf?id=E4-uRvmKkeB,"Instead of pretraining on ""upstream"" datasets and then finetuning on ""downstream"" tasks, we simultaneously train on all datasets, achieving significant performance improvements across all tasks, and particularly on rare classes.","Transfer learning is the predominant paradigm for training deep networks on small target datasets. Models are typically pretrained on large “upstream” datasets for classification, as such labels are easy to collect, and then finetuned on downstream” tasks such as action localisation, which are smaller due to their finer-grained annotations. In this paper, we question this approach, and propose co-finetuning -- simultaneously training a single model on multiple “upstream” and “downstream” tasks. We demonstrate that co-finetuning outperforms traditional transfer learning when using the same total amount of data, and also show how we can easily extend our approach to multiple “upstream” datasets to further improve performance. In particular, co-finetuning significantly improves the performance on rare classes in our downstream task, as it has a regularising effect, and enables the network to learn feature representations that transfer between different datasets. Finally, we observe how co-finetuning with public, video classification datasets, we are able to achieve state-of-the-art results for spatio-temporal action localisation on the challenging AVA and AVA-Kinetics datasets, outperforming recent works which develop intricate models.","transformer, video, action recognition, action detection, multi-task learning, co-training, transfer learning" Output Distribution over the Entire Input Space: A Novel Perspective to Understand Neural Networks,https://openreview.net/forum?id=TntbHxxGd6j,https://openreview.net/pdf?id=TntbHxxGd6j,We draw the connection between energy in physics and output of neural networks and propose an efficient sampler for better understanding of the input-output mapping relationship of the (binary) neural classifiers. ,"Understanding the input-output mapping relationship in the \emph{entire input space} contributes a novel perspective to a comprehensive understanding of deep neural networks. In this paper, we focus on binary neural classifiers and propose to first uncover the histogram about the number of inputs that are mapped to certain output values and then scrutinize the representative inputs from a certain output range of interest, such as the positive-logit region that corresponds to one of the classes. A straightforward solution is uniform sampling (or exhaustive enumeration) in the entire input space but when the inputs are high dimensional, it can take almost forever to converge. We connect the output histogram to the \emph{density of states} in physics by making an analogy between the energy of a system and the neural network output. Inspired by the Wang-Landau algorithm designed for sampling the density of states, we propose an efficient sampler that is driven to explore the under-explored output values through a gradient-based proposal. Compared with the random proposal in Wang-Landau algorithm, our gradient-based proposal converges faster as it can propose the inputs corresponding to the under-explored output values. Extensive experiments have verified the accuracy of the histogram generated by our sampler and also demonstrated interesting findings. For example, the models map many human unrecognizable images to very negative logit values. These properties of a neural model are revealed for the first time through our sampled statistics. We believe that our approach opens a new gate for neural model evaluation and shall be further explored in future works. ","model evaluation, comprehensive input-output mapping relation" Learning Control Policies for Region Stabilization in Stochastic Systems,https://openreview.net/forum?id=vm3jAx_pLCV,https://openreview.net/pdf?id=vm3jAx_pLCV,We learn policies and certificates for proving region stabilization in control systems,"We consider the problem of learning control policies in stochastic systems which guarantee that the system stabilizes within some specified stabilization region with probability 1. Our approach is based on the novel notion of stabilizing ranking supermartingales (sRSMs) that we introduce in this work. Our sRSMs overcome the limitation of methods proposed in previous works whose applicability is restricted to systems in which the stabilizing region cannot be left once entered under any control policy. We present a learning procedure that learns a control policy together with an sRSM that formally certifies probability 1 stability, both learned as neural networks. Our experimental evaluation shows that our learning procedure can successfully learn provably stabilizing policies in practice.","Stability, learning for control, martingale, verification" InCoder: A Generative Model for Code Infilling and Synthesis,https://openreview.net/forum?id=hQwb-lbM6EL,https://openreview.net/pdf?id=hQwb-lbM6EL,"An infilling-capable code completion model, evaluated on tasks including language-to-code, type inference, and comment generation.","Code is seldom written in a single left-to-right pass and is instead repeatedly edited and refined. We introduce InCoder, a unified generative model that can perform program synthesis (via left-to-right generation) as well as editing (via masking and infilling). InCoder is trained to generate code files from a large corpus of permissively licensed code, where regions of code have been randomly masked and moved to the end of each file, allowing code infilling with bidirectional context. Our model is the first large generative code model that is able to infill arbitrary regions of code, which we evaluate in a zero-shot setting on challenging tasks such as type inference, comment generation, and variable re-naming. We find that the ability to condition on bidirectional context substantially improves performance on these tasks, while still performing comparably on standard program synthesis benchmarks in comparison to left-to-right only models pretrained at similar scale. Our models and code will be publicly released.","code generation, program synthesis, language to code" Bridge the Inference Gaps of Neural Processes via Expectation Maximization,https://openreview.net/forum?id=A7v2DqLjZdq,https://openreview.net/pdf?id=A7v2DqLjZdq,,"The neural process (NP) is a family of computationally efficient models for learning distributions over functions. However, it suffers from under-fitting and shows suboptimal performance in practice. Researchers have primarily focused on incorporating diverse structural inductive biases, e.g. attention or convolution, in modeling. The topic of inference suboptimality and an analysis of the NP from the optimization objective perspective has hardly been studied in earlier work. To fix this issue, we propose a surrogate objective of the target log-likelihood of the meta dataset within the expectation maximization framework. The resulting model, referred to as the Self-normalized Importance weighted Neural Process (SI-NP), can learn a more accurate functional prior and has an improvement guarantee concerning the target log-likelihood. Experimental results show the competitive performance of SI-NP over other NPs objectives and illustrate that structural inductive biases, such as attention modules, can also augment our method to achieve SOTA performance.", Contrastive Prompt Tuning Improves Generalization in Vision-Language Models,https://openreview.net/forum?id=g4JB0ksCrKe,https://openreview.net/pdf?id=g4JB0ksCrKe,We introduce contrastive prompt tuning for improved generalization in vision-language models by optimizing for the learned prompts to be consistent with the image space.,"Prompt tuning, which focuses on learning continuous text prompts for adapting large vision-language models, has attracted much attention in recent years. While prior works show promising performance over the hand-crafted prompts, they typically use cross-entropy loss for learning prompts, which limits their generalization capability in many real-world scenarios. Motivated by the effectiveness of contrastive learning for improved generalization, we introduce Contrastive Prompt Tuning (CPT), an incredibly simple yet highly efficient framework that explicitly optimizes for the learned prompts to be consistent with the image space. In particular, combined with cross-entropy loss, our contrastive losses help learning prompts so that the model has consistent predictions across different views of an image while also maintaining the consistency of pairwise similarities among different images. Extensive experiments on a battery of datasets demonstrate that our proposed method significantly outperforms the existing methods in improving model's generalization, while also achieving consistent improvements in few-shot in-domain performance for a wide variety of vision-language models.","Prompt tuning, Vision-language models, Contrastive learning" Decentralized Robust V-learning for Solving Markov Games with Model Uncertainty,https://openreview.net/forum?id=QTbAoQ5yMCg,https://openreview.net/pdf?id=QTbAoQ5yMCg,Robust reinforcement learning algorithm for Markov games,"Markov game is a popular reinforcement learning framework for modeling competitive players in a dynamic environment. However, most of the existing works on Markov game focus on computing a certain equilibrium following uncertain interactions among the players, but ignores the uncertainty of the environment model, which is ubiquitous in practical scenarios. In this work, we develop a tractable solution to Markov games with model uncertainty. Specifically, we propose a new and tractable notion of robust correlated equilibrium for Markov games with environment model uncertainty. In particular, we prove that robust correlated equilibrium has a simple modification structure, and its characterization of equilibrium critically depends on the environment model uncertainty. Moreover, we propose the first fully-decentralized sample-based algorithm for computing such robust correlated equilibrium. Our analysis proves that the algorithm achieves the polynomial sample complexity $\widetilde{\mathcal{O}}( SA^2 H^5 p_{\min}^{-2}\epsilon^{-2})$ for computing an approximate robust correlated equilibrium with $\epsilon$ accuracy. ","Machine Learning, Reinforcement Learning, Markov Games" Generated Graph Detection,https://openreview.net/forum?id=8gU_8IdHN9g,https://openreview.net/pdf?id=8gU_8IdHN9g,We propose a general framework to detect generated graphs using GNN-based methods.,"Graph generative models become increasingly effective for data distribution approximation and data augmentation. Although still in sandboxes, they have aroused public concerns about their malicious misuses or misinformation broadcasts, just as what Deepfake visual and auditory media has been delivering to society. It is never too early to regulate the prevalence of generated graphs. As a preventive response, we pioneer to formulate the generated graph detection problem to distinguish generated graphs from real ones. We propose the first framework to systematically investigate a set of sophisticated models and their performance in four classification scenarios. Each scenario switches between seen and unseen datasets/generators during testing to get closer to real world settings and progressively challenge the classifiers. Extensive experiments evidence that all the models are qualified for generated graph detection, with specific models having advantages in specific scenarios. Resulting from the validated generality and oblivion of the classifiers to unseen datasets/generators, we draw a safe conclusion that our solution can sustain for a decent while to curb generated graph misuses.","Generated Graph, Graph Neural Network, Contrastive Learning, Metric Learning" Neural Embeddings for Text,https://openreview.net/forum?id=4-aEhZnvNnk,https://openreview.net/pdf?id=4-aEhZnvNnk,We propose a new kind of embedding for natural language text that deeply represents semantic meaning.,"We propose a new kind of embedding for natural language text that deeply represents semantic meaning. Standard text embeddings use the vector output of a pretrained language model. In our method, we let a language model learn from the text and then literally pick its brain, taking the actual weights of the model's neurons to generate a vector. We call this representation of the text a neural embedding. With analysis of its behavior on several datasets, we confirm the ability of this representation to reflect semantics of the text. We also compare neural embeddings with GPT sentence (SGPT) embeddings. We observe that neural embeddings achieve comparable performance with a far smaller model, and that the embeddings respond to semantics differently.","text embedding, semantic embedding, neural embedding, neural text representation" Find Your Friends: Personalized Federated Learning with the Right Collaborators,https://openreview.net/forum?id=5g4FC-SHkaV,https://openreview.net/pdf?id=5g4FC-SHkaV,We propose a novel personalized decentralized federated learning framework for heterogeneous client data by collaborating with the right clients.,"In the traditional federated learning setting, a central server coordinates a network of clients to train one global model. However, the global model may serve many clients poorly due to data heterogeneity. Moreover, there may not exist a trusted central party that can coordinate the clients to ensure that each of them can benefit from others. To address these concerns, we present a novel decentralized framework, FedeRiCo, where each client can learn as much or as little from other clients as is optimal for its local data distribution. Based on expectation-maximization, FedeRiCo estimates the utilities of other participants’ models on each client’s data so that everyone can select the right collaborators for learning. As a result, our algorithm outperforms other federated, personalized, and/or decentralized approaches on several benchmark datasets, being the only approach that consistently performs better than training with local data only.","Federated learning, Personalized federated learning, Decentralized federated learning" Masked Vision and Language Modeling for Multi-modal Representation Learning,https://openreview.net/forum?id=ZhuXksSJYWn,https://openreview.net/pdf?id=ZhuXksSJYWn,,"In this paper, we study how to use masked signal modeling in vision and language (V+L) representation learning. Instead of developing masked language modeling (MLM) and masked image modeling (MIM) independently, we propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality. This is motivated by the nature of image-text paired data that both of the image and the text convey almost the same information but in different formats. The masked signal reconstruction of one modality conditioned on another modality can also implicitly learn cross-modal alignment between language tokens and image patches. Our experiments on various V+L tasks show that the proposed method, along with common V+L alignment losses, not only achieves state-of-the-art performance by using a large amount of data but also outperforms the other competitors by a significant margin in the regimes of limited training data. ","Vision and language, multi-modal learning" Quantum Fourier Networks for solving Parametric PDEs,https://openreview.net/forum?id=ySQeVdXOcx0,https://openreview.net/pdf?id=ySQeVdXOcx0,"We provide three new quantum circuits to reproduce the Fourier Neural Operator, in order to learn PDEs solutions, and tested them on practical use cases.","Many real-world problems like modelling environment dynamics, physical processes, time series etc., involve solving Partial Differential Equations (PDEs) parameterized by problem-specific conditions. Recently, a deep learning architecture called Fourier Neural Operator (FNO) proved to be capable of learning solutions of given PDE families, for any initial conditions as input. Given the advancements in quantum hardware and the recent results in quantum machine learning methods, we propose three quantum circuits, inspired by the FNO, to learn this functional mapping for PDEs. The proposed algorithms are distinguished based on the trade-off between depth and their similarity to the classical FNO. At their core, we make use of unary encoding paradigm and orthogonal quantum layers, and introduce a new quantum Fourier transform in the unary basis. With respect to the number of samples, our quantum algorithm is proven to be substantially faster than the classical counterpart. We benchmark our proposed algorithms on three PDE families, namely Burger’s equation, Darcy’s flow equation and the Navier-Stokes equation, and the results show that our quantum methods are comparable in performance to the classical FNO. We also show an analysis of the image classification tasks where our proposed algorithms are able to match the accuracy of the CNNs, thereby showing their applicability to other domains.","quantum computing, quantum machine learning, quantum deep learnin, fourier transform, fourier neural operator, PDE, partial differential equation" Agent-based Graph Neural Networks,https://openreview.net/forum?id=8WTAh0tj2jC,https://openreview.net/pdf?id=8WTAh0tj2jC,We present a new agent-based sublinear and expressive GNN architecture for graph-level tasks.,"We present a novel graph neural network we call AgentNet, which is designed specifically for graph-level tasks. AgentNet is inspired by sublinear algorithms, featuring a computational complexity that is independent of the graph size. The architecture of AgentNet differs fundamentally from the architectures of traditional graph neural networks. In AgentNet, some trained \textit{neural agents} intelligently walk the graph, and then collectively decide on the output. We provide an extensive theoretical analysis of AgentNet: We show that the agents can learn to systematically explore their neighborhood, and that AgentNet can distinguish some structures that are even indistinguishable by 3-WL. Moreover, AgentNet is able to separate any two graphs which are sufficiently different in terms of subgraphs. We confirm these theoretical results with synthetic experiments on hard-to-distinguish graphs and real-world graph classification tasks. In both cases, we compare favorably not only to standard GNNs but also to computationally more expensive GNN extensions.","Graph Neural Networks, GNN, Graph Classification, Expressive Graph Neural Networks, Sublinear algorithms" Generating Adversarial Examples with Task Oriented Multi-Objective Optimization,https://openreview.net/forum?id=UFKW7EVrJAm,https://openreview.net/pdf?id=UFKW7EVrJAm,,"Deep learning models, even the-state-of-the-art ones, are highly vulnerable to adversarial examples. Adversarial training is one of the most efficient methods to improve the model's robustness. The key factor for the success of adversarial training is the capability to generate qualified and divergent adversarial examples which satisfy some objectives/goals (e.g., finding adversarial examples that maximize the model losses for simultaneously attacking multiple models). Therefore, multi-objective optimization (MOO) is a natural tool for adversarial example generation, where we search adversarial examples simultaneously maximizing some objectives/goals. However, we observe that a naive application of MOO tends to maximize all objectives/goals equally, without caring if an objective/goal has been achieved yet. This leads to useless effort to further improve the goal-achieved tasks, while putting less focus on the goal-unachieved tasks. In this paper, we propose \emph{Task Oriented MOO} to address this issue, in the context where we can explicitly define the goal achievement for a task. Our principle is to only maintain the goal-achieved tasks, while letting the optimizer spend more effort on improving the goal-unachieved tasks. We conduct comprehensive experiments for our Task Oriented MOO on various adversarial example generation schemes. The experimental results firmly demonstrate the merit of our proposed approach.","Multi-Objective Optimization, Adversarial Machine Learning" On the Performance of Temporal Difference Learning With Neural Networks,https://openreview.net/forum?id=6JMXLWX68Kj,https://openreview.net/pdf?id=6JMXLWX68Kj,,"Neural Temporal Difference (TD) Learning is an approximate temporal difference method for policy evaluation that uses a neural network for function approximation. Analysis of Neural TD Learning has proven to be challenging. In this paper we provide a convergence analysis of Neural TD Learning with a projection onto $B(\theta_0, \omega)$, a ball of fixed radius $\omega$ around the initial point $\theta_0$. We show an approximation bound of $O(\epsilon + 1/\sqrt{m})$ where $\epsilon$ is the approximation quality of the best neural network in $B(\theta_0, \omega)$ and $m$ is the width of all hidden layers in the network. ", Certified Defences Against Adversarial Patch Attacks on Semantic Segmentation,https://openreview.net/forum?id=b0JxQC7JLWh,https://openreview.net/pdf?id=b0JxQC7JLWh,"We propose the first approach for certified recovery and certified detection against adversarial patch attacks on semantic segmentation, which is based on novel masking schemes and image inpainting.","Adversarial patch attacks are an emerging security threat for real world deep learning applications. We present Demasked Smoothing, the first approach (up to our knowledge) to certify the robustness of semantic segmentation models against this threat model. Previous work on certifiably defending against patch attacks has mostly focused on image classification task and often required changes in the model architecture and additional training which is undesirable and computationally expensive. In Demasked Smoothing, any segmentation model can be applied without particular training, fine-tuning, or restriction of the architecture. Using different masking strategies, Demasked Smoothing can be applied both for certified detection and certified recovery. In extensive experiments we show that Demasked Smoothing can on average certify 63% of the pixel predictions for a 1% patch in the detection task and 46% against a 0.5% patch for the recovery task on the ADE20K dataset. ","certified defences, adversarial robustness, adversarial patch attacks" Markup-to-Image Diffusion Models with Scheduled Sampling,https://openreview.net/forum?id=81VJDmOE2ol,https://openreview.net/pdf?id=81VJDmOE2ol,,"Building on recent advances in image generation, we present a fully data-driven approach to rendering markup into images. The approach is based on diffusion models, which parameterize the distribution of data using a sequence of denoising operations on top of a Gaussian noise distribution. We view the diffusion denoising process a sequential decision making process, and show that it exhibits compounding errors similar to exposure bias issues in imitation learning problems. To mitigate these issues, we adapt the scheduled sampling algorithm to diffusion training. We conduct experiments on four markup datasets: formulas (LaTeX), table layouts (HTML), sheet music (LilyPond), and molecular images (SMILES). These experiments each verify the effectiveness of diffusion and the use of scheduled sampling to fix generation issues. These results also show that the markup-to-image task presents a useful controlled compositional setting for diagnosing and analyzing generative image models.", ADVERSARIALLY BALANCED REPRESENTATION FOR CONTINUOUS TREATMENT EFFECT ESTIMATION,https://openreview.net/forum?id=hXsBCVNJu_v,https://openreview.net/pdf?id=hXsBCVNJu_v,,"Estimating the individual treatment effect (ITE) requires covariate balance among different treatment groups, and machine learning models have shown great promise in learning a balanced representation of covariates. In contrast with binary treatments for which learning such a representation has been widely studied, the more practical yet complicated continuous treatment setting has remained relatively under-explored. Adopting an information-theoretic approach, we introduce a novel mutual information (MI)-based objective for continuous treatment effect estimation. Leveraging variational approximation to optimize MI terms in our objective, we propose a method called Adversarial CounterFactual Regression (ACFR). ACFR aligns the representation of covariates through an adversarial game and predicts the potential outcomes using a contribution-constraining hypothesis network. Comparison of ACFR against state-of-the-art methods on semi-synthetic datasets demonstrates its superiority in individual-level metrics.", Efficient Reward Poisoning Attacks on Online Deep Reinforcement Learning,https://openreview.net/forum?id=eG14tR9lssZ,https://openreview.net/pdf?id=eG14tR9lssZ,,"We study reward poisoning attacks on online deep reinforcement learning (DRL), where the attacker is oblivious to the learning algorithm used by the agent and does not necessarily have full knowledge of the environment. We demonstrate the intrinsic vulnerability of state-of-the-art DRL algorithms by designing a general, black-box reward poisoning framework called adversarial MDP attacks. We instantiate our framework to construct several new attacks which only corrupt the rewards for a small fraction of the total training timesteps and make the agent learn a low-performing policy. Our key insight is that state-of-the-art DRL algorithms strategically explore the environment to find a high-performing policy. Our attacks leverage this insight to construct a corrupted environment where (a) the agent learns a high-performing policy that has low performance in the original environment and (b) the corrupted environment is similar to the original one so that the attacker's budget is reduced. We provide a theoretical analysis of the efficiency of our attack and perform an extensive evaluation. Our results show that our attacks efficiently poison agents learning with a variety of state-of-the-art DRL algorithms, such as DQN, PPO, SAC, etc., under several popular classical control and MuJoCo environments.", How Much Space Has Been Explored? Measuring the Chemical Space Covered by Databases and Machine-Generated Molecules,https://openreview.net/forum?id=Yo06F8kfMa1,https://openreview.net/pdf?id=Yo06F8kfMa1,We propose a novel evaluation framework for measures of the chemical space in the context of drug discovery.,"Forming a molecular candidate set that contains a wide range of potentially effective compounds is crucial to the success of drug discovery. While most databases and machine-learning-based generation models aim to optimize particular chemical properties, there is limited literature on how to properly measure the coverage of the chemical space by those candidates included or generated. This problem is challenging due to the lack of formal criteria to select good measures of the chemical space. In this paper, we propose a novel evaluation framework for measures of the chemical space based on two analyses: an axiomatic analysis with two intuitive axioms that a good measure should obey, and an empirical analysis on the correlation between a measure and a proxy gold standard. Using this framework, we are able to identify a novel chemical space coverage measure, #Circles, superior to existing measures both analytically and empirically. We further evaluate how well the existing databases and generation models cover the chemical space in terms of #Circles. The results suggest that many generation models fail to explore a larger space over existing databases, which leads to new opportunities for improving generation models by encouraging exploration. ","molecular generation, drug discovery, coverea measures, chemical space measures" Semantic Video Synthesis from Video Scene Graphs,https://openreview.net/forum?id=C__2aUY0_3w,https://openreview.net/pdf?id=C__2aUY0_3w,"A video scene graph-to-video synthesis framework is proposed with a pre-trained video scene graph encoder, VQ-VAE and auto-regressive Transformer.","Video synthesis has recently attracted a lot of attention, as the natural extension to the image synthesis task. Most image synthesis works use class labels or text as guidance. However, neither labels nor text can provide explicit temporal guidance, such as when action starts or ends. To overcome this limitation, we introduce video scene graphs as input for video synthesis, as they represent the spatial and temporal relationships between objects in the scene. Since video scene graphs are usually temporally discrete annotations, we propose a video scene graph (VSG) encoder that not only encodes the existing video scene graphs but also predicts the graph representations for unlabeled frames. The VSG encoder is pre-trained with different contrastive multi-modal losses. A video scene graph-to-video synthesis framework (SGVS) based on the pre-trained VSG encoder, VQ-VAE, and auto-regressive Transformer is proposed to synthesize a semantic video given an initial scene image and a non-fixed number of video scene graphs. We evaluate SGVS and other state-of-the-art video synthesis models on Action Genome dataset and demonstrate the positive significance of video scene graphs in video synthesis.","video synthesis, scene graph, scene understanding" D-CIPHER: Discovery of Closed-form Partial Differential Equations,https://openreview.net/forum?id=q89i5jKql38,https://openreview.net/pdf?id=q89i5jKql38,,"Closed-form differential equations, including partial differential equations and higher-order ordinary differential equations, are one of the most important tools used by scientists to model and better understand natural phenomena. Discovering these equations directly from data is challenging because it requires modeling relationships between various derivatives that are not observed in the data (equation-data mismatch) and it involves searching across a huge space of possible equations. Current approaches make strong assumptions about the form of the equation and thus fail to discover many well-known systems. Moreover, many of them resolve the equation-data mismatch by estimating the derivatives, which makes them inadequate for noisy and infrequently sampled systems. To this end, we propose D-CIPHER, which is robust to measurement artifacts and can uncover a new and very general class of differential equations. We further design a novel optimization procedure, CoLLie, to help D-CIPHER search through this class efficiently. Finally, we demonstrate empirically that it can discover many well-known equations that are beyond the capabilities of current methods.","differential equations, symbolic regression" "Towards Identification of Microaggressions in real-life and Scripted conversations, using Context-Aware Machine Learning Techniques.",https://openreview.net/forum?id=z7FfWq2iaW4,https://openreview.net/pdf?id=z7FfWq2iaW4,"Classification of microaggressions from text data (extracted from real life conversation, social media platforms and classic TV shows) using SVMs and RoBERTA, considering the impact of varying amounts of context on overall models' performance. ","The advent and rapid proliferation of social media have brought with it an exponential growth in hate speech and overt offensive language, with one of the most subtle yet pervasive subcategories of hate speech being Microaggressions (MA). MAs are unintentional, hostile, derogatory, or negative prejudicial slights and insults toward any group, particularly culturally marginalized communities and growing bodies of research are linking long-term MA exposure to serious health problems. The scarcity of studies leveraging AI techniques to identify MAs in text and in spoken conversations, coupled with the lack of investigative analysis on the impact of context on the performance of algorithms used for this task, makes this a relevant topic for the AI community. In this paper, we explore the degree of effectiveness of MAs detection often found in spoken human communications across various contexts (e.g., workplace, social media, conversations) using Machine Learning models. We further examine the extent that art may imitate life, by comparing the ability of these models trained on real-life conversations to infer MAs, occurring in scripted Television shows. We apply a Support Vector Machine (SVM) classifier using N-grams and contextual modeling representation, using the Robustly Optimized Bidirectional Encoder Representation for Transformer (RoBERTa) model, whose performance is evaluated based on its pretraining size and ability to accurately detect hate speeches, with comparative results from BERT based-uncased and the HateBERT model respectively. Overall, the results show that contextual transformer models outperform simpler context-free approaches to classifying MAs collected from surveys and online blogs. We also found that these models trained on real-life conversations could infer MAs in scripted TV settings, though at reduced levels, and equal rates, suggesting there may be a disconnect between contexts of MA found in art and those from real life.","Microaggression, social conversations, Natural language Processing, Contextual Model, RoBERTa, Support Vector Machines." "UNIFIED-IO: A Unified Model for Vision, Language, and Multi-modal Tasks",https://openreview.net/forum?id=E01k9048soZ,https://openreview.net/pdf?id=E01k9048soZ,,"We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation, vision-and-language tasks such as region captioning and referring expression, to natural language processing tasks such as question answering and paraphrasing. Developing a single unified model for such a large variety of tasks poses unique challenges due to the heterogeneous inputs and outputs pertaining to each task, including RGB images, per-pixel maps, binary masks, bounding boxes, and language. We achieve this unification by homogenizing every supported input and output into a sequence of discrete vocabulary tokens. This common representation across all tasks allows us to train a single transformer-based architecture, jointly on over 90 diverse datasets in the vision and language fields. Unified-IO is the first model capable of performing all 7 tasks on the GRIT benchmark and produces strong results across 16 diverse benchmarks like NYUv2-Depth, ImageNet, VQA2.0, OK-VQA, Swig, VizWizGround, BoolQ, and SciTail, with no task-specific fine-tuning. Code and pre-trained models will be made publicly available.", Benchmarking Offline Reinforcement Learning on Real-Robot Hardware,https://openreview.net/forum?id=3k5CUGDLNdd,https://openreview.net/pdf?id=3k5CUGDLNdd,We propose new robotics datasets for dexterous manipulation and benchmark offline RL algorithms on them.,"Learning policies from previously recorded data is a promising direction for real-world robotics tasks, as online learning is often infeasible. Dexterous manipulation in particular remains an open problem in its general form. The combination of offline reinforcement learning with large diverse datasets, however, has the potential to lead to a breakthrough in this challenging domain analogously to the rapid progress made in supervised learning in recent years. To coordinate the efforts of the research community toward tackling this problem, we propose a benchmark including: i) a large collection of data for offline learning from a dexterous manipulation platform on two tasks, obtained with capable RL agents trained in simulation; ii) the option to execute learned policies on a real-world robotic system and a simulation for efficient debugging. We evaluate prominent open-sourced offline reinforcement learning algorithms on the datasets and provide a reproducible experimental setup for offline reinforcement learning on real systems.","offline reinforcement learning, robotic manipulation, dexterous manipulation, TriFinger platform" CUDA: Curriculum of Data Augmentation for Long-tailed Recognition,https://openreview.net/forum?id=RgUPdudkWlN,https://openreview.net/pdf?id=RgUPdudkWlN,"We propose a class-wise data augmentation method by designing the curriculum of data augmentation, which is based on our findings that stronger augmentation on major classes improves the performance on long-tailed recognition.","Class imbalance problems frequently occur in real-world tasks, and conventional deep learning algorithms are well known for performance degradation on imbalanced training datasets. To mitigate this problem, many approaches have aimed to balance among given classes by re-weighting or re-sampling training samples. These re-balancing methods increase the impact of minority classes and reduce the influence of majority classes on the output of models. However, the extracted representations may be of poor quality owing to the limited number of minority samples. To handle this restriction, several methods have been developed that increase the representations of minority samples by leveraging the features of the majority samples. Despite extensive recent studies, no deep analysis has been conducted on determination of classes to be augmented and strength of augmentation has been conducted. In this study, we first investigate the correlation between the degree of augmentation and class-wise performance, and find that the proper degree of augmentation must be allocated for each class to mitigate class imbalance problems. Motivated by this finding, we propose a simple and efficient novel curriculum, which is designed to find the appropriate per-class strength of data augmentation, called CUDA: CUrriculum of Data Augmentation for long-tailed recognition. CUDA can simply be integrated into existing long-tailed recognition methods. We present the results of experiments showing that CUDA effectively achieves better generalization performance compared to the state-of-the-art method on various imbalanced datasets such as CIFAR-100-LT, ImageNet-LT, and iNaturalist 2018. ","Long-tailed recognition, class imbalance" Understanding new tasks through the lens of training data via exponential tilting,https://openreview.net/forum?id=DBMttEEoLbw,https://openreview.net/pdf?id=DBMttEEoLbw,,"Deploying machine learning models on new tasks is a major challenge due to differences in distributions of the train (source) data and the new (target) data. However, the training data likely captures some of the properties of the new task. We consider the problem of reweighing the training samples to gain insights into the distribution of the target task. Specifically, we formulate a distribution shift model based on the exponential tilt assumption and learn train data importance weights minimizing the KL divergence between labeled train and unlabeled target datasets. The learned train data weights can then be used for downstream tasks such as target performance evaluation, fine-tuning, and model selection. We demonstrate the efficacy of our method on Waterbirds and Breeds benchmarks.","Out-of-distribution generalization, model selection, subpopulation shift, concept drift" Neighborhood Gradient Clustering: An Efficient Decentralized Learning Method for Non-IID Data Distributions,https://openreview.net/forum?id=9qgOs_IwRS3,https://openreview.net/pdf?id=9qgOs_IwRS3,Proposed a novel decentralized learning algorithm to improve the performance over non-IID data distributions through manipulation of local-gradients,"Decentralized learning algorithms enable the training of deep learning models over large distributed datasets generated at different devices and locations, without the need for a central server. In practical scenarios, the distributed datasets can have significantly different data distributions across the agents. The current state-of-the-art decentralized algorithms mostly assume the data distributions to be Independent and Identically Distributed (IID). This paper focuses on improving decentralized learning over non-IID data distributions with minimal compute and memory overheads. We propose Neighborhood Gradient Clustering (NGC), a novel decentralized learning algorithm that modifies the local gradients of each agent using self- and cross-gradient information. Cross-gradients for a pair of neighboring agents are the derivatives of the model parameters of an agent with respect to the dataset of the other agent. In particular, the proposed method replaces the local gradients of the model with the weighted mean of the self-gradients, model-variant cross-gradients (derivatives of the received neighbors’ model parameters with respect to the local dataset - computed locally), and data-variant cross-gradients (derivatives of the local model with respect to its neighbors’ datasets - received through communication). The data-variant cross-gradients are aggregated through an additional communication round without breaking the privacy constraints of the decentralized setting. Further, we present CompNGC, a compressed version of NGC that reduces the communication overhead by $32 \times$ by compressing the cross-gradients. We demonstrate the empirical convergence and efficiency of the proposed technique over non-IID data distributions sampled from the CIFAR-10 dataset on various model architectures and graph topologies. Our experiments demonstrate that NGC and CompNGC outperform the existing state-of-the-art (SoTA) decentralized learning algorithm over non-IID data by $1-5\%$ with significantly less compute and memory requirements. Further, we also show that the proposed NGC method outperforms the baseline by $5-40\%$ with no additional communication. ","Federated Learning, Distributed Machine Learning, Decentralized Learning, Communication Efficient, Energy Efficient, Non-IID Data Distribution, Convergence" Equilibrium-finding via exploitability descent with learned best-response functions,https://openreview.net/forum?id=ltCuqJpZl7S,https://openreview.net/pdf?id=ltCuqJpZl7S,We propose a new method for equilibrium finding based on the idea of learned best-response functions.,"There has been great progress on equilibrium-finding research over the last 20 years. Most of that work has focused on games with finite, discrete action spaces. However, many games involving space, time, money, etc. have continuous action spaces. We study the problem of computing approximate Nash equilibria of games with continuous strategy sets. The main measure of closeness to Nash equilibrium is exploitability, which measures how much players can benefit from unilaterally changing their strategy. We propose a new method that minimizes an approximation of exploitability with respect to the strategy profile. This approximation is computed using learned best-response functions, which take the current strategy profile as input and return learned best responses. The strategy profile and best-response functions are trained simultaneously, with the former trying to minimize exploitability while the latter try to maximize it. We evaluate our method on various continuous games, showing that it outperforms prior methods.","equilibrium finding, game solving, best-response function, computational game theory" A Unified Framework for Comparing Learning Algorithms,https://openreview.net/forum?id=dYQnWPqCCAs,https://openreview.net/pdf?id=dYQnWPqCCAs,A unified framework for comparing models trained with two different learning algorithms based on how models use training data to make predictions,"We propose a framework for {\em (learning) algorithm comparisons}, wherein the goal is to find similarities and differences between models trained with two different learning algorithms. We begin by formalizing the goal of algorithm comparison as finding {\em distinguishing feature transformations}, input transformations that change the predictions of models trained with one learning algorithm but not the other. We then present a two-stage method for algorithm comparisons based on comparing how models use the training data, leveraging the recently proposed datamodel representations [Ilyas et al., 2022]. We demonstrate our framework through three case studies that compare models trained with/without standard data augmentation, with/without pre-training, and with different optimizer hyperparameters. ","algorithm comparison, model comparison, data-centric, influence, datamodels, data augmentation, pretraining, learning rate" Neural Network Approximations of PDEs Beyond Linearity: Representational Perspective,https://openreview.net/forum?id=70BaDC5ceIO,https://openreview.net/pdf?id=70BaDC5ceIO,,"A burgeoning line of research has developed deep neural networks capable of approximating the solutions to high dimensional PDEs, opening related lines of theoretical inquiry focused on explaining how it is that these models appear to evade the curse of dimensionality. However, most theoretical analyses thus far have been limited to simple linear PDEs. In this work, we take a step towards studying the representational power of neural networks for approximating solutions to nonlinear PDEs. We focus on a class of PDEs known as nonlinear elliptic variational PDEs, whose solutions minimize an Euler-Lagrange energy functional $\mathcal{E}(u) = \int_\Omega L(\nabla u) dx$. We show that if composing a function with Barron norm $b$ with $L$ produces a function of Barron norm at most $B_L b^p$, the solution to the PDE can be $\epsilon$-approximated in the $L^2$ sense by a function with Barron norm $O\left(\left(dB_L\right)^{p^{\log(1/\epsilon)}}\right)$. By a classical result due to \cite{barron1993universal}, this correspondingly bounds the size of a 2-layer neural network needed to approximate the solution. Treating $p, \epsilon, B_L$ as constants, this quantity is polynomial in dimension, thus showing neural networks can evade the curse of dimensionality. Our proof technique involves neurally simulating (preconditioned) gradient in an appropriate Hilbert space, which converges exponentially fast to the solution of the PDE, and such that we can bound the increase of the Barron norm at each iterate. Our results subsume and substantially generalize analogous prior results for linear elliptic PDEs. ","PDE, Partial Differential Equations, Deep Learning Theory, Universal Approximation" Calibrating Sequence likelihood Improves Conditional Language Generation,https://openreview.net/forum?id=0qSOodKmJaN,https://openreview.net/pdf?id=0qSOodKmJaN,"A proposed sequence likelihood calibration stage improves fine-tuned conditional language models, leading to new state-of-the-art results in abstractive summarization, question generation, abstractive question answering and data-to-text.","Conditional language models are predominantly trained with maximum likelihood estimation (MLE), giving probability mass to sparsely observed target sequences. While MLE trained models assign high probability to plausible sequences given the context, the model probabilities often do not accurately rank-order generated sequences by quality. This has been empirically observed in beam search decoding as output quality degrading with large beam sizes, and decoding strategies benefiting from heuristics such as length normalization and repetition-blocking. In this work, we introduce sequence likelihood calibration (SLiC) where the likelihood of model generated sequences are calibrated to better align with reference sequences in the model’s latent space. With SLiC, decoding heuristics become unnecessary and decoding candidates’ quality significantly improves regardless of the decoding method. Furthermore, SLiC shows no sign of diminishing returns with model scale, and presents alternative ways to improve quality with limited training and inference budgets. With SLiC, we exceed or match SOTA results on a wide range of generation tasks spanning abstractive summarization, question generation, abstractive question answering and data-to-text generation, even with modest-sized models.","Natural Language Processing, conditional language models, sequence-to-sequence, text generation" Masked inverse folding with sequence transfer for protein representation learning,https://openreview.net/forum?id=2EO8eQ2vySB,https://openreview.net/pdf?id=2EO8eQ2vySB,,"Self-supervised pretraining on protein sequences has led to state-of-the art performance on protein function and fitness prediction. However, sequence-only methods ignore the rich information contained in experimental and predicted protein structures. Meanwhile, inverse folding methods reconstruct a protein's amino-acid sequence given its structure, but do not take advantage of sequences that do not have known structures. In this study, we train a masked inverse folding protein language model parameterized as a structured graph neural network. We then show that using the outputs from a pretrained sequence-only protein masked language model as input to the inverse folding model further improves pretraining perplexity. We evaluate both of these models on downstream protein engineering tasks and analyze the effect of using information from experimental or predicted structures on performance. ", Convolutions are competitive with transformers for protein sequence pretraining,https://openreview.net/forum?id=ukveBtI9lnk,https://openreview.net/pdf?id=ukveBtI9lnk,"For proteins, large pretrained CNNs are competitive with large pretrained transformers. ","Pretrained protein sequence language models largely rely on the transformer architecture. However, transformer run-time and memory requirements scale quadratically with sequence length. We investigate the potential of a convolution-based architecture for protein sequence masked language model pretraining and subsequent finetuning. CNNs are competitive on the pretraining task with transformers across several orders of magnitude in parameter size while scaling linearly with sequence length. More importantly, CNNs are competitive with and occasionally superior to transformers across an extensive set of downstream evaluations, including structure prediction, zero-shot mutation effect prediction, and out-of-domain generalization. We also demonstrate strong performance on sequences longer than the positional embeddings allowed in the current state-of-the-art transformer protein masked language models. Finally, we close with a call to disentangle the effects of pretraining task and model architecture when studying pretrained protein sequence models. ","protein, pretrain, convolution" Learning differentiable solvers for systems with hard constraints,https://openreview.net/forum?id=vdv6CmGksr0,https://openreview.net/pdf?id=vdv6CmGksr0,We propose a method to solve partial differential equations (PDEs) through enforcing constraints in neural networks.,"We introduce a practical method to enforce linear partial differential equation (PDE) constraints for functions defined by neural networks (NNs), up to a desired tolerance. By combining methods in differentiable optimization and applications of the implicit function theorem to NN models, we develop a differentiable PDE-constrained layer that can be incorporated into a NN. Inspired by dictionary learning, our model learns a family of functions, each of which defines a mapping from PDE parameters to PDE solutions. At inference time, the model finds an optimal linear combination of the functions in the learned family by solving a PDE-constrained optimization problem. Our method provides continuous solutions over the domain of interest that accurately satisfy desired physical constraints. Our results show that incorporating hard constraints directly into the NN architecture achieves much lower test error when compared to training on an unconstrained objective.","differentiable optimization, PDEs, physics, neural networks, differentiable constraints, dictionary learning" FedDAR: Federated Domain-Aware Representation Learning,https://openreview.net/forum?id=6P9Y25Pljl6,https://openreview.net/pdf?id=6P9Y25Pljl6,,"Cross-silo Federated learning (FL) has become a promising tool in machine learning applications for healthcare. It allows hospitals/institutions to train models with sufficient data while the data is kept private. To make sure the FL model is robust when facing heterogeneous data among FL clients, most efforts focus on personalizing models for clients. However, the latent relationships between clients' data are ignored. In this work, we focus on a special non-iid FL problem, called Domain-mixed FL, where each client's data distribution is assumed to be a mixture of several predefined domains. Recognizing the diversity of domains and the similarity within domains, we propose a novel method, FedDAR, which learns a domain shared representation and domain-wise personalized prediction heads in a decoupled manner. For simplified linear regression settings, we have theoretically proved that FedDAR enjoys a linear convergence rate. For general settings, we have performed intensive empirical studies on both synthetic and real-world medical datasets which demonstrate its superiority over prior FL methods. ","federated learning, healthcare, fairness, personalization" KL-Entropy-Regularized RL with a Generative Model is Minimax Optimal,https://openreview.net/forum?id=Xq2J1kiZeHE,https://openreview.net/pdf?id=Xq2J1kiZeHE,We show that KL-entropy-regularized value iteration is minimax-optimal under the generative model setting.,"In this work, we consider and analyze the sample complexity of model-free reinforcement learning with a generative model. Particularly, we analyze mirror descent value iteration (MDVI) by Geist et al. (2019) and Vieillard et al. (2020a), which uses the Kullback-Leibler divergence and entropy regularization in its value and policy updates. Our analysis shows that it is nearly minimax-optimal for finding an ε-optimal policy when ε is sufficiently small. This is the first theoretical result that demonstrates that a simple model-free algorithm without variance-reduction can be nearly minimax-optimal under the considered setting.","Reinforcement Learning, Minimax-Optimality, Generative Model, KL Regularization, Entropy Regularization" Learning to Estimate Shapley Values with Vision Transformers,https://openreview.net/forum?id=5ktFNz_pJLK,https://openreview.net/pdf?id=5ktFNz_pJLK,A learning-based approach to efficiently calculate Shapley values for ViTs,"Transformers have become a default architecture in computer vision, but understanding what drives their predictions remains a challenging problem. Current explanation approaches rely on attention values or input gradients, but these provide a limited understanding of a model’s dependencies. Shapley values offer a theoretically sound alternative, but their computational cost makes them impractical for large, high-dimensional models. In this work, we aim to make Shapley values practical for vision transformers (ViTs). To do so, we first leverage an attention masking approach to evaluate ViTs with partial information, and we then develop a procedure for generating Shapley value explanations via a separate, learned explainer model. Our experiments compare Shapley values to many baseline methods (e.g., attention rollout, GradCAM, LRP), and we find that our approach provides more accurate explanations than existing methods for ViTs.","ViTs, Shapley values, amortization, explainability" No Double Descent in PCA: Training and Pre-Training in High Dimensions,https://openreview.net/forum?id=ieWqvOiKgz2,https://openreview.net/pdf?id=ieWqvOiKgz2,We analyse PCA with linear regression for its generalization with high dimensional data and extend the setting to training the two model parts on two different data sets to establish connections to pre-training theory.,"With the recent body of work on overparameterized models the gap between theory and practice in contemporary machine learning is shrinking. While many of the present state-of-the-art models have an encoder-decoder architecture, there is little theoretical work for this model structure. To improve our understanding in this direction, we consider linear encoder-decoder models, specifically PCA with linear regression on data from a low-dimensional manifold. We present an analysis for fundamental guarantees of the risk and asymptotic results for isotropic data when the model is trained in a supervised manner. The results are also verified in simulations. Furthermore, we extend our analysis to the popular setting where parts of the model are pre-trained in an unsupervised manner by pre-training the PCA encoder with subsequent supervised training of the linear regression. We show that the overall risk depends on the estimates of the eigenvectors in the encoder and present a sample complexity requirement through a concentration bound. The results highlight that using more pre-training data decreases the overall risk only if it improves the eigenvector estimates. Therefore, we stress that the eigenvalue distribution determines whether more pre-training data is useful or not.","PCA, generalization theory, overparameterization, pre-training theory, encoder-decoder models, double descent, linear regression" Predicting Drug Repurposing Candidates and Their Mechanisms from A Biomedical Knowledge Graph,https://openreview.net/forum?id=YycrpoVQB4G,https://openreview.net/pdf?id=YycrpoVQB4G,We predict drug repurposing candidates and their path-based mechanisms of action based on a large biomedical knowledge graph.,"Computational drug repurposing is a cost- and time-efficient method to identify new indications of approved or experimental drugs/compounds. It is especially critical for emerging and/or orphan diseases due to its cheaper investment and shorter research cycle compared with traditional wet-lab drug discovery approaches. However, the underlying mechanisms of action between repurposed drugs and their target diseases remain largely unknown, which is still an unsolved issue in existing repurposing methods. As such, computational drug repurposing has not been widely adopted in clinical settings. In this work, based on a massive biomedical knowledge graph, we propose a computational drug repurposing framework that not only predicts the treatment probabilities between drugs and diseases but also predicts the path-based, testable mechanisms of action (MOAs) as their biomedical explanations. Specifically, we utilize the GraphSAGE model in an unsupervised manner to integrate each entity’s neighborhood information and employ a Random Forest model to predict the treatment probabilities between pairs of drugs and diseases. Moreover, we train an adversarial actor-critic reinforcement learning model to predict the potential MOA for explaining drug purposing. To encourage the model to find biologically reasonable paths, we utilize the curated molecular interactions of drugs and a PubMed-publication-based concept distance to extract potential drug MOA paths from the knowledge graph as ”demonstration paths” to guide the model during the process of path-finding. Comprehensive experiments and case studies show that the proposed framework outperforms state-of-the-art baselines in both predictive performance of drug repurposing and explanatory performance of recapitulating human-curated DrugMechDB-based paths.","Drug Repurposing, Reinforcement Learning, Biomedical Knowledge Graph" ProGen2: Exploring the Boundaries of Protein Language Models,https://openreview.net/forum?id=ZOn4HXehSJ6,https://openreview.net/pdf?id=ZOn4HXehSJ6,Exploration of billion-scale model and dataset sizes to examine the boundaries of protein language models,"Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional fine-tuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. We release the ProGen2 models and code at https://github.com/anonymized-research/progen2.", Interval Bound Interpolation for Few-shot Learning with Few Tasks,https://openreview.net/forum?id=gwTP_sA-aj-,https://openreview.net/pdf?id=gwTP_sA-aj-,A method to densify the task distribution for few-task few-shot learning using task interpolation within interval-arithmetic-based bounds,"Few-shot learning aims to transfer the knowledge acquired from training on a diverse set of tasks to unseen tasks from the same task distribution, with a limited amount of labeled data. The underlying requirement for effective few-shot generalization is to learn a good representation of the task manifold. This becomes more difficult when only a limited number of tasks are available for training. In such a few-task few-shot setting, it is beneficial to explicitly preserve the local neighborhoods from the task manifold and exploit this to generate artificial tasks for training. To this end, we introduce the notion of interval bounds from the provably robust training literature to few-shot learning. The interval bounds are used to characterize neighborhoods around the training tasks. These neighborhoods can then be preserved by minimizing the distance between a task and its respective bounds. We then use a novel strategy to artificially form new tasks for training by interpolating between the available tasks and their respective interval bounds. We apply our framework to both model-agnostic meta-learning as well as prototype-based metric-learning paradigms. The efficacy of our proposed approach is evident from the improved performance on several datasets from diverse domains in comparison to recent methods.","meta-learning, metric learning, task interpolation, Interval Bound Propagation" A framework for benchmarking Class-out-of-distribution detection and its application to ImageNet,https://openreview.net/forum?id=Iuubb9W6Jtk,https://openreview.net/pdf?id=Iuubb9W6Jtk,"We present a framework for benchmarking the performance of image classifiers in detecting OOD. We apply it to benchmark 525 pretrained ImageNet classifiers, and analyze their performance resulting in interesting conclusions","When deployed for risk-sensitive tasks, deep neural networks must be able to detect instances with labels from outside the distribution for which they were trained. In this paper we present a novel technique to benchmark image classifiers' ability to detect class-out-of-distribution instances (i.e., instances whose true labels the model does not recognize) at various levels of detection difficulty. We apply this technique to ImageNet, and benchmark 525 pretrained, publicly available, ImageNet-1k classifiers. We will provide the code to generate a benchmark for any ImageNet-1k classifier, along with the benchmarks prepared for the above-mentioned 525 models. Additionally, we analyze the results from benchmarking these models and make numerous observations, including: (1) knowledge distillation consistently improves \emph{class-out-of-distribution} (C-OOD) detection performance; (2) a subset of ViTs performs better C-OOD detection than any other model; (3) the language–-vision CLIP model achieves good zero-shot detection performance, with its best instance outperforming 96% of all other models evaluated; (4) accuracy and in-distribution ranking are positively correlated to C-OOD detection; and (5) we compare various confidence functions for C-OOD detection.","benchmarking, out of distribution, class out of distribution, OOD, OOD detection" Data Poisoning Attacks Against Multimodal Encoders,https://openreview.net/forum?id=7qSpaOSbRVO,https://openreview.net/pdf?id=7qSpaOSbRVO,,"Traditional machine learning (ML) models, e.g., image classifiers, usually rely on large-scale labeled datasets to achieve strong performance. However, such labeled datasets are often challenging and expensive to obtain. Also, the predefined categories limit the model's ability to generalize to other visual concepts as additional labeled data is required. On the contrary, the newly emerged multimodal model, which contains both visual and linguistic modalities, learns the concept of images from the raw text. It is a promising way to solve the above problems as it can use easy-to-collect image-text pairs to construct the training dataset and the raw texts contain almost unlimited categories according to their semantics. However, learning from a large-scale unlabeled dataset also exposes the model to the risk of potential poisoning attacks, whereby the adversary aims to perturb the model's training dataset to trigger malicious behaviors in it. Previous work mainly focuses on the visual modality. In this paper, we instead focus on answering two questions: (1) Is the linguistic modality also vulnerable to poisoning attacks? and (2) Which modality is most vulnerable? To answer the two questions, we conduct three types of poisoning attacks against CLIP, the most representative multimodal contrastive learning framework. Extensive evaluations on different datasets and model architectures show that all three attacks can perform well on the linguistic modality with only a relatively low poisoning rate and limited epochs. Also, we observe that the poisoning effect differs between different modalities, i.e., with lower MinRank in the visual modality and with higher Hit@K when K is small in the linguistic modality. To mitigate the attacks, we propose both pre-training and post-training defenses. We empirically show that both defenses can significantly reduce the attack performance while preserving the model's utility.", SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models,https://openreview.net/forum?id=TFbwV6I0VLg,https://openreview.net/pdf?id=TFbwV6I0VLg,We propose a general Transformer-based dynamic model to enable consistent future prediction in object-centric models,"Understanding dynamics from visual observations is a challenging problem that requires disentangling individual objects from the scene and learning their interactions. While recent object-centric models can successfully decompose a scene into objects, modeling their dynamics effectively still remains a challenge. We address this problem by introducing SlotFormer -- a Transformer-based autoregressive model operating on learned object-centric representations. Given a video clip, our approach reasons over object features to model spatio-temporal relationships and predicts accurate future object states. In this paper, we successfully apply SlotFormer to perform video prediction on datasets with complex object interactions. Moreover, the unsupervised SlotFormer's dynamics model can be used to improve the performance on supervised downstream tasks, such as Visual Question Answering (VQA), and goal-conditioned planning. Compared to past works on dynamics modeling, our method achieves significantly better long-term synthesis of object dynamics, while retaining high quality visual generation. Besides, SlotFormer enables VQA models to reason about the future without object-level labels, even outperforming counterparts that use ground-truth annotations. Finally, we show its ability to serve as a world model for model-based planning, which is competitive with methods designed specifically for such tasks.","Object-centric learning, dynamics modeling, Transformer" CEPD: Co-Exploring Pruning and Decomposition for Compact DNN Models,https://openreview.net/forum?id=PrRWSVT2htx,https://openreview.net/pdf?id=PrRWSVT2htx,,"Pruning and decomposition are two important techniques to compress deep neural network (DNN) models. To date, these two popular yet distinct approaches are typically used in a separate way; while their efficient integration for better compression performance is little explored. In this paper, we perform systematic co-exploration on pruning and decomposition toward compact DNN models. We first investigate and analyze several important design factors for joint pruning and decomposition, including operational sequence, decomposition format, and optimization procedure. Based on the observations from our analysis, we then propose CEPD, a unified DNN compression framework that can simultaneously capture the benefits of pruning and decomposition in an efficient way. Empirical experiments demonstrate the promising performance of our proposed solution. Notably, on CIFAR-10 dataset, CEPD brings 0.72% and 0.45% accuracy increase over the baseline ResNet-56 and MobileNetV2 models, respectively, and meanwhile the computational costs are reduced by 43.0% and 44.2%, respectively. On the ImageNet dataset, our approach can enable 0.10% and 1.39% accuracy increase over the baseline ResNet-18 and ResNet-50 models with 59.4% and 54.6% fewer parameters, respectively. ", "Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objective",https://openreview.net/forum?id=MQcmfgRxf7a,https://openreview.net/pdf?id=MQcmfgRxf7a,"We present a joint objective for latent space model based RL which lower bounds the RL objective. Maximising this bound jointly with the encoder, model and the policy matches the performance of SOTA methods, while being 6-10 times faster. ","While reinforcement learning (RL) methods that learn an internal model of the environment have the potential to be more sample efficient than their model-free counterparts, learning to model raw observations from high dimensional sensors can be challenging. Prior work has addressed this challenge by learning low-dimensional representation of observations through auxiliary objectives, such as reconstruction or value prediction. However, the alignment between these auxiliary objectives and the RL objective is often unclear. In this work, we propose a single objective which jointly optimizes a latent-space model and policy to achieve high returns while remaining self-consistent. This objective is a lower bound on expected returns. Unlike prior bounds for model-based RL on policy exploration or model guarantees, our bound is directly on the overall RL objective. We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods. While such sample efficient methods typically are computationally demanding, our method attains the performance of SAC in about 50% less wall-clock time.","Latent-space models, objective mismatch, model based RL" Tessellated Neural Networks: A Robust Defence against Adversarial Attacks,https://openreview.net/forum?id=_NlE9YiyXKb,https://openreview.net/pdf?id=_NlE9YiyXKb,," Data-driven deep learning approaches for image classification are prone to adversarial attacks. An adversarial image which is sufficiently close (visually indistinguishable) from a true image of its representative class can often be misclassified to be a member of a different class. It is possible for attackers to exploit the high dimensionality of image representations, as learned by the neural models, to identify adversarial perturbations. To mitigate this problem, we propose a novel divide-and-conquer based approach of tessellating a base network architecture (e.g., a ResNet used in our experiments). The tessellated network learns the parameterized representations of each non-overlapping sub-region or tiles within an image, independently, and then learns how to combine these representations to finally estimate the class of the input image. We investigate two different modes of tessellation, namely periodic, comprised of regular square-shaped tiles, and aperiodic, comprised of rectangles of different dimensions. Experiments demonstrate that the tessellated extension of two standard deep neural models leads to a better defence against a number of standard adversarial attacks. We observed that the decrease in post-attack accuracy values relative to the accuracy of the uncompromised networks is smaller for our proposed tessellated approach. ","AI safety, fairness, privacy, robustness" Retrieval-based Controllable Molecule Generation,https://openreview.net/forum?id=vDFA1tpuLvk,https://openreview.net/pdf?id=vDFA1tpuLvk,We propose a first-of-its-kind retrieval-based framework for controllable molecule generation which can effectively extrapolate beyond the retrieval database and achieves state-of-the-art performance on various benchmarks.,"Generating new molecules with specified chemical and biological properties via generative models has emerged as a promising direction for drug discovery. However, existing methods require extensive training/fine-tuning with a large dataset, often unavailable in real-world generation tasks. In this work, we propose a new retrieval-based framework for controllable molecule generation. We use a small set of exemplar molecules, i.e., those that (partially) satisfy the design criteria, to steer the pre-trained generative model towards synthesizing molecules that satisfy the given design criteria. We design a retrieval mechanism that retrieves and fuses the exemplar molecules with the input molecule, which is trained by a new self-supervised objective that predicts the nearest neighbor of the input molecule. We also propose an iterative refinement process to dynamically update the generated molecules and retrieval database for better generalization. Our approach is agnostic to the choice of generative models and requires no task-specific fine-tuning. On various tasks ranging from simple design criteria to a challenging real-world scenario for designing lead compounds that bind to the SARS-CoV-2 main protease, we demonstrate our approach extrapolates well beyond the retrieval database, and achieves better performance and wider applicability than previous methods.","controllable molecule generation, retrieval mechanism, exemplar molecules, drug discovery" ELRT: Towards Efficient Low-Rank Training for Compact Neural Networks,https://openreview.net/forum?id=TC39w69m8bB,https://openreview.net/pdf?id=TC39w69m8bB,,"Low-rank compression, a popular model compression technique that produces compact convolutional neural networks (CNNs) with low rankness, has been well studied in the literature. On the other hand, low-rank training, as an alternative way to train low-rank CNNs from scratch, is little exploited yet. Unlike low-rank compression, low-rank training does not need pre-trained full-rank models and the entire training phase is always performed on the low-rank structure, bringing attractive benefits for practical applications. However, the existing low-rank training solutions are still very limited and do not demonstrate their effectiveness for training modern low-rank CNN models in the large-scale dataset from scratch. In this paper, we perform a systematic investigation on low-rank CNN training. By identifying the proper low-rank format and performance-improving strategy, we propose ELRT, an efficient low-rank training solution for high-accuracy high-compactness low-rank CNN models. Our extensive evaluation results for training various CNNs on different datasets demonstrate the effectiveness of ELRT.", InfoOT: Information Maximizing Optimal Transport,https://openreview.net/forum?id=nG08xiRT2As,https://openreview.net/pdf?id=nG08xiRT2As,"We present InfoOT, an information-theoretic extension of optimal transport that combines the geometry with the mutual information between domains.","Optimal transport aligns samples across distributions by minimizing the transportation cost between them, e.g., the geometric distances. Yet, it ignores coherence structure in the data such as clusters, does not handle outliers well, and cannot integrate new data points. To address these drawbacks, we propose InfoOT, an information-theoretic extension of optimal transport that maximizes the mutual information between domains while minimizing geometric distances. The resulting objective can still be formulated as a (generalized) optimal transport problem, and can be efficiently solved by projected gradient descent. This formulation yields a new projection method that is robust to outliers and generalizes to unseen samples. Empirically, InfoOT improves the quality of alignments across benchmarks in domain adaptation, cross-domain retrieval, and single-cell alignment.","Optimal Transport, Unsupervised Alignment, Domain Adaptation, Mutual Information" To be robust and to be fair: aligning fairness with robustness,https://openreview.net/forum?id=G7iioWGldQ7,https://openreview.net/pdf?id=G7iioWGldQ7,bridging adversarial robustness of fairness and accuracy in a unified framework,"Adversarial training has been shown to be reliable in improving robustness against adversarial samples. However, the problem of adversarial training in terms of fairness has not yet been properly studied, and the relationship between fairness and accuracy attack still remains unclear. Can we simultaneously improve robustness w.r.t. both fairness and accuracy? To tackle this topic, in this paper, we study the problem of adversarial training and adversarial attacks w.r.t. both metrics. We propose a unified structure for fairness attack which bring together common notions in group fairness, and we theoretically prove the equivalence of fairness attacks against different notions. We show the alignment of fairness and accuracy attack in disadvantaged subgroups, and we theoretically demonstrate that robustness of samples w.r.t. adversarial attack against one metric also benefit from robustness of samples w.r.t. adversarial attack against the other metric. Our work unifies adversarial training and attack w.r.t. fairness and accuracy, where both metrics benefit from robustness of the other metric under adversarial attack. Our study suggests a novel way to incorporate adversarial training with fairness, and experimental results show that our proposed method achieves better performance in terms of robustness w.r.t. both fairness and accuracy.","fairness, adversarial robustness" Posterior Sampling Model-based Policy Optimization under Approximate Inference,https://openreview.net/forum?id=jwgnijhdF3V,https://openreview.net/pdf?id=jwgnijhdF3V,We proposed an improved posterior factorization for PSRL under approximate inference; and two sampling strategies.,"Model-based reinforcement learning algorithms (MBRL) hold tremendous promise for improving the sample efficiency in online RL. However, many existing popular MBRL algorithms cannot deal with exploration and exploitation properly. Posterior sampling reinforcement learning (PSRL) serves as a promising approach for automatically trading off the exploration and exploitation, but the theoretical guarantees only hold under exact inference. In this paper, we show that adopting the same methodology as in exact PSRL can be fairly suboptimal under approximate inference. Motivated by the analysis, we propose an improved factorization for the posterior distribution of polices by removing the conditional independence between the policy and data given the model. By adopting such a posterior factorization, we further propose a general algorithmic framework for PSRL under approximate inference and a practical instantiation of it. Empirically, our algorithm can surpass the baseline methods by a significant margin on both dense rewards and sparse rewards tasks from DM control suite, OpenAI Gym and Metaworld benchmarks.","Reinforcement learning, Posterior, Model-based reinforcement learning" Causal discovery from conditionally stationary time series,https://openreview.net/forum?id=Whf5OGxibGR,https://openreview.net/pdf?id=Whf5OGxibGR,,"Causal discovery, i.e., inferring underlying causal relationships from observational data, has been shown to be highly challenging for AI systems. In time series modeling context, traditional causal discovery methods mainly consider constrained scenarios with fully observed variables and/or data from stationary time-series. We develop a causal discovery approach to handle a wide class of non-stationary time-series that are conditionally stationary, where the non-stationary behaviour is modeled as stationarity conditioned on a set of (possibly hidden) state variables. Named state-dependent causal inference (SDCI), our approach is able to recover the underlying causal dependencies, provably with fully-observed states and empirically with hidden states. The latter is confirmed by experiments on synthetic linear system and nonlinear particle interaction data, where SDCI achieves superior performance over baseline causal discovery methods. Improved results over non-causal RNNs on modeling NBA player movements demonstrate the potential of our method and motivate the use causality-driven methods for forecasting.","causal discovery, temporal data, graph neural network, time series, non-stationary, probabilistic modelling" Fair Clustering via Equalized Confidence,https://openreview.net/forum?id=8vz6hO1S4o5,https://openreview.net/pdf?id=8vz6hO1S4o5,fair clustering based on equality of predicted confidence between different demographic groups,"Fair clustering aims at eliminating effects of sensitive information in clustering assignment. Existing work on fair clustering addresses this problem as a vanilla clustering with constraints that the distribution of protected groups on each cluster should be similar. However, existing criteria for fair clustering does not take into account clustering accuracy, and may restrain the performance of clustering algorithms. To tackle this problem, in this work, we propose a novel metric, equalized confidence, for fair clustering based on the predicted clustering confidence. Instead of enforcing similar distribution of sensitive attributes across different clusters, equalized confidence requires similar predicted confidence across different sensitive groups, bypassing the problem of disparities in statistical features across demographic groups. In light of the new metric, we propose a fair clustering method to learn a fair and good representation for clustering. Compared with conventional methods on fair clustering which try to adjust clustering assignment, our method focuses on learning a fair representation for downstream tasks. Our method proposes to eliminate the disparities of predicted soft labels of samples in different demographic groups using Sinkhorn divergence, as well as to learn clustering-favorable representations for clustering. Experimental results show that our method performs better or comparably than state-of-the-art methods, and that our proposed metric fits better under clustering accuracy.","fair clustering, Sinkhorn divergence, equalized confidence" Learning for Edge-Weighted Online Bipartite Matching with Robustness Guarantees,https://openreview.net/forum?id=cwiFbXPW4G0,https://openreview.net/pdf?id=cwiFbXPW4G0,This paper proposes a novel reinforcement learning approach to solve edge-weighted online bipartite matching with robustness guarantees.,"Many real-world problems, such as online ad display, can be formulated as online bipartite matching. The crucial challenge lies in the nature of sequentially-revealed online item information, based on which we make irreversible matching decisions at each step. While numerous expert online algorithms have been proposed with bounded worst-case competitive ratios, they may not offer satisfactory performance in average cases. On the other hand, reinforcement learning (RL) has been applied to improve the average performance, but they lack robustness and can perform arbitrarily badly. In this paper, we propose a novel RL-based approach to edge-weighted online bipartite matching with robustness guarantees (LOMAR), achieving both good average-case and good worst-case performance. The key novelty of LOMAR is a new online switching operation which, based on a judiciously-designed condition to hedge against future uncertainties, decides whether to follow the expert's decision or the RL decision for each online item arrival. We prove that for any $\rho \in [0,1]$, LOMAR is $\rho$-competitive against any given expert online algorithm. To improve the average performance, we train the RL policy by explicitly considering the online switching operation. Finally, we run empirical experiments to demonstrate the advantages of LOMAR compared to existing baselines.","Robustness, online bipartite matching, reinforcement learning" Tangential Wasserstein Projections,https://openreview.net/forum?id=Th98b8dH4yr,https://openreview.net/pdf?id=Th98b8dH4yr,,"We develop a notion of projections between sets of probability measures using the geometric properties of the $2$-Wasserstein space. It is designed for general multivariate probability measures, is computationally efficient to implement, and provides a unique solution in regular settings. The idea is to work on regular tangent cones of the Wasserstein space using generalized geodesics. Its structure and computational properties make the method applicable in a variety of settings, from causal inference to the analysis of object data. An application to estimating causal effects yields a generalization of the notion of synthetic controls for systems with general heterogeneity described via multivariate probability measures, as well as a way to estimate optimal weights jointly over all time periods.","Optimal Transport, Wasserstein, Generalized geodesics, Projection, Tangent Cone, Causal Inference" Data Drift Correction via Time-varying Importance Weight Estimator,https://openreview.net/forum?id=5b9uVL3l1T4,https://openreview.net/pdf?id=5b9uVL3l1T4,Data gradually evolves over time in the real-world applications. This paper proposes a simple yet effective way to detect gradual shifts in data.,"Real-world deployment of machine learning models is challenging when data evolves over time. And data does evolve over time. While no model can work when data evolves in an arbitrary fashion, if there is some pattern to these changes, we might be able to design methods to address it. This paper addresses situations when data evolves gradually. We introduce a novel time-varying importance weight estimator that can detect gradual shifts in the distribution of data. Such an importance weight estimator allows the training method to selectively sample past data---not just similar data from the past like a standard importance weight estimator would but also data that evolved in a similar fashion in the past. Our time-varying importance weight is quite general. We demonstrate different ways of implementing it that exploit some known structure in the evolution of data. We demonstrate and evaluate this approach on a variety of problems ranging from supervised learning tasks (multiple image classification datasets) where the data undergoes a sequence of gradual shifts of our design to reinforcement learning tasks (robotic manipulation and continuous control) where data undergoes a shift organically as the policy or the task changes.","distribution shift, data drift over time, propensity scoring" Analytical Composition of Differential Privacy via the Edgeworth Accountant,https://openreview.net/forum?id=2_BsVZ6R-ef,https://openreview.net/pdf?id=2_BsVZ6R-ef,We developed an efficient analytical tool via the Edgeworth expansion with finite-sample bounds to to keep track of DP guarantees with a large number of compositions.,"Many modern machine learning algorithms are composed of simple private algorithms; thus, an increasingly important problem is to efficiently compute the overall privacy loss under composition. In this study, we introduce the Edgeworth Accountant, an analytical approach to composing differential privacy guarantees of private algorithms. The Edgeworth Accountant starts by losslessly tracking the privacy loss under composition using the $f$-differential privacy framework, which allows us to express the privacy guarantees using privacy-loss log-likelihood ratios (PLLRs). As the name suggests, this accountant next uses the Edgeworth expansion to the upper and lower bounds the probability distribution of the sum of the PLLRs. Moreover, by relying on a technique for approximating complex distributions using simple ones, we demonstrate that the Edgeworth Accountant can be applied to the composition of any noise-addition mechanism. Owing to certain appealing features of the Edgeworth expansion, the $(\epsilon, \delta)$-differential privacy bounds offered by this accountant are non-asymptotic, with essentially no extra computational cost, as opposed to the prior approaches, wherein the running times increase with the number of compositions. Finally, we demonstrate that our upper and lower $(\epsilon, \delta)$-differential privacy bounds are tight in federated analytics and certain regimes of training private deep learning models.","Differential Privacy, f-Differential Privacy, Edgeworth Expansion, PLLR, Edgeworth Accountant" Policy-Induced Self-Supervision Improves Representation Finetuning in Visual RL,https://openreview.net/forum?id=9Q7wZ0Uq4Z6,https://openreview.net/pdf?id=9Q7wZ0Uq4Z6,"We study the transfer of visual representations in RL, show that they can be partially frozen, and propose a self-supervised method to accelerate their finetuning.","We study how to transfer representations pretrained on source tasks to target tasks in visual percept based RL. We analyze two popular approaches: freezing or finetuning the pretrained representations. Empirical studies on a set of popular tasks reveal several properties of pretrained representations. First, finetuning is required even when pretrained representations perfectly capture the information required to solve the target task. Second, finetuned representations improve learnability and are more robust to noise. Third, pretrained bottom layers are task-agnostic and readily transferable to new tasks, while top layers encode task-specific information and require adaptation. Building on these insights, we propose a self-supervised objective that \emph{clusters representations according to the policy they induce}, as opposed to traditional representation similarity measures which are policy-agnostic (\eg Euclidean norm, cosine similarity). Together with freezing the bottom layers, this objective results in significantly better representation than frozen, finetuned, and self-supervised alternatives on a wide range of benchmarks.","visual reinforcement learning, representation learning, transfer learning" Deep Generative Symbolic Regression,https://openreview.net/forum?id=o7koEEMA1bR,https://openreview.net/pdf?id=o7koEEMA1bR,,"Symbolic regression (SR) aims to discover concise closed-form mathematical equations from data, a task fundamental to scientific discovery. However, the problem is highly challenging because closed-form equations lie in a complex combinatorial search space. Existing methods, ranging from heuristic search to reinforcement learning, fail to scale with the number of input variables. We make the observation that closed-form equations often have structural characteristics and invariances (e.g. the commutative law) that could be further exploited to build more effective symbolic regression solutions. Motivated by this observation, our key contribution is to leverage pre-trained deep generative models to capture the intrinsic regularities of equations, thereby providing a solid foundation for subsequent optimization steps. We show that our novel formalism unifies several prominent approaches of symbolic regression and offers a new perspective to justify and improve on the previous ad hoc designs, such as the usage of cross-entropy loss during pre-training. Specifically, we propose an instantiation of our framework, Deep Generative Symbolic Regression (DGSR). In our experiments, we show that DGSR achieves a higher recovery rate of true equations in the setting of a larger number of input variables, and it is more computationally efficient at inference time than state-of-the-art RL symbolic regression solutions.","Symbolic Regression, Deep Generative Model, Deep Symbolic Regression" What Can we Learn From The Selective Prediction And Uncertainty Estimation Performance Of 523 Imagenet Classifiers?,https://openreview.net/forum?id=p66AzKi6Xim,https://openreview.net/pdf?id=p66AzKi6Xim,What are the best DNNs and training regimes for eliciting superior uncertainty estimation? Analyzing 523 DNNs in order to provide insights that practitioners and researchers can use to maximize the potential of current methods and discover new ones,"When deployed for risk-sensitive tasks, deep neural networks must include an uncertainty estimation mechanism. Here we examine the relationship between deep architectures and their respective training regimes, with their corresponding selective prediction and uncertainty estimation performance. We consider some of the most popular estimation performance metrics previously proposed including AUROC, ECE, AURC as well as coverage for selective accuracy constraint. We present a novel and comprehensive study of selective prediction and the uncertainty estimation performance of 523 existing pretrained deep ImageNet classifiers that are available in popular repositories. We identify numerous and previously unknown factors that affect uncertainty estimation and examine the relationships between the different metrics. We find that distillation-based training regimes consistently yield better uncertainty estimations than other training schemes such as vanilla training, pretraining on a larger dataset and adversarial training. Moreover, we find a subset of ViT models that outperform any other models in terms of uncertainty estimation performance. For example, we discovered an unprecedented 99% top-1 selective accuracy on ImageNet at 47% coverage (and 95% top-1 accuracy at 80%) for a ViT model, whereas a competing EfficientNet-V2-XL cannot obtain these accuracy constraints at any level of coverage.","selective prediction, selective classification, reject option, risk coverage trade-off, deep learning, neural networks" Solving and Learning non-Markovian Stochastic Control problems in continuous-time with Neural RDEs,https://openreview.net/forum?id=NGMAKE75_N7,https://openreview.net/pdf?id=NGMAKE75_N7,We propose a novel framework for solving non-Markovian stochastic control problems in continuous-time using Neural RDEs,"We propose a novel framework for solving continuous-time, non-Markovian stochastic control problems with the use of neural rough differential equations (Neural RDEs). By parameterising the control process as the solution of a Neural RDE driven by the state process, we show that the control-state joint dynamics are governed by an uncontrolled RDE with structured vector fields, allowing for efficient trajectories simulation, Monte-Carlo estimation of the value function and backpropagation. To deal with input paths of infinite 1-variation, we refine the existing universal approximation result to a probabilistic density result for Neural RDEs driven by random rough paths. Experiments on various non-Markovian problems indicate how the proposed framework is time-resolution-invariant and capable of learning optimal solutions with higher accuracy than traditional RNN-based approaches. Finally, we discuss possible extensions of this framework to the setting of non-Markovian continuous-time reinforcement learning and provide promising empirical evidence in this direction.","stochastic control, neural RDEs, rough paths, reinforcement learning" Spatio-temporal Self-Attention for Egocentric 3D Pose Estimation,https://openreview.net/forum?id=F_P8Dtg43vF,https://openreview.net/pdf?id=F_P8Dtg43vF,spatio-temporal egocentric pose estimation using transformers.,"Vision-based ego-centric 3D human pose estimation (ego-HPE) is essential to support critical applications of xR-technologies. However, severe self-occlusions and strong distortion introduced by the fish-eye view from the head mounted camera, make ego-HPE extremely challenging. While current state-of-the-art (SOTA) methods try to address the distortion, they still suffer from large errors in the most critical joints (such as hands) due to self-occlusions. To this end, we propose a spatio-temporal transformer model that can attend to semantically rich feature maps obtained from popular convolutional backbones. Leveraging the complex spatio-temporal information encoded in ego-centric videos, we design a spatial concept called feature map tokens (FMT) which can attend to all the other spatial units in our spatio-temporal feature maps. Powered by this FMT-based transformer, we build Egocentric Spatio-Temporal Self-Attention Network (Ego-STAN), which uses heatmap-based representations and spatio-temporal attention specialized to address distortions and self-occlusions in ego-HPE. Our quantitative evaluation on the contemporary sequential xR-EgoPose dataset, achieves a 38.2% improvement on the highest error joints against the SOTA ego-HPE model, while accomplishing a 22% decrease in the number of parameters. Finally, we also demonstrate the generalization capabilities of our model to real-world HPE tasks beyond ego-views.","pose estimation, egocentric vision, computer vision, self-attention, spatio-temporal data analysis" MAE are Secretly Efficient Learners,https://openreview.net/forum?id=dGRP5SfwkgY,https://openreview.net/pdf?id=dGRP5SfwkgY,we significantly accelerate MAE training by 59x or more,"Masked Autoencoders (MAE), introduced by (He et al., 2022), provides a strong framework to pre-train Vision Transformers (ViTs). In this paper, we accelerate MAE training by 59× or more while with little performance drop. Our changes are simple and straightforward: in the pre-training stage, we aggressively increase the masking ratio, decrease the training epochs, and reduce the decoder depth, for lowering pre-training cost; in the fine-tuning stage, we reveal layer-wise learning rate decay plays a vital role on unleashing the power of pre-trained models. With this setup, we are able to pre-train a ViT-B in 12.6 hours using a single the latest NVIDIA A100 GPU, which competitively attains 83.0% top-1 accuracy on the downstream ImageNet classification task. We additionally verify the speed acceleration on another MAE extension, SupMAE.","self-supervised learning, masked autoencoder, efficient training" RNAS-CL: Robust Neural Architecture Search by Cross-Layer Knowledge Distillation,https://openreview.net/forum?id=l5XHUBGrBkD,https://openreview.net/pdf?id=l5XHUBGrBkD,,"Deep Neural Networks are vulnerable to adversarial attacks. Neural Architecture Search (NAS), one of the driving tools of deep neural networks, demonstrates superior performance in prediction accuracy in various machine learning applications. However, it is unclear how it performs against adversarial attacks. Given the presence of a robust teacher, it would be interesting to investigate if NAS would produce robust neural architecture by inheriting robustness from the teacher. In this paper, we propose Robust Neural Architecture Search by Cross-Layer Knowledge Distillation (RNAS-CL), a novel NAS algorithm that improves the robustness of NAS by learning from a robust teacher through cross-layer knowledge distillation. Unlike previous knowledge distillation methods that encourage close student/teacher output only in the last layer, RNAS-CL automatically searches for the best teacher layer to supervise each student layer. Experimental result evidences the effectiveness of RNAS-CL and shows that RNAS-CL produces small and robust neural architecture.","Robust Neural Architecture Search, Knowledge Distillation, Efficient Deep Learning Model" Multi-Agent Policy Transfer via Task Relationship Modeling,https://openreview.net/forum?id=KaeYRGTaODt,https://openreview.net/pdf?id=KaeYRGTaODt,We propose to model task relationships by learning effect-based task representations for more efficient multi-agent policy transfer.,"Team adaptation to new cooperative tasks is a hallmark of human intelligence, which has yet to be fully realized in learning agents. Previous works on multi-agent transfer learning accommodate teams of different sizes, but heavily rely on the generalization ability of neural networks for adapting to unseen tasks. We posit that the relationship among tasks provides the key information for policy adaptation. To utilize such relationship for efficient transfer, we try to discover and exploit the knowledge among tasks from different teams, propose to learn effect-based task representations as a common latent space among tasks, and use it to build an alternatively fixed training scheme. We demonstrate that the task representation can capture the relationship among teams and generalize to unseen tasks. As a result, the proposed method can help transfer learned cooperation knowledge to new tasks after training on a few source tasks, and the learned transferred policies can also help solve tasks that are hard to learn from scratch.","Multi-agent reinforcement learning, cooperative transfer learning" When does Bias Transfer in Transfer Learning?,https://openreview.net/forum?id=r7bFgAGRkpL,https://openreview.net/pdf?id=r7bFgAGRkpL,We study a potential vulnerability of transfer learning where biases and other vulnerabilities from the source dataset remain present in downstream models.,"Using transfer learning to adapt a pre-trained ""source model"" to a downstream ""target task"" can dramatically increase performance with seemingly no downside. In this work, we demonstrate that there can exist a downside after all: bias transfer, or the tendency for biases of the source model to persist even after adapting the model to the target class. Through a combination of synthetic and natural experiments, we show that bias transfer both (a) arises in realistic settings (such as when pre-training on ImageNet or other standard datasets) and (b) can occur even when the target dataset is explicitly de-biased. As transfer-learned models are increasingly deployed in the real world, our work highlights the importance of understanding the limitations of pre-trained source models.","transfer learning, bias" Predictor-corrector algorithms for stochastic optimization under gradual distribution shift,https://openreview.net/forum?id=2SV2dlfBuE3,https://openreview.net/pdf?id=2SV2dlfBuE3,,"Time-varying stochastic optimization problems frequently arise in machine learning practice (e.g., gradual domain shift, object tracking, strategic classification). Often, the underlying process that drives the distribution shift is continuous in nature. We exploit this underlying continuity by developing predictor-corrector algorithms for time-varying stochastic optimization that anticipates changes in the underlying data generating process through a predictor-corrector term in the update rule. The key challenge is the estimation of the predictor-corrector term; a naive approach based on sample-average approximation may lead to non-convergence. We develop a general moving-average based method to estimate the predictor-corrector term and provide error bounds for the iterates, both in presence of pure and noisy access to the queries from the relevant derivatives of the loss function. Furthermore, we show (theoretically and empirically in several examples) that our method outperforms non-predictor corrector methods that do not anticipate changes in the data generating process.", AIM: Adapting Image Models for Efficient Video Understanding,https://openreview.net/forum?id=CIoSZ_HKHS7,https://openreview.net/pdf?id=CIoSZ_HKHS7,We propose a new method to adapt frozen image pre-trained model for efficient video action recognition,"Recent vision transformer based video models mostly follow the ``image pre-training then finetuning"" paradigm and have achieved great success on multiple video benchmarks. However, full finetuning such a video model could be computationally expensive and unnecessary, given the pre-trained image transformer models have demonstrated exceptional transferability. In this work, we propose a novel method to Adapt pre-trained Image Models (AIM) for efficient video understanding. By freezing the pre-trained image model and adding a few lightweight Adapters, we introduce spatial adaptation, temporal adaptation and joint adaptation to gradually equip an image model with spatiotemporal reasoning capability. We show that our proposed AIM can achieve competitive or even better performance than prior arts with substantially fewer tunable parameters on four video action recognition benchmarks. Thanks to its simplicity, our method is also generally applicable to different image pre-trained models, which has the potential to leverage more powerful image foundation models in the future. ","Video action recognition, efficient finetuning" Impossibly Good Experts and How to Follow Them,https://openreview.net/forum?id=sciA_xgYofB,https://openreview.net/pdf?id=sciA_xgYofB,,"We consider the sequential decision making problem of learning from an expert that has access to more information than the learner. For many problems this extra information will enable the expert to achieve greater long term reward than any policy without this privileged information access. We call these experts ``Impossibly Good'' because no learning algorithm will be able to reproduce their behavior. However, in these settings it is reasonable to attempt to recover the best policy possible given the agent's restricted access to information. We provide a set of necessary criteria on the expert that will allow a learner to recover the optimal policy in the reduced information space from the expert's advice alone. We also provide a new approach called Elf Distillation (Explorer Learning from Follower) that can be used in cases where these criteria are not met and environmental rewards must be taken into account. We show that this algorithm performs better than a variety of strong baselines on a challenging suite of minigrid environments.","Imitation Learning, Reinforcement Learning, Experts, Distillation" On Convergence of Average-Reward Off-Policy Control Algorithms in Weakly-Communicating MDPs,https://openreview.net/forum?id=NMoeVEwekzC,https://openreview.net/pdf?id=NMoeVEwekzC,Showing average-reward off-policy control algorithms converge in weakly-communicating MDPs,"We show two average-reward off-policy control algorithms, Differential Q Learning (Wan, Naik, \& Sutton 2021a) and RVI Q Learning (Abounadi Bertsekas \& Borkar 2001), converge in weakly-communicating MDPs. Weakly-communicating MDPs are the most general class of MDPs that a learning algorithm with a single stream of experience can guarantee obtaining a policy achieving optimal reward rate. The original convergence proofs of the two algorithms require that all optimal policies induce unichains, which is not necessarily true for weakly-communicating MDPs. To the best of our knowledge, our results are the first showing average-reward off-policy control algorithms converge in weakly-communicating MDPs. As a direct extension, we show that average-reward options algorithms introduced by (Wan, Naik, \& Sutton 2021b) converge if the Semi-MDP induced by options is weakly-communicating. ","Reinforcement Learning, Average-Reward, Off-Policy, Convergence" Distributionally Robust Post-hoc Classifiers under Prior Shifts,https://openreview.net/forum?id=3KUfbI9_DQE,https://openreview.net/pdf?id=3KUfbI9_DQE,We propose a method for scaling the model predictions at test-time for improved distribution robustness to prior shifts. ,"The generalization ability of machine learning models degrades significantly when the test distribution shifts away from the training distribution. We investigate the problem of training models that are robust to shifts caused by changes in the distribution of class-priors or group-priors. The presence of skewed training priors can often lead to the models overfitting to spurious features. Unlike existing methods, which optimize for either the worst or the average performance over classes or groups, our work is motivated by the need for finer control over the robustness properties of the model. We present an extremely lightweight post-hoc approach that performs scaling adjustments to predictions from a pre-trained model, with the goal of minimizing a distributionally robust loss around a chosen target distribution. These adjustments are computed by solving a constrained optimization problem on a validation set and applied to the model during test time. Our constrained optimization objective is inspired from a natural notion of robustness to controlled distribution shifts. Our method comes with provable guarantees and empirically makes a strong case for distributional robust post-hoc classifiers. ","Distributional robustness, post-hoc scaling, group robustness, class imbalance, spurious correlations" Transformer Meets Boundary Value Inverse Problems,https://openreview.net/forum?id=HnlCZATopvr,https://openreview.net/pdf?id=HnlCZATopvr,"We argue that, from both theoretical and experimental perspective, the attention mechanism is a structure-conforming neural architecture for learning the PDE-based boundary value inverse problems.","A Transformer-based deep direct sampling method is proposed for solving a class of boundary value inverse problem. A real-time reconstruction is achieved by evaluating the learned inverse operator between carefully designed data and the reconstructed images. An effort is made to give a specific example to a fundamental but critical question: whether and how one can benefit from the theoretical structure of a mathematical problem to develop task-oriented and structure-conforming deep neural network? Specifically, inspired by direct sampling methods for inverse problems, the 1D boundary data are preprocessed by a partial differential equation-based feature map to yield 2D harmonic extensions in different frequencies as different input channels. Then, by introducing learnable non-local kernel, the approximation of direct sampling is recast to a modified attention mechanism. The proposed method is then applied to electrical impedance tomography, a well-known severely ill-posed nonlinear inverse problem. The new method achieves superior accuracy over its predecessors and contemporary operator learners, as well as shows robustness with respect to noise. This research shall strengthen the insights that the attention mechanism, despite being invented for natural language processing tasks, offers great flexibility to be modified in conformity with the a priori mathematical knowledge, which ultimately leads to the design of more physics-compatible neural architectures.","inverse problems, attention, operator learning, Transformer, partial differential equations" Transferability Between Regression Tasks,https://openreview.net/forum?id=LB6KMRUqng2,https://openreview.net/pdf?id=LB6KMRUqng2,,"We consider the problem of estimating how well deep neural network regression models would transfer from source to target tasks. We focus on regression tasks, which received little previous attention, and develop novel transferability estimation methods that are simple, computationally efficient, yet effective and theoretically grounded. We propose two families of transferability estimators, both of which utilize the mean squared error of a regularized linear regression model to estimate the transferability. We prove novel theoretical bounds connecting our methods with the expected risk of the optimal target models obtained from the actual transfer learning process. We test our methods extensively in various challenging, practical scenarios and show they significantly outperform existing state-of-the-art regression task transferability estimators, in both accuracy and efficiency.","Transferability estimation, Transfer learning" Diagnosing and exploiting the computational demands of videos games for deep reinforcement learning,https://openreview.net/forum?id=ElI9znK_eUz,https://openreview.net/pdf?id=ElI9znK_eUz,Strategies for improving deep reinforcement learning agents can be predicted from their generalization performance.,"Humans learn by interacting with their environments and perceiving the outcomes of their actions. A landmark in artificial intelligence has been the development of deep reinforcement learning (dRL) algorithms capable of doing the same in video games, on par with or better than humans. However, it remains unclear whether the successes of dRL models reflect advances in visual representation learning, the effectiveness of reinforcement learning algorithms at discovering better policies, or both. To address this question, we introduce the Learning Challenge Diagnosticator (LCD), a tool that separately measures the perceptual and reinforcement learning demands of a task. We use LCD to discover a novel taxonomy of challenges in the Procgen benchmark, and demonstrate that these predictions are both highly reliable and can instruct algorithmic development. More broadly, the LCD reveals multiple failure cases that can occur when optimizing dRL algorithms over entire video game benchmarks like Procgen, and provides a pathway towards more efficient progress.","Cognitive Science, Deep Reinforcement Learning, Perceptual Grouping, Neuroscience" NeuralPCG: Learning Preconditioner for Solving Partial Differential Equations with Graph Neural Network,https://openreview.net/forum?id=IDSXUFQeZO5,https://openreview.net/pdf?id=IDSXUFQeZO5,,"Fast and accurate partial differential equation (PDE) solvers empower scientific and engineering research. Classic numerical solvers provide unparalleled accuracy but often require extensive computation time. Machine learning solvers are significantly faster but lack convergence and accuracy guarantees. We present Neural-Network-Preconditioned Conjugate Gradient, or NeuralPCG, a novel linear second-order PDE solver that combines the benefits of classic iterative solvers and machine learning approaches. Our key observation is that both neural-network PDE solvers and classic preconditioners excel at obtaining fast but inexact solutions. NeuralPCG proposes to use neural network models to \emph{precondition} PDE systems in classic iterative solvers. Compared with neural-network PDE solvers, NeuralPCG achieves converging and accurate solutions (e.g.,1e-12 precision) by construction. Compared with classic solvers, NeuralPCG is faster via data-driven preconditioners. We demonstrate the efficacy and generalizability of NeuralPCG by conducting extensive experiments on various 2D and 3D linear second-order PDEs.","Physics Simulation, Graph Neural Network, Applied Mathematics" Learning Dynamic Query Combinations for Transformer-based Object Detection and Segmentation,https://openreview.net/forum?id=FI5IysDR8pG,https://openreview.net/pdf?id=FI5IysDR8pG,,"Transformer-based detection and segmentation methods use a list of learned detection queries to retrieve information from the transformer network and learn to predict the location and category of one specific object from each query. We empirically find that random convex combinations of the learned queries are still good queries for the corresponding models. We then propose to learn a convex combination with dynamic coefficients based on the high-level semantics of the image. The generated dynamic queries better capture the prior of object locations and categories in the different images. Equipped with our dynamic queries, a wide range of DETR-based models achieve consistent and superior performance across multiple tasks (object detection, instance segmentation, panoptic segmentation) and on different benchmarks (MS COCO, CityScapes, YoutubeVIS).", Cross-Quality Few-Shot Transfer for Alloy Yield Strength Prediction: A New Material Science Benchmark and An Integrated Optimization Framework,https://openreview.net/forum?id=uYFRjvSJXbQ,https://openreview.net/pdf?id=uYFRjvSJXbQ,,"Discovering high-entropy alloys (HEAs) with high yield strength is an important yet challenging task in material science. However, the yield strength can only be accurately measured by very expensive and time-consuming real-world experiments, hence cannot be acquired at scale. Learning-based methods could facilitate the discovery process, but the lack of a comprehensive dataset on HEA yield strength has created barriers. We present X-Yield, a large-scale material science benchmark with 240 experimentally measured (""high-quality"") and over 100K simulated (imperfect or ""low-quality"") HEA yield strength annotations. Due to the scarcity of experimental annotations and the quality gap in imperfectly simulated data, existing transfer learning methods cannot generalize well on our dataset. We address this cross-quality few-shot transfer problem by leveraging model sparsification ""twice"" --- as a noise-robust feature learning regularizer at the pre-training stage, and as a data-efficient learning regularizer at the few-shot transfer stage. While the workflow already performs decently with ad-hoc sparsity patterns tuned independently for either stage, we take a step further by proposing a bi-level optimization framework termed Bi-RPT, that jointly learns optimal masks and automatically allocates sparsity levels for both stages. The optimization problem is solved efficiently using gradient unrolling, which is seamlessly integrated with the training process. The effectiveness of Bi-RPT is validated through extensive experiments on our new challenging X-Yield dataset, alongside other synthesized testbeds. Specifically, we achieve an 8.9~19.8% reduction in terms of the test mean squared error and 0.98~1.53% in terms of test accuracy, merely using 5-10% of the experimental data. Codes and sample data are in the supplement.", Parameter-varying neural ordinary differential equations with partition-of-unity networks,https://openreview.net/forum?id=heDr8wIYmw_,https://openreview.net/pdf?id=heDr8wIYmw_,Parameter-varying neural ODEs with spectrally represented model parameters using partition-of-unity networks,"In this study, we propose parameter-varying neural ordinary differential equations (NODEs) where the evolution of model parameters is represented by partition-of-unity networks (POUNets), a mixture of experts architecture. The proposed variant of NODEs, synthesized with POUNets, learn a meshfree partition of space and represent the evolution of ODE parameters using sets of polynomials associated to each partition. We demonstrate the effectiveness of the proposed method for three important tasks: data-driven dynamics modeling of (1) hybrid systems, (2) switching linear dynamical systems, and (3) latent dynamics for dynamical systems with varying external forcing.","neural ordinary differential equations, partition-of-unity networks, parameter-varying neural networks" Robust Reinforcement Learning with Distributional Risk-averse formulation,https://openreview.net/forum?id=ZMxVNpd76mw,https://openreview.net/pdf?id=ZMxVNpd76mw,,"The purpose of robust reinforcement learning is to make predictions more robust to changes in the dynamics or rewards of the system. This problem is particularly important when dynamics and rewards of the environment are estimated from the data. However, without constraints, this problem is intractable. In this paper, we approximate the Robust Reinforcement Learning constrained with a $f$-divergence using an approximate Risk-Averse formulation. We show that the classical Reinforcement Learning formulation can be robustified using a standard deviation penalization of the objective. Two algorithms based on Distributional Reinforcement Learning, one for discrete and one for continuous action spaces, are proposed and tested in a classical Gym environment to demonstrate the robustness of the algorithms.","Robust Reinforcement Learning, Risk-Averse Reinforcement Learning, Deep Reinforcement Learning" Unicom: Universal and Compact Representation Learning for Image Retrieval,https://openreview.net/forum?id=3YFDsSRSxB-,https://openreview.net/pdf?id=3YFDsSRSxB-,,"Modern image retrieval methods typically rely on fine-tuning pre-trained encoders to extract image-level descriptors. However, the most widely used models are pre-trained on ImageNet-1K with limited classes. The pre-trained feature representation is therefore not universal enough to generalize well to the diverse open-world classes. In this paper, we first cluster the large-scale LAION dataset into one million pseudo classes based on the joint textual and visual features extracted by the CLIP model. Due to the confusion of label granularity, the automatically clustered dataset inevitably contains heavy inter-class conflicts. To alleviate such conflicts, we randomly select partial inter-class prototypes to construct the margin-based softmax loss. To further enhance the low-dimensional feature representation, we randomly select partial feature dimensions when calculating the similarities between embeddings and class-wise prototypes. The dual random partial selections are with respect to the class dimension and the feature dimension of the prototype matrix, respectively, making the classification conflict-robust and the feature embedding compact. Our method outperforms state-of-the-art unsupervised and supervised image retrieval approaches on multiple benchmarks with substantial improvement under different dimension constraints. Pre-processed data, training code, and pre-trained models will be released to reproduce our results.","Cluster Discrimination, Image Retrieval" The Reward Hypothesis is False,https://openreview.net/forum?id=M4UxoupR3az,https://openreview.net/pdf?id=M4UxoupR3az,"We argue that the reward hypothesis is false, by providing several counterexamples. We also provide necessary and sufficient conditions for when a MORL problem can be reduced to ordinary RL, and describe a new way to express tasks for RL agents.","The reward hypothesis is the hypothesis that ""all of what we mean by goals and purposes can be well thought of as the maximisation of the expected value of the cumulative sum of a received scalar signal"". In this paper, we will argue that this hypothesis is false. We will look at three natural classes of reinforcement learning tasks (multi-objective reinforcement learning, risk-averse reinforcement learning, and modal reinforcement learning), and then prove mathematically that these tasks cannot be expressed using any scalar, Markovian reward function. We thus disprove the reward hypothesis by providing many examples of tasks which are both natural and intuitive to describe, but which are nonetheless impossible to express using reward functions. In the process, we provide necessary and sufficient conditions for when a multi-objective reinforcement learning problem can be reduced to ordinary, scalar reward reinforcement learning. We also call attention to a new class of reinforcement learning problems (namely those we call ""modal"" problems), which have so far not been given any systematic treatment in the reinforcement learning literature.","the reward hypothesis, reward functions, multi-objective reinforcement learning, MORL" Convergence of Generative Deep Linear Networks Trained with Bures-Wasserstein Loss,https://openreview.net/forum?id=FQvAlf6xwTy,https://openreview.net/pdf?id=FQvAlf6xwTy,We prove convergence of gradient decent optimization of generative deep linear networks trained with the Bures-Wasserstein loss. ,"We consider a deep matrix factorization model of covariance matrices trained with the Bures-Wasserstein distance. While recent works have made important advances in the study of the optimization problem for overparametrized low-rank matrix approximation, much emphasis has been placed on discriminative settings and the square loss. In contrast, our model considers another interesting type of loss and connects with the generative setting. We characterize the critical points and minimizers of the Bures-Wasserstein distance over the space of rank-bounded matrices. For low-rank matrices the Hessian of this loss can blow up, which creates challenges to analyze convergence of optimizaton methods. We establish convergence results for gradient flow using a smooth perturbative version of the loss and convergence results for finite step size gradient descent under certain assumptions on the initial weights","deep linear network, low-rank approximation, Bures-Wasserstein distance, optimal transport, implicit generative model, critical points" Diffusion Probabilistic Fields,https://openreview.net/forum?id=ik91mY-2GN,https://openreview.net/pdf?id=ik91mY-2GN,A diffusion model that can learn distribution over fields,"Diffusion probabilistic models have quickly become a major approach for generative modeling of images, 3D geometry, video and other domains. However, to adapt diffusion generative modeling to these domains the denoising network needs to be carefully designed for each domain independently, oftentimes under the assumption that data lives in an Euclidean grid. In this paper we introduce Diffusion Probabilistic Fields (DPF), a diffusion model that can learn distributions over continuous functions defined over metric spaces, commonly known as fields. We extend the formulation of diffusion probabilistic models to deal with this field parametrization in an explicit way, enabling us to define and end-to-end learning algorithm that side-steps the requirement of representing fields with latent vectors as in previous approaches. We empirically show that, while using the same denoising network, DPF effectively deals with different modalities like 2D images and 3D geometry, in addition to modeling distributions over fields defined on non-Euclidean metric spaces.","Generative Models, Field Representation, Diffusion Models" Improving Information Retention in Large Scale Online Continual Learning,https://openreview.net/forum?id=mAazgkPutZ,https://openreview.net/pdf?id=mAazgkPutZ,,"Given a stream of data sampled from non-stationary distributions, online continual learning (OCL) aims to adapt efficiently to new data while retaining existing knowledge. The typical approach to address information retention (the ability to retain previous knowledge) is keeping a replay buffer of a fixed size and computing gradients using a mixture of new data and the replay buffer. Surprisingly, the recent work (Cai et al., 2021) suggests that information retention remains a problem in large scale OCL even when the replay buffer is unlimited, \emph{i.e.}, the gradients are computed using all past data. This paper focuses on this peculiarity to understand and address information retention. To pinpoint the source of this problem, we theoretically show that, given limited computation budgets at each time step, even without strict storage limit, naively applying SGD with constant or constantly decreasing learning rates fail to optimize information retention in the long term. We propose using a moving average family of methods to improve optimization for non-stationary objectives. Specifically, we design an adaptive moving average (AMA) optimizer and a moving-average-based learning rate schedule (MALR). We demonstrate the effectiveness of AMA+MALR on large scale benchmarks, including Continual Localization (CLOC), Google Landmarks and ImageNet. Code will be released upon publication.","online continual learning, moving average, geo-localization" Shape Analysis by Shadow Synthesis,https://openreview.net/forum?id=J1fysSeRdk,https://openreview.net/pdf?id=J1fysSeRdk,We propose a method to reconstruct a 3D object from just its shadow by inverting an implicit 3D generative model,"3D reconstruction is a fundamental problem in computer vision, and the task is especially challenging when the object to reconstruct is partially or fully occluded. We introduce a method that uses the shadows cast by an unobserved object in order to infer the possible 3D volumes under occlusion. We create a differentiable image formation model that allows us to jointly infer the 3D shape of an object, its pose, and the position of a light source. Since the approach is end-to-end differentiable, we are able to integrate learned priors of object geometry in order to generate realistic 3D shapes of different object categories. Experiments and visualizations show that the method is able to generate multiple possible solutions that are consistent with the observation of the shadow. Our approach works even when the position of the light source and object pose are both unknown. Our approach is also robust to real-world images where ground-truth shadow mask is unknown.","3D Reconstruction, Shadow, Differentiable Rendering, Neural Fields" Landscape Learning for Neural Network Inversion,https://openreview.net/forum?id=yQpZ4WnRZM,https://openreview.net/pdf?id=yQpZ4WnRZM,We learn an easy-to-optimize loss landscape for neural network inversion problems such as GAN inversion and adversarial defense.,"Many machine learning methods operate by inverting a neural network at inference time, which has become a popular technique for solving inverse problems in computer vision, robotics, and graphics. However, these methods often involve gradient descent through a highly non-convex loss landscape, causing the optimization process to be unstable and slow. We introduce a method that learns a loss landscape where gradient descent is efficient, bringing massive improvement and acceleration to the inversion process. We demonstrate this advantage on a number of methods for both generative and discriminative tasks, including GAN inversion, adversarial defense, and 3D human pose reconstruction.","Neural Network Inversion, Loss Landscape, Optimization" Stochastic Multi-Person 3D Motion Forecasting,https://openreview.net/forum?id=_s1N-DnxdyT,https://openreview.net/pdf?id=_s1N-DnxdyT,"We introduce a new task of stochastic multi-person 3D motion forecasting, and propose a dual-level generative modeling framework to address this task.","This paper aims to deal with the ignored real-world complexity in prior work on human motion forecasting, emphasizing the social properties of multi-person motion, the diversity of motion and social interaction, and the complexity of articulated motion. To this end, we introduce a novel task of stochastic multi-person 3D motion forecasting. We propose a dual-level generative modeling framework that separately models independent individual movements at the local level and social interactions at the global level. Notably, this dual-level modeling mechanism can be achieved within a shared generative model, through introducing learnable latent codes that represent intents of future movement and switching the codes' modes of operation at different levels. Our framework is general, and we instantiate it with various multi-person forecasting models. Extensive experiments on CMU-Mocap, MuPoTS-3D, and SoMoF show that our approach produces diverse and accurate multi-person predictions, significantly outperforming the state of the art.","stochastic forecasting, multi-person 3D motion, dual-level generative modeling" ON INJECTING NOISE DURING INFERENCE,https://openreview.net/forum?id=XHiC52N24ox,https://openreview.net/pdf?id=XHiC52N24ox,,"We study activation noise in a generative energy-based modeling setting during training for the purpose of regularization. We prove that activation noise is a general form of dropout. Then, we analyze the role of activation noise at inference time and demonstrate it to be utilizing sampling. Thanks to the activation noise we observe about 200% improvement in performance (classification accuracy). Later, we not only discover, but also prove that the best performance is achieved when the activation noise follows the same distribution both during training and inference. To explicate this phenomenon, we provide theoretical results that illuminate the roles of activation noise during training, inference, and their mutual influence on the performance. To further confirm our theoretical results, we conduct experiments for five datasets and seven distributions of activation noise.","activation noise, energy-based modeling" LEARNING THE SPECTROGRAM TEMPORAL RESOLUTION FOR AUDIO CLASSIFICATION,https://openreview.net/forum?id=HOF3CTk2WH6,https://openreview.net/pdf?id=HOF3CTk2WH6,"This paper proposes DiffRes, which enables differentiable temporal resolution learning on audio spectrogram (as opposed to common fixed hop size approaches) to improve the performance of audio classification models. ","The audio spectrogram is a time-frequency representation that has been widely used for audio classification. The temporal resolution of a spectrogram depends on hop size. Previous works generally assume the hop size should be a constant value such as ten milliseconds. However, a fixed hop size or resolution is not always optimal for different types of sound. This paper proposes a novel method, DiffRes, that enables differentiable temporal resolution learning to improve the performance of audio classification models. Given a spectrogram calculated with a fixed hop size, DiffRes merges non-essential time frames while preserving important frames. DiffRes acts as a ""drop-in"" module between an audio spectrogram and a classifier, and can be end-to-end optimized. We evaluate DiffRes on the mel-spectrogram, followed by state-of-the-art classifier backbones, and apply it to five different subtasks. Compared with using the fixed-resolution mel-spectrogram, the DiffRes-based method can achieve the same or better classification accuracy with at least 25% fewer temporal dimensions on the feature level, which alleviates the computational cost at the same time. Starting from a high-temporal-resolution spectrogram such as one-millisecond hop size, we show that DiffRes can improve classification accuracy with the same computational complexity. ","audio classification, differentiable temporal resolution, feature dimension reduction" Beyond calibration: estimating the grouping loss of modern neural networks,https://openreview.net/forum?id=6w1k-IixnL8,https://openreview.net/pdf?id=6w1k-IixnL8,"We provide an estimator to evaluate confidence scores beyond calibration, revealing the subgroups heterogeneities that undermine individual predicted probabilities.","Good decision making requires machine-learning models to provide trustworthy confidence scores. To this end, recent work has focused on miscalibration, i.e, the over or under confidence of model scores. Yet, contrary to widespread belief, calibration is not enough: even a classifier with the best possible accuracy and perfect calibration can have confidence scores far from the true posterior probabilities. This is due to the grouping loss, created by samples with the same confidence scores but different true posterior probabilities. Proper scoring rule theory shows that given the calibration loss, the missing piece to characterize individual errors is the grouping loss. While there are many estimators of the calibration loss, none exists for the grouping loss in standard settings. Here, we propose an estimator to approximate the grouping loss. We use it to study modern neural network architectures in vision and NLP. We find that the grouping loss varies markedly across architectures, and that it is a key model-comparison factor across the most accurate, calibrated, models. We also show that distribution shifts lead to high grouping loss.","calibration, grouping loss, decision making, model evaluation" Hybrid RL: Using both offline and online data can make RL efficient,https://openreview.net/forum?id=yyBis80iUuU,https://openreview.net/pdf?id=yyBis80iUuU,"We propose a new hybrid RL framework with access to both offline dataset and online interaction, and design a hybrid RL algorithm that is statistically and computationally efficient.","We consider a hybrid reinforcement learning setting (Hybrid RL), in which an agent has access to an offline dataset and the ability to collect experience via real-world online interaction. The framework mitigates the challenges that arise in both pure offline and online RL settings, allowing for the design of simple and highly effective algorithms, in both theory and practice. We demonstrate these advantages by adapting the classical Q learning/iteration algorithm to the hybrid setting, which we call Hybrid Q-Learning or Hy-Q. In our theoretical results, we prove that the algorithm is both computationally and statistically efficient whenever the offline dataset supports a high-quality policy and the environment has bounded bilinear rank. Notably, we require no assumptions on the coverage provided by the initial distribution, in contrast with guarantees for policy gradient/iteration methods. In our experimental results, we show that Hy-Q with neural network function approximation outperforms state-of-the-art online, offline, and hybrid RL baselines on challenging benchmarks, including Montezuma’s Revenge.","reinforcement learning theory, hybrid reinforcement learning, online reinforcement learning, offline reinforcement learning" Spotting Expressivity Bottlenecks and Fixing Them Optimally ,https://openreview.net/forum?id=xBeGd7sAND,https://openreview.net/pdf?id=xBeGd7sAND,We quantify the concept of lack of expressivity in neural networks and propose an algorithm to fix them by appropriately adding neurons.,"Machine learning tasks are generally formulated as optimization problems, where one searches for an optimal function within a certain functional space. In practice, parameterized functional spaces are considered, in order to be able to perform gradient descent. Typically, a neural network architecture is chosen and fixed, and its parameters (connection weights) are optimized, yielding an architecture-dependent result. This way of proceeding however forces the evolution of the function during training to lie within the realm of what is expressible with the chosen architecture, and prevents any optimization across possible architectures. Costly architectural hyper-parameter optimization is often performed to compensate for this. Instead, we propose to adapt the architecture on the fly during training. We show that the information about desirable architectural changes, due to expressivity bottlenecks when attempting to follow the functional gradient, can be extracted from the back-propagation. To do this, we propose a new mathematically well-grounded method to detect expressivity bottlenecks on the fly and solve them by adding suitable neurons when and where needed. Thus, while the standard approach requires large networks, in terms of number of neurons per layer, for expressivity and optimization reasons, we are able to start with very small neural networks and let them grow appropriately. As a proof of concept, we show convincing results on the MNIST dataset, matching large neural network accuracy, with competitive training time, while removing the need for standard architectural hyper-parameter optimization. ","neural network, expressivity, optimization, quadratic programming, layer growth, tangent space, gradient" Scalable and Privacy-enhanced Graph Generative Model for Graph Neural Networks,https://openreview.net/forum?id=yFQjggu62T,https://openreview.net/pdf?id=yFQjggu62T,"We propose a novel, modern graph generation problem to enable generating privacy-controlled, synthetic substitutes of large-scale real-world graphs that can be effectively used to evaluate GNN models.","As the field of Graph Neural Networks (GNN) continues to grow, it experiences a corresponding increase in the need for large, real-world datasets to train and test new GNN models on challenging, realistic problems. Unfortunately, such graph datasets are often generated from online, highly privacy-restricted ecosystems, which makes research and development on these datasets hard, if not impossible. This greatly reduces the amount of benchmark graphs available to researchers, causing the field to rely only on a handful of publicly-available datasets. To address this dilemma, we introduce a novel graph generative model, Computation Graph Transformer (CGT) that can learn and reproduce the distribution of real-world graphs in a privacy-enhanced way. Our proposed model (1) generates effective benchmark graphs on which GNNs show similar task performance as on the source graphs, (2) scales to process large-scale real-world graphs, (3) guarantees privacy for end-users. Extensive experiments across a vast body of graph generative models show that only our model can successfully generate privacy-controlled, synthetic substitutes of large-scale real-world graphs that can be effectively used to evaluate GNN models.","graph generative model, graph neural networks, graph convolutional networks, benchmark graph generation" Model ensemble instead of prompt fusion: a sample-specific knowledge transfer method for few-shot prompt tuning,https://openreview.net/forum?id=p0yrSRbN5Bu,https://openreview.net/pdf?id=p0yrSRbN5Bu,"For improving few-shot prompt tuning, we propose a Sample-specific Ensemble of Source Models to transfer knowledge from soft prompts trained on source tasks to target tasks by adjusting the contribution of source models for each target sample.","Prompt tuning approaches, which learn task-specific soft prompts for a downstream task conditioning on frozen pre-trained models, have attracted growing interest due to its parameter efficiency. With large language models and sufficient training data, prompt tuning performs comparably to full-model tuning. However, with limited training samples in few-shot settings, prompt tuning fails to match the performance of full-model fine-tuning. In this work, we focus on improving the few-shot performance of prompt tuning by transferring knowledge from soft prompts of source tasks with abundant training samples. Recognizing the good generalization capabilities of ensemble methods in low-data regime, we first experiment and show that a simple ensemble of model predictions based on different source prompts, outperforms existing multi-prompt knowledge transfer approaches such as source prompt fusion in the few-shot setting. Motivated by this observation, we further investigate model ensembles and propose Sample-specific Ensemble of Source Models (SESoM). SESoM learns to adjust the contribution of each source model for each target sample separately when ensembling source model outputs. Through this way, SESoM inherits the superior generalization of ensemble methods and simultaneously captures the sample-specific competence of each source prompt. We conduct experiments across a diverse set of eight NLP tasks using models of different scales (T5-\{base, large, XL\}) and find that SESoM consistently outperforms the existing models of the same as well as larger parametric scale by a large margin.","prompt tuning, natural language processing, few-shot learning" Entropy-Regularized Model-Based Offline Reinforcement Learning,https://openreview.net/forum?id=bBBA-8ELXcF,https://openreview.net/pdf?id=bBBA-8ELXcF,We propose a single model that learns a pessimistic MDP for offline RL scenarios which is regularized for transitions that are outside of the data support.,"Model-based approaches to offline Reinforcement Learning (RL) aim to remedy the problem of sample complexity in offline learning via first estimating a pessimistic Markov Decision Process (MDP) from offline data, followed by freely exploring in the learned model for policy optimization. Recent advances in model-based RL techniques mainly rely on an ensemble of models to quantify the uncertainty of the empirical MDP which is leveraged to penalize out-of-distribution state-action pairs during the policy learning. However, the performance of ensembles for uncertainty quantification highly depends on how they are implemented in practice, which can be a limiting factor. In this paper, we propose a systematic way to measure the epistemic uncertainty and present \abbrv, an Entropy-regularized Model-based Offline RL approach, to provide a smooth error estimation when leaving the support of data toward uncertain areas. Subsequently, we optimize a single neural architecture that maximizes the likelihood of offline data distribution while regularizing the transitions that are outside of the data support. Empirical results demonstrate that our framework achieves competitive performance compared to state-of-the-art offline RL methods on D4RL benchmark datasets.","Reinforcement learning, model-based, offline RL, entropy regularization" Reward-free Policy Learning through Active Human Involvement,https://openreview.net/forum?id=N3fc0aKFB-0,https://openreview.net/pdf?id=N3fc0aKFB-0,We propose a reward-free policy learning method called Proxy Value Propagation that conveys human intents explicitly to the learning policy through active human involvement,"Despite the success of reinforcement learning (RL) in many control tasks, the behaviors of the learned agents are largely limited by the hand-crafted reward function in the environment, which might not truthfully reflect human intents and preferences. This work proposes a reward-free policy learning method called Proxy Value Propagation that conveys human intents explicitly to the learning policy through active involvement. We adopt an interactive learning setting where human subjects can actively intervene and demonstrate to the agent. Our key insight is that a latent value function can be learned from active human involvement, which in return guides the learning policy to emulate human behaviors. The proposed method first relabels and propagates the proxy values of human demonstrations to other states, and then optimizes the policies to comply with the human intents expressed through the proxy value function. The proposed method can be incorporated into many existing RL algorithms with minimum modifications. Experiments on various tasks and human control devices demonstrate the generality and efficiency of our method. Theoretic guarantee on the learning safety is also provided. Demo video and code are available in the supplementary material. ","Human-in-the-loop Reinforcement Learning, Safety, Sample Efficiency, Reward-free" Automaton Distillation: A Neuro-Symbolic Transfer Learning Approach for Deep RL,https://openreview.net/forum?id=C6CEY8xiA7v,https://openreview.net/pdf?id=C6CEY8xiA7v,Transfer reinforcement learning using symbolic knowledge extracted from a teacher,"Reinforcement learning is a powerful tool for finding optimal policies in sequential decision processes. However, deep learning methods suffer from two weaknesses: collecting the amount of agent experience required for practical RL problems is prohibitively expensive, and the learned policies exhibit poor generalization on tasks outside the training distribution. To mitigate these issues, we introduce automaton distillation, a form of neuro-symbolic transfer learning in which Q-value estimates from a teacher are distilled into a low-dimensional representation in the form of an automaton. We then propose two methods for generating Q-value estimates: static transfer, which reasons over an abstract MDP constructed based on prior knowledge, and dynamic transfer, where symbolic information is extracted from a DQN teacher. The resulting Q-value estimates from either method are used to bootstrap learning in the target environment via a modified DQN loss function. We list several failure modes of existing automaton-based transfer methods and demonstrate that both static and dynamic automaton distillation decrease the time required to find optimal policies for various decision tasks.","reinforcement learning, transfer learning, neuro-symbolic, formal languages, automaton" Sign and Basis Invariant Networks for Spectral Graph Representation Learning,https://openreview.net/forum?id=Q-UHqMorzil,https://openreview.net/pdf?id=Q-UHqMorzil,"We develop neural networks invariant to the symmetries of eigenvectors, which are theoretically expressive and empirically improve performance in geometric learning tasks.","We introduce SignNet and BasisNet---new neural architectures that are invariant to two key symmetries displayed by eigenvectors: (i) sign flips, since if v is an eigenvector then so is -v; and (ii) more general basis symmetries, which occur in higher dimensional eigenspaces with infinitely many choices of basis eigenvectors. We prove that under certain conditions our networks are universal, i.e., they can approximate any continuous function of eigenvectors with the desired invariances. When used with Laplacian eigenvectors, our networks are provably more expressive than existing spectral methods on graphs; for instance, they subsume all spectral graph convolutions, certain spectral graph invariants, and previously proposed graph positional encodings as special cases. Experiments show that our networks significantly outperform existing baselines on molecular graph regression, learning expressive graph representations, and learning neural fields on triangle meshes.","Invariance, Equivariance, Eigenvectors, Spectral, Neural Networks" Certification of Attribution Robustness for Euclidean Distance and Cosine Similarity Measure,https://openreview.net/forum?id=sDNuHPA7Ib4,https://openreview.net/pdf?id=sDNuHPA7Ib4,,"Model attribution is a critical component of deep neural networks (DNNs) for its interpretability to complex models. Recent works bring up attention to the security of attributions as they are vulnerable to attribution attacks that generate similar images with dramatically different attributions. Studies have been working on empirically improving the robustness of DNNs against those attacks. However, due to their lack of certification, the actual robustness of the model for a testing point is not known. In this work, we define \emph{certified attribution robustness} for the first time that upper bounds the dissimilarity of attributions after the samples are perturbed by any noises within a certain region while the classification results remain the same. Based on the definition, we propose different approaches to certify the attributions using Euclidean distance and cosine similarity under both $\ell_2$ and $\ell_\infty$-norm perturbations constraints. The bounds developed by our theoretical study are validated on three datasets (MNIST, Fashion-MNIST and CIFAR-10), and two different types of attacks (PGD attack and IFIA attribution attack). The experimental results show that the bounds certify the model effectively.","interpretation, robustness" Diffusing Graph Attention,https://openreview.net/forum?id=4QIgPD5BLnv,https://openreview.net/pdf?id=4QIgPD5BLnv,,"The dominant paradigm for machine learning on graphs uses Message Passing Graph Neural Networks~(MP-GNNs), in which node representations are updated by aggregating information in their local neighborhood. Recently, there have been increasingly more attempts to adapt the Transformer architecture to graphs in an effort to solve some known limitations of MP-GNN. A challenging aspect of designing Graph Transformers is integrating the arbitrary graph structure into the architecture. We propose \emph{Graph Diffuser}~(GD) to address this challenge. GD learns to extract structural and positional relationships between distant nodes in the graph, which it then uses to direct the Transformer's attention and node representation. We demonstrate that existing GNNs and Graph Transformers struggle to capture long-range interactions and how Graph Diffuser does so while admitting intuitive visualizations. Experiments on eight benchmarks show Graph Diffuser to be a highly competitive model, outperforming the state-of-the-art in a diverse set of domains.","Graph Transformer, graph neural networks, transformers, long-range context" Sequential Latent Variable Models for Few-Shot High-Dimensional Time-Series Forecasting,https://openreview.net/forum?id=7C9aRX2nBf2,https://openreview.net/pdf?id=7C9aRX2nBf2,We present the very first step toward few-shot high-dimensional sequence forecasting by a Bayesian meta-learning model that learns the process of learning latent dynamics that changes with the small number of observations that are available.,"Modern applications increasingly require learning and forecasting latent dynamics from high-dimensional time-series. Compared to univariate time-series forecasting, this adds a new challenge of reasoning about the latent dynamics of an unobserved abstract state. Sequential latent variable models (LVMs) present an attractive solution, although existing works either struggle with long-term forecasting or have difficulty learning across diverse dynamics. In this paper, we first present a conceptual framework of sequential LVMs to unify existing works, contrast their fundamental limitations, and identify an intuitive solution to long-term forecasting for diverse dynamics via meta-learning. We then present the first framework of few-shot forecasting for high-dimensional time-series: instead of learning a single dynamic function, we leverage data of diverse dynamics and learn to adapt latent dynamic functions to few-shot support series. This is realized via Bayesian meta-learning underpinned by: 1) a latent dynamic function conditioned on knowledge derived from few-shot support series, and 2) a meta-model that learns to extract such dynamic-specific knowledge via feed-forward embedding of support set. We compared the presented framework with a comprehensive set of baseline models trained 1) globally on the large meta-training set with diverse dynamics, and 2) individually on single dynamics, both with and without fine-tuning to k-shot support series used by the meta-models. We demonstrated that the presented framework is agnostic to the latent dynamic function of choice and, at meta-test time, is able to forecast for new dynamics given variable-shot of support series.","Time series, generative models, Bayesian meta-learning" Code Translation with Compiler Representations,https://openreview.net/forum?id=XomEU3eNeSQ,https://openreview.net/pdf?id=XomEU3eNeSQ,We leverage compiler intermediate representations (IR) for the unsupervised neural machine translation of programming languages and get state-of-the-art results,"In this paper, we leverage low-level compiler intermediate representations (IR) code translation. Traditional transpilers rely on syntactic information and handcrafted rules, which limits their applicability and produces unnatural-looking code. Applying neural machine translation (NMT) approaches to code has successfully broadened the set of programs on which one can get a natural-looking translation. However, they treat the code as sequences of text tokens, and still do not differentiate well enough between similar pieces of code which have different semantics in different languages. The consequence is low quality translation, reducing the practicality of NMT, and stressing the need for an approach significantly increasing its accuracy. Here we propose to augment code translation with IRs, specifically LLVM IR, with results on the C++, Java, Rust, and Go languages. Our method improves upon the state of the art for unsupervised code translation, increasing the number of correct translations by 11% on average, and up to 79% for the Java → Rust pair with greedy decoding. With beam search, it increases the number of correct translations by 5.5% in average. We extend previous test sets for code translation, by adding hundreds of Go and Rust functions. Additionally, we train models with high performance on the problem of IR decompilation, generating programming source code from IR, and study using IRs as intermediary pivot for translation.","Neural Machine Translation, Unsupervised Learning, Programming Languages, LLVM, Decompilation, Deep Learning, Program Translation" GAIN: On the Generalization of Instructional Action Understanding,https://openreview.net/forum?id=RlPmWBiyp6w,https://openreview.net/pdf?id=RlPmWBiyp6w,,"Despite the great success achieved in instructional action understanding by deep learning and mountainous data, deploying trained models to the unseen environment still remains a great challenge, since it requires strong generalizability of models from in-distribution training data to out-of-distribution (OOD) data. In this paper, we introduce a benchmark, named GAIN, to analyze the GeneralizAbility of INstructional action understanding models. In GAIN, we reassemble steps of existing instructional video training datasets to construct the OOD tasks and then collect the corresponding videos. We evaluate the generalizability of models trained on in-distribution datasets with the performance on OOD videos and observe a significant performance drop. We further propose a simple yet effective approach, which cuts off the excessive contextual dependency of action steps by performing causal inference, to provide a potential direction for enhancing the OOD generalizability. In the experiments, we show that this simple approach can improve several baselines on both instructional action segmentation and detection tasks. We expect the introduction of the GAIN dataset will promote future in-depth research on the generalization of instructional video understanding.","Action Analysis, Instructional Video, OOD Generalization" Deep Reinforcement learning on Adaptive Pairwise Critic and Asymptotic Actor,https://openreview.net/forum?id=nd3yVgRYKVJ,https://openreview.net/pdf?id=nd3yVgRYKVJ,,"Maximum entropy deep reinforcement learning has displayed great potential on a range of challenging continuous tasks. The maximum entropy is able to encourage policy exploration, however, it has a tradeoff between the efficiency and stability, especially when employed on large-scale tasks with high state and action dimensionality. Sometimes the temperature hyperparameter of maximum entropy term is limited to remain stable at the cost of slower and lower convergence. Besides, the function approximation errors existing in actor-critic learning are known to induce estimation errors and suboptimal policies. In this paper, we propose an algorithm based on adaptive pairwise critics, and adaptive asymptotic maximum entropy combined. Specifically, we add a trainable state-dependent weight factor to build an adaptive pairwise target Q-value to serve as the surrogate policy objective. Then we adopt a state-dependent adaptive temperature to smooth the entropy policy exploration, which introduces an asymptotic maximum entropy. The adaptive pairwise critics can effectively improve the value estimation, preventing overestimation or underestimation errors. Meanwhile, the adaptive asymptotic entropy can adapt to the tradeoff between efficiency and stability, which provides more exploration and flexibility. We evaluate our method on a set of Gym tasks, and the results show that the proposed algorithms have better performance than several baselines on continuous control.", Model-based Value Exploration in Actor-critic Deep Reinforcement Learning,https://openreview.net/forum?id=PTZhYSD8aUv,https://openreview.net/pdf?id=PTZhYSD8aUv,,"Off-policy method has demonstrated great potential on model-free deep reinforcement learning due to the sample-efficient advantage. However, it suffers extra instability due to some mismatched distributions from observations. Model-free on-policy counterparts usually have poor sample efficiency. Model-based algorithms, in contrast, are highly dependent on the goodness of expert demonstrations or learned dynamics. In this work, we propose a method which involves training the dynamics to accelerate and gradually stabilize learning without adding sample-complexity. The dynamics model prediction can provide effective target value exploration, which is essentially different from the methods on-policy exploration, by adding valid diversity of transitions. Despite the existence of model bias, the model-based prediction can avoid the overestimation and distribution mismatch errors in off-policy learning, as the learned dynamics model is asymptotically accurate. Besides, to generalize the solution to large-scale reinforcement learning problems, we use global gaussian and deterministic function approximation to model the transition probability and reward function, respectively. To minimize the negative impact of potential model bias brought by the estimated dynamics, we adopt one-step global prediction for the model-based part of target value. By analyses and proofs, we show how the model-based prediction provides value exploration and asymptotical performance to the overall network. It can also be concluded that the convergence of proposed algorithm only depends on the accuracy of learnt dynamics model.", Omnigrok: Grokking Beyond Algorithmic Data,https://openreview.net/forum?id=zDiHoIWa0q1,https://openreview.net/pdf?id=zDiHoIWa0q1,"We aim to understand grokking through the lens of neural loss landscapes, and show grokking can occur for various datasets beyond algorithmic datasets.","Grokking, the unusual phenomenon for algorithmic datasets where generalization happens long after overfitting the training data, has remained elusive. We aim to understand grokking by analyzing the loss landscapes of neural networks, identifying the mismatch between training and test losses as the cause for grokking. We refer to this as the ""LU mechanism"" because training and test losses (against model weight norm) typically resemble ""L"" and ""U"", respectively. This simple mechanism can nicely explain many aspects of grokking: data size dependence, weight decay dependence, the emergence of representations, etc. Guided by the intuitive picture, we are able to induce grokking on tasks involving images, language and molecules, although the grokking signals are sometimes less dramatic. We attribute the dramatic nature of grokking for algorithmic datasets to representation learning.","grokking, loss landscape, neural dynamics, representation learning, initialization" ManyDG: Many-domain Generalization for Healthcare Applications,https://openreview.net/forum?id=lcSfirnflpW,https://openreview.net/pdf?id=lcSfirnflpW,"New ""many-domain generalization"" setting and new approach ManyDG for the setting in healthcare applications","The vast amount of health data has been continuously collected for each patient, providing opportunities to support diverse healthcare predictive tasks such as seizure detection and hospitalization prediction. Existing models are mostly trained on other patients’ data and evaluated on new patients. Many of them might suffer from poor generalizability. One key reason can be overfitting due to the unique information related to patient identities and their data collection environments, referred to as patient covariates in the paper. These patient covariates usually do not contribute to predicting the targets but are often difficult to remove. As a result, they can bias the model training process and impede generalization. In healthcare applications, most existing domain generalization methods assume a small number of domains. In this paper, considering the diversity of patient covariates, we propose a new setting by treating each patient as a separate domain (leading to many domains). We develop a new domain generalization method ManyDG, that can scale to such many-domain problems. Our method identifies the patient do- main covariates by mutual reconstruction, and removes them via an orthogonal projection step. Extensive experiments show that ManyDG can boost the generalization performance on multiple real-world healthcare tasks (e.g., 3.7% Jaccard improvements on MIMIC drug recommendation) and support realistic but challenging settings such as insufficient data and continuous learning.","Patient covariate shift, Domain Generalization, Healthcare, EEG, EHR" Adversarial Detector for Decision Tree Ensembles Using Representation Learning,https://openreview.net/forum?id=yLv6eSBmA-,https://openreview.net/pdf?id=yLv6eSBmA-,,"Research on adversarial evasion attacks focuses mainly on neural network models. Among other reasons, this is because of their popularity in certain fields (e.g., computer vision and NLP) and the models' properties, making it easier to search for adversarial examples with minimal input change. Decision trees and tree ensembles are still very popular due to their high performance in fields dominated by tabular data and their explainability. In recent years, several works have defined new adversarial attacks targeting decision trees and tree ensembles. As a result, several papers were published focusing on robust versions of tree ensembles. This research aims to create an adversarial detector for attacks on an ensemble of decision trees. While several previous works have demonstrated the generation of more robust tree ensembles, the process of considering evasion attacks during ensemble generation can affect model performance. We demonstrate a method to detect adversarial samples without affecting either the target model structure or its original performance. We showed that by using representation learning based on the structure of the trees, we achieved better detection rates than the state-of-the-art technique and better than using the original representation of the dataset to train an adversarial detector.","Machine Learning, Representation Learning, Adversarial Learning, Evasion Attacks, Adversarial Detection, Tree Ensembles, Decision Trees" Learning with Instance-Dependent Label Noise: Balancing Accuracy and Fairness,https://openreview.net/forum?id=W8UYLEvvYeR,https://openreview.net/pdf?id=W8UYLEvvYeR,We propose an approach for instance dependent label noise and demonstrate its ability to balance discriminative performance and fairness in a variety of settings.,"Incorrect labels hurt model performance when the model overfits to noise. Many state-of-the-art approaches that address label noise assume that label noise is independent from the input features. In practice, however, label noise is often feature or instance-\textit{dependent}, and therefore is biased (i.e., some instances are more likely to be mislabeled than others). Approaches that ignore this dependence can produce models with poor discriminative performance, and depending on the task, can exacerbate issues around fairness. In light of these limitations, we propose a two-stage approach to learn from datasets with instance-dependent label noise. Our approach utilizes \textit{anchor points}, a small subset of data for which we know the ground truth labels. On many tasks, our approach leads to consistent improvements over the state-of-the-art in discriminative performance (AUROC) while balancing model fairness (area under the equalized odds curve, AUEOC). For example, when predicting acute respiratory failure onset on the MIMIC-III dataset, the harmonic mean of the AUROC and AUEOC of our approach is 0.84 (SD 0.01) while that of the next best baseline is 0.81 (SD 0.01). Overall, our approach leads to learning more accurate and fair models compared to existing approaches in the presence of instance-dependent label noise.","noisy labels, supervised learning" Flow Annealed Importance Sampling Bootstrap,https://openreview.net/forum?id=XCTVFJwS9LJ,https://openreview.net/pdf?id=XCTVFJwS9LJ,We train normalizing flows to fit multi-modal target distributions by generating samples where the flow is a poor approximation of the target using an annealed importance sampling bootstrap procedure.,"Normalizing flows are tractable density models that can approximate complicated target distributions, e.g.~Boltzmann distributions of physical systems. However, current methods for training flows either suffer from mode-seeking behavior, use samples from the target generated beforehand by expensive MCMC simulations, or use stochastic losses that have high variance. To avoid these problems, we augment flows with annealed importance sampling (AIS) and minimize the mass-covering $\alpha$-divergence with $\alpha=2$, which minimizes importance weight variance. Our method, Flow AIS Bootstrap (FAB), uses AIS to generate samples in regions where the flow is a poor approximation of the target, facilitating the discovery of new modes. We apply FAB to complex multimodal targets and show that we can approximate them very accurately where previous methods fail. To the best of our knowledge, we are the first to learn the Boltzmann distribution of the alanine dipeptide molecule using only the unnormalized target density, without access to samples generated via Molecular Dynamics (MD) simulations: FAB produces better results than training via maximum likelihood on MD samples while using 100 times fewer target evaluations. After reweighting samples with importance weights, we obtain unbiased histograms of dihedral angles that are almost identical to the ground truth.","Normalizing flow, Boltzmann distribution, Boltzmann generator, Annealed Importance Sampling, Approximate Inference" Learning with MISELBO: The Mixture Cookbook,https://openreview.net/forum?id=ULkdnAqaZTx,https://openreview.net/pdf?id=ULkdnAqaZTx,"We demonstrate the power and flexibility of GMMs as variational approximations in SGD-based VI in terms of density estimation and representation learning, provide a cookbook for implementing them, and point to novel connections between VI and AIS.","Mixture models in variational inference (VI) is an active field of research. Recent works have established their connection to multiple importance sampling (MIS) through the MISELBO and advanced the use of ensemble approximations for large-scale problems. However, as we show here, an independent learning of the ensemble components can lead to suboptimal diversity. Hence, we study the effect of instead using MISELBO as an objective function for learning mixtures, and we propose the first ever mixture of variational approximations for a normalizing flow-based hierarchical variational autoencoder (VAE) with VampPrior and a PixelCNN decoder network. Two major insights led to the construction of this novel \textit{composite model}. First, mixture models have potential to be off-the-shelf tools for practitioners to obtain more flexible posterior approximations in VAEs. Therefore, we make them more accessible by demonstrating how to apply them to four popular architectures. Second, the mixture components cooperate in order to cover the target distribution while trying to maximize their diversity when MISELBO is the objective function. We explain this cooperative behavior by drawing a novel connection between VI and adaptive importance sampling. Finally, we demonstrate the superiority of the Mixture VAEs' learned feature representations on both image and single-cell transcriptome data, and obtain state-of-the-art results among VAE architectures in terms of negative log-likelihood on the MNIST and FashionMNIST datasets. Code available here: \url{gitlink}.","Variational Autoencoders, Mixture VAEs, Bayesian Inference, GMMs, ELBO, VI, Adaptive Importance Sampling, Deep Ensembles, Deep Mixtures, Density Estimation" DecAF: Joint Decoding of Answers and Logical Forms for Question Answering over Knowledge Bases,https://openreview.net/forum?id=XHc5zRPxqV9,https://openreview.net/pdf?id=XHc5zRPxqV9,"We propose a novel KBQA framework that jointly generates both direct answers and logical forms, and then combines them to obtain the final answers.","Question answering over knowledge bases (KBs) aims to answer natural language questions with factual information such as entities and relations in KBs. Previous methods either generate logical forms that can be executed over KBs to obtain final answers or predict answers directly. Empirical results show that the former often produces more accurate answers, but it suffers from non-execution issues due to potential syntactic and semantic errors in the generated logical forms. In this work, we propose a novel framework DecAF that jointly generates both logical forms and direct answers, and then combines the merits of them to get the final answers. Moreover, different from most of the previous methods, DecAF is based on simple free-text retrieval without relying on any entity linking tools --- this simplification eases its adaptation to different datasets. DecAF achieves new state-of-the-art accuracy on WebQSP, FreebaseQA, and GrailQA benchmarks, while getting competitive results on the ComplexWebQuestions benchmark.","knowledge base, question answering, information retrieval, semantic parsing" NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis,https://openreview.net/forum?id=elDEe8LYW7-,https://openreview.net/pdf?id=elDEe8LYW7-,"This paper introduces a unified voice synthesis framework that tackles four tasks, zero-shot voice conversion, text-to-speech, singing voice synthesis, and voice designing.","Various applications of voice synthesis have been developed independently despite the fact that they generate “voice” as output in common. In addition, most of the voice synthesis models still require a large number of audio data paired with annotated labels (e.g., text transcription and music score) for training. To this end, we propose a unified framework of synthesizing and manipulating voice signals from analysis features, dubbed NANSY++. The backbone network of NANSY++ is trained in a self-supervised manner that does not require any annotations paired with audio. After training the backbone network, we efficiently tackle four voice applications - i.e. voice conversion, text-to-speech, singing voice synthesis, and voice designing - by partially modeling the analysis features required for each task. Extensive experiments show that the proposed framework offers competitive advantages such as controllability, data efficiency, and fast training convergence, while providing high quality synthesis. Audio samples: tinyurl.com/8tnsy3uc.","voice synthesis, integrated framework, zero-shot voice conversion, text-to-speech, singing voice synthesis, voice designing" Robust Attention for Contextual Biased Visual Recognition,https://openreview.net/forum?id=8XqDnrmZQNF,https://openreview.net/pdf?id=8XqDnrmZQNF,,"Visual attention does not always capture the essential object representation desired for robust predictions. Attention modules tend to underline not only the target object but also the common co-occurring context that the module thinks helpful in the training. The problem is rooted in the confounding effect of the context leading to incorrect causalities between objects and predictions, which is further exacerbated by visual attention. In this paper, to learn causal object features robust for contextual bias, we propose a novel attention module named Interventional Dual Attention (IDA) for visual recognition. Specifically, IDA adopts two attention layers with multiple sampling intervention, which protects the attention against the confounder context. Note that our method is model-agnostic and thus can be implemented on various backbones. Extensive experiments show our model obtains significant improvements in classification and detection with lower computation. In particular, we achieve the state-of-the-art results in multi-label classification on MS-COCO and PASCAL-VOC. The codes will be publicly available.","Causal Inference, Object Recognition, Attention Mechanism, Confounding Context, Interventional Dual Attention" A unified optimization framework of ANN-SNN Conversion: towards optimal mapping from activation values to firing rates,https://openreview.net/forum?id=83piwkGNzOP,https://openreview.net/pdf?id=83piwkGNzOP,,"Spiking Neural Networks (SNNs) have attracted great attention as a primary candidate for running large-scale deep artificial neural networks (ANNs) in real-time due to their distinctive properties of energy-efficient and event-driven fast computation. Training an SNN directly from scratch is usually difficult because of the discreteness of spikes. Converting an ANN to an SNN, i.e., ANN-SNN conversion, is an alternative method to obtain deep SNNs. The performance of the converted SNN is determined by both the ANN performance and the conversion error. The existing ANN-SNN conversion methods usually redesign the ANN with a new activation function instead of the regular ReLU, train the tailored ANN and convert it to an SNN. The performance loss between the regular ANN with ReLU and the tailored ANN has never been considered, which will be inherited to the converted SNN. In this work, we formulate the ANN-SNN conversion as a unified optimization problem which considers the performance loss between the regular ANN and the tailored ANN, as well as the conversion error simultaneously. Following the unified optimization framework, we propose the SlipReLU activation function to replace the regular ReLU activation function in the tailored ANN. The SlipReLU is a weighted sum of the threhold-ReLU and the step function, which improves the performance of either as an activation function alone. The SlipReLU method covers a family of activation functions mapping from activation values in source ANNs to firing rates in target SNNs; most of the state-of-the-art optimal ANN-SNN conversion methods are special cases of our proposed SlipReLU method. We demonstrate through two theorems that the expected conversion error between SNNs and ANNs can theoretically be zero on a range of shift values $\delta \in [-\frac{1}{2},\frac{1}{2}]$ rather than a fixed shift term $\frac{1}{2}$, enabling us to achieve converted SNNs with high accuracy and ultra-low latency. We evaluate our proposed SlipReLU method on CIFAR-10 dataset, and the results show that the SlipReLU outperforms the state-of-the-art ANN-SNN conversion in both accuracy and latency. To our knowledge, this is the first work to explore high-performance ANN-SNN conversion method considering the ANN performance and the conversion error simultaneously.",ANN-SNN conversion "Multi-Objective Reinforcement Learning: Convexity, Stationarity and Pareto Optimality",https://openreview.net/forum?id=TjEzIsyEsQ6,https://openreview.net/pdf?id=TjEzIsyEsQ6,We propose a linear scalarization based algorithm that has the potential to find the entire Pareto front.,"In recent years, single-objective reinforcement learning (SORL) algorithms have received a significant amount of attention and seen some strong results. However, it is generally recognized that many practical problems have intrinsic multi-objective properties that cannot be easily handled by SORL algorithms. Although there have been many multi-objective reinforcement learning (MORL) algorithms proposed, there has been little recent exploration of the fundamental properties of the spaces we are learning in. In this paper, we perform a rigorous analysis of policy induced value functions and use the insights to distinguish three views of Pareto optimality. The results imply the convexity of the induced value function's range for stationary policies and suggest that any point of its Pareto front can be achieved by training a policy using linear scalarization (LS). We show the problem that leads to the suboptimal performance of LS can be solved by adding strongly concave terms to the immediate rewards, which motivates us to propose a new vector reward-based Q-learning algorithm, CAPQL. Combined with an actor-critic formulation, our algorithm achieves state-of-the-art performance on multiple MuJoCo tasks in the preference agnostic setting. Furthermore, we empirically show that, in contrast to other LS-based algorithms, our approach is significantly more stable, achieving similar results across various random seeds. ", Point-based Molecular Representation Learning from Conformers,https://openreview.net/forum?id=pjePBJjlBby,https://openreview.net/pdf?id=pjePBJjlBby,This paper proposes a point-based deep network for molecular representation learning from three-dimensional conformers. ,"Molecular representation learning (MRL) aims to embed molecules into vectors in a high dimensional latent space, which can be used (and reused) for the prediction of various molecular properties. Most current MRL models exploited the SMILES (Simplified Molecular-Input Line-Entry System) strings or molecular graphs as the input format of molecules. As a result, these methods may not capture the full information encoded in the three-dimensional (3D) molecular conformations (also known as the conformers). With mature algorithms for generating 3D molecular conformers, we propose to engage the abundant geometric information in the molecular conformers by representing molecules as point sets, and adapt the point-based deep neural network for MRL. Specifically, we designed an atom-shared elemental operation that extracts features from individual atoms as well as atomic interactions (including covalent bonds and non-covalent interactions), and a mini-network that ensures the representation invariant to rotations and translations of the molecular conformers. We trained the deep neural network (referred to as Mol3DNet) for a variety of tasks of molecular properties prediction using benchmarking datasets. The experimental results demonstrated that Mol3DNet achieves state-of-the-art performance on these classification and regression tasks, except for one task (solubility prediction) where all deep learning models underperform a customized machine learning model. ",molecular representation learning Continual Unsupervised Disentangling of Self-Organizing Representations,https://openreview.net/forum?id=ih0uFRFhaZZ,https://openreview.net/pdf?id=ih0uFRFhaZZ,We proposed a novel generative model describing a topologically-connected mixture of spike-and-slab distributions in the latent space for continual unsupervised learning and disentangling representations.,"Limited progress has been made in continual unsupervised learning of representations, especially in reusing, expanding, and continually disentangling learned semantic factors across data environments. We argue that this is because existing approaches treat continually-arrived data independently, without considering how they are related based on the underlying semantic factors. We address this by a new generative model describing a topologically-connected mixture of spike-and-slab distributions in the latent space, learned end-to-end in a continual fashion via principled variational inference. The learned mixture is able to automatically discover the active semantic factors underlying each data environment and to accumulate their relational structure based on that. This distilled knowledge of different data environments can further be used for generative replay and guiding continual disentangling of new semantic factors. We tested the presented method on a split version of 3DShapes to provide the first quantitative disentanglement evaluation of continually learned representations, and further demonstrated its ability to continually disentangle new representations in benchmark datasets.","continual disentanglment, generative model, VAE, SOM" Inducing Gaussian Process Networks,https://openreview.net/forum?id=S0v71vsLBYhM,https://openreview.net/pdf?id=S0v71vsLBYhM,We introduce a new method to efficiently learn the kernel and inducing points for Gaussian processes.,"Gaussian processes (GPs) are powerful but computationally expensive machine learning models, requiring an estimate of the kernel covariance matrix for every prediction. In large and complex domains, such as graphs, sets, or images, the choice of suitable kernel can also be non-trivial to determine, providing an additional obstacle to the learning task. Over the last decade, these challenges have resulted in significant advances being made in terms of scalability and expressivity, exemplified by, e.g., the use of inducing points and neural network kernel approximations. In this paper, we propose inducing Gaussian process networks (IGN), a simple framework for simultaneously learning the feature space as well as the inducing points. The inducing points, in particular, are learned directly in the feature space, enabling a seamless representation of complex structured domains while also facilitating scalable gradient-based learning methods. We consider both regression and (binary) classification tasks and report on experimental results for real-world data sets showing that IGNs provide significant advances over state-of-the-art methods. We also demonstrate how IGNs can be used to effectively model complex domains using neural network architectures.","Gaussian processes, Kernel Methods, Classification, Regression" Causal Inference via Nonlinear Variable Decorrelation in Healthcare,https://openreview.net/forum?id=5MUJsSRuylD,https://openreview.net/pdf?id=5MUJsSRuylD,,"Causal inference and model interpretability research are gaining increasing attention, especially in the domains of healthcare and bioinformatics. Despite recent successes in this field, decorrelating features under nonlinear environments with human interpretable representations has not been adequately investigated. To address this issue, we introduce a novel method with a variable decorrelation regularizer to handle both linear and nonlinear confounding. Moreover, we employ association rules as new representations using association rule mining based on the original features to further proximate human decision patterns to increase model interpretability. Extensive experiments are conducted on four healthcare datasets (one synthetically generated and three real-world collections on different diseases). Quantitative results in comparison to baseline approaches on parameter estimation and causality computation indicate the model's superior performance. Furthermore, expert evaluation given by healthcare professionals validates the effectiveness and interpretability of the proposed model.", Monotonicity and Double Descent in Uncertainty Estimation with Gaussian Processes,https://openreview.net/forum?id=fxkACnJZmy_,https://openreview.net/pdf?id=fxkACnJZmy_,"We prove marginal likelihood for optimally-tuned Gaussian processes increases monotonically with input dimension, contrasting with posterior predictive losses that can exhibit double descent.","The quality of many modern machine learning models improves as model complexity increases, an effect that has been quantified—for predictive performance—with the non-monotonic double descent learning curve. Here, we address the overarching question: is there an analogous theory of double descent for models which estimate uncertainty? We provide a partially affirmative and partially negative answer in the setting of Gaussian processes (GP). Under standard assumptions, we prove that higher model quality for optimally-tuned GPs (including uncertainty prediction) under marginal likelihood is realized for larger input dimensions, and therefore exhibits a monotone learning curve. After showing that marginal likelihood does not naturally exhibit double descent in the input dimension, we highlight related forms of posterior predictive loss that do. Finally, we verify empirically that our results hold for real data, beyond our considered assumptions, and explore unusual consequences involving synthetic covariates.","double descent, Gaussian processes, Bayesian statistics" Fooling SHAP with Stealthily Biased Sampling,https://openreview.net/forum?id=J4mJjotSauh,https://openreview.net/pdf?id=J4mJjotSauh,"We show that Shapley-based explanation techniques commonly used in ML can be manipulated to show false compliance (e.g., during an algorithmic fairness audit) and that this type of attack can be hard to detect.","SHAP explanations aim at identifying which features contribute the most to the difference in model prediction at a specific input versus a background distribution. Recent studies have shown that they can be manipulated by malicious adversaries to produce arbitrary desired explanations. However, existing attacks focus solely on altering the black-box model itself. In this paper, we propose a complementary family of attacks that leave the model intact and manipulate SHAP explanations using stealthily biased sampling of the data points used to approximate expectations w.r.t the background distribution. In the context of fairness audit, we show that our attack can reduce the importance of a sensitive feature when explaining the difference in outcomes between groups while remaining undetected. These results highlight the manipulability of SHAP explanations and encourage auditors to treat them with skepticism.","Explainability, Robustness, SHAP, Stealthily Sampling" Towards Realtime Distributed Virtual Flow Meter via Compressed Continual Learning,https://openreview.net/forum?id=H6T7AAoTUsR,https://openreview.net/pdf?id=H6T7AAoTUsR,,"A robust-accurate estimation of fluid flow is the main building block of a distributed virtual flow meter. Unfortunately, a big leap in algorithm development would be required for this objective to come to fruition, mainly due to the inability of current machine learning algorithms to make predictions outside the training data distribution. To improve predictions outside the training distribution, we explore the Continual Learning (CL) paradigm for accurately estimating the characteristics of fluid flow in pipelines. A significant challenge facing CL is the concept of catastrophic forgetting. In this paper, we provide a novel approach of how to address the forgetting problem via compressing the distributed sensor data to increase the capacity of CL memory bank using a compressive learning algorithm. Through extensive experiments, we show that our approach provides around 8% accuracy improvement compared to other CL algorithms in the real-field distributed sensor dataset. Noticeable accuracy improvement is also achieved when using our proposed approach with the CL-benchmark datasets, achieving state-of-the-art accuracies of 94.95% and 77.27% for the MNIST and CIFAR-10 datasets, respectively.","continual learning, distributed sensor, compressed learning" Asynchronous Gradient Play in Zero-Sum Multi-agent Games,https://openreview.net/forum?id=vPXp7K_Yhre,https://openreview.net/pdf?id=vPXp7K_Yhre,This work provides the first set of algorithms and analyses on understanding asynchronous gradient play in zero-sum multi-agent polymatrix games under a wide range of delay assumptions.,"Finding equilibria via gradient play in competitive multi-agent games has been attracting a growing amount of attention in recent years, with emphasis on designing efficient strategies where the agents operate in a decentralized and symmetric manner with guaranteed convergence. While significant efforts have been made in understanding zero-sum two-player matrix games, the performance in zero-sum multi-agent games remains inadequately explored, especially in the presence of delayed feedbacks, leaving the scalability and resiliency of gradient play open to questions. In this paper, we make progress by studying asynchronous gradient plays in zero-sum polymatrix games under delayed feedbacks. We first establish that the last iterate of entropy-regularized optimistic multiplicative weight updates (OMWU) method converges linearly to the quantal response equilibrium (QRE), the solution concept under bounded rationality, in the absence of delays. The linear convergence continues to hold even when the feedbacks are randomly delayed under mild statistical assumptions, albeit at a slower rate. Moving beyond random delays, we further demonstrate entropy-regularized OMWU with two-timescale learning rates enjoys faster last-iterate convergence under fixed delays, and continues to converge provably even when the delays are arbitrarily bounded. Our methods also lead to finite-time guarantees to approximate the Nash equilibrium (NE) by moderating the amount of regularization. To the best of our knowledge, this work is the first that aims to understand asynchronous gradient play in zero-sum polymatrix games under a wide range of delay assumptions. ","asynchronous gradient play, OMWU, zero-sum polymatrix games" Novel View Synthesis with Diffusion Models,https://openreview.net/forum?id=HtoA0oT30jC,https://openreview.net/pdf?id=HtoA0oT30jC,Novel View Synthesis with diffusion models from as few a single image,"We present 3DiM (pronounced ""three-dim""), a diffusion model for 3D novel view synthesis from as few as a single image. The core of 3DiM is an image-to-image diffusion model -- 3DiM takes a single reference view and their poses as inputs, and generates a novel view via diffusion. 3DiM can then generate a full 3D consistent scene following our novel stochastic conditioning sampler: the output frames of the scene are generated autoregressively, and during the reverse diffusion process of each individual frame, we select a random conditioning frame from the set of previous frames at each denoising step. We demonstrate that stochastic conditioning yields much more 3D consistent results compared to the naive sampling process which only conditions on a single previous frame. We compare 3DiMs to prior work on the SRN ShapeNet dataset, demonstrating that 3DiM's generated videos from a single view achieve much higher fidelity while being approximately 3D consistent. We also introduce a new evaluation methodology, 3D consistency scoring, to measure the 3D consistency of a generated object by training a neural field on the model's output views. 3DiMs are geometry free, do not rely on hyper-networks or test-time optimization for novel view synthesis, and allow a single model to easily scale to a large number of scenes.","3D, diffusion, ddpm, novel, view, synthesis, generative, models" DM-NeRF: 3D Scene Geometry Decomposition and Manipulation from 2D Images,https://openreview.net/forum?id=C_PRLz8bEJx,https://openreview.net/pdf?id=C_PRLz8bEJx,"In this paper, we study the problem of 3D scene geometry decomposition and manipulation from 2D views.","In this paper, we study the problem of 3D scene geometry decomposition and manipulation from 2D views. By leveraging the recent implicit neural representation techniques, particularly the appealing neural radiance fields, we introduce an object field component to learn unique codes for all individual objects in 3D space only from 2D supervision. The key to this component is a series of carefully designed loss functions to enable every 3D point, especially in non-occupied space, to be effectively optimized even without 3D labels. In addition, we introduce an inverse query algorithm to freely manipulate any specified 3D object shape in the learned scene representation. Notably, our manipulation algorithm can explicitly tackle key issues such as object collisions and visual occlusions. Our method, called DM-NeRF, is among the first to simultaneously reconstruct, decompose, manipulate and render complex 3D scenes in a single pipeline. Extensive experiments on three datasets clearly show that our method can accurately decompose all 3D objects from 2D views, allowing any interested object to be freely manipulated in 3D space such as translation, rotation, size adjustment, and deformation.","3D Scene Decomposition, Object Manipulation, Neural Rendering" Robust Neural ODEs via Contractivity-promoting Regularization,https://openreview.net/forum?id=n8toFjHwyjq,https://openreview.net/pdf?id=n8toFjHwyjq,We use contraction theory for dynamical systems to design regularizers improving the robustness of neural ODEs.,"Neural networks can be fragile to input noise and adversarial attacks. In this work, we consider Neural Ordinary Differential Equations (NODEs) – a family of continuous-depth neural networks represented by dynamical systems - and propose to use contraction theory to improve their robustness. A dynamical system is contractive if two trajectories starting from different initial conditions converge to each other exponentially fast. Contractive NODEs can enjoy increased robustness as slight perturbations of the features do not cause a significant change in the output. Contractivity can be induced during training by using a regularization term involving the Jacobian of the system dynamics. To reduce the computational burden, we show that it can also be promoted using carefully selected weight regularization terms for a class of NODEs with slope-restricted activation functions, including convolutional networks commonly used in image classification. The performance of the proposed regularizers is illustrated through benchmark image classification tasks on MNIST and FashionMNIST datasets, where images are corrupted by different kinds of noise and attacks.","contraction theory, neural ODEs, robustness, adversarial attacks, convolutional neural networks" Analyzing the Effects of Classifier Lipschitzness on Explainers,https://openreview.net/forum?id=I89hkzP0U4y,https://openreview.net/pdf?id=I89hkzP0U4y,Theoretical work in support of the intuition that robust classifiers lend themselves to robust explainers,"Machine learning methods are getting increasingly better at making predictions, but at the same time they are also becoming more complicated and less transparent. As a result, explainers are often relied on to provide interpretability to these \textit{black-box} prediction models. As crucial diagnostics tools, it is important that these explainers themselves are reliable. In this paper we focus on one particular aspect of reliability, namely that an explainer should give similar explanations for similar data inputs. We formalize this notion by introducing and defining \textit{explainer astuteness}, analogous to astuteness of classifiers. Our formalism is inspired by the concept of \textit{probabilistic Lipschitzness}, which captures the probability of local smoothness of a function. For a variety of explainers (e.g., SHAP, RISE, CXPlain), we provide lower bound guarantees on the astuteness of these explainers given the Lipschitzness of the prediction function. These theoretical results imply that locally smooth prediction functions lend themselves to locally robust explanations. We evaluate these results empirically on simulated as well as real datasets.","Explainers, Explanation, Robustness, Astuteness, Lipschitz, Blackbox, Classifiers" Complex-Target-Guided Open-Domain Conversation based on offline reinforcement learning,https://openreview.net/forum?id=JEY4Rgx62Vt,https://openreview.net/pdf?id=JEY4Rgx62Vt,,"Previous target-guided open-domain dialogue systems mostly take one keyword as the target, which has great limitations and cannot characterize the dialogue target well. In this paper, we introduce a new target representation model which uses a verb-noun pair to represent a complex-target. To this end, we implement a new dialogue guide procedure with Verb graph and Noun graph construction, dialogue encoder, verb-noun choose model and response generator. Machine metrics and human evaluation both show that our model outperforms previous target-guided dialogue system. In addition, different from previous target-guided dialogue systems which use online reinforcement learning to make decisions, we integrate an offline reinforcement learning method to gradually reduce the training time with a high performance.","target-guided dialogue, offline RL" Trading Information between Latents in Hierarchical Variational Autoencoders,https://openreview.net/forum?id=eWtMdr6yCmL,https://openreview.net/pdf?id=eWtMdr6yCmL,We generalize the rate/distortion theory of VAEs and analyze both theoeretically and analytically how manipulating each individual layer's rate affects performance.,"Variational Autoencoders (VAEs) were originally motivated as probabilistic generative models in which one performs approximate Bayesian inference. The proposal of $\beta$-VAEs breaks this interpretation and generalizes VAEs to application domains beyond generative modeling (e.g., representation learning, clustering, or lossy data compression) by introducing an objective function that allows practitioners to trade off between the information content (``bit rate'') of the latent representation and the distortion of reconstructed data. In this paper, we reconsider this rate/distortion trade-off in the context of hierarchical VAEs, i.e., VAEs with more than one layer of latent variables. We propose a method to control each layer's contribution to the rate independently. We identify the most general class of inference models to which our proposed method is applicable, and we derive theoretical bounds on the performance of downstream tasks as functions of the individual layers' rates. Our experiments demonstrate that the proposed method allows us to better tune hierarchical VAEs for a diverse set of practical use cases.","VAE, hierarchical VAE, rate distortion theory, information theory" """Why did the Model Fail?"": Attributing Model Performance Changes to Distribution Shifts",https://openreview.net/forum?id=b7jXzuQMq8W,https://openreview.net/pdf?id=b7jXzuQMq8W,We propose a method to attribute model performance changes to distribution shifts in causal mechanisms.,"Performance of machine learning models may differ between training and deployment for many reasons. For instance, model performance can change between environments due to changes in data quality, observing a different population than the one in training, or changes in the relationship between labels and features. These manifest as changes to the underlying data generating mechanisms, and thereby result in distribution shifts across environments. Attributing performance changes to specific shifts, such as covariate or concept shifts, is critical for identifying sources of model failures, and for taking mitigating actions that ensure robust models. In this work, we introduce the problem of attributing performance differences between environments to shifts in the underlying data generating mechanisms. We formulate the problem as a cooperative game and derive an importance weighting method for computing the value of a coalition (or a set) of distributions. The contribution of each distribution to the total performance change is then quantified as its Shapley value. We demonstrate the correctness and utility of our method on two synthetic datasets and two real-world case studies, showing its effectiveness in attributing performance changes to a wide range of distribution shifts.","distribution shifts, Shapley attribution, model robustness" VC Theoretical Explanation of Double Descent,https://openreview.net/forum?id=I7Mvqi0p9Xj,https://openreview.net/pdf?id=I7Mvqi0p9Xj,,"There has been growing interest in generalization performance of large multilayer neural networks that can be trained to achieve zero training error, while generalizing well on test data. This regime is known as ‘second descent’ and it appears to contradict the conventional view that optimal model complexity should reflect an optimal balance between underfitting and overfitting, i.e., the bias-variance tradeoff. This paper presents a VC-theoretical analysis of double descent and shows that it can be fully explained by classical VC-generalization bounds. We illustrate an application of analytic VC-bounds for modeling double descent for classification problems, using empirical results for several learning methods, such as SVM, Least Squares, and Multilayer Perceptron classifiers. In addition, we discuss several reasons for the misinterpretation of VC-theoretical results in Deep Learning community.","Double Descent, Deep Learning, SVM, Least Squares, VC Dimension, VC Generalization Bounds, Structural Risk Minimization" Points2NeRF: Generating Neural Radiance Fields from 3D point cloud,https://openreview.net/forum?id=Db8XXy9RCL,https://openreview.net/pdf?id=Db8XXy9RCL,We convert 3D point clouds into Neural Radiance Fields (NeRFs).,"Neural Radiance Fields (NeRFs) offer a state-of-the-art quality in synthesising novel views of complex 3D scenes from a small subset of base images. For NeRFs to perform optimally, the registration of base images has to follow certain assumptions, including maintaining constant distance between the camera and the object. We can address this limitation by training NeRFs with 3D point clouds, instead of images, yet a straightforward substitution is impossible due to the sparsity of 3D clouds in the under-sampled regions which leads to incomplete reconstructions output by NeRFs. To solve this problem, here we propose an auto-encoder-based architecture that leverages a hypernetwork paradigm to transfer 3D points with the associated color values through a lower-dimensional latent space and generate weights of NeRF model. This way we are able to accommodate sparsity of 3D point clouds and fully exploit the potential of point cloud data. As a side benefit, our method offers an implicit way for representing 3D scenes and objects, that can be employed to condition NeRFs and hence generalize the models beyond objects seen during training. Empirical evaluation confirms the advantages of our method over conventional NeRFs and proves its superiority in practical applications.","NeRF, Neural Radiance Fields, 3D point clouds" Imitation Improvement Learning for Large-scale Capacitated Vehicle Routing Problems,https://openreview.net/forum?id=K5UfKyHIBS,https://openreview.net/pdf?id=K5UfKyHIBS,We propose an imitation learning and a clockwise clustering framework to efficiently solve large-scale capacitated vehicle routing problems,"Recent works using deep reinforcement learning (RL) to solve routing problems such as the capacitated vehicle routing problem (CVRP) have focused on improvement learning-based methods, which involve improving a given solution until it becomes near-optimal. Although adequate solutions can be achieved for small problem instances, their efficiency degrades for large-scale ones. In this work, we propose a new improvement learning-based framework based on imitation learning where classical heuristics serve as experts to encourage the policy model to mimic and produce similar and better solutions. Moreover, to improve scalability, we propose Clockwise Clustering, a novel augmented framework for decomposing large-scale CVRP into subproblems by clustering sequentially nodes in clockwise order, and then learning to solve them simultaneously. Our approaches enhance state-of-the-art CVRP solvers while attaining competitive solution quality on several well-known datasets, including real-world instances with sizes up to 30,000 nodes. Our best methods are able to achieve new state-of-the-art solutions for several large instances and generalize to a wide range of CVRP variants and solvers. We also contribute new datasets and results to test the generalizability of our deep RL algorithms.","capacitated vehicle routing, deep reinforcement learning, imitation learning, clockwise clustering" Enhance Local Consistency for Free: A Multi-Step Inertial Momentum Approach,https://openreview.net/forum?id=nsz2ZxnD9D,https://openreview.net/pdf?id=nsz2ZxnD9D," we propose a novel federated learning algorithm, named FedMIM, which adopts the multi-step inertial momentum on the edge devices and enhances the local consistency for free during the training to improve the robustness of the heterogeneity.","Federated learning (FL), as a collaborative distributed training paradigm with several edge computing devices under the coordination of a centralized server, is plagued by inconsistent local stationary points due to the heterogeneity of the local partial participation clients, which precipitates the local client-drifts problems and sparks off the unstable and slow convergence, especially on the aggravated heterogeneous dataset. To address these issues, we propose a novel federated learning algorithm, named FedMIM, which adopts the multi-step inertial momentum on the edge devices and enhances the local consistency for free during the training to improve the robustness of the heterogeneity. Specifically, we incorporate the weighted global gradient estimations as the inertial correction terms to guide both the local iterates and stochastic gradient estimation, which can reckon the global objective optimization on the edges' heterogeneous dataset naturally and maintain the demanding consistent iteration locally. Theoretically, we show that FedMIM achieves the $\mathcal{O}\big({1}/{\sqrt{SKT}}\big)$ convergence rate with a linear speedup property with respect to the number of selected clients $S$ and proper local interval $K$ in each communication round under the nonconvex setting. Empirically, we conduct comprehensive experiments on various real-world datasets and demonstrate the efficacy of the proposed FedMIM against several state-of-the-art baselines.","federated learning, optimization" SYNG4ME: Model Evaluation using Synthetic Test Data,https://openreview.net/forum?id=J7CTp-jNyJ,https://openreview.net/pdf?id=J7CTp-jNyJ,Evaluating ML supervised models by generating synthetic test data,"Model evaluation is a crucial step in ensuring reliable machine learning systems. Currently, predictive models are evaluated on held-out test data, quantifying aggregate model performance. Limitations of available test data make it challenging to evaluate model performance on small subgroups or when the environment changes. Synthetic test data provides a unique opportunity to address this challenge; instead of evaluating predictive models on real data, we propose to use synthetic data. This brings two advantages. First, supplementing and increasing the amount of evaluation data can lower the variance of model performance estimates compared to evaluation on the original test data. This is especially true for local performance evaluation in low-density regions, e.g. minority or intersectional groups. Second, generative models can be conditioned as to induce a shift in the synthetic data distribution, allowing us to evaluate how supervised models could perform in different target settings. In this work, we propose SYNG4ME: an automated suite of synthetic data generators for model evaluation. By generating smart synthetic data sets, data practitioners have a new tool for exploring how supervised models may perform on subgroups of the data, and how robust methods are to distributional shifts. We show experimentally that SYNG4ME achieves more accurate performance estimates compared to using the test data alone.","Model Evaluation, Synthetic data" "Take One Gram of Neural Features, Get Enhanced Group Robustness",https://openreview.net/forum?id=24quGic59-,https://openreview.net/pdf?id=24quGic59-,"We improve group robustness without group annotations by introducing GramClust, a two-stage method which (1) partition a dataset into groups based on features Gram matrices and (2) apply a robust optimization based on these pseudo-groups. ","Predictive performance of machine learning models trained with empirical risk minimization (ERM) can degrade considerably under distribution shifts. In particular, the presence of spurious correlations in training datasets leads ERM-trained models to display high loss when evaluated on minority groups not presenting such correlations in test sets. Extensive attempts have been made to develop methods improving worst-group robustness. However, they require group information for each training input or at least, a validation set with group labels to tune their hyperparameters, which may be expensive to get or unknown a priori. In this paper, we address the challenge of improving group robustness without group annotations during training. To this end, we propose to partition automatically the training dataset into groups based on Gram matrices of features extracted from an identification model and to apply robust optimization based on these pseudo-groups. In the realistic context where no group labels are available, our experiments show that our approach not only improves group robustness over ERM but also outperforms all recent baselines.","group robustness, distribution shift, spurious correlations, Gram matrices" LMC: Fast Training of GNNs via Subgraph Sampling with Provable Convergence,https://openreview.net/forum?id=5VBBA91N6n,https://openreview.net/pdf?id=5VBBA91N6n,We propose a novel and efficient subgraph-wise sampling method with a convergence guarantee by Local Message Compensation (LMC).,"The message passing-based graph neural networks (GNNs) have achieved great success in many real-world applications. However, training GNNs on large-scale graphs suffers from the well-known neighbor explosion problem, i.e., the exponentially increasing dependencies of nodes with the number of message passing layers. Subgraph-wise sampling methods---a promising class of mini-batch training techniques---discard messages outside the mini-batches in backward passes to avoid the neighbor explosion problem at the expense of gradient estimation accuracy. This poses significant challenges to their convergence analysis and convergence speeds, which seriously limits their reliable real-world applications. To address this challenge, we propose a novel subgraph-wise sampling method with a convergence guarantee, namely Local Message Compensation (LMC). To the best of our knowledge, LMC is the {\it first} subgraph-wise sampling method with provable convergence. The key idea of LMC is to retrieve the discarded messages in backward passes based on a message passing formulation of backward passes. By efficient and effective compensations for the discarded messages in both forward and backward passes, LMC computes accurate mini-batch gradients and thus accelerates convergence. We further show that LMC converges to first-order stationary points of GNNs. Experiments on large-scale benchmark tasks demonstrate that LMC significantly outperforms state-of-the-art subgraph-wise sampling methods in terms of efficiency.","Graph Nerual Networks, Scalable Training, Provable Convergence, Local Message Compensation" DEEPER-GXX: DEEPENING ARBITRARY GNNS,https://openreview.net/forum?id=r_vnM5H9Fm,https://openreview.net/pdf?id=r_vnM5H9Fm,,"Recently, motivated by real applications, a major research direction in graph neural networks (GNNs) is to explore deeper structures. For instance, the graph connectivity is not always consistent with the label distribution (e.g., the closest neighbors of some nodes are not from the same category). In this case, GNNs need to stack more layers, in order to find the same categorical neighbors in a longer path for capturing the class-discriminative information. However, two major problems hinder the deeper GNNs to obtain satisfactory performance, i.e., vanishing gradient and over-smoothing. On one hand, stacking layers makes the neural network hard to train as the gradients of the first few layers vanish. Moreover, when simply addressing vanishing gradient in GNNs, we discover the shading neighbors effect (i.e., stacking layers inappropriately distorts the non-IID information of graphs and degrade the performance of GNNs). On the other hand, deeper GNNs aggregate much more information from common neighbors such that individual node representations share more overlapping features, which makes the final output representations not discriminative (i.e., overly smoothed). In this paper, for the first time, we address both problems to enable deeper GNNs, and propose Deeper-GXX, which consists of the Weight-Decaying Graph Residual Connection module (WDG-ResNet) and Topology-Guided Graph Contrastive Loss (TGCL). Extensive experiments on real-world data sets demonstrate that Deeper-GXX outperforms state-of-the-art deeper baselines.","Graph Neural Networks, Deep Learning" Music-to-Text Synaesthesia: Generating Descriptive Text from Music Recordings,https://openreview.net/forum?id=1FsLDqHivn4,https://openreview.net/pdf?id=1FsLDqHivn4,,"In this paper, we consider a novel research problem, music-to-text synaesthesia. Different from the classical music tagging problem that classifies a music recording into pre-defined categories, the music-to-text synaesthesia aims to generate descriptive texts from music recordings for further understanding. Although this is a new and interesting application to the machine learning community, to our best knowledge, the existing music-related datasets do not contain the semantic description on music recordings and cannot serve the music-to-text synaesthesia task. In light of this, we collect a new dataset that contains 1,955 aligned pairs of classical music recordings and text descriptions. Based on this, we build a computational model to generate sentences that can describe the content of the music recording. To tackle the highly non-discriminative classical music, we design a group topology-preservation loss in our computational model, which considers more samples as a group reference and preserves the relative topology among different samples. Extensive experimental results qualitatively and quantitatively demonstrate the effectiveness of our proposed model over five heuristics or pre-trained competitive methods and their variants on our collected dataset.","Multi-modal Learning, Music Description, Text Generation" ISAAC Newton: Input-based Approximate Curvature for Newton's Method,https://openreview.net/forum?id=0paCJSFW7j,https://openreview.net/pdf?id=0paCJSFW7j,,"We present ISAAC (Input-baSed ApproximAte Curvature), a novel method that conditions the gradient using selected second-order information and has an asymptotically vanishing computational overhead, assuming a batch size smaller than the number of neurons. We show that it is possible to compute a good conditioner based on only the input to a respective layer without a substantial computational overhead. The proposed method allows effective training even in small-batch stochastic regimes, which makes it competitive to first-order as well as second-order methods.", Learning Human-Compatible Representations for Case-Based Decision Support,https://openreview.net/forum?id=r0xte-t40I,https://openreview.net/pdf?id=r0xte-t40I,We combine metric learning and supervised classification to learn human-compatible decision-focused representation.,"Algorithmic case-based decision support provides examples to help human make sense of predicted labels and aid human in decision-making tasks. Despite the promising performance of supervised learning, representations learned by super- vised models may not align well with human intuitions: what models consider as similar examples can be perceived as distinct by humans. As a result, they have limited effectiveness in case-based decision support. In this work, we incorporate ideas from metric learning with supervised learning to examine the importance of alignment for effective decision support. In addition to instance-level labels, we use human-provided triplet judgments to learn human-compatible decision-focused rep- resentations. Using both synthetic data and human subject experiments in multiple classification tasks, we demonstrate that such representation is better aligned with human perception than representation solely optimized for classification. Human- compatible representations identify nearest neighbors that are perceived as more similar by humans and allow humans to make more accurate predictions, leading to substantial improvements in human decision accuracies (17.8% in butterfly vs. moth classification and 13.2% in pneumonia classification).","human-compatible representation learning, human triplet judgments" Long-Tailed Learning Requires Feature Learning,https://openreview.net/forum?id=S-h1oFv-mq,https://openreview.net/pdf?id=S-h1oFv-mq,We study the importance of learning features in order to achieve good generalization when the data distribution has a long tail. ,"We propose a simple data model inspired from natural data such as text or images, and use it to study the importance of learning features in order to achieve good generalization. Our data model follows a long-tailed distribution in the sense that some rare and uncommon subcategories have few representatives in the training set. In this context we provide evidence that a learner succeeds if and only if it identifies the correct features, and moreover derive non-asymptotic generalization error bounds that precisely quantify the penalty that one must pay for not learning features.","deep learning theory, generalization, long-tailed data distribution" Understanding Hindsight Goal Relabeling Requires Rethinking Divergence Minimization,https://openreview.net/forum?id=SxO-qoAwVM,https://openreview.net/pdf?id=SxO-qoAwVM,,"Hindsight goal relabeling has become a foundational technique for multi-goal reinforcement learning (RL). The idea is quite simple: any arbitrary trajectory can be seen as an expert demonstration for reaching the trajectory's end state. Intuitively, this procedure trains a goal-conditioned policy to imitate a sub-optimal expert. However, this connection between imitation and hindsight relabeling is not well understood. Modern imitation learning algorithms are described in the language of divergence minimization, and yet it remains an open problem how to recast hindsight goal relabeling into that framework. In this work, we develop a unified objective for goal-reaching that explains such a connection, from which we can derive goal-conditioned supervised learning (GCSL) and the reward function in hindsight experience replay (HER) from first principles. Experimentally, we find that despite recent advances in goal-conditioned behaviour cloning (BC), multi-goal Q-learning can still outperform BC-like methods; moreover, a vanilla combination of both actually hurts model performance. Under our framework, we study when BC is expected to help, and empirically validate our findings. Our work further bridges goal-reaching and generative modeling, illustrating the nuances and new pathways of extending the success of generative models to RL.","reinforcement learning, multi-goal reinforcement learning, imitation learning" DoE2Vec: Representation Learning for Exploratory Landscape Analysis,https://openreview.net/forum?id=LVujM8Yxsi,https://openreview.net/pdf?id=LVujM8Yxsi,"We propose DoE2Vec, a variational autoencoder (VAE)-based methodology to learn optimization landscape characteristics for downstream meta-learning tasks.","We propose DoE2Vec, a variational autoencoder (VAE)-based methodology to learn optimization landscape characteristics for downstream meta-learning tasks, e.g., automated selection of optimization algorithms. Principally, using large training data sets generated with a random function generator, DoE2Vec self-learns an informative latent representation for any design of experiments (DoE). Unlike the classical exploratory landscape analysis (ELA) method, our approach does not require any feature engineering and is easily applicable for high dimensional search spaces. For validation, we inspect the quality of latent reconstructions and analyze the latent representations using different experiments. The latent representations not only show promising potentials in identifying similar (cheap-to-evaluate) surrogate functions, but also can boost performances when being used complementary to the ELA features in classification tasks.","autoencoder, optimization, exploratory landscape analysis, representation learning" How to Exploit Hyperspherical Embeddings for Out-of-Distribution Detection?,https://openreview.net/forum?id=aEFaE0W5pAd,https://openreview.net/pdf?id=aEFaE0W5pAd,,"Out-of-distribution (OOD) detection is a critical task for reliable machine learning. Recent advances in representation learning give rise to distance-based OOD detection, where testing samples are detected as OOD if they are relatively far away from the centroids or prototypes of in-distribution (ID) classes. However, prior methods directly take off-the-shelf contrastive losses that suffice for classifying ID samples, but are not optimally designed when test inputs contain OOD samples. In this work, we propose CIDER, a novel representation learning framework that exploits hyperspherical embeddings for OOD detection. CIDER jointly optimizes two losses to promote strong ID-OOD separability: a dispersion loss that promotes large angular distances among different class prototypes, and a compactness loss that encourages samples to be close to their class prototypes. We analyze and establish the unexplored relationship between OOD detection performance and the embedding properties in the hyperspherical space, and demonstrate the importance of dispersion and compactness. CIDER establishes superior performance, outperforming the latest rival by 19.36% in FPR95.", Inferring Causal Relations between Temporal Events,https://openreview.net/forum?id=8is5PNk68ql,https://openreview.net/pdf?id=8is5PNk68ql,method to infer causal relations between temporal events,"Due to the popularity of event-based data, causal inference from event datasets has attracted increasing interest. However, inferring causalities from observational event sequences is challenging because of the heterogeneous and irregular nature of event-based data. Existing work on causal inference for temporal events disregards the event durations, and is thus unable to capture their impact on the causal relations. In the present paper, we overcome this limitation by proposing a new modeling approach for temporal events that captures and utilizes event durations. Based on this new temporal model, we propose a set of novel Duration-based Event Causality (DEC) scores, including the Duration-based Necessity and Sufficiency Trade-off score, and the Duration-based Conditional Intensity Rates scores that take into consideration event durations when inferring causal relations between temporal events. We prove that the proposed scores follow the causality hypothesis testing framework. We conduct an extensive experimental evaluation using both synthetic datasets, and two real-world event datasets in the medical and environmental domains to evaluate our proposed scores, and compare them against the closest baseline. The experimental results show that our proposed scores outperforms the baseline with a large margin using the popular evaluation metric Hits@K.","causality, temporal event" AnyDA: Anytime Domain Adaptation,https://openreview.net/forum?id=yyLvxYBJV1B,https://openreview.net/pdf?id=yyLvxYBJV1B,,"Unsupervised domain adaptation is an open and challenging problem in computer vision. While existing research shows encouraging results in addressing cross-domain distribution shift on common benchmarks, they are often limited to testing under a specific target setting. This can limit their impact for many real-world applications that present different resource constraints. In this paper, we introduce a simple yet effective framework for anytime domain adaptation that is executable with dynamic resource constraints to achieve accuracy-efficiency trade-offs under domain-shifts. We achieve this by training a single shared network using both labeled source and unlabeled data, with switchable depth, width and input resolutions on the fly to enable testing under a wide range of computation budgets. Starting with a teacher network trained from a label-rich source domain, we utilize bootstrapped recursive knowledge distillation as a nexus between source and target domains to jointly train the student network with switchable subnetworks. Extensive experiments on several diverse benchmark datasets well demonstrate the superiority of our proposed approach over state-of-the-art methods.","Efficient Domain Adaptation, Anytime Prediction, Knowledge Distillation, Resource-constrained Learning" Improving Deep Regression with Ordinal Entropy,https://openreview.net/forum?id=raU07GpP0P,https://openreview.net/pdf?id=raU07GpP0P,"We observe that many regression problems are preferably formulated as classification tasks, and we provide a theoretical analysis to explain this phenomenon then we propose an ordinal entropy loss to improve the performance of regression.","In computer vision, it is often observed that formulating regression problems as a classification task often yields better performance. We investigate this curious phenomenon and provide a derivation to show that classification, with the cross-entropy loss, outperforms regression with a mean squared error loss in its ability to learn high-entropy feature representations. Based on the analysis, we propose an ordinal entropy loss to encourage higher-entropy feature spaces while maintaining ordinal relationships to improve the performance of regression tasks. Experiments on synthetic and real-world regression tasks demonstrate the importance and benefits of increasing entropy for regression.","regression, classification, entropy, depth estimation, counting, age estimation" Revisiting Pretraining Objectives for Tabular Deep Learning,https://openreview.net/forum?id=kjPLodRa0n,https://openreview.net/pdf?id=kjPLodRa0n,"We identify best practicies for pretraining tabular DL models and show significant increase in downstream perfomance, which often leads to superiority over GBDTs.","Recent deep learning models for tabular data currently compete with the traditional ML models based on decision trees (GBDT). Unlike GBDT, deep models can additionally benefit from pretraining, which is a workhorse of DL for vision and NLP. For tabular problems, several pretraining methods were proposed, but it is not entirely clear if pretraining provides consistent noticeable improvements and what method should be used, since the methods are often not compared to each other or comparison is limited to the simplest MLP architectures. In this work, we aim to identify the best practices to pretrain tabular DL models that can be universally applied to different datasets and architectures. Among our findings, we show that using the object target labels during the pretraining stage is beneficial for the downstream performance and advocate several target-aware pretraining objectives. Overall, our experiments demonstrate that properly performed pretraining significantly increases the performance of tabular DL models, which often leads to their superiority over GBDTs.","Deep Learning, Tabular Data, Pretraining" OoD-Control: Out-of-Distribution Generalization for Adaptive UAV Flight Control,https://openreview.net/forum?id=cH4MVZsScm,https://openreview.net/pdf?id=cH4MVZsScm,,"Data-driven control methods have demonstrated precise and agile control of Unmanned Aerial Vehicles (UAVs) over turbulence environments. However, they are relatively weak at taming the out-of-distribution (OoD) data, i.e., encountering the generalization problem when faced with unknown environments with different data distributions from the training set. Many studies have designed algorithms to reduce the impact of the OoD problem, a common but tricky problem in machine learning. To tackle the OoD generalization problem in control, we propose a theoretically guaranteed approach: OoD-Control. We provide proof that for any perturbation within some range on the states, the control error can be upper bounded by a constant. In this paper, we present our OoD-Control generalization algorithm for online adaptive flight control and execute it on two instances. Experiments show that systems trained by the proposed OoD-Control algorithm perform better in quite different environments from training. And the control method is extensible and pervasively applicable and can be applied to different dynamical models. OoD-Control is validated on UAV dynamic models, and we find it performs state-of-the-art in positioning stability and trajectory tracking problems.", AdaptFSP: Adaptive Fictitious Self Play,https://openreview.net/forum?id=1kTxYvMRR8N,https://openreview.net/pdf?id=1kTxYvMRR8N,Use deep rl to modify FSP for better performance in continuous control games,"Fictitious Self-Play (FSP) is an iterative algorithm capable of learning approximate Nash equilibria in many types of two-player zero-sum games. In FSP, at each iteration, a best response is learned to the opponent's meta strategy. However, FSP can be slow to converge in continuous control games in which two embodied agents compete against one another. We propose Adaptive FSP (AdaptFSP), a deep reinforcement learning (RL) algorithm inspired by FSP. The main idea is that instead of training a best response only against the meta strategy, we additionally train against an adaptive deep RL agent that can adapt to the best response. In four test domains, two tabular cases--random normal-form matrix games, Leduc poker--and two continuous control tasks--Thou Shall Not Pass and a soccer environment--we show that AdaptFSP achieves lower exploitability more quickly than vanilla FSP.","Deep reinforcement learning, game theory, exploitability" A Robust Stacking Framework for Training Deep Graph Models with Multifaceted Node Features,https://openreview.net/forum?id=2bhXOpq53RP,https://openreview.net/pdf?id=2bhXOpq53RP,,"Graph Neural Networks (GNNs) with numerical node features and graph structure as inputs have demonstrated superior performance on various supervised learning tasks with graph data. However the numerical node features utilized by GNNs are commonly extracted from raw data which is of text or tabular (numeric/categorical) type in most real-world applications. The best models for such data types in most standard supervised learning settings with IID (non-graph) data are not simple neural network layers and thus are not easily incorporated into a GNN. Here we propose a robust stacking framework that fuses graph-aware propagation with arbitrary models intended for IID data, which are ensembled and stacked in multiple layers. Our layer-wise framework leverages bagging and stacking strategies to enjoy strong generalization, in a manner which effectively mitigates label leakage and overfitting. Across a variety of graph datasets with tabular/text node features, our method achieves comparable or superior performance relative to both tabular/text and graph neural network models, as well as existing state-of-the-art hybrid strategies that combine the two. ", VLG: General Video Recognition with Web Textual Knowledge,https://openreview.net/forum?id=Fp0CMUtBtw,https://openreview.net/pdf?id=Fp0CMUtBtw,"We build a comprehensive video benchmark of Kinetics-GVR including close-set, long-tail, few-shot and open-set, and present a unified video-text framework (VLG) with web textual knowledge to achieve SOTA performance under different settings.","Video recognition in an open world is quite challenging, as we need to handle different settings such as close-set, long-tail, few-shot and open-set. By leveraging semantic knowledge from noisy text descriptions crawled from the Internet, we focus on the general video recognition (GVR) problem of solving different recognition tasks within a unified framework. The contribution of this paper is twofold. First, we build a comprehensive video recognition benchmark of Kinetics-GVR, including four sub-task datasets to cover the mentioned settings. To facilitate the research of GVR, we propose to utilize external textual knowledge from the Internet and provide multi-source text descriptions for all action classes. Second, inspired by the flexibility of language representation, we present a unified visual-linguistic framework (VLG) to solve the problem of GVR by devising an effective two-stage training paradigm. Our VLG is first pre-trained on video and language datasets to learn a shared feature space, and then devises a flexible bi-modal attention head to collaborate high-level semantic concepts under different settings. Extensive results show that our VLG obtains the state-of-the-art performance under four settings. The superior performance demonstrates the effectiveness and generalization ability of our proposed VLG framework. We hope our work makes a step towards the general video recognition and could serve as a baseline for future research.","Video Recognition, Multi Modality, Video-language representation learning" Unified Discrete Diffusion for Simultaneous Vision-Language Generation,https://openreview.net/forum?id=8JqINxA-2a,https://openreview.net/pdf?id=8JqINxA-2a,"We proposed Unified Discrete Denoising Diffusion model, which allows us to construct a joint vision-language probability distribution, leading to a capability of simultaneously generating cross-domain results. ","The recently developed discrete diffusion model performs extraordinarily well in generation tasks, especially in the text-to-image task, showing great potential for modeling multimodal signals. In this paper, we leverage these properties and present a unified multimodal generation model, which can perform text-based, image-based, and even vision-language simultaneous generation using a single model. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified Markov transition matrix and a unified objective. Moreover, we design a multimodal mutual attention module to highlight the inter-modal linkages, which is vital for multimodal generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.","Multi-modal, Image generation, Image Caption." Take 5: Interpretable Image Classification with a Handful of Features,https://openreview.net/forum?id=9EcAsB7wgM,https://openreview.net/pdf?id=9EcAsB7wgM,,"Deep Neural Networks use thousands of mostly incomprehensible features to identify a single class, a decision no human can follow. We propose an interpretable sparse and low dimensional final decision layer in a deep neural network with measurable aspects of interpretability and demonstrate it on fine-grained image classification. We argue that a human can only understand the decision of a machine learning model, if the input features are interpretable and only very few of them are used for a single decision. For that matter, the final layer has to be sparse and – to make interpreting the features feasible – low dimensional. We call a model with a Sparse Low-Dimensional Decision “SLDD-Model”. We show that a SLDD-Model is easier to interpret locally and globally than a dense high-dimensional decision layer while being able to maintain competitive accuracy. Additionally, we propose a loss function that improves a model’s feature diversity and accuracy. Our interpretable SLDD-Model only uses 5 out of just 50 features per class, while maintaining 97% to 100% of the accuracy on four common benchmark datasets compared to the baseline model with 2048 features.","xAI, interpretability, fine-grained image classification, sparsity, image classification, interpretability by design" Uncertainty-based Multi-Task Data Sharing for Offline Reinforcement Learning,https://openreview.net/forum?id=u1Vj68CJZP,https://openreview.net/pdf?id=u1Vj68CJZP,We propose an uncertainty-based multi-task data sharing approach that shares the entire dataset without data selection.,"Offline Reinforcement Learning (RL) has shown promising results in learning a task-specific policy from a fixed dataset. However, successful offline RL often relies heavily on the coverage and quality of the given dataset. In scenarios where the dataset for a specific task is limited, a natural approach is to improve offline RL with datasets from other tasks, namely, to conduct Multi-Task Data Sharing (MTDS). Nevertheless, directly sharing datasets from other tasks exacerbates the distribution shift in offline RL. As a remedy, previous attempts share only partial data that maximizes a conservative value function. However, such attempts are inefficient in calculating the policy constraints and abandon a large portion of datasets, which could be potentially informative. In this paper, we propose an uncertainty-based MTDS approach that shares the entire dataset without data selection. We further provide theoretical analysis, which shows that the optimality gap of our method is only related to the expected data coverage of the shared dataset, thus resolving the distribution shift issue in data sharing. Empirically, we construct an MTDS benchmark and collect datasets from three challenging domains. The experimental results show our algorithm outperforms the previous state-of-the-art methods in challenging MTDS problems.","multi-task data sharing, offline reinforcement learning, uncertainty quantification" On the Fast Convergence of Unstable Reinforcement Learning Problems,https://openreview.net/forum?id=j3mm8mci4u,https://openreview.net/pdf?id=j3mm8mci4u,We propose new methods to effectively improve the convergence of policy gradient method for unstable reinforcement problems.," For many of the reinforcement learning applications, the system is assumed to be inherently stable and with bounded reward, state and action space. These are key requirements for the optimization convergence of classical reinforcement learning reward function with discount factors. Unfortunately, these assumptions do not hold true for many real world problems such as an unstable linear–quadratic regulator (LQR). In this work, we propose new methods to stabilize and speed up the convergence of unstable reinforcement learning problems with the policy gradient methods. We provide theoretical insights on the efficiency of our methods. In practice, our method achieve good experimental results over multiple examples where the vanilla methods mostly fail to converge due to system instability.","unstable reinforcement learning, LQR, optimization" Iterative Patch Selection for High-Resolution Image Recognition,https://openreview.net/forum?id=QCrw0u9LQ7,https://openreview.net/pdf?id=QCrw0u9LQ7,"We propose a simple, memory-efficient method that selects the most salient patches from a high-resolution image and then aggregates them into a global representation for image recognition.","High-resolution images are prevalent in various applications, such as autonomous driving and computer-aided diagnosis. However, training neural networks on such images is computationally challenging and easily leads to out-of-memory errors even on modern GPUs. We propose a simple method, Iterative Patch Selection (IPS), which decouples the memory usage from the input size and thus enables the processing of arbitrarily large images under tight hardware constraints. IPS achieves this by selecting only the most salient patches, which are then aggregated into a global representation for image recognition. For both patch selection and aggregation, a cross-attention based transformer is introduced, which exhibits a close connection to Multiple Instance Learning. Our method demonstrates strong performance and has wide applicability across different domains, training regimes and image sizes while using minimal accelerator memory. For example, we are able to finetune our model on whole-slide images consisting of up to 250k patches (>16 gigapixels) with only 5 GB of GPU VRAM at a batch size of 16.","high-resolution images, memory-efficient deep learning, multiple instance learning, transformer, image recognition, computer vision" HyperMAML: Few-Shot Adaptation of Deep Models with Hypernetworks,https://openreview.net/forum?id=uAb5lQqdeHd,https://openreview.net/pdf?id=uAb5lQqdeHd,We use Hypernetworks for generating weight updates for novel tasks in Few-Shot learning.,"The aim of Few-Shot learning methods is to train models which can easily adapt to previously unseen tasks, based on small amounts of data. One of the most popular and elegant Few-Shot learning approaches is Model-Agnostic Meta-Learning (MAML). The main idea behind this method is to learn the general weights of the meta-model, which are further adapted to specific problems in a small number of gradient steps. However, the model’s main limitation lies in the fact that the update procedure is realized by gradient-based optimisation. In consequence, MAML cannot always modify weights to the essential level in one or even a few gradient iterations. On the other hand, using many gradient steps results in a complex and time-consuming optimization procedure, which is hard to train in practice, and may lead to overfitting. In this paper, we propose HyperMAML, a novel generalization of MAML, where the training of the update procedure is also part of the model. Namely, in HyperMAML, instead of updating the weights with gradient descent, we use for this purpose a trainable Hypernetwork. Consequently, in this framework, the model can generate significant updates whose range is not limited to a fixed number of gradient steps. Experiments show that HyperMAML consistently outperforms MAML in most cases and performs comparably to other state-of-the-art techniques in a number of standard Few-Shot learning benchmarks.","deep learning, hypernetworks, maml, few-shot learning, meta-learning, adaptation, hypermaml, few-shot, meta, learning, mini-imagenet" Conditional Antibody Design as 3D Equivariant Graph Translation,https://openreview.net/forum?id=LFHFQbjxIiP,https://openreview.net/pdf?id=LFHFQbjxIiP,,"Antibody design is valuable for therapeutic usage and biological research. Existing deep-learning-based methods encounter several key issues: 1) incomplete context for Complementarity-Determining Regions (CDRs) generation; 2) incapable of capturing the entire 3D geometry of the input structure; 3) inefficient prediction of the CDR sequences in an autoregressive manner. In this paper, we propose Multi-channel Equivariant Attention Network (MEAN), an end-to-end model that is able to co-design 1D sequences and 3D structures of CDRs. To be specific, MEAN formulates antibody design as a conditional graph translation problem by importing extra components including the target antigen and the light chain of the antibody. Then, MEAN resorts to E(3)-equivariant message passing along with a proposed attention mechanism to better capture the geometrical correlation between different components. Finally, it outputs both the 1D sequences and 3D structure via a multi-round progressive full-shot scheme, which enjoys more efficiency against previous autoregressive approaches. Our method significantly surpasses state-of-the-art models in sequence and structure modeling, antigen-binding antibody design, and binding affinity optimization. Specifically, the relative improvement to baselines is about 23% in antigen-binding CDR design and 34% for affinity optimization.","conditional antibody generation, equivariant, multi-channel attention" Robust Constrained Reinforcement Learning,https://openreview.net/forum?id=KzfhxLoh6s0,https://openreview.net/pdf?id=KzfhxLoh6s0,,"Constrained reinforcement learning is to maximize the reward subject to constraints on utilities/costs. However, in practice it is often the case that the training environment is not the same as the test one, due to, e.g., modeling error, adversarial attack, non-stationarity, resulting in severe performance degradation and more importantly constraint violation in the test environment. To address this challenge, we formulate the framework of robust constrained reinforcement learning under model uncertainty, where the MDP is not fixed but lies in some uncertainty set. The goal is two fold: 1) to guarantee that constraints on utilities/costs are satisfied for all MDPs in the uncertainty set, and 2) to maximize the worst-case reward performance over the uncertainty set. We design a robust primal-dual approach, and further develop theoretical guarantee on its convergence, complexity and robust feasibility. We then investigate a concrete example of $\delta$-contamination uncertainty set, design an online and model-free algorithm and theoretically characterize its sample complexity. ", Differentiable Meta-Logical Programming,https://openreview.net/forum?id=iUCI3mQ8KkR,https://openreview.net/pdf?id=iUCI3mQ8KkR,We realize a differentiable logical meta interpreter (DLMI) using differentiable forward-chaining reasoning in first-order logic. ,"Deep learning uses an increasing amount of computation and data to solve very specific problems. By stark contrast, human minds solve a wide range of problems using a fixed amount of computation and limited experience. One ability that seems crucial to this kind of general intelligence is meta-reasoning, i.e., our ability to reason about reasoning. To make deep learning do more from less, we propose the differentiable logical meta interpreter (DLMI). The key idea is to realize a meta-interpreter using differentiable forward-chaining reasoning in first-order logic. This directly allows DLMI to reason and even learn about its own operations. This is different from performing object-level deep reasoning and learning, which refers in some way to entities external to the system. In contrast, DLMI is able to reflect or introspect, i.e., to shift from meta-reasoning to object-level reasoning and vice versa. Among many other experimental evaluations, we illustrate this behavior using the novel task of ""repairing Kandinsky patterns"", i.e., how to edit the objects in an image so that it agrees with a given logical concept.","meta interpreter, differentiable forward chaining inference, first order logic" FaceMAE: Privacy-Preserving Face Recognition via Masked Autoencoders,https://openreview.net/forum?id=4Xd_aAqNe7h,https://openreview.net/pdf?id=4Xd_aAqNe7h,A novel method for privacy-preserving face recognition.,"Face recognition, as one of the most successful applications in artificial intelligence, has been widely used in security, administration, advertising, and healthcare. However, the privacy issues of public face datasets have attracted increasing attention in recent years. Previous works simply mask most areas of faces or synthesize samples using generative models to construct privacy-preserving face datasets, which overlooks the trade-off between privacy protection and data utility. In this paper, we propose a novel framework FaceMAE, where the face privacy and recognition performance are considered simultaneously. Firstly, randomly masked face images are used to train the reconstruction module in FaceMAE. We tailor the instance relation matching (IRM) module to minimize the distribution gap between real faces and FaceMAE reconstructed ones. During the deployment phase, we use trained FaceMAE to reconstruct images from masked faces of unseen identities without extra training. The risk of privacy leakage is measured based on face retrieval between reconstructed and original datasets. Experiments prove that the identities of reconstructed images are difficult to be retrieved. We also perform sufficient privacy-preserving face recognition on several public face datasets (i.e. CASIA-WebFace and WebFace260M). Compared to previous state of the arts, FaceMAE consistently \textbf{reduces at least 50\% error rate} on LFW, CFP-FP and AgeDB.",FaceMAE: Privacy-Preserving Face Recognition via Masked Autoencoders Fuzzy Alignments in Directed Acyclic Graph for Non-Autoregressive Machine Translation,https://openreview.net/forum?id=LSz-gQyd0zE,https://openreview.net/pdf?id=LSz-gQyd0zE,"We introduce a fuzzy alignment objective in Directed Acyclic Graph for NAT, setting a new state of the art for NAT on the raw training data.","Non-autoregressive translation (NAT) reduces the decoding latency but suffers from performance degradation due to the multi-modality problem. Recently, the structure of Directed Acyclic Graph has achieved great success in NAT, which tackles the multi-modality problem by introducing dependency between vertices. However, training it with Negative Log Likelihood loss implicitly requires a strict alignment between reference tokens and vertices, weakening its ability to handle multiple translation modalities. In this paper, we hold the view that all paths in the graph are fuzzily aligned with the reference sentence. We do not require the exact alignment but train the model to maximize a fuzzy alignment score between the graph and reference, which takes captured translations in all modalities into account. Extensive experiments on major WMT benchmarks show that our method substantially improves translation performance and increases prediction confidence, setting a new state of the art for NAT on the raw training data.","Machine translation, Non-autoregressive generation, Fuzzy alignment" Efficient Federated Domain Translation ,https://openreview.net/forum?id=uhLAcrAZ9cJ,https://openreview.net/pdf?id=uhLAcrAZ9cJ,,"A central theme in federated learning (FL) is the fact that client data distributions are often not independent and identically distributed (IID), which has strong implications on the training process. While most existing FL algorithms focus on the conventional non-IID setting of class imbalance or missing classes across clients, in practice, the distribution differences could be more complex, e.g., changes in class conditional (domain) distributions. In this paper, we consider this complex case in FL wherein each client has access to only one domain distribution. For tasks such as domain generalization, most existing learning algorithms require access to data from multiple clients (i.e., from multiple domains) during training, which is prohibitive in FL. To address this challenge, we propose a federated domain translation method that generates pseudodata for each client which could be useful for multiple downstream learning tasks. We empirically demonstrate that our translation model is more resource-efficient (in terms of both communication and computation) and easier to train in an FL setting than standard domain translation methods. Furthermore, we demonstrate that the learned translation model enables use of state-of-the-art domain generalization methods in a federated setting, which enhances accuracy and robustness to increases in the synchronization period compared to existing methodology.", EIT: Enhanced Interactive Transformer for Sequence Generation,https://openreview.net/forum?id=9RHjy5oHmfe,https://openreview.net/pdf?id=9RHjy5oHmfe,,"In this work, we tackle the head degradation problem in attention. We propose an \textbf{E}nhanced \textbf{I}nteractive \textbf{T}ransformer (\textsc{Eit}) architecture in which the standard multi-head self-attention is replaced with the enhanced multi-head attention (EMHA). EMHA removes the one-to-one mapping constraint among queries and keys in multiple subspaces and allows each query to attend to multiple keys. On top of that, we develop a method to make full use of many-to-many mapping by introducing two interaction models, namely Inner-Subspace Interaction and Cross-Subspace Interaction. Extensive experiments on a wide range of sequence generation tasks (e.g. machine translation, abstractive summarization and grammar correction), show its superiority, with a very modest increase in model size. ","Transformer, Multi-head self-attention, Sequence Generation, Machine Translation" Single-Stage Open-world Instance Segmentation with Cross-task Consistency Regularization,https://openreview.net/forum?id=l02pjIT6JWy,https://openreview.net/pdf?id=l02pjIT6JWy,"This paper proposed a single-stage open-world instance segmentation framework with a cross-task consistency loss, achieving superior performance.","Open-World Instance Segmentation (OWIS) is an emerging research topic that aims to segment class-agnostic object instances from images. The mainstream approaches use a two-stage segmentation framework, which first locates the candidate object bounding boxes and then performs instance segmentation. In this work, we instead promote a single-stage framework for OWIS. We argue that the end-to-end training process in the single-stage framework can be more convenient for directly regularizing the localization of class-agnostic object pixels. Based on the single-stage instance segmentation framework, we propose a regularization model to predict foreground pixels and use its relation to instance segmentation to construct a cross-task consistency loss. We show that such a consistency loss could alleviate the problem of incomplete instance annotation -- a common problem in the existing OWIS datasets. We also show that the proposed loss lends itself to an effective solution to semi-supervised OWIS that could be considered an extreme case that all object annotations are absent for some images. Our extensive experiments demonstrate that the proposed method achieves impressive results in both fully-supervised and semi-supervised settings. Compared to SOTA methods, the proposed method significantly improves the $AP_{100}$ score by 4.75\% in UVO$\rightarrow$UVO setting and 4.05\% in COCO$\rightarrow$UVO setting. In the case of semi-supervised learning, our model learned with only 30\% labeled data, even outperforms its fully-supervised counterpart with 50\% labeled data. The code will be released soon.","Open world, instance segmentation, Cross-task Consistency Regularization" What can be learnt with wide convolutional neural networks?,https://openreview.net/forum?id=m3QhpKNXU6-,https://openreview.net/pdf?id=m3QhpKNXU6-,theoretical study of generalisation rates for deep CNNs in the kernel regime,"Understanding how convolutional neural networks (CNNs) can efficiently learn high-dimensional functions remains a fundamental challenge. A popular belief is that these models harness the local and hierarchical structure of natural data such as images. Yet, we lack a quantitative understanding of how such structure affects performance, e.g. the rate of decay of the generalisation error with the number of training samples. In this paper, we study deep CNNs in the kernel regime. First, we show that the spectrum of the corresponding kernel inherits the hierarchical structure of the network, and we characterise its asymptotics. Then, we use this result together with generalisation bounds to prove that deep CNNs adapt to the spatial scale of the target function. In particular, we find that if the target function depends on low-dimensional subsets of adjacent input variables, then the rate of decay of the error is controlled by the effective dimensionality of these subsets. Conversely, if the teacher function depends on the full set of input variables, then the error rate is inversely proportional to the input dimension. We conclude by computing the rate when a deep CNN is trained on the output of another deep CNN with randomly-initialised parameters. Interestingly, we find that despite their hierarchical structure, the functions generated by deep CNNs are too rich to be efficiently learnable in high dimension.","hierarchical models, kernel learning, neural tangent kernel, theory of deep learning, generalization, convolutional neural networks" 3D Segmenter: 3D Transformer based Semantic Segmentation via 2D Panoramic Distillation,https://openreview.net/forum?id=4dZeBJ83oxk,https://openreview.net/pdf?id=4dZeBJ83oxk,Distill knowledge from 2D strong model to enhance 3D semantic segmentation,"Recently, 2D semantic segmentation has witnessed a significant advancement thanks to the huge amount of 2D image datasets available. Therefore, in this work, we propose the first 2D-to-3D knowledge distillation strategy to enhance 3D semantic segmentation model with knowledge embedded in the latent space of powerful 2D models. Specifically, unlike standard knowledge distillation, where teacher and student models take the same data as input, we use 2D panoramas properly aligned with corresponding 3D rooms to train the teacher network and use the learned knowledge from 2D teacher to guide 3D student. To facilitate our research, we create a large-scale, fine-annotated 3D semantic segmentation benchmark, containing voxel-wise semantic labels and aligned panoramas of 5175 scenes. Based on this benchmark, we propose a 3D volumetric semantic segmentation network, which adapts Video Swin Transformer as backbone and introduces a skip connected linear decoder. Achieving a state-of-the-art performance, our 3D Segmenter is computationally efficient and only requires $3.8\%$ of the parameters compared to the prior art. Our code and data will be released upon acceptance.","3D semantic segmentation, knowledge distillation" Towards Skilled Population Curriculum for MARL,https://openreview.net/forum?id=GbsvQSaJV-6,https://openreview.net/pdf?id=GbsvQSaJV-6,,"Recent advances in multi-agent reinforcement learning (MARL) allow agents to coordinate their behaviors in complex environments. However, common MARL algorithms still suffer from scalability and sparse reward issues. One promising approach to resolve them is automated curriculum learning (ACL), where a student (curriculum learner) train on tasks of increasing difficulty controlled by a teacher (curriculum generator). Unfortunately, in spite of its success, ACL’s applicability is restricted due to: (1) lack of a general student framework to deal with the varying number of agents across tasks and the sparse reward problem, and (2) the non-stationarity in the teacher’s task due to the ever-changing student strategies. As a remedy for ACL, we introduce a novel automatic curriculum learning framework, Skilled Population Curriculum (SPC), adapting curriculum learning to multi-agent coordination. To be specific, we endow the student with population-invariant communication and a hierarchical skill set. Thus, the student can learn cooperation and behavior skills from distinct tasks with a varying number of agents. In addition, we model the teacher as a contextual bandit conditioned by student policies. As a result, a team of agents can change its size while retaining previously acquired skills. We also analyze the inherent non-stationarity of this multi-agent automatic curriculum teaching problem, and provide a corresponding regret bound. Empirical results show that our method improves scalability, sample efficiency, and generalization in multiple MARL environments. The source code and the video can be found at https://sites.google.com/view/marl-spc/.","multi-agent reinforcement learning, multi-agent cooperation" Logit Clipping for Robust Learning against Label Noise,https://openreview.net/forum?id=IJV0augCyk,https://openreview.net/pdf?id=IJV0augCyk,"We propose to clamp the norm of the logit output, which can enhance the noise-robostness of existing loss functions with theoretical guarantee.","In the presence of noisy labels, designing robust loss functions is critical for securing the generalization performance of deep neural networks. Cross Entropy (CE) loss has been shown to be not robust to noisy labels due to its unboundedness. To alleviate this issue, existing works typically design specialized robust losses with the symmetric condition, which usually lead to the underfitting issue. In this paper, our key idea is to induce a loss bound at the logit level, thus universally enhancing the noise robustness of existing losses. Specifically, we propose logit clipping (LogitClip), which clamps the norm of the logit vector to ensure that it is upper bounded by a constant. In this manner, CE loss equipped with our LogitClip method is effectively bounded, mitigating the overfitting to examples with noisy labels. Moreover, we present theoretical analyses to certify the noise-tolerant ability of LogitClip. Extensive experiments show that LogitClip not only significantly improves the noise robustness of CE loss, but also broadly enhances the generalization performance of popular robust losses.","noisy labels, robust loss functions, logit clipping, overfitting" Clifford Neural Layers for PDE Modeling,https://openreview.net/forum?id=okwxL_c4x84,https://openreview.net/pdf?id=okwxL_c4x84,"We introduce neural network layers on composite objects of scalars, vectors, and higher order objects such as bivectors.","Partial differential equations (PDEs) see widespread use in sciences and engineering to describe simulation of physical processes as scalar and vector fields interacting and coevolving over time. Due to the computationally expensive nature of their standard solution methods, neural PDE surrogates have become an active research topic to accelerate these simulations. However, current methods do not explicitly take into account the relationship between different fields and their internal components, which are often correlated. Viewing the time evolution of such correlated fields through the lens of multivector fields allows us to overcome these limitations. Multivector fields consist of scalar, vector, as well as higher-order components, such as bivectors and trivectors. Their algebraic properties, such as multiplication, addition and other arithmetic operations can be described by Clifford algebras. To our knowledge, this paper presents the first usage of such multivector representations together with Clifford convolutions and Clifford Fourier transforms in the context of deep learning. The resulting Clifford neural layers are universally applicable and will find direct use in the areas of fluid dynamics, weather forecasting, and the modeling of physical systems in general. We empirically evaluate the benefit of Clifford neural layers by replacing convolution and Fourier operations in common neural PDE surrogates by their Clifford counterparts on 2D Navier-Stokes and weather modeling tasks, as well as 3D Maxwell equations. For similar parameter count, Clifford neural layers consistently improve generalization capabilities of the tested neural PDE surrogates.","Geometric Deep Learning, PDE modeling, multivector fields, Clifford algebra, Clifford convolution, Clifford Fourier transform" GOOD: Exploring geometric cues for detecting objects in an open world,https://openreview.net/forum?id=W-nZDQyuy8D,https://openreview.net/pdf?id=W-nZDQyuy8D,We propose incorporating geometric cues into open-world object detector training and make significant improvements on various benchmarks.,"We address the task of open-world class-agnostic object detection, i.e., detecting every object in an image by learning from a limited number of base object classes. State-of-the-art RGB-based models suffer from overfitting the training classes and often fail at detecting novel-looking objects. This is because RGB-based models primarily rely on appearance similarity to detect novel objects and are also prone to overfitting short-cut cues such as textures and discriminative parts. To address these shortcomings of RGB-based object detectors, we propose incorporating geometric cues such as depth and normals, predicted by general-purpose monocular estimators. Specifically, we use the geometric cues to train an object proposal network for pseudo-labeling unannotated novel objects in the training set. Our resulting Geometry-guided Open-world Object Detector (GOOD) significantly improves detection recall for novel object categories and already performs well with only a few training classes. Using a single ``person'' class for training on the COCO dataset, GOOD surpasses SOTA methods by 5.0% AR@100, a relative improvement of 24%.","open world, object detection, geometric cues, mid-level representations" Enhancing Robustness of Deep Networks Based on a Two-phase Model of Their Training with Noisy Labels,https://openreview.net/forum?id=ay4xkpMnyE,https://openreview.net/pdf?id=ay4xkpMnyE,,"In this study we model explicitly the learning behavior of deep neural networks (DNNs) trained with noisy labels in image classification. Specifically, we show theoretically and experimentally that the training process can be divided into two phases: a learning phase in which the outputs of DNNs converge to a hidden noise distribution no matter whether the training samples are clean or noisy; and a memorization phase in which DNNs start to overfit until the output for each training sample converges to its corresponding noisy one-hot labels. This two-phase model enables us to develop two simple yet accurate methods that rely on the outputs of DNNs to estimate the noise transition matrix (NTM). It also enables us to resolve a pitfall of many existing methods for robust training under noisy labels based on the small-loss assumption, namely clean samples have smaller loss than noisy samples in the early training phase. We show that these methods fail when NTM is not a column diagonally-maximum matrix and that this pitfall can be fixed by modifying the small-loss assumption based on our NTM estimation methods.","Image classification, Noisy label, Robust training" Bringing Saccades and Fixations into Self-supervised Video Representation Learning,https://openreview.net/forum?id=PQXP4WZNcM,https://openreview.net/pdf?id=PQXP4WZNcM,"In this paper, we propose a self-supervised video representation learning method by taking inspiration from cognitive science and neuroscience on human visual perceptionization. ","In this paper, we propose a self-supervised video representation learning (video SSL) method by taking inspiration from cognitive science and neuroscience on human visual perception. Different from previous methods that mainly start from the inherent properties of videos, we argue that humans learn to perceive the world through the self-awareness of the semantic change or consistency in the input stimuli in the absence of labels, accompanied by representation reorganization during the post-learning rest periods. To this end, we first exploit the presence of saccades as an indicator of semantic change in a contrastive learning framework to mimic the self-awareness in human representation learning, where the saccades are generated without eye-tracking data. Second, we model the semantic consistency by minimizing the prediction error between the predicted and the true state of another time point during a fixation. Third, we later incorporate prototypical contrastive learning to reorganize the learned representations such that perceptually similar representations would be associated closer. Compared to previous counterparts, our method can capture finer-grained semantics from video instances, and the associations among similar ones are further strengthened. Experiments show that the proposed bio-inspired video SSL method significantly improves the Top-1 video retrieval accuracy on UCF101 and achieves superior performance on downstream tasks such as action recognition under comparable settings.","Self-supervised learning, video self-supervised learning, bio-inspired" Improve learning combining crowdsourced labels by weighting Areas Under the Margin,https://openreview.net/forum?id=dGzgbdQbgwm,https://openreview.net/pdf?id=dGzgbdQbgwm,We introduced the weighted Areas Under the Margin to identify ambiguous tasks in crowdsourced learning scenarios,"In supervised learning -- for instance in image classification -- modern massive datasets are commonly labelled by a crowd of workers. The obtained labels in this crowdsourcing setting are then aggregated for training. The aggregation step generally leverages a per worker trust score. Yet, such worker-centric approaches discard each task ambiguity. Some intrinsically ambiguous tasks might even fool expert workers, which could eventually be harmful for the learning step. In a standard supervised learning setting -- with one label per task and balanced classes -- the Area Under the Margin (AUM) statistic is tailored to identify mislabeled data. We adapt the AUM to identify ambiguous tasks in crowdsourced learning scenarios, introducing the Weighted AUM (WAUM). The WAUM is an average of AUMs weighted by worker and task dependent scores. We show that the WAUM can help discarding ambiguous tasks from the training set, leading to better generalization or calibration performance. We report improvements with respect to feature-blind aggregation strategies both for simulated settings and for the CIFAR-10H crowdsourced dataset.","crowdsourcing, ambiguity, area under the margin, aggregation, noisy labels" Distraction is All You Need For Fairness,https://openreview.net/forum?id=etPgCVMh8IB,https://openreview.net/pdf?id=etPgCVMh8IB,,"Bias in training datasets must be managed for various groups in classification tasks to ensure parity or equal treatment. With the recent growth in artificial intelli- gence models and their expanding role in automated decision-making, ensuring that these models are not biased is vital. There is an abundance of evidence sug- gesting that these models could contain or even amplify the bias present in the data on which they are trained, inherent to their objective function and learning algorithms; Many researchers direct their attention to this issue in different direc- tions, namely, changing data to be statistically independent, adversarial training for restricting the capabilities of a particular competitor who aims to maximize parity, etc. These methods result in information loss and do not provide a suitable balance between accuracy and fairness or do not ensure limiting the biases in train- ing. To this end, we propose a powerful strategy for training deep learning models called the Distraction module, which can be theoretically proven effective in con- trolling bias from affecting the classification results. This method can be utilized with different data types (e.g., Tabular, images, graphs, etc.). We demonstrate the potency of the proposed method by testing it on UCI Adult and Heritage Health datasets (tabular), POKEC-Z, POKEC-N and NBA datasets (graph), and CelebA dataset (Vision). Using state-of-the-art methods proposed in the fairness literature for each dataset, we exhibit our model is superior to these proposed methods in minimizing bias and maintaining accuracy.","Fairness, Deep Learning, Neural Networks, Adversarial Learning" Learning Diverse and Effective Policies with Non-Markovian Rewards,https://openreview.net/forum?id=w7ds3UKtQJl,https://openreview.net/pdf?id=w7ds3UKtQJl,"We propose a diversity matrix to quantify policy diversity and theoretically prove that if the diversity matrix is positive definite, then the diversity of policies can be achieved without sacrificing their effectiveness.","Learning a set of diverse and high-quantity policies is a difficult problem in Reinforcement Learning since the diversity of policies is demanded to be achieved without dampening their effectiveness. This problem becomes more challenging when the rewards are non-Markovian, i.e., the rewards depend on the history of states and actions, which are quite sparse and returned over a long period. The sparse supervision signals and the non-Markovian properties of the rewards hinder the learning of policy embeddings and thus the learning of diverse and high-quality policies. In this paper, we propose to use a diversity matrix to quantify policy diversity and theoretically prove that if the diversity matrix is positive definite, then the diversity of policies can be achieved without sacrificing their effectiveness. The policy diversity matrix stems from policy embeddings. To obtain high-quality embeddings, we adopt a transformer to capture mutual dependencies between states and actions and design pseudo tasks to overcome sparse rewards. Experimental results show that our method can achieve a set of policies with more effective diversity and better performance than multiple recently proposed baseline methods in a variety of non-Markovian and Markovian environments.","policy diversity, non-Markovian Rewards, reinforcement learning" Emergent world representations: Exploring a sequence model trained on a synthetic task,https://openreview.net/forum?id=DeG07_TcZvT,https://openreview.net/pdf?id=DeG07_TcZvT,,"Language models show a surprising range of capabilities, but the source of their apparent competence is unclear. Do these networks just memorize a collection of surface statistics, or do they rely on internal representations of the process that generates the sequences they see? We investigate this question by applying a variant of the GPT model to the task of predicting legal moves in a simple board game, Othello. Although the network has no a priori knowledge of the game or its rules, we uncover evidence of an emergent nonlinear internal representation of the board state. Interventional experiments indicate this representation can be used to control the output of the network and create ""latent saliency maps"" that can help explain predictions in human terms.","world representation, GPT" "Programmatically Grounded, Compositionally Generalizable Robotic Manipulation",https://openreview.net/forum?id=rZ-wylY5VI,https://openreview.net/pdf?id=rZ-wylY5VI,"We parse and execute semantically grounded neural programs for robotic manipulation, enabling better zero-shot and compositional generalizable to new manipulation behaviors.","Robots operating in the real world require both rich manipulation skills as well as the ability to semantically reason about when to apply those skills. Towards this goal, recent works have integrated semantic representations from large-scale pretrained vision-language (VL) models into manipulation models, imparting them with more general reasoning capabilities. However, we show that the conventional {\it pretraining-finetuning} pipeline for integrating such representations entangles the learning of domain-specific action information and domain-general visual information, leading to less data-efficient training and poor generalization to unseen objects and tasks. To this end, we propose \ours, a {\it modular} approach to better leverage pretrained VL models by exploiting the syntactic and semantic structures of an input language instruction. Our framework uses a semantic parser to recover an executable program that is composed of functional modules grounded on vision and action across different modalities. Each functional module is realized as a combination of deterministic computation and learnable neural networks. The execution of the program produces parameters to general manipulation primitives for a robotic end-effector. The entire modular network can be trained with end-to-end imitation learning objectives. Experiments show that our model successfully disentangles action and perception, translating to improved zero-shot and compositional generalization in a variety of manipulation behaviors. Project webpage at: \url{https://progport.github.io}","Vision-Language-Action Grounding, Zero-Shot Generalization, Compositional Generalization, Neurosymbolic Learning" M$^3$Video: Masked Motion Modeling for Self-Supervised Video Representation Learning,https://openreview.net/forum?id=_lyO4HOLDF,https://openreview.net/pdf?id=_lyO4HOLDF,"We propose a masked motion modeling task, where the model is asked to predict the motion of masked moving objects, for self-supervised video representation learning","We study self-supervised video representation learning that seeks to learn video features from unlabeled videos, which is widely used for video analysis as labeling videos is labor-intensive. Current methods often mask some video regions and then train a model to reconstruct spatial information in these regions (\eg, original pixels). However, the model is easy to reconstruct this information by considering content in a single frame. As a result, it may neglect to learn the interactions between frames, which are critical for video analysis. In this paper, we present a new self-supervised learning task, called Masked Motion Modeling (M$^3$Video), for learning representation by enforcing the model to predict the motion of moving objects in the masked regions. To generate motion targets for this task, we track the objects using optical flow. The motion targets consist of position transitions and shape changes of the tracked objects. To predict these trajectory motion targets, the model has to consider multiple frames comprehensively. Besides, to help the model capture fine-grained motion details, we enforce the model to predict trajectory motion targets in high temporal resolution based on a video in low temporal resolution. After pre-training using our M$^3$Video task, the model is able to anticipate fine-grained motion details even taking a sparsely sampled video as input. We conduct extensive experiments on four benchmark datasets to evaluate the effectiveness of our M$^3$Video. Remarkably, when doing pre-training with 400 epochs, we improve the accuracy from 67.6\% to 69.2\% and from 78.8\% to 79.7\% on Something-Something V2 and Kinetics-400 datasets, respectively.","Self-supervised Video Representation Learning, Action Recognition, Masked Motion Modeling" ObPose: Leveraging Pose for Object-Centric Scene Inference and Generation in 3D,https://openreview.net/forum?id=tc2UP4qhplB,https://openreview.net/pdf?id=tc2UP4qhplB,We present an object-centric scene inference and generation model that learns 3D structured latent representations from RGB-D scenes.,"We present ObPose, an unsupervised object-centric inference and generation model which learns 3D-structured latent representations from RGB-D scenes. Inspired by prior art in 2D representation learning, ObPose considers a factorised latent space, separately encoding object location (where) and appearance (what). ObPose further leverages an object's pose (i.e. location and orientation), defined via a minimum volume principle, as a novel inductive bias for learning the where component. To achieve this, we propose an efficient, voxelised approximation approach to recover the object shape directly from a neural radiance field (NeRF). As a consequence, ObPose models each scene as a composition of NeRFs, richly representing individual objects. To evaluate the quality of the learned representations, ObPose is evaluated quantitatively on the YCB and CLEVR datatasets for unsupervised scene segmentation, outperforming the current state-of-the-art in 3D scene inference (ObSuRF) by a significant margin. Generative results provide qualitative demonstration that the same ObPose model can both generate novel scenes and flexibly edit the objects in them. These capacities again reflect the quality of the learned latents and the benefits of disentangling the where and what components of a scene. Key design choices made in the ObPose encoder are validated with ablations. ","Object-centric scene inference, Object-centric representation learning, Scene generation, 3D" FedCL: Critical Learning Periods-aware Adaptive Client Selection in Federated Learning,https://openreview.net/forum?id=QCtizuT48D,https://openreview.net/pdf?id=QCtizuT48D,,"Federated learning (FL) is a distributed optimization paradigm that learns from data samples distributed across a number of clients. Adaptive client selection that is cognizant of the training progress of clients has become a major trend to improve FL efficiency but not yet well-understood. Most existing FL methods such as FedAvg and its state-of-the-art variants implicitly assume that all learning phases during the FL training process are equally important. Unfortunately, this assumption has been revealed to be invalid due to recent findings on critical learning (CL) periods, in which small gradient errors may lead to an irrecoverable deficiency on final test accuracy. In this paper, we develop FedCL, a CL periods-aware FL framework to reveal that adaptively augmenting exiting FL methods with CL periods, the resultant performance is significantly improved when the client selection is guided by the discovered CL periods. Experiments based on various machine learning models and datasets validate that the proposed FedCL framework consistently achieves an improved model accuracy while maintains comparable or even better communication efficiency as compared to state-of-the-art methods, demonstrating a promising and easily adopted method for tackling the heterogeneity of FL training. ","Critical Learning Periods, Federated Learning, Client Selection" TabCaps: A Capsule Neural Network for Tabular Data Classification with BoW Routing,https://openreview.net/forum?id=OgbtSLESnI,https://openreview.net/pdf?id=OgbtSLESnI,We proposed a capsule neural network for tabular data classification.,"The instances in a table are represented by a collection of heterogeneous tabular features. Previous work often made predictions for such instances in a paradigm that processed tabular features as operating units, which requires to well cope with the heterogeneity. In this paper, we propose to encapsulate all tabular features of an instance into vectorial features and process them collectively rather than have to deal with individual ones, which directly captures the representations at the instance level and benefits robust performances. Specifically, we adopt ""capsules"" to organize tabular features of the instance into vectorial features, and devise a novel capsule neural network called TabCaps to process the vectorial features for classification. In TabCaps, a tabular instance is respectively encoded into several vectorial features by some optimizable multivariate Gaussian kernels in the primary capsule layer, where each vectorial feature represents a specific ""profile"" of the input instance and is transformed into senior capsule layer under the guidance of a novel straightforward routing algorithm. The design of routing algorithm is motivated by the Bag-of-Words (BoW) model, which performs capsule feature grouping straightforwardly and efficiently, in lieu of the computationally complex clustering of previous routing algorithms. Comprehensive experiments show that TabCaps achieves competitive and robust performances in tabular data classification tasks.",capsule neural network Learning Instance-Solution Operator For Optimal Control,https://openreview.net/forum?id=nfMQJ4pz2Y,https://openreview.net/pdf?id=nfMQJ4pz2Y,,"Optimal control problems (OCPs) aim at finding a control function for a dynamical system such that a cost functional is optimized. These problems are central to physical system research in both academia and industry. In this paper, we propose a novel instance-solution operator learning perspective, which solves OCPs in a one-shot manner with no dependence on the explicit expression of dynamics or iterative optimization processes. The design is in principle endowed with substantial speedup in running time, and the model reusability is guaranteed by high-quality in- and out-of-distribution generalization. We theoretically validate the perspective by presenting the approximation bounds for the instance-solution operator learning. Extensive experiments on 6 physical systems verify the effectiveness and efficiency of our approach. The source code will be made publicly available.", CorruptEncoder: Data Poisoning Based Backdoor Attacks to Contrastive Learning,https://openreview.net/forum?id=gZ1UIqjW1Q,https://openreview.net/pdf?id=gZ1UIqjW1Q,We propose data poisoning based backdoor attacks to contrastive learning that achieves the state-of-the-art performance.,"Contrastive learning (CL) pre-trains general-purpose encoders using an unlabeled pre-training dataset, which consists of images (called single-modal CL) or image-text pairs (called multi-modal CL). CL is vulnerable to data poisoning based backdoor attacks (DPBAs), in which an attacker injects poisoned inputs into the pre-training dataset so the pre-trained encoder is backdoored. However, existing DPBAs achieve limited effectiveness. In this work, we propose new DPBAs called CorruptEncoder to CL. Our experiments show that CorruptEncoder substantially outperforms existing DPBAs for both single-modal and multi-modal CL. Moreover, we also propose a defense, called localized cropping, to defend single-modal CL against DPBAs. Our results show that our defense can reduce the effectiveness of DPBAs, but it sacrifices the utility of the encoder, highlighting the needs of new defenses. We will release our code upon paper acceptance.","Backdoor Attack, Contrastive Learning" Learning Transferable Spatiotemporal Representations from Natural Script Knowledge,https://openreview.net/forum?id=dOHZkdrtnrD,https://openreview.net/pdf?id=dOHZkdrtnrD,"A video pre-training method that learns transferable spatiotemporal representations from large-scale uncurated data, exhibiting strong out-of-the-box capabilities.","Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years. Despite some progress, existing methods are mostly limited to highly curated datasets (e.g., K400) and exhibit unsatisfactory out-of-the-box representations. We argue that it is due to the fact that they only capture pixel-level knowledge rather than spatiotemporal commonsense, which is far away from cognition-level video understanding. Inspired by the great success of image-text pre-training (e.g., CLIP), we take the first step to exploit language semantics to boost transferable spatiotemporal representation learning. We introduce a new pretext task, Turning to Video for Transcript Sorting (TVTS), which sorts shuffled ASR scripts by attending to learned video representations. We do not rely on descriptive captions and learn purely from video, i.e., leveraging the natural transcribed speech knowledge to provide noisy but useful semantics over time. Furthermore, rather than the simple concept learning in vision-caption contrast, we encourage cognition-level temporal commonsense reasoning via narrative reorganization. The advantages enable our model to contextualize what is happening like human beings and seamlessly apply to large-scale uncurated video data in the real world. Note that our method differs from ones designed for video-text alignment (e.g., Frozen) and multimodal representation learning (e.g., Merlot). Our method demonstrates strong out-of-the-box spatiotemporal representations on diverse video benchmarks, e.g., +13.6% gains over VideoMAE on SSV2 via linear probing.","Spatiotemporal Representation Learning, Video Pre-training, Action Recognition" Heterogeneous Continual Learning,https://openreview.net/forum?id=f7VHa2mwDEq,https://openreview.net/pdf?id=f7VHa2mwDEq,A novel framework and a solution to tackle the continual learning problem with progressive evolution of neural networks.,"We propose a novel framework and a solution to tackle the continual learning (CL) problem with progressive evolution of neural networks. Most CL methods focus on adapting a single network to a new task/class by modifying its weights. However, with rapid progress in architecture design, the problem of adapting existing solutions to novel architectures becomes relevant. For the first time, we propose Heterogeneous Continual Learning (HCL) to address this problem, where a wide range of evolving network architectures emerge continually together with novel data/tasks. As a solution, we build on top of the distillation family of techniques and modify it to a new setting where a weaker model takes the role of a teacher; meanwhile, a new stronger architecture acts as a student. Furthermore, we consider a setup of limited access to previous data and propose Quick Deep Inversion (QDI) to recover prior task visual features to support knowledge transfer. QDI significantly reduces computational costs compared to previous solutions and improves overall performance. In summary, we propose a new setup for CL with a modified knowledge distillation paradigm and design a quick data inversion method to enhance distillation. Our evaluation of various benchmarks shows that the proposed method can successfully progress over various networks while outperforming state-of-the-art methods with a 2x improvement on accuracy.","Continual learning, representational learning, deep learning, model progression" Decentralized Online Bandit Optimization on Directed Graphs with Regret Bounds,https://openreview.net/forum?id=Js4vqB4bWVh,https://openreview.net/pdf?id=Js4vqB4bWVh,,"We consider a decentralized multiplayer game, played over $T$ rounds, with a leader-follower hierarchy described by a directed acyclic graph. For each round, the graph structure dictates the order of the players and how players observe the actions of one another. By the end of each round, all players receive a joint bandit-reward based on their joint action that is used to update the player strategies towards the goal of minimizing the joint pseudo-regret. We present a learning algorithm inspired by the single-player multi-armed bandit problem and show that it achieves sub-linear joint pseudo-regret in the number of rounds for both adversarial and stochastic bandit rewards. Furthermore, we quantify the cost incurred due to the decentralized nature of our problem compared to the centralized setting. ","Bandit optimization, multi-agent learning, decentralized learning, joint bandit-rewards" BAMBI: Vertical Federated Bilevel Optimization with Privacy-Preserving and Computation Efficiency,https://openreview.net/forum?id=pO7KggcbMiP,https://openreview.net/pdf?id=pO7KggcbMiP,"To our best knowledge, this is the first work on the bilevel optimization under the setting of VFL.","Vertical federated learning (VFL) has shown promising in meeting the vast demands of multi-party privacy-preserving learning. However, existing VFL methods are not applicable to popular machine learning tasks falling under bilevel programming, such as hyper-representation learning and hyperparameter tuning. A desirable solution is adopting bilevel optimization (BO) into VFL, but on-shelf BO methods are shackled by the difficulty in computing the hypergradients with privacy-preserving and computation-efficient under the setting of VFL. To address this challenge, this paper proposes a stochastic Bilevel optimizAtion Method with a desirable JacoBian estImator (BAMBI), which constructs a novel zeroth-order (ZO) estimator to locally approximate the Jacobian matrix. This approximation enables BAMBI to compute the hypergradients in a privacy-preserving and computation-efficient manner. We prove that BAMBI convergences in the rate of $\mathcal{O}(1/\sqrt{K})$ ($K$ is the total number of the upper-level iterations) under the nonconvex-strongly-convex setting which covers most practical scenarios. This convergence rate is comparable with the algorithms without ZO estimator, which justifies our advantage in privacy preservation without sacrifice in convergence rate. Moreover, we design a BAMBI-DP method for further mitigating the concerns on label privacy by leveraging the differential privacy (DP) technique. Extensive experiments fully support our algorithms. The code will be released publicly. To our best knowledge, this is the first work on the bilevel optimization under the setting of VFL.","Vertical federated learning, zeroth-order estimation" Revitalize Region Feature for Democratizing Video-language Pre-training of Retrieval,https://openreview.net/forum?id=8JRQza2MaO4,https://openreview.net/pdf?id=8JRQza2MaO4,,"Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language retrieval. Despite the impressive results, VLP research becomes extremely expensive with the need for massive data and a long training time, preventing further explorations. In this work, we revitalize region features of sparsely sampled video clips to significantly reduce both spatial and temporal visual redundancy towards democratizing VLP research at the same time achieving state-of-the-art results. Specifically, to fully explore the potential of region features, we introduce a novel bidirectional region-word alignment regularization that properly optimizes the fine-grained relations between regions and certain words in sentences, eliminating the domain/modality disconnections between pre-extracted region features and text. Extensive results of downstream video-language retrieval tasks on four datasets demonstrate the superiority of our method on both effectiveness and efficiency, e.g., our method achieves competing results with 80% fewer data and 85% less pre-training time compared to the most efficient VLP method so far.", Local Attention Layers for Vision Transformers,https://openreview.net/forum?id=tm8p3x5Rf8N,https://openreview.net/pdf?id=tm8p3x5Rf8N,,"Attention layers in transformer networks have contributed to state-of-the-art results on many vision tasks. Still, attention layers leave room for improvement because relative position information is not learned, and locality constraints are typically not enforced. To mitigate both issues, we propose a convolution-style attention layer, LA-layer, as a replacement for traditional attention layers. LA-layers implicitly learn the position information in a convolutional manner. Given an input feature map, keys in the kernel region deform in a designated constrained region, which results in a larger receptive field with locality constraints. Query and keys are processed by a novel aggregation function that outputs attention weights for the values. The final result is an aggregation of the attention weights and values. In our experiments, we replace ResNet's convolutional layers with LA-layers and address image recognition, object detection and instance segmentation tasks. We consistently demonstrate performance gains with LA-layers over the state-of-the-art, despite having fewer floating point operations and training parameters. These results suggest that LA-layers more effectively and efficiently extract features. They can replace convolutional and attention layers across a range of networks.", MESSAGENET: MESSAGE CLASSIFICATION USING NATURAL LANGUAGE PROCESSING AND META-DATA,https://openreview.net/forum?id=PFUIHZGE4DS,https://openreview.net/pdf?id=PFUIHZGE4DS,We propose a deep neural network based on blocks for message classification using meta-data inputs,"In this paper we propose a new Deep Learning (DL) approach for message classification. Our method is based on the state-of-the-art Natural Language Processing (NLP) building blocks, combined with a novel technique for infusing the meta-data input that is typically available in messages such as the sender information, timestamps, attached image, audio, affiliations, and more. As we demonstrate throughout the paper, going beyond the mere text by leveraging all available channels in the message, could yield an improved representation and higher classification accuracy. To achieve message representation, each type of input is processed in a dedicated block in the neural network architecture that is suitable for the data type. Such an implementation enables training all blocks together simultaneously, and forming cross channels features in the network. We show in the Experiments Section that in some cases, message’s meta-data holds an additional information that cannot be extracted just from the text, and when using this information we achieve better performance. Furthermore, we demonstrate that our multi-modality block approach outperforms other approaches for injecting the meta data to the the text classifier.",Message classification · Meta data injection · Deep learning · Natural language processing Koopman neural operator for learning non-linear partial differential equations,https://openreview.net/forum?id=kciCTrtbzVl,https://openreview.net/pdf?id=kciCTrtbzVl,,"The lacking of analytic solutions of diverse partial differential equations (PDEs) gives birth to series of computational techniques for numerical solutions. In machine learning, numerous latest advances of solver designs are accomplished in developing neural operators, a kind of mesh-free approximators of the infinite-dimensional operators that map between different parameterization spaces of equation solutions. Although neural operators exhibit generalization capacities for learning an entire PDE family simultaneously, they become less accurate and explainable while learning long-term behaviours of non-linear PDE families. In this paper, we propose Koopman neural operator (KNO), a new neural operator, to overcome these challenges. With the same objective of learning an infinite-dimensional mapping between Banach spaces that serves as the solution operator of target PDE family, our approach differs from existing models by formulating a non-linear dynamic system of equation solution. By approximating the Koopman operator, an infinite-dimensional linear operator governing all possible observations of the dynamic system, to act on the flow mapping of dynamic system, we can equivalently learn the solution of an entire non-linear PDE family by solving simple linear prediction problems. In zero-shot prediction and long-term prediction experiments on representative PDEs (e.g., the Navier-Stokes equation), KNO exhibits notable advantages in breaking the tradeoff between accuracy and efficiency (e.g., model size) while previous state-of-the-art models are limited.","Neural Operator, Koopman Theory, Partial Differential Equations, Dynamic System" Regularizing hard examples improves robustness,https://openreview.net/forum?id=5L1ctJ223ML,https://openreview.net/pdf?id=5L1ctJ223ML,We study the negative effect of hard examples on generalization in adversarial training and propose a new method to mitigate the effect of hard examples.,"Recent studies have validated that pruning hard-to-learn examples from training improves the generalization performance of neural networks (NNs). In this study, we investigate this intriguing phenomenon—the negative effect of hard examples on generalization—in adversarial training. Particularly, we theoretically demonstrate that the increase in the difficulty of hard examples in adversarial training is significantly greater than the increase in the difficulty of easy examples. Furthermore, we verify that hard examples are only fitted through memorization of the label in adversarial training and that the memorization of hard examples is attributed to the significant increase in the difficulty of hard examples. We find that the increased difficulty of hard examples brings about the functioning of hard examples as label corrupted data in adversarial training, thereby leading to the memorization of those hard examples and deterioration of the robustness performance. Based upon these observations, we propose a new approach, difficulty proportional label smoothing (DPLS), to mitigate the negative effect of hard examples, thereby improving the adversarial robustness of NNs. Notably, our experimental result indicates that our method can successfully leverage hard examples while circumventing the negative effect.","deep learning, adversarial robustness, adversarial examples" Universal approximation and model compression for radial neural networks ,https://openreview.net/forum?id=lJX9okHBVMb,https://openreview.net/pdf?id=lJX9okHBVMb,,"We introduce a class of fully-connected neural networks whose activation functions, rather than being pointwise, rescale feature vectors by a function depending only on their norm. We call such networks radial neural networks, extending previous work on rotation equivariant networks that considers rescaling activations in less generality. We prove universal approximation theorems for radial neural networks, including in the more difficult cases of bounded widths and unbounded domains. Our proof techniques are novel, distinct from those in the pointwise case. Additionally, radial neural networks exhibit a rich group of orthogonal change-of-basis symmetries on the vector space of trainable parameters. Factoring out these symmetries leads to a practical lossless model compression algorithm. Optimization of the compressed model by gradient descent is equivalent to projected gradient descent for the full model.","universal approximation, model compression, radial functions, parameter space symmetries, projected gradient descent" Momentum Diminishes the Effect of Spectral Bias in Physics-Informed Neural Networks,https://openreview.net/forum?id=YrmoVzxBLSa,https://openreview.net/pdf?id=YrmoVzxBLSa,,"Physics-informed neural network (PINN) algorithms have shown promising results in solving a wide range of problems involving partial differential equations (PDEs). However, even the simplest PDEs, often fail to converge to desirable solutions when the target function contains high-frequency modes, due to a phenomenon known as spectral bias. In the present work, we exploit neural tangent kernels (NTKs) to investigate the training dynamics of PINNs evolving under stochastic gradient descent with momentum (SGDM). This demonstrates SGDM significantly reduces the effect of spectral bias. We have also examined why training a model via the Adam optimizer can accelerate the convergence while reducing the spectral bias. Moreover, our numerical experiments have confirmed that wide-enough networks using SGDM or Adam still converge to desirable solutions, even in the presence of high-frequency features.", MULTILEVEL XAI: VISUAL AND LINGUISTIC BONDED EXPLANATIONS,https://openreview.net/forum?id=ngg86sSNDn,https://openreview.net/pdf?id=ngg86sSNDn,"We propose a novel XAI methodology to explain DNNs predictions in a multilevel manner (i.e., visual and linguistic) without requiring per-image annotations.","Applications of deep neural networks are booming in more and more fields but lack transparency due to their black-box nature. Explainable Artificial Intelligence (XAI) is therefore of paramount importance, where strategies are proposed to understand how these black-box models function. The research so far mainly focuses on producing, for example, class-wise saliency maps, highlighting parts of a given image that affect the prediction the most. However, this way does not fully represent the way humans explain their reasoning and, awkwardly, validating these maps is quite complex and generally requires subjective interpretation. In this article, we conduct XAI differently by proposing a new XAI methodology in a multilevel (i.e., visual and linguistic) manner. By leveraging the interplay between the learned representations, i.e., image features and linguistic attributes, the proposed approach can provide salient attributes and attribute-wise saliency maps, which are far more intuitive than the class-wise maps, without requiring per-image ground-truth human explanations. It introduces self-interpretable attributes to overcome the current limitations in XAI and bring the XAI towards human-like level. The proposed architecture is simple in use and can reach surprisingly good performance in both prediction and explainability for deep neural networks thanks to the low-cost per-class attributes.","Deep neural networks, Black box, Explainable Artificial Intelligence, Saliency maps" Efficient Evaluation of Adversarial Robustness for Deep Hashing based Retrieval,https://openreview.net/forum?id=4I3vW2sInc,https://openreview.net/pdf?id=4I3vW2sInc,,"Deep hashing has been extensively applied to massive image retrieval due to its efficiency and effectiveness. Recently, several adversarial attacks have been presented to reveal the vulnerability of deep hashing models against adversarial examples. However, existing attack methods suffer in degraded performance or inefficiency because they underutilize the semantic relations between original samples or spend a lot of time learning from these samples. In this paper, we propose a novel Pharos-guided Attack, dubbed \textbf{PgA}, to evaluate the adversarial robustness of deep hashing networks efficiently. Specifically, we design \textit{pharos code} to represent the semantics of the benign image, which preserves the similarity with semantically related samples and dissimilarity with irrelevant examples. It is proven that we can quickly calculate the pharos code via a simple math formula rather than time-consuming iterative procedures. Thus, PgA can directly conduct a reliable and efficient attack on deep hashing-based retrieval by maximizing the similarity between the hash code of the adversarial example and the pharos code. Extensive experiments on the benchmark datasets verify that the proposed algorithm outperforms the prior state-of-the-arts in both attack strength and speed.","Adversarial Attack, Adversarial Training, Deep Hashing, Similarity Retrieval" An Exact Poly-Time Membership-Queries Algorithm for Extracting a Three-Layer ReLU Network,https://openreview.net/forum?id=-CoNloheTs,https://openreview.net/pdf?id=-CoNloheTs,A first polynomial-time algorithm to extract the parameters and architecture of two- and three-layer neural networks using membership-queries,"We consider the natural problem of learning a ReLU network from queries, which was recently remotivated by model extraction attacks. In this work, we present a polynomial-time algorithm that can learn a depth-two ReLU network from queries under mild general position assumptions. We also present a polynomial-time algorithm that, under mild general position assumptions, can learn a rich class of depth-three ReLU networks from queries. For instance, it can learn most networks where the number of first layer neurons is smaller than the dimension and the number of second layer neurons. These two results substantially improve state-of-the-art: Until our work, polynomial-time algorithms were only shown to learn from queries depth-two networks under the assumption that either the underlying distribution is Gaussian (Chen et al. (2021)) or that the weights matrix rows are linearly independent (Milli et al. (2019)). For depth three or more, there were no known poly-time results.","Learning With Queries, ReLU Networks, Model Extraction" Neural Discrete Reinforcement Learning,https://openreview.net/forum?id=YPHIlC3K4J,https://openreview.net/pdf?id=YPHIlC3K4J,Discrete all types of action spaces in decision making and utilize arbitrary DRL algorithm to solve them.,"Designing effective action spaces for complex environments is a fundamental and challenging problem in reinforcement learning (RL). Some recent works have revealed that naive RL algorithms utilizing well-designed handcrafted discrete action spaces can achieve promising results even when dealing with high-dimensional continuous or hybrid decision-making problems. However, elaborately designing such action spaces requires comprehensive domain knowledge. In this paper, we systemically analyze the advantages of discretization for different action spaces and then propose a unified framework, Neural Discrete Reinforcement Learning (NDRL), to automatically learn how to effectively discretize almost arbitrary action spaces. Specifically, we propose the Action Discretization Variational AutoEncoder (AD-VAE), an action representation learning method that can learn compact latent action spaces while maintain the essential properties of original environments, such as boundary actions and the relationship between different action dimensions. Moreover, we uncover a key issue that parallel optimization of the AD-VAE and online RL agents is often unstable. To address it, we further design several techniques to adapt RL agents to learned action representations, including latent action remapping and ensemble Q-learning. Quantitative experiments and visualization results demonstrate the efficiency and stability of our proposed framework for complex action spaces in various environments. ","Deep Reinforcement Learning, Representation Learning, Action Space" CAB: Comprehensive Attention Benchmarking on Long Sequence Modeling,https://openreview.net/forum?id=S80I3NwbbpS,https://openreview.net/pdf?id=S80I3NwbbpS,We propose Comprehensive Attention Benchmark (CAB) with seven real-world tasks from different research areas to evaluate efficient attentions under four fine-grained attention patterns.,"Transformer has achieved remarkable success in language, image, and speech processing. Recently, various efficient attention architectures have been proposed to improve transformer's efficiency while largely preserving its efficacy, especially in modeling long sequences. A widely-used benchmark to test these efficient methods' capability on long-range modeling is Long Range Arena (LRA). However, LRA only focuses on the standard bidirectional (or noncausal) self attention, and completely ignores cross attentions and unidirectional (or causal) attentions, which are equally important to downstream applications. Although designing cross and causal variants of an attention method is straightforward for vanilla attention, it is often challenging for efficient attentions with subquadratic time and memory complexity. In this paper, we propose Comprehensive Attention Benchmark (CAB) under a fine-grained attention taxonomy with four distinguishable attention patterns, namely, noncausal self, causal self, noncausal cross, and causal cross attentions. CAB collects seven real-world tasks from different research areas to evaluate efficient attentions under the four attention patterns. Among these tasks, CAB validates efficient attentions in eight backbone networks to show their generalization across neural architectures. We conduct exhaustive experiments to benchmark the performances of nine widely-used efficient attention architectures designed with different philosophies on CAB. Extensive experimental results also shed light on the fundamental problems of efficient attentions, such as efficiency length against vanilla attention, performance consistency across attention patterns, the benefit of attention mechanisms, and interpolation/extrapolation on long-context language modeling.","Long Sequence Modeling, Benchmark, Efficient Attention" miCSE: Mutual Information Contrastive Learning for Low-shot Sentence Embeddings,https://openreview.net/forum?id=wle-ah5tiY,https://openreview.net/pdf?id=wle-ah5tiY,,"This paper presents miCSE, a mutual information-based Contrastive learning framework that significantly advances the state-of-the-art in few-shot sentence embedding. The proposed approach imposes alignment between the attention pattern of different views during contrastive learning. Learning sentence embeddings with miCSE entails enforcing the syntactic consistency across augmented views for every single sentence, making contrastive self-supervised learning more sample efficient. As a result, the proposed approach shows strong performance in the few-shot learning domain. While it achieves superior results compared to state-of-the-art methods on multiple benchmarks in few-shot learning, it is comparable in the full-shot scenario. The proposed approach is conceptually simple, easy to implement and optimize, yet empirically powerful. This study opens up avenues for efficient self-supervised learning methods that are more robust than current contrastive methods for sentence embedding.", Formal Conceptual Views in Neural Networks,https://openreview.net/forum?id=Zdpvtif5nPZ,https://openreview.net/pdf?id=Zdpvtif5nPZ,Conceptual structures allow for global insights into neural network models resulting in novel approaches for explainable AI. ," Explaining neural network models is a challenging task that remains unsolved in its entirety to this day. This is especially true for high dimensional and complex data. With the present work, we introduce two notions for conceptual views of a neural network, specifically a many-valued and a symbolic view. Both provide novel analysis methods to enable a human AI analyst to grasp deeper insights into the knowledge that is captured by the neurons of a network. We test the conceptual expressivity of our novel views through different experiments on the ImageNet and Fruit-360 data sets. Furthermore, we show to which extent the views allow to quantify the conceptual similarity of different learning architectures. Finally, we demonstrate how conceptual views can be applied for abductive learning of human comprehensible rules from neurons. In summary, with our work, we contribute to the most relevant task of globally explaining neural networks models.","Conceptual Scaling, Explainable AI, Global Explanation, Formal Concept Analysis, Lattices" Towards Understanding and Mitigating Dimensional Collapse in Heterogeneous Federated Learning,https://openreview.net/forum?id=EXnIyMVTL8s,https://openreview.net/pdf?id=EXnIyMVTL8s,"We show data heterogeneity in federated learning causes dimensional collapse for trained models, and propose FedDecorr to mitigate such problem.","Federated learning aims to train models collaboratively across different clients without the sharing of data for privacy considerations. However, one major challenge for this learning paradigm is the data heterogeneity problem, which refers to the discrepancies between the local data distributions among various clients. To tackle this problem, we first study how data heterogeneity affects the representations of the globally aggregated models. Interestingly, we find that heterogeneous data results in the global model suffering from severe dimensional collapse, in which representations tend to reside in a lower-dimensional space instead of the ambient space. Moreover, we observe a similar phenomenon on models locally trained on each client and deduce that the dimensional collapse on the global model is inherited from local models. In addition, we theoretically analyze the gradient flow dynamics to shed light on how data heterogeneity result in dimensional collapse for local models. To remedy this problem caused by the data heterogeneity, we propose FedDecorr, a novel method that can effectively mitigate dimensional collapse in federated learning. Specifically, FedDecorr applies a regularization term during local training that encourages different dimensions of representations to be uncorrelated. FedDecorr, which is implementation-friendly and computationally-efficient, yields consistent improvements over baselines on standard benchmark datasets. Code will be released.","federated Learning, representation Learning, data heterogeneity, dimensional collapse" A New Paradigm for Federated Structure Non-IID Subgraph Learning,https://openreview.net/forum?id=Qyz2cMy-ty6,https://openreview.net/pdf?id=Qyz2cMy-ty6,The first attempt to investigate the structure non-iid problem in federated subgraph learning.,"Federated graph learning (FGL), a distributed training framework for graph neural networks (GNNs) has attracted much attention for breaking the centralized machine learning assumptions. Despite its effectiveness, the differences in data collection perspectives and quality lead to the challenges of heterogeneity, especially the domain-specific graph is partitioned into subgraphs in different institutions. However, existing FGL methods implement graph data augmentation or personalization with community split which follows the cluster homogeneity assumptions. Hence we investigate the above issues and suggest that subgraph heterogeneity is essentially the structure variations. From the observations on FGL, we first define the structure non-independent identical distribution (Non-IID) problem, which presents covariant shift challenges among client-wise subgraphs. Meanwhile, we propose a new paradigm for general federated data settings called Adaptive Federated Graph Learning (AdaFGL). The motivation behind it is to implement adaptive propagation mechanisms based on federated global knowledge and non-params label propagation. We conduct extensive experiments with community split and structure Non-IID settings, our approach achieves state-of-the-art performance on five benchmark datasets.","graph neural network, federated learning, structure non-iid subgraphs" An Intrinsic Dimension Perspective of Transformers for Sequential Modeling,https://openreview.net/forum?id=0UzYWLzPBjA,https://openreview.net/pdf?id=0UzYWLzPBjA,The analysis of transformers applied to sequential modeling from an perspective of intrinsic dimension.,"Transformers have gained great popularity for sequential modeling, especially in fields such as natural language processing (NLP). Recently, numerous architectures based on the Transformer framework are proposed, leading to great achievements in applications. However, the working principles behind still remain mysterious. In this work, we numerically investigate the geometrical properties of data representation learned by Transformers, via a mathematical concept called intrinsic dimension (ID), which can be viewed as the minimal number of parameters required for modeling. A series of experiments, mainly focusing on text classification tasks, backs up the following empirical claims on relationships among embedding dimension, depth, respective ID per layer and tasks performance. First, we surprisingly observe that a higher ID (of terminal features extracted by Transformers) typically implies a lower classification error rate. This is contrary to that of CNNs (or other models) performed on image classification tasks. In addition, it is shown that the ID per layer tends to decrease as the depth increases, and this reduction usually appears more significant for deeper architectures. Moreover, we give numerical evidence on geometrical structures of data representation learned by Transformers, where only the nonlinear dimension reduction can be achieved. Finally, we explore the effect of sequential lengths on the ID and tasks performance, which guarantees the validity of data reduction in training. We hope that these findings can play a guiding role in hyper-parameters selection and dimension/data reduction for Transformers on text classification and other mainstream NLP tasks.","intrinsic dimension, transformer, text Classification, NLP" SketchKnitter: Vectorized Sketch Generation with Diffusion Models,https://openreview.net/forum?id=4eJ43EN2g6l,https://openreview.net/pdf?id=4eJ43EN2g6l,,"We show vectorized sketch generation can be identified as a reversal of the stroke deformation process. This relationship was established by means of a diffusion model that learns data distributions over the stroke-point locations and pen states of real human sketches. Given randomly scattered stroke-points, sketch generation becomes a process of deformation-based denoising, where the generator rectifies positions of stroke points at each timestep to converge at a recognizable sketch. A key innovation was to embed recognizability into the reverse time diffusion process. It was observed that the estimated noise during the reversal process is strongly correlated with sketch classification accuracy. An auxiliary recurrent neural network (RNN) was consequently used to quantify recognizability during data sampling. It follows that, based on the recognizability scores, a sampling shortcut function can also be devised that renders better quality sketches with fewer sampling steps. Finally it is shown that the model can be easily extended to a conditional generation framework, where given incomplete and unfaithful sketches, it yields one that is more visually appealing and with higher recognizability.", Evidential Uncertainty and Diversity Guided Active Learning for Scene Graph Generation,https://openreview.net/forum?id=xI1ZTtVOtlz,https://openreview.net/pdf?id=xI1ZTtVOtlz,We proposed an Active Learning framework for the Scene Graph Generation.,"Scene Graph Generation (SGG) has already shown its great potential in various downstream tasks, but it comes at the price of a prohibitively expensive annotation process. To reduce the annotation cost, we propose using Active Learning (AL) for sampling the most informative data. However, directly porting current AL methods to the SGG task poses the following challenges: 1) unreliable uncertainty estimates, and 2) data bias problems. To deal with these challenges, we propose EDAL (\textbf{E}vidential Uncertainty and \textbf{D}iversity Guided Deep \textbf{A}ctive \textbf{L}earning), a novel AL framework tailored for the SGG task. For challenge 1), we start with Evidential Deep Learning (EDL) coupled with a global relationship mining approach to estimate uncertainty, which can effectively overcome the perturbations of open-set relationships and background-relationships to obtain reliable uncertainty estimates. To address challenge 2), we seek the diversity-based method and design the Context Blocking Module (CBM) and Image Blocking Module (IBM) to alleviate context-level bias and image-level bias, respectively. Experiments show that our AL framework can approach the performance of a fully supervised SGG model with only about $10\%$ annotation cost. Furthermore, our ablation studies indicate that introducing AL into the SGG will face many challenges not observed in other vision tasks that are successfully overcome by our new modules. ","Active learning, Scene graph generation, Uncertainty estimation" ErGOT: entropy-regularized graph optimal transport,https://openreview.net/forum?id=9czfKu1QqcN,https://openreview.net/pdf?id=9czfKu1QqcN,,"Graph comparison is a fundamental task, which not only relates to graph matching, an NP-hard problem, but also has various applications in graph learning. We tackle this task by studying optimal graph representation and the entropy-regularized optimal transport between graphs (ErGOT). First, we analytically derive a family of Gaussian variables that optimally represent graph topology and node relation. Second, we realize graph comparison by formulating ErGOT, a framework with low sample complexity, on represented graph information. Third, we control biases in the solution by defining ErGOT with a 2-Sinkhorn divergence, whose closed-form expression can be derived on the manifold of Gaussian variables. As the Gaussian geometry changes with entropy regularization magnitude, ErGOT defined with 2-Sinkhorn divergence wanders between pure optimal transport and maximum mean discrepancy among graphs. We demonstrate that these statistically efficient, principally unbiased, and in-between properties ensure theoretically faster convergence of our approach to empirically higher performance than the state-of-art algorithms on graph alignment, sketching, and retrieval tasks. ","Graph comparison, entropy-regularized optimal transport, NP-hard problem, graph matching, graph alignment, graph sketching, graph retrieval" Test-time recalibration of conformal predictors under distribution shift based on unlabeled examples,https://openreview.net/forum?id=KrGgylZ0tw_,https://openreview.net/pdf?id=KrGgylZ0tw_,We propose a novel test-time recalibration method for conformal prediction based on unlabeled examples that provides excellent uncertainty estimates under natural distribution shifts.,"Modern image classifiers achieve high predictive accuracy, but the predictions typically come without reliable uncertainty estimates. Conformal prediction algorithms provide uncertainty estimates by predicting a set of classes based on the probability estimates of the classifier (for example, the softmax scores). To provide such sets, conformal prediction algorithms often rely on estimating a cutoff threshold for the probability estimates, and this threshold is chosen based on a calibration set. Conformal prediction methods guarantee reliability only when the calibration set is from the same distribution as the test set. Therefore, the methods need to be recalibrated for new distributions. However, in practice, labeled data from new distributions is rarely available, making calibration infeasible. In this work, we consider the problem of predicting the cutoff threshold for a new distribution only based on unlabeled examples. While it is impossible in general to guarantee reliability when calibrating based on unlabeled examples, we show that our method provides excellent uncertainty estimates under natural distribution shifts.","classification, uncertainty estimation, conformal prediction" TabDDPM: Modelling Tabular Data with Diffusion Models,https://openreview.net/forum?id=EJka_dVXEcr,https://openreview.net/pdf?id=EJka_dVXEcr,Proposed a new state-of-the-art approach for tabular data generation using diffusion models,"Denoising diffusion probabilistic models are currently becoming the leading paradigm of generative modeling for many important data modalities. Being the most prevalent in the computer vision community, diffusion models have also recently gained some attention for other domains, including speech, NLP, and graph-like data. In this work, we investigate if the framework of diffusion models can be advantageous for general tabular problems, where datapoints are typically represented by vectors of heterogeneous features. The inherent heterogeneity of tabular data makes it quite challenging for accurate modeling, since the individual features can be of completely different nature, i.e., some of them can be continuous and some of them can be discrete. To address such data types, we introduce TabDDPM --- a diffusion model that can be universally applied to any tabular dataset and handles any types of features. We extensively evaluate TabDDPM on a wide set of benchmarks and demonstrate its superiority over existing GAN/VAE alternatives, which is consistent with the advantage of diffusion models in other fields. Additionally, we show that TabDDPM can be successfully used in privacy-oriented setups, where the original datapoints cannot be shared.","tabular data, diffusion models, generative modelling" BED: Boundary-Enhanced Decoder for Chinese Word Segmentation,https://openreview.net/forum?id=d_iQXvrt9KN,https://openreview.net/pdf?id=d_iQXvrt9KN, An optimized decoder for the CWS model called Boundary-Enhanced Decoder.,"Chinese Word Segmentation (CWS) is an essential fundamental step in the Chinese NLP processing pipeline. In recent years, with the development of deep learning and pre-training language models, many CWS models based on pre-training models, e.g., BERT and Roberta, have been proposed, and the performance of CWS models has been dramatically improved. However, CWS remains an open problem that deserves further study, such as the poor effect on OOV words. To our knowledge, the current proposed CWS approaches mainly focus on optimizing the encoder part of the model, such as incorporating more word information into the encoder or doing pre-training related to the CWS task, etc. And there is no attempt to improve the decoder's performance in the CWS model. This paper proposes an optimized decoder for the CWS model called Boundary-Enhanced Decoder (BED). It could bring 0.05% and 0.69% improvement on Average-F1 and OOV Average-F1 on four benchmark datasets when using a model with a BERT encoder and softmax standard decoder. We also publish our implementation of BED. ","Chinese Word Segmentation, deep learning, nature language processing" Gradient Inversion via Over-parameterized Convolutional Network in Federated Learning,https://openreview.net/forum?id=aBHTGMkisy-,https://openreview.net/pdf?id=aBHTGMkisy-,,"The main premise of federated learning is local clients could upload gradients instead of data during collaborative learning, hence preserving data privacy. But the development of gradient inversion method renders this premise under severe challenges: a third-party could still reconstruct the original training images through the uploaded gradients. While previous works are majorly conducted under relatively low-resolution images and small batch sizes, in this paper, we show that image reconstruction from complex datasets like ImageNet is still possible, even nested with large batch sizes and high resolutions. Success of the proposed method is built upon three key factors: a convolutional network to implicitly create an image prior, an over-parameterized network to guarantee the non-empty of the image generation and gradient matching, and a properly-designed architecture to create pixel intimacy. We conduct a series of practical experiments to demonstrate that the proposed algorithm can outperform SOTA algorithms and reconstruct the underlying original training images more effectively. Source code is available at: (to be released upon publication).","Gradient Inversion, Federated Learning" Memory-Augmented Variational Adaptation for Online Few-Shot Segmentation,https://openreview.net/forum?id=0h-YwriPUI,https://openreview.net/pdf?id=0h-YwriPUI,"We propose a memory-augmented variational adaptation mechanism, which learns to adapt the model to every new sample that arrives sequentially.","We investigate online few-show segmentation, which learns to make dense predictions for novel classes while observing samples sequentially. The main challenge in such an online scenario is the sample diversity in the sequence, resulting in models that do not generalize well to future samples. To this end, we propose a memory-augmented variational adaptation mechanism, which learns to adapt the model to every new sample that arrives sequentially. Specifically, we first introduce a prototype memory, which retains category knowledge from previous samples to facilitate the model adaptation to future samples. The adaptation to each new sample is then formulated as a variational Bayesian inference problem, which strives to generate sample-specific model parameters by conditioning the sample and the prototype memory. Furthermore, we propose memory-augmented segmentation to learn sample-specific feature representation for better adaptation to the segmentation of each sample. With extensive experiments, we show that a simple extension of existing few-shot segmentation methods tends to converge to over-smoothed, averaged masks of lesser performance. By contrast, the proposed method achieves considerably better online few-shot segmentation performance.","Online few-shot segmentation, Variation inference, Memory-augmented." Tailoring Language Generation Models under Total Variation Distance,https://openreview.net/forum?id=VELL0PlWfc,https://openreview.net/pdf?id=VELL0PlWfc,We analyze total variation distance (TVD) as a robust metric to outliers and devise a new training objective based on TVD to alleviate text degeneration and improve the generation quality.,"The standard paradigm of neural language generation adopts maximum likelihood estimation (MLE) as the optimizing method. From a distributional view, MLE in fact minimizes the Kullback-Leibler divergence (KLD) between the distribution of the real data and that of the model. However, this approach forces the model to distribute non-zero (sometimes large) probability mass to all training samples regardless of their quality. Moreover, in the attempt to cover the low-probability regions in the data distribution, the model systematically overestimates the probability of corrupted text sequences, which we conjecture is one of the main reasons for text degeneration during autoregressive decoding. To remedy this problem, we leverage the total variation distance (TVD) with its robustness to outliers, and develop practical bounds to apply it to language generation. Then, we introduce the TaiLr objective that balances the tradeoff of estimating TVD. Intuitively, TaiLr downweights real data samples that have low model probabilities with tunable penalization intensity. Experimental results show that our method alleviates the overestimation of degenerated sequences without sacrificing diversity and improves generation quality on a wide range of text generation tasks.","language generation, maximum likelihood estimation, total variation distance, text degeneration" SeqSHAP: Subsequence Level Shapley Value Explanations for Sequential Predictions,https://openreview.net/forum?id=0h4_YLDhf4K,https://openreview.net/pdf?id=0h4_YLDhf4K,,"With the increasing demands of interpretability in real-world applications, various methods for explainable artificial intelligence (XAI) have been proposed. However, most of them overlook the interpretability in sequential scenarios, which have a wide range of applications, e.g., online transactions and sequential recommendations. In this paper, we propose a Shapley value based explainer named SeqSHAP to explain the model predictions in sequential scenarios. Compared to existing methods, SeqSHAP provides more intuitive explanations at a subsequence level, which explicitly models the effect of contextual information among the related elements in a sequence. We propose to calculate subsequence-level feature attributions instead of element-wise attributions to utilize the information embedded in sequence structure, and provide a distribution-based segmentation method to obtain reasonable subsequences. Extensive experiments on two online transaction datasets from a real-world e-commerce platform show that the proposed method could provide valid and reliable explanations for sequential predictions.","XAI, Explainability, SHAP, Sequential Predictions" Newton Losses: Efficiently Including Second-Order Information into Gradient Descent,https://openreview.net/forum?id=FPeVU4Y_Lo6,https://openreview.net/pdf?id=FPeVU4Y_Lo6,Applying Newton to the loss and gradient descent to the neural network.,"We present Newton losses, a method for incorporating second-order information of losses by approximating them with quadratic functions. The presented method is applied only to the loss function and allows training the neural network with gradient descent. As loss functions are usually substantially cheaper to compute than the neural network, Newton losses can be used at a relatively small additional cost. We find that they yield superior performance, especially when applied to non-convex and hard-to-optimize loss functions such as algorithmic losses, which have been popularized in recent research.","differentiable algorithms, backpropagation, differentiable" BPFL: Towards Efficient Byzantine-Robust and Provably Privacy-Preserving Federated Learning,https://openreview.net/forum?id=3FdmckXo3cN,https://openreview.net/pdf?id=3FdmckXo3cN,We propose to address both Byzantine (security) attacks and data reconstruction (privacy) attacks against federated learning.,"Federated learning (FL) is an emerging distributed learning paradigm without sharing participating clients' private data. However, existing works show that FL is vulnerable to both Byzantine (security) attacks and data reconstruction (privacy) attacks. Existing FL defenses only address one of the two attacks, and also face the efficiency issue. We propose BPFL, an efficient Byzantine-robust and provably privacy-preserving FL method that addresses all the issues. Specifically, we draw on the state-of-the-art Byzantine-robust FL method and use similarity metrics to measure the robustness of each participating client in FL. The validity of clients are formulated as circuit constraints on similarity metrics and verified via a zero-knowledge proof. Moreover, the client models are masked by a shared random vector, which is generated based on homomorphic encryption. In doing so, the server receives the masked client models rather than the true ones, which are proven to be private. BPFL is also efficient due to the usage of non-interactive zero-knowledge proof. Experimental results on various datasets show that our BPFL is efficient, Byzantine-robust, and privacy-preserving.","federated learning, Byzantine-robust, privacy-preserving" Understanding Masked Image Modeling via Learning Occlusion Invariant Feature,https://openreview.net/forum?id=JSZvTDggUvz,https://openreview.net/pdf?id=JSZvTDggUvz,,"Recently, Masked Image Modeling (MIM) achieves great success in self-supervised visual recognition. However, as a reconstruction-based framework, it is still an open question to understand how MIM works, since MIM appears very different from previous well-studied siamese approaches such as contrastive learning. In this paper, we propose a new viewpoint: MIM implicitly learns occlusion-invariant features, which is analogous to other siamese methods while the latter learns other invariance. By relaxing MIM formulation into an equivalent siamese form, MIM methods can be interpreted in a unified framework with conventional methods, among which only a) data transformations, i.e. what invariance to learn, and b) similarity measurements are different. Furthermore, taking MAE (He et al., 2021) as a representative example of MIM, we empirically find the success of MIM models relates a little to the choice of similarity functions, but the learned occlusion invariant feature introduced by masked image – it turns out to be a favored initialization for vision transformers, even though the learned feature could be less semantic. We hope our findings could inspire researchers to develop more powerful self-supervised methods in computer vision community.", Anisotropic Message Passing: Graph Neural Networks with Directional and Long-Range Interactions,https://openreview.net/forum?id=socffUzSIlx,https://openreview.net/pdf?id=socffUzSIlx,Modified message passing for the efficient description of long-range and directional interactions with applications to quantum-chemical systems.,"Graph neural networks have shown great potential for the description of a variety of chemical systems. However, standard message passing does not explicitly account for long-range and directional interactions, for instance due to electrostatics. In this work, an anisotropic state based on Cartesian multipoles is proposed as an addition to the existing hidden features. With the anisotropic state, message passing can be modified to explicitly account for directional interactions. Compared to existing models, this modification results in relatively little additional computational cost. Most importantly, the proposed formalism offers as a distinct advantage the seamless integration of (1) anisotropic long-range interactions, (2) interactions with surrounding fields and particles that are not part of the graph, and (3) the fast multipole method. As an exemplary use case, the application to quantum mechanics/molecular mechanics (QM/MM) systems is demonstrated.","Message Passing, Graph Neural Networks, Directional, Long-Range, Equivariant, Quantum Chemistry, QM/MM" Learn Low-dimensional Shortest-path Representation of Large-scale and Complex Graphs,https://openreview.net/forum?id=p4bZLgHUB6L,https://openreview.net/pdf?id=p4bZLgHUB6L,"We propose an efficient and interpretable shortest-path representation method for fast, accurate and scalable shortest-path distance queries.","Estimation of shortest-path (SP) distance lies at the heart of network analysis tasks. Along with the rapid emergence of large-scale and complex graphs, approximate SP-representing algorithms that transform a graph into compact and low-dimensional representations are critical for fast and scalable online analysis. Among different approaches, learning-based representation methods have made a breakthrough both in response time and accuracy. Several competitive works in learning-based methods heuristically leverage truncated random walk and optimization on the arbitrary linkage for SP representation learning. However, they have limitations on both exploration range and distance preservation. We propose in this paper an efficient and interpretable SP representation method called Betweenness Centrality-based Distance Resampling (BCDR). First, we prove that betweenness centrality-based random walk can occupy a wider exploration range of distance due to its awareness of high-order path structures. Second, we leverage distance resampling to simulate random shortest paths from original paths and prove that the optimization on such shortest paths preserves distance relations via implicitly decomposing SP distance-based similarity matrix. BCDR yields an average improvement of 25% accuracy and 25-30% query speed, compared to all existing approximate methods when evaluated on a broad class of real-world and synthetic graphs with diverse sizes and structures.","shortest-path distance, graph representation learning, random walk" SYNC: SAFETY-AWARE NEURAL CONTROL FOR STABILIZING STOCHASTIC DELAY-DIFFERENTIAL EQUATIONS,https://openreview.net/forum?id=_8mS2NE-HXN,https://openreview.net/pdf?id=_8mS2NE-HXN,"We propose a new class of neural control polices for stabilizing stochastic delay-differential equations with safety guarantee, named as SYNC and including both deterministic and stochastic control outperforming the existing methods.","Stabilization of the systems described by stochastic delay-differential equations is a challenging task in control community. Here, to achieve this task, we leverage neural networks to learn control policies using the information of the controlled systems in some prescribed regions. The two learned control policies, the neural deterministic controller (NDC) and the neural stochastic controller (NSC), work effectively because the learning procedures use, respectively, the well-known LaSalle-Type theorem and the newly-established theorem for guaranteeing the stochastic stability in SDDEs. We theoretically investigate the performance of the proposed NDC and NSC in terms of convergence time and energy cost. More practically and significantly, we improve our learned control policies through considering the situation where the controlled trajectories can only evolve in some specific safety set. Such successful stabilization based on neural networks restricted in safety set is attributed to our further developed theory for safety verification of SDDEs using the stochastic control barrier function, and we name it as SYNC ($\textbf{S}$afet$\textbf{Y}$-aware $\textbf{N}$eural $\textbf{C}$ontrol). The efficacy of all the articulated control policies, including the SYNC, is demonstrated systematically by using representative control problems.","Stochastic delay-differential equations, safety guarantee, stochastic stabilization, neural networks" Byzantine-robust Decentralized Learning via ClippedGossip,https://openreview.net/forum?id=qxcQqFUTIpQ,https://openreview.net/pdf?id=qxcQqFUTIpQ,"In this paper, we study the challenging task of Byzantine-robust decentralized training on arbitrary communication graphs. ","In this paper, we study the challenging task of Byzantine-robust decentralized training on arbitrary communication graphs. Unlike federated learning where workers communicate through a server, workers in the decentralized environment can only talk to their neighbors, making it harder to reach consensus and benefit from collaborative training. To address these issues, we propose an algorithm, termed ClippedGossip, for Byzantine-robust consensus and optimization, which is the first to provably converge to a $O(\delta_{\max}\zeta^2/\gamma^2)$ neighborhood of the stationary point for non-convex objectives under standard assumptions. Finally, we demonstrate the encouraging empirical performance of ClippedGossip under a large number of attacks.","Byzantine-robustness, distributed machine learning, robustness, optimization, decentralized learning" From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models,https://openreview.net/forum?id=Ck1UtnVukP8,https://openreview.net/pdf?id=Ck1UtnVukP8,,"Large language models (LLMs) have demonstrated excellent zero-shot generalization to new tasks. However, effective utilization of LLMs for zero-shot visual question-answering (VQA) remains challenging, primarily due to the modality disconnection and task disconnection between LLM and VQA task. End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive. To address this issue, we propose \emph{Img2Prompt}, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform VQA tasks without end-to-end training. In order to provide such prompts, we further employ LLM-agnostic models to provide prompts that can describe image content and self-constructed question-answer pairs, which can effectively guide LLM to perform VQA tasks. Img2Prompt offers the following benefits: 1) It is LLM-agnostic and can work with any LLM to perform VQA. 2) It renders end-to-end training unnecessary and significantly reduces the cost of deploying LLM for VQA tasks. 3) It achieves comparable or better performance than methods relying on end-to-end training. On the challenging A-OKVQA dataset, our method outperforms some few-shot methods by as much as 20\%.","Large Language Model, Visual Question Answer, Prompts, Zero-Shot" A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental Learning,https://openreview.net/forum?id=S07feAlQHgM,https://openreview.net/pdf?id=S07feAlQHgM,,"Real-world applications require the classification model to adapt to new classes without forgetting old ones. Correspondingly, Class-Incremental Learning (CIL) aims to train a model with limited memory size to meet this requirement. Typical CIL methods tend to save representative exemplars from former classes to resist forgetting, while recent works find that storing models from history can substantially boost the performance. However, the stored models are not counted into the memory budget, which implicitly results in unfair comparisons. We find that when counting the model size into the total budget and comparing methods with aligned memory size, saving models do not consistently work, especially for the case with limited memory budgets. As a result, we need to holistically evaluate different CIL methods at different memory scales and simultaneously consider accuracy and memory size for measurement. On the other hand, we dive deeply into the construction of the memory buffer for memory efficiency. By analyzing the effect of different layers in the network, we find that shallow and deep layers have different characteristics in CIL. Motivated by this, we propose a simple yet effective baseline, denoted as MEMO for Memory-efficient Expandable MOdel. MEMO extends specialized layers based on the shared generalized representations, efficiently extracting diverse representations with modest cost and maintaining representative exemplars. Extensive experiments on benchmark datasets validate MEMO's competitive performance. ",class-incremental learning Reinforcement learning for instance segmentation with high-level priors,https://openreview.net/forum?id=C3ukgkqJuh0,https://openreview.net/pdf?id=C3ukgkqJuh0,Instance segmentation can be learned from high-level rules only for objects following a regular shape prior.,"Instance segmentation is a fundamental computer vision problem which remains challenging despite impressive recent advances due to deep learning-based methods. Given sufficient training data, fully supervised methods can yield excellent performance, but annotation of groundtruth data remains a major bottleneck, especially for biomedical applications where it has to be performed by domain experts. The amount of labels required can be drastically reduced by using rules derived from prior knowledge to guide the segmentation. However, these rules are in general not differentiable and thus cannot be used with existing methods. Here, we revoke this requirement by using stateless actor critic reinforcement learning, which enables non-differentiable rewards. We formulate the instance segmentation problem as graph partitioning and the actor critic predicts the edge weights driven by the rewards, which are based on the conformity of segmented instances to high-level priors on object shape, position or size. The experiments on toy and real data demonstrate that a good set of priors is sufficient to reach excellent performance without any direct object-level supervision.","Instance Segmentation, Reinforcement Learning, Biomedical Imaging" Differentiable Mathematical Programming for Object-Centric Representation Learning,https://openreview.net/forum?id=1J-ZTr7aypY,https://openreview.net/pdf?id=1J-ZTr7aypY,,"We propose topology-aware feature partitioning into $k$ disjoint partitions for given scene features as a method for object-centric representation learning. To this end, we propose to use minimum $s$-$t$ graph cuts as a partitioning method which is represented as a linear program. The method is topologically aware since it explicitly encodes neighborhood relationships in the image graph. To solve the graph cuts our solution relies on an efficient, scalable, and differentiable quadratic programming approximation. Optimizations specific to cut problems allow us to solve the quadratic programs and compute their gradients significantly more efficiently compared with the general quadratic programming approach. Our results show that our approach is scalable and outperforms existing methods on object discovery tasks with textured scenes and objects.", Transformers are Sample-Efficient World Models,https://openreview.net/forum?id=vhFu1Acb0xb,https://openreview.net/pdf?id=vhFu1Acb0xb,We introduce a data-efficient agent that learns in a world model composed of a discrete autoencoder and an autoregressive Transformer.,"Deep reinforcement learning agents are notoriously sample inefficient, which considerably limits their application to real-world problems. Recently, many model-based methods have been designed to address this issue, with learning in the imagination of a world model being one of the most prominent approaches. However, while virtually unlimited interaction with a simulated environment sounds appealing, the world model has to be accurate over extended periods of time. Motivated by the success of Transformers in sequence modeling tasks, we introduce IRIS, a data-efficient agent that learns in a world model composed of a discrete autoencoder and an autoregressive Transformer. With the equivalent of only two hours of gameplay in the Atari 100k benchmark, IRIS achieves a mean human normalized score of 1.046, and outperforms humans on 10 out of 26 games, setting a new state of the art for methods without lookahead search. To foster future research on Transformers and world models for sample-efficient reinforcement learning, we release our codebase at this https URL. For the review process, we provide the code and visualizations in the supplementary materials.","deep learning, reinforcement learning, model-based reinforcement learning, world models, learning in imagination, transformers, discrete autoencoders, generative modeling, sequence modeling" Considering Layerwise Importance in the Lottery Ticket Hypothesis,https://openreview.net/forum?id=aHwehiwz6YW,https://openreview.net/pdf?id=aHwehiwz6YW,Using different importance measures in the LTH procedure to determine properties of the resulting LTs.,"The recently-introduced Lottery Ticket Hypothesis (LTH) posits that it is possible to extract a sparse trainable subnetwork from a dense network using iterative magnitude pruning. By iteratively training the model, removing the connections with the lowest global weight magnitude and rewinding the remaining connections, sparse networks can be extracted that, when fully trained, reach a similar or better performance than their dense counterpart. Intuitively, this approach of comparing connection weights globally removes a lot of context about the connection weights and their relations to other connections in their layer as the weight distributions in layers throughout the network often differ significantly. In this paper we study a number of different approaches that try to recover some of this layer distributional context by computing an importance value for each connection that is dependent on the weights of the other connections in the same layer. We then generalise the LTH to use weight importances rather than weight magnitudes. Experiments using these importance metrics on several architectures and datasets, reveal interesting aspects on the structure and emergence of Lottery tickets. We find that given a repeatable training procedure, applying different importance metrics lead to distinct performant lottery tickets with little overlapping connections.",Lottery Ticket Hypothesis Generalized Sum Pooling for Metric Learning,https://openreview.net/forum?id=TtxOsdYU92d,https://openreview.net/pdf?id=TtxOsdYU92d,We generalize global average pooling and propose a learnable generalized sum pooling method which effectively choose a subset of the features to be re-weighted in aggregation.,"A common architectural choice for deep metric learning is a convolutional neural network followed by global average pooling (GAP). Albeit simple, GAP is a highly effective way to aggregate information. One possible explanation for the effectiveness of GAP is considering each feature vector as representing a different semantic entity and GAP as a convex combination of them. Following this perspective, we generalize GAP and propose a learnable generalized sum pooling method (GSP). GSP improves GAP with two distinct abilities: i) the ability to choose a subset of semantic entities, effectively learning to ignore nuisance information, and ii) learning the weights corresponding to the importance of each entity. Formally, we propose an entropy-smoothed optimal transport problem and show that it is a strict generalization of GAP, \ie a specific realization of the problem gives back GAP. We show that this optimization problem enjoys analytical gradients enabling us to use it as a direct learnable replacement for GAP. We further propose a zero-shot loss to ease the learning of GSP. We show the effectiveness of our method with extensive evaluations on 4 popular metric learning benchmarks. Code is available at: GSP-DML Framework","Metric learning, feature selection, global average pooling, zero-shot regularization" SAAL: Sharpness-Aware Active Learning,https://openreview.net/forum?id=FpkVnbE_h6i,https://openreview.net/pdf?id=FpkVnbE_h6i,"We propose Sharpness-Aware Active Learning, or SAAL, which adopts the loss sharpness for the acquisition score.","While modern deep neural networks play significant roles in many research areas, they are also prone to overfitting problems under limited data instances. Particularly, this overfitting, or generalization issue, could be a problem in the framework of active learning because it selects a few data instances for learning over time. To consider the generalization, this paper introduces the first active learning method to incorporate the sharpness of loss space in the design of the acquisition function, inspired by sharpness-aware minimization (SAM). SAM intends to maximally perturb the training dataset, so the optimization can be led to a flat minima, which is known to have better generalization ability. Specifically, our active learning, Sharpness-Aware Active Learning (SAAL), constructs its acquisition function by selecting unlabeled instances whose perturbed loss becomes maximum. Over the adaptation of SAM into SAAL, we design a pseudo labeling mechanism to look forward to the perturbed loss w.r.t. the ground-truth label. Furthermore, we present a theoretic analysis between SAAL and recent active learning methods, so the recent works could be reduced to SAAL under a specific condition. We conduct experiments on various benchmark datasets for vision-based tasks in image classification and object detection. The experimental results confirm that SAAL outperforms the baselines by selecting instances that have the potentially maximal perturbation on the loss.","active learning, loss sharpness, SAM" Scalable Subset Sampling with Neural Conditional Poisson Networks,https://openreview.net/forum?id=p8hMBcPtvju,https://openreview.net/pdf?id=p8hMBcPtvju,,"A number of problems in learning can be formulated in terms of the basic primitive of sampling $k$ elements out of a universe of $n$ elements. This subset sampling operation cannot directly be included in differentiable models and approximations are essential. Current approaches take an \emph{order sampling} approach to sampling subsets and depend on differentiable approximations of the Top-$k$ operator for selecting the largest $k$ elements from a set. We present a simple alternative method for sampling subsets based on \emph{conditional Poisson sampling}. Unlike order sampling approaches, the parallel complexity of the proposed method is independent of the subset size which makes the method scalable to large subset sizes. We adapt the procedure to make it efficient and amenable to discrete gradient approximations for use in differentiable models. Furthermore, the method also allows the subset size parameter $k$ to be differentiable. We demonstrate our approach on model explanation, image sub-sampling and stochastic $k$-nearest neighbor tasks outperforming existing methods in accuracy, efficiency and scalability.", High probability error bounds of SGD in unbounded domain,https://openreview.net/forum?id=unKdm72T5wP,https://openreview.net/pdf?id=unKdm72T5wP,,"This paper studies the high probability convergence behaviour of the stochastic gradient descent (SGD) method applied to convex problems. The existing tail-bound analysis of SGD relies crucially on assuming the domain of the problem to be bounded. In this work, we show that the bounded domain assumption can be removed for free. That is, we prove SGD in an unbounded domain enjoys the same high probability error bound as the bound established in the bounded domain; SGD converges with rate $O(\log(1/\delta)/\epsilon^2)$ no matter the problem domain is bounded or not. As a by-product, we also prove that the trajectory of SGD is guaranteed to stay in a neighbourhood of the initialization with almost bounded diameter. As simple extensions of our analysis, we further establish the high probability error bounds of the last iterate of SGD and SGD with momentum, respectively.", Improved Convergence of Differential Private SGD with Gradient Clipping,https://openreview.net/forum?id=FRLswckPXQ5,https://openreview.net/pdf?id=FRLswckPXQ5,,"Differential private stochastic gradient descent (DP-SGD) with gradient clipping (DP-SGD-GC) is an effective optimization algorithm that can train machine learning models with a privacy guarantee. Despite the popularity of DP-SGD-GC, its convergence in unbounded domain without the Lipschitz continuous assumption is less-understood; existing analysis of DP-SGD-GC either impose additional assumptions or end up with an utility bound that involves an non-vanishing bias term. In this work, for smooth and unconstrained problems, we improve the current analysis and show that DP-SGD-GC can achieve a vanishing utility bound without any bias term. Furthermore, when the noise generated from subsampled gradients is light-tailed, we prove that DP-SGD-GC can achieve nearly the same utility bound as DP-SGD applies to the Lipschitz continuous objectives. As a by-product, we propose a new clipping technique, called value clipping, to mitigate the computational overhead caused by the classic gradient clipping. Experiments on standard benchmark datasets are conducted to support our analysis.", Learning Inductive Object-Centric Slot Initialization via Clustering,https://openreview.net/forum?id=YtghWaAhboM,https://openreview.net/pdf?id=YtghWaAhboM,,"Object-centric representations using slots have shown the advances towards efficient, flexible and interpretable abstraction from low-level perceptual features in a compositional scene. Current approaches randomize the initial state of slots followed by an iterative refinement. As we show in this paper, the random slot initialization significantly affects the accuracy of the final slot prediction. Moreover, current approaches require a predetermined number of slots from prior knowledge of the data, which limits the applicability in the real world. In our work, we initialize the slot representations with clustering algorithms conditioned on the perceptual input features. This requires an additional layer in the architecture to initialize the slots given the identified clusters. We design permutation invariant and permutation equivariant versions of this layer to enable the exchangeable slot representations after clustering. Additionally, we employ mean-shift clustering to automatically identify the number of slots for a given scene. We evaluate our method on object discovery and novel view synthesis tasks with various datasets. The results show that our method outperforms prior works consistently, especially for complex scenes.","Slot representation, Clustering, Unsupervised learning, Object discovery, Novel view synthesis" Group-level Brain Decoding with Deep Learning,https://openreview.net/forum?id=lEFM4OTz62c,https://openreview.net/pdf?id=lEFM4OTz62c,We propose a neuroscientifically interpretable deep learning model capable of jointly decoding multiple subjects in neuroimaging data aided by subject embeddings.,"Decoding experimental variables from brain imaging data is gaining popularity, with applications in brain-computer interfaces and the study of neural representations. Decoding is typically subject-specific and does not generalise well over subjects. Here, we propose a method that uses subject embedding, analogous to word embedding in Natural Language Processing, to learn and exploit the structure in between subject variability as part of a decoding model, our adaptation of the WaveNet architecture for classification. We apply this to magnetoencephalography data, where 15 subjects viewed 118 different images, with 30 examples per image; to classify images using the entire 1s window following image presentation. We show that the combination of deep learning and subject embedding is crucial to closing the performance gap between subject- and group-level decoding models. Importantly, group models outperform subject models on low-accuracy subjects (but impair high-accuracy subjects) and can be helpful for initialising subject models. The potential of such group modelling is even higher with bigger datasets. To better enable physiological interpretation at the group level we demonstrate the use of permutation feature importance developing insights into the spatio-temporal and spectral information encoded in the models. All code is available on GitHub.","deep learning, transfer learning, decoding, neuroimaging, MEG, permutation feature importance" QUANTILE-LSTM: A ROBUST LSTM FOR ANOMALY DETECTION,https://openreview.net/forum?id=k5e6oQP2zHx,https://openreview.net/pdf?id=k5e6oQP2zHx,,"Anomalies refer to departure of systems and devices from their normal behaviour in standard operating conditions. An anomaly in an industrial device can indicate an upcoming failure, often in the temporal direction. In this paper, we make two contributions: 1) we estimate conditional quantiles, and consider three different ways to define anomalies based on the estimated quantiles and 2) use a new learnable activation function in the popular Long Short Term Memory (LSTM) architecture to model temporal long-range dependency. In particular, we propose Parametrized Elliot Function (Parametric Elliot Function (PEF)) as an activation function inside LSTM, which saturates lately compared to sigmoid and tanh. The proposed algorithms are compared with other well known anomaly detection algorithms, such as Isolation Forest (iForest), Elliptic Envelope, Autoencoder,and modern Deep Learning models such as Deep Autoencoding Gaussian Mixture Model (DAGMM), Generative Adversarial Networks (GAN) etc. The algorithms are evaluated in terms of various performance metrics, such as precision and recall. The algorithms are experimented on multiple industrial timeseries datasets such as Yahoo, AWS, GE, and machine sensor. We have found the LSTM based quantile algorithms are very effective and outperformed the existing algorithms in identifying the anomalies. ","Anomaly, Quantile, LSTM, Activation Function" Mutual Information-guided Knowledge Transfer for Open-World Semi-Supervised Learning,https://openreview.net/forum?id=7IMneQViz6h,https://openreview.net/pdf?id=7IMneQViz6h,,"We tackle the open-world semi-supervised learning problem, aiming to cluster novel classes and classify seen classes in unlabeled data based on labeled data from seen classes. The main challenge is to transfer knowledge contained in seen class data to unseen ones. Previous methods mostly transfer knowledge through sharing representation space. However, they learn the seen and unseen classes classifier in a disjoint manner, neglecting the underlying relation between predictions on the seen and unseen classes. Therefore, the learned representations and classifiers are less effective for clustering unseen classes. In this paper, we propose a novel and general method to transfer knowledge between seen and unseen classes. Our insight is to utilize mutual information to measure the generic statistical dependency between seen and unseen classes in the classifier output space, which couple the learning of classifier and promote transferring knowledge between two data sets. To validate the effectiveness and generalization of our method, we conduct extensive experiments on several benchmarks, including CIFAR10/100, Imagenet100, Oxford-IIIT Pet and FGVC-Aicraft datasets. Our results show that the proposed method outperforms previous SOTA by a significant margin on almost all benchmarks. ","Novel Class Discovery, Open-world Semi-supervised learning, Knowledge Transfer, Mutual Information" RegQ: Convergent Q-Learning with Linear Function Approximation using Regularization,https://openreview.net/forum?id=ZtW4gh_q5AC,https://openreview.net/pdf?id=ZtW4gh_q5AC,This paper develops convergent Q-learning algorithm when linear function approximation is used.," Q-learning is widely used algorithm in reinforcement learning community. Under the lookup table setting, its convergence is well established. However, its behavior is known to be unstable with the linear function approximation case. This paper develops a new Q-learning algorithm, called RegQ, that converges when linear function approximation is used. We prove that simply adding an appropriate regularization term ensures convergence of the algorithm. We prove its stability using a recent analysis tool based on switching system models. Moreover, we experimentally show that RegQ converges in environments where Q-learning with linear function approximation has known to diverge. We also provide an error bound on the solution where the algorithm converges.","reinforcement learning, Q-learning, reinforcement learning theory" Neural Field Discovery Disentangles Equivariance in Interacting Dynamical Systems,https://openreview.net/forum?id=wZRgC1McxyU,https://openreview.net/pdf?id=wZRgC1McxyU,"We disentangle global fields effects from local object interactions in interacting dynamical systems, and propose neural fields to discover underlying fields.","Systems of interacting objects often evolve under the influence of underlying field effects that govern their dynamics, \emph{e.g.} electromagnetic fields in physics, or map topologies and traffic rules in traffic scenes. While the interactions between objects depend on local information, the underlying fields depend on global states. Pedestrians and vehicles in traffic scenes, for example, follow different traffic rules and social norms depending on their absolute geolocation. The entanglement of global and local effects makes recently popularized equivariant networks inapplicable, since they fail to capture global information. To address this, in this work, we propose to \emph{disentangle} local object interactions --which are equivariant to global roto-translations and depend on relative positions and orientations-- from external global field effects --which depend on absolute positions and orientations. We theorize the presence of latent fields, which we aim to discover \emph{without} directly observing them, but infer them instead from the dynamics alone. We propose neural fields to learn the latent fields, and model the interactions with equivariant graph networks operating in local coordinate frames. We combine the two components in a graph network that transforms field effects in local frames and operates solely there. Our experiments show that we can accurately discover the underlying fields in charged particles settings, traffic scenes, and gravitational n-body problems, and effectively use them to learn the system and forecast future trajectories.","Interacting Dynamical systems, Graph Neural Networks, Neural Fields, Equivariance" DIMENSION-REDUCED ADAPTIVE GRADIENT METHOD,https://openreview.net/forum?id=Xp-__WzXiBy,https://openreview.net/pdf?id=Xp-__WzXiBy,,"Adaptive gradient methods, such as Adam, have shown faster convergence speed than SGD across various kinds of network models. However, adaptive algorithms often suffer from inferior generalization performance than SGD. Though much effort via combining Adam and SGD have been invested to solve this issue, adaptive methods still fail to attain as good generalization as SGD. In this work, we proposed a Dimension-Reduced Adaptive Gradient Method (DRAG) to eliminate the generalization gap. DRAG makes an elegant combination of SGD and Adam by adopting a trust-region like framework. We observe that 1) Adam adjusts stepsizes for each gradient coordinate according to some loss curvature, and indeed decomposes the $n$-dimensional gradient into $n$ independent directions to search, in which each direction inherits one coordinate element from the gradient and sets the remaining coordinate positions as zeros; 2) SGD uniformly scales gradient for all gradient coordinates and actually has only one descent direction to minimize. Accordingly, DRAG reduces the high degree of freedom of Adam and also improves the flexibility of SGD via optimizing the loss along $k\ (\ll \! n)$ descent directions, e.g. the gradient direction and momentum direction used in this work. Then per iteration, DRAG finds the best stepsizes for $k$ descent directions by solving a trust-region subproblem whose computational overhead is negligible since the trust-region subproblem is low-dimensional, e.g. $k=2$ in this work. DRAG is compatible with the common deep learning training pipeline without introducing extra hyper-parameters and with negligible extra computation. Moreover, we prove the convergence property of DRAG for non-convex stochastic problems that often occur in deep learning training. Experimental results on representative benchmarks testify the fast convergence speed and also superior generalization of DRAG.",Deep learning optimizer Learning to Estimate Single-View Volumetric Flow Motions without 3D Supervision,https://openreview.net/forum?id=2vmGv5wPDBZ,https://openreview.net/pdf?id=2vmGv5wPDBZ,We train a network to estimate 3D motions and densities from single view videos of smoke without using 3D ground truth.,"We address the challenging problem of jointly inferring the 3D flow and volumetric densities moving in a fluid from a monocular input video with a deep neural network. Despite the complexity of this task, we show that it is possible to train the corresponding networks without requiring any 3D ground truth for training. In the absence of ground truth data we can train our model with observations from real-world capture setups instead of relying on synthetic reconstructions. We make this unsupervised training approach possible by first generating an initial prototype volume which is then moved and transported over time without the need for volumetric supervision. Our approach relies purely on image-based losses, an adversarial discriminator network, and regularization. Our method can estimate long-term sequences in a stable manner, while achieving closely matching targets for inputs such as rising smoke plumes.", Towards the Out-of-Distribution Generalization of Contrastive Self-Supervised Learning,https://openreview.net/forum?id=n0Pb9T5kmb,https://openreview.net/pdf?id=n0Pb9T5kmb,"This paper studies the out-of-distribution generalization of contrastive self-supervised learning, and propose an augmentation-robust contrastive learning algorithm to improve the OOD performance.","Self-supervised learning attracts much attention recently, since it does not require labeled data for training contrasted to supervised learning. Empirical studies also observe that it has better transfer ability than supervised learning. However, the theoretical study of the out-of-distribution (OOD) generalization ability of self-supervised learning is still limited. In this paper, by focusing on the data augmentation used in SSL, we establish a theoretical framework for the OOD performance of contrastive-based self-supervised learning. Although some recent work claims that contrastive learning learns more robust representations than supervised learning, our results suggest that this superiority mainly comes from the data augmentation used, i.e., more data are fed to the model. In the face of more challenging OOD scenarios, the standard contrastive learning still suffers from the same generalization problem as empirical risk minimization (ERM). Based on our theoretical results, we propose an augmentation-robust contrastive learning approach, named as ArCL, which significantly improves the OOD performance of contrastive learning in several datasets. ","Contrastive learning, out-of-distribution generalization" Online Policy Optimization for Robust MDP,https://openreview.net/forum?id=cYZupNY8DS4,https://openreview.net/pdf?id=cYZupNY8DS4,,"Reinforcement learning (RL) has exceeded human performance in many synthetic settings such as video games and Go. However, real-world deployment of end-to-end RL models is rare, as RL models can be very sensitive to slight perturbation of the environment. The robust Markov decision process (MDP) framework---in which the transition probabilities belong to an uncertainty set around a nominal model---provides one way to develop robust models. While previous analysis shows RL algorithms are effective assuming access to a generative model, it remains unclear whether RL can be efficient under a more realistic online setting, which requires carefully balancing exploration and exploitation. In this work, we consider online robust MDP by interacting with an unknown nominal system. We propose a robust optimistic policy optimization algorithm that is provably efficient. To address the additional uncertainty caused by an adversarial environment, our model features a new optimistic update rule derived via Fenchel conjugates. Our analysis establishes the first regret upper bound for online robust MDPs. ", Toeplitz Neural Network for Sequence Modeling,https://openreview.net/forum?id=IxmWsm4xrua,https://openreview.net/pdf?id=IxmWsm4xrua,An efficient method that uses Toeplitz matrices to model sequences.,"Sequence modeling has important applications in natural language processing and computer vision. Recently, the transformer-based models have shown strong performance on various sequence modeling tasks, which rely on attention to capture pairwise token relations, and position embedding to inject positional information. While showing good performance, the transformer models are inefficient to scale to long input sequences, mainly due to the quadratic space-time complexity of attention. To overcome this inefficiency, we propose to model sequences with a relative position encoded Toeplitz matrix and use a Toeplitz matrix-vector production trick to reduce the space-time complexity of the sequence modeling to log linear. A lightweight sub-network called relative position encoder is proposed to generate relative position coefficients with a fixed budget of parameters, enabling the proposed Toeplitz neural network to deal with varying sequence lengths. In addition, despite being trained on 512-token sequences, our model can extrapolate input sequence length up to 14K tokens in inference with consistent performance. Extensive experiments on autoregressive and bidirectional language modeling, image modeling, and the challenging Long-range Arena Benchmark show that our method achieves better performance than its competitors in most downstream tasks while being significantly faster.","Toeplitz Matrix, Sequence Modeling, Relative position" An Adaptive Entropy-Regularization Framework for Multi-Agent Reinforcement Learning,https://openreview.net/forum?id=zzL_5WoI3I,https://openreview.net/pdf?id=zzL_5WoI3I,This paper proposes an adaptive entropy-regularization framework for multi-agent reinforcement learning to learn the adequate amount of exploration for each agent based on the degree of required exploration.,"In this paper, we propose an adaptive entropy-regularization framework (ADER) for multi-agent reinforcement learning (RL) to learn the adequate amount of exploration for each agent based on the degree of required exploration. In order to handle instability arising from updating multiple entropy temperature parameters for multiple agents, we disentangle the soft value function into two types: one for pure reward and the other for entropy. By applying multi-agent value factorization to the disentangled value function of pure reward, we obtain a relevant metric to assess the necessary degree of exploration for each agent. Based on this metric, we propose the ADER algorithm based on maximum entropy RL, which controls the necessary level of exploration across agents over time by learning the proper target entropy for each agent. Experimental results show that the proposed scheme significantly outperforms current state-of-the-art multi-agent RL algorithms. ","Multi-Agent Reinforcement Learning, Entropy Regularization, Exploration-Exploitation Tradeoff" Relative Positional Encoding Family via Unitary Transformation,https://openreview.net/forum?id=xMWFqb5Uyk,https://openreview.net/pdf?id=xMWFqb5Uyk,"Introducing unitary relative positional encoding, a principal design for relative position encoding, applicable inclusively for linear and vanilla transformer.","Relative position encoding is widely used in vanilla and linear transformers to represent positional information. However, the existing encoding methods of a vanilla transformer are not always directly applicable to a linear transformer, because the latter requires a decomposition of the query and key representations into separate kernel functions. Nevertheless, principles to design encoding methods suitable for linear transformers remain under-studied. In this work, we put together a variety of existing encoding methods under a canonical form and further propose a family of relative positional encodings via unitary transformation. Our formulation leads to a principled framework that can be used to develop new relative positional encoding methods that preserve linear space-time complexity. Equipping with different parameters, the proposed unitary relative positional encoding family (URPE) derives effective encoding for various applications. Experiments show that compared with existing encoding methods, unitary encoding achieves competitive performance on language modeling and various challenging downstream tasks, such as machine translation and text classification. In the meantime, it highlights a general paradigm to design broadly more relative positional encoding methods, applicable inclusively to linear and vanilla transformers.","Linear Transformer, Relative positional encoding, Unitary transformation" Revisiting Feature Acquisition Bias for Few-Shot Fine-Grained Image Classification,https://openreview.net/forum?id=i4z90HQjBZa,https://openreview.net/pdf?id=i4z90HQjBZa,,"Recent work on metric-learning based few-shot fine-grained image classification (FSFGIC) has achieved promising success in classification accuracy, where various convolution neural networks with different similarity measures are utilized to learn a common feature representation for each category for FSFGIC. In this paper, we identify and analyze for the first time a fundamental problem of existing metric-learning based FSFGIC methods which fail to effectively address the bias in features obtained from each input image that causes misclassification. To solve this problem, we present a robust feature acquisition network (RFANet) that has the ability to effectively address the bias in the feature information obtained from each input image and guide convolution-based embedding models to significantly increase the accuracy. Our proposed architecture can be easily embedded into any episodic training mechanisms for end-to-end training from scratch. Extensive experiments on FSFGIC tasks demonstrate the superiority of the proposed method over the state-of-the-arts.", ColoristaNet for Photorealistic Video Style Transfer,https://openreview.net/forum?id=Z6XKjKM2zBA,https://openreview.net/pdf?id=Z6XKjKM2zBA,"In this paper, we propose a novel photorealistic video style transfer network called ColoristaNet, which can conduct color style transfer in videos without introducing any painterly spatial distortions and inconsistent flickering artifacts.","Photorealistic style transfer aims to transfer the artistic style of an image onto an input image or video while keeping photorealism. In this paper, we think it's the summary statistics matching scheme in existing algorithms that leads to unrealistic stylization. To avoid employing the popular Gram loss, we propose a self-supervised style transfer framework, which contains a style removal part and a style restoration part. The style removal network removes the original image styles, and the style restoration network recovers image styles in a supervised manner. Meanwhile, to address the problems in current feature transformation methods, we propose decoupled instance normalization to decompose feature transformation into style whitening and restylization. It works quite well in ColoristaNet and can transfer image styles efficiently while keeping photorealism. To ensure temporal coherency, we also incorporate optical flow methods and ConvLSTM to embed contextual information. Experiments demonstrates that ColoristaNet can achieve better stylization effects when compared with state-of-the-art algorithms.","Photorealistic Video Style Transfer, ColoristaNet, Style Removal-and-Reconstruction, Decoupled Instance Normalziation" Auto-Encoding Adversarial Imitation Learning,https://openreview.net/forum?id=HpEfFkzHUgt,https://openreview.net/pdf?id=HpEfFkzHUgt,this paper presents a new adversarial imitation learning method based on auto-encoding,"Reinforcement learning (RL) provides a powerful framework for decision-making, but its application in practice often requires a carefully designed reward function. Adversarial Imitation Learning (AIL) sheds light on automatic policy acquisition without access to the reward signal from the environment. In this work, we propose Auto-Encoding Adversarial Imitation Learning (AEAIL), a robust and scalable AIL framework. To induce expert policies from demonstrations, AEAIL utilizes the reconstruction error of an auto-encoder as a reward signal, which provides more information for optimizing policies than the prior discriminator-based ones. Subsequently, we use the derived objective functions to train the auto-encoder and the agent policy. Experiments show that our AEAIL performs superior compared to state-of-the-art methods in the MuJoCo environments. More importantly, AEAIL shows much better robustness when the expert demonstrations are noisy. Specifically, our method achieves $11\%$ and $50.7\%$ relative improvement overall compared to the best baseline GAIL and PWIL on clean and noisy expert data, respectively. Video results, open-source code and dataset are available in supplementary materials. ","imitation learning, reinforcement learning, auto-encoders" $\Delta$-PINNs: physics-informed neural networks on complex geometries,https://openreview.net/forum?id=5P96KWeULzE,https://openreview.net/pdf?id=5P96KWeULzE,We encode the geometry using the Laplace-Beltrami eigenfunctions to solve partial differential equations with physics-informed neural networks on complex geometries.,"Physics-informed neural networks (PINNs) have demonstrated promise in solving forward and inverse problems involving partial differential equations. Despite recent progress on expanding the class of problems that can be tackled by PINNs, most of existing use-cases involve simple geometric domains. To date, there is no clear way to inform PINNs about the topology of the domain where the problem is being solved. In this work, we propose a novel positional encoding mechanism for PINNs based on the eigenfunctions of the Laplace-Beltrami operator. This technique allows to create an input space for the neural network that represents the geometry of a given object. We approximate the eigenfunctions as well as the operators involved in the partial differential equations with finite elements. We extensively test and compare the proposed methodology against traditional PINNs in complex shapes, such as a coil, a heat sink and a bunny, with different physics, such as the Eikonal equation and heat transfer. We also study the sensitivity of our method to the number of eigenfunctions used, as well as the discretization used for the eigenfunctions and the underlying operators. Our results show excellent agreement with the ground truth data in cases where traditional PINNs fail to produce a meaningful solution. We envision this new technique will expand the effectiveness of PINNs to more realistic applications.","deep learning, Laplace-Beltrami, physics-informed neural networks, partial differential equations" On the Nonconvex Convergence of SGD,https://openreview.net/forum?id=OmGZ7ymnSno,https://openreview.net/pdf?id=OmGZ7ymnSno,"This paper shows that the $\epsilon$-stationary point exists in the final iterates of SGDs in minimizing nonconvex objectives, not just anywhere in the entire range of iterates---A much stronger result than the existing one.","Stochastic gradient descent (SGD) and its variants are the main workhorses for solving large-scale optimization problems with nonconvex objective functions. Although the convergence of SGDs in the (strongly) convex case is well-understood, their convergence for nonconvex functions stands on weak mathematical foundations. Most existing studies on the nonconvex convergence of SGD show the complexity results based on either the minimum of the expected gradient norm or the functional sub-optimality gap (for functions with extra structural property) by searching over the entire range of iterates. Hence the last iterations of SGDs do not necessarily maintain the same complexity guarantee. This paper shows that the $\epsilon$-stationary point exists in the final iterates of SGDs, not just anywhere in the entire range of iterates---A much stronger result than the existing one. Additionally, our analyses allow us to measure the \emph{density of the $\epsilon$-stationary points} in the final iterates of SGD, and we recover the classical $O(\frac{1}{\sqrt{T}})$ asymptotic rate under various existing assumptions on the regularity of the objective function and the bounds on the stochastic gradient. ","Stochastic gradient descent, nonconvex optimization, nonsmooth optimization, random-reshuffling stochastic gradient descent" BiTAT: Neural Network Binarization with Task-Dependent Aggregated Transformation,https://openreview.net/forum?id=fwt8vkm_9Hn,https://openreview.net/pdf?id=fwt8vkm_9Hn,,"Neural network quantization aims to transform high-precision weights and activations of a given neural network into low-precision weights/activations for reduced memory usage and computation, while preserving the performance of the original model. However, extreme quantization (1-bit weight/1-bit activations) of compactly-designed backbone architectures (e.g., MobileNets) often used for edge-device deployments results in severe performance degeneration. This paper proposes a novel Quantization-Aware Training (QAT) method that can effectively alleviate performance degeneration even with extreme quantization by focusing on the inter-weight dependencies, between the weights within each layer and across consecutive layers. To minimize the quantization impact of each weight on others, we perform an orthonormal transformation of the weights at each layer by training an input-dependent correlation matrix and importance vector, such that each weight is disentangled from the others. Then, we quantize the weights based on their importance to minimize the loss of the information from the original weights/activations. We further perform progressive layer-wise quantization from the bottom layer to the top, so that quantization at each layer reflects the quantized distributions of weights and activations at previous layers. We validate the effectiveness of our method on various benchmark datasets against strong neural quantization baselines, demonstrating that it alleviates the performance degeneration on ImageNet and successfully preserves the full-precision model performance on CIFAR-100 with compact backbone networks.", Dynamic Loss for Learning with Label Noise,https://openreview.net/forum?id=J_kUIC1DNHJ,https://openreview.net/pdf?id=J_kUIC1DNHJ,"To handle the mismatch between the statics of robust loss functions and the dynamics of DNNs learning with label noise, we propose a dynamic loss function which improves robustness gradually.","Label noise is verified seriously harmful to deep neural networks (DNNs). A simple and scalable strategy to handle this problem is to design robust loss functions, which improve generalization in the presence of label noise by reconciling fitting ability with robustness. However, the widely-used static trade-off between the two contradicts the dynamics of DNNs learning with label noise, leading to an inferior performance. Therefore, in this paper, we propose a dynamic loss function to solve this problem. Specifically, DNNs tend to first learn generalized patterns, then gradually overfit label noise. In light of this, we make fitting ability stronger initially, then gradually increase the weight of robustness. Moreover, we let DNNs put more emphasis on easy examples than hard ones at the later stage since the former are correctly labeled with a higher probability, further reducing the negative impact of label noise. Extensive experimental results on various benchmark datasets demonstrate the state-of-the-art performance of our method. We will open-source our code very soon.","label noise, robust loss function, dynamic" Memory of Unimaginable Outcomes in Experience Replay,https://openreview.net/forum?id=-5fSvp1ofdd,https://openreview.net/pdf?id=-5fSvp1ofdd,"This paper proposes techniques to add only the most relevant experiences in the replay buffer, using model uncertainty as selection criterion.","Model-based reinforcement learning (MBRL) applies a single-shot dynamics model to imagined actions to select those with best expected outcome. The dynamics model is an unfaithful representation of the environment physics, and its capacity to predict the outcome of a future action varies as it is trained iteratively. An experience replay buffer collects the outcomes of all actions executed in the environment and is used to iteratively train the dynamics model. With growing experience, it is expected that the model becomes more accurate at predicting the outcome and expected reward of imagined actions. However, training times and memory requirements drastically increase with the growing collection of experiences. Indeed, it would be preferable to retain only those experiences that could not be anticipated by the model while interacting with the environment. We argue that doing so results in a lean replay buffer with diverse experiences that correspond directly to the model's predictive weaknesses at a given point in time. We propose strategies for: i) determining reliable predictions of the dynamics model with respect to the imagined actions, ii) retaining only the unimaginable experiences in the replay buffer, and iii) training further only when sufficient novel experience has been acquired. We show that these contributions lead to lower training times, drastic reduction of the replay buffer size, fewer updates to the dynamics model and reduction of catastrophic forgetting. All of which enable the effective implementation of continual-learning agents using MBRL.","Transfer Multitask and Meta-learning, Robotics, Model-Based Reinforcement Learning, Batch/Offline RL, Deep RL, Continuous Action RL" Temperature Schedules for self-supervised contrastive methods on long-tail data,https://openreview.net/forum?id=ejHUr4nfHhD,https://openreview.net/pdf?id=ejHUr4nfHhD,Simple temperature schedules in self-supervised contrastive learning improve representation learning on long-tail distributions,"Most approaches for self-supervised learning (SSL) are optimised on curated balanced datasets, e.g. ImageNet, despite the fact that natural data usually exhibits long-tail distributions. In this paper, we analyse the behaviour of one of the most popular variants of SSL, i.e. contrastive methods, on imbalanced data. In particular, we investigate the role of the temperature parameter $\tau$ in the contrastive loss, by analysing the loss through the lens of average distance maximisation, and find that a large $\tau$ emphasises group-wise discrimination, whereas a small $\tau$ leads to a higher degree of instance discrimination. While $\tau$ has thus far been treated exclusively as a constant hyperparameter, in this work, we propose to employ a dynamic $\tau$ and show that a simple cosine schedule can yield significant improvements in the learnt representations. Such a schedule results in a constant `task switching' between an emphasis on instance discrimination and group-wise discrimination and thereby ensures that the model learns both group-wise features, as well as instance-specific details. Since frequent classes benefit from the former, while infrequent classes require the latter, we find this method to consistently improve separation between the classes in long-tail data without any additional computational cost. ","contrastive learning, long-tail data, self-supervised learning, temperature, analysis" Deep Learning on Implicit Neural Representations of Shapes,https://openreview.net/forum?id=OoOIW-3uadi,https://openreview.net/pdf?id=OoOIW-3uadi,,"Implicit Neural Representations (INRs) have emerged in the last few years as a powerful tool to encode continuously a variety of different signals like images, videos, audio and 3D shapes. When applied to 3D shapes, INRs allow to overcome the fragmentation and shortcomings of the popular discrete representations used so far. Yet, considering that INRs consist in neural networks, it is not clear whether and how it may be possible to feed them into deep learning pipelines aimed at solving a downstream task. In this paper, we put forward this research problem and propose inr2vec, a framework that can compute a compact latent representation for an input INR in a single inference pass. We verify that inr2vec can embed effectively the 3D shapes represented by the input INRs and show how the produced embeddings can be fed into deep learning pipelines to solve several tasks by processing exclusively INRs.", Continual Vision-Language Representaion Learning with Off-Diagonal Information,https://openreview.net/forum?id=X1-0f_y88F9,https://openreview.net/pdf?id=X1-0f_y88F9,we explore the feasibility of training CLIP model continuously through streaming data and find the reason about cognitive disorder in continual CLIP training and produce a new framework Mod-x to alleviate model's cognitive disorder.,"Multimodal pre-trained methods with a contrastive learning framework (like CLIP) have recently achieved consistent advantages on various cross-model downstream tasks. However, they usually require a large amount of image-text samples and a vast computing budget for training, which makes the re-training process expensive while the training data is collected continuously (the phenomenon is widespread in real scenarios). In this paper, we discuss the feasibility of continuously training CLIP models based on discrete streaming data. We find that the multimodal retrieval performance of the CLIP in a continual training setting is significantly lower than that in a joint training setting. We name this phenomenon Cognitive Disorder(CD). By tracking the directional changes of the representation vectors in the continuously updated CLIP model, we explore and summarize the spatial variation of the modal encoders within the CLIP: Intra-modal Rotation and Inter-modal Deviation. Intra-modal Rotation means that the vision and language representation space in the CLIP is rotating greatly around the center of a high-dimensional unit sphere during continual training, accompanied by a relatively small change in the topology of the representation space. Inter-modal deviation happens when the vision and language's intra-modal rotation is unsynchronized. Moreover, we empirically and theoretically demonstrate how intra-modal rotation and inter-modal deviation lead to CD. In order to alleviate CD in continual CLIP training, we propose a new continual training framework Mod-X: Maintain off-diagonal information-matrix. By selectively aligning the off-diagonal information distribution of contrastive matrixes, the Mod-X helps the model not only better fits the newly trained data domain but also maintains the multimodal cognitive ability on the old data domain during the continual large-scale training (Section \ref{experiments}).","representation learning, continual learning" Learning Counterfactually Invariant Predictors,https://openreview.net/forum?id=ERjQnrmLKH4,https://openreview.net/pdf?id=ERjQnrmLKH4,"We propose a new technique to train predictors that are counterfactually invariant, i.e., robust to interventions on specified covariates.","We propose a method to learn predictors that are invariant under counterfactual changes of certain covariates. This method is useful when the prediction target is causally influenced by covariates that should not affect the predictor output. For instance, this could prevent an object recognition model from being influenced by position, orientation, or scale of the object itself. We propose a model-agnostic regularization term based on conditional kernel mean embeddings to enforce counterfactual invariance during training. We prove the soundness of our method, which can handle mixed categorical and continuous multivariate attributes. Empirical results on synthetic and real-world data demonstrate the efficacy of our method in a variety of settings.","causality, kernel mean embeddings, counterfactual fairness, counterfactual invariance" Deep Reinforcement Learning for Cryptocurrency Trading: Practical Approach to Address Backtest Overfitting,https://openreview.net/forum?id=2U_AM7TcRQK,https://openreview.net/pdf?id=2U_AM7TcRQK,A practical approach to address backtest overfitting for cryptocurrency trading using deep reinforcement learning.,"Designing profitable and reliable trading strategies is challenging in the highly volatile cryptocurrency market. Existing works applied deep reinforcement learning methods and optimistically reported increased profits in backtesting, which may suffer from the \textit{false positive} issue due to overfitting. In this paper, we propose a practical approach to address backtest overfitting for cryptocurrency trading using deep reinforcement learning. First, we formulate the detection of backtest overfitting as a hypothesis test. Then, we train the DRL agents, estimate the probability of overfitting, and reject the overfitted agents, increasing the chance of good trading performance. Finally, on 10 cryptocurrencies over a testing period from 05/01/2022 to 06/27/2022 (during which the crypto market \textbf{crashed two times}), we show that the less overfitted deep reinforcement learning agents have a higher return than that of more overfitted agents, an equal weight strategy, and the S\&P DBM Index (market benchmark), offering confidence in possible deployment to a real market.","Computing methodologies, Markov decision processes, Neural networks, Reinforcement learning" ImaginaryNet: Learning Object Detectors without Real Images and Annotations,https://openreview.net/forum?id=9MbhFHqrti9,https://openreview.net/pdf?id=9MbhFHqrti9,This paper propose ImaginaryNet obtain about 70% performance in object detection trained without real images or annotations and improve the performance by incorporating real images and annotations.,"Humans can easily detect a known concept without the demand of training in reality. Equipping this ability to deep learning may allow the neural network to learn complex vision models, e.g., object detection, without collecting and annotating real images. In this paper, we define a novel paradigm as Imaginary-Supervised Object Detection (ISOD), where no real images and manual annotations are used for training object detectors. To resolve this challenge, we propose ImaginaryNet, a framework to learn object detectors by combining pretrained language model as well as text-to-image synthesis models. In particular, photo-realistic images can be generated by the text-to-image model, and class labels can be obtained by the text generated by the language model. Then, weakly supervised object detection is leveraged to learn the detector without real images and manual annotations. By gradually introducing real images and manual annotations, ImaginaryNet can collaborate with other supervision settings to further boost detection performance. Experiments show that ImaginaryNet can (i) obtain about 70% performance in ISOD compared with the weakly supervised counterpart of the same backbone trained on real data, (ii) significantly improve the baseline while achieving state-of-the-art or comparable performance by incorporating real images and manual annotations.","Object detection, Visual synthesis, Generative model" Don't Throw Your Old Policies Away: Knowledge-based Policy Recycling Protects Against Adversarial Attacks,https://openreview.net/forum?id=oCx90Ezdox_,https://openreview.net/pdf?id=oCx90Ezdox_,Incorporating domain-knowledge over auxiliary tasks enhances deep reinforcement policy robustness against adversarial attacks in both Atari games and a high dimensional Robot Food Court environment. ,"Recent work has shown that Deep Reinforcement Learning (DRL) is vulnerable to adversarial attacks, in which minor perturbations of input signals cause agents to behave inappropriately and unexpectedly. Humans, on the other hand, appear robust to these particular sorts of input variations. We posit that this part of robustness stems from accumulated knowledge about the world. In this work, we propose to leverage prior knowledge to defend against adversarial attacks in RL settings using a framework we call Knowledge-based Policy Recycling (KPR). Different from previous defense methods such as adversarial training and robust learning, KPR incorporates domain knowledge over a set of auxiliary tasks policies and learns relations among them from interactions with the environment via a Graph Neural Network (GNN). KPR can use any relevant policy as an auxiliary policy and, importantly, does not assume access or information regarding the adversarial attack. Empirically, KPR results in policies that are more robust to various adversarial attacks in Atari games and a simulated Robot Foodcourt environment. ","Domain Knowledge, Knowledge Representation, Representation Learning, Policy Ensemble" NÜWA-LIP: Language-guided Image Inpainting with Defect-free VQGAN,https://openreview.net/forum?id=HYfD5CoCjDX,https://openreview.net/pdf?id=HYfD5CoCjDX,This paper proposes NÜWA-LIP by incorporating DF-VQGAN with MP-S2S to address receptive spreading or information loss in language-guided image inpainting.,"Language-guided image inpainting aims to fill in the defective regions of an image under the guidance of text while keeping non-defective regions unchanged. However, the encoding process of existing models suffers from either receptive spreading of defective regions or information loss of non-defective regions, giving rise to visually unappealing inpainting results. To address the above issues, this paper proposes NÜWA-LIP by incorporating defect-free VQGAN (DF-VQGAN) with multi-perspective sequence to sequence (MP-S2S). In particular, DF-VQGAN introduces relative estimation to control receptive spreading and adopts symmetrical connections to protect information. MP-S2S further enhances visual information from complementary perspectives, including both low-level pixel and high-level token. Experiments show that DF-VQGAN performs much more robustly than VQGAN. To evaluate language-guided inpainting, we built up three open-domain benchmarks, where NÜWA-LIP is also superior to recent strong baselines.","Language-guided image inpainting, Vector quantization, Visual synthesis, Generative model" "Contextual bandits with concave rewards, and an application to fair ranking",https://openreview.net/forum?id=UT-_SVOyD1H,https://openreview.net/pdf?id=UT-_SVOyD1H,"We show a reduction of concave multi-reward contextual bandits to classical single-reward bandits, and apply this reduction to a fair ranking problem.","We consider Contextual Bandits with Concave Rewards (CBCR), a multi-objective bandit problem where the desired trade-off between the rewards is defined by a known concave objective function, and the reward vector depends on an observed stochastic context. We present the first algorithm with provably vanishing regret for CBCR without restrictions on the policy space, whereas prior works were restricted to finite policy spaces or tabular representations. Our solution is based on a geometric interpretation of CBCR algorithms as optimization algorithms over the convex set of expected rewards spanned by all stochastic policies. Building on Frank-Wolfe analyses in constrained convex optimization, we derive a novel reduction from the CBCR regret to the regret of a \emph{scalar-reward} bandit problem. We illustrate how to apply the reduction off-the-shelf to obtain algorithms for CBCR with both linear and general reward functions, in the case of non-combinatorial actions. Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives, leading to the first algorithm with regret guarantees for contextual combinatorial bandits with fairness of exposure.","bandits, concave rewards, fairness, learning to rank" Contrastive Adversarial Loss for Point Cloud Reconstruction,https://openreview.net/forum?id=Oj1ceY_qohC,https://openreview.net/pdf?id=Oj1ceY_qohC,Learn a point cloud reconstruction loss by contrastive constraint and adversarial training,"For point cloud reconstruction-related tasks, the reconstruction losses to evaluate the shape differences between reconstructed results and the ground truths are typically used to train the task networks. The Chamfer Distance (CD) and Earth Mover's Distance (EMD) are two widely-used reconstruction losses, which firstly use predefined strategies to match points in two point clouds and then apply the average distances from points to their matched neighbors as differentiable measurements of shape differences. However, the predefined matching rules may deviate from the real shape differences and cause defective reconstructed results. To solve the above problem, we propose a learning-based Contrastive adversarial Loss (CALoss) to train a reconstruction-related task network without the predefined matching rules. CALoss learns to evaluate shape differences by combining the contrastive constraint with the adversarial strategy. Specifically, we use the contrastive constraint to help CALoss learn shape similarity, while we introduce the adversarial strategy to help CALoss mine differences between reconstructed results and ground truths. According to experiments on reconstruction-related tasks, CALoss can help task networks improve reconstruction performances and learn more representative representations.","Point clouds, reconstruction loss, learning-based" Low-complexity Deep Video Compression with A Distributed Coding Architecture,https://openreview.net/forum?id=QiORiW-NNqr,https://openreview.net/pdf?id=QiORiW-NNqr,"We design the first end-to-end distributed deep video compression framework based on the distributed coding paradigm, which outperforms traditional distributed video codecs and achieves competitive performance with H.264. ","Prevalent video compression methods follow a $predictive\;coding$ architecture that relies on a heavy encoder to exploit the statistical redundancy, which makes it challenging to deploy them on resource-constrained devices. Meanwhile, as early as the 1970s, distributed source coding theory, namely, Slepian-Wolf and Wyner-Ziv theorems, has indicated that efficient compression of correlated sources can be achieved by exploiting the source statistics at the decoder only, with the help of effective side information (SI). This has inspired a $distributed\;coding$ architecture that is promising to reduce the encoder complexity. While there have been some attempts to develop practical distributed video coding systems, traditional methods suffer from a substantial performance gap to the predictive coding architecture. Inspired by the recent successes of deep learning in enhancing image and video compression, we propose the first end-to-end distributed deep video compression (Distributed DVC) framework with neural network-based modules that can be optimized to improve the rate-distortion performance. A key ingredient is an effective SI generation module at the decoder, which helps to effectively exploit the inter-frame correlation without computation-intensive encoder-side motion estimation and compensation. Experiments show that Distributed DVC significantly outperforms conventional distributed video coding methods and H.264. Meanwhile, it enjoys $6\sim7$ times encoding speedup against DVC with only 1.61\% increase in the bitrate for 1080P test videos on the UVG dataset.","Deep video compression, distributed coding, low encoder complexity" When Few-shot Meets Cross-domain Object Detection: Learning Instance-level Class Prototypes for Knowledge Transfer,https://openreview.net/forum?id=cpiQjF6I7XA,https://openreview.net/pdf?id=cpiQjF6I7XA,,"Adversarial learning is typically used to tackle cross-domain object detection, which transfers a well-trained model from a source domain to a new target domain, assuming that extensive unlabeled training data from the new domain can be easily obtained. However, such an assumption may not hold in many access-constrained target scenarios such as biomedical applications. To this end, we study the Few-Shot Domain Adaptation Object Detection (FSDAOD) problem, where only a few labeled instances from the target domain are available, making it difficult to comprehensively represent the target domain data distribution, and causing the adversarial feature alignment using only a few instances hard to transfer complete knowledge from source to target. Benefiting from the success of prototype-based meta-learning in the few-shot learning community, we propose an Instance-level Prototype learning Network (IPNet) for addressing the FSDAOD problem. The IPNet first develops an Instance-level Prototypical Meta-alignment (IPM) module, which fuses instances from both domains to learn the domain-invariant prototypical representations, for boosting the adaptation model’s discriminability between different classes. Then a Feature-level Spatial Attention Transfer (FSAT) module is further developed, which employs the instance-level prototypes to discriminate various features' salience in one domain to make the model attend to foreground regions, then transfers such attention extraction knowledge to the target images via adversarial learning. Extensive experiments are conducted on several datasets with domain variances, including cross-weather changes, cross-sensor differences, and cross-style variances, and results show the consistent accuracy gains of the IPNet over state-of-the-art methods, e.g., 9.7% mAP increase on Cityscapes-to-FoggyCityscapes setting and 3.0% mAP increase on Sim10k-to-Cityscapes setting.","Domain Adaptive Object Detection, Few-shot Object Detection, Domain Adaptation" Gradient Boosting Performs Gaussian Process Inference,https://openreview.net/forum?id=3VKiaagxw1S,https://openreview.net/pdf?id=3VKiaagxw1S,"We prove that gradient boosting converges to a Gaussian process' posterior mean and can be transformed into a sampler from the posterior, which leads to improved knowledge uncertainty estimates.","This paper shows that gradient boosting based on symmetric decision trees can be equivalently reformulated as a kernel method that converges to the solution of a certain Kernel Ridge Regression problem. Thus, we obtain the convergence to a Gaussian Process' posterior mean, which, in turn, allows us to easily transform gradient boosting into a sampler from the posterior to provide better knowledge uncertainty estimates through Monte-Carlo estimation of the posterior variance. We show that the proposed sampler allows for better knowledge uncertainty estimates leading to improved out-of-domain detection.","gradient boosting, gaussian process, knowledge uncertainty, kernel gradient boosting" Constrained Reinforcement Learning for Safety-Critical Tasks via Scenario-Based Programming,https://openreview.net/forum?id=BLOkjU9iS24,https://openreview.net/pdf?id=BLOkjU9iS24,"A novel technique for incorporating domain-expert knowledge to train a constrained DRL agent, based on scenario-based programming paradigm, we validated our method on the popular robotic mapless navigation problem, both physically and in simulation.","Deep reinforcement learning (DRL) has achieved groundbreaking successes in various applications, including robotics. A natural consequence is the adoption of this paradigm for safety-critical tasks, where human safety and expensive hardware can be involved. In this context, it is crucial to optimize the performance of DRL-based agents while providing guarantees about their behavior. This paper presents a novel technique for incorporating domain-expert knowledge into a constrained DRL training loop. Our technique exploits the scenario-based programming paradigm, designed to specify such knowledge in a simple and intuitive way. While our approach can be considered general purpose, we validated our method by performing experiments on a synthetic set of benchmark environments, and the popular robotic mapless navigation problem, in simulation and on the actual platform. Our results demonstrate that using our approach to leverage expert knowledge dramatically improves the safety and performance of the agent.","Constrained Reinforcement Learning, Scenario Based Programming, Safety, Robotic Navigation" TGP: Explainable Temporal Graph Neural Networks for Personalized Recommendation,https://openreview.net/forum?id=EGobBwPc1J-,https://openreview.net/pdf?id=EGobBwPc1J-,,"The majority of item retrieval algorithms in typical ""retrieval-rank-rerank"" structured recommendation systems can be separated into three categories: deep latent, sequential and graph-based recommenders, which collect collaborative-filtering, sequential and homogeneous signals respectively. However, there is a conceptual overlap between sequential and graph recommenders on a user's past interacted items. It triggers an idea that the sequential, collaborative-filtering and homegeneous signals can be included in one temporal graph formatted data structure, and the sequential, latent and graph learning algorithms can be summarized as one temporal graph encoder. In this paper, Temporal Graph Plugin is proposed as a such explainable temporal graph encoder to supplement deep latent algorithms with aggregated $k$-hop temporal neighborhood message via a local attention module. We conduct extensive experiments on two public datasets Reddit and Wikipedia, where TGP exceeds SOTA sequential, latent, graph algorithms by $1.1\%$, $52.8\%$ and $98.9\%$ respectively, partially verifying the proposed hypothesis. Codes will be made public upon receival.","deep learning, graph neural networks, temporal graph, retrieval models, recommendation system" When is Adversarial Robustness Transferable?,https://openreview.net/forum?id=mWJ0QKcPgzX,https://openreview.net/pdf?id=mWJ0QKcPgzX,We study how adversarial robustness can be preserved during transfer from a source domain to a target domain by using randomized smoothing and adversarial attacks to analyze different training and target-retraining procedures.,"Knowledge transfer is an effective tool for learning, especially when labeled data is scarce or when training from scratch is prohibitively costly. The overwhelming majority of transfer learning literature is focused on obtaining accurate models, neglecting the issue of adversarial robustness. Yet, robustness is essential, particularly when transferring to safety-critical domains. We analyze and compare how different training procedures on the source domain and different fine-tuning strategies on the target domain affect robustness. More precisely, we study 10 training schemes for source models and 3 for target models, including normal, adversarial, contrastive and Lipschitz constrained variants. We quantify model robustness via randomized smoothing and adversarial attacks. Our results show that improving model robustness on the source domain increases robustness on the target domain. Target retraining has a minor influence on target model robustness. These results indicate that model robustness is preserved during target retraining and transfered from the source domain to the target domain. ","transfer learning, adversarial robustness" COFS: COntrollable Furniture layout Synthesis,https://openreview.net/forum?id=khF4d1SRrGH,https://openreview.net/pdf?id=khF4d1SRrGH,Language Models need an order. Layouts have no order. We show how to modify a Language Model to be order-equivariant.,"Realistic, scalable, and controllable generation of furniture layouts is essential for many applications in virtual reality, augmented reality, game development and synthetic data generation. The most successful current methods tackle this problem as a sequence generation problem which imposes a specific ordering on the elements of the layout, making it hard to exert fine-grained control over the attributes of a generated scene. Existing methods provide control through object-level conditioning, or scene completion, where generation can be conditioned on an arbitrary subset of furniture objects. However, attribute-level conditioning, where generation can be conditioned on an arbitrary subset of object attributes, is not supported. We propose COFS, a method to generate furniture layouts that enables fine-grained control through attribute-level conditioning. For example, COFS allows specifying only the scale and type of objects that should be placed in the scene and the generator chooses their positions and orientations; or the position that should be occupied by objects can be specified and the generator chooses their type, scale, orientation, etc. Our results show both qualitatively and quantitatively that we significantly outperform existing methods on attribute-level conditioning.","generative modelling, conditional generation, layouts, transformers" Distribution Shift Detection for Deep Neural Networks,https://openreview.net/forum?id=yZCpZrUqzK0,https://openreview.net/pdf?id=yZCpZrUqzK0,,"To deploy and operate deep neural models in production, the quality of their predictions, which might be contaminated benignly or manipulated maliciously by input distributional deviations, must be monitored and assessed. Specifically, we study the case of monitoring the healthy operation of a deep neural network (DNN) receiving a stream of data, with the aim of detecting input distributional deviations over which the quality of the network's predictions is potentially damaged. Using selective prediction principles, we propose a distribution deviation detection method for DNNs. The proposed method is derived from a tight coverage generalization bound computed over a sample of instances drawn from the true underlying distribution. Based on this bound, our detector continuously monitors the operation of the network over a test window and fires off an alarm whenever a deviation is detected. This novel detection method consistently and significantly outperforms the state of the art with respect to the CIFAR-10 and ImageNet datasets, thus establishing a new performance bar for this task, while being substantially more efficient in time and space complexities.","Selective classification, Window based shift detection" "Learning Zero-Shot Cooperation with Humans, Assuming Humans Are Biased",https://openreview.net/forum?id=TrwE8l9aJzs,https://openreview.net/pdf?id=TrwE8l9aJzs,,"There is a recent trend of applying multi-agent reinforcement learning (MARL) to train an agent that can cooperate with humans in a zero-shot fashion without using any human data. The typical workflow is to first repeatedly run self-play (SP) to build a policy pool and then train the final adaptive policy against this pool. A crucial limitation of this framework is that every policy in the pool is optimized w.r.t. the environment reward function, which implicitly assumes that the testing partners of the adaptive policy will be precisely optimizing the same reward function as well. However, human objectives are often substantially biased according to their own preferences, which can differ greatly from the environment reward. We propose a more general framework, Hidden-Utility Self-Play (HSP), which explicitly models human biases as hidden reward functions in the self-play objective. By approximating the reward space as linear functions, HSP adopts an effective technique to generate an augmented policy pool with biased policies. We evaluate HSP on the Overcooked benchmark. Empirical results show that our HSP method produces higher rewards than baselines when cooperating with learned human models, manually scripted policies, and real humans. The HSP policy is also rated as the most assistive policy based on human feedback.","Human-AI Collaboration, Multi-Agent Reinforcement Learning" SUG: Single-dataset Unified Generalization for 3D Point Cloud Classification,https://openreview.net/forum?id=kN4IkQvvrBD,https://openreview.net/pdf?id=kN4IkQvvrBD,,"In recent years, research on zero-shot domain adaptation, namely Domain Generalization (DG), which aims to adapt a well-trained source domain model to unseen target domains without accessing any target sample, has been fast-growing in the 2D image tasks such as classification and object detection. However, its exploration on 3D point cloud data is still insufficient and challenged by more complex and uncertain cross-domain variances with irregular point data structures and uneven inter-class modality distribution. In this paper, different from previous 2D DG works, we focus on the 3D DG problem, and propose a Single-dataset Unified Generalization (SUG) framework that only leverages the source domain data to alleviate the unforeseen domain differences faced by the well-pretrained source model. Specifically, we first design a Multi-grained Sub-domain Alignment (MSA) method that can constrain the learned representations to be domain-agnostic and discriminative, by performing a multi-grained feature alignment process between the splitted sub-domains from the single source dataset. Then, a Sample-level Domain-aware Attention (SDA) strategy is presented, which can selectively enhance easy-to-adapt samples from different sub-domains according to the sample-level inter-domain distance, to avoid the negative transfer. Extensive experiments are conducted on three common 3D point cloud benchmarks. The experimental results demonstrate that SUG framework is effective to boost the model generalization ability for unseen target domains, even outperforming the existing unsupervised domain adaptation methods that have to access extensive target domain data, where we significantly improve classification accuracy by 7.7% on ModelNet-to-ScanNet setting and 2.3% on ShapeNet-to-ScanNet setting. Our code will be available.","3D Point Cloud Classification, Domain Adaptation" Efficient Policy Space Response Oracles,https://openreview.net/forum?id=XxnMFuv-y3h,https://openreview.net/pdf?id=XxnMFuv-y3h,,"Policy Space Response Oracle methods (PSRO) provide a general solution to approximate Nash equilibrium in two-player zero-sum games but suffer from two drawbacks: (1) the \textit{computational inefficiency} due to consistent meta-game evaluation via simulations, and (2) the \textit{exploration inefficiency} due to learning best responses against fixed meta-strategies. In this work, we propose Efficient PSRO (EPSRO) that considerably improves the efficiency of the above two steps. Central to our development is the novel subroutine of \textit{no-regret optimization} on solving \textit{unrestricted-restricted (URR)} games. By modeling the EPSRO as URR game solving, one can compute the best responses and meta-strategies in a single forward pass without extra simulations. Theoretically, we prove that the proposed optimization procedures of EPSRO guarantee the monotonic improvement on the exploitability, which is absent in existing researches of PSRO. Furthermore, we prove that the no-regret optimization has a regret bound of $\mathcal{O}(\sqrt{T\log{[(k^2+k)/2]}})$, where $k$ the size of restricted policy set. The pipeline of EPSRO is highly parallelized, making policy-space exploration more affordable in practice and thus more behavioral diversity. Empirical evaluations on various games report that EPSRO achieves a 50x speedup in wall-time and 2.5x data efficiency while obtaining comparable exploitability against existing PSRO methods.","reinforcement learning, multi-agent reinforcement learning" An Optimal Transport Perspective on Unpaired Image Super-Resolution,https://openreview.net/forum?id=0g1JdUJF7Fr,https://openreview.net/pdf?id=0g1JdUJF7Fr,,"Real-world image super-resolution (SR) tasks often do not have paired datasets, which limits the application of supervised techniques. As a result, the tasks are usually approached by unpaired techniques based on Generative Adversarial Networks (GANs), which yield complex training losses with several regularization terms, e.g., content or identity losses. We theoretically investigate optimization problems which arise in such models and find two surprizing observations. First, the learned SR map is always an optimal transport (OT) map. Second, we theoretically prove and empirically show that the learned map is biased, i.e., it does not actually transform the distribution of low-resolution images to high-resolution ones. Inspired by these findings, we propose an algorithm for unpaired SR which learns an unbiased OT map for the perceptual transport cost. Unlike the existing GAN-based alternatives, our algorithm has a simple optimization objective reducing the need for complex hyperparameter selection and an application of additional regularizations. At the same time, it provides a nearly state-of-the-art performance on the large-scale unpaired AIM19 dataset.","optimal transport, unpaired image super-resolution" A Functional Perspective on Multi-Layer Out-of-Distribution Detection,https://openreview.net/forum?id=gyTuMfkOney,https://openreview.net/pdf?id=gyTuMfkOney,We propose an original approach to OOD detection based on a functional view of the network that exploits the sample's trajectories through the various layers and their statistical dependencies.,"A crucial component for implementing reliable classifiers is detecting examples far from the reference (training) distribution, referred to as out-of-distribution (OOD) samples. A key feature of OOD detection is to exploit the network by extracting statistical patterns and relationships through the pre-trained multi-layer classifier. Despite achieving solid results, state-of-the-art methods require either additional OOD examples, expensive computation of gradients, or are tightened to a particular architecture, limiting their applications. This work adopts an original approach based on a functional view of the network that exploits the sample's trajectories through the various layers and their statistical dependencies. In this new framework, OOD detection translates into detecting samples whose trajectories differ from the typical behavior characterized by the training set. Our method significantly decreases the OOD detection error of classifiers trained on ImageNet and outperforms the state-of-the-art methods on average AUROC and TNR at 95% TPR. We demonstrate that the functional signature left by a sample in a network carries relevant information for OOD detection. ","Out-of-distribution detection, Deep Learning, Safety AI" The Continuous CNN: from Task-Specific to Unified CNN Architecture,https://openreview.net/forum?id=ZW5aK4yCRqU,https://openreview.net/pdf?id=ZW5aK4yCRqU,"We formulate a CNN architecture that can be used across input resolutions, lengths and dimensionalities (1D, 2D, 3D) showing its viability across several 1D, 2D and 3D tasks.","Performant Convolutional Neural Network (CNN) architectures must be tailored to specific tasks in order to incorporate considerations such as input length, resolution, and dimentionality of the data. To overcome the need for such problem-specific CNN architectures, and the fragmentation they represent to the field, we introduce the \textit{Continuous Convolutional Neural Network} (CCNN): a single CNN architecture that can be used for tasks on data of arbitrary resolution, dimensionality and length without structural changes. The key component of the CCNN is its \textit{continuous convolutional kernel} which models long-range dependencies at every layer and removes the need for downsampling and task-dependent depths used in current CNN architectures. We demonstrate the generality of our CCNN by deploying the \emph{same architecture} to tasks on sequential ($1{\rm D}$), visual ($2{\rm D}$), and point-cloud ($3{\rm D}$) data. Experiments show that the CCNN matches and often outperforms the current state-of-the-art across the tasks considered.","convolutional neural networks, continuous convolutional kernels, CNNs, continuous parameterizations, sequential data, visual data, point-cloud data" Target-specific Peptide Design by Multi-fragments Autoregressive Generative Model,https://openreview.net/forum?id=2hTj3upOnNf,https://openreview.net/pdf?id=2hTj3upOnNf,"We present the first pure data-driven approach for designing target-specific peptides. A multi-fragment autoregressive generative model is developed. Whether or not there are provided anchors, our approach performs remarkably well in peptide design. ","Therapeutic peptides are a new class of pharmaceutical molecules composed of a series of amino acids. De-novo design of peptides that can bind to any pro- tein targets is a more challenging task in comparison to designing single proteins. Most current approaches heavily rely on physical-chemical energy functions. To address this problem, we propose a pure data-driven machine learning method to achieve sequence-structure co-design of peptides that can bind to any given pro- tein target. Our model characterizes an important biological intuition that there is only a few “anchor” amino acids on the peptide interacting with the functional site of the binding protein and determining the binding affinity. We also define a new optimization algorithm that can connect all the anchors and fragments into one feasible peptide. Experiments show that our model could recover the native pep- tide sequence with about 26.7% of accuracy. The generated 3D peptide structures also have been shown to have lower physical and chemical energy with highly sequence-structure consistency.","Protein design, Machine learning, Deep learning, Graph neural network, Autoregressive generative model" Ahead-of-Time P-Tuning,https://openreview.net/forum?id=8IBtyLQ8GKw,https://openreview.net/pdf?id=8IBtyLQ8GKw,"A novel method for parameter efficient fine-tuning. Can perform multi-task inference like P-Tuning, but up to 1.3x times faster than it.","This paper proposes a new parameter-efficient method for fine-tuning, AoT P-Tuning. This method adds input-dependent biases before evaluating the Transformer layer, reducing the required evaluation time when compared to P-Tuning. Same as P-Tuning, AoT P-Tuning allows multi-task inference with a single backbone model for evaluating different tasks in a single batch. We experimented with the proposed method on the GLUE and SuperGLUE benchmarking datasets using RoBERTa-Base, RoBERTa-Large, and DeBERTa-XL backbone models. Our observations show that AoT P-tuning performed on par with or better than P-Tuning v2 while being up to $1.3\times$ times faster during inference.","Efficient Fine-Tuning, P-Tuning, Multi-Task Inference, Transformers, GLUE, SuperGLUE" MAXENT LOSS: CONSTRAINED MAXIMUM ENTROPY FOR CALIBRATING DEEP NEURAL NETWORKS,https://openreview.net/forum?id=w4ojb-FIq72,https://openreview.net/pdf?id=w4ojb-FIq72,"A novel loss function involving constraints, used to improve model calibration on OOD data.","Miscalibration distorts our interpretation between a model's confidence and correctness, making them unreliable for real-world deployment. In general, we want dependable and meaningful probabilistic estimations of our model's uncertainty, which are essential in real-world applications. This may include inputs that are from out-of-distributions (OOD), which can be widely different from the given training distribution. Motivated by the Principle of Maximum Entropy, we show that -- compared to conventional cross-entropy loss and focal loss -- training neural networks with additional statistical constraints can help improve neural calibration whilst retaining recognition accuracy. We evaluate our method extensively on different augmented and in-the-wild OOD computer vision datasets and show that our Maxent loss achieves state-of-the-art calibration in all cases. Our code will be made available upon acceptance.","Calibration, Out-of-distribution, Loss function, Machine learning safety, Overconfidence, Robustness, Distribution shifts" "Unsupervised Threshold Learning with ""$L$""-trend Prior For Visual Anomaly Detection",https://openreview.net/forum?id=s6QuERBSrRc,https://openreview.net/pdf?id=s6QuERBSrRc,Propose a new perspective for unsupervised visual anomaly detection,"This paper considers unsupervised threshold learning, a practical yet under-researched module of anomaly detection (AD) for image data. AD comprises two separate modules: score generation and threshold learning. Most existing studies are more curious about the first part. It is often assumed that if the scoring module is good, estimating an accurate threshold is within easy reach. However, we argue that in the context of computer vision, some challenges in high-dimensional space lead threshold estimation be a non-trivial problem. In this paper, we leverage the inherent difference between normal instances and anomalies by ranking their anomaly score, which shows a phenomenon that involves two distinct trends. We term it as the ""$L$""-trend prior. With that finding, we utilize an adaptive polynomial regression model to determine the threshold. Unlike the classic threshold learners which rely on enough training samples or statistical assumptions, this method is plug-and-play that can be implemented into different anomaly score function among various datasets. Also, the evaluation results demonstrate an obvious improvement.","Unsupervised, visual anomaly detection, threshold learning" Planckian Jitter: countering the color-crippling effects of color jitter on self-supervised training,https://openreview.net/forum?id=Pia70sP2Oi1,https://openreview.net/pdf?id=Pia70sP2Oi1,,"Several recent works on self-supervised learning are trained by mapping different augmentations of the same image to the same feature representation. The data augmentations used are of crucial importance to the quality of learned feature representations. In this paper, we analyze how the color jitter traditionally used in data augmentation negatively impacts the quality of the color features in learned feature representations. To address this problem, we propose a more realistic, physics-based color data augmentation - which we call Planckian Jitter - that creates realistic variations in chromaticity and produces a model robust to illumination changes that can be commonly observed in real life, while maintaining the ability to discriminate image content based on color information. Experiments confirm that such a representation is complementary to the representations learned with the currently-used color jitter augmentation and that a simple concatenation leads to significant performance gains on a wide range of downstream datasets. In addition, we present a color sensitivity analysis that documents the impact of different training methods on model neurons and shows that the performance of the learned features is robust with respect to illuminant variations.","Contrastive Learning, Self-Supervised Learning, Color Features, Illuminant Invariance" Efficient and Stealthy Backdoor Attack Triggers are Close at Hand,https://openreview.net/forum?id=wLFTV-Nv2ZR,https://openreview.net/pdf?id=wLFTV-Nv2ZR,A new strategy for developing the trigger pattern of backdoor attacks with great efficiency and stealthiness using benign training data.,"A backdoor attack aims to inject a backdoor into a deep model so that the model performs normally on benign samples while maliciously predicting the input as the attacker-defined target class when the backdoor is activated by a predefined trigger pattern. Most existing backdoor attacks use a pattern that rarely occurs in benign data as the trigger pattern. In this way, the impact of the attack on the label prediction of benign data can be mitigated. However, this practice also results in the attack being defended against with little performance degradation on benign data by preventing the trigger pattern from being activated. In this work, we present a new attack strategy to solve this dilemma. Unlike the conventional strategy, our strategy extracts the trigger pattern from benign training data, which frequently occurs in samples of the target class but rarely occurs in samples of the other classes. Compared with the prevailing strategy, our proposed strategy has two advantages. First, it can improve the efficiency of the attack because learning on benign samples of the target class can facilitate the fitting of the trigger pattern. Second, it increases the difficulty or cost of identifying the trigger pattern and preventing its activation, since many benign samples of the target class contain the trigger pattern. We empirically evaluate our strategy on four benchmark datasets. The experimental studies show that attacks performed with our strategy can achieve much better performance when poisoning only 0.1\% or more of the training data, and can achieve better performance against several benchmark defense algorithms.",Backdoor Attack;Deep Neural Networks SimST: A GNN-Free Spatio-Temporal Learning Framework for Traffic Forecasting,https://openreview.net/forum?id=2ppuWD3dkie,https://openreview.net/pdf?id=2ppuWD3dkie,,"Traffic forecasting is a crucial and challenging problem in smart city efforts. Spatio-Temporal Graph Neural Networks (STGNNs) have demonstrated great promise and become the de facto solution in this field. While successful, they require the message passing scheme of GNNs to construct spatial dependencies between nodes, and thus inevitably inherit the notorious inefficiency of GNNs. Given these facts, in this paper, we propose a simple yet effective GNN-free spatio-temporal learning framework, entitled SimST. Specifically, our framework replaces GNNs with two feasible and efficient spatial context injectors, which provide proximity and position information, respectively. SimST is also compatible with various temporal encoding backbones and involves a tailored training strategy. We conduct extensive experiments on five popular traffic benchmarks to assess the capability of SimST in terms of effectiveness and efficiency. Experimental results show that such a simple baseline performs surprisingly well. Using much fewer parameters, SimST not only achieves comparable or better performance than more sophisticated state-of-the-art STGNNs, but also obtains substantial throughput improvements.","Traffic Forecasting, Spatio-Temporal Graph Neural Networks" Property Inference Attacks Against t-SNE Plots,https://openreview.net/forum?id=q5ZwEiLzDft,https://openreview.net/pdf?id=q5ZwEiLzDft,We present for the first time that t-SNE plots can be a new valid side channel for property inference attacks,"With the prevailing of machine learning (ML), researchers have shown that ML models are also vulnerable to various privacy and security attacks. As one of the representative attacks, the property inference attack aims to infer the private/sensitive properties of the training data (e.g., race distribution) given the output of ML models. In this paper, we present a new side channel for property inference attacks, i.e., t-SNE plots, which are widely used to show feature distribution or demonstrate model performance. We show for the first time that the private/sensitive properties of the data that are used to generate the plot can be successfully predicted. Briefly, we leverage the publicly available model as the shadow model to generate t-SNE plots with different properties. We use those plots to train an attack model, which is a simple image classifier, to infer the specific property of a given t-SNE plot. Extensive evaluation on four datasets shows that our proposed attack can effectively infer the undisclosed property of the data presented in the t-SNE plots, even when the shadow model is different from the target model used to generate the t-SNE plots. We also reveal that the attacks are robust in various scenarios, such as constructing the attack with fewer t-SNE plots/different density settings and attacking t-SNE plots generated by fine-tuned target models. The simplicity of our attack method indicates that the potential risk of leaking sensitive properties in t-SNE plots is largely underestimated. As possible defenses, we observe that adding noise to the image embeddings or t-SNE coordinates effectively mitigates attacks but can be bypassed by adaptive attacks, which prompts the need for more effective defenses.","Property Inference Attacks, t-SNE" Physically Plausible and Conservative Solutions to Navier-Stokes Equations Using Physics-Informed CNNs,https://openreview.net/forum?id=5zaWBdMxcF1,https://openreview.net/pdf?id=5zaWBdMxcF1,Solving Navier-Stokes Equations Using PICNN,"Physics-informed Neural Network (PINN) is an emerging approach for efficiently solving partial differential equations (PDEs) using neural networks. PICNN, a variant of PINN enhanced by convolutional neural networks (CNNs), has achieved better results on a series of PDEs since the parameter-sharing property of CNNs is effective to learn spatial dependencies. However, applying existing PICNN-based methods to solve Navier-Stokes equations can generate oscillating predictions, which are inconsistent with the laws of physics and the conservation properties. To address this issue, we propose a novel method that combines PICNN with the finite volume method to obtain physically plausible and conservative solutions to Navier-Stokes equations. We derive the second-order upwind difference scheme of Navier-Stokes equations using the finite volume method. Then we use the derived scheme to calculate the partial derivatives and construct the physics-informed loss function. The proposed method is assessed by experiments on steady-state Navier-Stokes equations under different scenarios, including convective heat transfer, lid-driven cavity flow, etc. The experimental results demonstrate that our method can effectively improve the plausibility and the accuracy of the predicted solutions from PICNN.","Finite volume method, Navier-Stokes equation, Partial differential equation, Physics-informed convolutional neural network" GAMR: A Guided Attention Model for (visual) Reasoning,https://openreview.net/forum?id=iLMgk2IGNyv,https://openreview.net/pdf?id=iLMgk2IGNyv,A framework for a memory and attention based architecture demonstrating the capability of sample efficient learning and generalization capability on complex visual reasoning tasks aligned with the theory of visual routines. ,"Humans continue to outperform modern AI systems in their ability to flexibly parse and understand complex visual scenes. Here, we present a novel transformer-based module for visual reasoning, the Guided Attention Model for (visual) Reasoning ($\textit{GAMR}$), which instantiates an active vision theory -- positing that the brain solves complex visual reasoning problems dynamically -- via sequences of attention shifts to select and route task-relevant visual information into memory. Experiments on an array of visual reasoning tasks and datasets demonstrate GAMR's ability to learn visual routines in a robust and sample-efficient manner. In addition, GAMR is shown to be capable of zero-shot generalization on completely novel reasoning tasks. Overall, our work provides computational support for cognitive theories that postulate the need for a critical interplay between attention and memory to dynamically maintain and manipulate task-relevant visual information to solve complex visual reasoning tasks.","abstract visual reasoning, visual routines, out-of-distribution generalization, external memory, zero shot generalization, compositional learning" On the Connection between Fisher's Criterion and Shannon's Capacity: Theoretical Concepts and Implementation,https://openreview.net/forum?id=hPg-z_yBlcr,https://openreview.net/pdf?id=hPg-z_yBlcr,A feature selection scheme is developed by relating Fisher's criterion to Shannon's channel capacity.,"Fisher's criterion is arguably among the most widely used tools in machine learning for feature selection. The higher the value of Fisher's criterion, the more favorable a feature is. A rather different but nevertheless very important tool is Shannon's channel capacity. With Shannon’s capacity, one can determine the maximum rate at which information can flow across a channel. Fisher's criterion and Shannon’s capacity appear to be unrelated, yet both capture in their unique way the separation between probability distributions. In this study, we investigate whether Fisher's class separation criterion and Shannon’s capacity can be related to each other. We formulate our research problem as a binary classification task and derive analytic expressions to determine if there is a potential link between Fisher's criterion and Shannon's capacity. It turns out that Fisher's class separation criterion and Shannon’s channel capacity are intimately connected through two principal assumptions. Using this result, we develop a divergence measure for feature selection. Additionally, we show how our results can be used to solve classification problems and conduct a proof-of-concept experiment to demonstrate the viability of our approach.","Feature selection, Fisher's Criterion, Shannon's Capacity, Neural Networks." Pixel-Aligned Non-parametric Hand Mesh Reconstruction,https://openreview.net/forum?id=UVgv6goRFND,https://openreview.net/pdf?id=UVgv6goRFND,,"Non-parametric mesh reconstruction has recently shown significant progress in 3D hand and body applications. In these methods, mesh vertices and edges are visible to neural networks, enabling the possibility to establish a direct mapping between 2D image pixels and 3D mesh vertices. In this paper, we seek to establish and exploit this mapping with a simple and compact architecture. The network is designed with these considerations: 1) aggregating both local 2D image features from the encoder and 3D geometric features captured in the mesh decoder; 2) decoding coarse-to-fine meshes along the decoding layers to make the best use of the hierarchical multi-scale information. Specifically, we propose an end-to-end pipeline for hand mesh recovery tasks which consists of three phases: a 2D feature extractor constructing multi-scale feature maps, a feature mapping module transforming local 2D image features to 3D vertex features via 3D-to-2D projection, and a mesh decoder combining the graph convolution and self-attention to reconstruct mesh. The decoder aggregate both local image features in pixels and geometric features in vertices. It also regresses the mesh vertices in a coarse-to-fine manner, which can leverage multi-scale information. By exploiting the local connection and designing the mesh decoder, Our approach achieves state-of-the-art for hand mesh reconstruction on the public FreiHAND dataset.","3D Hand Reconstruction, Mesh Reconstruction, Graph Convolution Network, Attention, Deep Learning" Voint Cloud: Multi-View Point Cloud Representation for 3D Understanding ,https://openreview.net/forum?id=IpGgfpMucHj,https://openreview.net/pdf?id=IpGgfpMucHj,"We propose voint cloud, a novel 3D data structure, that combines multi-view and point clouds for robust 3D understanding tasks.","Multi-view projection methods have demonstrated promising performance on 3D understanding tasks like 3D classification and segmentation. However, it remains unclear how to combine such multi-view methods with the widely available 3D point clouds. Previous methods use unlearned heuristics to combine features at the point level. To this end, we introduce the concept of the multi-view point cloud (Voint cloud), representing each 3D point as a set of features extracted from several view-points. This novel 3D Voint cloud representation combines the compactness of 3D point cloud representation with the natural view-awareness of multi-view representation. Naturally, we can equip this new representation with convolutional and pooling operations. We deploy a Voint neural network (VointNet) to learn representations in the Voint space. Our novel representation achieves state-of-the-art performance on 3D classification, shape retrieval, and robust 3D part segmentation on standard benchmarks ( ScanObjectNN, ShapeNet Core55, and ShapeNet Parts). Further analysis shows that VointNet improves the robustness to occlusion compared to other methods.","multi-view, point cloud, 3D understanding" Is the Deep Model Representation Sparse and Symbolic with Causal Patterns?,https://openreview.net/forum?id=-0jbdOhFn4g,https://openreview.net/pdf?id=-0jbdOhFn4g,"This paper shows that the inference logic of a deep model can usually be represented as a sparse causal graph, and the faithfulness of such a symbolic representation is theoretically guaranteed.","This paper aims to show that the inference logic of a deep model can be faithfully approximated as a sparse, symbolic causal graph. Such a causal graph potentially bridges the gap between connectionism and symbolism. To this end, the faithfulness of the causal graph is theoretically guaranteed, because we show that the causal graph can well mimic the model's output on an exponential number of different masked samples. Besides, such a causal graph can be further simplified and re-written as an And-Or graph (AOG), which explains the logical relationship between interactive concepts encoded by the deep model, without losing much explanation accuracy. The code will be released when the paper is accepted.","Representation Learning, Deep Learning Theory, Explainable AI" Learning QUBO Forms in Quantum Annealing,https://openreview.net/forum?id=isiQ5KIXbjj,https://openreview.net/pdf?id=isiQ5KIXbjj,,"Modern quantum annealers can find high-quality solutions to combinatorial optimization objectives given as quadratic unconstrained binary optimization (QUBO) problems. Unfortunately, obtaining suitable QUBO forms in computer vision remains challenging and currently requires problem-specific analytical derivations. Moreover, such explicit formulations impose tangible constraints on solution encodings. In stark contrast to prior work, this paper proposes to learn QUBO forms from data through gradient backpropagation instead of deriving them. As a result, the solution encodings can be chosen flexibly and compactly. Furthermore, our methodology is general and virtually independent of the specifics of the target problem type. We demonstrate the advantages of learned QUBOs on the diverse problem types of graph matching, 2D point cloud alignment, and 3D rotation estimation. Our results are competitive with the previous quantum state of the art while requiring much fewer logical and physical qubits, enabling our method to scale to larger problems. The code and the new dataset will be open-sourced. ", Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias,https://openreview.net/forum?id=gHi_bIxFdDZ,https://openreview.net/pdf?id=gHi_bIxFdDZ,Gradient Regularzation works efficiently by a certain finite-difference computation and has a desirable implicit bias in theory,"Gradient regularization (GR) is a method that penalizes the gradient norm of the training loss during training. Although some studies have reported that GR improves generalization performance in deep learning, little attention has been paid to it from the algorithmic perspective, that is, the algorithms of GR that efficiently improve performance. In this study, we first reveal that a specific finite-difference computation, composed of both gradient ascent and descent steps, reduces the computational cost for GR. In addition, this computation empirically achieves better generalization performance. Next, we theoretically analyze a solvable model, a diagonal linear network, and clarify that GR has a desirable implicit bias. Learning with GR chooses better minima in a certain problem, and the finite-difference GR chooses even better ones as the ascent step size becomes larger. Finally, we demonstrate that finite-difference GR is closely related to some other algorithms based on iterative ascent and descent steps for exploring flat minima: sharpness-aware minimization and the flooding method. In particular, we reveal that flooding performs finite-difference GR in an implicit way. Thus, this work broadens our understanding of GR in both practice and theory.","Gradient regularization, Implicit bias, Gradient ascent and descent, Diagonal Linear Network" Approximate Nearest Neighbor Search through Modern Error-Correcting Codes,https://openreview.net/forum?id=-jP_rDkyfpI,https://openreview.net/pdf?id=-jP_rDkyfpI,"Using modern error-correcting codes, we present an improved method of using locality-sensitive hash functions for approximate nearest-neighbor search..","A locality-sensitive hash (or LSH) is a function that can efficiently map dataset points into a latent space while preserving pairwise distances. Such LSH functions have been used in approximate nearest-neighbor search (ANNS) in the following classic way, which we call classic hash clustering (CHC): first, the dataset points are hashed into a low-dimensional binary space using the LSH function; then, the points are clustered by these hash values. Upon receiving a query, its nearest neighbors are sought within its hash-cluster and nearby hash-clusters (i.e., multi-probe). However, CHC mandates a low-dimensional latent space for the LSH function, which distorts distances from the (high-dimensional) original real space; this results in inferior recall. This is often mitigated through using multiple hash tables at additional storage and memory costs. In this paper, we introduce a better way of using LSH functions for ANNS. Our method, called the Polar Code Nearest-Neighbor (PCNN) algorithm, uses modern error-correcting codes (specifically polar codes) to maintain a manageable number of clusters inside a high-dimensional latent space. Allowing the LSH function to embed into this high-dimensional latent space results in higher recall, as the embedding faithfully captures distances in the original space. The crux of PCNN is using polar codes for probing: we present a multi-probe scheme for PCNN which uses efficient list-decoding methods for polar codes, with time complexity independent of the dataset size. Fixing the choice of LSH, experiment results demonstrate significant performance gains of PCNN over CHC; in particular, PCNN with a single table outperforms CHC with multiple tables, obviating the need for large memory and storage.","Similarity Search, Nearest-Neighbor Search, Polar Codes, Locality-Sensitive Hashing, LSH" Social and environmental impact of recent developments in machine learning on biology and chemistry research,https://openreview.net/forum?id=luajgSjRlew,https://openreview.net/pdf?id=luajgSjRlew,,"Potential societal and environmental effects such as the rapidly increasing resource use and the associated environmental impact, reproducibility issues, and exclusivity, the privatization of ML research leading to a public research brain-drain, a narrowing of the research effort caused by a focus on deep learning, and the introduction of biases through a lack of sociodemographic diversity in data and personnel caused by recent developments in machine learning are a current topic of discussion and scientific publications. However, these discussions and publications focus mainly on computer science-adjacent fields, including computer vision and natural language processing or basic ML research. Using bibliometric analysis of the complete and full-text analysis of the open-access literature, we show that the same observations can be made for applied machine learning in chemistry and biology. These developments can potentially affect basic and applied research, such as drug discovery and development, beyond the known issue of biased data sets.", TransformMix: Learning Transformation and Mixing Strategies for Sample-mixing Data Augmentation,https://openreview.net/forum?id=-1vpxBUtP0B,https://openreview.net/pdf?id=-1vpxBUtP0B,"We propose an automated approach, TransformMix, to learn better transformation and mixing augmentation strategies from data","Data augmentation improves the generalization power of deep learning models by synthesizing more training samples. Sample-mixing is a popular data augmentation approach that creates additional training samples by combining existing images. Recent sample-mixing methods, like Mixup and Cutmix, adopt simple mixing operations to blend multiple input images. Although such a heuristic approach shows certain performance gains in some computer vision tasks, it mixes the images blindly and does not adapt to different datasets automatically. A mixing strategy that is effective for a particular dataset does not often generalize well to other datasets. If not properly configured, the methods may create misleading mixed images, which jeopardize the effectiveness of sample-mixing augmentations. In this work, we propose an automated approach, TransformMix, to learn better transformation and mixing augmentation strategies from data. In particular, TransformMix applies learned transformations and mixing masks to create compelling mixed images that contain correct and important information for the target tasks. We demonstrate the effectiveness of TransformMix in multiple datasets under the direct and transfer settings. Experimental results show that our method achieves better top-1 and top-5 accuracy as well as efficiency when compared with strong sample-mixing baselines.","Data Augmentation, Automated Data Augmentation, Sample-mixing, Computer Vision" When to Make and Break Commitments?,https://openreview.net/forum?id=q8vgHfPdoQP,https://openreview.net/pdf?id=q8vgHfPdoQP,,"In many scenarios, decision-makers must commit to long-term actions until their resolution before receiving the payoff of said actions, and usually, staying committed to such actions incurs continual costs. For instance, in healthcare, a newly-discovered treatment cannot be marketed to patients until a clinical trial is conducted, which both requires time and is also costly. Of course in such scenarios, not all commitments eventually pay off. For instance, a clinical trial might end up failing to show efficacy. Given the time pressure created by the continual cost of keeping a commitment, we aim to answer: When should a decision-maker break a commitment that is likely to fail—either to make an alternative commitment or to make no further commitments at all? First, we formulate this question as a new type of optimal stopping/switching problem called the optimal commitment problem (OCP). Then, we theoretically analyze OCP, and based on the insights we gain, propose a practical algorithm for solving it. Finally, we empirically evaluate the performance of our algorithm in running clinical trials with subpopulation selection.","optimal stopping/switching, sequential hypothesis testing, adaptive experimentation" Generalization bounds and algorithms for estimating the effect of multiple treatments and dosage,https://openreview.net/forum?id=IIxe8wlXwb0,https://openreview.net/pdf?id=IIxe8wlXwb0,"We propose generalization bounds for the counterfactual error in treatment effect estimation in the context of multiple treatments and dosage parameters, and regularization techniques for training prediction models inspired by these bounds.","Estimating conditional treatment effects has been a longstanding challenge for fields of study such as epidemiology or economics that require a treatment-dosage pair to make decisions, but may not be able to run randomized trials to precisely quantify their effect. In the context of representation learning, there is an extensive literature relating model architectures with regularization techniques to solve this problem using observational data. However, theoretically motivated loss functions and bounds on generalization errors only exist in select circumstances, such as in the presence of binary treatments. In this paper, we introduce new bounds on the counterfactual generalization error in the context of multiple treatments and continuous dosage parameters, which subsume existing results. This result, in a principled manner, guides the definition of new learning objectives that can be used to train representation learning algorithms. We show empirically new state-of-the-art performance results across several benchmark datasets for this problem, including in comparison to doubly-robust estimation methods.",Treatment effect estimation DENSE RGB SLAM WITH NEURAL IMPLICIT MAPS,https://openreview.net/forum?id=QUK1ExlbbA,https://openreview.net/pdf?id=QUK1ExlbbA,,"There is an emerging trend of using neural implicit functions for map representation in Simultaneous Localization and Mapping (SLAM). Some pioneer works have achieved encouraging results on RGB-D SLAM. In this paper, we present a dense RGB SLAM method with neural implicit map representation. To reach this challenging goal without depth input, we introduce a hierarchical feature volume to facilitate the implicit map decoder. This design effectively fuses shape cues across different scales to facilitate map reconstruction. Our method simultaneously solves the camera motion and the neural implicit map by matching the rendered and input video frames. To facilitate optimization, we further propose a photometric warping loss in the spirit of multi-view stereo to better constrain the camera pose and scene geometry. We evaluate our method on commonly used benchmark datasets and compare with modern RGB and RGB-D SLAM systems. Our method achieves favorable results than previous methods and even surpasses some recent RGB-D SLAM methods. Our source code will be publicly available.","dense RGB SLAM, implict funciton, RGB VO" Monocular Scene Reconstruction with 3D SDF Transformers,https://openreview.net/forum?id=-iADdfa4GKH,https://openreview.net/pdf?id=-iADdfa4GKH,,"Monocular scene reconstruction from posed images is challenging due to the complexity of a large environment. Recent volumetric methods learn to directly predict the TSDF volume and have demonstrated promising results in this task. However, most methods focus on how to extract and fuse the 2D features to a 3D feature volume, but none of them improve the way how the 3D volume is aggregated. In this work, we propose an SDF transformer network, which replaces the role of 3D CNN for better 3D feature aggregation. To reduce the explosive computation complexity of the 3D multi-head attention, we propose a sparse window attention module, where the attention is only calculated between the non-empty voxels within a local window. Then a top-down-bottom-up 3D attention network is built for 3D feature aggregation, where a dilate-attention structure is proposed to prevent geometry degeneration, and two global modules are employed to equip with global receptive fields. The experiments on multiple datasets show that this 3D transformer network generates a more accurate and complete reconstruction, which outperforms previous methods by a large margin. Remarkably, the mesh accuracy is improved by 41.8%, and the mesh completeness is improved by 25.3% on the ScanNet dataset. The code of our method will be made public.","3D Reconstruction, Monocular Scene Reconstruction, 3D Transformer, TSDF volume" HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression,https://openreview.net/forum?id=3QdSdm6Oqat,https://openreview.net/pdf?id=3QdSdm6Oqat,,"Transformers have attained superior performance in natural language processing and computer vision tasks. Their self-attention and feedforward layers are overparameterized, limiting inference speed and energy efficiency. Tensor decomposition is a promising technique to reduce parameter redundancy by leveraging tensor algebraic properties to express the parameters in an efficiently factorized form. Prior efforts used manual or heuristic settings without hardware-aware customization, resulting in poor hardware efficiencies and large performance degradations. In this work, we propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible tensor decompositions and automates the choice of tensorization shape and decomposition rank. We jointly investigate tensor contraction path optimizations and a fused Einsum mapping strategy to bridge the gap between theoretical benefits and real hardware efficiency improvement. Our two-stage knowledge distillation flow resolves the trainability bottleneck and thus significantly boosts the final accuracy of factorized Transformers. Overall, we experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss and achieve a better efficiency-accuracy Pareto frontier than hand-tuned and heuristic baselines.", Learning Heterogeneous Interaction Strengths by Trajectory Prediction with Graph Neural Network,https://openreview.net/forum?id=qU6NIcpaSi-,https://openreview.net/pdf?id=qU6NIcpaSi-,"We propose a neural architecture that infers a continuous interaction graph rather than a conventional discrete one, solely from trajectories in an unsupervised manner.","Dynamical systems with interacting agents are universal in nature, commonly modeled by a graph of relationships between their constituents. Recently, various works have been presented to tackle the problem of inferring those relationships from the system trajectories via deep neural networks, but most of the studies assume binary or discrete types of interactions for simplicity. In the real world, the interaction kernels often involve continuous interaction strengths, which cannot be accurately approximated by discrete relations. In this work, we propose the relational attentive inference network (RAIN) to infer continuously weighted interaction graphs without any ground-truth interaction strengths. Our model employs a novel pairwise attention (PA) mechanism to refine the trajectory representations and a graph transformer to extract heterogeneous interaction weights for each pair of agents. We show that our RAIN model with the PA mechanism accurately infers continuous interaction strengths for simulated physical systems in an unsupervised manner. Further, RAIN with PA successfully predicts trajectories from motion capture data with an interpretable interaction graph, demonstrating the virtue of modeling unknown dynamics with continuous weights.","relational learning, complex systems, dynamic systems, graph learning" From $t$-SNE to UMAP with contrastive learning,https://openreview.net/forum?id=B8a1FcY0vi,https://openreview.net/pdf?id=B8a1FcY0vi,We show that UMAP is effectively negative sampling applied to the t-SNE loss function.,"Neighbor embedding methods $t$-SNE and UMAP are the de facto standard for visualizing high-dimensional datasets. Motivated from entirely different viewpoints, their loss functions appear to be unrelated. In practice, they yield strongly differing embeddings and can suggest conflicting interpretations of the same data. The fundamental reasons for this and, more generally, the exact relationship between $t$-SNE and UMAP have remained unclear. In this work, we uncover their conceptual connection via a new insight into contrastive learning methods. Noise-contrastive estimation can be used to optimize $t$-SNE, while UMAP relies on negative sampling, another contrastive method. We find the precise relationship between these two contrastive methods, and provide a mathematical characterization of the distortion introduced by negative sampling. Visually, this distortion results in UMAP generating more compact embeddings with tighter clusters compared to $t$-SNE. We exploit this new conceptual connection to propose and implement a generalization of negative sampling, allowing us to interpolate between (and even extrapolate beyond) $t$-SNE and UMAP and their respective embeddings. Moving along this spectrum of embeddings leads to a trade-off between discrete/local and continuous/global structures, mitigating the risk of over-interpreting ostensible features of any single embedding. We provide a PyTorch implementation.","visualization, dimensionality reduction, t-SNE, UMAP" On the optimal precision of GANs,https://openreview.net/forum?id=qOV5REmPOM,https://openreview.net/pdf?id=qOV5REmPOM,,"Generative adversarial networks (GANs) are known to face model misspecification when learning disconnected distributions. Indeed, continuous mapping from a unimodal latent distribution to a disconnected one is impossible, so GANs necessarily generate samples outside of the support of the target distribution. In this paper, we make the connection between the performance of GANs and their latent space configuration. In particular, we raise the following question: what is the latent space partition that minimizes the measure of out-of-manifold samples? Building on a recent result of geometric measure theory, we prove there exist optimal GANs when the dimension of the latent space is larger than the number of modes. In particular, we show that these generators structure their latent space as a `simplicial cluster' - a Voronoi partition where centers are equally distant. We derive both an upper and a lower bound on the optimal precision of GANs learning disconnected manifolds. Interestingly, these two bounds have the same order of decrease: $\sqrt{\log m}$, $m$ being the number of modes. Finally, we perform several experiments to exhibit the geometry of the latent space and experimentally show that GANs have a geometry with similar properties to the theoretical one.","Deep learning theory, GANs, Deep generative modelling" Disentangled Knowledge Transfer: A New Perspective for Personalized Federated Learning,https://openreview.net/forum?id=l1U_oTRQX62,https://openreview.net/pdf?id=l1U_oTRQX62,"We present, pFedC, a novel training method for personalized Federated Learning which can avoid the irrelevant knowledge aggregation from other clients.","Personalized federated learning (pFL) is to collaboratively train non-identical machine learning models for different clients to adapt to their heterogeneously distributed datasets. State-of-the-art pFL approaches pay much attention on exploiting clients' inter-similarities to facilitate the collaborative learning process, meanwhile, can barely escape from the irrelevant knowledge pooling that is inevitable during the aggregation phase (e.g., inconsistent classes among clients), and thus hindering the optimization convergence and degrading the personalization performance. To tackle with such conflicts from facilitating collaboration against promoting personalization, we propose a novel pFL framework, dubbed pFedC, to disentangle the global aggregated knowledge into several compositional branches and only aggregate relevant branches for supporting conflicts-aware collaboration among contradictory clients. Specifically, by reconstructing each local model into a shared feature extractor and multiple disentangled task-specific classifiers, the training on each client transforms into a mutually reinforced and relatively independent multi-task learning process, which provides a new perspective for pFL. Besides, we conduct a personalized aggregation mechanism on disentangled classifiers via quantifying the combination weights for each client to capture clients' common prior, as well as mitigate potential conflicts from the divergent knowledge caused by the heterogeneous data. Extensive empirical experiments are conducted over various models and datasets to verify the effectiveness and superior performance of the proposed algorithm. ","Personalized Federated Learning, Model Disentanglement, Multi-task Learning" D4AM: A General Denoising Framework for Downstream Acoustic Models,https://openreview.net/forum?id=5fvXH49wk2,https://openreview.net/pdf?id=5fvXH49wk2,We propose a general denoising framework for various downstream acoustic models (D4AM) by adopting an effective joint training scheme with the regression (denoising) objective and the classification (ASR) objective.,"The performance of acoustic models degrades notably in noisy environments. Speech enhancement (SE) can be used as a front-end strategy to serve automatic speech recognition (ASR) systems. However, the training objectives of existing SE approaches do not consider the generalization ability to unseen ASR systems. In this study, we propose a general denoising framework for various downstream acoustic models, called D4AM. Our framework fine-tunes the SE model with the backward gradient according to a specific acoustic model and the corresponding classification objective. At the same time, our method aims to take the regression objective as an auxiliary loss to make the SE model generalize to other unseen acoustic models. To jointly train an SE unit with regression and classification objectives, D4AM uses an adjustment scheme to directly estimate suitable weighting coefficients instead of going through a grid search process with additional training costs. The adjustment scheme consists of two parts: gradient calibration and regression objective weighting. Experimental results show that D4AM can consistently and effectively provide improvements to various unseen acoustic models and outperforms other combination setups. To the best of our knowledge, this is the first work that deploys an effective combination scheme of regression (denoising) and classification (ASR) objectives to derive a general pre-processor applicable to various unseen ASR systems.","audio processing, speech enhancement, robust automatic speech recognition, auxiliary task learning" Fully Continuous Gated Recurrent Units For processing Time Series,https://openreview.net/forum?id=7TKYqsMjNh,https://openreview.net/pdf?id=7TKYqsMjNh,Previous GRU-based models are piece-wise continuous. We proposed the first fully continuous GRU.,"For a long time, RNN-based models, such as RNNs, LSTMs, and GRUs, have been used to process time series data. However, RNN-based models do not fit well with real-world sporadically observed data. As a result, many researchers have suggested various enhancements to overcome the limitation. Among them, differential equation-based models, e.g., GRU-ODE-Bayes, ODE-RNN, and so forth, show good accuracy in many cases. Those methods try to continuously model the hidden state of RNNs (or GRUs). However, existing methods' hidden states are piece-wise continuous. In this paper, we represent GRUs as delay differential equations and present fully continuous GRUs. To our knowledge, we propose the first model that continuously generalizes all the parts of GRUs, including their hidden state and various gates. After reconstructing a continuous path $x(t)$ from discrete time series observations $\{(x_i, t_i)\}_{i=0}^{N-1}$ (with an appropriate interpolation algorithm), we calculate the time derivatives of the reset gate $r(t)$, the update gate $z(t)$, the update vector $g(t)$, and the hidden state $h(t)$. Then, we develop an augmented delay differential equation (DDE) that continuously generalizes all the parts. In our experiments with 3 real-world datasets and 13 baselines, our fully continuous GRU method outperforms existing baselines by non-trivial margins. ","Time Series forecasting, Continuous GRU" Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning ,https://openreview.net/forum?id=lq62uWRJjiY,https://openreview.net/pdf?id=lq62uWRJjiY,We propsoe MARVEL to adaptively allocate the parameter budget among weight matrices in correspondence to their importance. ,"Fine-tuning large pre-trained language models on downstream tasks has become an important paradigm in NLP. However, common practice fine-tunes all of the parameters in a pre-trained model, which becomes prohibitive when a large number of downstream tasks are present. Therefore, many fine-tuning methods are proposed to learn incremental updates of pre-trained weights in a parameter efficient way, e.g., low-rank increments. These methods often evenly distribute the budget of incremental updates across all pre-trained weight matrices, and overlook the varying importance of different weight parameters. As a consequence, the fine-tuning performance is suboptimal. To bridge this gap, we propose MARVEL, which adaptively allocates the parameter budget among weight matrices according to their importance score. In particular, MARVEL parameterizes the incremental updates in the form of singular value decomposition. Such a novel approach allows us to effectively prune the singular values of unimportant updates, which is essentially to reduce their parameter budget but circumvent intensive exact SVD computations. We conduct extensive experiments with several pre-trained models on natural language processing, question answering, and natural language generation to validate the effectiveness of MARVEL. Results demonstrate that MARVEL manifests notable improvement over baselines, especially in the low budget settings. Our code will be publicly available. ","Adaptive budget allocation, Parameter-efficient fine-tuning, Natural language processing" On Intriguing Layer-Wise Properties of Robust Overfitting in Adversarial Training,https://openreview.net/forum?id=Rumwc_raZvE,https://openreview.net/pdf?id=Rumwc_raZvE,,"Adversarial training has proven to be one of the most effective methods to defend against adversarial attacks. Nevertheless, robust overfitting is a common obstacle in adversarial training of deep networks. There is a common belief that the features learned by different network layers have different properties, however, existing works generally investigate robust overfitting by considering a DNN as a single unit and hence the impact of different network layers on robust overfitting remains unclear. In this work, we divide a DNN into a series of layers and investigate the effect of different network layers on robust overfitting. We find that different layers exhibit distinct properties towards robust overfitting, and in particular, robust overfitting is mostly related to the optimization of latter parts of the network. Based upon the observed effect, we propose a robust adversarial training (RAT) prototype: in a mini-batch, we optimize the front parts of the network as usual, and adopt additional measures to regularize the optimization of the latter parts. Based on the prototype, we designed two realizations of RAT, and extensive experiments demonstrate that RAT can eliminate robust overfitting and boost adversarial robustness over the standard adversarial training.", Does Federated Learning Really Need Backpropagation?,https://openreview.net/forum?id=TYEY9qBqgfF,https://openreview.net/pdf?id=TYEY9qBqgfF,BAFFLE is a backpropagation-free and memory-efficient federated learning framework that only executes forward propagation during training.,"Federated learning (FL) provides general principles for decentralized clients to train a server model collectively without sharing local data. FL is a promising framework with practical applications, but its standard training paradigm requires the clients to backpropagate through the model to compute gradients. Since these clients are typically edge devices and not fully trusted, executing backpropagation on them incurs computational and storage overhead as well as white-box vulnerability. In light of this, we develop backpropagation-free federated learning, dubbed BAFFLE, in which backpropagation is replaced by multiple forward processes to estimate gradients. BAFFLE is 1) memory-efficient and easily fits uploading bandwidth; 2) compatible with inference-only hardware optimization and model quantization or pruning; and 3) well-suited to trusted execution environments, because the clients in BAFFLE only execute forward propagation and return a set of scalars to the server. In experiments, we demonstrate that BAFFLE-trained models can achieve empirically comparable performance to conventional FL models.","Federated Learning, Backpropagation-Free Training" Teaching Others is Teaching Yourself Regularization For Controllable Language Models,https://openreview.net/forum?id=Wfvm3hYjwnC,https://openreview.net/pdf?id=Wfvm3hYjwnC,,"Large-scale pre-trained language models have achieved great success on natural language generation tasks. However, it is difficult to control the pre-trained language models to generate sentences with the expected attribute such as topic and sentiment. Recent efforts on controllable language generation employ an additional attribute classifier, which guides the generation of large-scale pre-trained language models, have been shown to be efficient in controllable language generation. These methods are named ''classifier-guided language models'' (CGLMs). However, we find that the probabilities predicted by the attribute classifiers usually approaches 0 or 1, which make it hard to distinguish sentences with different matching degrees to the expected attribute. The problem is named \textit{the biased probability distribution} (BPD) problem. To address the problem, we investigate different methods for adjusting probability distribution and propose a ''Teaching Others is Teaching Yourself'' (TOTY) regularization method to smooth the probability distribution. Experiments on sentiment control and topic control tasks show that CGLMs can get better performance with guiding classifiers trained with TOTY.", Prompt Generation Networks for Efficient Adaptation of Frozen Vision Transformers,https://openreview.net/forum?id=9TpJYSI1n9t,https://openreview.net/pdf?id=9TpJYSI1n9t,"A novel method for adapting frozen pretrained vision transformer models by adding prompts that vary based on each input, which can surpass even full-finetuning.","Large-scale pretrained models, especially those trained from vision-language data have demonstrated the tremendous value that can be gained from both larger training datasets and models. Thus, in order to benefit from these developments, there is renewed interest in transfer learning and adapting models from large-scale general pretraining to particular downstream tasks. However, the continuously increasing size of the models means that even the classic approach of finetuning is becoming infeasible for all but big institutions. Prompt learning has emerged as a flexible way to adapt models by solely learning additional inputs to a model that is kept frozen, but so far performances remained inferior to finetuning. To address this, we propose the Prompt Generation Network (PGN) that generates input-dependent prompts by sampling from a learned library of tokens. We show the PGN is effective in adapting pretrained models to various new datasets. It surpasses previous prompt-learning methods by a large margin and even full-finetuning on 5 out of 12 datasets while requiring 100x less parameters. PGN can even be used for training and inferring on multiple datasets simultaneously and learns to allocate tokens between domains. Given these findings, we conclude that PGN is a viable and scalable approach for downstream adaptation of frozen models.","model adaptation, pretrained models, prompting, vision transformers" Saliency-guided Vision Transformer for Few-shot Keypoint Detection,https://openreview.net/forum?id=bnRBltYQboI,https://openreview.net/pdf?id=bnRBltYQboI,,"One attractive property of Vision Transformer (ViT) is to capture long-range dependency among image patches, which helps improve few-shot keypoint detection (FSKD) and is not explored yet in the literature. However, the simple application of ViT brings in irrelevant features outside of the region of interest due to the global attention matrix, thus degrading similarity learning between support and query features. In this paper, we present a novel saliency-guided vision transformer, dubbed \emph{SalViT}, for few-shot keypoint detection. Our SalViT enjoys a uniquely designed masked self-attention and a morphology learner, where the former introduces saliency map as a soft mask into ViT and constrains the self-attention to foregrounds, while the latter leverages the so-called power normalization adjusting morphology of saliency map, realizing dynamically changing receptive field. With the SalViT, we explore the use of ViT features together with CNN features to model both local and long-range dependency, providing more informative representations for similarity learning. We apply SalViT to FSKD in inductive and transductive settings, and show that it outperforms other methods. ","Saliency-guided vision transformer, few-shot learning, few-shot keypoint detection, masked self-attention, morphology learning" Towards Learning Imperceptible Adversarial Distribution for Black-Box Attacks,https://openreview.net/forum?id=ecVbozYsBmw,https://openreview.net/pdf?id=ecVbozYsBmw,A black-box attack method which can learn imperceptible adversarial distribution ,"An effective black-box threat model should find a sweet spot that balances well across success rate, perceptual quality, and query efficiency. In this paper, we propose PadvFlow, a black-box attack method that achieves the desirable property. Instead of searching for examples in a conventional $\ell_p$ space, PadvFlow leverages the use of normalizing flows (NFs) to model the density distribution of natural and indistinguishable adversarial examples in a perceptual space. The expressive NFs can reduce the perceptible noises. Meanwhile, searching for adversarial samples via the perceptual space improves details of generation. Thus, PadvFlow can generate perceptually-natural adversarial examples. Our comprehensive experiments show that PadvFlow not only successfully attacks 6 undefended and 4 defended image classifiers on CIFAR-10 and SVHN, but also can be scaled up to attack ImageNet of pixel size $299\times299$. The effectiveness of PadvFlow is also validated for a different modality by attacking an automatic speech recognition system.","adversarial attack, robustness, image classification, automatic recognition system" Active Learning with Partial Labels,https://openreview.net/forum?id=1VuBdlNBuR,https://openreview.net/pdf?id=1VuBdlNBuR,"we propose a new problem setting named active learning with partial labels, where the oracle provides partial labels to the selected samples.","In this paper, we for the first time study a new problem setting called active learning with partial labels (ALPL), where an oracle provides the query samples with a set of candidate labels that contains the true label. Such a setting relaxes the oracle from the demanding labeling process. To address ALPL, we firstly propose a firm and intuitive baseline by directly adapting a state-of-the-art method for learning with partial labels to train the predictor, which can be seamlessly incorporated into existing AL frameworks. Inspired by human inference in cognitive science, we propose to improve the baseline by exploiting and exploring counter examples (CEs) to relieve the overfitting caused by a few training samples in ALPL. Specifically, we propose to construct CEs by reversing the partial labels for each instance, learning from which we propose a simple but effective WorseNet. By leveraging the distribution gap between WorseNet and the predictor, both the predictor itself and the sample selection process can be improved. Experimental results on five real-world datasets and four benchmark datasets show that our proposed methods achieve comprehensive improvements over ten representative AL frameworks, highlighting the superiority and effectiveness of CEs and WorseNet. ","weakly supervised learning, active learning, partial label learning" Specialization of Sub-paths for Adaptive Depth Networks,https://openreview.net/forum?id=yCGgOFC0bG,https://openreview.net/pdf?id=yCGgOFC0bG,We present an adaptive depth network that does not requires intermediate classifiers or decision networks. ,"We present a novel approach to anytime networks that can control network depths instantly at runtime to provide various accuracy-efficiency trade-offs. While controlling the depth of a network is an effective way to obtain actual inference speed-up, previous adaptive depth networks require either additional intermediate classifiers or decision networks, that are challenging to train properly. Unlike previous approaches, our adaptive depth networks require virtually no architectural changes from baseline networks. Instead, we introduce a novel training method that enforces some sub-paths of the baseline networks to have a special property, with which the sub-paths do not change the semantic level of input features, but only refine them to reduce prediction errors. Those specialized sub-paths can be skipped at test time, if needed, to save computation at marginal loss of prediction accuracy. We first formally present the rationale behind the sub-paths specialization, and based on that, we propose a simple and practical training method to specialize sub-paths for adaptive depth networks. While minimal architectural changes and training efforts are required, we demonstrate that our approach significantly outperforms non-adaptive baselines in various tasks, including ImageNet classification, COCO object detection and instance segmentation. Further, we show that the smallest sub-networks of our adaptive depth networks achieve competitive model compression effect compared to recent state-of-the-art techniques.","convolution neural network, anytime network, adaptive network, accuracy-efficiency trade-offs, imagenet, coco" Towards Effective and Interpretable Human-Agent Collaboration in MOBA Games: A Communication Perspective,https://openreview.net/forum?id=q3F0UBAruO,https://openreview.net/pdf?id=q3F0UBAruO,We propose an efficient and interpretable Meta-Command Communication-based (MCC) framework for accomplishing effective human-AI collaboration in MOBA games. ,"MOBA games, e.g., Dota2 and Honor of Kings, have been actively used as the testbed for the recent AI research on games, and various AI systems have been developed at the human level so far. However, these AI systems mainly focus on how to compete with humans, less on exploring how to collaborate with humans. To this end, this paper makes the first attempt to investigate human-agent collaboration in MOBA games. In this paper, we propose to enable humans and agents to collaborate through explicit communication by designing an efficient and interpretable Meta-Command Communication-based framework, dubbed MCC, for accomplishing effective human-agent collaboration in MOBA games. The MCC framework consists of two pivotal modules: 1) an interpretable communication protocol, i.e., the Meta-Command, to bridge the communication gap between humans and agents; 2) a meta-command value estimator, i.e., the Meta-Command Selector, to select a valuable meta-command for each agent to achieve effective human-agent collaboration. Experimental results in Honor of Kings demonstrate that MCC agents can collaborate reasonably well with human teammates and even generalize to collaborate with different levels and numbers of human teammates. Videos are available at https://sites.google.com/view/mcc-demo.","game playing, multi-agent, human-ai communication, human-ai collaboration, deep reinforcement learning" Fine-Grained Image Retrieval with Neighbor-Attention Label Correction,https://openreview.net/forum?id=lOkOPfpeSl,https://openreview.net/pdf?id=lOkOPfpeSl,A Neighbor-Attention Label Correction model trained by nested optimization is proposed to correct noisy label in fine-grained image retrieval,"This paper studies noise-resistant deep model training for the fine-grained image retrieval task, which has a unconstrained target label space and suffers from the difficulty of acquiring accurate fine-grained labels. A Neighbor-Attention Label Correction (NALC) model is proposed based on the meta-learning framework to correct labels during the training stage. A training batch and a validation batch are sampled from the training set, which hence allows to optimize the NALC model by referring to the validation batch. We also propose a novel nested optimization for the meta-learning framework to enhance the optimization efficiency. The training procedure consistently boosts the label accuracy in training batch, which in-turn ensures a more accurate training set. Experiments results show that our method boosts the label accuracy from 70% to 97+% and it outperforms recent works up to 11.5% in rank1 accuracy on various fine-grained image retrieval tasks, e.g., fine-grained instance retrieval on CUB200 and CARS, as well as person re-identification, respectively. Ablation studies also show the NALC generalizes well on different types of noises, e.g., Asymmetric, Pair-Flip, Pattern noises, etc.","noisy label, meta learning, fine-grained image retrieval" How Normalization and Weight Decay Can Affect SGD? Insights from a Simple Normalized Model,https://openreview.net/forum?id=k-JvYGkA9o,https://openreview.net/pdf?id=k-JvYGkA9o,Theoretical and empirical analysis on learning dynamics of neural network with normalization and weight decay.,"Recent works(Li et al., 2020, Wan et al., 2021) characterize an important mechanism of normalized model trained with SGD and WD (Weight Decay), called Spherical Motion Dynamics (SMD), confirming its widespread effects in practice. However, no theoretical study is available on the influence of SMD on the training process of normalized models in literature. In this work, we seek to understand the effect of SMD by theoretically analyzing a simple normalized model, named as Noisy Rayleigh Quotient (NRQ). On NRQ, We theoretically prove SMD can dominate the whole training process via controlling the evolution of angular update (AU), an essential feature of SMD. Specifically, we show: 1) within equilibrium state of SMD, the convergence rate and limiting risk of NRQ are mainly determined by the theoretical value of AU; and 2) beyond equilibrium state, the evolution of AU can interfere the optimization trajectory, causing odd phenomena such as ``escape'' behavior. We further show the insights drawn from NRQ is consistent with empirical observations in experiments on real datasets. We believe our theoretical results shed new light on the role of normalization techniques during the training of modern deep learning models.","normalization, stochastic gradient descent, optimization" Closing the Performance Gap between Cumbersome and Lightweight Contrastive Models,https://openreview.net/forum?id=9CGiwZeCAd,https://openreview.net/pdf?id=9CGiwZeCAd,"We successfully improve the linear evaluation results from 36.3\% to 62.3\% of MobileNet-V3-Large and from 42.2\% to 65.8\% of EfficientNet-B0 on ImageNet, closing the accuracy gap to ResNet-50 which contains $5\times$ parameters.","While self-supervised contrastive learning has made continuous progress utilizing big models, the performance lags far behind when the model size decreases. A common practice to address this problem requires a two-stage training procedure, where a larger model is pretrained in a self-supervised manner first, then its representational knowledge is transferred to a smaller model in the second stage. Despite its effectiveness, this method is highly time-consuming and is inapplicable to some resource-limited scenarios. In this work, we are aiming at directly training a lightweight contrastive model with satisfactory performance in the absence of a pretrained teacher model. Specifically, by empirically exploring the training recipes (e.g., MLP, lower temperature, et al), we boost the accuracy of different lightweight models by a large margin. Besides, we observe that smaller models are more sensitive to noisy labels, and propose a smooth version of InfoNCE loss to alleviate this problem. With these combined techniques, we successfully improve the linear evaluation results from 36.3\% to 62.3\% of MobileNet-V3-Large and from 42.2\% to 65.8\% of EfficientNet-B0 on ImageNet, closing the accuracy gap to ResNet-50 which contains $5\times$ parameters. These results suggest the feasibility to train lightweight self-supervised models without distillation. ","self-supervised learning, contrastive learning, lightweight model" DCAPS: Dual Cross-Attention Coupled with Stabilizer for Few-Shot Common Action Localization,https://openreview.net/forum?id=s0JAnAOS24,https://openreview.net/pdf?id=s0JAnAOS24,"For few-shot common action localization where no class cue of support videos is given, we mainly suggests a 3-stage cross-attention to align a long untrimmed query and trimmed support videos without losing compatibility among the support videos.","The goal of this paper is to localize action instances in a long untrimmed query video using just meager trimmed support videos representing a common action whose class information is not given. In this task, it is crucial not only to correctly align a temporal segment (proposal) of the query video and the support videos, but also to increase the compatibility among the support videos. The latter has been understudied, even though the context (e.g., background, camera angle) varies across the support videos. To address both points, we design a dual cross-attention coupled with a stabilizer (DCAPS). First, we develop an attention mechanism by cross-correlation, and apply it independently to each support video (with the query videos) in order to manage the heterogeneity among the support videos. Next, we devise a stabilizer to increase the compatibility among the support videos. Then, the cross-attention is used again here to make the stabilized support videos attend and enhance the query proposals. Finally, we also develop a relational classifier head based on the query and support video representations. Hence, our contributions better utilize a few support videos for representing query proposals and thus attain precise common action localization. We show the effectiveness of our work with the state-of-the-art performance in benchmark datasets (ActivityNet1.3 and THUMOS14), and analyze each component extensively.","Few-shot action localization, common action localization, commonality" Generalize Learned Heuristics to Solve Large-scale Vehicle Routing Problems in Real-time,https://openreview.net/forum?id=6ZajpxqTlQ,https://openreview.net/pdf?id=6ZajpxqTlQ,Propose a zero-shot method to generalize the data-driven heuristics trained on small-scale VRPs to solve large-scale VRPs in real-time,"Large-scale Vehicle Routing Problems (VRPs) are widely used in logistics, transportation, supply chain, and robotic system. Recently, data-driven VRP heuristics are proposed to generate real-time VRP solutions with up to 100 nodes. However, current heuristics for large-scale VRPs still face three challenges: 1) Hard to generalize the heuristics learned on small-scale VRPs to large-scale VRPs in zero-shot way; 2) Hard to generate real-time solutions for large-scale VRPs; 3) Hard to embed global constraints in learned heuristics. We contribute in the three directions: We propose a Two-stage Divide Method (TAM) to generate sub-route sequence rather than node sequence for generalizing the heuristics learned on small-scale-VRPs to solve large-scale VRPs in real-time. A two-step reinforcement learning method with new reward and padding techniques is proposed to train our TAM. A global mask function is proposed to keep the global constraints satisfied when dividing a large-scale VRP into several small-scale Traveling Salesman Problems (TSPs). As result, we can solve the small-scale TSPs in parallel quickly. The experiments on synthetic and real-world large-scale VRPs show our method could generalize the learned heuristics trained on datasets of VRP 100 to solve VRPs with over 5000 nodes in real-time while keeping the solution quality better than data-driven heuristics and competitive with traditional heuristics. ","Learning, Vehicle Routing Problem, Large-scale Vehicle Routing Problem, Generalization, Combinatorial Optimization, Reinforcement Learning, Attention" MUTUAL EXCLUSIVE MODULATOR FOR LONG-TAILED RECOGNITION,https://openreview.net/forum?id=Y-PoPmNuLHZ,https://openreview.net/pdf?id=Y-PoPmNuLHZ,,"The long-tailed recognition (LTR) is the task of learning high-performance classifiers given extremely imbalanced training samples between categories. Most of the existing works address the problem by either enhancing the features of tail classes or re-balancing the classifiers to reduce the inductive bias. In this paper, we try to look into the root cause of the LTR task, i.e., training samples for each class are greatly imbalanced, and propose a straightforward solution. We split the categories into three groups, i.e., many, medium and few, according to the number of training images. The three groups of categories are separately predicted to reduce the difficulty for classification. This idea naturally arises a new problem of how to assign a given sample to the right class groups? We introduce a mutual exclusive modulator which can estimate the probability of an image belonging to each group. Particularly, the modulator consists of a light-weight module and learned with a mutual exclusive objective. Hence, the output probabilities of the modulator encode the data volume clues of the training dataset. They are further utilized as prior information to guide the prediction of the classifier. We conduct extensive experiments on multiple datasets, e.g., ImageNet-LT, Place-LT and iNaturalist to evaluate the proposed approach. Our method achieves competitive performance compared to the state-of-the-art benchmarks.","long-tailed recognition, mutual exclusive modulator, soft routing" RetinexUTV: ROBUST RETINEX MODEL WITH UNFOLDING TOTAL VARIATION,https://openreview.net/forum?id=BUewet8vCFr,https://openreview.net/pdf?id=BUewet8vCFr,,"Digital images are underexposed due to poor scene lighting or hardware limitations, reducing visibility and level of detail in the image, which will affect subsequent high-level tasks and image aesthetics. Therefore, it is of great practical significance to enhance low-light images. Among existing low-light image enhancement techniques, retinex-based methods are the focus today. However, most retinex methods either ignore or poorly handle noise during enhancement, which can produce unpleasant visual effects in low-light image enhancement and affect high-level tasks. In this paper, we propose a robust low-light image enhancement method RetinexUTV, which aims to enhance low-light images well while suppressing noise. In RetinexUTV, we propose an adaptive illumination estimation unfolded total variational network, which approximates the noise level of the real low-light image by learning the balance parameter of the total variation regularization term of the model, obtains the noise level map and the smooth noise-free sub-map of the image. The initial illumination map is then estimated by obtaining the illumination information of the smooth sub-map. The initial reflection map is obtained through the initial illumination map and original image. Under the guidance of the noise level map, the noise of the reflection map is suppressed, and finally it is multiplied by the adjusted illumination map to obtain the final enhancement result. We test our method on real low-light datasets LOL, VELOL, and experiments demonstrate that our method outperforms state-of-the-art methods.","low light iamge enhancement, retinex, noise suppression, total variation" Adapting Pre-trained Language Models for Quantum Natural Language Processing,https://openreview.net/forum?id=-NAi1oQJbA3,https://openreview.net/pdf?id=-NAi1oQJbA3,,"The emerging classical-quantum transfer learning paradigm has brought a decent performance to quantum computational models in many tasks, such as computer vision, by enabling a combination of quantum models and classical pre-trained neural networks. However, using quantum computing with pre-trained models has yet been explored in natural language processing (NLP). Due to the high linearity constraints of the underlying quantum computing infrastructures, existing Quantum NLP models are limited in performance on real tasks. We fill this gap by pre-training a sentence state with complex-valued BERT-like architecture, and adapting it to the classical-quantum transfer learning scheme for sentence classification. On quantum simulation experiments, the pre-trained representation can bring 50% to 60% increases to the capacity of end-to-end quantum models.","Quantum Computing, Complex-valued Neural Network, Pre-trained Language Model" Towards the Generalization of Contrastive Self-Supervised Learning,https://openreview.net/forum?id=XDJwuEYHhme,https://openreview.net/pdf?id=XDJwuEYHhme,This paper presents a theoretical understanding of contrastive learning and provide an upper bound of the downstream classification error.,"Recently, self-supervised learning has attracted great attention, since it only requires unlabeled data for model training. Contrastive learning is one popular method for self-supervised learning and has achieved promising empirical performance. However, the theoretical understanding of its generalization ability is still limited. To this end, we define a kind of $(\sigma,\delta)$-measure to mathematically quantify the data augmentation, and then provide an upper bound of the downstream classification error rate based on the measure. It reveals that the generalization ability of contrastive self-supervised learning is related to three key factors: *alignment* of positive samples, *divergence* of class centers, and *concentration* of augmented data. The first two factors can be optimized by contrastive algorithms, while the third one is priorly determined by pre-defined data augmentation. With the above theoretical findings, we then study two canonical contrastive losses, InfoNCE and cross-correlation, to see how they satisfy the first two factors. Furthermore, we conduct various experiments to study the third factor, and observe that the downstream performance is highly correlated to the concentration of augmented data.","deep learning theory, contrastive learning, generalization error" Towards Controllable Policy through Goal-Masked Transformers,https://openreview.net/forum?id=VYaTFO2Myi5,https://openreview.net/pdf?id=VYaTFO2Myi5,,"Offline goal-conditioned supervised learning (GCSL) can learn to achieve various goals from purely offline datasets without reward information, enhancing control over the policy. However, we argue that learning a composite policy switchable among different goals seamlessly should be an essential task for obtaining a controllable policy. This feature should be learnable if the dataset contains enough data about such switches. Unfortunately, most existing datasets either partially or entirely lack such switching demonstrations. Current GCSL approaches that use hindsight information concentrate primarily on reachability at the state or return level. They might not work as expected when the goal is changed within an episode. To this end, we present Goal-Masked Transformers (GMT), an efficient GCSL algorithm based on transformers with goal masking. GMT makes use of trajectory-level hindsight information, which is automatically gathered and can be adjusted for various statistics of interest. Due to the autoregressive nature of GMT, we can change the goal and control the policy at any time. We empirically evaluate GMT on MuJoCo continuous control benchmarks and Atari discrete control games with image states to compare GMT against baselines. We illustrate that GMT can infer the missing switching processes from the given dataset and thus switch smoothly among different goals. As a result, GMT demonstrates its ability to control policy and succeeds on all the tasks with low variance, while existing GCSL works can hardly succeed in goal-switching.", Fed-CBS: Heterogeneity-Aware Client Sampling Mechanism for Federated Learning via Class-Imbalance Reduction,https://openreview.net/forum?id=rktxwkgbbPB,https://openreview.net/pdf?id=rktxwkgbbPB,,"Due to the limited communication capacities of edge devices, most existing federated learning (FL) methods randomly select only a subset of devices to participate in training for each communication round. Compared with engaging all the available clients, the random-selection mechanism can lead to significant performance degradation on non-IID (independent and identically distributed) data. In this paper, we show our key observation that the essential reason resulting in such performance degradation is the class-imbalance of the grouped data from the randomly selected clients. Based on our key observation, we design an efficient heterogeneity-aware client sampling mechanism, i.e., Federated Class-balanced Sampling (Fed-CBS), which can effectively reduce class-imbalance of the group data from the intentionally selected clients. In particular, we propose a measure of class-imbalance and then employ homomorphic encryption to derive this measure in a privacy-preserving way. Based on this measure, we also design a computation-efficient client sampling strategy, such that the actively selected clients will generate a more class-balanced grouped dataset. Extensive experimental results demonstrate Fed-CBS outperforms the status quo approaches. Furthermore, it achieves comparable or even better performance than the ideal setting where all the available clients participate in the FL training. In addition, we provide the theoretical convergence guarantee of Fed-CBS.", Comparative Analysis between Vision Transformers and CNNs from the view of Neuroscience,https://openreview.net/forum?id=jrrokKkjVsz,https://openreview.net/pdf?id=jrrokKkjVsz,"Neural sparsity of Transformers and CNNs are defined and calculated, leading to striking conclusion.","Neuroscience has provide many inspirations for the development of artificial intelligence, especially for neural networks for computer vision tasks. Recent research on animals' visual systems builds the connection between neural sparsity and animals' levels of evolution, based on which comparisons between two most influential vision architecture, Transformer and CNN, are carried out. In particular, the sparsity of attentions in Transformers is comprehensively studied, and previous knowledge on sparsity of neurons in CNNs is reviewed. In addition, a novel metric for neural sparsity is defined and ablation experiments are launched on various types of Transformer and CNN models. Finally, we draw the conclusion that more layers in models will result in higher sparsity, however, too many heads in Transformers may cause reduction of sparsity, which attributes to the significant overlap among effects of attention units.","Vision Transformer, CNN, neuroscience, sparsity" Uncertainty-Aware Meta-Learning for Multimodal Task Distributions,https://openreview.net/forum?id=bd7tj6MoZn,https://openreview.net/pdf?id=bd7tj6MoZn,"We present a novel meta-learning algorithm that makes probabilistic predictions efficiently, detects out-of-distribution context data, and performs well on heterogeneous, multimodal task distributions.","Meta-learning or learning to learn is a popular approach for learning new tasks with limited data (i.e., few-shot learning) by leveraging the commonalities among different tasks. However, meta-learned models can perform poorly when context data is limited, or when data is drawn from an out-of-distribution (OoD) task. Especially in safety-critical settings, this necessitates an uncertainty-aware approach to meta-learning. In addition, the often multimodal nature of task distributions can pose unique challenges to meta-learning methods. In this work, we present UnLiMTD (Uncertainty-aware meta-Learning for Multimodal Task Distributions), a novel method for meta-learning that (1) makes probabilistic predictions on in-distribution tasks efficiently, (2) is capable of detecting OoD context data at test time, and (3) performs on heterogeneous, multimodal task distributions. To achieve this goal, we take a probabilistic perspective and train a parametric, tuneable distribution over tasks on the meta-dataset. We construct this distribution by performing Bayesian inference on a linearized neural network, leveraging Gaussian process theory. We demonstrate that UnLiMTD’s predictions compare to, and outperform in most cases, the standard baselines, especially in the low-data regime. Furthermore, we show that UnLiMTD is effective in detecting data from OoD tasks. Finally, we confirm that both of these findings continue to hold in the multimodal task-distribution setting.","Meta-learning, Bayesian inference, neural network linearization, uncertainty estimation, Gaussian Process, NTK" Neural Operator Variational Inference based on Regularized Stein Discrepancy for Deep Gaussian Processes,https://openreview.net/forum?id=AONW9iXn22,https://openreview.net/pdf?id=AONW9iXn22,,"A Deep Gaussian Process (DGP) model is a hierarchical composition of GP models that provides a deep Bayesian nonparametric approach to infer the posterior. Exact Bayesian inference is usually intractable for DGPs, motivating the use of various approximations. We theoretically demonstrate that the traditional alternative of mean-field Gaussian assumptions across the hierarchy leads to lack of expressiveness and efficacy of DGP models, whilst stochastic approximation often incurs a significant computational cost. To address this issue, we propose Neural Operator Variational Inference (NOVI) for Deep Gaussian Processes, where a sampler is obtained from a neural generator through minimizing Regularized Stein Discrepancy in L2 space between the approximate distribution and true posterior. Wherein a minimax problem is obtained and solved by Monte Carlo estimation and subsampling stochastic optimization. We experimentally demonstrate the effectiveness and efficiency of the proposed model, by applying it to a more flexible and wider class of posterior approximations on data ranging in size from hundreds to tens of thousands. By comparison, NOVI is superior to previous methods in both classification and regression.","Deep Gaussian processes, Operator variational inference, Stein discrepancy" On the complexity of nonsmooth automatic differentiation,https://openreview.net/forum?id=uqg3FhRZaq,https://openreview.net/pdf?id=uqg3FhRZaq,Backpropagation of nonsmooth gradients is proved to be a fast/cheap process for the vast class of locally Lipschitz semi-algebraic functions.,"Using the notion of conservative gradient, we provide a simple model to estimate the computational costs of the backward and forward modes of algorithmic differentiation for a wide class of nonsmooth programs. The complexity overhead of the backward mode turns out to be independent of the dimension when using programs with locally Lipschitz semi-algebraic or definable elementary functions. This extends considerably the Baur-Strassen's smooth cheap gradient principle. We illustrate our results by establishing fast backpropagation results of conservative gradients through feedforward neural networks with standard activation and loss functions. Nonsmooth backpropagation's cheapness contrasts with concurrent forward approaches, which have, to this day, dimensional-dependent worst case overhead estimates. We provide further results suggesting the superiority of backward propagation of conservative gradients. Indeed, we relate the complexity of computing a large number of directional derivatives to that of matrix multiplication, and we show that finding two subgradients in the Clarke subdifferential of a function is a NP-hard problem.","Automatic differentiation, nonsmooth derivatives, computational complexity, cheap derivatives, conservative gradients" CO3: Cooperative Unsupervised 3D Representation Learning for Autonomous Driving,https://openreview.net/forum?id=QUaDoIdgo0,https://openreview.net/pdf?id=QUaDoIdgo0,"We propose CO3, namely {Co}operative {Co}ntrastive Learning and {Co}ntextual Shape Prediction, to learn 3D representation for outdoor-scene point clouds in an unsupervised manner.","Unsupervised contrastive learning for indoor-scene point clouds has achieved great successes. However, unsupervised representation learning on outdoor-scene point clouds remains challenging because previous methods need to reconstruct the whole scene and capture partial views for the contrastive objective. This is infeasible in outdoor scenes with moving objects, obstacles, and sensors. In this paper, we propose CO3, namely {Co}operative {Co}ntrastive Learning and {Co}ntextual Shape Prediction, to learn 3D representation for outdoor-scene point clouds in an unsupervised manner. CO3 has several merits compared to existing methods. (1) It utilizes LiDAR point clouds from vehicle-side and infrastructure-side to build views that differ enough but meanwhile maintain common semantic information for contrastive learning, which are more appropriate than views built by previous methods. (2) Alongside the contrastive objective, we propose contextual shape prediction to bring more task-relevant information for unsupervised 3D point cloud representation learning and we also provide a theoretical analysis for this pre-training goal. (3) As compared to previous methods, representation learned by CO3 is able to be transferred to different outdoor scene dataset collected by different type of LiDAR sensors. (4) CO3 improves current state-of-the-art methods on Once, KITTI and NuScenes datasets by up to 2.58 mAP in 3D object detection task and 3.54 mIoU in LiDAR semantic segmentation task. Codes and models will be released.","Cooperative Contrastive Learning, Contextual Shape Prediction, Unsupervised Representation Learning, Autonomous Driving" Bag of Tricks for Unsupervised Text-to-Speech,https://openreview.net/forum?id=SbR9mpTuBn,https://openreview.net/pdf?id=SbR9mpTuBn,We introduce a bag of tricks to enable effective unsupervised TTS using low-quality and multi-speaker unpaired data.,"Unsupervised text-to-speech (TTS) aims to train TTS models for a specific language without any paired speech-text training data in that language. Existing methods either use speech and corresponding pseudo text generated by an unsupervised automatic speech recognition (ASR) model as training data, or employ the back-translation technique. Though effective, they suffer from low robustness to low-quality data and heavy dependence on the lexicon of a language that is sometimes unavailable, leading to difficulty in convergence, especially in low-resource language scenarios. In this work, we introduce a bag of tricks to enable effective unsupervised TTS. Specifically, 1) we carefully design a voice conversion model to normalize the variable and noisy information in the low-quality speech data while preserving the pronunciation information; 2) we employ the non-autoregressive TTS model to overcome the robustness issue; and 3) we explore several tricks applied in back-translation, including curriculum learning, length augmentation and auxiliary supervised loss to stabilize the back-translation and improve its effectiveness. Through experiments, it has been demonstrated that our method achieves better intelligibility and audio quality than all previous methods, and that these tricks are very essential to the performance gain.","speech synthesis, unsupervised learning" "FedSpeed: Larger Local Interval, Less Communication Round, and Higher Generalization Accuracy",https://openreview.net/forum?id=bZjxxYURKT,https://openreview.net/pdf?id=bZjxxYURKT,A novel and practical federated learning method with theoretical analysis guarantees achieves higher performance in the common federated settings.,"Federated learning (FL) is an emerging distributed machine learning framework which jointly trains a global model via a large number of local devices with data privacy protections. Its performance suffers from the non-vanishing biases introduced by the local inconsistent optimal and the rugged client-drifts by the local over-fitting. In this paper, we propose a novel and practical method, FedSpeed, to alleviate the negative impacts posed by these problems. Concretely, FedSpeed applies the prox-correction term on the current local updates to efficiently reduce the biases introduced by the prox-term, a necessary regularizer to maintain the strong local consistency. Furthermore, FedSpeed merges the vanilla stochastic gradient with a perturbation computed from an extra gradient ascent step in the neighborhood, thereby alleviating the issue of local over-fitting. Our theoretical analysis indicates that the convergence rate is related to both the communication rounds $T$ and local intervals $K$ with a tighter upper bound $\mathcal{O}(\frac{1}{T})$ if $K=\mathcal{O}(T)$. Moreover, we conduct extensive experiments on the real-world dataset to demonstrate the efficiency of our proposed FedSpeed, which converges significantly faster and achieves the state-of-the-art (SOTA) performance on the general FL experimental settings than several baselines including FedAvg, FedProx, FedCM, FedAdam, SCAFFOLD, FedDyn, FedADMM, etc.",federated learning Holistically Explainable Vision Transformers,https://openreview.net/forum?id=jw37FUa_Aw9,https://openreview.net/pdf?id=jw37FUa_Aw9,"We propose B-cos ViTs, which are inherently interpretable transformer models.","Transformers increasingly dominate the machine learning landscape across many tasks and domains, which increases the importance for understanding their outputs. While their attention modules provide partial insight into their inner workings, the attention scores have been shown to be insufficient for explaining the models as a whole. To address this, we propose B-cos transformers, which inherently provide holistic explanations for their decisions. Specifically, we formulate each model component—such as the multi-layer perceptrons, attention layers, and the tokenisation module—to be dynamic linear, which allows us to faithfully summarise the entire transformer via a single linear transform. We apply our proposed design to Vision Transformers (ViTs) and show that the resulting models, dubbed Bcos-ViTs, are highly interpretable and perform competitively to baseline ViTs on ImageNet. Code will be available at: github.com/anonymous/authors.","Explainable Deep Neural Networks, Vision Transformers, XAI" Delving into Discrete Normalizing Flows on SO(3) Manifold for Probabilistic Rotation Modeling,https://openreview.net/forum?id=pvrkJUkmto,https://openreview.net/pdf?id=pvrkJUkmto,,"Normalizing flows (NFs) provide a powerful tool to construct an expressive distribution by a sequence of trackable transformations of a base distribution and form a probabilistic model of underlying data. Rotation, as an important quantity in computer vision, graphics and robotics, can exhibit many ambiguities when occlusion and symmetry occur and thus demands such probabilistic models. Though various NFs in Euclidean space have been proposed, there are no normalizing flows tailored for SO(3) manifold. Given the unique non-Euclidean properties of the rotation manifold, adapting the existing NFs to SO(3) manifold is non-trivial. In this paper, we propose a novel normalizing flow on SO(3) by combining a Möbius transformation-based layer and a quaternion affine transformation. With our proposed rotation normalizing flows, one can not only effectively express arbitrary distributions on SO(3), but also conditionally build the target distribution given input observations. Extensive experiments show that our rotation normalizing flows significantly outperform the baselines on both unconditional and conditional tasks.", Neural Volumetric Mesh Generator,https://openreview.net/forum?id=W6cTWszOQSo,https://openreview.net/pdf?id=W6cTWszOQSo,,"Deep generative models have shown success in generating 3D shapes with different representations. In this work, we propose Neural Volumetric Mesh Generator (NVMG), which can generate novel and high-quality volumetric meshes. Unlike the previous 3D generative model for point cloud, voxel, and implicit surface, the volumetric mesh representation is a ready-to-use representation in industry with details on both the surface and interior. Generating this such highly-structured data thus brings a significant challenge. We first propose a diffusion-based generative model to tackle this problem by generating voxelized shapes with close-to-reality outlines and structures. We can simply obtain a tetrahedral mesh as a template with the voxelized shape. Further, we use a voxel-conditional neural network to predict the smooth implicit surface conditioned on the voxels, and progressively project the tetrahedral mesh to the predicted surface under regularization. The regularization terms are carefully designed so that they can (1) get rid of the defects like flipping and high distortion; (2) force the regularity of the interior and surface structure during the deformation procedure for a high-quality final mesh. As shown in the experiments, our pipeline can generate high-quality artifact-free volumetric and surface meshes from random noise or a reference image without any post-processing. Compared with the state-of-the-art voxel-to-mesh deformation method, we show more robustness and better performance when taking generated voxels as input.", PathFusion: Path-consistent Lidar-Camera Deep Feature Fusion,https://openreview.net/forum?id=2t7L0lcDqAr,https://openreview.net/pdf?id=2t7L0lcDqAr,,"Fusing camera with LiDAR is a promising technique to improve the accuracy of 3D detection due to its complementary physical properties. While most existing methods focus on fusing camera features directly with raw LiDAR point clouds or shallow 3D features, it is observed that direct deep 3D feature fusion achieves inferior accuracy due to feature mis-alignment. The mis-alignment that originates from the feature aggregation across large receptive fields becomes increasingly severe for deep network stages. In this paper, we propose PathFusion to enable path-consistent LiDAR-camera deep feature fusion. PathFusion introduces a path consistency loss between shallow and deep features, which encourages the 2D backbone and its fusion path to transform 2D features in a way that is semantically aligned with the transform of the 3D backbone. We apply PathFusion to the prior-art fusion baseline, Focals Conv, and observe more than 1.2% mAP improvements on the nuScenes test split consistently with and without testing-time augmentations. Moreover, PathFusion also improves KITTI AP 3D (R11) by more than 0.6% on moderate level.", DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization,https://openreview.net/forum?id=Siln8xpTMrZ,https://openreview.net/pdf?id=Siln8xpTMrZ,We introduce a novel decentralized asynchronous accelerated stochastic first order algorithm to minimize a sum of smooth and strongly convex functions over a time-varying connectivity network.,"DADAO is a novel decentralized asynchronous stochastic first order algorithm to minimize a sum of $L$-smooth and $\mu$-strongly convex functions distributed over a time-varying connectivity network of size $n$. We model the local gradient updates and gossip communication procedures with separate independent Poisson Point Processes, decoupling the computation and communication steps in addition to making the whole approach completely asynchronous. Our method employs primal gradients and does not use a multi-consensus inner loop nor other ad-hoc mechanisms such as Error Feedback, Gradient Tracking, or a Proximal operator. By relating the inverse of the smallest positive eigenvalue $\chi^*_1$ and the effective resistance $\chi_2^*$ of our graph to a necessary minimal communication rate between nodes of the network, we show that our algorithm requires $\mathcal{O}(n\sqrt{\frac{L}{\mu}}\log \epsilon)$ local gradients and only $\mathcal{O}(n\sqrt{\chi_1^*\chi_2^*}\sqrt{\frac{L}{\mu}}\log \epsilon)$ communications to reach a precision $\epsilon$. If SGD with uniform noise $\sigma^2$ is used, we reach a precision $\epsilon$ with same speed, up to a bias term in $\mathcal{O}(\frac{\sigma^2}{\sqrt{\mu L}})$. This improves upon the bounds obtained with current state-of-the-art approaches, our simulations validating the strength of our relatively unconstrained method.","Decentralized Asynchronous Optimization, Convex Optimization, Time-Varying Networks" Enabling Probabilistic Inference on Large-Scale Spiking Neural Networks,https://openreview.net/forum?id=EBJG0A0PUo1,https://openreview.net/pdf?id=EBJG0A0PUo1,,"Deep spiking neural networks have achieved success in many machine learning tasks. However, most existing works consider deterministic SNNs, which ignore the inherent randomness of neurons. On the other hand, existing works on stochastic SNNs are limited to small networks and are hard to scale to larger SNN topologies. We introduce Noisy SNNs (NSNNs), built upon a stochastic noisy LIF neuron model to enable probabilistic inference on large-scale SNN topologies. By viewing NSNN as a Bayesian Network, we derive a three-factor learning rule called noise-driven learning (NDL) for synaptic optimization. The post-synaptic factor in NDL is obtained using the neuronal membrane noise statistics, avoiding the problematic derivative of the Heaviside spiking function and providing an explanation for surrogate gradients from the standpoint of random noise. NDL is backpropagation-compatible, enabling NSNNs to be extended to any SNN topology through modular replacement (Codes are available at https://cutt.ly/9CxT5jI). Evaluations on CIFAR-10/100 and DVS-CIFAR show that NSNNs achieve competitive performance in clean test scenarios. Furthermore, NSNNs exhibit high robustness against challenging perturbations like adversarial perturbation and spike-level disturbance.","spiking neural networks, SNNs" Less is More: Identifying the Cherry on the Cake for Dynamic Networks,https://openreview.net/forum?id=HHPEkUi5POw,https://openreview.net/pdf?id=HHPEkUi5POw,"We reveal the contradiction between the human brain and dynamic networks, then propose and validate the Cherry Hypothesis to show that a partially dynamic network (PAD-Net) could advance the performance in dynamic networks.","Dynamic networks, e.g., Dynamic Convolution (DY-Conv) and the Mixture of Experts (MoE), have been extensively explored as they can considerably improve the model's representation power with acceptable computational cost. The common practice in implementing dynamic networks is to convert given static layers into fully dynamic ones where all parameters are dynamic (at least within a single layer) and vary with the input. Recent studies empirically show the trend that the more dynamic layers contribute to ever-increasing performance. However, such a fully dynamic setting 1) may cause redundant parameters and high deployment costs, limiting the applicability of dynamic networks to a broader range of tasks and models, and more importantly, 2) contradicts the previous discovery in the human brain that \textit{when human brains process an attention-demanding task, only partial neurons in the task-specific areas are activated by the input, while the rest neurons leave in a baseline state.} Critically, there is no effort to understand and resolve the above contradictory finding, leaving the primal question -- to make the computational parameters fully dynamic or not? -- unanswered. The main contributions of our work are challenging the basic commonsense in dynamic networks, and, proposing and validating the \textsc{cherry hypothesis} -- \textit{A fully dynamic network contains a subset of dynamic parameters that when transforming other dynamic parameters into static ones, can maintain or even exceed the performance of the original network.} Technically, we propose a brain-inspired partially dynamic network, namely PAD-Net, to transform the redundant dynamic parameters into static ones. Also, we further design Iterative Mode Partition to partition the dynamic- and static-subnet, which alleviates the redundancy in traditional fully dynamic networks. Our hypothesis and method are comprehensively supported by large-scale experiments with two typical advanced dynamic methods, i.e., DY-Conv and MoE, on both image classification and GLUE benchmarks. Encouragingly, we surpass the fully dynamic networks by $+0.7\%$ top-1 acc with only $30\%$ dynamic parameters for ResNet-50 and $+1.9\%$ average score in language understanding tasks with only $50\%$ dynamic parameters for BERT-base. ","Dynamic Networks, Cherry Hypothesis, Efficient Architecture Designation." Advancing Radiograph Representation Learning with Masked Record Modeling,https://openreview.net/forum?id=w-x7U26GM7j,https://openreview.net/pdf?id=w-x7U26GM7j,We propose to learn radiograph representations via masked record modeling.,"Modern studies in radiograph representation learning (R$^2$L) rely on either self-supervision to encode invariant semantics or associated radiology reports to incorporate medical expertise, while the complementarity between them is barely noticed. To explore this, we formulate the self- and report-completion as two complementary objectives and present a unified framework based on masked record modeling (MRM). In practice, MRM reconstructs masked image patches and masked report tokens following a multi-task scheme to learn knowledge-enhanced semantic representations. With MRM pre-training, we obtain pre-trained models that can be well transferred to various radiography tasks. Specifically, we find that MRM offers superior performance in label-efficient fine-tuning. For instance, MRM achieves 88.5% mean AUC on CheXpert using 1% labeled data, outperforming previous R$^2$L methods with 100% labels. On NIH ChestX-ray, MRM outperforms the best performing counterpart by about 3% under small labeling ratios. Besides, MRM surpasses self- and report-supervised pre-training in identifying the pneumonia type and the pneumothorax area, sometimes by large margins.","Representation Learning, Radiograph, Self-supervised Learning, Medical Imaging" Instance-wise Batch Label Restoration via Gradients in Federated Learning,https://openreview.net/forum?id=FIrQfNSOoTr,https://openreview.net/pdf?id=FIrQfNSOoTr,We propose an analytic method to perform instance-wise batch label restoration and enhance the existing gradient inversion attacks.,"Gradient inversion attacks have posed a serious threat to the privacy of federated learning. The attacks search for the optimal pair of input and label best matching the shared gradients and the search space of the attacks can be reduced by pre-restoring labels. Recently, label restoration technique allows for the extraction of labels from gradients analytically, but even the state-of-the-art remains limited to identify the presence of categories (i.e., the class-wise label restoration). This work considers the more real-world settings, where there are multiple instances of each class in a training batch. An analytic method is proposed to perform instance-wise batch label restoration from only the gradient of the final layer. On the basis of the approximate recovered class-wise embeddings and post-softmax probabilities, we establish linear equations of the gradients, probabilities and labels to derive the Number of Instances (NoI) per class by the Moore-Penrose pseudoinverse algorithm. Our experimental evaluations reach over 99% Label existence Accuracy (LeAcc) and exceed 96% Label number Accuracy (LnAcc) in most cases on three image datasets and four classification models. The two metrics are used to evaluate class-wise and instance-wise label restoration accuracy, respectively. And the recovery is made feasible even with a batch size of 4096 and partially negative activations (e.g., Leaky ReLU and Swish). Furthermore, we demonstrate that our method facilitates the existing gradient inversion attacks by exploiting the recovered labels, with an increase of 6-7 in PSNR on both MNIST and CIFAR100.","federated learning, batch label restoration, gradient inversion attack." Self Check-in: Tight Privacy Amplification for Practical Distributed Learning,https://openreview.net/forum?id=xq-CQz6-gfg,https://openreview.net/pdf?id=xq-CQz6-gfg,"A more practical/realistic protocol of differentially private federated learning, with emphasis given to the privacy analysis ","Recent studies of distributed computation with formal privacy guarantees, such as differentially private (DP) federated learning, leverage random sampling of clients in each round (privacy amplification by subsampling) to achieve satisfactory levels of privacy. Achieving this however requires precise and uniform subsampling of clients as well as a highly trusted ochestrating server, strong assumptions which may not hold in practice. In this paper, we explore a more practical protocol, self check-in, to resolve the aforementioned issues. The protocol relies on client making independent and random decision to participate in the computation, freeing the requirement of server-initiated subsampling, and enabling robust modelling of client dropouts. Our protocol has immediate application to employing intermediate trust models, i.e., shuffle and distributed DP models, for realizing distributed learning in practice. To this end, we present a novel analysis based on R{\'e}nyi differential privacy (RDP) that improves in privacy guarantee over those using approximate DP's strong composition at various parameter regimes for self check-in. We also provide a numerical approach to track the privacy of generic shuffling mechanism including distributed learning with Gaussian mechanism, which can be of independent interest as it is the first evaluation of a generic mechanism as far as we know within the local/shuffle model under the distributed setting in the literature. Empirical studies are given to demonstrate the efficacy of learning as well.","differential privacy, federated learning, privacy amplification" Re-parameterizing Your Optimizers rather than Architectures,https://openreview.net/forum?id=B92TMCG_7rp,https://openreview.net/pdf?id=B92TMCG_7rp,Modify gradient flow to incorporate model-specific prior knowledge into the optimizers for training simple and efficient models.,"The well-designed structures in neural networks reflect the prior knowledge incorporated into the models. However, though different models have various priors, we are used to training them with model-agnostic optimizers such as SGD. In this paper, we propose to incorporate model-specific prior knowledge into optimizers by modifying the gradients according to a set of model-specific hyper-parameters. Such a methodology is referred to as Gradient Re-parameterization, and the optimizers are named RepOptimizers. For the extreme simplicity of model structure, we focus on a VGG-style plain model and showcase that such a simple model trained with a RepOptimizer, which is referred to as RepOpt-VGG, performs on par with or better than the recent well-designed models. From a practical perspective, RepOpt-VGG is a favorable base model because of its simple structure, high inference speed and training efficiency. Compared to Structural Re-parameterization, which adds priors into models via constructing extra training-time structures, RepOptimizers require no extra forward/backward computations and solve the problem of quantization. We hope to spark further research beyond the realms of model structure design. We will make the code and models publicly available.","Deep Learning, Model Architecture, Optimizer, Re-parameterization" Protein Representation Learning via Knowledge Enhanced Primary Structure Reasoning,https://openreview.net/forum?id=VbCMhg7MRmj,https://openreview.net/pdf?id=VbCMhg7MRmj,We perform protein knowledge encoding by learning to exploit knowledge graphs for protein primary structure reasoning.,"Protein representation learning has primarily benefited from the remarkable development of language models (LMs). Accordingly, pre-trained protein models also suffer from a problem in LMs: a lack of factual knowledge. The recent solution models the relationships between proteins and associated knowledge terms as the knowledge encoding objective. However, it fails to consider the semantic gap between protein sequences and natural language, and the resulting feature misalignment may adversely affect representation learning. To mitigate this, we propose Knowledge-exploited Auto-encoder for Proteins (KeAP), which performs implicit knowledge encoding by learning to exploit knowledge for protein primary structure reasoning. In practice, the protein representation iteratively queries the associated knowledge terms to extract and integrate helpful information for restoring missing amino acids via attention, avoiding a direct comparison between the two modalities. We show that KeAP can consistently outperform the previous counterpart on 9 representative downstream applications, sometimes surpassing it by large margins. These results suggest that KeAP provides an alternative yet effective way to perform knowledge encoding in protein representation learning.","Protein Science, Representation Learning, Knowledge Graph" Provable Unsupervised Data Sharing for Offline Reinforcement Learning,https://openreview.net/forum?id=MTTPLcwvqTt,https://openreview.net/pdf?id=MTTPLcwvqTt,We propose a principled way to leverage unlabeled offline RL dataset with guarantees in linear MDPs and it outperforms previous methods.,"Self-supervised methods play a vital role in fueling the progress of deep learning using supervision from the data itself, obviating the need for expensive annotations. The same merit applies to offline reinforcement learning (RL), which conducts RL in a supervised manner, but it is unclear how to utilize such unlabeled data to improve offline RL in a principled way. In this paper, we examine the theoretical benefit of unlabeled data in the context of linear MDPs and propose a novel and Provable Data Sharing algorithm, which we refer to as PDS, to utilize such unlabeled data for offline RL. PDS utilizes additional penalties upon the reward function learned from labeled data to avoid potential overestimation of the reward. We show that such a penalty is crucial to keep the algorithm conservative, and PDS achieves a provable benefit from unlabeled data under mild conditions. We conduct extensive experiments on various offline RL tasks and show that PDS can significantly improve offline RL algorithms with unlabeled data.","offline reinforcement learning, unsupervised learning, data sharing" Federated Learning for Inference at Anytime and Anywhere,https://openreview.net/forum?id=CRhzJqLhnwU,https://openreview.net/pdf?id=CRhzJqLhnwU,,"Federated learning has been predominantly concerned with collaborative training of deep networks from scratch, and especially the many challenges that arise, such as communication cost, robustness to heterogeneous data, and support for diverse device capabilities. However, there is no unified framework that addresses all these problems together. This paper studies the challenges and opportunities of exploiting pre-trained Transformer models in FL. In particular, we propose to efficiently adapt such pre-trained models by injecting a novel attention-based adapter module at each transformer block that both modulates the forward pass and makes an early prediction. Training only the lightweight adapter by FL leads to fast and communication-efficient learning even in the presence of heterogeneous data and devices. Extensive experiments on standard FL benchmarks, including CIFAR- 100, FEMNIST and SpeechCommandsv2 demonstrate that this simple framework provides fast and accurate FL while supporting heterogenous device capabilities, efficient personalization, and scalable-cost anytime inference.",Federated Learning Modeling Sequential Sentence Relation to Improve Cross-lingual Dense Retrieval,https://openreview.net/forum?id=-bVsNeR56KS,https://openreview.net/pdf?id=-bVsNeR56KS,,"Recently multi-lingual pre-trained language models (PLM) such as mBERT and XLM-R have achieved impressive strides in cross-lingual dense retrieval. Despite its successes, they are general-purpose PLM while the multilingual PLM tailored for cross-lingual retrieval is still unexplored. Motivated by an observation that the sentences in parallel documents are approximately in the same order, which is universal across languages, we propose to model this sequential sentence relation to facilitate cross-lingual representation learning. Specifically, we propose a multilingual PLM called masked sentence model (MSM), which consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document. The document encoder is shared for all languages to model the universal sequential sentence relation across languages. To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives. Comprehensive experiments on four cross-lingual retrieval tasks show MSM significantly outperforms existing advanced pre-training models, demonstrating the effectiveness and stronger cross-lingual retrieval capabilities of our approach. Code and model will be available.", Boosting Discriminative Visual Representation Learning with Scenario-Agnostic Mixup,https://openreview.net/forum?id=km2lP70ds-0,https://openreview.net/pdf?id=km2lP70ds-0,We propose a learnable scenario-agnostic mixup (SAMix) methods for both self-supervised and supervised vision representation learning.,"Mixup is a popular data-dependent augmentation technique for deep neural networks, which contains two sub-tasks, mixup generation, and classification. The community typically confines mixup to supervised learning (SL) and the objective of the generation sub-task is fixed to selected sample pair instead of considering the whole data manifold. To overcome such limitations, we systematically study the mixup generation objective and propose Scenario-Agnostic Mixup for both SL and Self-supervised Learning (SSL) scenarios, named SAMix. Specifically, we hypothesize and verify the objective function of mixup generation as optimizing local smoothness between two mixed classes subject to global discrimination from other classes. Therefore, we propose η-balanced mixup loss for complementary learning of the two sub-objectives. Meanwhile, we parameterize the generation sub-task as a learnable sub-network, Mixer, with mixing attention which avoids trivial solutions and improves transferable abilities. To eliminate the computational cost of online training, we introduce a pre-trained version, SAMixP , that achieves efficient performance in various tasks. Extensive experiments on SL and SSL benchmarks demonstrate that SAMix consistently outperforms leading methods.","Data Augmentation, Mixup, Image Classification, Self-supervised Learning, Representation Learning" A Robustly and Effectively Optimized Pretraining Approach for Masked Autoencoder,https://openreview.net/forum?id=LHBiPX5BOwZ,https://openreview.net/pdf?id=LHBiPX5BOwZ,,"Recently, Masked Image Modeling (MIM) has increasingly reshaped the status quo of self-supervised visual pre-training. This paper does not describe a novel MIM method, but to unravel several fundamental ingredients to robustly and effectively pre-train a Masked AutoEncoder (MAE) with improved downstream performance as a byproduct. We highlight the great significance for the whole autoencoder to encourage high-variance interactions across different tokens, while simultaneously for the reconstructed target to smooth the inter-patch variances. First, at the decoding phase, we apply the standard dropout upon the attention probabilities as noise to randomly mask out the edge connection across different tokens. Otherwise, their shortcut interactions might hinder the emergence of meaningful contextual representation. Second, we point out that the per-patch normalization will fail unless the patch pixels rely on some population statistics to reduce inter-patch variance and then smooth the reconstruction. Third, we show that autoencoders with different capacities encounter the issue to varying degrees and the learnable masked tokens can be employed to manipulate the variance dependent on its inserted position and ratio in the model. The proposed techniques here are simple and effective to benefit the pre-training of a masked autoencoder stably and obtain superior performance across different downstream tasks. ", Diffusion Posterior Sampling for General Noisy Inverse Problems,https://openreview.net/forum?id=OnD9zGAGT0k,https://openreview.net/pdf?id=OnD9zGAGT0k,We propose a diffusion model-based general inverse problem solver that scales to nonlinear problems and different noise statistics.,"Diffusion models have been recently studied as powerful generative inverse problem solvers, owing to their high quality reconstructions and the ease of combining existing iterative solvers. However, most works focus on solving simple linear inverse problems in noiseless settings, which significantly under-represents the complexity of real-world problems. In this work, we extend diffusion solvers to efficiently handle general noisy (non)linear inverse problems via the Laplace approximation of the posterior sampling. Interestingly, the resulting posterior sampling scheme is a blended version of diffusion sampling with the manifold constrained gradient without a strict measurement consistency projection step, yielding a more desirable generative path in noisy settings compared to the previous studies. Our method demonstrates that diffusion models can incorporate various measurement noise statistics such as Gaussian and Poisson, and also efficiently handle noisy nonlinear inverse problems such as Fourier phase retrieval and non-uniform deblurring.","Diffusion model, Inverse problem, Posterior sampling" Low-Rank Graph Neural Networks Inspired by the Weak-balance Theory in Social Networks,https://openreview.net/forum?id=ufCQZeAMZzf,https://openreview.net/pdf?id=ufCQZeAMZzf,"Inspired by the global low-rank structures of signed networks, we propose to explicitly model the coefficient matrix as a low-rank matrix, based on which the aggregation and propagation are performed.","Graph Neural Networks (GNNs) have achieved state-of-the-art performance on node classification tasks by exploiting both the graph structures and node features. Generally, most existing GNNs depend on the implicit homophily assumption that nodes belonging to the same class are more likely to be connected. However, GNNs may fail to model heterophilious graphs where nodes with different labels tend to be linked, as shown in recent studies. To address this issue, we propose a generic GNN applicable to both homophilious and heterophilious graphs, namely Low-Rank Graph Neural Network (LRGNN). In detail, we aim at computing a coefficient matrix such that the sign of each coefficient reveals whether the corresponding two nodes belong to the same class, which is similar to the sign inference problem. In Signed Social Networks (SSNs), the sign inference problem can be modeled as a low-rank matrix factorization (LRMF) problem due to the global low-rank structure described by the weak balance theory. In this paper, we show that signed graphs are naturally generalized weakly-balanced when considering node classification tasks. Motivated by this observation, we propose to leverage LRMF to recover a coefficient matrix from a partially observed signed adjacency matrix. To effectively capture the node similarity, we further incorporate the low-rank representation (LRR) method. Our theoretical result shows that under our update rule of node representations, LRR obtained by solving a subspace clustering problem can recover the subspace structure of node representations. To solve the corresponding optimization problem, we utilize an iterative optimization algorithm with a convergence guarantee and develop a neural-style initialization manner that enables fast convergence. Finally, extensive experimental evaluation on both real-world and synthetic graphs has validated the superior performance of LRGNN over various state-of-the-art GNNs. In particular, LRGNN can offer clear performance gains in a scenario when the node features are not informative enough.","graph neural networks, heterophily, social theory, low rank" Do We Need Neural Collapse? Learning Diverse Features for Fine-grained and Long-tail Classification,https://openreview.net/forum?id=5gri-cs4RVq,https://openreview.net/pdf?id=5gri-cs4RVq,Neural collapse is not what you need: Deep features with within-class diversity improve the performance of fine-grained and long-tail learning,"Feature extractors learned from supervised training of deep neural networks have demonstrated superior performance over handcrafted ones. Recently, it is shown that such learned features have a neural collapse property, where within-class features collapse to the class mean and different class means are maximally separated. This paper examines the neural collapse property in the context of fine-grained classification tasks, where a feature extractor pretrained from a classification task with coarse labels is used for generating features for a downstream classification task with fine-grained labels. We argue that the within-class feature collapse is an undesirable property for fine-grained classification. Hence, we introduce a geometric arrangement of features called the maximal-separating-cone, where within-class features lie in a cone of nontrivial radius instead of collapsing to the class mean, and cones of different classes are maximally separated. We present a technique based on classifier weight and training loss design to produce such an arrangement. Experimentally we demonstrate an improved fine-grained classification performance with a feature extractor pretrained by our method. Moreover, our technique also provides benefits for the classification on data with long-tail distribution over classes. Our work may motivate future efforts on the design of better geometric arrangements of deep features.","Neural Collapse, Diverse deep learning features, Finegrained transfer learning" Node-Level Membership Inference Attacks Against Graph Neural Networks,https://openreview.net/forum?id=XKHNu4OF6wn,https://openreview.net/pdf?id=XKHNu4OF6wn,We perform the first comprehensive analysis of node-level membership inference attacks against GNNs.,"Many real-world data are graphs, such as social networks and protein structures. To fully utilize the information contained in graph data, graph neural networks (GNNs) have been introduced. Previous studies have shown that machine learning models are vulnerable to privacy attacks. However, most of the current efforts concentrate on ML models trained on images and texts. On the other hand, privacy risks stemming from GNNs remain largely unstudied. In this paper, we fill the gap by performing the first comprehensive analysis of node-level membership inference attacks against GNNs. We systematically define the threat models and propose eight node-level membership inference attacks based on an adversary's background knowledge. Our evaluation on four GNN structures and four benchmark datasets shows that GNNs are vulnerable to node-level membership inference even when the adversary has minimal background knowledge. Besides, we show that node degree, graph density, and feature similarity have major impacts on the attack's success. We further investigate three defense mechanisms and show that differential privacy (DP) can better protect the membership privacy while preserving the model's utility.","Graph Neural Network, Membership Inference Attack" HRBP: Hardware-friendly Regrouping towards Block-wise Pruning for Sparse Training,https://openreview.net/forum?id=OSS-yWzE9Yu,https://openreview.net/pdf?id=OSS-yWzE9Yu,"This paper proposes a novel block-wise pruning algorithm, which accelerates the sparse training of convolutional neural networks at both forward and backward pass. ","Recently, pruning at initialization and training a sparse network from scratch (sparse training) become increasingly popular. However, most sparse training literature addresses only the unstructured sparsity, which in practice brings little benefit to the training acceleration on GPU due to the irregularity of non-zero weights. In this paper, we work on sparse training with fine-grained structured sparsity, by extracting a few dense blocks from unstructured sparse weights. For Convolutional Neural networks (CNN), however, the extracted dense blocks will be broken in backpropagation due to the shape transformation of convolution filters implemented by GEMM. Thus, previous block-wise pruning methods can only be used to accelerate the forward pass of sparse CNN training. To this end, we propose the Hardware-friendly Regrouping towards Block-based Pruning (HRBP), where the grouping is conducted on the kernel-wise mask. With HRBP, extracted dense blocks are preserved in backpropagation. We further propose HRBP++ to reduce zero kernels by extracting common sparse kernel patterns on all kernels within one block. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet demonstrate that HRBP (HRBP++) can almost match the accuracy of unstructured sparse training methods while achieving a huge acceleration on hardware. ","efficient training, sparse training, fine-grained structured sparsity, grouping algorithm" MAGA: Modeling a Group Action,https://openreview.net/forum?id=U9FkAFUBzu2,https://openreview.net/pdf?id=U9FkAFUBzu2,We make a new generative model that is capable of combinatorial generalization.,"Combinatorial generalization, an ability to collect various attributes from diverse data and assemble them to generate novel unexperienced data, is considered an essential traversal point to achieve human-level intelligence. Previous unsupervised approaches mainly focused on learning the disentangled representation, such as the variational autoencoder. However, recent studies discovered that the disentangled representation is insufficient for combinatorial generalization and is not even correlated. In this regard, we proposed a novel framework of data generation that can robustly generalize under these distribution shift situations. The model, simulating the group action, carries out combinatorial generalization by discovering the fundamental transformation between the data. We conducted experiments on the two settings: Recombination-to-Element, and Recombination-to-Range. The experiments demonstrated that our method has quantitatively and qualitatively superior generalizability and generates better images over traditional models. ","Generative Model, Generalization, Deep Learning, Representation Learning" Learning in Compressed Domain via Knowledge Transfer,https://openreview.net/forum?id=AcyZ0Q5p6G8,https://openreview.net/pdf?id=AcyZ0Q5p6G8,We propose learning in compressed domain by transferring the knowledge learned in pixel domain.,"Learning in compressed domain aims to perform vision tasks directly on compressed latent representations instead of reconstructed images. Existing reports show that learning in compressed domain can achieve a comparable performance compared to that in pixel domain for certain compression models. However, we observe that when using the state-of-the-art learned compression models, the performance gap between compressed-domain and pixel-domain vision tasks is still large due to the lack of some natural inductive biases in pixel-domain convolutional neural networks. In this paper, we attempt to address this problem by transferring knowledge from pixel domain to compressed domain. We first modify neural networks for pixel-domain vision tasks to better suit compressed-domain inputs. In addition, we propose a knowledge transfer loss to narrow the gap between compressed domain and pixel domain. Experimental results on classification and instance segmentation show that the proposed method improves the accuracy of compressed-domain vision tasks significantly, which even outperforms learning on reconstructed images while avoiding the computational cost for image reconstruction.","compressed-domain vision, image compression, knowledge transfer" DepthFL : Depthwise Federated Learning for Heterogeneous Clients,https://openreview.net/forum?id=pf8RIZTMU58,https://openreview.net/pdf?id=pf8RIZTMU58,DepthFL is a new federated learning framework based on depth scaling to tackle system heterogeneity.,"Federated learning is for training a global model without collecting private local data from clients. As they repeatedly need to upload locally-updated weights or gradients instead, clients require both computation and communication resources enough to participate in learning, but in reality their resources are heterogeneous. To enable resource-constrained clients to train smaller local models, width scaling techniques have been used, which reduces the channels of a global model. Unfortunately, width scaling suffers from heterogeneity of local models when averaging them, leading to a lower accuracy than when simply excluding resource-constrained clients from training. This paper proposes a new approach based on depth scaling called DepthFL. DepthFL defines local models of different depths by pruning the deepest layers off the global model, and allocates them to clients depending on their available resources. Since many clients do not have enough resources to train deep local models, this would make deep layers partially-trained with insufficient data, unlike shallow layers that are fully trained. DepthFL alleviates this problem by mutual self-distillation of knowledge among the classifiers of various depths within a local model. Our experiments show that depth-scaled local models build a global model better than width-scaled ones, and that self-distillation is highly effective in training data-insufficient deep layers.","Federated Learning, Heterogeneity" Masked Image Modeling with Denoising Contrast,https://openreview.net/forum?id=1fZd4owfJP6,https://openreview.net/pdf?id=1fZd4owfJP6,"We first treat masked patch prediction as denoising contrastive learning in self-supervised image pre-training, achieving state-of-the-art results.","Since the development of self-supervised visual representation learning from contrastive learning to masked image modeling (MIM), there is no significant difference in essence, that is, how to design proper pretext tasks for vision dictionary look-up. MIM recently dominates this line of research with state-of-the-art performance on vision Transformers (ViTs), where the core is to enhance the patch-level visual context capturing of the network via denoising auto-encoding mechanism. Rather than tailoring image tokenizers with extra training stages as in previous works, we unleash the great potential of contrastive learning on denoising auto- encoding and introduce a pure MIM method, ConMIM, to produce simple intra- image inter-patch contrastive constraints as the sole learning objectives for masked patch prediction. We further strengthen the denoising mechanism with asymmetric designs, including image perturbations and model progress rates, to improve the network pre-training. ConMIM-pretrained models with various scales achieve competitive results on downstream image classification, semantic segmentation, object detection, and instance segmentation tasks, e.g., on ImageNet-1K classification, we achieve 83.9% top-1 accuracy with ViT-Small and 85.3% with ViT- Base without extra data for pre-training.","masked image modeling, self-supervised learning, image pre-training" Holding Monotonic Improvement and Generality for Multi-Agent Proximal Policy Optimization,https://openreview.net/forum?id=cALu06i7JJH,https://openreview.net/pdf?id=cALu06i7JJH,,"Proximal Policy Optimization (PPO) has achieved empirical successes in the field of single-agent reinforcement learning thanks to guaranteed monotonic improvement. The theoretical support makes its extension in multi-agent systems very attractive. However, existing PPO-based algorithms in cooperative multi-agent reinforcement learning (MARL) either lack the theoretical monotonic improvement guarantee or have inevitably restrictive settings, which greatly limit their applicable scenarios. In this paper, we propose a theoretically-justified and general multi-agent PPO algorithm for cooperative MARL called Full-Pipeline PPO (FP3O). The core idea of FP3O is to dynamically allocate agents to different optimization pipelines and perform the proposed one-separation trust region optimization for each pipeline. We prove in theory the monotonicity of joint policy improvement when executing the policy iteration procedure of FP3O. In addition, FP3O enjoys high generality since it avoids the restrictive factors that could arise in other existing PPO-based algorithms. In our experiments, FP3O outperforms other strong baselines on Multi-Agent MuJoCo and StarCraftII Multi-Agent Challenge benchmarks and also demonstrates its generality to the common network types (i.e., full parameter sharing, partial parameter sharing, and non-parameter sharing) and various multi-agent tasks.", Monkeypox with Cross Infection Hypothesis via Epidemiological Mode,https://openreview.net/forum?id=UFaOH39SZ4y,https://openreview.net/pdf?id=UFaOH39SZ4y,,"A new re-emerging infectious disease of monkeypox 2022 is structurally related to smallpox that is induced by the monkeypox viruses and has caused 59,606 active cases with 18 deaths up to September 15, 2022. To end this ongoing epidemic, there is a need for population-wide control policies like reducing social interaction by keeping social distance, treatment of infected individuals, and restriction on animals, etc. We forecast the progression of the epidemic and come up with an efficient control mechanism by formulating a mathematical model. The biological feasibility and dynamical behavior of the proposed model are then investigated together with sensitivity analysis to obtain the effect of various epidemic parameters mitigating the spread of the disease. Subsequently, by taking non-pharmaceutical and pharmaceutical intervention strategies as control measures, an optimal control theory is applied to mitigate the fatality of the disease to minimize the infectious population and reduce the cost of controls, we construct an objective functional and solve it by using Pontryagin’s maximum principle. Finally, extensive numerical simulations are performed to show the impact of the application of intervention mechanisms in controlling the transmission of the monkeypox epidemic.", LPMARL: Linear Programming based Implicit Task Assignment for Hierarchical Multi-agent Reinforcement Learning,https://openreview.net/forum?id=gcjxr_g48GU,https://openreview.net/pdf?id=gcjxr_g48GU,Linear programming-based optimal agent-task allocation for hierarchical multi-agent reinforcement learning.,"Training a multi-agent reinforcement learning (MARL) model with sparse reward is notoriously difficult because the terminal reward is induced by numerous interactions among agents. In this study, we propose linear programming-based hierarchical MARL (LPMARL) to learn effective coperative strategy among agents. LPMARL is composed of two hierarchical decision-making schemes: (1) solving an agent-task assignment problem and (2) solving a local cooperative game among agents that are assigned to the same task. For the first step, LPMARL formulates the agent-task assignment problem as linear programming (LP) using the state-dependent cost parameters generated by a graph neural network (GNN). Solving the LP can be considered as assigning tasks to agents, which decomposes the original problem into a set of task-dependent sub-problems. After solving the formulated LP, LPMARL employs a general MARL strategy to derive a lower-level policy to solve each sub-task in a cooperative manner. We train the LP-parameter generating GNN layer and the low-level MARL policy network, which are the essential components for making hierarchical decisions, in an end-to-end manner using the implicit function theorem. We empirically demonstrate that LPMARL learns an optimal agent-task allocation and the subsequent local cooperative control policy among agents in sub-groups for solving various mixed cooperative-competitive environments.","Linear programming, Multi-agent reinforcement learning, Hierarchical multi-agent reinforcement learning, Implicit deep learning" Transmission Dynamics of Hepatitis B: Analysis and Control,https://openreview.net/forum?id=YmVcNC2oCzq,https://openreview.net/pdf?id=YmVcNC2oCzq,,"The infection of hepatitis B attacks the liver and can produce acute and chronic diseases, while it is a major health problem and life-threatening around the globe. The control of this infection is a difficult task due to several reasons such as variation of human behavior, proper medication, vaccination, and existence of a large number of carries, etc., but understanding the dynamics of the infection helps to design appropriate control strategies. Thus, a proper complex dynamical system is needed to find the stability conditions and propose intervention strategies for forecasting the control of hepatitis B virus transmission. We formulate a model that will be helpful to investigate the temporal dynamics and suggest control strategies for hepatitis B infection. The well-posedness of the proposed model will be shown, and used to find the threshold parameter to analyze the model equilibria and its stability. We also perform the sensitive analysis of the threshold quantity to quantify the most sensitive epidemic parameters. Based on the temporal dynamics and sensitivity, we investigate effective methods to minimize the infection of hepatitis B, and develop the algorithms to support the theoretical results with the help of numerical simulations.", Mass-Editing Memory in a Transformer,https://openreview.net/forum?id=MkbcAHIYgyS,https://openreview.net/pdf?id=MkbcAHIYgyS,An algorithm that can make tens of thousands of edits to an autoregressive transformer's memory.,"Recent work has shown exciting promise in updating large language models with new memories, so as to replace obsolete information or add specialized knowledge. However, this line of work is predominantly limited to updating single associations. We develop MEMIT, a method for directly updating a language model with many memories, demonstrating experimentally that it can scale up to thousands of associations for GPT-J (6B) and GPT-NeoX (20B), exceeding prior work by an order of magnitude. Our code and data will be open-sourced upon publication.","language models, GPT, transformers, model editing, factual associations, memory" Enhancement and Numerical Assessment of Novel SARS-CoV-2 Virus Transmission Model,https://openreview.net/forum?id=iQYFdEL6KeS,https://openreview.net/pdf?id=iQYFdEL6KeS,,"Recent pandemic of the coronavirus started in December 2019, which has affected almost all groups of humankind. In this regard, accurate epidemic models are not only crucial for demonstrating the mitigation of the current pandemic but also helpful for forecasting their future dynamics. In this work, we propose a model for SARS-CoV-2 virus transmission to forecast the temporal dynamics of the novel coronavirus disease by considering the characteristics of the disease and the recent literature. Due to the nondeterministic and stochastic nature of the novel-coronavirus disease, we present the model with the aid of stochastic differential equations by considering two infectious phases: pre-symptomatic and symptomatic, because both are significant in the spread of SARS-CoV-2 virus transmission. We ensure that the model is well-posed and identify the necessary conditions for disease eradication by proving the existence, uniqueness, and extinction analysis. The efficacy of the model and the importance of the current study are demonstrated using the actual data. Finally, the model will be simulated using Euler-Maruyama and Milstein’s numerical schemes to support the theoretical findings and show the significance of the results obtained.", GoBigger: A Scalable Platform for Cooperative-Competitive Multi-Agent Interactive Simulation,https://openreview.net/forum?id=NnOZT_CR26Z,https://openreview.net/pdf?id=NnOZT_CR26Z,,"The emergence of various multi-agent environments has motivated powerful algorithms to explore agents' cooperation or competition. Even though this has greatly promoted the development of multi-agent reinforcement learning (MARL), it is still not enough to support further exploration on the behavior of swarm intelligence between multiple teams, and cooperation between multiple agents due to their limited scalability. To alleviate this, we introduce GoBigger, a scalable platform for cooperative-competition multi-agent interactive simulation. GoBigger is an enhanced environment for the Agar-like game, enabling the simulation of multiple scales of agent intra-team cooperation and inter-team competition. Compared with existing multi-agent simulation environments, our platform supports multi-team games with more than two teams simultaneously, which dramatically expands the diversity of agent cooperation and competition, and can more effectively simulate the swarm intelligent agent behavior. Besides, in GoBigger, the cooperation between the agents in a team can lead to much higher performance. We offer a diverse set of challenging scenarios, built-in bots, and visualization tools for best practices in benchmarking. We evaluate several state-of-the-art algorithms on GoBigger and demonstrate the potential of the environment. We believe this platform can inspire various emerging research directions in MARL, swarm intelligence, and large-scale agent interactive learning. Both GoBigger and its related benchmark are open-sourced. More information could be found at anonymized-gobigger.github.io.","Reinforcement Learning, Environment, Cooperation, Competition, Scalable" Masked Unsupervised Self-training for Label-free Image Classification ,https://openreview.net/forum?id=ZAKkiVxiAM9,https://openreview.net/pdf?id=ZAKkiVxiAM9,We propose a new label-free classification method which significantly improves upon CLIP by unsupervised adaptation.,"State-of-the-art computer vision models are mostly trained with supervised learning using human-labeled images, which limits their scalability due to the expensive annotation cost. While self-supervised representation learning has achieved impressive progress, it still requires a second stage of finetuning on labeled data. On the other hand, models pre-trained with large-scale text supervision (e.g., CLIP) have enabled zero-shot transfer to downstream image classification tasks. However, the zero-shot performance of CLIP-like models are often insufficient for real-world adoption. In this paper, we aim to leverage the abundant unlabeled data from a target domain to improve the performance of a pre-trained zero-shot classifier, by unsupervised finetuning of the pre-trained model. We propose Masked Unsupervised Self-Training (MUST), a new approach which leverages two different and complimentary sources of training signals: pseudo-labels and raw images. MUST jointly optimizes three objectives to learn both class-level global feature and pixel-level local feature and enforces a regularization between the two. We demonstrate the efficacy of MUST on 8 downstream tasks across a variety of domains, where it improves upon CLIP by a large margin. MUST also outperforms supervised few-shot adaptation methods. It achieves a top-1 accuracy of 77.7% on ImageNet using ViT-B, +9.4% higher than CLIP, and +6.2% higher than 16-shot CLIP adaptation. Our code is submitted in the supplementary material.","zero-shot classification, unsupervised learning, self-training, CLIP, masked image modeling" Recursion of Thought: Divide and Conquer Reasoning with Language Models,https://openreview.net/forum?id=PTUcygUoxuc,https://openreview.net/pdf?id=PTUcygUoxuc,"We unleash the reasoning capability of language models, which has been constrained by the maximum size of a single context, by letting them recursively create and utilize multiple contexts.","With the recent advances in language models, attempts are being made to apply them to solving multi-step reasoning problems. A major breakthrough in this line of research is to let language models generate intermediate steps, often called Chain of Thought (CoT), before producing a final answer. However, language models have an upper bound on the context size, i.e., the number of input tokens, such as 2048 for the recent GPT-3 and PaLM. Although several thousand tokens are enough to handle various tasks, solving more complex reasoning tasks can require orders of magnitude more tokens. Therefore, the context limit imposes a fundamental limit on the model's reasoning capability. Inspired by human's incredible reasoning ability based on abstraction and recursion, we propose Recursion of Thought (RoT) as a model-agnostic framework with the novel paradigm of teaching a language model to divide and conquer complex problems by recursively creating multiple contexts. Since RoT casts the context-related operations as tokens, a language model can trigger the recursion operations by simply producing the corresponding tokens. On multiple arithmetic and algorithmic reasoning tasks, we demonstrate that RoT dramatically improves the recent large-scale language model GPT-3 to solve extremely complex problems. Moreover, RoT can make tiny, randomly initialized Transformers or LSTMs to solve problems that even humans find daunting.","reasoning, language models, chain of thought" GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis,https://openreview.net/forum?id=YfwMIDhPccD,https://openreview.net/pdf?id=YfwMIDhPccD,,"Generating photo-realistic video portrait with arbitrary speech audio is a crucial problem in film-making and virtual reality. Recently, several works explore the usage of neural radiance field in this task to improve 3D realness and image fidelity. However, the generalizability of previous NeRF-based methods is limited by the small scale of training data. In this work, we propose GeneFace, a generalized and high-fidelity NeRF-based talking face generation method, which can generate natural results corresponding to various out-of-domain audio. Specifically, we learn a variaitional motion generator on a large lip-reading corpus, and introduce a domain adaptative post-net to calibrate the result. Moreover, we learn a NeRF-based renderer conditioned on the predicted motion. A head-aware torso-NeRF is proposed to eliminate the head-torso separation problem. Extensive experiments show that our method achieves more generalized and high-fidelity talking face generation compared to previous methods. Video samples are available at https://geneface.github.io .","Talking Face Generation, Neural Radiance Field" Environment Partitioning For Invariant Learning By Decorrelation,https://openreview.net/forum?id=z9FyE7jHXkN,https://openreview.net/pdf?id=z9FyE7jHXkN,A method Decorr that does algorithmic environment partitioning and makes IRM more generally applicable.,"Invariant learning methods try to find an invariant predictor across several environments and have become popular in OOD generalization. However, in situations where environments do not naturally exist in the data, they have to be decided by practitioners manually. Environment partitioning, which splits the whole training dataset into environments by algorithms, will significantly influence the performance of invariant learning and has been left undiscussed. A good environment partitioning method can bring invariant learning to applications with more general settings and improve its performance. We propose to split the dataset into several environments by finding low-correlated data subsets. Theoretical interpretations and algorithm details are both introduced in the paper. Through experiments on both synthetic and real data, we show that our Decorr method can achieve outstanding performance, while some other partitioning methods may lead to bad, even below-ERM results using the same training scheme of IRM.","OOD Generalization, IRM, Environment Partitioning, Decorrelation" Learning the Positions in CountSketch,https://openreview.net/forum?id=iV9Cs8s8keU,https://openreview.net/pdf?id=iV9Cs8s8keU,We propose the first learning-based algorithms that also optimize the locations of the non-zero entries of CountSketch matrix.,"We consider sketching algorithms which first compress data by multiplication with a random sketch matrix, and then apply the sketch to quickly solve an optimization problem, e.g., low-rank approximation and regression. In the learning-based sketching paradigm proposed by Indyk et al., the sketch matrix is found by choosing a random sparse matrix, e.g., CountSketch, and then the values of its non-zero entries are updated by running gradient descent on a training data set. Despite the growing body of work on this paradigm, a noticeable omission is that the locations of the non-zero entries of previous algorithms were fixed, and only their values were learned. In this work, we propose the first learning-based algorithms that also optimize the locations of the non-zero entries. Our first proposed algorithm is based on a greedy algorithm. However, one drawback of the greedy algorithm is its slower training time. We fix this issue and propose approaches for learning a sketching matrix for both low-rank approximation and Hessian approximation for second-order optimization. The latter is helpful for a range of constrained optimization problems, such as LASSO and matrix estimation with a nuclear norm constraint. Both approaches achieve good accuracy with a fast running time. Moreover, our experiments suggest that our algorithm can still reduce the error significantly even if we only have a very limited number of training matrices.","learning-augmented sketches, count-sketch, low-rank approximation, iterative Hessian sketch" Towards the gradient adjustment by loss status for Neural Network Optimization,https://openreview.net/forum?id=tG6xz50IHMD,https://openreview.net/pdf?id=tG6xz50IHMD,," Gradient descent-based algorithms are crucial in neural network optimization, and most of them only depend on local properties such as the first and second-order momentum of gradients to determine the local optimization directions. As a result, such algorithms often converge slowly in the case of a small gradient and easily fall into the local optimum. Since the goal of optimization is to minimize the loss function, the status of the loss indicates the overall progress of the optimization but has not been fully explored. In this paper, we propose a loss-aware gradient adjusting strategy (LGA) based on the loss status. LGA automatically adjusts the update magnitude of parameters to accelerate convergence and escape local optimums by introducing a loss-incentive correction term monitoring the loss and adapting gradient experience. The proposed strategy can be applied to various gradient descent-based optimization algorithms. We provide theoretical analysis on the convergence rate and empirical evaluations on different datasets to demonstrate the effectiveness of our method.", On the Necessity of Disentangled Representations for Downstream Tasks,https://openreview.net/forum?id=pjyM7D09M5,https://openreview.net/pdf?id=pjyM7D09M5,We show that dimension-wise disentangled representations are not necessary for downstream tasks using deep neural networks with learned representations as input. ,"A disentangled representation encodes generative factors of data in a separable and compact pattern. Thus it is widely believed that such a representation format benefits downstream tasks. In this paper, we challenge the necessity of disentangled representation in downstream applications. Specifically, we show that dimension-wise disentangled representations are not necessary for downstream tasks using neural networks that take learned representations as input. We provide extensive empirical evidence against the necessity of disentanglement, covering multiple datasets, representation learning methods, and downstream network architectures. Moreover, our study reveals that the informativeness of representations best accounts for downstream performance. The positive correlation between informativeness and disentanglement explains the claimed usefulness of disentangled representations in previous works. ","representation disentanglement, representation learning downstream task" Grouped self-attention mechanism for a memory-efficient Transformer,https://openreview.net/forum?id=jAsM_UoXUeP,https://openreview.net/pdf?id=jAsM_UoXUeP,We proposed Grouped Self-Attention (GSA) and Compressed Cross-Attention (CCA) modules to solve the quadratic order problem on self-attention mechanism of Transformer.,"Time-series data analysis is important because numerous real-world tasks such as forecasting weather, electricity consumption, and stock market involve predicting data that vary over time. Time-series data are generally recorded over a long period of observation with long sequences owing to their periodic characteristics and long-range dependencies over time. Thus, capturing long-range dependency is an important factor in time-series data forecasting. To solve these problems, we proposed two novel modules, Grouped Self-Attention (GSA) and Compressed Cross-Attention (CCA). With both modules, we achieved a computational space and time complexity of order $O(l)$ with a sequence length $l$ under small hyperparameter limitations, and can capture locality while considering global information. The results of experiments conducted on time-series datasets show that our proposed model efficiently exhibited reduced computational complexity and performance comparable to or better than existing methods. ","time-series data, Transformer, memory-efficient structure, self-attention, computational and space complexity, long-range dependency" Linear Video Transformer with Feature Fixation,https://openreview.net/forum?id=2VWa8qj2vd0,https://openreview.net/pdf?id=2VWa8qj2vd0,,"Vision Transformers have achieved impressive performance in video classification, while suffering from the quadratic complexity caused by the Softmax attention mechanism. Some studies alleviate the computational costs by reducing the number of tokens attended in attention calculation, but the complexity is still quadratic. Another promising way is to replace Softmax attention with linear attention, which owns linear complexity but presents a clear performance drop. We find that such a drop in linear attention results from the lack of attention concentration to critical features. Therefore, we propose a feature fixation module to reweight feature importance of the query and key prior to computing linear attention. Specifically, we regard the query, key, and value as latent representations of the input token, and learn the feature fixation ratio by aggregating Query-Key-Value information. This is beneficial for measuring the feature importance comprehensively. Furthermore, we improve the feature fixation by neighborhood association, which leverages additional guidance from spatial and temporal neighboring tokens. Our proposed method significantly improves the linear attention baseline, and achieves state-of-the-art performance among linear video Transformers on three popular video classification benchmarks. Our performance is even comparable to some quadratic Transformers with fewer parameters and higher efficiency.", Neural Frailty Machine: Beyond proportional hazard assumption in neural survival regressions,https://openreview.net/forum?id=1mU6ADbjk-c,https://openreview.net/pdf?id=1mU6ADbjk-c,A flexible framework of neural survival regression with provable statistical guarantees,"We present neural frailty machine (NFM), a powerful and flexible neural modeling framework for survival regressions. The NFM framework utilizes the classical idea of multiplicative frailty in survival analysis to capture unobserved heterogeneity among individuals, at the same time being able to leverage the strong approximation power of neural architectures for handling nonlinear covariate dependence. Two concrete models are derived under the framework that extends neural proportional hazard models and nonparametric hazard regression models. Both models allow efficient training under the likelihood objective. Theoretically, for both proposed models, we establish statistical guarantees of neural function approximation with respect to nonparametric components via characterizing their rate of convergence. Empirically, we provide synthetic experiments that verify our theoretical statements. We also conduct experimental evaluations over 6 benchmark datasets of different scales, showing that the proposed NFM models outperform state-of-the-art survival models in terms of predictive performance. ","survival analysis, sieve method, theory" A Closer Look at Dual Batch Normalization and Two-domain Hypothesis In Adversarial Training With Hybrid Samples,https://openreview.net/forum?id=TMnxVoWdX_M,https://openreview.net/pdf?id=TMnxVoWdX_M,,"There is a growing concern about applying batch normalization (BN) in adversarial training (AT), especially when the model is trained on both \textit{adversarial} samples and \textit{clean} samples (termed Hybrid-AT). With the assumption that \textit{adversarial} and \textit{clean} samples are from two different domains, a common practice in prior works is to adopt dual BN, where BN$_{adv}$ and BN$_{clean}$ are used for adversarial and clean branches, respectively. A popular belief for motivating dual BN is that estimating normalization statistics of this mixture distribution is challenging and thus disentangling it for normalization achieves stronger robustness. In contrast to this belief, we reveal that what makes dual BN effective mainly lies in its two sets of affine parameters. Moreover, we demonstrate that the domain gap between adversarial and clean samples is actually not very large, which is counter-intuitive considering the significant influence of adversarial perturbation on the model. Overall, our work sheds new light on understanding the mechanism of dual BN in Hybrid-AT as well as its underlying two-domain hypothesis. ","Adversarial training, batch normalization" Generative Recorrupted-to-Recorrupted: An Unsupervised Image Denoising Network for Arbitrary Noise Distribution,https://openreview.net/forum?id=vfa7--yvtYh,https://openreview.net/pdf?id=vfa7--yvtYh,,"With the great breakthrough of supervised learning in the field of denoising, more and more works focus on end-to-end learning to train denoisers. The premise of this method is effective is that there is certain data support, but in practice, it is particularly difficult to obtain labels in the training data. To this end, some unsupervised denoisers have emerged in recent years, however, the premise of these methods being effective is that the noise model needs to be known in advance, which will limit the practical use of unsupervised denoising. In addition, inaccurate noise prior from noise estimation algorithms causes low denoising accuracy. Therefore, we design a more practical denoiser that requires neither clean images as training labels nor noise model assumptions. Our method also needs the support of the noise model, the difference is that the model is generated by a residual image and a random mask during the network training process, and then the input and target of the network are generated from a single noisy images and the noise model, at the same time, train an unsupervised module and a pseudo supervised module. Extensive experiments demonstrate the effectiveness of our framework and even surpass the accuracy of supervised denoising.","Image denoising, Unsupervised, Deep learning" Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup,https://openreview.net/forum?id=YdFkY-QHkPl,https://openreview.net/pdf?id=YdFkY-QHkPl,"Mixup with appropriate hyperparameters can learn multiple predictive features for each class in a dataset, even when empirical risk minimization fails to do so.","Mixup is a data augmentation technique that relies on training using random convex combinations of data points and their labels. In recent years, Mixup has become a standard primitive used in the training of state-of-the-art image classification models due to its demonstrated benefits over empirical risk minimization with regards to generalization and robustness. In this work, we try to explain some of this success from a feature learning perspective. We focus our attention on classification problems in which each class may have multiple associated features (or views) that can be used to predict the class correctly. Our main theoretical results demonstrate that, for a non-trivial class of data distributions with two features per class, training a 2-layer convolutional network using empirical risk minimization can lead to learning only one feature for almost all classes while training with a specific instantiation of Mixup succeeds in learning both features for every class. We also show empirically that these theoretical insights extend to the practical settings of image benchmarks modified to have additional synthetic features.","mixup, data augmentation, theory, optimization, multi-view, feature learning, generalization, deep learning" Understanding Catastrophic Overfitting in Fast Adversarial Training From a Non-robust Feature Perspective,https://openreview.net/forum?id=UO8UP_xDMwD,https://openreview.net/pdf?id=UO8UP_xDMwD,We provide a new perspective on catastrophic overfitting in fast adversarial training,"To make adversarial training (AT) computationally efficient, FGSM AT has attracted significant attention. The fast speed, however, is achieved at the cost of catastrophic overfitting (CO), whose reason remains unclear. Prior works mainly study the phenomenon of a significant PGD accuracy (Acc) drop to understand CO while paying less attention to its FGSM Acc. We highlight an intriguing CO phenomenon that FGSM Acc is higher than accuracy on clean samples and attempt to apply non-robust feature (NRF) to understand it. Our investigation of CO by extending the existing NRF into fine-grained categorization suggests: there exists a certain type of NRF whose usefulness is increased after FGSM attack, and CO in FGSM AT can be seen as a dynamic process of learning such NRF. Therefore, the key to preventing CO lies in reducing its usefulness under FGSM AT, which sheds new light on understanding the success of a SOTA technique for mitigating CO. ","Fast Adversarial Training, Catastrophic Overfitting, Non-robust Feature" AutoDisc: Automatic Distillation Schedule for Large Language Model Compression,https://openreview.net/forum?id=YEVJI3Uqkux,https://openreview.net/pdf?id=YEVJI3Uqkux,,"Driven by the teacher-student paradigm, knowledge distillation is one of the de facto ways for language model compression. Recent studies have uncovered that conventional distillation is less effective when facing a large capacity gap between the teacher and the student, and introduced teacher assistant-based distillation to bridge the gap. As a connection, the scale and the performance of the teacher assistant is crucial for transferring the knowledge from the teacher to the student. However, existing teacher assistant-based methods manually select the teacher assistant, requiring many trials before identifying the optimal teacher assistant. To this end, we propose an Automatic Distillation Schedule (AutoDisc) for large language model compression to discover the optimal teacher assistant in only one trial. In particular, motivated by the finding that the performance of the student is positively correlated to the scale-performance tradeoff of the teacher assistant, AutoDisc designs a $\lambda$-Traddoff to measure the optimality of the teacher assistant. AutoDisc then yields the $\lambda$-Traddoffs of all teacher assistant candidates in an once-for-all optimization with two approximations. The optimal teacher assistant can be automatically selected by uncovering the best $\lambda$-Traddoff. AutoDisc is evaluated with an extensive set of experiments on a language understanding benchmark GLUE. Experimental results demonstrate the improved efficiency with similar or even better effectiveness of our AutoDisc compared to several state-of-the-art baselines. We further apply AutoDisc to a language model with over one billion parameters and show the scalability of AutoDisc.", Lifting the Curse of Capacity Gap in Distilling Large Language Models,https://openreview.net/forum?id=CMsuT6Cmfvs,https://openreview.net/pdf?id=CMsuT6Cmfvs,,"Large language models (LLMs) have shown compelling performance on various downstream tasks, but unfortunately require a tremendous amount of inference compute. Knowledge distillation finds a path to compress LLMs to small ones with a teacher-student paradigm. However, when capacity gap between the teacher and the student is large, a curse of capacity gap appears, invoking a deficiency in distilling LLMs. While a few studies have been investigated to fill the gap, the curse is not yet well tackled. To the demand, we aim at lifting the curse of capacity gap via enlarging the capacity of the student without notably increasing the inference compute. Largely motivated by sparse activation regime of mixture of experts (MoE), we propose a mixture of minimal experts (MiniMoE), which imposes extra parameters to the student but introduces almost no additional inference compute. Experimental results on GLUE and CoNLL demonstrate the curse of capacity gap is lifted by the magic of MiniMoE to a large extent.MiniMoE also achieves state-of-the-art performance at small FLOPs compared with a range of competitive baselines. With compression as much as ~50x, MiniMoE preserves 95% GLUE score of the teacher.", BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers,https://openreview.net/forum?id=VB75Pi89p7,https://openreview.net/pdf?id=VB75Pi89p7,,"Masked image modeling (MIM) has demonstrated impressive results in self-supervised representation learning by recovering corrupted image patches. However, most existing studies operate on low-level image pixels, which hinders the exploitation of high-level semantics for representation models. In this work, we propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction, providing a systematic way to promote MIM from pixel-level to semantic-level. Specifically, we propose vector-quantized knowledge distillation to train the tokenizer, which discretizes a continuous semantic space to compact codes. We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches. Furthermore, we introduce a patch aggregation strategy which associates discrete image patches to enhance global semantic representation. Experiments on image classification and semantic segmentation show that BEiT v2 outperforms all compared MIM methods. On ImageNet-1K (224 size), the base-size BEiT v2 achieves $85.5\%$ top-1 accuracy for fine-tuning and $80.1\%$ top-1 accuracy for linear probing. The large-size BEiT v2 obtains $87.3\%$ top-1 accuracy for ImageNet-1K (224 size) fine-tuning, and $56.7\%$ mIoU on ADE20K for semantic segmentation. The code can be found in the supplementary materials.", Geo-NN: An End-to-End Framework for Geodesic Mean Estimation on the Manifold of Symmetric Positive Definite Matrices,https://openreview.net/forum?id=h-UkhDzFFj,https://openreview.net/pdf?id=h-UkhDzFFj,"We propose an end-to-end deep learning framework, the Geo-NN, to efficiently compute the geodesic mean of a collection of matrices lying on the SPD manifold","The manifold of symmetric positive definite (SPD) matrices plays a key role in many domains, from network science to differential geometry to signal and image processing. However, leveraging the SPD manifold geometry during inference is challenging, as simple operations, such as mean estimation, do not have a closed-form or easily computable solution. In this paper, we propose an end-to-end deep learning framework, which we call a Geometric Neural Network (Geo-NN), to efficiently compute the geodesic mean of a collection of matrices lying on the SPD manifold. Geo-NN utilizes a Matrix-Autoencoder (MAE) architecture with intersecting fully connected layers as its backbone. We illustrate that the matrix-normal equation arising from Fr\'echet mean estimation can be converted into a loss function for optimizing the Geo-NN, which in turn approximates the geodesic mean of a collection of SPD matrices. We demonstrate the efficacy of our framework in both synthetic and real-world scenarios, as compared to commonly used alternative methods. Our simulated experiments demonstrate that Geo-NN is robust to various noise conditions and is scalable to increasing dataset size and dimensionality. Our real-world application of Geo-NN to functional connectomics data allows us to extract network patterns associated with patient/control differences.","Symmetric Postive Definite Manifolds, Geodesic Mean, Matrix Autoencoder" HIVE: HIerarchical Volume Encoding for Neural Implicit Surface Reconstruction,https://openreview.net/forum?id=LnQn5-rN-LR,https://openreview.net/pdf?id=LnQn5-rN-LR,,"Neural implicit surface reconstruction has become a new trend in reconstructing a detailed 3D shape from images. In previous methods, however, the 3D scene is only encoded by the MLPs which do not have an explicit 3D structure. To better represent 3D shapes, we introduce a volume encoding to explicitly encode the spatial information. We further design hierarchical volumes to encode the scene structures in multiple scales. The high-resolution volumes capture the high-frequency geometry details since spatially varying features could be learned from different 3D points, while the low-resolution volumes enforce the spatial consistency to keep the shape smooth since adjacent locations possess the same low-resolution feature. In addition, we adopt a sparse structure to reduce the memory consumption at high-resolution volumes, and two regularization terms to enhance results smoothness. This hierarchical volume encoding could be appended to any implicit surface reconstruction method as a plug-and-play module, and can generate a smooth and clean reconstruction with more details. Superior performance is demonstrated in DTU, EPFL, and BlendedMVS datasets with significant improvement on the standard metrics. The code of our method will be made public.","neural implicit surface reconstruction, multi-view surface reconstruction, hierarchical volume encoding" Progressive Image Synthesis from Semantics to Details with Denoising Diffusion GAN,https://openreview.net/forum?id=5RxmkAFVs_V,https://openreview.net/pdf?id=5RxmkAFVs_V,We propose a novel progressive method for image synthesis from semantics to details with diffusion denoising GAN.,"Image generation has been dominated by generative adversarial Networks (GANs) due to its superior ability to generate realistic images. Recently, by decomposing the image generation process into a sequence of denoising steps, denoising diffusion probabilistic models (DDPMs) have shown remarkable sample quality and diversity in image generation. However, DDPMs typically face two main challenges (but GANs do not): the time-expensive sampling process and the semantically meaningless latent space. Although these two challenges start to draw attention in recent works on DDPMs, they are often addressed separately. In this paper, by interpreting the sampling process of DDPMs in a new way with a special noise scheduler, we propose a novel progressive training pipeline to address these two challenges simultaneously. Concretely, we choose to decompose the sampling process into two stages: generating semantics firstly and then refining details progressively. As a result, we are able to interpret the sampling process of DDPMs as a refinement process instead of a denoising process, when the DDPMs try to predict the real images at each time step. Motivated by such new interpretation, we present a novel training pipeline that progressively transforms the attention from semantics to sample quality during training. Extensive results on two benchmarks show that our proposed diffusion model achieves competitive results with as few as two sampling steps on unconditional image generation. Importantly, the latent space of our diffusion model is shown to be semantically meaningful, which can be exploited on various downstream tasks (e.g., attribute manipulation).","Image generation, GANs, diffusion model, progressive generation" Communication-Efficient Federated Learning with Accelerated Client Gradient,https://openreview.net/forum?id=de-_FHXQ4--,https://openreview.net/pdf?id=de-_FHXQ4--,"We present, FedACG, a novel federated learning framework which reduce the gap between global and local losses by incorporating the global momentum to guide client updates.","Federated learning often suffers from slow and unstable convergence due to heterogeneous characteristics of participating client datasets. Such a tendency is aggravated when the client participation ratio is low since the information collected from the clients is prone to have large variations. To tackle this challenge, we propose a novel federated learning framework, which improves the consistency across clients and facilitates the convergence of the server model. This is achieved by making the server broadcast a global model with a gradient acceleration. By adopting the strategy, the proposed algorithm conveys the projective global update information to participants effectively with no extra communication cost and relieves the clients from storing the previous models. We also regularize local updates by aligning each of the clients with the overshot global model to reduce bias and improve the stability of our algorithm. We perform comprehensive empirical studies on real data under various settings and demonstrate remarkable performance gains of the proposed method in terms of accuracy and communication efficiency compared to the state-of-the-art methods, especially with low client participation rates. We will release our code to facilitate and disseminate our work.","Federated learning, Data heterogeneity, Deep Neural Networks, Distributed Optimization" DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection,https://openreview.net/forum?id=3mRwyG5one,https://openreview.net/pdf?id=3mRwyG5one,"We present a state-of-the-art end-to-end object detector, the first DETR-like model on top of the COCO detection leader board.","We present DINO (DETR with Improved deNoising anchOr boxes), a strong end-to-end object detector. DINO improves over previous DETR-like models in performance and efficiency by using a contrastive way for denoising training, a look forward twice scheme for box prediction, and a mixed query selection method for anchor initialization. DINO achieves 49.4AP in 12 epochs and 51.3AP in 24 epochs on COCO with a ResNet-50 backbone and multi-scale features, yielding a significant improvement of +6.0AP and +2.7AP, respectively, compared to DN-DETR, the previous best DETR-like model. DINO scales well in both model size and data size. Without bells and whistles, after pre-training on the Objects365 dataset with a SwinL backbone, DINO obtains the best results on both COCO val2017 (63.2AP) and test-dev (63.3AP) with model size under 1 billion parameters. Compared to other models on the leaderboard, DINO significantly reduces its model size and pre-training data size while achieving better results. The code will be available.","Object Detection, Detection Transformer, End-to-End Detector" Ranking-Enhanced Unsupervised Sentence Representation Learning,https://openreview.net/forum?id=g77JafrHWyy,https://openreview.net/pdf?id=g77JafrHWyy,,"Previous unsupervised sentence embedding studies have focused on data augmentation methods such as dropout masking and rule-based sentence transformation methods. However, these approaches have a limitation of controlling the fine-grained semantics of augmented views of a sentence. This results in inadequate supervision signals for capturing a semantic similarity of similar sentences. In this work, we found that using neighbor sentences enables capturing a more accurate semantic similarity between similar sentences. Based on this finding, we propose RankEncoder, which uses relations between an input sentence and sentences in a corpus for training unsupervised sentence encoders. We evaluate RankEncoder from three perspectives: 1) the semantic textual similarity performance, 2) the efficacy on similar sentence pairs, and 3) the universality of RankEncoder. Experimental results show that RankEncoder achieves 80.07% Spearman's correlation, a 1.1% absolute improvement compared to the previous state-of-the-art performance. The improvement is even more significant, a 1.73% improvement, on similar sentence pairs. Also, we demonstrate that RankEncoder is universally applicable to existing unsupervised sentence encoders.","Unsupervised Sentence Embedding, Sentence Embedding, Semantic Textual Similarity, Natural Language Processing" Simultaneously Learning Stochastic and Adversarial Markov Decision Process with Linear Function Approximation,https://openreview.net/forum?id=Oys81jfesjQ,https://openreview.net/pdf?id=Oys81jfesjQ,,"Reinforcement learning (RL) has been commonly used in practice. To deal with the numerous states and actions in real applications, the function approximation method has been widely employed to improve the learning efficiency, among which the linear function approximation has attracted great interest both theoretically and empirically. Previous works on the linear Markov Decision Process (MDP) mainly study two settings, the stochastic setting where the reward is generated in a stochastic way and the adversarial setting where the reward can be chosen arbitrarily by an adversary. All these works treat these two environments separately. However, the learning agents often have no idea of how rewards are generated and a wrong reward type can severely disrupt the performance of those specially designed algorithms. So a natural question is whether an algorithm can be derived that can efficiently learn in both environments but without knowing the reward type. In this paper, we first consider such best-of-both-worlds problem for linear MDP with the known transition. We propose an algorithm and prove it can simultaneously achieve $O(\text{poly} \log K)$ regret in the stochastic setting and $O(\sqrt{K})$ regret in the adversarial setting where $K$ is the horizon. To the best of our knowledge, it is the first such result for linear MDP. ", Statistical Efficiency of Score Matching: The View from Isoperimetry,https://openreview.net/forum?id=TD7AnQjNzR6,https://openreview.net/pdf?id=TD7AnQjNzR6,We show a tight connection between the statistical efficiency of score matching and the isoperimetric properties (e.g. log-Sobolev constant) of the distribution being estimated," Deep generative models parametrized up to a normalizing constant (e.g. energy-based models) are difficult to train by maximizing the likelihood of the data because the likelihood and/or gradients thereof cannot be explicitly or efficiently written down. Score matching is a training method, whereby instead of fitting the likelihood $\log p(x)$ for the training data, we instead fit the score function $\nabla_x \log p(x)$ --- obviating the need to evaluate the partition function. Though this estimator is known to be consistent, its unclear whether (and when) its statistical efficiency is comparable to that of maximum likelihood --- which is known to be (asymptotically) optimal. We initiate this line of inquiry in this paper, and show a tight connection between statistical efficiency of score matching and the isoperimetric properties of the distribution being estimated --- i.e. the Poincar\'e, log-Sobolev and isoperimetric constant --- quantities which govern the mixing time of Markov processes like Langevin dynamics. Roughly, we show that the score matching estimator is statistically comparable to the maximum likelihood when the distribution has a small isoperimetric constant. Conversely, if the distribution has a large isoperimetric constant --- even for simple families of distributions like exponential families with rich enough sufficient statistics --- score matching will be substantially less efficient than maximum likelihood. We suitably formalize these results both in the finite sample regime, and in the asymptotic regime. Finally, we identify a direct parallel in the discrete setting, where we connect the statistical properties of pseudolikelihood estimation with approximate tensorization of entropy and the Glauber dynamics. ","score matching, log-Sobolev inequality, isoperimetry, relative efficiency, sample complexity" Quadratic models for understanding neural network dynamics,https://openreview.net/forum?id=GNFimGDfEiV,https://openreview.net/pdf?id=GNFimGDfEiV,Quadratic models capture properties of wide neural networks in both optimization and generalization. ,"In this work, we show that recently proposed quadratic models capture optimization and generalization properties of wide neural networks that cannot be captured by linear models. In particular, we prove that quadratic models for shallow ReLU networks exhibit the ""catapult phase"" from Lewkowycz et al. (2020) that arises when training such models with large learning rates. We then empirically show that the behaviour of quadratic models parallels that of neural networks in generalization, especially in the catapult phase regime. Our analysis further demonstrates that quadratic models are an effective tool for analysis of neural networks. ","quadratic models, wide neural networks, catapult phase, optimization dynamics" Improving Adversarial Transferability with Worst-case Aware Attacks,https://openreview.net/forum?id=jh9nlYWJ5kk,https://openreview.net/pdf?id=jh9nlYWJ5kk,,"Generating adversarial examples with high transferability is key to practical black-box attack scenarios, where the attacker has limited or no information about target models. While previous works mainly deal with input transformation or optimization process to reduce overfitting on a surrogate model and enhance transferability, we find that well-designed model manipulation can provide complementary gain to existing methods. We propose Worst-case Aware Attack (WAA), a simple effective method that provides access to a virtual ensemble of models to mitigate overfitting on a specific model during the adversarial example generation process. Specifically, WAA formulates max-min optimization to seek adversarial examples that are robust against the worst-case models, which are created by adding per-example weight perturbation to the source model towards the direction of weakening the adversarial sample in question. Unlike other model manipulation methods, WAA does not require multiple surrogate models or architecture-specific knowledge. Experimental results on ImageNet demonstrate that WAA can be incorporated with a variety of existing methods to consistently improve transferability over different settings, including naturally trained models, adversarially trained models, and adversarial defenses.", TCFimt: Temporal Counterfactual Forecasting from Individual Multiple Treatment Perspective,https://openreview.net/forum?id=zgYefCxXmQe,https://openreview.net/pdf?id=zgYefCxXmQe,We propose a new method to forecast future outcomes given multiple treatments in a counterfactual fashion.,"Determining causal effects of temporal multi-intervention assists decision making. Restricted by time-varying bias, selection bias, and interactions of multiple interventions, the disentanglement and estimation of multiple treatment effects from individual temporal data is still rare. To tackle these challenges, we propose a comprehensive framework of temporal counterfactual forecasting based on balanced representation from an individual multiple treatment perspective (TCFimt). TCFimt constructs adversarial tasks in a seq2seq framework to alleviate selection and time-varying bias and designs a contrastive learning-based block to decouple a mixed treatment effect into separated main treatment effects and causal interactions which further improves estimation accuracy. Through implementing experiments on two real-world datasets from distinct fields, the proposed method shows satisfactory performance in predicting future outcomes with specific treatments and in choosing optimal treatment type and timing than state-of-the-art methods.","causal inference, balanced representation, temporal prediction" Curved Representation Space of Vision Transformers,https://openreview.net/forum?id=DH4v0nW7yJ,https://openreview.net/pdf?id=DH4v0nW7yJ,"We analyze how the representation space of Transformers is shaped, based on which their characteristics in terms of adversarial robustness, model calibration, and difficulty of training are explained.","Neural networks with self-attention (a.k.a. Transformers) like ViT and Swin have emerged as a better alternative to traditional convolutional neural networks (CNNs) for computer vision tasks. However, our understanding of how the new architecture works is still limited. In this paper, we focus on the phenomenon that Transformers show higher robustness against corruptions than CNNs, while not being overconfident (in fact, we find Transformers are actually underconfident). This is contrary to the intuition that robustness increases with confidence. We resolve this contradiction by investigating how the output of the penultimate layer moves in the representation space as the input data moves within a small area. In particular, we show the following. (1) While CNNs exhibit fairly linear relationship between the input and output movements, Transformers show nonlinear relationship for some data. For those data, the output of Transformers moves in a curved trajectory as the input moves linearly. (2) When a data is located in a curved region, it is hard to move it out of the decision region since the output moves along a curved trajectory instead of a straight line to the decision boundary, resulting in high robustness of Transformers. (3) If a data is slightly modified to jump out of the curved region, the movements afterwards become linear and the output goes to the decision boundary directly. Thus, Transformers can be attacked easily after a small random jump and the perturbation in the final attacked data remains imperceptible. In other words, there does exist a decision boundary near the data, which is hard to find only because of the curved representation space. This also explains the underconfident prediction of Transformers. (4) The curved regions in the representation space start to form at an early training stage and grow throughout the training course. Some data are trapped in the regions, obstructing Transformers from reducing the training loss.","Vision transformers, representation space, robustness, calibration, decision boundary" Revisiting Graph Adversarial Attack and Defense From a Data Distribution Perspective,https://openreview.net/forum?id=dSYoPjM5J_W,https://openreview.net/pdf?id=dSYoPjM5J_W,We revisit graph adversarial attack and defense from a data distribution perspective.," Recent studies have shown that structural perturbations are significantly effective in degrading the accuracy of Graph Neural Networks (GNNs) in the semi-supervised node classification (SSNC) task. However, why the gradient-based methods are so destructive is rarely explored. In this work, we discover an interesting phenomenon: the adversarial edges are not uniformly distributed on the graph. Nearly all perturbations are generated around the training nodes in poisoning attack. Combined with this phenomenon, we provide an explanation for the effectiveness of the gradient-based attack method from a data distribution perspective and revisit both poisoning attack and evasion attack in SSNC. From this new perspective, we empirically and theoretically discuss some other attack tendencies. Based on the analysis, we provide nine practical tips on both attack and defense and meanwhile leverage them to improve existing attack and defense methods. Moreover, we design a fast attack method and a self-training defense method, which outperform the state-of-the-art methods and can effectively scale to large graphs like ogbn-arxiv. We conduct extensive experiments on four benchmark datasets to verify our claims.","Graph Adversarial Attack, Robustness, Data Distribution" Gated Domain Units for Multi-source Domain Generalization,https://openreview.net/forum?id=hPRxEcEZJyp,https://openreview.net/pdf?id=hPRxEcEZJyp,,"Distribution shift (DS) is a common problem that deteriorates the performance of learning machines. To tackle this problem, we postulate that real-world distributions are composed of elementary distributions that remain invariant across different environments. We call this an invariant elementary distribution (I.E.D.) assumption. The I.E.D. assumption implies an invariant structure in the solution space that enables knowledge transfer to unseen domains. To exploit this property in domain generalization (DG), we developed a modular neural network layer that consists of Gated Domain Units (GDUs). Each GDU learns an embedding of an individual elementary distribution that allows us to encode the domain similarities during the training. During inference, the GDUs compute similarities between an observation and each of the corresponding elementary distributions which are then used to form a weighted ensemble of learning machines. Because our layer is trained with backpropagation, it can naturally be integrated into existing deep learning frameworks. Our evaluation on image, text, graph, and time-series data shows a significant improvement in the performance on out-of-training target domains without domain information and any access to data from the target domains. This finding supports the practicality of the I.E.D. assumption and demonstrates that our GDUs can learn to represent these elementary distributions.","Robust machine learning, domain generalization, out-of-distribition generalization, kernel theory, distribution shift, deep learning" CooPredict : Cooperative Differential Games For Time Series Prediction,https://openreview.net/forum?id=5uH745DalVx,https://openreview.net/pdf?id=5uH745DalVx,We proposed a novel framework on time series prediction as an application of cooperative differential games. ,"Modeling time series dynamics with neural differential equations has become a major line of research that opened new ways to handle various real-world scenarios (e.g., missing observations, irregular times). Despite the progress, most existing methods still face challenges in providing an explainable rationale on temporal association, which tells how past observations affect future states. To tackle this challenge, we introduce novel multi-agent based neural stochastic differential equations and analyze the time series prediction through the lens of cooperative differential game. Our framework provides an explainable method that can reveal the underlying temporal relevance of the data and fully utilizes this information to systemically solve the prediction problem. We develop the gradient descent based deep neural fictitious play to approximate the Nash equilibrium and theoretical results assure the convergence. Throughout the experiments on various datasets, we demonstrate the superiority of our framework over all the benchmarks in modeling time series prediction by capitalizing on the underlying temporal dynamics without any inductive bias. An ablation study shows that neural agents of the proposed framework learn intrinsic temporal relevance to predict accurate time series.","time series forecasting, time series prediction, neural stochastic differential equations, cooperative differential game" Learning large-scale Kernel Networks,https://openreview.net/forum?id=kBR2bM0tmwe,https://openreview.net/pdf?id=kBR2bM0tmwe,A new scalable algorithm for training kernel networks (generalization of RBF networks)," This paper concerns large-scale training of *Kernel Networks*, a generalization of kernel machines that allows the model to have arbitrary centers. We propose a scalable training algorithm -- EigenPro 3.0 -- based on alternating projections with preconditioned SGD for the alternating steps. In contrast to classical kernel machines, but similar to neural networks, our algorithm enables decoupling the learned model from the training set. This empowers kernel models to take advantage of modern methodologies in deep learning, such as data augmentation. We demonstrate the promise of EigenPro 3.0 on several experiments over large datasets. We also show data augmentation can improve performance of kernel models.","Kernel machines, RBF networks, large-scale datasets, data augmentation" Self-Architectural Knowledge Distillation for Spiking Neural Networks,https://openreview.net/forum?id=QwFw-CcUb10,https://openreview.net/pdf?id=QwFw-CcUb10,"We propose a Self-Architectural Knowledge Distillation framework (SAKD), which matches the knowledge (i.e., the features and logits) of ANNs to that of SNNs with the same architecture. ","Brain-inspired spiking neural networks (SNNs) have drawn wide attention recently since they are biologically plausible and neuromorphic hardware-friendly. To obtain low-latency (i.e., a small number of timesteps) SNNs, the surrogate gradients (SG) method has been widely applied. However, SNNs trained by the SG method still have a huge performance gap from artificial neural networks (ANNs). In this paper, we find that the knowledge distillation paradigm can effectively alleviate the performance gap by transferring the knowledge from ANNs (teacher) to SNNs (student), but it remains a problem to find the architecture of teacher-student pairs. We introduce neural architecture search (NAS) and find that the performance is insensitive to the architectures of SNNs. Hence, we choose the same architecture for ANN-teacher and SNN-student since it is easy to implement and the student can initiate the weight from the teacher. We thus propose a Self-Architectural Knowledge Distillation framework (SAKD), which matches the knowledge (i.e., the features and logits) of ANNs to that of SNNs with the same architecture. Although adopting a teacher model in training, SNNs trained via our SAKD still keep ultra-low latency (T=4) compared with other methods and achieve state-of-the-art performance on a variety of datasets (e.g., CIFAR-10, CIFAR-100, ImageNet, and DVS-CIFAR10), and we demonstrate that this simple training strategy can provide a new training paradigm of SNNs. ","Spiking Neural Networks, Knowledge Distillation, Neural Architecture Search" Provable Sim-to-real Transfer in Continuous Domain with Partial Observations,https://openreview.net/forum?id=S31oTB72m0G,https://openreview.net/pdf?id=S31oTB72m0G,," Sim-to-real transfer, which trains RL agents in the simulated environments and then deploys them in the real world, has been widely used to overcome the limitations of gathering samples in the real world. Despite the empirical success of the sim-to-real transfer, its theoretical foundation is much less understood. In this paper, we study the sim-to-real transfer in continuous domain with partial observations, where the simulated environments and real-world environments are modeled by linear quadratic Gaussian (LQG) systems. We show that a popular robust adversarial training algorithm is capable of learning a policy from the simulated environment that is competitive to the optimal policy in the real-world environment. To achieve our results, we design a new algorithm for infinite-horizon average-cost LQGs and establish a regret bound that depends on the intrinsic complexity of the model class. Our algorithm crucially relies on a novel history clipping scheme, which might be of independent interest.","sim-to-real, RL theory, partial observations" Local Coefficient Optimization in Federated Learning,https://openreview.net/forum?id=xQXHkKI_7TG,https://openreview.net/pdf?id=xQXHkKI_7TG,,"Federated learning emerges as a promising approach to build a large-scale cooperative learning system among multiple clients without sharing their raw data. However, given a specific global objective, finding the optimal sampling weights for each client remains largely unexplored. This is particularly challenging when clients' data distributions are non-i.i.d. and clients partially participate. In this paper, we model the above task as a bi-level optimization problem which takes the correlations among different clients into account. We present a double-loop primal-dual-based algorithm to solve the bi-level optimization problem. We further provide rigorous convergence analysis for our algorithm under mild assumptions. Finally, we perform extensive empirical studies under both toy examples and learning models from real datasets to verify the effectiveness of the proposed method. ","Federated Learning, Bilevel optimization" Outcome-directed Reinforcement Learning by Uncertainty \& Temporal Distance-Aware Curriculum Goal Generation,https://openreview.net/forum?id=v69itrHLEu,https://openreview.net/pdf?id=v69itrHLEu,,"Current reinforcement learning (RL) often suffers when solving a challenging exploration problem where the desired outcomes or high rewards are rarely observed. Even though curriculum RL, a framework that solves complex tasks by proposing a sequence of surrogate tasks, shows reasonable results, most of the previous works still have difficulty in proposing curriculum due to the absence of a mechanism for obtaining calibrated guidance to the desired outcome state without any prior domain knowledge. To alleviate it, we propose an uncertainty \& temporal distance-aware curriculum goal generation method for the outcome-directed RL via solving a bipartite matching problem. It could not only provide precisely calibrated guidance of the curriculum to the desired outcome states but also bring much better sample efficiency and geometry-agnostic curriculum goal proposal capability compared to previous curriculum RL methods. We demonstrate that our algorithm significantly outperforms these prior methods in a variety of challenging navigation tasks and robotic manipulation tasks in a quantitative and qualitative way.","Curriculum Learning, Outcome-directed RL, Goal-conditioned RL" E$^2$: Entropy Discrimination and Energy Optimization for Source-free Universal Domain Adaptation,https://openreview.net/forum?id=FMEXgK9-I8,https://openreview.net/pdf?id=FMEXgK9-I8,This paper presents a novel source-free universal domain adaptation method by combining two innovative components of confidence-guided Entropy discrimination and likelihood-induced Energy optimization.,"Universal domain adaptation (UniDA) aims to tackle the knowledge transfer problem in the presence of both distribution and category shifts. Most existing UniDA methods are developed based on the accessibility assumption of source-domain data during target model adaptation, which may result in privacy policy violation and source-data transfer inefficiency. To address this issue, we propose a novel source-free UniDA method by confidence-guided entropy discrimination and likelihood-induced energy optimization. The entropy-based separation criterion to determine known- and unknown-class target data may be too conservative for known-class prediction. Thus, we derive the confidence-guided entropy by scaling the normalized prediction score with the known-class confidence, such that much more known-class samples are correctly predicted. Without source-domain data for distribution alignment, we constrain the target-domain marginal distribution by maximizing the known-class likelihood and minimizing the unknown-class one. Since the marginal distribution is difficult to estimate but can be written as a function of free energy, the likelihood-induced loss is changed to an equivalent form based on energy optimization. Theoretically, the proposed method amounts to decreasing and increasing internal energy of known and unknown classes in physics, respectively. Extensive experiments on four publicly available datasets demonstrate the superiority of our method for source-free UniDA.","Domain Adaptation, Confidence-guided Entropy, Energy-based Model" Protective Label Enhancement for Label Privacy,https://openreview.net/forum?id=svP7EgyDcx,https://openreview.net/pdf?id=svP7EgyDcx,,"Much sensitive data is gathered from the individual device for commercial value without effective safeguards over the past decade, which would bring on serious privacy leakage. Here we review the label differential privacy (label DP), which means that only the labels are sensitive. The generated private labels of previous methods do not consider the label confidence corresponding to the feature and multiple sampling could be employed to identify the true labels. In the paper, a novel approach called Protective Label Enhancement (PLE) is proposed to mask the true label in the label distribution while ensuring that the protective label distribution is utility for training an effective predictive model on the server. Specifically, when we generate the label distribution, the true label is mixed up by choosing several random labels and the true label will be punished when it is at the top of the label distribution. Meanwhile, if the true label almost vanishes, it will be compensated to keep the statistical effectiveness. Furthermore, we provide the corresponding theoretical guarantee that the predictive model is classifier-consistent and that learning with the protective label distribution is ERM learnable. Finally, experimental results clearly validate the effectiveness of the proposed approach for solving the label DP problem.", Synergistic Neuromorphic Federated Learning with ANN-SNN Conversion For Privacy Protection,https://openreview.net/forum?id=xQdweNAgel,https://openreview.net/pdf?id=xQdweNAgel,,"Federated Learning (FL) has been widely explored for the growing public data privacy issues, where only model parameters are communicated instead of private data. However, recent studies debunk the privacy protection of FL, showing that private data can be leaked from the communicated gradients or parameters updates. In this paper, we propose a framework called Synergistic Neuromorphic Federated Learning (SNFL) that enhances privacy during FL. Before uploading the updates of the client model, SNFL first converts clients' Artificial Neural Networks (ANNs) to Spiking Neural Networks (SNNs) via calibration algorithms. In a way that not only loses almost no accuracy but also encrypts the client model's parameters, SNFL manages to obtain a more performant model with high privacy. After aggregation of various SNNs parameters, server distributes the parameters back to clients to continue training under ANN architecture, providing smooth convergence. The proposed framework is demonstrated to be private, introducing a lightweight overhead as well as yielding prominent performance boosts. Extensive experiments with different kinds of datasets have demonstrated the efficacy and practicability of our method. In most of our experimental IID and not extreme Non-IID scenarios, the SNFL technique has significantly enhanced the model performance. For instance, SNFL improve the accuracy of FedAvg on Tiny-ImageNet by 13.79%. In the IID situation of tiny-ImageNet, for instance, the SNFL method is 13.79% more accurate than FedAvg. Also, the original image cannot be reconstructed after 280 iterations of attacks with the SNFL method, whereas it can be reconstructed after just 70 iterations with FedAvg. ", On Fairness Measurement for Generative Models,https://openreview.net/forum?id=VpwAo8rHDi,https://openreview.net/pdf?id=VpwAo8rHDi,,"Deep generative models have made significant progress in improving the diversity and quality of generated data. Recently, there has been increased interest in fair generative models. Fairness in generative models is important, as some bias in the sensitive attributes of the generated samples could have severe effects in applications under high-stakes settings (e.g., criminal justice, healthcare). In this work, we conduct, for the first time, an in-depth study on fairness measurement, a critical component to gauge the research progress of fair generative models. Our work makes two contributions. As our first contribution, we reveal that there exist considerable errors in the existing fairness measurement framework. We attribute this to the lack of consideration for errors in the sensitive attribute classifiers. Contrary to prior assumptions, even highly accurate attribute classifiers can result in large errors in fairness measurement, e.g., a ResNet-18 for Gender with $\sim$97% accuracy could still lead to 4.98% estimation error when measuring the fairness of a StyleGAN2 trained on the CelebA-HQ. As our second (major) contribution, we address this error in the existing fairness measurement framework by proposing a CLassifier Error-Aware Measurement (CLEAM). CLEAM applies a statistical model to take into account the error in the attribute classifiers, leading to significant improvement in the accuracy of fairness measurement. Our experimental results on evaluating fairness of state-of-the-art GANs (StyleGAN2 and StyleSwin) show CLEAM is able to significantly reduce fairness measurement errors, e.g., by 7.78% for StyleGAN2 (8.68%$\rightarrow$0.90%), and by 7.16% for StyleSwin (8.23%$\rightarrow$1.07%)} when targeting the Gender attribute. Furthermore, our proposed CLEAM has minimal additional overhead when compared to the existing baseline. Code and instructions to reproduce the results are included in Supplementary.","Fairness, Generative models, GAN" Federated Semi-supervised Learning with Dual Regulator,https://openreview.net/forum?id=KnqaT58PV7,https://openreview.net/pdf?id=KnqaT58PV7,,"Federated learning emerges as a powerful method to learn from decentralized heterogeneous data while protecting data privacy. Federated semi-supervised learning (FSSL) is even more practical and challenging, where only a fraction of data can be labeled due to high annotation cost. Existing FSSL methods, however, assume independent and identically distributed (IID) labeled data across clients and consistent class distribution between labeled and unlabeled data within a client. In this work, we propose a novel FSSL framework with dual regulator, FedDure, to optimize and customize model training according to specific data distributions of clients. FedDure lifts the previous assumption with a coarse-grained regulator (C-reg) and a fine-grained regulator (F-reg): C-reg regularizes the updating of local model by tracking the learning effect on labeled data distribution; F-reg learns an adaptive weighting scheme tailored for unlabeled instances in each client. We further formulate the client model training as bi-level optimization that adaptively optimize the model in the client with two regulators. Theoretically, we show the convergence guarantee of dual regulator. Empirically, we demonstrate that FedDure is superior to the existing methods across wide range of settings, notably by more than 12% on CIFAR-10 and CINIC-10 datasets.","federated learning, semi-supervised learning, dual regulator, class imbalance" Path Regularization: A Convexity and Sparsity Inducing Regularization for Parallel ReLU Networks,https://openreview.net/forum?id=cgUFZvSEXG,https://openreview.net/pdf?id=cgUFZvSEXG,We introduce a convex analytic framework to unveil a hidden convex optimization landscape for training path regularized deep neural networks.,"Understanding the fundamental principles behind the success of deep neural networks is one of the most important open questions in the current literature. To this end, we study the training problem of deep neural networks and introduce an analytic approach to unveil hidden convexity in the optimization landscape. We consider a deep parallel ReLU network architecture, which also includes standard deep networks and ResNets as its special cases. We then show that pathwise regularized training problems can be represented as an exact convex optimization problem. We further prove that the equivalent convex problem is regularized via a group sparsity inducing norm. Thus, a path regularized parallel ReLU network can be viewed as a parsimonious convex model in high dimensions. More importantly, we show that the computational complexity required to globally optimize the equivalent convex problem is fully polynomial-time in feature dimension and number of samples. Therefore, we prove polynomial-time trainability of path regularized ReLU networks with global optimality guarantees. We also provide experiments corroborating our theory.","Convex optimization, deep learning theory, path norm, group sparsity, polynomial-time training, ReLU networks, parallel architectures, global optimality, computational complexity" Globally Optimal Training of Neural Networks with Threshold Activation Functions,https://openreview.net/forum?id=_9k5kTgyHT,https://openreview.net/pdf?id=_9k5kTgyHT,"We show that training problem of regularized deep neural networks with threshold activations can be equivalently formulated as a standard convex optimization problem, which parallels the LASSO method."," Threshold activation functions are highly preferable in neural networks due to their efficiency in hardware implementations. Moreover, their mode of operation is more interpretable and resembles that of biological neurons. However, traditional gradient based algorithms such as Gradient Descent cannot be used to train the parameters of neural networks with threshold activations since the activation function has zero gradient except at a single non-differentiable point. To this end, we study weight decay regularized training problems of deep neural networks with threshold activations. We first show that regularized deep threshold network training problems can be equivalently formulated as a standard convex optimization problem, which parallels the LASSO method, provided that the last hidden layer width exceeds a certain threshold. We also derive an alternative simplified convex optimization formulation when the set of hyperplane arrangements for the data matrix is complete, i.e., the dataset can be shattered at a certain layer of the network. We corroborate our theoretical results with various numerical experiments.","Convex optimization, Lasso, threshold activation, binary activation, quantization, neural networks" Robust Learning with Decoupled Meta Label Purifier,https://openreview.net/forum?id=4NWwhku4AEI,https://openreview.net/pdf?id=4NWwhku4AEI,A method that decouples the label purification process into label-free representation learning and a simple meta label purifier.,"Training deep neural networks (DNN) with noisy labels is challenging since DNN can easily memorize inaccurate labels, leading to poor generalization ability. Recently, the meta-learning based label correction strategy is widely adopted to tackle this problem via identifying and correcting potential noisy labels with the help of a small set of clean validation data. Although training with purified labels can effectively improve performance, solving the meta-learning problem inevitably involves a nested loop of bi-level optimization between model weights and hyper-parameters (i.e., label distribution). As compromise, previous methods resort to a coupled learning process with alternating update. In this paper, we empirically find such simultaneous optimization over both model weights and label distribution can not achieve an optimal routine, consequently limiting the representation ability of backbone and accuracy of corrected labels. From this observation, a novel multi-stage label purifier named DMLP is proposed. DMLP decouples the label correction process into label-free representation learning and a simple meta label purifier, In this way, DMLP can focus on extracting discriminative feature and label correction in two distinctive stages. DMLP is a plug-and-play label purifier, the purified labels can be directly reused in naive end-to-end network retraining or other robust learning methods, where state-of-the-art results are obtained on several synthetic and real-world noisy datasets, especially under high noise levels.","Learning with Noisy labels, Decoupled Optimization, Meta Learning" Molecule Generation For Target Protein Binding with Structural Motifs,https://openreview.net/forum?id=Rq13idF0F73,https://openreview.net/pdf?id=Rq13idF0F73,,"Designing ligand molecules that bind to specific protein binding sites is a fundamental problem in structure-based drug design. Although deep generative models and geometric deep learning have made great progress in drug design, existing works either sample in the 2D graph space or fail to generate valid molecules with realistic substructures. To tackle these problems, we propose a Fragment-based LigAnd Generation framework (FLAG), to generate 3D molecules with valid and realistic substructures fragment-by-fragment. In FLAG, a motif vocabulary is constructed by extracting common molecular fragments (i.e., motif) in the dataset. At each generation step, a 3D graph neural network is first employed to encode the intermediate context information. Then, our model selects the focal motif, predicts the next motif type, and attaches the new motif. The bond lengths/angles can be quickly and accurately determined by cheminformatics tools. Finally, the molecular geometry is further adjusted according to the predicted rotation angle and the structure refinement. Our model not only achieves competitive performances on conventional metrics such as binding affinity, QED, and SA, but also outperforms baselines by a large margin in generating molecules with realistic substructures.",Structure-based Drug Design Bag of Tricks for FGSM Adversarial Training ,https://openreview.net/forum?id=X2Dbqvfx-NI,https://openreview.net/pdf?id=X2Dbqvfx-NI,"We provide the first study, which thoroughly examines a collection of tricks, to overcome the catastrophic overfitting in FGSM-AT.","Adversarial training (AT) with samples generated by Fast Gradient Sign Method (FGSM), also known as FGSM-AT, is a computationally simple method to train robust networks. However, during its training procedure, an unstable mode of ``catastrophic overfitting'' has been identified in [Wong2020FastIB], where the robust accuracy abruptly drops to zero within a single training step. Existing methods use gradient regularizers or random initialization tricks to attenuate this issue, whereas they either take high computational cost or lead to lower robust accuracy. In this work, we provide the first study, which thoroughly examines a collection of tricks from three perspectives: Data Initialization, Network Structure, and Optimization, to overcome the catastrophic overfitting in FGSM-AT. Surprisingly, we find that simple tricks, i.e., a) masking partial pixels (even without randomness), b) setting a large convolution stride and smooth activation functions, or c) regularizing the weights of the first convolutional layer, can effectively tackle the overfitting issue. Extensive results on a range of network architectures validate the effectiveness of each proposed trick, and the combinations of tricks are also investigated. For example, trained with PreActResNet-18 on CIFAR-10, our method attains 49.8% accuracy against PGD-50 attacker and 46.4% accuracy against AutoAttack, demonstrating that pure FGSM-AT is capable of enabling robust learners. ",adversarial training Exploring interactions between modalities for deepfake detection,https://openreview.net/forum?id=BCfxM1tR8E,https://openreview.net/pdf?id=BCfxM1tR8E,,"As face forgery techniques have become more mature, the proliferation of deepfakes may threat the human society security. Although existing deepfake detection methods achieve a good performance for in-dataset evaluation, it still remains to be improved in the generalization abiltiy, where the representation of the imperceptible artifacts plays a significant role. In this paper, we propose an Interactive Two-Stream Network (ITSNet) to explore the discriminant inconsistency representation from the perspective of cross-modality. Specially, the patch-wise Decomposable Discrete Cosine Transform (DDCT) is adopted to extract fine-grained high-frequency clues and information from different modalities are communitcated with each other via a designed interaction module. To perceive the temporal inconsistency, we first develop a Short-term Embedding Module (SEM) to refine subtle local inconsistency representation between adjacent frames, and then a Long-term Embedding Module (LEM) is designed to further refine the erratic temporal inconsistency representation from the long-range perspective. Extensive experimental results conducted on three public datasets show that ITSNet outperforms the state-of-the-art methods both in terms of in-dataset and cross-dataset evaluations.","cross-modality representation learning, inconsistency representation, interaction" Towards Robustness Certification Against Universal Perturbations,https://openreview.net/forum?id=7GEvPKxjtt,https://openreview.net/pdf?id=7GEvPKxjtt,A robustness certification framework against universal perturbations (including both universal adversarial noise and backdoor attacks).,"In this paper, we investigate the problem of certifying neural network robustness against universal perturbations (UPs), which have been widely used in universal adversarial attacks and backdoor attacks. Existing robustness certification methods aim to provide robustness guarantees for each sample with respect to the worst-case perturbations given a neural network. However, those sample-wise bounds will be loose when considering the UP threat model as they overlook the important constraint that the perturbation should be shared across all samples. We propose a method based on a combination of linear relaxation-based perturbation analysis and Mixed Integer Linear Programming to establish the first robust certification method for UP. In addition, we develop a theoretical framework for computing error bounds on the entire population using the certification results from a randomly sampled batch. Aside from an extensive evaluation of the proposed certification, we further show how the certification facilitates efficient comparison of robustness among different models or efficacy among different universal adversarial attack defenses and enables accurate detection of backdoor target classes.","Universal Perturbation, Adversarial Attack, Backdoor Attack, Certified Robustness, Poisoning Attack" Deep Generative Modeling on Limited Data with Regularization by Nontransferable Pre-trained Models,https://openreview.net/forum?id=M9u_ctqFUlg,https://openreview.net/pdf?id=M9u_ctqFUlg,,"Deep generative models (DGMs) are data-eager because learning a complex model on limited data suffers from a large variance and easily overfits. Inspired by the classical perspective of the bias-variance tradeoff, we propose regularized deep generative model (Reg-DGM), which leverages a nontransferable pre-trained model to reduce the variance of generative modeling with limited data. Formally, Reg-DGM optimizes a weighted sum of a certain divergence and the expectation of an energy function, where the divergence is between the data and the model distributions, and the energy function is defined by the pre-trained model w.r.t. the model distribution. We analyze a simple yet representative Gaussian-fitting case to demonstrate how the weighting hyperparameter trades off the bias and the variance. Theoretically, we characterize the existence and the uniqueness of the global minimum of Reg-DGM in a non-parametric setting and prove its convergence with neural networks trained by gradient-based methods. Empirically, with various pre-trained feature extractors and a data-dependent energy function, Reg-DGM consistently improves the generation performance of strong DGMs with limited data and achieves competitive results to the state-of-the-art methods.", MAT: Mixed-Strategy Game of Adversarial Training in Fine-tuning,https://openreview.net/forum?id=t1N05TtC7CM,https://openreview.net/pdf?id=t1N05TtC7CM,We generalize fine-tuning adversarial training to a mixed-strategy game.,"Fine-tuning large-scale models from pre-trained checkpoints has been demonstrated effective for various natural language processing (NLP) tasks. Previous works reveal that leveraging adversarial training methods during the fine-tuning stage significantly enhances the generalization and robustness of the models. However, from the perspective of optimization, the previous adversarial training methods suffer from converging onto local optima due to the non-convexity of the objective. In this work, we reformulate the adversarial training in the view of mixed strategy in game theory and incorporate full strategy space to avoid trapping in local stationarity. Methodologically, we derive the Nash equilibrium of mixed-strategy for adversarial training using entropy mirror descent to establish a novel mixed-strategy adversarial training algorithm (MAT). Numerically, to verify the effectiveness of MAT, we conducted extensive benchmark experiments over the large-scale pre-trained models such as BERT and RoBERTa. The experimental results show that MAT outperforms the previous state-of-the-art on both GLUE and ANLI benchmarks in terms of generalization and robustness.","natural language processing, adversarial training, mixed-strategy game, fine-tuning" Defense against Backdoor Attacks via Identifying and Purifying Bad Neurons,https://openreview.net/forum?id=NpZ7TIs6ws,https://openreview.net/pdf?id=NpZ7TIs6ws,We design a backdoor defense method that identifies and purifies the backdoored neurons of victim models with a novel yet effective metric called benign salience.,"Recent studies reveal the vulnerability of neural networks to backdoor attacks. By embedding backdoors into the hidden neurons with poisoned training data, the backdoor attacker can override normal predictions of the victim model to the attacker-chosen ones whenever the backdoor pattern is present in a testing input. In this paper, to mitigate public concerns about the attack, we propose a novel backdoor defense via identifying and purifying the backdoored neurons of the victim neural network. Specifically, we first define a new metric, called benign salience. By combining the first-order gradient to retain the connections between neurons, benign salience can identify the backdoored neurons with high accuracy. Then, a new Adaptive Regularization (AR) mechanism is proposed to assist in purifying these identified bad neurons via fine-tuning. Due to the ability to adapt to different magnitudes of parameters, AR can provide faster and more stable convergence than the common regularization mechanisms in neuron purifying. Finally, we test the defense effect of our method on ten different backdoor attacks with three benchmark datasets. Experimental results show that our method can decrease the attack success rate by more than 95% on average, which is the best among six state-of-the-art defense methods.","backdoor defense, security, neuron importance evaluation" Basic Binary Convolution Unit for Binarized Image Restoration Network,https://openreview.net/forum?id=h8T5dZWTZ-Z,https://openreview.net/pdf?id=h8T5dZWTZ-Z,"We reconsider the components in BNNs and design a strong, simple and efficient baisc binary convolution unit.","Lighter and faster image restoration (IR) models are crucial for the deployment on resource-limited devices. Binary neural network (BNN), one of the most promising model compression methods, can dramatically reduce the computations and parameters of full-precision convolutional neural networks (CNN). However, there are different properties between BNN and full-precision CNN, and we can hardly use the experience of designing CNN to develop BNN. In this study, we reconsider components in binary convolution, such as residual connection, BatchNorm, activation function, and structure, for IR tasks. We conduct systematic analyses to explain each component's role in binary convolution and discuss the pitfalls. Specifically, we find that residual connection can reduce the information loss caused by binarization; BatchNorm can solve the value range gap between residual connection and binary convolution; The position of the activation function dramatically affects the performance of BNN. Based on our findings and analyses, we design a simple yet efficient basic binary convolution unit (BBCU). Furthermore, we divide IR networks into four parts and specially design variants of BBCU for each part to explore the benefit of binarizing these parts. We conduct experiments on different IR tasks, and our BBCU significantly outperforms other BNNs and lightweight models, which shows that BBCU can serve as a basic unit for binarized IR networks. All codes and models will be released.",Imgae Restoration; Low-Level Vision DSP: Dynamic Semantic Prototype for Generative Zero-Shot Learning,https://openreview.net/forum?id=3S62EPkO7k-,https://openreview.net/pdf?id=3S62EPkO7k-,Dynamic Semantic Prototype should be Considered in Generative Zero-Shot Learning,"Generative models (e.g., generative adversarial network (GAN)) have advanced zero-shot learning (ZSL). Studies on the generative ZSL methods typically produce visual features of unseen classes to mitigate the issue of lacking unseen samples based on the predefined class semantic prototypes. As these empirically designed prototypes are not able to faithfully represent the actual semantic prototypes of visual features (i.e., visual prototypes), existing methods limit their ability to synthesize visual features that accurately represent real features and prototypes. We formulate this phenomenon as a visual-semantic domain shift problem. It prevents the generative models from further improving the ZSL performance. In this paper, we propose a dynamic semantic prototype learning (DSP) method to align the empirical and actual semantic prototypes for synthesizing accurate visual features. The alignment is conducted by jointly refining semantic prototypes and visual features so that the generator synthesizes visual features which are close to the real ones. We utilize a visual$\rightarrow$semantic mapping network (V2SM) to map both the synthesized and real features into the class semantic space. The V2SM benefits the generator to synthesize visual representations with rich semantics. The real/synthesized visual features supervise our visual-oriented semantic prototype evolving network (VOPE) where the predefined class semantic prototypes are iteratively evolved to become dynamic semantic prototypes. Such prototypes are then fed back to the generative network as conditional supervision. Finally, we enhance visual features by fusing the evolved semantic prototypes into their corresponding visual features. Our extensive experiments on three benchmark datasets show that our DSP improves existing generative ZSL methods, \textit{e.g.}, the average improvements of the harmonic mean over four baselines (e.g., CLSWGAN, f-VAEGAN, TF-VAEGAN and FREE) by 8.5\%, 8.0\% and 9.7\% on CUB, SUN and AWA2, respectively.","Zero-Shot Learning, Generative Model, Knowledge Transfer" Analyzing the Latent Space of GAN through Local Dimension Estimation,https://openreview.net/forum?id=SlzEll3EsKv,https://openreview.net/pdf?id=SlzEll3EsKv,We analyze the latent space of GAN through local dimension estimation and propose a global disentanglement metric called Distortion.,"The impressive success of style-based GANs (StyleGANs) in high-fidelity image synthesis has motivated research to understand the semantic properties of their latent spaces. Recently, a close relationship was observed between the semantically disentangled local perturbations and the local PCA components in the learned latent space $\mathcal{W}$. However, understanding the number of disentangled perturbations remains challenging. Building upon this observation, we propose a local dimension estimation algorithm for an arbitrary intermediate layer in a pre-trained GAN model. The estimated intrinsic dimension corresponds to the number of disentangled local perturbations. In this perspective, we analyze the intermediate layers of the mapping network in StyleGANs. Our analysis clarifies the success of $\mathcal{W}$-space in StyleGAN and suggests a method for finding an alternative. Moreover, the intrinsic dimension estimation opens the possibility of unsupervised evaluation of global-basis-compatibility and disentanglement for a latent space. Our proposed metric, called Distortion, measures an inconsistency of intrinsic tangent space on the learned latent space. The metric is purely geometric and does not require any additional attribute information. Nevertheless, the metric shows a high correlation with the global-basis-compatibility and supervised disentanglement score. Our findings pave the way towards an unsupervised selection of globally disentangled latent space among the intermediate latent spaces in a GAN.","generative adversarial network, disentanglement, semantic factorization, dimension estimation, grassmannian" MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition,https://openreview.net/forum?id=q23TzayryBG,https://openreview.net/pdf?id=q23TzayryBG,,"Vision Transformer and its variants have demonstrated great potential in various computer vision tasks. But conventional vision transformers often focus on global dependency at a coarse level, which suffer from a learning challenge on global relationships and fine-grained representation at a token level. In this paper, we introduce Multi-scale Attention Fusion into transformer (MAFormer), which explores local aggregation and global feature extraction in a dual-stream framework for visual recognition. We develop a simple but effective module to explore the full potential of transformers for visual representation by learning fine-grained and coarse-grained features at a token level and dynamically fusing them. Our Multi-scale Attention Fusion (MAF) block consists of: i) a local window attention branch that learns short-range interactions within windows, aggregating fine-grained local features; ii) global feature extraction through a novel Global Learning with Down-sampling (GLD) operation to efficiently capture long-range context information within the whole image; iii) a fusion module that self-explores the integration of both features via attention. MAFormer achieves state-of-the-art performance on common vision tasks. In particular, MAFormer-L achieves 85.9$\%$ Top-1 accuracy on ImageNet, surpassing CSWin-B and LV-ViT-L by 1.7$\%$ and 0.6$\%$ respectively. On MSCOCO, MAFormer outperforms the prior art CSWin by 1.7$\%$ mAPs on object detection and 1.4$\%$ on instance segmentation with similar-sized parameters, demonstrating the potential to be a general backbone network.", A Causal Approach to Detecting Multivariate Time-series Anomalies and Root Causes,https://openreview.net/forum?id=f25VGPzATcn,https://openreview.net/pdf?id=f25VGPzATcn,A causality-based approach for anomaly detection and root cause analysis,"Detecting anomalies and the corresponding root causes in multivariate time series plays an important role in monitoring the behaviors of various real-world systems, e.g., IT system operations or manufacturing industry. Previous anomaly detection approaches model the joint distribution without considering the underlying mechanism of multivariate time series, making them computationally hungry and hard to identify root causes. In this paper, we formulate the anomaly detection problem from a causal perspective and view anomalies as instances that do not follow the regular causal mechanism to generate the multivariate data. We then propose a causality-based framework for detecting anomalies and root causes. It first learns the causal structure from data and then infers whether an instance is an anomaly relative to the local causal mechanism whose conditional distribution can be directly estimated from data. In light of the modularity property of causal systems (the causal processes to generate different variables are irrelevant modules), the original problem is divided into a series of separate, simpler, and low-dimensional anomaly detection problems so that where an anomaly happens (root causes) can be directly identified. We evaluate our approach with both simulated and public datasets as well as a case study on real-world AIOps applications, showing its efficacy, robustness, and practical feasibility.","time series, anomaly detection, root cause analysis, causality" Quark: A Gradient-Free Quantum Learning Framework for Classification Tasks,https://openreview.net/forum?id=3wCqIZivcJx,https://openreview.net/pdf?id=3wCqIZivcJx,A new quantum learning framework for classification task,"As more practical and scalable quantum computers emerge, much attention has been focused on realizing quantum supremacy in machine learning. Existing quantum ML methods either (1) embed a classical model into a target Hamiltonian to enable quantum optimization or (2) represent a quantum model using variational quantum circuits and apply classical gradient-based optimization. The former method leverages the power of quantum optimization but only supports simple ML models, while the latter provides flexibility in model design but relies on gradient calculation, resulting in barren plateau (i.e., gradient vanishing) and frequent classical-quantum interactions. To address the limitations of existing quantum ML methods, we introduce Quark, a gradient-free quantum learning framework that optimizes quantum ML models using quantum optimization. Quark does not rely on gradient computation and therefore avoids barren plateau and frequent classical-quantum interactions. In addition, Quark can support more general ML models than prior quantum ML methods and achieves a dataset-size-independent optimization complexity. Theoretically, we prove that Quark can outperform classical gradient-based methods by reducing model query complexity for highly non-convex problems; empirically, evaluations on the Edge Detection and Tiny-MNIST tasks show that Quark can support complex ML models and significantly reduce the number of measurements needed for discovering near-optimal weights for these tasks.","Quantum Computing, Deep Learning, Quantum Machine Learning" Cross-modal Graph Contrastive Learning with Cellular Images,https://openreview.net/forum?id=V_V53-WG9m6,https://openreview.net/pdf?id=V_V53-WG9m6,We propose using high-content cell images to assist learning molecular representation in a cross-modal framwork.,"Constructing discriminative representations of molecules lies at the core of a number of domains such as drug discovery, material science, and chemistry. State-of-the-art methods employ graph neural networks (GNNs) and self-supervised learning (SSL) to learn the structural representations from unlabeled data, which can then be fine-tuned for downstream tasks. Albeit powerful, these methods that are pre-trained solely on molecular structures cannot generalize well to the tasks involved in intricate biological processes. To cope with this challenge, we propose using high-content cell microscopy images to assist in learning molecular representation. The fundamental rationale of our method is to leverage the correspondence between molecular topological structures and the caused perturbations at the phenotypic level. By including cross-modal pre-training with different types of contrastive loss functions in a unified framework, our model can efficiently learn generic and informative representations from cellular images, which are complementary to molecular structures. Empirical experiments demonstrated that the model transfers non-trivially to a variety of downstream tasks and is often competitive with the existing SSL baselines, e.g., a 15.4\% absolute Hit@10 gains in graph-image retrieval task and a 4.0\% absolute AUC improvements in clinical outcome predictions. Further zero-shot case studies show the potential of the approach to be applied to real-world drug discovery. ","Celluar Image, Drug discovery, Graph neural networks, Self-supervised learning, Cross-modal learning" A Closer Look at Self-supervised Lightweight Vision Transformers,https://openreview.net/forum?id=NHfSJAWhKTw,https://openreview.net/pdf?id=NHfSJAWhKTw,,"Self-supervised learning on large-scale Vision Transformers (ViTs) as pre-training methods has achieved promising downstream performance. Yet, how much these pre-training paradigms promote lightweight ViTs' performance is considerably less studied. In this work, we mainly develop and benchmark self-supervised pre-training methods, e.g., contrastive-learning-based MoCo-v3, masked-image-modeling-based MAE on image classification tasks, and some downstream dense prediction tasks. We surprisingly find that if proper pre-training is adopted, even vanilla lightweight ViTs show comparable performance on ImageNet to previous SOTA networks with delicate architecture design. We also point out some defects of such pre-training, \eg, failing to benefit from large-scale pre-training data and showing inferior performance on data-insufficient downstream tasks. Furthermore, we analyze and clearly show the effect of such pre-training by analyzing the properties of the layer representation and attention maps for related models. Finally, based on the above analyses, a distillation strategy during pre-training is developed, which leads to further downstream performance improvement for MAE-based pre-training.","Self-supervised Learning, Vision Transformers, Lightweight Networks" MANDERA: Malicious Node Detection in Federated Learning via Ranking,https://openreview.net/forum?id=djpsp_fSf2G,https://openreview.net/pdf?id=djpsp_fSf2G,We propose a statistically sound way to detect the malicious nodes in Federated Learning.,"Byzantine attacks hinder the deployment of federated learning algorithms. Although we know that the benign gradients and Byzantine attacked gradients are distributed differently, to detect the malicious gradients is challenging due to (1) the gradient is high-dimensional and each dimension has its unique distribution and (2) the benign gradients and the attacked gradients are always mixed (two-sample test methods cannot apply directly). To address the above, for the first time, we propose MANDERA which is theoretically guaranteed to efficiently detect all malicious gradients under Byzantine attacks with no prior knowledge or history about the number of attacked nodes. Specifically, we transfer the original updating gradient matrix into a ranking matrix. By such an operation, the scales of different dimensions of the gradients in the ranking space become identical. The high-dimensional benign gradients and the malicious gradients can be easily separated. The effectiveness of MANDERA is further confirmed by experimentation on four Byzantine attack implementations (Gaussian, Zero Gradient, Sign Flipping, Shifted Mean), comparing with state-of-the-art defenses. The experiments cover both IID and Non-IID datasets.","Federated Learning, Byzantine attack, malicious node detection, ranking" MQSP: Micro-Query Sequence Parallelism for Linearly Scaling Long Sequence Transformer,https://openreview.net/forum?id=gfr5yILQc7_,https://openreview.net/pdf?id=gfr5yILQc7_,MQSP is a novel sequence parallelism that linearly scales long sequence Transformers through all-gathering Micro-Q.,"Long sequence modeling of Transformer gains prevalence in fields involving long texts and high-resolution images and videos but suffers from quadratic memory complexity. Existing work investigates low-complexity variants or parallel methods to handle it. The former attempts to approximate full attention and is limited by a single device's capacity. The latter struggles to manage quadratic memory of attention maps, leading to insufficient sequence scalability. In this work, we propose a novel parallel method named $\textbf{M}$icro-$\textbf{Q}$uery $\textbf{S}$equence $\textbf{P}$arallelism. MQSP slices sequences across devices and projects local queries, keys, and values in self-attention. For communication and memory efficiency, MQSP all-gathers the queries while keys and values remain locally to acquire the local attention map, on which a distributed softmax gets conducted to amortize memory by column. Meanwhile, the queries get further partitioned as Micro-Q to divide the computation and recycle the attention map by row, jointly decomposing the quadratic memory to achieve linear scalability. The evaluation result shows that MQSP scales up sequence length linearly and achieves 4.5$\times$ sequence length of ColossalAI's sequence parallelism and 4.3$\times$ of Megatron-LM3, enabling training BERT-large of 78848 sequence length on 32 A100 GPUs. MQSP can reduce up to 78.6$\%$ of memory occupation and achieve up to 3.3$\times$ throughput when training on 17408 sequence length. The convergence quality experiment proves that MQSP provides the means for long sequences with guaranteed convergence, bringing the potential for the Transformer to explore longer sequences.","Sequence parallelism, Long Sequence Transformer, Distributed training" HagSeg: Hardness-adaptive Guidance for Semi-supervised Semantic Segmentation,https://openreview.net/forum?id=AtyO3IZYVEy,https://openreview.net/pdf?id=AtyO3IZYVEy,A instance-specific and hardness-adaptive SSS framework,"Recently, semi-supervised semantic segmentation has achieved promising performance with a small fraction of labelled data. However, most existing studies treat all unlabeled data equally and barely consider the differences and training difficulties among unlabeled instances. Differentiating unlabeled instances can promote instance-specific supervision to adapt to the model's evolution dynamically. In this paper, we emphasize the cruciality of instance differences and propose an instance-specific and hardness-adaptive guidance for semi-supervised semantic segmentation, named as HagSeg. Relying on the model's performance, HagSeg employs the class-weighted symmetric intersection-over-union to evaluate the hardness of each unlabeled instance and then supervises the training on unlabeled data in a hardness-adaptive manner. Specifically, HagSeg learns from unlabeled instances progressively by weighing their corresponding consistency losses based on the evaluated hardness. Meanwhile, HagSeg dynamically adjusts the augmentation for each instance such that the distortion degree of augmented instances is adapted to the model's generalization capability across the training course. Not integrating additional losses and training procedures, HagSeg can obtain remarkable performance gains against current state-of-the-art approaches on segmentation benchmarks under different semi-supervised partition protocols.","Semi-supervised, Semantic Segmentation, Hardness-adaptive guidance" DSPNet: Towards Slimmable Pretrained Networks based on Discriminative Self-supervised Learning,https://openreview.net/forum?id=6Ysgo5RXUvn,https://openreview.net/pdf?id=6Ysgo5RXUvn,,"Self-supervised learning (SSL) has achieved promising downstream performance. However, when facing various resource budgets in real-world applications, it costs a huge computation burden to pretrain multiple networks of various sizes one by one. In this paper, we propose Discriminative-SSL-based Slimmable Pretrained Networks (DSPNet), which can be trained once and then slimmed to multiple sub-networks of various sizes, each of which faithfully learns good representation and can serve as good initialization for downstream tasks with various resource budgets. Specifically, we extend the idea of slimmable networks to a discriminative SSL paradigm, by integrating SSL and knowledge distillation gracefully. We show comparable or improved performance of DSPNet on ImageNet to the networks individually pretrained one by one under the linear evaluation and semi-supervised evaluation protocols, while reducing large training cost. The pretrained models also generalize well on downstream detection and segmentation tasks. Code will be made public.","Self-supervised Learning, Dynamic Neural Networks, Knowledge Distillation" "Generative Multi-Flow Networks: Centralized, Independent and Conservation",https://openreview.net/forum?id=OTIhUlChVaT,https://openreview.net/pdf?id=OTIhUlChVaT,Generative Multi-Flow Networks,"Generative flow networks utilize the flow matching loss to learn a stochastic policy for generating objects from a sequence of actions, such that the probability of generating a pattern can be proportional to the corresponding given reward. However, existing works can only handle single flow model tasks and cannot directly generalize to multi-agent flow networks due to limitations such as flow estimation complexity and independent sampling. In this paper, we propose the framework of generative multi-flow networks (GMFlowNets) that can be applied to multiple agents to generate objects collaboratively through a series of joint actions. Then, the centralized flow network algorithm is proposed for centralized training GMFlowNets, while the independent flow network algorithm is proposed to achieve decentralized execution of GMFlowNets. Based on the independent global conservation condition, the flow conservation network algorithm is then proposed to realize centralized training with decentralized execution paradigm. Theoretical analysis proves that using the multi-flow matching loss function can train a unique Markovian flow, and the flow conservation network can ensure independent policies can generate samples with probability proportional to the reward function. Experimental results demonstrate the performance superiority of the proposed algorithms compared to reinforcement learning and MCMC-based methods.","GFlowNets, Multi-Flow" A Laplace-inspired Distribution on SO(3) for Probabilistic Rotation Estimation,https://openreview.net/forum?id=Mvetq8DO05O,https://openreview.net/pdf?id=Mvetq8DO05O,,"Estimating the 3DoF rotation from a single RGB image is an important yet challenging problem. Probabilistic rotation regression has raised more and more attention with the benefit of expressing uncertainty information along with the prediction. Though modeling noise using Gaussian-resembling Bingham distribution and matrix Fisher distribution is natural, they are shown to be sensitive to outliers for the nature of quadratic punishment to deviations. In this paper, we draw inspiration from multivariate Laplace distribution and propose a novel Rotation Laplace distribution on SO(3). Rotation Laplace distribution is robust to the disturbance of outliers and enforces much gradient to the low-error region, resulting in a better convergence. Our extensive experiments show that our proposed distribution achieves state-of-the-art performance for rotation regression tasks over both probabilistic and non-probabilistic baselines.", Why pseudo-label based algorithm is effective? --from the perspective of pseudo-labeled data,https://openreview.net/forum?id=0KfAZAyClG1,https://openreview.net/pdf?id=0KfAZAyClG1,Theoretical analysis of the superiority of pseudo-label based semi-supervised learning algorithm --from the perspective of pseudo-labeled data,"Recently, pseudo label based semi-supervised learning has achieved great success in many fields. The core idea of the pseudo label based semi-supervised learning algorithm is to use the model trained on the labeled data to generate pseudo labels on the unlabeled data, and then train a model to fit the previously generated pseudo labels. We give a theory analysis for why pseudo label based semi-supervised learning is effective in this paper. We mainly compare the generalization error of the model trained under two settings: (1) There are $N$ labeled data. (2) There are $N$ unlabeled data and a suitable initial model. Our analysis shows that, firstly, when the amount of unlabeled data tends to infinity, the pseudo label based semi-supervised learning algorithm can obtain model which have the same generalization error upper bound as model obtained by normally training in the condition of the amount of labeled data tends to infinity. More importantly, we prove that when the amount of unlabeled data is large enough, the generalization error upper bound of the model obtained by pseudo label based semi-supervised learning algorithm can converge to the optimal upper bound with linear convergence rate. We also give the lower bound on sampling complexity to achieve linear convergence rate. Our analysis contributes to understanding the empirical successes of pseudo label-based semi-supervised learning.","pseudo-label based algorithm, semi-supervised learning" Distributionally Robust Model-Based Offline Reinforcement Learning with Near-Optimal Sample Complexity,https://openreview.net/forum?id=5DKHY-Ag62E,https://openreview.net/pdf?id=5DKHY-Ag62E,This paper provides the first provably near-optimal robust offline RL algorithm that learns under model uncertainty and partial coverage.,"This paper concerns the central issues of model robustness and sample efficiency in offline reinforcement learning (RL), which aims to learn to perform decision making from history data without active exploration. Due to uncertainties and variabilities of the environment, it is critical to learn a robust policy---with as few samples as possible---that performs well even when the deployed environment deviates from the nominal one used to collect the history dataset. We consider a distributionally robust formulation of offline RL, focusing on tabular robust Markov decision processes with an uncertainty set specified by the Kullback-Leibler divergence in both finite-horizon and infinite-horizon settings. To combat with sample scarcity, a model-based algorithm that combines distributionally robust value iteration with the principle of pessimism in the face of uncertainty is proposed, by penalizing the robust value estimates with a carefully designed data-driven penalty term. Under a mild and tailored assumption of the history dataset that measures distribution shift without requiring full coverage of the state-action space, we establish the finite-sample complexity of the proposed algorithm, and further show it is almost unimprovable in light of a nearly-matching information-theoretic lower bound up to a polynomial factor of the (effective) horizon length. To the best our knowledge, this provides the first provably near-optimal robust offline RL algorithm that learns under model uncertainty and partial coverage. ","offline reinforcement learning, distributional robustness, model-based reinforcement learning" ContraGen: Effective Contrastive Learning For Causal Language Model,https://openreview.net/forum?id=nMwFhKoAo5v,https://openreview.net/pdf?id=nMwFhKoAo5v,An effective contrastive learning approach that enhances both the generation and discrimination capability of causal language models,"Despite exciting progress in large-scale language generation, the expressiveness of its representations is severely limited by the \textit{anisotropy} issue where the hidden representations are distributed into a narrow cone in the vector space. To address this issue, we present ContraGen, a novel contrastive learning framework to improve the representation with better uniformity and discrimination at both sequence-level and token-level. We assess ContraGen on a wide range of downstream tasks in natural and programming languages. We show that ContraGen can effectively enhance both uniformity and discrimination of the representations and lead to the desired improvement on various language understanding tasks where discriminative representations are crucial for attaining good performance. Specifically, we attain $45.9\%$ relative improvement on the Semantic Textual Similarity tasks and $33.5\%$ on Code-to-Code Search tasks. Furthermore, by improving the expressiveness of the representations, ContraGen also boosts the source code generation capability with $9\%$ relative improvement on execution accuracy on HumanEval benchmark.","Natural Language Processing, Contrastive Learning, Causal Language Model, Natural Language Generation and Discrimination, Code Generation and Discrimination" Fast 6D Object Pose Refinement via Implicit Surface Representation Driven Optimization,https://openreview.net/forum?id=IJwkJCFQ9EJ,https://openreview.net/pdf?id=IJwkJCFQ9EJ,"In this paper, we propose a simple yet efficient self-supervised point cloud aligenment method via implicit neural network, which can serve as an alternative of ICP to achieve fast and accurate pose refinement.","Pose refinement after the initial pose estimator has been demonstrated to be effective for 6D object pose estimation. The iterative closest point (ICP) is the most popular refinement strategy, which however suffers from slow convergence due to the nature of iterative nonlinear optimization. In this paper, we propose a simple yet efficient self-supervised point cloud aligenment method via implicit neural network, which can serve as an alternative of ICP to achieve fast and accurate pose refinement. Our key idea is to encode the surface of target point cloud into a signed distance function (SDF); the optimal rigid transformation then can be derived by addressing a minimization problem over the SDF. The workflow of our method does not require any pose annotations. Experimental results show our method can achieve 6.4\%, 16.2\%, and 3.9\% performance improvement over the prior art OVE6D (w/o ICP) on LINEMOD, Occluded LINEMOD and T-LESS datasets respectively, and is comparable with other SOTA methods even the supervised ones. Compared with point-to-plane ICP, our method has the obvious advantage on computation speed, due to the merit of full play to the high parallel characteristics of deep learning based on GPU acceleration. ","Signed Distance Field, 6D pose refinement, Implicit neural network, ICP" Benchmarking Encoder-Decoder Architectures for Biplanar X-ray to 3D Bone Shape Reconstruction,https://openreview.net/forum?id=RlCa0pFZsR9,https://openreview.net/pdf?id=RlCa0pFZsR9,,"Various deep learning pipelines have been proposed for 3D Bone Shape Reconstruction from Biplanar X-rays. Although these methods individually report excellent results, we do not know how these architecture pipelines compare against each other since they are reported on different anatomy, datasets, and cohort distribution. We benchmark these disparate architectures on equal footing on three different anatomies using public datasets. We describe various benchmarking tasks to simulate real-world clinical scenarios including reconstruction of fractured bones, bones with implants, robustness to population shift, and estimate clinical parameters. We provide an open-source implementation of SOTA architectures, dataset pipelines, and extraction of clinical parameters. Comparing the encoder-decoder architectures with baseline retrieval models, we find that the encoder-decoder methods are able to learn from data and are much better than retrieval baselines. However, the best methods have limited difference on performance, but the domain shift plays an important role in deteriorating the performance of these methods. ","2D-3D Reconstruction, Object Reconstruction, Medical Applications, Encoder-Decoder Architectures" Time Series Anomaly Detection via Hypothesis Testing for Dynamical Systems,https://openreview.net/forum?id=G-dM79m_EXd,https://openreview.net/pdf?id=G-dM79m_EXd,We tackle the problem of anomaly detection in dynamical systems from the perspective of hypothesis testing and propose a new algorithm.,"Real world systems---such as robots, weather, energy systems and stock markets---are complicated and high-dimensional. Hence, without prior knowledge of the system dynamics, detecting or forecasting abnormal events from the sequential observations of the system is challenging. In this work, we address the problem caused by high-dimensionality via viewing time series anomaly detection as hypothesis testing on dynamical systems. This perspective can avoid the dimension of the problem from increasing linearly with time horizon, and naturally leads to a novel anomaly detection model, termed as DyAD (Dynamical system Anomaly Detection). Furthermore, as existing time-series anomaly detection algorithms are usually evaluated on relatively small datasets, we released a large-scale one on detecting battery failures in electric vehicles. We benchmarked several popular algorithms on both public datasets and our released new dataset. Our experiments demonstrated that our proposed model achieves state-of-the-art results.","anomaly detection, dynamical system, hypothesis testing" Exploring the Generalizability of CNNs via Activated Representational Substitution,https://openreview.net/forum?id=fhKzTDDStdZ,https://openreview.net/pdf?id=fhKzTDDStdZ,We propose an metric named Activation Representation Substitution (ARS) to explore the association between representations learned by convolution kernels and generalization.,"Convolutional neural networks (CNNs) have achieved remarkable success in various fields due to their excellent generalizability. To explore the relationship between CNN representations and generalization, we propose an Activation Representation Substitution (ARS) metric based on the disentangled visual representations of convolution kernels. Without additional data, we obtain the disentangled visual representation of a kernel in the convolutional layer by iterating over a random image, and feed it into the CNN. The output activations of the other kernels in that convolutional layer are then investigated. When all other output activations are small, the ARS of the convolution kernel from that representation is also small, indicating that the representation is important for CNN. Our experiments with ablation analysis confirm the importance of the low ARS convolution kernel on accuracy. With ARS, we also explain batch normalization and class selectivity. By comparing the model performances on the test set, we empirically find that when the convolutional layer contains a large number of low ARS convolution kernels, the model has good generalization. ARS is a metric that can be used to better understand model generalizability without using external data.","Deep Learning, Convolutional Neural Network, Generalizability" Schrödinger's FP: Training Neural Networks with Dynamic Floating-Point Containers,https://openreview.net/forum?id=Y-J3jGFcnr4,https://openreview.net/pdf?id=Y-J3jGFcnr4,"Reducing the training cost and time by learning and using, on-the-fly, shorter floating-point formats","We introduce a software-hardware co-design approach to reduce memory traffic and footprint during training with BFloat16 or FP32, in order to boost energy efficiency and execution time performance. Our methods dynamically adjust the size and format of the floating-point containers used to store activations and weights during training. The different value distributions lead us to different approaches for exponents and mantissas. Gecko exploits the favourable exponent distribution with a lossless delta encoding approach to reduce the total exponent footprint by up to 58% in comparison to the FP32 baseline. To contend with the noisy mantissa distributions, we present two lossy methods to eliminate as many as possible least significant bits without affecting accuracy. Quantum Mantissa is a machine learning mantissa compression method that taps onto the gradient descent algorithm to learn the minimal mantissa bitlengths on a per-layer granularity, and obtain up to 92% reduction in total mantissa footprint. Alternatively, BitChop observes changes in the loss function during training to adjust mantissa bitlength network-wide, yielding a reduction of 81% in footprint. Schrödinger's FP implements hardware encoders/decoders that, guided by Gecko/Quantum Mantissa or Gecko/BitChop, transparently encode/decode values when transferring to/from off-chip memory, boosting energy efficiency and reducing execution time.","Quantization, Hardware Acceleration, Deep Learning" Measuring and Narrowing the Compositionality Gap in Language Models,https://openreview.net/forum?id=PUwbwZJz9dO,https://openreview.net/pdf?id=PUwbwZJz9dO,"Language models can solve complex problems when you let them 'talk things through', our new 'self-ask' prompt improves their ability to do that, and lets us easily plug in a search engine for even better performance.","We investigate the ability of language models to perform compositional reasoning tasks where the overall solution depends on correctly composing the answers to sub-problems. We measure how often models can correctly answer all sub-problems but not generate the overall solution, a ratio we call the compositionality gap. We evaluate this ratio by asking multi-hop questions with answers that require composing multiple facts unlikely to have been observed together during pretraining. In the GPT-3 family of models, as model size increases we show that the single-hop question answering performance improves faster than the multi-hop performance does, therefore the compositionality gap does not decrease. This surprising result suggests that while more powerful models memorize and recall more factual knowledge, they show no corresponding improvement in their ability to perform this kind of compositional reasoning. We then demonstrate how elicitive prompting (such as chain of thought) narrows the compositionality gap by reasoning explicitly instead of implicitly. We present a new method, self-ask, that further improves on chain of thought. In our method, the model explicitly asks itself (and then answers) follow-up questions before answering the initial question. We finally show that self-ask's structured prompting lets us easily plug in a search engine to answer the follow-up questions, which additionally improves accuracy.","language modeling, prompting, question answering, retrieval" FedFA: Federated Learning with Feature Alignment for Heterogeneous Data,https://openreview.net/forum?id=BSZN74xLdqn,https://openreview.net/pdf?id=BSZN74xLdqn,"A federated learning framework with feature alignment is proposed to tackle the data heterogeneity problem, including label and feature distribution skews across clients, from a novel perspective of shared feature space by feature anchors.","Federated learning allows multiple clients to collaboratively train a model without exchanging their data, thus preserving data privacy. Unfortunately, it suffers significant performance degradation under heterogeneous data at clients. Common solutions involve designing specific regularizers for local-model training or developing aggregation schemes for global-model aggregation. Nevertheless, we found that these methods fail to achieve the desired performance due to neglecting the importance of feature mapping consistency across client models. We first observe and analyze that, with heterogeneous data, a vicious cycle exists between classifier divergence and feature mapping inconsistency across clients, thereby shifting the aggregated global model from the expected optima. We then propose a simple yet effective framework named Federated learning with Feature Alignment (FedFA) to tackle the data heterogeneity problem from a novel perspective of shared feature space. A key insight of FedFA is introducing feature anchors to align the feature mappings and calibrate the classifier updates across clients during their local updates, such that client models are updated in a shared feature space. We prove that this modification brings a property of consistent classifier updates if features are class-discriminative. Extensive experiments show that FedFA significantly outperforms the state-of-the-art federated learning algorithms on various image classification datasets under both label and feature distribution skews.","Federated learning, feature alignment, data heterogeneity, heterogeneous label distribution, heterogeneous feature distribution" HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer,https://openreview.net/forum?id=3F6I-0-57SC,https://openreview.net/pdf?id=3F6I-0-57SC,A novel hierarchical vision transformer that is stronger and faster when applied to masked image modeling,"There has been a debate on the choice of plain vs. hierarchical vision transformers, where researchers often believe that the former (e.g., ViT) has a simpler design but the latter (e.g., Swin) enjoys higher recognition accuracy. Recently, the emerge of masked image modeling (MIM), a self-supervised visual pre-training method, raised a new challenge to vision transformers in terms of flexibility, i.e., part of image patches or tokens are to be discarded, which seems to claim the advantages of plain vision transformers. In this paper, we delve deep into the comparison between ViT and Swin, revealing that (i) the performance gain of Swin is mainly brought by a deepened backbone and relative positional encoding, (ii) the hierarchical design of Swin can be simplified into hierarchical patch embedding (proposed in this work), and (iii) other designs such as shifted-window attentions can be removed. By removing the unnecessary operations, we come up with a new architecture named HiViT (short for hierarchical ViT), which is simpler and more efficient than Swin yet further improves its performance on fully-supervised and self-supervised visual representation learning. In particular, after pre-trained using masked autoencoder (MAE) on ImageNet-1K, HiViT-B reports a 84.6% accuracy on ImageNet-1K classification, a 53.3% box AP on COCO detection, and a 52.8% mIoU on ADE20K segmentation, significantly surpassing the baseline. Code is attached in the supplementary materials and will be released to the public.","Hierarchical vision transformers, self-supervised learning, masked image modeling" Style Spectroscope: Improve Interpretability and Controllability through Fourier Analysis,https://openreview.net/forum?id=KAB29urre4C,https://openreview.net/pdf?id=KAB29urre4C,,"Universal style transfer (UST) infuses styles from arbitrary reference images into content images. Existing methods, while enjoying many practical successes, are unable of explaining experimental observations, including different performances of UST algorithms in preserving the spatial structure of content images. In addition, methods are limited to cumbersome global controls on stylization, so that they require additional spatial masks for desired stylization. In this work, we provide a systematic Fourier analysis on a general framework for UST. We present an equivalent form of the framework in the frequency domain. The form implies that existing algorithms treat all frequency components and pixels of feature maps equally, except for the zero-frequency component. We connect Fourier amplitude and phase with Gram matrices and a content reconstruction loss in style transfer, respectively. Based on such equivalence and connections, we can thus interpret different structure preservation behaviors between algorithms with Fourier phase. Given the interpretations we have, we propose two manipulations in practice for structure preservation and desired stylization. Both qualitative and quantitative experiments demonstrate the competitive performance of our method against the state-of-the-art methods. We also conduct experiments to demonstrate (1) the abovementioned equivalence, (2) the interpretability based on Fourier amplitude and phase and (3) the controllability associated with frequency components.", Multimodal Federated Learning via Contrastive Representation Ensemble,https://openreview.net/forum?id=Hnk1WRMAYqg,https://openreview.net/pdf?id=Hnk1WRMAYqg,"CreamFL, a multimodal FL framework using contrastive representation-level ensemble to learn a larger server model from heterogeneous clients across multi-modalities.","With the increasing amount of multimedia data on modern mobile systems and IoT infrastructures, harnessing these rich multimodal data without breaching user privacy becomes a critical issue. Federated learning (FL) serves as a privacy-conscious alternative to centralized machine learning. However, existing FL methods extended to multimodal data all rely on model aggregation on single modality level, which restrains the server and clients to have identical model architecture for each modality. This limits the global model in terms of both model complexity and data capacity, not to mention task diversity. In this work, we propose \textit{Contrastive Representation Ensemble and Aggregation for Multimodal FL (CreamFL)}, a multimodal federated learning framework that enables training larger server models from clients with heterogeneous model architectures and data modalities, while only communicating knowledge on public dataset. To achieve better multimodal representation fusion, we design a global-local cross-modal ensemble strategy to aggregate client representations. To mitigate local model drift caused by two unprecedented heterogeneous factors stemming from multimodal discrepancy (\textit{modality gap} and \textit{task gap}), we further propose two inter-modal and intra-modal contrasts to regularize local training, which complements information of the absent modality for uni-modal clients and regularizes local clients to head towards global consensus. Thorough evaluations and ablation studies on image-text retrieval and visual question answering tasks showcase the superiority of CreamFL over state-of-the-art FL methods and its practical value.","Federated Learning, Multi-modal Learning, Representation-level Ensemble Knowledge Transfer" Eva: Practical Second-order Optimization with Kronecker-vectorized Approximation,https://openreview.net/forum?id=_Mic8V96Voy,https://openreview.net/pdf?id=_Mic8V96Voy,We propose an efficient approximation algorithm to accelerate second-order optimization for deep learning models. ,"Second-order optimization algorithms exhibit excellent convergence properties for training deep learning models, but often incur significant computation and memory overheads. This can result in lower training efficiency than the first-order counterparts such as stochastic gradient descent (SGD). In this work, we present a memory- and time-efficient second-order algorithm named Eva with two novel techniques: 1) we construct the second-order information with the Kronecker factorization of small stochastic vectors over a mini-batch of training data to reduce memory consumption, and 2) we derive an efficient update formula without explicitly computing the inverse of matrices using the Sherman-Morrison formula. We further provide a theoretical interpretation of Eva from a trust-region optimization point of view to understand how it works. Extensive experimental results on different models and datasets show that Eva reduces the end-to-end training time up to $2.05\times$ and $2.42\times$ compared to first-order SGD and second-order algorithms (K-FAC and Shampoo), respectively. ","Deep learning, Second-order optimization, Approximation" Identifying Weight-Variant Latent Causal Models,https://openreview.net/forum?id=dCwBpTXbfIq,https://openreview.net/pdf?id=dCwBpTXbfIq,,"The task of causal representation learning aims to uncover latent higher-level causal representations that affect lower-level observations. Identifying true latent causal representations from observed data, while allowing instantaneous causal relations among latent variables, remains a challenge, however. To this end, we start from the analysis of three intrinsic properties in identifying latent space from observations: transitivity, permutation indeterminacy, and scaling indeterminacy. We find that transitivity acts as a key role in impeding the identifiability of latent causal representations. To address the unidentifiable issue due to transitivity, we introduce a novel identifiability condition where the underlying latent causal model satisfies a linear-Gaussian model, in which the causal coefficients and the distribution of Gaussian noise are modulated by an additional observed variable. Under some mild assumptions, we can show that the latent causal representations can be identified up to trivial permutation and scaling. Furthermore, based on this theoretical result, we propose a novel method, termed Structural caUsAl Variational autoEncoder, which directly learns latent causal representations and causal relationships among them, together with the mapping from the latent causal variables to the observed ones. We show that the proposed method learns the true parameters asymptotically. Experimental results on synthetic and real data demonstrate the identifiability and consistency results and the efficacy of the proposed method in learning latent causal representations.", Beyond Single Path Integrated Gradients for Reliable Input Attribution via Randomized Path Sampling,https://openreview.net/forum?id=muHaELT29WK,https://openreview.net/pdf?id=muHaELT29WK,We propose a new attribution method by taking the expectation of path-integrated attribution by introducing efficient path sampling method.,"Input attribution is a widely used explanation method for deep neural networks, especially in visual tasks. Among various attribution methods, Integrated Gradients (IG) is frequently used because of its model-agnostic applicability and desirable axioms. However, previous work has shown that such method often produces noisy and unreliable attributions during the integration of the gradients over the path defined in the input space. In this paper, we tackle this issue by estimating the distribution of the possible attributions according to the integrating path selection. We show that such noisy attribution can be reduced by aggregating attributions from the multiple paths instead of using a single path. Inspired by Stick-Breaking Process, we suggest a random process to generate rich and various sampling of the gradient integrating path. Using multiple input attributions obtained from randomized path, we propose a novel attribution measure using the distribution of attributions at each input features. We identify proposed method qualitatively show less-noisy and object-aligned attribution and its feasibility through the quantitative evaluations.","Explainable AI, Input Attribution" Sweet Gradient Matters: Designing Consistent and Efficient Estimator for Zero-Shot Neural Architecture Search,https://openreview.net/forum?id=AsSdrNJ-DZG,https://openreview.net/pdf?id=AsSdrNJ-DZG,"We observe Sweet Gradient and propose Sweetimator, a consistent and efficient performance estimator in Zero-Shot Neural Architecture Search.","Neural architecture search (NAS) is one of the core technologies of AutoML for designing high-performance networks. Recently, Zero-Shot NAS has gained growing interest due to its training-free property and super-fast search speed. However, existing Zero-Shot estimators commonly suffer from low consistency, which limits the reliability and applicability. In this paper, we observe that Sweet Gradient of parameters, i.e., the absolute gradient values within a certain interval, brings higher consistency in network performance compared to the overall number of parameters. We further demonstrate a positive correlation between the network depth and the parameter ratio of sweet gradients in each layer. Based on the analysis, we propose a training-free method to find the Sweet Gradient interval and obtain an estimator, named Sweetimator. Experiments show that Sweetimator has superior consistency to existing Zero-Shot estimators on four benchmarks with eight search spaces. Moreover, Sweetimator achieves state-of-the-art performance on NAS-Bench-201 and DARTS search spaces.","Neural Architecture Search, Zero-Shot, Estimator, Sweet Gradient" Bridging attack and prompting: An Enhanced Visual Prompting at the pixel level,https://openreview.net/forum?id=RjsiAoZqN6,https://openreview.net/pdf?id=RjsiAoZqN6,"We design a novel and concise visual prompting method incorporating a simple and effective training strategy inspired from adversarial attack, and ourperform traditional linear probe in many scenarios.","In this paper, we study the problem of the visual prompt at the pixel level. Recent works demonstrate flexibility and generalization of visual-only prompt. However, it still cannot achieve superior results compared with linear probe in terms of accuracy and parameter efficiency. We believe that the full power of visual prompt remains to be harnessed through a novel perspective, which bridges adversarial attack and visual prompt considering the high similarity in both formats and objective functions. Bringing in the “old ideas” in adversarial attacks to enhance visual prompt is promising since there are extensive theoretical and empirical solutions to improve the performance of adversarial attack. Therefore, we propose a novel and concise visual prompting method incorporating simple and effective training strategies inspired by ideas from adversarial attack. Specifically, we introduce the input diversity and gradient normalization into visual prompt learning to obtain better generalization ability. Moreover, to avoid disruptions to the original image caused by perturbation without changing the spatial size of inputs, we separate the prompt and image by shrinking and then padding the image with learnable visual prompts, which can significantly improve the performance further without increasing FLOPs. Extensive experiments are conducted on various large-scale pre-trained models across several downstream datasets under different scenarios. We show that with a CLIP-based model, our enhanced visual prompt can successfully outperform linear probe by 1.9% across 12 datasets on average with a comparable number of parameters, and can even match fully fine-tuning paradigm in some settings training only 0.04% parameters.","prompting, adversarial machine learning, CLIP" Neural Collaborative Filtering Bandits via Meta Learning,https://openreview.net/forum?id=15hYIH0TUi,https://openreview.net/pdf?id=15hYIH0TUi,,"Contextual multi-armed bandits provide powerful tools to solve the exploitation-exploration dilemma in decision making, with direct applications in the personalized recommendation. In fact, collaborative effects among users carry the significant potential to improve the recommendation. In this paper, we introduce and study the problem by exploring `Neural Collaborative Filtering Bandits', where the rewards can be non-linear functions and groups are formed dynamically given different specific contents. To solve this problem, we propose a meta-learning based bandit algorithm, Meta-Ban (\textbf{meta-ban}dits), where a meta-learner is designed to represent and rapidly adapt to dynamic groups, along with an informative UCB-based exploration strategy. Furthermore, we analyze that Meta-Ban can achieve the regret bound of $\mathcal{O}(\sqrt{nT\log T})$, which is sharper over state-of-the-art related works. In the end, we conduct extensive experiments showing that Meta-Ban outperforms six strong baselines.","Neural Contextual Bandit, Meta Learning" MABA-Net: Masked Additive Binary Activation Network,https://openreview.net/forum?id=LlWfawcSpf,https://openreview.net/pdf?id=LlWfawcSpf,,"Despite significant reduction in memory footprint and computational cost, binary neural networks suffer from noticeable accuracy degradation compared to real-valued counterparts. A few works have attempted to narrow the accuracy gap by increasing the representation bit-width or the network width/depth, but they come at the expense of increased memory and/or compute. In this work, we find that the imbalanced ratio of activations to weights may be the main cause of degraded performance and increased memory overhead. We propose Masked Additive Binary Activation Network (MABA-Net) to reduce approximation errors and the activation bit-width, with minimum increase in the activation size. MABA-Net balances the ratio of the activation size to the weight size, leading to significant memory saving on large CNNs. We demonstrate MABA-Net's superior performance on the ImageNet dataset under various network configurations. Experimental results show that MABA-Net achieves competitive accuracy without increase of computational cost, while reducing memory usage compared to state-of-the-art. We will release the codes upon acceptance.","Binary Neural Networks, Quantization, Binarization" Cascaded Teaching Transformers with Data Reweighting for Long Sequence Time-series Forecasting,https://openreview.net/forum?id=vSMubaJA1j,https://openreview.net/pdf?id=vSMubaJA1j,Cascaded Teaching Trasnformers,"The Transformer-based models have shown superior performance in the long sequence time-series forecasting problem. The sparsity assumption on self-attention dot-product reveals that not all inputs are equally significant for Transformers. Instead of implicitly utilizing weighted time-series, we build a new learning framework by cascaded teaching Transformers to reweight samples. We formulate the framework as a multi-level optimization and design three different dataset-weight generators. We perform extensive experiments on five datasets, which shows that our proposed method could significantly outperform the SOTA Transformers.","Teaching, Reweight, Time-series, Forecast" Decoupled and Patch-based Contrastive Learning for Long-tailed Visual Recognition,https://openreview.net/forum?id=5i-n9TYb-xa,https://openreview.net/pdf?id=5i-n9TYb-xa,,"The imbalance of the dataset leads to the trained model being biased towards head classes and under-represent the tail classes, making the long-tailed recognition challenging. To address those issues, this paper proposes the decoupled and patch-based contrastive learning. Given an anchor image, the supervised contrastive learning pulls two kinds of positives together in the embedding space: the same image with different data augmentation and other images from the same classes. The weights of two kinds of positives can be influenced by the cardinality of different classes, leading to biased feature space. The decoupled supervised contrastive loss decouples the two kinds of positives, removing the influence of the imbalanced dataset. To improve the discriminative of the learned model on the tail classes, patch-based self distillation crops the small patches from the global view of an image. These small patches can encode the shared visual patterns between different images, and thus can be used to transfer similarity relationship knowledge. Experiments on several long-tailed classification benchmarks demonstrate the superiority of our method. For instance, it achieves 57.7% top-1 accuracy on the ImageNet-LT dataset. Combined with the ensemble-based method, the performance can be further boosted to 59.7%. Our code will be released.","long-tailed, self distillation" Can CNNs Be More Robust Than Transformers?,https://openreview.net/forum?id=TKIFuQHHECj,https://openreview.net/pdf?id=TKIFuQHHECj,"we show CNNs can be as robust as, or even more robust than, Transformers","The recent success of Vision Transformers is shaking the long dominance of Convolutional Neural Networks (CNNs) in image recognition for a decade. Specifically, in terms of robustness on out-of-distribution samples, recent research finds that Transformers are inherently more robust than CNNs, regardless of different training setups. Moreover, it is believed that such superiority of Transformers should largely be credited to their self-attention-like architectures per se. In this paper, we question that belief by closely examining the design of Transformers. Our findings lead to three highly effective architecture designs for boosting robustness, yet simple enough to be implemented in several lines of code, namely a) patchifying input images, b) enlarging kernel size, and c) reducing activation layers and normalization layers. Bringing these components together, we are able to build pure CNN architectures without any attention-like operations that is as robust as, or even more robust than, Transformers. We hope this work can help the community better understand the design of robust neural architectures.","CNNs, Transformers, Out-of-Distribution Robustness" motifNet: Functional motif interactions discovered in mRNA sequences with implicit neural representation learning,https://openreview.net/forum?id=kPHOmwdHHZX,https://openreview.net/pdf?id=kPHOmwdHHZX,we use position coordinates for mRNA sequences representation to extract important motif patterns and evaluate the patterns in diverse datasets.,"Predicting the functional sequence patterns in mRNA, or motifs, is an important way to uncover the mechanisms of cell life cycle in clinical research and drug discovery. Despite emerging studies trying to build neural networks for mRNA event prediction with high accuracy, most of these models were black boxes where the relationship between internal network components and the mRNA events could not be explained. Hence, we propose motifNet with a generalized framework for position specific patterns extraction and visualization. MotifNet is a lightweight neural network using position coordinates to represent various motif interaction patterns in human mRNA sequences. By navigating sequence and positional inputs in the encoded latent space, we can interactively generate motif patterns or the positional effect score in mRNA activities constrained on defined input. Furthermore, we also found out that violation of some motif patterns from motifNet in real human mRNA variants will significantly cause disease related cell dysfunction.","RNA motif, implicit neural representation, sequence generation" Decoupled Mixup for Data-efficient Learning,https://openreview.net/forum?id=uCNUPuhyuU,https://openreview.net/pdf?id=uCNUPuhyuU,This paper proposes a decoupled mixup (DM) loss that can adaptively mine discriminative features without losing smoothness and improve various mixup methods in a plug-and-play manner.,"Mixup is an efficient data augmentation approach that improves the generalization of neural networks by smoothing the decision boundary with mixed data. Recently, dynamic mixup methods have improved previous static policies effectively (e.g., linear interpolation) by maximizing salient regions or maintaining the target in mixed samples. The discrepancy is that the generated mixed samples from dynamic policies are more instance discriminative than the static ones, e.g., the foreground objects are decoupled from the background. However, optimizing mixup policies with dynamic methods in input space is an expensive computation compared to static ones. Hence, we are trying to transfer the decoupling mechanism of dynamic methods from the data level to the objective function level and propose the general decoupled mixup (DM) loss. The primary effect is that DM can adaptively focus on discriminative features without losing the original smoothness of the mixup while avoiding heavy computational overhead. As a result, DM enables static mixup methods to achieve comparable or even exceed the performance of dynamic methods. This also leads to an interesting objective design problem for mixup training that we need to focus on both smoothing the decision boundaries and identifying discriminative features. Extensive experiments on supervised and semi-supervised learning benchmarks across seven classification datasets validate the effectiveness of DM by equipping it with various mixup methods.","Data Augmentation, Mixup, Data-efficient Learning, Semi-supervised Learning" FAIRER: Fairness as Decision Rationale Alignment,https://openreview.net/forum?id=_g-D1zNps_,https://openreview.net/pdf?id=_g-D1zNps_,,"Deep neural networks (DNNs) have achieved remarkable accuracy, but they often suffer from fairness issues, as deep models typically show distinct accuracy differences among some specific subgroups (e.g., males and females). Existing research addresses this critical issue by employing fairness-aware loss functions to constrain the last-layer outputs and directly regularize DNNs. Although the fairness of DNNs is improved, it is unclear how the trained network makes a fair prediction, which limits future fairness improvements. In this paper, we investigate fairness from the perspective of decision rationale and define neuron parity scores to characterize the fair decision process of networks by analyzing neuron behaviors in various subgroups. Extensive empirical studies show that the unfair issue could arise from the unaligned decision rationales of subgroups. Existing fairness regularization terms fail to achieve decision rationale alignment because they only constrain last-layer outputs while ignoring intermediate neuron alignment. To address the issue, we formulate the fairness as a new task, i.e., decision rationale alignment that requires DNNs’ neurons to have consistent responses on subgroups at both intermediate processes and the final prediction. To make this idea practical during optimization, we relax the naive objective function and propose gradient-guided parity alignment, which encourages gradient-weighted consistency of neurons across subgroups. Extensive experiments on a variety of datasets show that our method can improve fairness while maintaining high accuracy and outperforming other baselines by a large margin. We have released our codes at https://anonymous.4open.science/r/fairer_submission-F176/.","Fairness, deep neural network, decision rationale alignment, neuron parity score" Rethinking Data Augmentation for Improving Transferable Targeted Attacks,https://openreview.net/forum?id=go0P5gsBE2,https://openreview.net/pdf?id=go0P5gsBE2,,"Diverse input patterns induced by data augmentations prevent crafted adversarial perturbations from over-fitting to white-box models, hence improving the transferability of adversarial examples for non-targeted attacks. Nevertheless, current data augmentation methods usually perform unsatisfactory for transferable targeted attacks. In this paper, we revisit the commonly used data augmentation method - DI, which is originally proposed to improve non-targeted transferability and discover that its unsatisfactory performance in targeted transferability is mainly caused by the unreasonable restricted diversity. Besides, we also show that directly increasing the diversity of input patterns offers better transferability. In addition, our analysis of attention heatmaps suggests that incorporating more diverse input patterns into optimizing perturbations enlarges the discriminative regions of the target class in the white-box model. Therefore, these generated perturbations can activate discriminative regions of other models with high probabilities. Motivated by this observation, we propose to optimize perturbations with a set of augmented images that have various discriminative regions of the target class in the white-box model. Specifically, we design a data augmentation method, which includes multiple image transformations that can significantly change discriminative regions of the target class, to improve transferable targeted attacks by a large margin. On the ImageNet-compatible dataset, our method achieves an average of 92.5\% targeted attack success rate in the ensemble transfer scenario, shedding light on transfer-based targeted attacks. ","Data augmentation, Transferable targeted attacks" A Deep Dive into the Stability-Plasticity Dilemma in Class-Incremental Learning,https://openreview.net/forum?id=Qx0vjIvlkev,https://openreview.net/pdf?id=Qx0vjIvlkev,,"A fundamental objective in class-incremental learning is to strike a balance between stability and plasticity, where models should be both stable enough to retain knowledge learnt from previously seen classes, and plastic enough to learn concepts from new classes. While previous works demonstrate strong performance on class-incremental benchmarks, it is not clear whether their success comes from the models being stable, plastic, or a mixture of both. In this paper we aim to shed light on how effectively recent class-incremental learning algorithms address the stability-plasticity trade-off. We establish analytical tools that help measure the stability and plasticity feature representations, and employ such tools to investigate models trained with various class-incremental algorithms on large-scale class-incremental benchmarks. Surprisingly, we find that the majority of class-incremental algorithms heavily favor stability over plasticity, to the extent that the feature extractor of a model trained on the initial set of classes is no less effective than that of the final incremental model. Our observations not only inspire two simple algorithms that highlight the importance of analyzing feature representations, but also suggest that class-incremental research, in general, should strive for better feature representation learning.","continual learning, class-incremental learning, analysis" KITE: A Kernel-based Improved Transferability Estimation Method,https://openreview.net/forum?id=rbHfqQ9T61E,https://openreview.net/pdf?id=rbHfqQ9T61E,"we propose a novel transferability estimation method, called KITE, for selecting the most effective pre-trained model for fine-tuning on a target dataset","Transferability estimation has emerged as an important problem in transfer learning. A transferability estimation method takes as inputs a set of pre-trained models and decides which pre-trained model can deliver the best transfer learning performance. Existing methods tackle this problem by analyzing the output of the pre-trained model or by comparing the pre-trained model with a probe model trained on the target dataset. However, neither is sufficient to provide reliable and efficient transferability estimations. In this paper, we present a novel perspective and introduce \textsc{Kite}, as a \underline{K}ernel-based \underline{I}mproved \underline{T}ransferability \underline{E}stimation method. \textsc{Kite} is based on the key observations that the separability of the pre-trained features and the similarity of the pre-trained features to random features are two important factors for estimating transferability. Inspired by kernel methods, \textsc{Kite} adopts \emph{centered kernel alignment} as an effective way to assess feature separability and feature similarity. \textsc{Kite} is easy to interpret, fast to compute, and robust to the target dataset size. We evaluate the performance of \textsc{Kite} on a recently introduced large-scale model selection benchmark. The benchmark contains 8 source dataset, 6 target datasets and 4 architectures with a total of 32 pre-trained models. Extensive results show that \textsc{Kite} outperforms existing methods by a large margin for transferability estimation.","Transfer Learning, Transferability Estimation" Risk-Aware Reinforcement Learning with Coherent Risk Measures and Non-linear Function Approximation,https://openreview.net/forum?id=-RwZOVybbj,https://openreview.net/pdf?id=-RwZOVybbj,We propose a unified framework to analyze the regret of risk-aware RL policy that uses a coherent risk measure in conjunction with non-linear function approximation.,"We study the risk-aware reinforcement learning (RL) problem in the episodic finite-horizon Markov decision process with unknown transition and reward functions. In contrast to the risk-neutral RL problem, we consider minimizing the risk of having low rewards, which arise due to the intrinsic randomness of the MDPs and imperfect knowledge of the model. Our work provides a unified framework to analyze the regret of risk-aware RL policy with coherent risk measures in conjunction with non-linear function approximation, which gives the first sub-linear regret bounds in the setting. Finally, we validate our theoretical results via empirical experiments on synthetic and real-world data.","Risk-Aware Reinforcement Learning, Coherent Risk Measures, Non-linear Function Approximation" "A Minimalist Dataset for Systematic Generalization of Perception, Syntax, and Semantics",https://openreview.net/forum?id=kIPyTuEZuAK,https://openreview.net/pdf?id=kIPyTuEZuAK,"We take inspiration from arithmetic and present a new benchmark for studying systematic generalization of perception, syntax, and semantics.","Inspired by humans' remarkable ability to master arithmetic and generalize to unseen problems, we present a new dataset, HINT, to study machines' capability of learning generalizable concepts at three levels: perception, syntax, and semantics. In HINT, machines are tasked to learn how concepts are perceived from raw signals such as images (i.e., perception), how multiple concepts are structurally combined to form a valid expression (i.e., syntax), and how concepts are realized to afford various reasoning tasks (i.e., semantics), all in a weakly supervised manner. With a focus on systematic generalization, we carefully design a five-fold test set to evaluate both the interpolation and the extrapolation of learned concepts w.r.t. the three levels. We further design a few-shot learning split to test whether models could quickly learn new concepts and generalize them to more complex scenarios. To understand existing models' limitations, we conduct extensive experiments with various sequence-to-sequence models, including RNNs, Transformers, and GPT-3 (with the chain of thought prompting). The results suggest that current models still struggle in extrapolation to long-range syntactic dependency and semantics. Models show a significant gap toward human-level generalization when tested with new concepts in a few-shot setting. Moreover, we find that it is infeasible to solve HINT by simply scaling up the dataset and the model size; this strategy barely helps the extrapolation over syntax and semantics. Finally, in zero-shot GPT-3 experiments, the chain of thought prompting shows impressive results and significantly boosts the test accuracy. We believe the proposed dataset together with the experimental findings is of great interest to the community on systematic generalization.","Systematic Generalization, Concept Learning" Bi-level Physics-Informed Neural Networks for PDE Constrained Optimization using Broyden's Hypergradients,https://openreview.net/forum?id=kkpL4zUXtiw,https://openreview.net/pdf?id=kkpL4zUXtiw,,"Deep learning based approaches like Physics-informed neural networks (PINNs) and DeepONets have shown promise on solving PDE constrained optimization (PDECO) problems. However, existing methods are insufficient to handle those PDE constraints that have a complicated or nonlinear dependency on optimization targets. In this paper, we present a novel bi-level optimization framework to resolve the challenge by decoupling the optimization of the targets and constraints. For the inner loop optimization, we adopt PINNs to solve the PDE constraints only. For the outer loop, we design a novel method by using Broyden's method based on the Implicit Function Theorem (IFT), which is efficient and accurate for approximating hypergradients. We further present theoretical explanations and error analysis of the hypergradients computation. Extensive experiments on multiple large-scale and nonlinear PDE constrained optimization problems demonstrate that our method achieves state-of-the-art results compared with strong baselines.","PINN, machine learning, bi-level optimization" Learning Continuous Grasping Function with a Dexterous Hand from Human Demonstrations,https://openreview.net/forum?id=020JErJMvVZ,https://openreview.net/pdf?id=020JErJMvVZ,We propose Continuous Grasping Function (CGF) to generate grasping motion for manipulation with a dexterous hand.,"We propose to learn to generate grasping motion for manipulation with a dexterous hand using implicit functions. With continuous time inputs, the model can generate a continuous and smooth grasping plan. We name the proposed model Continuous Grasping Function (CGF). CGF is learned via generative modeling with a Conditional Variational Autoencoder using 3D human demonstrations. We will first convert the large-scale human-object interaction trajectories to robot demonstrations via motion retargeting, and then use these demonstrations to train CGF. During inference, we perform sampling with CGF to generate different grasping plans in the simulator and select the successful ones to transfer to the real robot. By training on diverse human data, our CGF allows generalization to manipulate multiple objects. Compared to previous planning algorithms, CGF is more efficient and achieves significant improvement on success rate when transferred to grasping with the real Allegro Hand. Our anonymous project page is available at https://continuous-grasping.github.io/.","Dexterous Grasping, Implicit Function, Generative Model, Sim2Real" Hazard Gradient Penalty for Survival Analysis,https://openreview.net/forum?id=xQCk26Pp00,https://openreview.net/pdf?id=xQCk26Pp00,,"Survival analysis appears in various fields such as medicine, economics, engineering, and business. Recent studies showed that the Ordinary Differential Equation (ODE) modeling framework integrates many existing survival models while the framework is flexible and widely applicable. However, naively applying the ODE framework to survival analysis problems may model fiercely changing density function with respect to covariates which may worsen the model’s performance. Though we can apply L1 or L2 regularizers to the ODE model, their effect on the ODE modeling framework is barely known. In this paper, we propose hazard gradient penalty (HGP) to enhance the performance of a survival analysis model. Our method imposes constraints on local data points by regularizing the gradient of hazard function with respect to the data point. Our method applies to any survival analysis model including the ODE modeling framework and is easy to implement. We theoretically show that our method is related to minimizing the KL divergence between the density function at a data point and that of the neighborhood points. Experimental results on three public benchmarks show that our approach outperforms other regularization methods. ","survival analysis, gradient penalty, KL divergence" Model-Agnostic Meta-Attack: Towards Reliable Evaluation of Adversarial Robustness,https://openreview.net/forum?id=4Ff0zhHYxwl,https://openreview.net/pdf?id=4Ff0zhHYxwl,,"The vulnerability of deep neural networks to adversarial examples has motivated an increasing number of defense strategies for promoting model robustness. However, the progress is usually hampered by insufficient robustness evaluations. As the de facto standard to evaluate adversarial robustness, adversarial attacks typically solve an optimization problem of crafting adversarial examples with an iterative process. But the existing attacks are usually limited by using hand-designed optimization algorithms, leading to less accurate robustness evaluations. In this paper, we propose a Model-Agnostic Meta-Attack (MAMA) approach to discover stronger attack algorithms automatically. Our method learns the optimizer in adversarial attacks parameterized by a recurrent neural network, which is trained over a class of data samples and defense models to produce effective update directions during adversarial example generation. Furthermore, we develop a model-agnostic training algorithm to improve the generalization ability of the learned optimizer when attacking unseen defenses. Our approach can be flexibly incorporated with various attacks and consistently improves their performance. Extensive experiments demonstrate the effectiveness and efficiency of the learned attacks by MAMA, e.g., MAMA achieves x2 speedup over the state-of-the-art AutoAttack while obtaining lower robust test accuracy on all adopted defense models. Therefore, MAMA leads to a more reliable and efficient evaluation of adversarial robustness. ","Adversarial attacks, robust evaluation" Rethink Depth Separation with Intra-layer Links,https://openreview.net/forum?id=o3du8VqB4wL,https://openreview.net/pdf?id=o3du8VqB4wL,,"The depth separation theory is nowadays widely accepted as an effective explanation for the power of depth, which consists of two parts: i) there exists a function representable by a deep network; ii) such a function cannot be represented by a shallow network whose width is lower than a threshold. Here, we report that adding intra-layer links can greatly improve a network's representation capability through the bound estimation, explicit construction, and functional space analysis. Then, we modify the depth separation theory by showing that a shallow network with intra-layer links does not need to go as wide as before to express some hard functions constructed by a deep network. Such functions include the renowned ""sawtooth"" functions. Our results supplement the existing depth separation theory by examining its limit in a broader domain. Also, our results suggest that once configured with an appropriate structure, a shallow and wide network may have expressive power on a par with a deep network. ", Reach the Remote Neighbors: Dual-Encoding Transformer for Graphs,https://openreview.net/forum?id=p-N-CoSyszH,https://openreview.net/pdf?id=p-N-CoSyszH,A Transformer model for diverse graph representation learning tasks,"Despite recent successes in natural language processing and computer vision, Transformer suffers from the scalability problem when dealing with graphs. Computing node-to-node attentions is infeasible on complicated graphs, e.g., knowledge graphs. One solution is to consider only the near neighbors, which, however, will lose the key merit of Transformer that attends to the elements at any distance. In this paper, we propose a new Transformer architecture, named dual-encoding Transformer (DET), which has a structural encoder to aggregate information from near neighbors and a semantic encoder to focus on useful semantically close neighbors. The two encoders can be incorporated to boost each other's performance. Our experiments demonstrate that DET achieves superior performance compared to the respective state-of-the-art attention-based methods in dealing with molecules, networks and knowledge graphs.","konwledge graph prediction, node classification, graph property prediction, MSA Transformer" The Geometry of Self-supervised Learning Models and its Impact on Transfer Learning,https://openreview.net/forum?id=qoSNQprgGDs,https://openreview.net/pdf?id=qoSNQprgGDs,,"Self-supervised learning~(SSL) has emerged as a desirable paradigm in computer vision due to the inability of supervised models to learn representations that can generalize in domains with limited labels. The recent popularity of SSL has led to the development of several models that make use of diverse training strategies, architectures, and data augmentation policies with no existing unified framework to study or assess their effectiveness in transfer learning. We propose a data-driven geometric strategy to analyze different SSL models using local neighborhoods in the feature space induced by each. Unlike existing approaches that consider mathematical approximations of the parameters, individual components, or optimization landscape, our work aims to explore the geometric properties of the representation manifolds learned by SSL models. Our proposed manifold graph metrics~(MGMs) provide insights into the geometric similarities and differences between available SSL models, their invariances with respect to specific augmentations, and their performances on transfer learning tasks. Our key findings are two fold: $(i)$ contrary to popular belief, the geometry of SSL models is not tied to its training paradigm (contrastive, non-contrastive, and cluster-based); $(ii)$ we can predict the transfer learning capability for a specific model based on the geometric properties of its semantic and augmentation manifolds.","Self-supervised learning, Transfer Learning, Graphs, Geometry, Embedding, Computer Vision" Only For You: Deep Neural Anti-Forwarding Watermark Preserves Image Privacy,https://openreview.net/forum?id=5udLUhg1E5,https://openreview.net/pdf?id=5udLUhg1E5,,"In recent decades, messaging apps (e.g., Facebook Messager, Whatsapp, Wechat, Snapchat) have expanded exponentially, where a huge amount of private image sharing takes place daily. However, within these apps, the possible unauthorised or malicious image forwarding among users poses significant threats to personal image privacy. In specific situations, we hope to send private and confidential images (e.g., personal selfies) in an `only for you' manner. Given limited existing studies on this topic, for the first time, we propose the Deep Neural Anti-Forwarding Watermark (DeepRAFT) that enables media platforms to check and block any unauthorised forwarding of protected images through injecting non-fragile and invisible watermarks. To this end, we jointly train a DeepRAFT encoder and scanner, where the encoder embeds a confidentiality stamp into images as watermarks, and the scanner learns to detect them. To ensure that the technique is robust and resistant to tampering, we involve a series of data augmentations (mounted on a stochastic concatenation process) and adversarial defenses (i.e., adversarial training and randomized smoothing) towards both common image corruptions (e.g., rotation, cropping, color jitters, defocus blur, perspective warping, pixel noise, JPEG compression) and adversarial attacks (i.e., under both black and white box settings). Experiments on Mirflickr and MetFaces datasets demonstrate that DeepRAFT can efficiently and robustly imbue and detect the anti-forwarding watermark in images. Moreover, the trained DeepRAFT encoder and scanner can be easily transferred in a zero-shot manner even with a significant domain shift. We release our code and models to inspire studies in this anti-forwarding area at \url{link.available.upon.acceptance.}", When Do Models Generalize? A Perspective From Data-Algorithm Compatibility,https://openreview.net/forum?id=7t3ggLCjl7G,https://openreview.net/pdf?id=7t3ggLCjl7G,"We propose data-algorithm-compatibility to characterize generalization, and study it in the overparameterized linear regression regime.","One of the major open problems in machine learning is to characterize generalization in the overparameterized regime, where most traditional generalization bounds become inconsistent (Nagarajan and Kolter, 2019). In many scenarios, their failure can be attributed to obscuring the crucial interplay between the training algorithm and the underlying data distribution. To address this issue, we propose a concept named compatibility, which quantitatively characterizes generalization in a both data-relevant and algorithm relevant manner. By considering the entire training trajectory and focusing on early-stopping iterates, compatibility exploits the data and the algorithm information and is therefore a more suitable notion for generalization. We validate this by theoretically studying compatibility under the setting of solving overparameterized linear regression with gradient descent. Specifically, we perform a data-dependent trajectory analysis and derive a sufficient condition for compatibility in such a setting. Our theoretical results demonstrate that in the sense of compatibility, generalization holds with significantly weaker restrictions on the problem instance than the previous last iterate analysis.","generalization, data-algorithm compatibility, early stopping, overparameterized linear regression" On the Saturation Effect of Kernel Ridge Regression,https://openreview.net/forum?id=tFvr-kYWs_Y,https://openreview.net/pdf?id=tFvr-kYWs_Y,,"The saturation effect refers to the phenomenon that the kernel ridge regression (KRR) fails to achieve the information theoretical lower bound when the smoothness of the underground truth function exceeds certain level. The saturation effect has been widely observed in practices and a saturation lower bound of KRR has been conjectured for decades. In this paper, we provide a proof of this long-standing conjecture.","Kernel ridge regression, Saturation effect, Reproducing kernel Hilbert space, Learning theory" Adversarial perturbation based latent reconstruction for domain-agnostic self-supervised learning,https://openreview.net/forum?id=YxjfKeWjuQ9,https://openreview.net/pdf?id=YxjfKeWjuQ9,,"Most self-supervised learning (SSL) methods rely on domain-specific pretext tasks and data augmentations to learn high-quality representations from unlabeled data. Development of those pretext tasks and data augmentations requires expert domain knowledge. In addition, it is not clear why solving certain pretext tasks leads to useful representations. Those two reasons hinder wider application of SSL to different domains. To overcome such limitations, we propose adversarial perturbation based latent reconstruction (APLR) for domain-agnostic self-supervised learning. In APLR, a neural network is trained to generate adversarial noise to perturb the unlabeled training sample so that domain-specific augmentations are not required. The pretext task in APLR is to reconstruct the latent representation of a clean sample from a perturbed sample. We show that representation learning via latent reconstruction is closely related to multi-dimensional Hirschfeld-Gebelein-Rényi (HGR) maximal correlation and has theoretical guarantees on the linear probe error. To demonstrate the effectiveness of APLR, the proposed method is applied to various domains such as tabular data, images, and audios. Empirical results indicate that APLR not only outperforms existing domain-agnostic SSL methods, but also closes the performance gap to domain-specific SSL methods. In many cases, APLR also outperforms training the full network in a supervised manner.","self-supervised learning, representation learning, domain-agnostic" Unsupervised Model Selection for Time Series Anomaly Detection,https://openreview.net/forum?id=gOZ_pKANaPW,https://openreview.net/pdf?id=gOZ_pKANaPW,"This paper answers the question-- Given an unlabeled dataset and a set of candidate time series anomaly detectors, how can we select the most accurate model?","Anomaly detection in time-series has a wide range of practical applications. While numerous anomaly detection methods have been proposed in the literature, a recent survey concluded that no single method is the most accurate across various datasets. To make matters worse, anomaly labels are scarce and rarely available in practice. The practical problem of selecting the most accurate model for a given dataset without labels has received little attention in the literature. This paper answers this question i.e. Given an unlabeled dataset and a set of candidate anomaly detectors, how can we select the most accurate model? To this end, we identify three classes of surrogate (unsupervised) metrics, namely, prediction error}, model centrality, and performance on injected synthetic anomalies, and show that some metrics are highly correlated with standard supervised anomaly detection performance metrics such as the $F_1$ score, but to varying degrees. We formulate metric combination with multiple imperfect surrogate metrics as a robust rank aggregation problem. We then provide theoretical justification behind the proposed approach. Large-scale experiments on multiple real-world datasets demonstrate that our proposed unsupervised approach is as effective as selecting the most accurate model based on partially labeled data.","Time Series, Anomaly Detection, Model Selection, Unsupervised Learning, Rank Aggregation" Constrained Hierarchical Deep Reinforcement Learning with Differentiable Formal Specifications,https://openreview.net/forum?id=LUOSN8opID1,https://openreview.net/pdf?id=LUOSN8opID1,This paper uses differentiable formal specifications to constrain the policy updates in hierarchical deep reinforcement learning. ,"Formal logic specifications are a useful tool to describe desired agent behavior and have been explored as a means to shape rewards in Deep Reinforcement Learning (DRL) systems over a variety of problems and domains. Prior work, however, has failed to consider the possibility of making these specifications differentiable, which would yield a more informative signal of the objective via the specification gradient. This paper examines precisely such an approach by exploring a Lagrangian method to constrain policy updates using a differentiable style of temporal logic specifications that associates logic formulae with real-valued quantitative semantics. This constrained learning mechanism is then used in a hierarchical setting where a high-level specification-guided neural network path planner works with a low-level control policy to navigate through planned waypoints. The effectiveness of our approach is demonstrated over four robot dynamics with five different types of Linear Temporal Logic (LTL) specifications. Our demo videos are collected at https://sites.google.com/view/schrl.","Deep Reinforcement Learning, Differentiable Formal Specification Language, Robot Navigation, Robot Planning and Control" Topic Aware Transformer: Domain Shift for Unconditional Text Generation Model,https://openreview.net/forum?id=WOfOf53mVyo,https://openreview.net/pdf?id=WOfOf53mVyo,Domain adaptation framework of PLMs to unconditional text generation tasks.,"Our goal is to adapt pre-trained language models (PLMs) to support unconditional text generation tasks. Because Transformer-based models are pre-trained on more massive and heterogeneous corpora than specific target corpus, the gap between these corpora and the target corpus raises the question of whether these PLMs will actually benefit this task even after fine-tuning. As the domain adaptation of PLMs needs to bridge this gap, we propose a framework, Topic Aware Transformer (TAT), that adapts PLMs for target-aware text generation while alleviating catastrophic forgetting. The motivation of TAT to distill the target-specific knowledge as topics, and steer PLMs toward these topics. This requirement and motivation lead us to introduce a topic steering layer (TSL) as an additional layer, and Topic Distribution Modeling (TDM) as a training task. Experiments show that these components resolve the gap as the domain shift, and can tailor PLMs to generate text to better reflect a given small fine-tuning corpus.","Text generation, Domain adaptation, Domain shift, Transformers" PromptCast: A New Prompt-based Learning Paradigm for Time Series Forecasting,https://openreview.net/forum?id=bMUZXhuFEf,https://openreview.net/pdf?id=bMUZXhuFEf,,"This paper studies the time series forecasting problem from a whole new perspective. In the existing SOTA time-series representation learning methods, the forecasting models take a sequence of numerical values as input and yield numerical values as output. The existing SOTA models are largely based on Transformer architecture, modified with multiple encoding mechanisms to incorporate the context and semantics around the historical data. In this paper, we approach representation learning of time-series from the paradigm of prompt-based natural language modeling. Inspired by the successes of pre-trained language foundation models, we pose a question about whether these models can also be adapted to solve time-series forecasting. Thus, we propose a new forecasting paradigm: prompt-based time series forecasting (PromptCast). In this novel task, the numerical input and output are transformed into prompts. We frame the forecasting task in a sentence-to-sentence manner which makes it possible to directly apply language models for forecasting purposes. To support and facilitate the research of this task, we also present a large-scale dataset (PISA) that includes three real-world forecasting scenarios. We evaluate different SOTA numerical-based forecasting methods and language generation models such as Bart. The benchmark results with single- and multi-step forecasting settings demonstrate that the proposed prompt-based time series forecasting with language generation models is a promising research direction. In addition, in comparison to conventional numerical-based forecasting, PromptCast shows a much better generalization ability under the zero-shot setting. We believe that the proposed PromptCast task as well as our PISA dataset could provide novel insights and further lead to new research directions in the domain of time-series representation learning and forecasting.", Protein Representation Learning by Geometric Structure Pretraining,https://openreview.net/forum?id=to3qCB3tOh9,https://openreview.net/pdf?id=to3qCB3tOh9,"In this work, we propose a versatile protein structure encoder GearNet, a superior protein structure pre-trainining algorithm Multiview Contrast and a suite of protein structure pre-training baselines.","Learning effective representations of proteins is critical in a variety of tasks in biology such as predicting protein function or structure. Existing approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences and then finetune the models with some labeled data in downstream tasks. Despite the effectiveness of sequence-based approaches, the power of pretraining on known protein structures, which are available in smaller numbers only, has not been explored for protein property prediction, though protein structures are known to be determinants of protein function. In this paper, we propose to pretrain protein representations according to their 3D structures. We first present a simple yet effective encoder to learn the geometric features of a protein. We pretrain the protein structure encoder by leveraging multiview contrastive learning and compare against pretraining with various self-prediction tasks. Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods, while using much less data. All codes and models will be published upon acceptance.","Protein representation learning, self-supervised learning" Conditional Invariances for Conformer Invariant Protein Representations,https://openreview.net/forum?id=xOFD5BMwsB,https://openreview.net/pdf?id=xOFD5BMwsB,"We propose the conditional invariance (CI) framework, which captures input-dependent transformation invariances as an add-on to existing neural network methods. We augment existing protein GNNs with CI to learn conformer invariant representations.","Representation learning for proteins is an emerging area in geometric deep learning. Recent works have factored in both the relational (atomic bonds) and the geometric aspects (atomic positions) of the task, notably bringing together graph neural networks (GNNs) with neural networks for point clouds. The equivariances and invariances to geometric transformations (group actions such as rotations and translations) so far treats large molecules as rigid structures. However, in many important settings, proteins can co-exist as an ensemble of multiple stable conformations. The conformations of a protein, however, cannot be described as input-independent transformations of the protein: Two proteins may require different sets of transformations in order to describe their set of viable conformations. To address this limitation, we introduce the concept of conditional transformations (CT). CT can capture protein structure, while respecting the restrictions posed by constraints on dihedral (torsion) angles and steric repulsions between atoms. We then introduce a Markov chain Monte Carlo framework to learn representations that are invariant to these conditional transformations. Our results show that endowing existing baseline models with these conditional transformations helps improve their performance without sacrificing computational cost.","Invariances, Conditional Invariances, input dependent invariances, proteins, protein representation learning, conformer invariant representations, graph neural networks, group invariant neural networks" Learning PDE Solution Operator for Continuous Modeling of Time-Series,https://openreview.net/forum?id=B-dM7df9Axo,https://openreview.net/pdf?id=B-dM7df9Axo, PDE-based approach for modeling time-series.,"Learning underlying dynamics from data is important and challenging in many real-world scenarios. Incorporating differential equations (DEs) to design continuous networks has drawn much attention recently, the most prominent of which is Neural ODE. Most prior works make specific assumptions on the type of DEs or restrict them to first or second-order DEs, making the model specialized for certain problems. Furthermore, due to the use of numerical integration, they suffer from computational expensiveness and numerical instability. Building upon recent Fourier neural operator (FNO), this work proposes a partial differential equation (PDE) based framework which improves the dynamics modeling capability and circumvents the need for costly numerical integration. FNO is hard to be directly applied to real applications because it is mainly confined to physical PDE problems. To fill this void, we propose a continuous-in-time FNO to deal with irregularly-sampled time series and provide a theoretical result demonstrating its universality. Moreover, we reveal an intrinsic property of PDEs that increases the stability of the model. Several numerical evidence shows that our method represents a broader range of problems, including synthetic, image classification, and irregular time-series. Our framework opens up a new way for a continuous representation of neural networks that can be readily adopted for real-world applications.","Neural ODEs, Partial differential equations, Neural operators, Time-series" Quantum-Inspired Tensorized Embedding with Application to Node Representation Learning,https://openreview.net/forum?id=bA6h-N17X1E,https://openreview.net/pdf?id=bA6h-N17X1E,,"Node representation learning a.k.a. network embedding (NE) is an essential technique for network analysis by representing nodes as vectors, which also serves downstream tasks or as initial input for GNN models. Most of existing NE algorithms require a space complexity linear to the multiplication of the number of nodes and embedding dimension to store embeddings. Such a conventional embedding paradigm has two defects: i) it brings challenge to the deployment of NE algorithms for large-scale networks on devices of limited memory/storage space; ii) model expressiveness is constrained due to the limited embedding dimension. Impressed and inspired by the large Hilbert space of quantum systems, we propose a brand new NE algorithm \emph{node2ket} by imitating behaviors of quantum systems. Theoretically, we give analysis on how it unifies existing embedding methods including both conventional ones and tensorized ones, and explore the ultimate compressive power of the embedding model on the space complexity compared with conventional ones. Experiments are conducted on five public real-world networks where methods are evaluated through tasks of network reconstruction and link prediction. On BlogCatalog, our method achieves to outperform all baselines with 1/32 training parameters and 1/16 running time on the same machine. On DBLP, the reconstruction precision of node2ket achieves to be 3 times higher than the best baseline i.e. LouvainNE. Source code will be made publicly available. ","network embedding, node representation learning, quantum mechanics, tensorized embedding" Identifying Latent Causal Content for Multi-Source Domain Adaptation,https://openreview.net/forum?id=Mmgcp3MRp7q,https://openreview.net/pdf?id=Mmgcp3MRp7q,,"Multi-source domain adaptation (MSDA) learns to predict the labels in target domain data, under the setting that data from multiple source domains are labelled and data from the target domain are unlabelled. Most methods for this task focus on learning invariant representations across domains. However, their success relies heavily on the assumption that the label distribution remains consistent across domains, which may not hold in general real-world problems. In this paper, we propose a new and more flexible assumption, termed \textit{latent covariate shift}, where a latent content variable $\mathbf{z}_c$ and a latent style variable $\mathbf{z}_s$ are introduced in the generative process, with the marginal distribution of $\mathbf{z}_c$ changing across domains and the conditional distribution of the label given $\mathbf{z}_c$ remaining invariant across domains. We show that although (completely) identifying the proposed latent causal model is challenging, the latent content variable can be identified up to scaling by using its dependence with labels from source domains, together with the identifiability conditions of nonlinear ICA. This motivates us to propose a novel method for MSDA, which learns the invariant label distribution conditional on the latent content variable, instead of learning invariant representations. Empirical evaluation on simulation and real data demonstrates the effectiveness of the proposed method.", Robust Self-Supervised Image Denoising with Cyclic Shift and Noise-Intensity-Aware Uncertainty,https://openreview.net/forum?id=x-wqE_-uhAL,https://openreview.net/pdf?id=x-wqE_-uhAL,,"In self-supervised image denoising, it is challenging to construct paired noisy samples from a single noisy observation, and the quality of samples seriously influences the performance of the denoising model. Strategies for constructing pairs of samples for learning, such as blind-spot convolution and sub-sampling, are widely adopted in existing self-supervised denoising methods. However, these strategies suffer from the severe problems of information underutilization and pixel misalignment, which seriously hinder the further improvement of denoising performance. Furthermore, little attention has been paid to the sensitivity of denoising models to deal with unknown noise, which is of great significance in enhancing the practicality of denoising models. To overcome these challenges, we propose a very simple and effective method, called Cyclic Shift, to construct paired noisy images for self-supervised training. This new strategy solves the problems of information underutilization and pixel misalignment without additional computation, and it can be easily embedded into existing denoising methods and significantly boost their performance. In addition, we introduce the uncertainty-aware loss in training to enable the denoising network to perceive the noise intensity and have robust denoising performance. We theoretically explain the effectiveness of Cyclic Shift and analyze the ability of the uncertainty loss to endow the network with noise intensity perception. Extensive experimental results show that our approach achieves state-of-the-art self-supervised image denoising performance.", Trainable Weight Averaging: Efficient Training by Optimizing Historical Solutions,https://openreview.net/forum?id=8wbnpOJY-f,https://openreview.net/pdf?id=8wbnpOJY-f,We propose trainable weight averaging (TWA) to optimize historical solutions in DNNs' training to achieve efficiency and better performance.,"Stochastic gradient descent (SGD) and its variants are considered as the de-facto methods to train deep neural networks (DNNs). While recent improvements to SGD mainly focus on the descent algorithm itself, few works pay attention to utilizing the historical solutions---as an iterative method, SGD has gone through substantial explorations before convergence. Recently, an interesting attempt is stochastic weight averaging (SWA), which significantly improves the generalization by simply averaging the solutions at the tail stage of training. In this paper, we realize that the averaging coefficients could be determined in a trainable manner and propose Trainable Weight Averaging (TWA), a novel optimization method in the reduced subspace spanned by historical solutions. TWA has much greater flexibility and can be applied to the head stage of training to achieve training efficiency while preserving good generalization capability. Further, we propose a distributed training scheme to resolve the memory burden of large-scale training with efficient parallel computation. In the extensive numerical experiments, (i) TWA achieves consistent improvements over SWA with less sensitivity to learning rate; (ii) applying TWA in the head stage of training largely speeds up the convergence, resulting in over $40\%$ time saving on CIFAR and $30\%$ on ImageNet with improved generalization compared with regular training.","efficient training, weight averaging, optimization" Revealing Single Frame Bias for Video-and-Language Learning,https://openreview.net/forum?id=UhEJz3wgLnG,https://openreview.net/pdf?id=UhEJz3wgLnG,,"Training an effective video-and-language model intuitively requires multiple frames as model inputs. However, it is unclear whether using multiple frames is beneficial to downstream tasks, and if yes, whether the performance gain is worth the drastically-increased computation and memory costs resulting from using more frames. In this work, we explore single-frame models for video-and-language learning. On a diverse set of video-and-language tasks (including text-to-video retrieval and video question answering), we show the surprising result that, with large-scale pre-training and a proper frame ensemble strategy at inference time, a single-frame trained model that does not consider temporal information can achieve better performance than existing methods that use multiple frames for training. This result reveals the existence of a strong ``static appearance bias'' in popular video-and-language datasets. Therefore, to allow for a more comprehensive evaluation of video-and-language models, we propose two new retrieval tasks based on existing fine-grained action recognition datasets that encourage temporal modeling. Full code and models will be made publicly available upon acceptance.", Deep Declarative Dynamic Time Warping for End-to-End Learning of Alignment Paths,https://openreview.net/forum?id=UClBPxIZqnY,https://openreview.net/pdf?id=UClBPxIZqnY,We introduce a novel differentiable dynamic time warping layer based on continuous time warps and implicit differentiation.,"This paper addresses end-to-end learnable models for time series data that include a temporal alignment step via dynamic time warping (DTW). Existing approaches to differentiable DTW either differentiate through a fixed warping path or apply a continuous relaxation to the min operator found in the recursive steps used to solve the DTW problem. We instead propose a DTW layer based around deep declarative networks. By formulating the DTW problem as a continuous, inequality constrained optimisation problem, we can compute exact gradients for the solution of the optimal alignment (with respect to the underlying time series) using implicit differentiation. Our formulation yields a major improvement over existing approaches; our DTW layer outputs the entire warping path between two time series as opposed to only the DTW discrepancy value. This enables the specification of downstream loss functions on the alignment path itself, useful for instance, to improve the precision of predicted alignments when ground-truth alignments are available. We evaluate our method on two such applications, namely the audio-to-score alignment task in music information retrieval and the visual place recognition task in robotics, demonstrating state-of-the-art results in both.","implicit differentiation, sequence matching, time series, visual localization, music" DEEAPR: Controllable Depth Enhancement via Adaptive Parametric Feature Rotation,https://openreview.net/forum?id=3KHzMQUOH4x,https://openreview.net/pdf?id=3KHzMQUOH4x,,"Understanding depth of an image provides viewers with a better interpretation of the 3D structures within an image. Photographers utilize numerous factors that can affect depth perception to aesthetically improve a scene. Unfortunately, controlling depth perception after the image has been captured is a difficult process as it requires accurate and explicit depth information. Also, defining a quantitative metric of a subjective quality (i.e., depth perception) is difficult which makes supervised learning a great challenge. To this end, we propose DEpth Enhancement via Adaptive Parametric feature Rotation (DEEAPR), which modulates the perceptual depth of an input scene using a single control parameter without the need for explicit depth information. We first embed content-independent depth perception of a scene by visual representation learning. Then, we train the controllable depth enhancer network with a novel modulator, parametric feature rotation block (PFRB), that allows for continuous modulation of a representative feature. We demonstrate the effectiveness of our proposed approach by verifying each component through an ablation study and comparison to other controllable methods.", Deep Active Anomaly Detection With Diverse Queries,https://openreview.net/forum?id=DmYnLaFGMoc,https://openreview.net/pdf?id=DmYnLaFGMoc,A new active learning approach for deep anomaly detection that leads to leads to systematic improvements over current approaches.,"Selecting informative data points for expert feedback can significantly improve the performance of anomaly detection in various contexts, such as medical diagnostics or fraud detection. In this paper, we determine a set of conditions under which the ranking of anomaly scores generalizes from labeled queries to unlabeled data. Inspired by these conditions, we propose a new querying strategy for active anomaly detection that leads to systematic improvements over current approaches for this problem. It selects a diverse set of data points for labeling, achieving high data coverage with a limited budget. These labeled data points provide weak supervision to the unsupervised anomaly detection problem. However, correctly identifying anomalies requires an estimate of the fraction of anomalies in the data. We show how this anomaly rate can be estimated from the query set by importance-weighting, removing the associated bias due to the non-uniform sampling procedure. Extensive experiments on image, tabular, and video data sets show that our approach results in state-of-the-art active anomaly detection performance.","deep anomaly detection, active learning, diversified sampling" MODULAR FEDERATED CONTRASTIVE LEARNING WITH PEER NORMALIZATION,https://openreview.net/forum?id=zdfUK8wPJl,https://openreview.net/pdf?id=zdfUK8wPJl,,"Despite recent progress in federated learning (FL), the fundamental challenge of training a global model across multiple clients having heterogeneous and class imbalanced (CIB) data has not been fully resolved. Furthermore, most of the existing works for FL with heterogeneous data assume that the clients have fully labeled data, which might not be practical in real-world scenarios due to the challenges of labeling, especially at the clients. In this paper, we provide a solution for the realistic FL setting in which the clients have unlabeled, heterogeneous, and CIB data. To address the issue of biased gradients in training on heterogeneous and CIB data, we develop a new FL framework, called the Modular Federated Contrastive Learning (MFCL). Instead of federally training a whole deep network across the clients, we propose to train two separate and different network modules at the clients and the server. One is a sensor module that is federally trained across the clients to extract the data representations from the clients’ unlabeled data, which are sent to the server. The other is a discriminator module at the server, which is trained with contrastive loss on the data representations received from the clients. We also propose a new normalization technique, Peer Normalization (PN), which is customized for the contrastive FL to reduce the biases of the gradients resulting from training on the heterogeneous and CIB data across the clients. Our experiments show that the proposed MFCL with PN provides high and stable accuracy, achieving state-of-the-art performance when the clients have (severe) heterogeneous and CIB data.","federated learning, contrastive learning, normalization" Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning,https://openreview.net/forum?id=3itjR9QxFw,https://openreview.net/pdf?id=3itjR9QxFw,Generating discrete/categorical data with (continuous) diffusion models; also presents a technique that improves diffusion models in general.,"We present Bit Diffusion: a simple and generic approach for generating discrete data with continuous diffusion models. The main idea behind our approach is to first represent the discrete data as binary bits, and then train a continuous diffusion model to model these bits as real numbers which we call analog bits. To generate samples, the model first generates the analog bits, which are then thresholded to obtain the bits that represent the discrete variables. We further propose two simple techniques, namely Self-Conditioning and Asymmetric Time Intervals, which lead to a significant improvement in sample quality. Despite its simplicity, the proposed approach can achieve strong performance in both discrete image generation and image captioning tasks. For discrete image generation, we significantly improve previous state-of-the-art on both CIFAR-10 (which has 3K discrete 8-bit tokens) and ImageNet-64x64 (which has 12K discrete 8-bit tokens), outperforming the best autoregressive model in both sample quality (measured by FID) and efficiency. For image captioning on MS-COCO dataset, our approach achieves competitive results compared to autoregressive models.","Diffusion Models, Discrete Data" NetBooster: Empowering Tiny Deep Learning By Standing on the Shoulders of Deep Giants,https://openreview.net/forum?id=JQK0BsKpE8,https://openreview.net/pdf?id=JQK0BsKpE8,We propose an expansion-then-contraction training strategy on both width and depth dimension to fully unleash tiny neural network's potential on large scale datasets and downstream tasks.,"Tiny deep learning has attracted increasingly growing interest driven by the substantial demand for deep learning solutions in numerous Internet-of-Things (IoT) applications. Nevertheless, due to the under-fitting issue, it is still a challenge to unleash tiny deep learning’s full potential on large-scale datasets. Consequently, tiny neural networks’ (TNNs’) downstream task performance is limited due to the inferior learned representations during pretraining. To this end, we propose a framework dubbed NetBooster which empowers tiny deep learning from a novel perspective by augmenting the architecture of TNNs via an expansion-then-contraction strategy. Specifically, during training, our proposed NetBooster first expands each/some layer(s) of a given TNN into multi-layer blocks, favoring the learning of more complex features to generate an expanded counterpart model (i.e., deep giant), and then contracts the expanded layers by gradually removing the non-linear layers from the expanded ones to recover efficiency. NetBooster’s expansion-then-contraction training empowers its trained TNNs to benefit from the superior performance of their expanded counterparts while preserving the TNNs’ original complexity and thus inference efficiency. Extensive experiments and ablation studies on two tasks, seven datasets, and six networks validate that NetBooster consistently leads to a nontrivial accuracy boost (e.g., 1.3% ∼ 2.5%) on top of state-of-the-art TNNs on ImageNet and as much as 4.7% higher accuracy on various downstream datasets, while maintaining their inference complexity/efficiency.","Network Training, Transfer Learning" Understanding Edge-of-Stability Training Dynamics with a Minimalist Example,https://openreview.net/forum?id=p7EagBsMAEO,https://openreview.net/pdf?id=p7EagBsMAEO,,"Recently, researchers observed that gradient descent for deep neural networks operates in an ``edge-of-stability'' (EoS) regime: the sharpness (maximum eigenvalue of the Hessian) is often larger than stability threshold 2/$\eta$ (where $\eta$ is the step size). Despite this, the loss oscillates and converges in the long run, and the sharpness at the end is just slightly below $2/\eta$. While many other well-understood nonconvex objectives such as matrix factorization or two-layer networks can also converge despite large sharpness, there is often a larger gap between sharpness of the endpoint and $2/\eta$. In this paper, we study EoS phenomenon by constructing a simple function that has the same behavior. We give rigorous analysis for its training dynamics in a large local region and explain why the final converging point has sharpness close to $2/\eta$. Globally we observe that the training dynamics for our example has an interesting bifurcating behavior, which was also observed in the training of neural nets.","edge of stability, nonconvex optimization, gradient descent, training dynamics, scalar network" Learning Proximal Operators to Discover Multiple Optima,https://openreview.net/forum?id=PzBGIu-llo7,https://openreview.net/pdf?id=PzBGIu-llo7,,"Finding multiple solutions of non-convex optimization problems is a ubiquitous yet challenging task. Most past algorithms either apply single-solution optimization methods from multiple random initial guesses or search in the vicinity of found solutions using ad hoc heuristics. We present an end-to-end method to learn the proximal operator of a family of training problems so that multiple local minima can be quickly obtained from initial guesses by iterating the learned operator, emulating the proximal-point algorithm that has fast convergence. The learned proximal operator can be further generalized to recover multiple optima for unseen problems at test time, enabling applications such as object detection. The key ingredient in our formulation is a proximal regularization term, which elevates the convexity of our training loss: by applying recent theoretical results, we show that for weakly-convex objectives with Lipschitz gradients, training of the proximal operator converges globally with a practical degree of over-parameterization. We further present an exhaustive benchmark for multi-solution optimization to demonstrate the effectiveness of our method.", Guiding continuous operator learning through Physics-based boundary constraints,https://openreview.net/forum?id=gfWNItGOES6,https://openreview.net/pdf?id=gfWNItGOES6,We propose novel kernel correction mechanisms for neural operators to satisfy physical boundary constraints which are effective in improving the overall performance.,"Boundary conditions (BCs) are important groups of physics-enforced constraints that are necessary for solutions of Partial Differential Equations (PDEs) to satisfy at specific spatial locations. These constraints carry important physical meaning, and guarantee the existence and the uniqueness of the PDE solution. Current neural-network based approaches that aim to solve PDEs rely only on training data to help the model learn BCs implicitly, however, there is no guarantee of BC satisfaction by these models during evaluation. In this work, we propose Boundary enforcing Operator Network (BOON) that enables the BC satisfaction of neural operators by making structural changes to the operator kernel. We provide our refinement procedure, and demonstrate the satisfaction of physics-based BCs such as Dirichlet, Neumann, and periodic by the solutions obtained by BOON. Numerical experiments based on multiple PDEs with a wide variety of applications indicate that the proposed approach ensures satisfaction of BCs, and leads to more accurate solutions over the whole domain. The proposed method exhibits a (2X-20X) improvement in accuracy (0.000084 relative $L^2$ error for Burgers' equation).","partial differential equations, operator learning, physics-constraints, boundary conditions, kernel correction" AdaWAC: Adaptively Weighted Augmentation Consistency Regularization for Volumetric Medical Image Segmentation,https://openreview.net/forum?id=eqrJZ-Davr2,https://openreview.net/pdf?id=eqrJZ-Davr2,,"Sample reweighting is an effective strategy for learning from training data coming from a mixture of different subpopulations. However, existing reweighting algorithms do not fully take advantage of the particular type of data distribution encountered in volumetric medical image segmentation, where the training data images are uniformly distributed but their associated data labels fall into two subpopulations---""label-sparse"" and ""label-dense""---depending on whether the data image occurs near the beginning/end of the volumetric scan or the middle. For this setting, we propose AdaWAC as an adaptive weighting algorithm that assigns label-dense samples to supervised cross-entropy loss and label-sparse samples to unsupervised consistency regularization. We provide a convergence guarantee for AdaWAC by appealing to the theory of online mirror descent on saddle point problems. Moreover, we empirically demonstrate that AdaWAC not only enhances segmentation performance and sample efficiency but also improves robustness to the subpopulation shift in labels.","Medical image segmentation, Adaptive weighting, Consistency regularization, Subpopulation shift" Limitations of the NTK for Understanding Generalization in Deep Learning,https://openreview.net/forum?id=KUP3ic8jdGo,https://openreview.net/pdf?id=KUP3ic8jdGo,"1. Neural networks have significantly better scaling than neural tangent kernels. 2. The empirical NTK continues to evolve throughout the training, in contrast with prior work which suggests that it stabilizes after a few epochs of training.","The “Neural Tangent Kernel” (NTK) (Jacot et al., 2018), and its empirical variants have been proposed as a proxy to capture certain behaviors of real neural networks. In this work, we study NTKs through the lens of scaling laws, and demonstrate that they fall short of explaining important aspects of neural network generalization. In particular, we demonstrate realistic settings where finite-width neural networks have significantly better data scaling exponents as compared to their corresponding empirical and infinite NTKs at initialization. This reveals a more fundamental difference between the real networks and NTKs, beyond just a few percentage points of test accuracy. Further, we show that even if the empirical NTK is allowed to be pre-trained on a constant number of samples, the kernel scaling does not catch up to the neural network scaling. Finally, we show that the empirical NTK continues to evolve throughout most of the training, in contrast with prior work which suggests that it stabilizes after a few epochs of training. Altogether, our work establishes concrete limitations of the NTK approach in understanding generalization of real networks on natural datasets.","scaling laws, ntk, time dynamics" Federated Learning of Large Models at the Edge via Principal Sub-Model Training,https://openreview.net/forum?id=8Vxuz_PJNus,https://openreview.net/pdf?id=8Vxuz_PJNus,We provide a sub-model training method that enabled resource-constrained clients to train large model in FL.,"Limited compute, memory, and communication capabilities of edge users create a significant bottleneck for federated learning (FL) of large models. Current literature typically tackles the challenge with a heterogeneous client setting or allows training to be offloaded to the server. However, the former requires a fraction of clients to train near-full models, which may not be achievable at the edge; while the latter can compromise privacy with sharing of intermediate representations or labels. In this work, we consider a realistic, but much less explored, cross-device FL setting in which no client has the capacity to train a full large model nor is willing to share any intermediate representations with the server. To this end, we present Principal Sub-Model (PriSM) training methodology, which leverages models’ low-rank structure and kernel orthogonality to train sub-models in the orthogonal kernel space. More specifically, by applying singular value decomposition to original kernels in the server model, PriSM first obtains a set of principal orthogonal kernels with importance weighed by their singular values. Thereafter, PriSM utilizes a novel sampling strategy that selects different subsets of the principal kernels independently to create sub-models for clients with reduced computation and communication requirements. Importantly, a kernel with a large singular value is assigned with a high sampling probability. Thus, each sub-model is a low-rank approximation of the full large model, and all clients together achieve nearly full coverage of the principal kernels. To further improve memory efficiency, PriSM exploits low-rank structure in intermediate representations and allows each sub-model to learn only a subset of them while still preserving training performance. Our extensive evaluations on multiple datasets in various resource-constrained settings demonstrate that PriSM can yield an improved performance of up to $10\%$ compared to existing alternatives, when training sub-models with only $20\%$ principal kernels ($\sim 5\%$ of the full server model.).","Federated Learning, Resource-Constrained Clients, Sub-Model Training" Low-Entropy Features Hurt Out-of-Distribution Performance,https://openreview.net/forum?id=6j3bPQtQS-3,https://openreview.net/pdf?id=6j3bPQtQS-3,We hypothesize that low-entropy features tend to be more domain-specific. This paper studies how the entropy of the intermediate representation affect the model's robustness against out-of-distribution (OOD) data.,"We study the relationship between the entropy of intermediate representations and a model's robustness to distributional shift. We train two feed-forward networks end-to-end separated by a discrete $n$-bit channel on an unsupervised contrastive learning task. Different \textit{masking strategies} are implemented that remove a proportion $p_{\text{mask}}$ of low-entropy bits, high-entropy bits, or random bits, and the effects on performance are compared to the baseline accuracy with no mask. When testing in-distribution (InD) we find that the removal of bits via any strategy leads to an \textit{increase} in performance, when masking out a relatively low $p_{\text{mask}}$. We hypothesize that the entropy of a bit serves as a guide to its usefulness out-of-distribution (OOD). Through experiment on three OOD datasets we demonstrate that the removal of low-entropy bits can notably benefit OOD performance. Conversely, we show that top-entropy masking disproportionately harms performance both InD and OOD.","Generalization, Out-of-Distribution, Entropy-based Methods, Unsupervised Contrastive Learning, Latent Representations" Implicit Offline Reinforcement Learning via Supervised Learning,https://openreview.net/forum?id=egaddkwMOd3,https://openreview.net/pdf?id=egaddkwMOd3,This work bridged an essential gap between implicit models and explicit RL via Supervised Learning methods.,"Offline Reinforcement Learning (RL) via Supervised Learning is a simple and effective way to learn robotic skills from a dataset of varied behaviors. It is as simple as supervised learning and Behavior Cloning (BC) but takes advantage of the return information. On BC tasks, implicit models have been shown to match or outperform explicit ones. Despite the benefits of using implicit models to learn robotic skills via BC, Offline RL via Supervised Learning algorithms have been limited to explicit models. We show how implicit models leverage return information and match or outperform explicit algorithms to acquire robotic skills from fixed datasets. Furthermore, we show how closely related our implicit methods are to other popular RL via Supervised Learning algorithms.","Offline Reinforcement Learning, Energy Based Model, Offline Reinforcement Learning via Supervised Learning" "A Unimodal, Uncertainty-Aware Deep Learning Approach for Ordinal Regression",https://openreview.net/forum?id=yyTNV4CcIR9,https://openreview.net/pdf?id=yyTNV4CcIR9,,"Ordinal regression is an important area in machine learning, and many algorithms were proposed to approach it. In this work, we propose an ordinal regression prediction algorithm, based on deep learning machinery and inspired by the well-known Proportional Odds model. Our proposed approach has three key components: first, it is designed to guarantee unimodal output probabilities, which is a desired element in many real world applications. Second, we argue that the standard maximum likelihood is sub-optimal for ordinal regression problems and train our model using optimal transport loss, as it naturally captures the order of the classes. Third, we design a novel regularizer aiming to make the model uncertainty-aware, in the sense of making the model more confident about correct predictions, comparing to wrong predictions. In addition, we propose a novel uncertainty-awareness evaluation measure. Experimental results on eight real-world datasets demonstrate that our proposed approach consistently performs on par with and often better than several recently proposed deep learning approaches for ordinal regression, in terms of both accuracy and uncertainty-awareness, while having a guarantee on the output unimodality.","unimodality, ordinal regression, uncertainty, deep learning" Augmentation Backdoors,https://openreview.net/forum?id=-CIOGGhkEfy,https://openreview.net/pdf?id=-CIOGGhkEfy,We present three backdoor attacks that can be covertly inserted into data augmentation functions.,"Data augmentation is used extensively to improve model generalisation. However, reliance on external libraries to implement augmentation methods introduces a vulnerability into the machine learning pipeline. It is well known that backdoors can be inserted into machine learning models through serving a modified dataset to train on. Augmentation therefore presents a perfect opportunity to perform this modification without requiring an initially backdoored dataset. In this paper we present three backdoor attacks that can be covertly inserted into data augmentation. Our attacks each insert a backdoor using a different type of computer vision augmentation transform, covering simple image transforms, GAN-based augmentation, and composition-based augmentation. By inserting the backdoor using these augmentation transforms, we make our backdoors difficult to detect, while still supporting arbitrary backdoor functionality. We evaluate our attacks on a range of computer vision benchmarks and demonstrate that an attacker is able to introduce backdoors through just a malicious augmentation routine.","training time attacks, backdoors, augmentation" Neural Radiance Field Codebooks,https://openreview.net/forum?id=mX56bKDybu5,https://openreview.net/pdf?id=mX56bKDybu5,"Learning geometrically-aware, object-centric representations through elastic bottlenecks and a differentiable renderer for downstream tasks.","Compositional representations of the world are a promising step towards enabling high-level scene understanding and efficient transfer to downstream tasks. Learning such representations for complex scenes and tasks remains an open challenge. Towards this goal, we introduce Neural Radiance Field Codebooks (NRC), a scalable method for learning object-centric representations through novel view reconstruction. NRC learns to reconstruct scenes from novel views using a dictionary of object codes which are decoded through a volumetric renderer. This enables the discovery of reoccurring visual and geometric patterns across scenes which are transferable to downstream tasks. We show that NRC representations transfer well to object navigation in THOR, outperforming 2D and 3D representation learning methods by 3.1\% success rate. We demonstrate that our approach is able to perform unsupervised segmentation for more complex synthetic (THOR) and real scenes (NYU Depth) better than prior methods (.101 ARI). Finally, we show that NRC improves on the task of depth ordering by 5.5% accuracy in THOR.","Object-Centric Representation Learning, Representation Learning, Neural Radiance Fields" Scalable Estimation of Nonparametric Markov Networks with Mixed-Type Data,https://openreview.net/forum?id=qBvBycTqVJ,https://openreview.net/pdf?id=qBvBycTqVJ,"We investigate scalable estimation of nonparametric Markov networks with general distributions for all data types (i.e., continuous, discrete, and mixed-type).","Markov network characterizes the conditional independence structure, or Markov property, among a set of random variables. Existing work focuses on specific families of distributions (e.g., exponential families) and/or certain structures of the graph, and most of them can only handle variables of a single data type (continuous or discrete). In this work, we generalize the characterization of the conditional independence structure to handle general distributions for all data types (i.e., continuous, discrete, and mixed-type) with general functional relations among variables, thus giving rise to a Markov network structure learning algorithm in one of the most general settings. To deal with the computational challenge of the problem, especially for large graphs, we unify all cases under the same umbrella of a regularized score matching framework. We validate the theoretical results experimentally and demonstrate the scalability of the approach--it produces the estimated Markov network over up to 5000 nodes within one hour on CPUs. We further discuss the implication of the proposed approach in causal discovery.","Structure learning, Markov networks, graphical models, score matching, model selection" Determinant regularization for Deep Metric Learning,https://openreview.net/forum?id=NUBuJsAq1U,https://openreview.net/pdf?id=NUBuJsAq1U,,"Distance Metric Learning (DML) aims to learn the distance metric that better reflects the semantical similarities in the data. Current \textit{pair-based} and \textit{proxy-based} methods on DML focus on reducing the distance between similar samples while expanding the distance of dissimilar ones. However, we reveal that shrinking the distance between similar samples may distort the feature space, increasing the distance between points of the same class region and, therefore, harming the generalization of the model. Traditional regularization terms (such as $L_2$-norm on weights) cannot be adopted to solve this issue as they are based on linear projection. To alleviate this issue, we adopt the structure of normalizing flow as the deep metric layer and calculate the determinant of the Jacobi Matrix as the regularization term. At last, we conduct experiments on several \textit{pair-based} and \textit{proxy-based} algorithms that demonstrate the benefits of our method. ","Deep Metric Learning, Generalization, Jacobi Matrix" Data-Efficient and Interpretable Tabular Anomaly Detection,https://openreview.net/forum?id=ybFkELZjuc,https://openreview.net/pdf?id=ybFkELZjuc,We developed an interpretable tabular anomaly detection method that allows incorporation of labeled data.,"Anomaly detection (AD) plays an important role in numerous applications. In this paper, we focus on two understudied aspects of AD that are critical for integration into real-world applications. First, most AD methods cannot incorporate labeled data that are often available in practice in small quantities and can be crucial to achieve high accuracy. Second, most AD methods are not interpretable, a bottleneck that prevents stakeholders from understanding the reason behind the anomalies. In this paper, we propose a novel AD framework, DIAD, that adapts a white-box model class, Generalized Additive Models, to detect anomalies using a partial identification objective which naturally handles noisy or heterogeneous features. DIAD can incorporate a small amount of labeled data to further boost AD performances in semi-supervised settings. We demonstrate the superiority of DIAD compared to previous work in both unsupervised and semi-supervised settings on multiple datasets. We also present explainability capabilities of DIAD, on its rationale behind predicting certain samples as anomalies.","Anomaly Detection, Interpretability" Extracting Expert's Goals by What-if Interpretable Modeling,https://openreview.net/forum?id=e25n9Z29PeC,https://openreview.net/pdf?id=e25n9Z29PeC,We recover clinicians' goals of treatments by integrating counterfactual reasoning into batch inverse reinforcement learning and interpretable GAM modeling,"Although reinforcement learning (RL) has tremendous success in many fields, applying RL to real-world settings such as healthcare is challenging when the reward is hard to specify and no exploration is allowed. In this work, we focus on batch inverse RL (IRL) to recover clinicians' rewards from their past demonstrations of treating patients. We explain their treatments based on the what-if future outcomes: ""what future would have happened if a different treatment was taken?"", and provide interpretability with generalized additive models (GAMs) - a class of accurate, interpretable models. In both simulation and a real-world hospital dataset, our model outperforms baselines and provide explanations consistent with clinical guidelines, while the commonly-used linear model often contradicts them. We also uncover the unreliability of offline metrics such as matched action accuracy to compare IRL methods which is often used in the literature.","counterfactuals, explaining decision-making, preference learning, inverse reinforcement learning, healthcare" FiT: Parameter Efficient Few-shot Transfer Learning for Personalized and Federated Image Classification,https://openreview.net/forum?id=9aokcgBVIj1,https://openreview.net/pdf?id=9aokcgBVIj1,"We propose FiT, a parameter efficient few-shot image classification system that uses a Naive Bayes head, FiLM layers that modulate a pretrained backbone, and an episodic fine-tuning protocol that achieves SOTA on the VTAB-1k benchmark.","Modern deep learning systems are increasingly deployed in situations such as personalization and federated learning where it is necessary to support i) learning on small amounts of data, and ii) communication efficient distributed training protocols. In this work, we develop FiLM Transfer (FiT) which fulfills these requirements in the image classification setting by combining ideas from transfer learning (fixed pretrained backbones and fine-tuned FiLM adapter layers) and meta-learning (automatically configured Naive Bayes classifiers and episodic training) to yield parameter efficient models with superior classification accuracy at low-shot. The resulting parameter efficiency is key for enabling few-shot learning, inexpensive model updates for personalization, and communication efficient federated learning. We experiment with FiT on a wide range of downstream datasets and show that it achieves better classification accuracy than the leading Big Transfer (BiT) algorithm at low-shot and achieves state-of-the art accuracy on the challenging VTAB-1k benchmark, with fewer than 1% of the updateable parameters. Finally, we demonstrate the parameter efficiency and superior accuracy of FiT in distributed low-shot applications including model personalization and federated learning where model update size is an important performance metric.","few-shot learning, transfer learning, federated learning" A Critical Analysis of Out-of-Distribution Detection for Document Understanding,https://openreview.net/forum?id=IHGnybgLo1Z,https://openreview.net/pdf?id=IHGnybgLo1Z,This work investigates the OOD robustness of pretrained models and presents a benchmark for various document understanding tasks.,"Large-scale pretraining is widely used in recent document understanding models. During deployment, one may expect that large-scale pretrained models should trigger a conservative fallback policy when encountering out-of-distribution (OOD) samples, which suggests the importance of OOD detection. However, most existing OOD detection methods focus on single-modal inputs such as images or texts. While documents are multi-modal in nature, it is underexplored if and how multi-modal information in documents can be exploited for OOD detection. In this work, we first provide a systematic and in-depth analysis on OOD detection for document understanding models. We study the effects of model modality, pretraining, and finetuning across various types of OOD inputs. In particular, we find that spatial information is critical for document OOD detection. To better exploit spatial information, we propose a simple yet effective special-aware adapter, which serves as an add-on module to adapt transformer-based language models to document domain. Extensive experiments show that our method consistently improves ID accuracy and OOD detection performance compared to baselines. We hope our findings can help inspire future works on understanding OOD robustness for documents.","Document Understanding, Pretraining, Out-of-Distribution, Document intelligence, Robustness" Learnable Visual Words for Interpreting Image Recognition Models,https://openreview.net/forum?id=lb8wXVGWn0E,https://openreview.net/pdf?id=lb8wXVGWn0E,,"To interpret deep models' predictions, attention-based visual cues are widely used in addressing *why* deep models make such predictions. Beyond that, the current research community becomes more interested in reasoning *how* deep models make predictions, where some prototype-based methods employ interpretable representations with their corresponding visual cues to reveal the black-box mechanism of deep model behaviors. However, these pioneering attempts only either learn the category-specific prototypes and deteriorate with their generalization ability deterioration or demonstrate several illustrative examples without a quantitative evaluation of visual-based interpretability narrowing their practical usages. In this paper, we revisit the concept of visual words and propose the Learnable Visual Words (LVW) to interpret the model prediction behaviors with two novel modules: semantic visual words learning and dual fidelity preservation. The semantic visual words learning relaxes the category-specific constraint, enabling the generic visual words shared across multiple categories. Beyond employing the visual words for prediction to align visual words with the base model, our dual fidelity preservation also includes the attention-guided semantic alignment that encourages the learned visual words to focus on the same conceptual regions for prediction. Experiments on six visual benchmarks demonstrate the superior effectiveness of our proposed LVW in both accuracy and interpretation fidelity over the state-of-the-art methods. Moreover, we elaborate on various in-depth analyses to further explore the learned visual words and the generalizability of our method for unseen categories.", Compact Bilinear Pooling via General Bilinear Projection,https://openreview.net/forum?id=0PH-P_FIqGD,https://openreview.net/pdf?id=0PH-P_FIqGD,"We proposed a general bilinear projection based on complete matrix bases, and then we design a compact bilinear pooling algorithm by using the proposed general bilinear pooling.","Most factorized bilinear pooling (FBiP) employs Hadamard product-based bilinear projection to learn appropriate projecting directions to reduce the dimension of bilinear features. However, in this paper, we reveal that the Hadamard product-based bilinear projection makes FBiP miss a lot of possible projecting directions, which will significantly harm the performance of outputted compact bilinear features, including compactness and effectiveness. To address this issue, we propose a general matrix-based bilinear projection based on the rank-$k$ matrix base decomposition, where the Hadamard-based bilinear projection is a special case of our proposed one. Using the proposed bilinear projection, we design a novel low-rank factorized bilinear pooling (named RK-FBP), which does not miss any projecting directions. Thus, our RK-FBP can generate better compact bilinear features. To leverage high-order information in local features, we nest several RK-FBP modules together to formulate a multi-linear pooling that outputs compact multi-linear features. At last, we conduct experiments on several fine-grained image tasks to evaluate our models. The experiments show that our models achieve new state-of-the-art classification accuracy by the lowest dimension.","Bilinear Pooling, Bilinear Projection, fine-grained recognition" AANG : Automating Auxiliary Learning,https://openreview.net/forum?id=vtVDI3w_BLL,https://openreview.net/pdf?id=vtVDI3w_BLL,"We automatically generate a suite of auxiliary objectives and give a theoretically informed, efficient algorithm for searching the space of generated objectives to find those most useful to a specified end-task.","Auxiliary objectives, supplementary learning signals that are introduced to help aid learning on data-starved or highly complex end-tasks, are commonplace in machine learning. Whilst much work has been done to formulate useful auxiliary objectives, their construction is still an art which proceeds by slow and tedious hand-design. Intuition for how and when these objectives improve end-task performance has also had limited theoretical backing. In this work, we present an approach for automatically generating a suite of auxiliary objectives. We achieve this by deconstructing existing objectives within a novel unified taxonomy, identifying connections between them, and generating new ones based on the uncovered structure. Next, we theoretically formalize widely-held intuitions about how auxiliary learning improves generalization on the end-task. This leads us to a principled and efficient algorithm for searching the space of generated objectives to find those most useful to a specified end-task. With natural language processing (NLP) as our domain of study, we demonstrate that our automated auxiliary learning pipeline leads to strong improvements over competitive baselines across continued training experiments on a pre-trained model on 5 NLP end-tasks.","auxiliary learning, automl, natural language processing, meta-learning, algorithmic stability, multitask learning" Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation,https://openreview.net/forum?id=1-MBdJssZ-S,https://openreview.net/pdf?id=1-MBdJssZ-S,"We present a conditional contrastive diffusion approach for better input-output correspondence via maximized mutual information, applicable for music and image generations.","Diffusion probabilistic models (DPMs) have become a popular approach to conditional generation, due to their promising results and support for cross-modal synthesis. A key desideratum in conditional synthesis is to achieve high correspondence between the conditioning input and generated output. Most existing methods learn such relationships implicitly, by incorporating the prior into the variational lower bound. In this work, we take a different route---we explicitly enhance input-output connections by maximizing their mutual information. To this end, we introduce a Conditional Discrete Contrastive Diffusion (CDCD) loss and design two contrastive diffusion mechanisms to effectively incorporate it into the denoising process, combining the diffusion training and contrastive learning for the first time by connecting it with the conventional variational objectives. We demonstrate the efficacy of our approach in evaluations with diverse multimodal conditional synthesis tasks: dance-to-music generation, text-to-image synthesis, as well as class-conditioned image synthesis. On each, we enhance the input-output correspondence and achieve higher or competitive general synthesis quality. Furthermore, the proposed approach improves the convergence of diffusion models, reducing the number of required diffusion steps by more than 35% on two benchmarks, significantly increasing the inference speed.","Contrastive Diffusion, Conditioned Generations, Music Generation, Image Synthesis" Diffusion Probabilistic Modeling of Protein Backbones in 3D for the motif-scaffolding problem,https://openreview.net/forum?id=6TxBxqNME1Y,https://openreview.net/pdf?id=6TxBxqNME1Y,We have created the first generative modeling approach to motif-scaffolding by developing a diffusion probabilistic model of protein backbones and a procedure for generating scaffolds conditional on a motif.,"Construction of a scaffold structure that supports a desired motif, conferring protein function, shows promise for the design of vaccines and enzymes. But a general solution to this motif-scaffolding problem remains open. Current machine-learning techniques for scaffold design are either limited to unrealistically small scaffolds (up to length 20) or struggle to produce multiple diverse scaffolds. We propose to learn a distribution over diverse and longer protein backbone structures via an E(3)-equivariant graph neural network. We develop SMCDiff to efficiently sample scaffolds from this distribution conditioned on a given motif; our algorithm is the first to theoretically guarantee conditional samples from a diffusion model in the large-compute limit. We evaluate our designed backbones by how well they align with AlphaFold2-predicted structures. We show that our method can (1) sample scaffolds up to 80 residues and (2) achieve structurally diverse scaffolds for a fixed motif.","Diffusion Models, Sequential Monte Carlo, Protein Design, Geometric Deep Learning" NeRF-SOS: Any-View Self-supervised Object Segmentation on Complex Scenes,https://openreview.net/forum?id=kfOtMqYJlUU,https://openreview.net/pdf?id=kfOtMqYJlUU,"We propose a novel collaborative contrastive loss for NeRF to segment objects in complex real-world scenes, without any annotation.","Neural volumetric representations have shown the potential that Multi-layer Perceptrons (MLPs) can be optimized with multi-view calibrated images to represent scene geometry and appearance without explicit 3D supervision. Object segmentation can enrich many downstream applications based on the learned radiance field. However, introducing hand-crafted segmentation to define regions of interest in a complex real-world scene is non-trivial and expensive as it acquires per view annotation. This paper carries out the exploration of self-supervised learning for object segmentation using NeRF for complex real-world scenes. Our framework, called NeRF with Self-supervised Object Segmentation (NeRF-SOS), couples object segmentation and neural radiance field to segment objects in any view within a scene. By proposing a novel collaborative contrastive loss in both appearance and geometry levels, NeRF-SOS encourages NeRF models to distill compact geometry-aware segmentation clusters from their density fields and the self-supervised pre-trained 2D visual features. The self-supervised object segmentation framework can be applied to various NeRF models that both lead to photo-realistic rendering results and convincing segmentation maps for both indoor and outdoor scenarios. Extensive results on the LLFF, BlendedMVS, CO3Dv2, and Tank & Temples datasets validate the effectiveness of NeRF-SOS. It consistently surpasses other 2D-based self-supervised baselines and predicts finer object masks than existing supervised counterparts.","neural radiance field, self-supervised learning, object segmentation" RbX: Region-based explanations of prediction models,https://openreview.net/forum?id=vaf8KQ8bhS,https://openreview.net/pdf?id=vaf8KQ8bhS,Region-based explanations for local prediction importance from a black-box model,"We introduce region-based explanations (RbX), a novel, model-agnostic method to generate local explanations of scalar outputs from a black-box prediction model using only query access. RbX is based on a greedy algorithm for building a convex polytope that approximates a region of feature space where model predictions are close to the prediction at some target point. This region is fully specified by the user on the scale of the predictions, rather than on the scale of the features. The geometry of this polytope - specifically the change in each coordinate necessary to escape the polytope - quantifies the local sensitivity of the predictions to each of the features. These “escape distances” can then be standardized to rank the features by local importance. RbX is guaranteed to satisfy a “sparsity” axiom, which requires that features which do not enter into the prediction model are assigned zero importance. At the same time, real data examples and synthetic experiments show how RbX can more readily detect all locally relevant features than existing methods.", Rethinking Graph Lottery Tickets: Graph Sparsity Matters,https://openreview.net/forum?id=fjh7UGQgOB,https://openreview.net/pdf?id=fjh7UGQgOB,,"Lottery Ticket Hypothesis (LTH) claims the existence of a winning ticket (i.e., a properly pruned sub-network together with original weight initialization) that can achieve competitive performance to the original dense network. A recent work, called UGS, extended LTH to prune graph neural networks (GNNs) for effectively accelerating GNN inference. UGS simultaneously prunes the graph adjacency matrix and the model weights using the same masking mechanism, but since the roles of the graph adjacency matrix and the weight matrices are very different, we find that their sparsifications lead to different performance characteristics. Specifically, we find that the performance of a sparsified GNN degrades significantly when the graph sparsity goes beyond a certain extent. Therefore, we propose two techniques to improve GNN performance when the graph sparsity is high. First, UGS prunes the adjacency matrix using a loss formulation which, however, does not properly involve all elements of the adjacency matrix; in contrast, we add a new auxiliary loss head to better guide the edge pruning by involving the entire adjacency matrix. Second, by regarding unfavorable graph sparsification as adversarial data perturbations, we formulate the pruning process as a min-max optimization problem to gain the robustness of lottery tickets when the graph sparsity is high. We further investigate the question: Can the ``retrainable'' winning ticket of a GNN be also effective for graph transferring learning? We call it the transferable graph lottery ticket (GLT) hypothesis. Extensive experiments were conducted which demonstrate the superiority of our proposed sparsification method over UGS, and which empirically verified our transferable GLT hypothesis.",GLT The Impact of Approximation Errors on Warm-Start Reinforcement Learning: A Finite-time Analysis,https://openreview.net/forum?id=MuWgF-FVzON,https://openreview.net/pdf?id=MuWgF-FVzON,,"Warm-Start reinforcement learning (RL), aided by a prior policy obtained from offline training, is emerging as a promising RL approach for practical applications. Recent empirical studies have demonstrated that the performance of Warm-Start RL can be improved \textit{quickly} in some cases but become \textit{stagnant} in other cases, calling for a fundamental understanding, especially when the function approximation is used. To fill this void, we take a finite time analysis approach to quantify the impact of approximation errors on the learning performance of Warm-Start RL. Specifically, we consider the widely used Actor-Critic (A-C) method with a prior policy. We first quantify the approximation errors in the Actor update and the Critic update, respectively. Next, we cast the Warm-Start A-C algorithm as Newton's method with perturbation, and study the impact of the approximation errors on the finite-time learning performance with inaccurate Actor/Critic updates. Under some general technical conditions, we obtain lower bounds on the sub-optimality gap of the Warm-Start A-C algorithm to quantify the impact of the bias and error propagation. We also derive the upper bounds, which provide insights on achieving the desired finite-learning performance in the Warm-Start A-C algorithm.","Reinforcement Learning, Finite-time Analysis, Approximation Error, Warm Start" NeRN: Learning Neural Representations for Neural Networks,https://openreview.net/forum?id=9gfir3fSy3J,https://openreview.net/pdf?id=9gfir3fSy3J,"In this paper we present NerN: a neural representation for the weights of a pretrained neural network, which is obtained by applying smoothness over the reconstructed weights and various knowledge distillation techniques","Neural Representations have recently been shown to effectively reconstruct a wide range of signals from 3D meshes and shapes to images and videos. We show that, when adapted correctly, neural representations can be used to directly represent the weights of a pre-trained convolutional neural network, resulting in a Neural Representation for Neural Networks (NeRN). Inspired by coordinate inputs of previous neural representation methods, we assign a coordinate to each convolutional kernel in our network based on its position in the architecture, and optimize a predictor network to map coordinates to their corresponding weights. Similarly to the spatial smoothness of visual scenes, we show that incorporating a smoothness constraint over the original network's weights aids NeRN towards a better reconstruction. In addition, since slight perturbations in pre-trained model weights can result in a considerable accuracy loss, we employ techniques from the field of knowledge distillation to stabilize the learning process. We demonstrate the effectiveness of NeRN in reconstructing widely used architectures on CIFAR-10, CIFAR-100, and ImageNet. Finally, we present two applications using NeRN, demonstrating the capabilities of the learned representations.","Convolutional Neural Networks, Neural Representations, Implicit Representations" Private Federated Learning Without a Trusted Server: Optimal Algorithms for Convex Losses,https://openreview.net/forum?id=TVY6GoURrw,https://openreview.net/pdf?id=TVY6GoURrw,Optimal algorithms for differentially private convex/strongly convex federated learning with data from people who do not trust the server or other silos/clients. ,"This paper studies federated learning (FL)—especially cross-silo FL—with data from people who do not trust the server or other silos. In this setting, each silo (e.g. hospital) has data from different people (e.g. patients) and must maintain the privacy of each person’s data (e.g. medical record), even if the server or other silos act as adversarial eavesdroppers. This requirement motivates the study of Inter-Silo Record-Level Differential Privacy (ISRL-DP), which requires silo $i$’s communications to satisfy record/item-level differential privacy (DP). ISRL-DP ensures that the data of each person (e.g. patient) in silo $i$ (e.g. hospital $i$) cannot be leaked. ISRL-DP is different from well-studied privacy notions. Central and user-level DP assume that people trust the server/other silos. On the other end of the spectrum, local DP assumes that people do not trust anyone at all (even their own silo). Sitting between central and local DP, ISRL-DP makes the realistic assumption (in cross-silo FL) that people trust their own silo, but not the server or other silos. In this work, we provide tight (up to logarithms) upper and lower bounds for ISRL-DP FL with convex/strongly convex loss functions and homogeneous (i.i.d.) silo data. Remarkably, we show that similar bounds are attainable for smooth losses with arbitrary heterogeneous silo data distributions, via an accelerated ISRL-DP algorithm. We also provide tight upper and lower bounds for ISRL-DP federated empirical risk minimization, and use acceleration to attain the optimal bounds in fewer rounds of communication than the state-of-the-art. Finally, with a secure “shuffler” to anonymize silo messages (but without a trusted server), our algorithm attains the optimal central DP rates under more practical trust assumptions. Numerical experiments show favorable privacy-accuracy tradeoffs for our algorithm in classification and regression tasks.","differential privacy, federated learning, distributed optimization, private optimization, stochastic convex optimization, cross-silo federated learning" 3D-Aware Video Generation,https://openreview.net/forum?id=N7ts-GTfuy,https://openreview.net/pdf?id=N7ts-GTfuy,3D-Aware Video Generation,"Generative models have emerged as an essential building block for many image synthesis and editing tasks. Recent advances in this field have also enabled high-quality 3D or video content to be generated that exhibits either multi-view or temporal consistency. With our work, we explore 4D generative adversarial networks (GANs) that learn unconditional generation of 3D-aware videos. By combining neural implicit representations with time-aware discriminator, we develop a GAN framework that synthesizes 3D video supervised only with monocular videos. We show that our method learns a rich embedding of decomposable 3D structures and motions that enables new visual effects of spatio-temporal renderings while producing imagery with quality comparable to that of existing 3D or video GANs.","video generation, 3D, generative model, 3D-aware image synthesis" Joint rotational invariance and adversarial training of a dual-stream Transformer yields state of the art Brain-Score for Area V4,https://openreview.net/forum?id=02Bt_4tx6r,https://openreview.net/pdf?id=02Bt_4tx6r,We provide evidence that a specific Vision Transformer under a joint rotationally-invariant and adversarial optimization procedure can reach state of the art Brain-Score for Area V4,"Modern high-scoring models of vision in the brain score competition do not stem from Vision Transformers. However, in this paper, we provide evidence against the unexpected trend of Vision Transformers (ViT) being not perceptually aligned with human visual representations by showing how a dual-stream Transformer, a CrossViT $~\textit{a la}$ Chen et. al. (2021), under a joint rotationally-invariant and adversarial optimization procedure yields 2nd place in the aggregate Brain-Score 2022 competition (Schrimpf et al., 2020b) averaged across all visual categories, and at the time of the competition held 1st place for the highest explainable variance of area V4. In addition, our current Transformer-based model also achieves greater explainable variance for areas V4, IT, and Behaviour than a biologically-inspired CNN (ResNet50) that integrates a frontal V1-like computation module (Dapello et al., 2020). To assess the contribution of the optimization scheme with respect to the CrossViT architecture, we perform several additional experiments on differently optimized CrossViT's regarding adversarial robustness, common corruption benchmarks, mid-ventral stimuli interpretation, and feature inversion. Against our initial expectations, our family of results provides tentative support for an $\textit{``All roads lead to Rome''}$ argument enforced via a joint optimization rule even for non biologically-motivated models of vision such as Vision Transformers.","Vision Transformer, Brain-Score competition, adversarial training, rotation invariance." AutoSparse: Towards Automated Sparse Training,https://openreview.net/forum?id=zyfEWkV6it,https://openreview.net/pdf?id=zyfEWkV6it,,"Sparse training is emerging as a promising avenue for reducing the computational cost of training neural networks. Several recent studies have proposed pruning methods using learnable thresholds to efficiently explore the non-uniform distribution of sparsity inherent within the models. In this paper, we propose Gradient Annealing (GA), a gradient driven approach where gradients to pruned out weights are scaled down in a non-linear manner. GA eliminates the need for additional sparsity-inducing regularization by providing an elegant trade-off between sparsity and accuracy. We integrated GA with the latest learnable threshold based pruning methods to create an automated sparse training algorithm called AutoSparse. Our algorithm achieves state-of-the-art accuracy with 80% sparsity for ResNet50 and 75% sparsity for MobileNetV1 on Imagenet-1K. AutoSparse also results in 7× reduction in inference FLOPS and > 2× reduction in training FLOPS for ResNet50 on ImageNet at 80% sparsity. Finally, GA generalizes well to fixed-budget (Top-K, 80%) sparse training methods, improving the accuracy of ResNet50 on Imagenet-1K, to outperform TopKAST+PP by 0.3%.","sparsity, sparse training, deep learning" Improving Molecular Pretraining with Complementary Featurizations,https://openreview.net/forum?id=-1k-zfgHFWQ,https://openreview.net/pdf?id=-1k-zfgHFWQ,,"Molecular pretraining, which learns molecular representations over massive unlabeled data, has become a prominent paradigm to solve a variety of tasks in computational chemistry and drug discovery. Recently, prosperous progress has been made in molecular pretraining with different molecular featurizations, including 1D SMILES strings, 2D graphs, and 3D geometries. However, the role of molecular featurizations with their corresponding neural architectures in molecular pretraining remains largely unexamined. In this paper, through two case studies—chirality classification and aromatic ring counting—we first demonstrate that different featurization techniques convey chemical information differently. In light of this observation, we propose a simple and effective MOlecular pretraining framework with COmplementary featurizations (MOCO). MOCO comprehensively leverages multiple featurizations that complement each other and outperforms existing state-of-the-art models that solely relies on one or two featurizations on a wide range of molecular property prediction tasks.","molecular pretraining, featurizations, contrastive learning" Learning to Communicate using Contrastive Learning ,https://openreview.net/forum?id=jyHAGzMu-1Q,https://openreview.net/pdf?id=jyHAGzMu-1Q,"A novel approach and perspective to decentralized communication learning in MARL based on contrastive learning with a suite of evaluation methods (e.g. protocol symmetry, representation probing and zero-shot communication) to analyze protocols.","Communication is a powerful tool for coordination in multi-agent RL. Inducing an effective, common language has been a difficult challenge, particularly in the decentralized setting. In this work, we introduce an alternative perspective where communicative messages sent between agents are considered as different incomplete views of the environment state. Based on this perspective, we propose to learn to communicate using contrastive learning by maximizing the mutual information between messages of a given trajectory. In communication-essential environments, our method outperforms previous work in both performance and learning speed. Using qualitative metrics and representation probing, we show that our method induces more symmetric communication and captures task-relevant information from the environment. Finally, we demonstrate promising results on zero-shot communication, a first for MARL. Overall, we show the power of contrastive learning, and self-supervised learning in general, as a method for learning to communicate.","Reinforcement Learning, Multi-Agent Reinforcement Learning, Multi-Agent Communication" Cheap Talk Discovery and Utilization in Multi-Agent Reinforcement Learning,https://openreview.net/forum?id=cddbeL1HWaD,https://openreview.net/pdf?id=cddbeL1HWaD,A novel problem formulation and methodology in MARL on learning where to communicate and where best to communicate.,"By enabling agents to communicate, recent cooperative multi-agent reinforcement learning (MARL) methods have demonstrated better task performance and more coordinated behavior. Most existing approaches facilitate inter-agent communication by allowing agents to send messages to each other through free communication channels, i.e., \emph{cheap talk channels}. Current methods require these channels to be constantly accessible and known to the agents a priori. In this work, we lift these requirements such that the agents must discover the cheap talk channels and learn how to use them. Hence, the problem has two main parts: \emph{cheap talk discovery} (CTD) and \emph{cheap talk utilization} (CTU). We introduce a novel conceptual framework for both parts and develop a new algorithm based on mutual information maximization that outperforms existing algorithms in CTD/CTU settings. We also release a novel benchmark suite to stimulate future research in CTD/CTU.","Reinforcement Learning, Multi-Agent Reinforcement Learning" Motif-induced Graph Normalization,https://openreview.net/forum?id=RPyemmvfqNF,https://openreview.net/pdf?id=RPyemmvfqNF,,"Graph Neural Networks (GNNs) have emerged as a powerful category of learning architecture for handling graph-structured data in the non-Euclidean domain. Despite their success, existing GNNs typically suffer from the insufficient expressive power bottlenecked by Weisfeiler-Lehman (WL) test, and meanwhile are prone to the over-smoothing situation with increasing layer numbers. In this paper, we strive to strengthen the discriminative capabilities of GNNs by devising a dedicated plug-and-play normalization scheme, termed as Motif-induced Normalization (MotifNorm), that explicitly considers the intra-connection information within each node-induced subgraph. To this end, we embed the motif-induced structural weights at the beginning and the end of the standard BatchNorm, as well as incorporate the graph instance-specific statistics for improved distinguishable capabilities. In the meantime, we provide the theoretical analysis to support that, with the proposed elaborated MotifNorm, an arbitrary GNNs is capable of more expressive abilities than the 1-WL test in distinguishing k-regular graphs. Furthermore, the proposed MotifNorm scheme is also exemplified to be able to alleviate the over-smoothing phenomenon. Experimental results on ten popular benchmarks across all the tasks of the graph-, node-, as well as link-level property predictions, demonstrate the effectiveness of the proposed method. Our code is made available in the supplementary material.", Stochastic Gradient Methods with Preconditioned Updates,https://openreview.net/forum?id=ZbzcLy5I4rz,https://openreview.net/pdf?id=ZbzcLy5I4rz,,"This work considers non-convex finite sum minimization. There are a number of algorithms for such problems, but existing methods often work poorly when the problem is badly scaled and/or ill-conditioned, and a primary goal of this work is to introduce methods that alleviate this issue. Thus, here we include a preconditioner that is based upon Hutchinson's approach to approximating the diagonal of the Hessian, and couple it with several gradient based methods to give new `scaled' algorithms: Scaled SARAH and Scaled L-SVRG. Theoretical complexity guarantees under smoothness assumptions are presented, and we prove linear convergence when both smoothness and the PL-condition is assumed. Because our adaptively scaled methods use approximate partial second order curvature information, they are better able to mitigate the impact of badly scaled problems, and this improved practical performance is demonstrated in the numerical experiments that are also presented in this work.","optimization, non-convex optimization, stochastic optimization, scaled methods, variance reduction" Reversible Column Networks,https://openreview.net/forum?id=Oc2vlWU0jFY,https://openreview.net/pdf?id=Oc2vlWU0jFY,,"We propose a new neural network design paradigm Reversible Column Networks (RevCols). The main body of RevCols is composed of multiple copies of subnetworks, named columns respectively, between which multi-level reversible connections are employed. Such architectural scheme attributes RevCols very different behavior from conventional networks: during forward propagation, features in RevCols are learned to be gradually disentangled when passing through each column, whose total information is maintained rather than compressed or discarded as other network does. Our experiments suggest that CNN-style RevCols can achieve very competitive performances on multiple computer vision tasks such as image classification, object detection and semantic segmentation, especially with large parameter budget and large dataset. For example, after ImageNet-22K pre-training, RevCol-XL obtains 88.2% ImageNet-1K accuracy. Given more pre-training data, our largest model RevCol-H reaches 90.0% on ImageNet-1K, 61.2% APbox and 53.6% APmask on COCO detection test-dev set, 57.1% mIoU on ADE20k segmentation. To our knowledge, it is the best COCO detection result among pure CNN models without extra detection data. Moreover, as a general macro architecture fashion, RevCols can also be introduced into transformers or other neural networks, which is demonstrated to improve the performances in both computer vision and NLP tasks. ", Flexible Relation Preserving for Adversarial Training,https://openreview.net/forum?id=LlOOSDGLD24,https://openreview.net/pdf?id=LlOOSDGLD24,,"In this study, we revisit the representation learning problem for adversarial training from the perspective of relation preservation. Typical adversarial training methods tend to pull clean and adversarial samples closer to improve robustness. However, our experimental analysis reveals that such operation would lead to cluttered feature representations thus decreasing the accuracy for both clean and adversarial samples. To alleviate the problem, we build a robust discriminative feature space for both clean and adversarial samples by taking into account a relational prior which preserves the relationship between features of clean samples. A flexible relationship preserving adversarial training (FRPAT) strategy is proposed to transfer the well-generalized relational structure of the standard training model into the adversarial training model. Moreover, it acts as an extra regularization term mathematically, making it easy to be combined with various popular adversarial training algorithms in a plug-and-play way to achieve the best of both worlds. Extensive experiments on CIFAR10 and CIFAR100 demonstrate the superiority of our algorithm. Without additional data, it improves clean generalizability up to $\textbf{8.78\%}$ and robust generalizability up to $\textbf{3.04\%}$ on these datasets.","adversarial training, adversarial robustness, relationship knowledge distillation" Formal Mathematics Statement Curriculum Learning,https://openreview.net/forum?id=-P7G-8dmSh4,https://openreview.net/pdf?id=-P7G-8dmSh4,,"We explore the use of expert iteration in the context of language modeling applied to formal mathematics. We show that at same compute budget, expert iteration, by which we mean proof search interleaved with learning, dramatically outperforms proof search only. We also observe that when applied to a collection of formal statements of sufficiently varied difficulty, expert iteration is capable of finding and solving a curriculum of increasingly difficult problems, without the need for associated ground-truth proofs. Finally, by applying this expert iteration to a manually curated set of problem statements, we surpass previous state-of-the-art on the miniF2F benchmark, automatically solving multiple challenging problems drawn from high school olympiads.","neural theorem proving, formal mathematics, language modeling, expert iteration" A Unified Causal View of Domain Invariant Representation Learning,https://openreview.net/forum?id=dr56zCCLtqY,https://openreview.net/pdf?id=dr56zCCLtqY,,"Machine learning methods can be unreliable when deployed in domains that differ from the domains on which they were trained. One intuitive approach for addressing this is to learn representations of data that are domain-invariant in the sense that they preserve data structure that is stable across domains, but throw out spuriously-varying parts. There are many approaches aimed at this kind of representation-learning, including methods based on data augmentation, distributional invariances, and risk invariance. Unfortunately, it is often unclear when a given method actually learns domain-invariant structure, and whether learning domain-invariant structure actually yields robust models. The key issue is that, in general, it's unclear how to formalize ``domain-invariant structure''. The purpose of this paper is to study these questions in the context of a particular natural domain shift notion that admits a natural formal notion of domain invariance. This notion is a formalization of the idea that causal relationships are invariant, but non-causal relationships (e.g., due to confounding) may vary. We find that whether a given method learns domain-invariant structure, and whether this leads to robust prediction, both depend critically on the true underlying causal structure of the data.", PIPS: Path Integral Stochastic Optimal Control for Path Sampling in Molecular Dynamics,https://openreview.net/forum?id=TnIZfXSFJAh,https://openreview.net/pdf?id=TnIZfXSFJAh,,"We consider the problem of Sampling Transition Paths: Given two metastable conformational states of a molecular system, \eg\ a folded and unfolded protein, we aim to sample the most likely transition path between the two states. Sampling such a transition path is computationally expensive due to the existence of high free energy barriers between the two states. To circumvent this, previous work has focused on simplifying the trajectories to occur along specific molecular descriptors called Collective Variables (CVs). However, finding CVs is non trivial and requires chemical intuition. For larger molecules, where intuition is not sufficient, using these CV-based methods biases the transition along possibly irrelevant dimensions. In this work, we propose a method for sampling transition paths that considers the entire geometry of the molecules. We achieve this by relating the problem to recent works on the Schr\""odinger bridge problem and stochastic optimal control. Using this relation, we construct a path integral method that incorporates important characteristics of molecular systems such as second-order dynamics and invariance to rotations and translations. We demonstrate our method on commonly studied protein structures like Alanine Dipeptide, and also consider larger proteins such as Polyproline and Chignolin.", ID and OOD Performance Are Sometimes Inversely Correlated on Real-world Datasets,https://openreview.net/forum?id=qkdzAuh_gy,https://openreview.net/pdf?id=qkdzAuh_gy,Empirical+theoretical examination of an example of inverse correlation between ID/OOD accuracy across multiple neural networks (Camelyon17 dataset).,"Several studies have empirically compared in-distribution (ID) and out-of-distribution (OOD) performance of various models. They report frequent positive correlations on benchmarks in computer vision and NLP. Surprisingly, they never observe inverse correlations suggesting necessary trade-offs. This matters to determine whether ID performance can serve as a proxy for OOD generalization. This paper shows that inverse correlations between ID and OOD performance do happen in real-world benchmarks. They could be missed in past studies because of a biased selection of models. We show an example on the WILDS-Camelyon17 dataset, using models from multiple training epochs and random seeds. Our observations are particularly striking with models trained with a regularizer that diversifies the solutions to the ERM objective. We nuance recommendations and conclusions made in past studies. (1) High OOD performance may sometimes require trading off ID performance.(2) Focusing on ID performance alone may not lead to optimal OOD performance: it can lead to diminishing and eventually negative returns in OOD performance. (3) Our example reminds that empirical studies only chart regimes achievable with existing methods: care is warranted in deriving prescriptive recommendations.","OOD generalization, underspecification" Continual Learning with Group-wise Neuron Normalization,https://openreview.net/forum?id=uwmlZn6n-gV,https://openreview.net/pdf?id=uwmlZn6n-gV,,"Continual learning focuses on methods that accommodate the change in distribution and allow model adaptation and evolution while receiving data continuously. Importance and regularization -based weight update methods that rely on heuristics might not be effective. Recently, enhanced experience replay-based methods showed promising results but might add to the computational cost. In this paper, we propose simple parameter-free normalization over groups of distinct neurons at the penultimate layer of the used neural network and a straightforward experience replay algorithm. We argue that such normalization enables the network to balance its capacity for each task, reducing the chances of damaging interference between tasks and mitigating forgetting. Our evaluation shows that normalization over groups of neurons drastically impacts performance. We demonstrate improved retained accuracy and backward transfer with respect to related state-of-the-art methods while computationally efficient.","continual learning, group-wise neuron normalization, experience replay, subset of network weights competition" Zemi: Learning Zero-Shot Semi-Parametric Language Models from Multiple Tasks,https://openreview.net/forum?id=xQQQWB3VcpE,https://openreview.net/pdf?id=xQQQWB3VcpE,We introduce the first semi-parametric language model that demonstrates strong zero-shot performance on a wide range of unseen downstream tasks,"Although large language models have achieved impressive zero-shot ability, the huge model size generally incurs high cost. Recently, semi-parametric language models, which augment a smaller language model with an external retriever, have demonstrated promising language modeling capabilities. However, it remains unclear whether such semi-parametric language models can perform competitively well as their fully-parametric counterparts on zero-shot generalization to downstream tasks. In this work, we introduce $\text{Zemi}$, a zero-shot semi-parametric language model. To our best knowledge, this is the first semi-parametric language model that can demonstrate strong zero-shot performance on a wide range of held-out unseen tasks. We train $\text{Zemi}$ with a novel semi-parametric multitask prompted training paradigm, which shows significant improvement compared with the parametric multitask training as proposed by T0. Specifically, we augment the multitask training and zero-shot evaluation with retrieval from a large-scale task-agnostic unlabeled corpus. In order to incorporate multiple potentially noisy retrieved augmentations, we further propose a novel $\text{augmentation fusion}$ module leveraging perceiver resampler and gated cross-attention. Notably, our proposed $\text{Zemi}_\text{LARGE}$ outperforms T0-3B by 16% on all seven evaluation tasks while being 3.9x smaller in model size.","zero-shot, semi-parametric language model, multitask training" Visual Transformation Telling,https://openreview.net/forum?id=NqaGPQXblk,https://openreview.net/pdf?id=NqaGPQXblk,Visual Transformation Telling: a new task that requires to reason and describe transformations from a series of images.,"In this paper, we propose a new visual reasoning task, called Visual Transformation Telling (VTT). Given a series of states (i.e.~images), a machine is required to describe what happened (i.e.~transformation) between every two adjacent states. Different from most existing visual reasoning tasks, which focus on state reasoning, VTT concentrates on transformation reasoning. Moreover, describing the transformation in the form of language is more natural and closer to the real application than the property change way in the previous TVR task. We collect 13,547 samples from two instructional video datasets, i.e.~CrossTask and COIN, and extract desired states and transformation descriptions to form a suitable VTT benchmark dataset. After that, we introduce an end-to-end learning model for VTT, named TTNet. TTNet consists of three components to mimic human's cognition process of reasoning transformation. First, an image encoder, e.g. CLIP, reads content from each image, then a context encoder links the image content together, and at last, a transformation decoder autoregressively generates transformation descriptions between every two adjacent images. This basic version of TTNet is difficult to meet the cognitive challenge of VTT, that is to identify abstract transformations from images with small visual differences, and the descriptive challenge, which asks to describe the transformation consistently. In response to these difficulties, we propose three strategies to improve TTNet. Specifically, TTNet leverages difference features to emphasize small visual gaps, masked transformation model to stress context by forcing attention to neighbor transformations, and auxiliary category and topic classification tasks to make transformations consistent by sharing underlying semantics among representations. We adapt some typical methods from visual storytelling and dense video captioning tasks, considering their similarity with VTT. Our experimental results show that TTNet achieves better performance on transformation reasoning. In addition, our empirical analysis demonstrates the soundness of each module in TTNet, and provides some insight into transformation reasoning.","visual reasoning, transformation, captioning" PRUDEX-Compass: Towards Systematic Evaluation of Reinforcement Learning in Financial Markets,https://openreview.net/forum?id=FAXVNe1GxX,https://openreview.net/pdf?id=FAXVNe1GxX,,"The financial markets, which involve more than $90 trillion market capitals, attract the attention of innumerable investors around the world. Recently, reinforcement learning in financial markets (FinRL) has emerged as a promising direction to train agents for making profitable investment decisions. However, the evaluation of most FinRL methods only focuses on profit-related measures, which are far from satisfactory for practitioners to deploy these methods into real-world financial markets. Therefore, we introduce PRUDEX-Compass, which has 6 axes, i.e., Profitability, Risk-control, Universality, Diversity, rEliability, and eXplainability, with a total of 17 measures for a systematic evaluation. Specifically, i) we propose AlphaMix+ as a strong FinRL baseline, which leverages mixture-of-experts (MoE) and risk-sensitive approaches to make diversified risk-aware investment decisions, ii) we evaluate 8 FinRL methods in 4 long-term real-world datasets of influential financial markets to demonstrate the usage of our PRUDEX-Compass, iii) PRUDEX-Compass1 together with 4 real-world datasets, standard implementation of 8 FinRL methods and a portfolio management RL environment is released as public resources to facilitate the design and comparison of new FinRL methods. We hope that PRUDEX-Compass can shed light on future FinRL research to prevent untrustworthy results from stagnating FinRL into successful industry deployment.","Evaluation, Reinforcement Learning, Finance, Benchmarking" Joint Spatiotemporal Attention for Mortality Prediction of Patients with Long COVID,https://openreview.net/forum?id=iMevKmhiUZ,https://openreview.net/pdf?id=iMevKmhiUZ,We proposed a joint spatiotemporal attention mechanism for deep learning-based mortality prediction of long covid.,"Long COVID is a general term of Post-Acute Sequelae of COVID-19. Patients with Long COVID can endure long-lasting symptoms including fatigue, headache, dyspnea and anosmia, facing increased risk of death. Identifying the cohorts with severe long-term complications in COVID-19 could benefit the treatment planning and resource arrangement. However, due to the heterogeneous phenotypes and various duration of symptoms presented in patients with Long COVID, it is difficult to predict their outcomes from their longitudinal data. In this study, we proposed a spatiotemporal attention mechanism to weigh feature importance jointly from the temporal dimension and feature space of longitudinal medical data. Considering that medical examinations can have interchangeable orders in adjacent time points, we restricted the learning of short-term dependency with a Local-LSTM and the learning of long-term dependency with the joint spatiotemporal attention. We also compared the proposed method with several state-of-the-art methods and a method in clinical practice. The methods are evaluated on a hard-to-acquire clinical dataset of patients with Long COVID. Experimental results show the Local-LSTM with joint spatiotemporal attention achieved superior performance in mortality prediction comparing to related methods. By analyzing the critical time points identified by the joint spatiotemporal attention, we identified time-specific prognostic biomarkers for life-threatening Long COVID. The proposed method provides a clinical tool for the severity assessment of Long COVID.","longitudinal data, mortality prediction, COVID-19, attention" Predicting Antimicrobial MICs for Nontyphoidal Salmonella Using Multitask Representations Learning ,https://openreview.net/forum?id=WgGeNLtoDR,https://openreview.net/pdf?id=WgGeNLtoDR,,"The antimicrobial resistance (AMR) pathogens have become an increasingly worldwide issue, posing a significant threat to global public health. To obtain an optimized therapeutic effect, the antibiotic sensitivity is usually evaluated in a clinical setting, whereas traditional culture-dependent antimicrobial sensitivity tests are labor-intensive and relatively slow. Rapid methods can greatly optimize antimicrobial therapeutic strategies and improve patient outcomes by reducing the time it takes to test antibiotic sensitivity. The booming development of sequencing technology and machine learning techniques provide promising alternative approaches for antimicrobial resistance prediction based on sequencing. In this study, we used a lightweight Multitask Learning Transformer to predict the MIC of 14 antibiotics for Salmonella strains based on the genomic information, including point mutations, pan-genome structure, and the profile of antibiotic resistance genes from 5,278 publicly available whole genomes of nontyphoidal Salmonella. And we got better prediction results (improved more than 10% for raw accuracy and 3% for accuracy within ±1 2-fold dilution step) and provided better interpretability than the other ML models. Besides the potential clinical application, our models would cast light on mechanistic understanding of key genetic regions influencing AMR.", Bootstrap Motion Forecasting With Self-Consistent Constraints,https://openreview.net/forum?id=7KSeWGIOYM,https://openreview.net/pdf?id=7KSeWGIOYM,"We introduce self-consistent constraints to improve the performance of motion forecasting in autonomous driving, which can be easily incorporated into other motion forecasting approaches.","We present a novel framework to bootstrap Motion forecasting with self-consistent constraints (MISC). The motion forecasting task aims at predicting future trajectories of vehicles by incorporating spatial and temporal information from the past. A key design of MISC is the proposed Dual Consistency Constraints that regularize the predicted trajectories under spatial and temporal perturbation during training. Also, to model the multi-modality in motion forecasting, we design a novel self-ensembling scheme to obtain accurate teacher targets to enforce the self-constraints with multi-modality supervision. With explicit constraints from multiple teacher targets, we observe a clear improvement in the prediction performance. Extensive experiments on the Argoverse motion forecasting benchmark show that MISC significantly outperforms the state-of-the-art methods. As the proposed strategies are general and can be easily incorporated into other motion forecasting approaches, we also demonstrate that our proposed scheme consistently improves the prediction performance of several existing methods.","Motion Forecasting, Autonomous Driving, Trajectory prediction" DIFFUSED INSTANCE CONDITIONED GAN,https://openreview.net/forum?id=Rc0Xpxxfx5,https://openreview.net/pdf?id=Rc0Xpxxfx5,Improving image quality and mode coverage of GAN using diffusion based Gaussian mixture in feature space as partition guidance. ,"Recently, numerous data partitioning methods for generative adversarial networks has been developed for better distribution coverage on complex distribution. Most of these approaches aims to build fine-grained overlapping clusters in data manifold and condition both generator and discriminator with compressed representation about cluster. Although giving larger size of condition can be more informative, existing algorithms only utilize low dimension vector as condition due to dependency on clustering algorithm and unsupervised / self-supervised learning methods. In this work, we take a step towards using richer representation for cluster by utilizing diffusion based Gaussian mixture. Our analysis shows that we can derive continuous representation of cluster with Gaussian mixture when noise scale is given. Moreover, unlike other counterparts, we do not need excessive computation for acquiring clustered representation. Experiments on multiple datasets show that our model produces better results compared to recent GAN models.","generative adversarial networks, GAN, conditional GAN, image generation" Sparse Hyperbolic Representation Learning,https://openreview.net/forum?id=wOVGs7LJVs3,https://openreview.net/pdf?id=wOVGs7LJVs3,The paper proposes sparse regularization for hyperbolic-space-based machine learning.,"Reducing the space complexity of representations while minimizing the loss of information makes data science procedures computationally efficient. For the entities with the tree structure, hyperbolic-space-based representation learning (HSBRL) has successfully reduced the space complexity of representations by using low-dimensional space. Nevertheless, it has not minimized the space complexity of each representation since it has used the same dimension for all representations and has not selected the best dimension for each representation. Hence, this paper aims to minimize representations' space complexity for HSBRL. For minimizing each representation's space complexity, sparse learning has been effective in the context of linear-space-based machine learning; however, no sparse learning has been proposed for HSBRL. It is non-trivial to propose sparse learning for HSBRL because (i) sparse learning methods designed for linear space cause non-uniform sparseness in hyperbolic space, and (ii) existing Riemannian gradient descent methods fail to obtain sparse representations owing to an oscillation problem. This paper, for the first time, establishes a sparse learning scheme for hyperbolic space, overcoming the above issues with our novel sparse regularization term and optimization algorithm. Our regularization term achieves uniform sparseness since it is defined based on geometric distance from subspaces inducing sparsity. Our optimization algorithm successfully obtains sparse representations, avoiding the oscillation problem by realizing the shrinkage-thresholding idea in a general Riemannian manifold. Numerical experiments demonstrate that our scheme can obtain sparse representations with smaller information loss than traditional methods, successfully avoiding the oscillation problem. ","hyperbolic space, Riemannian manifold, sparse learning, sparse regularization, iterative shrinkage-thresholding algorithm, Riemannian gradient descent" Fair Multi-exit Framework for Facial Attribute Classification,https://openreview.net/forum?id=nY5e2_e7WpY,https://openreview.net/pdf?id=nY5e2_e7WpY,Multi-exit deep neural network improves the model fairness on facial attribute classification.,"Fairness has become increasingly pivotal in facial recognition. Without bias mitigation, deploying unfair AI would harm the interest of the underprivileged population. In this paper, we observe that despite the higher accuracy that deeper neural networks can generally offer, fairness conditions deteriorate as we extract features from deeper layers. This phenomenon motivates us to extend the concept of multi-exit framework. Unlike existing works mainly focusing on accuracy, our multi-exit framework is fairness-oriented, where the internal classifiers are trained to be more accurate and fairer. During inference, any instance with high confidence from an internal classifier is allowed to exit early. Moreover, our framework can be applied to most existing fairness-aware frameworks. Experiment results show that the proposed framework can largely improve the fairness condition over the state-of-the-art in CelebA and UTK Face datasets. ","Fairness, Multi-Exit Deep Neural Network" Learning to Split for Automatic Bias Detection,https://openreview.net/forum?id=IA96Pn7A08h,https://openreview.net/pdf?id=IA96Pn7A08h,"We propose ls, an algorithm that learns to split the given dataset so that predictors trained on the training split cannot generalize to the testing split.","Classifiers are biased when trained on biased datasets. As a remedy, we propose Learning to Split (ls), an algorithm for automatic bias detection. Given a dataset with input-label pairs, ls learns to split this dataset so that predictors trained on the training split cannot generalize to the testing split. This performance gap suggests that the testing split is under-represented in the dataset, which is a signal of potential bias. Identifying non-generalizable splits is challenging since we have no annotations about the bias. In this work, we show that the prediction correctness of each example in the testing split can be used as a source of weak supervision: generalization performance will drop if we move examples that are predicted correctly away from the testing split, leaving only those that are mispredicted. ls is task-agnostic and can be applied to any supervised learning problem, ranging from natural language understanding and image classification to molecular property prediction. Empirical results show that ls is able to generate astonishingly challenging splits that correlate with human-identified biases. Moreover, we demonstrate that combining robust learning algorithms (such as group DRO) with splits identified by ls enables automatic de-biasing. Compared to previous state-of-the-art, we substantially improve the worst-group performance (23.4% on average) when the source of biases is unknown during training and validation. Our code is included in the supplemental materials and will be publicly available.","bias, robustness, spurious correlation" Union Subgraph Neural Networks,https://openreview.net/forum?id=KICUSNslb7Q,https://openreview.net/pdf?id=KICUSNslb7Q,We propose a Union Subgraph Network that introduces local structural information by a shortest-path-based descriptor.,"Graph Neural Networks (GNNs) are widely used for graph representation learning in many application domains. The expressiveness of GNNs is upper-bounded by 1-dimensional Weisfeiler-Lehman (1-WL) test as they operate on rooted subtrees in message passing. In this paper, we empower GNNs by injecting neighbor-connectivity information extracted from a new type of substructures. We first investigate different kinds of connectivities existing in a local neighborhood and identify a substructure called union subgraph, which is able to capture the complete picture of the neighborhood. We then design a shortest-path-based substructure descriptor that possesses three nice properties and can effectively encode the high-order connectivities in union subgraphs. By infusing the encoded neighbor connectivities, we propose a novel model, namely Union Subgraph Neural Network (UnionSNN), which is proven to be strictly more powerful than 1-WL in distinguishing non-isomorphic graphs. Our extensive experiments on both graph-level and node-level classification tasks demonstrate that UnionSNN outperforms state-of-the-art baseline models, with competitive computational efficiency. ","Graph Neural Network, Representation Learning" Modeling Multimodal Aleatoric Uncertainty in Segmentation with Mixture of Stochastic Experts,https://openreview.net/forum?id=KE_wJD2RK4,https://openreview.net/pdf?id=KE_wJD2RK4,"We propose a novel mixture of stochastic experts (MoSE) model training with a Wasserstein-like loss, which produces an efficient two-level representation for the multi-modal aleatoric uncertainty in semantic segmentation.","Equipping predicted segmentation with calibrated uncertainty is essential for safety-critical applications. In this work, we focus on capturing the data-inherent uncertainty (aka aleatoric uncertainty) in segmentation, typically when ambiguities exist in input images. Due to the high-dimensional output space and potential multiple modes in segmenting ambiguous images, it remains challenging to predict well-calibrated uncertainty for segmentation. To tackle this problem, we propose a novel mixture of stochastic experts (MoSE) model, where each expert network estimates a distinct mode of the aleatoric uncertainty and a gating network predicts the probabilities of an input image being segmented in those modes. This yields an efficient two-level uncertainty representation. To learn the model, we develop a Wasserstein-like loss that directly minimizes the distribution distance between the MoSE and ground truth annotations. The loss can easily integrate traditional segmentation quality measures and be efficiently optimized via constraint relaxation. We validate our method on the LIDC-IDRI dataset and a modified multimodal Cityscapes dataset. Results demonstrate that our method achieves the state-of-the-art or competitive performance on all metrics.","Semantic Segmentation, Aleatoric Uncertainty, Stochastic Segmentation, Multiple Annotations" Frame Adaptive Network,https://openreview.net/forum?id=jKgataakTy,https://openreview.net/pdf?id=jKgataakTy,We propose a framework to train video recognition methods which can be evaluated at multiple frames and exhibit better performance compared to individual ones.,"Existing video recognition algorithms always conduct different training pipelines for inputs with different frame numbers, which requires repetitive training operations and multiplying storage costs. If we evaluate the model using other frame numbers which are not used in training, our observation, named Temporal Deviation, shows the performance will drop significantly (see Fig.1). Thus, the common training protocol for video related tasks is relatively rigid for flexible inference using various testing frames, especially for some edge devices with limited available frames or computational resources. In this study, we propose Frame Adaptive Network (FAN) to conduct a one-shot training but enable the model can be evaluated on different frame numbers. Concretely, FAN integrates several sets of training sequences, involves Specialized Normalization and Weight Alteration to efficiently expand the original network, and leverages Mutual Distillation for optimization. Comprehensive empirical validations using various architectures and popular benchmarks solidly demonstrate the effectiveness and generalization of FAN (e.g., 3.50/5.76/2.38$\%$ performance gain at frame 4/8/16 on Something-Something V1 dataset over competing method Uniformer), which also promises the practical potential of model usage.","Video Recognition, Temporal Deviation" On the Robustness of Safe Reinforcement Learning under Observational Perturbations,https://openreview.net/forum?id=jbIYfq4Tr-,https://openreview.net/pdf?id=jbIYfq4Tr-,"We study the robustness of safe RL under observational perturbations, and propose two effective adversaries and a defense algorithm to increase the agent's safety under attacks.","Safe reinforcement learning (RL) trains a policy to maximize the task reward while satisfying safety constraints. While prior works focus on the performance optimality, we find that the optimal solutions of many safe RL problems are not robust and safe against carefully designed observational perturbations. We formally analyze the unique properties of designing effective state adversarial attackers in the safe RL setting. We show that baseline adversarial attack techniques for standard RL tasks are not always effective for safe RL and proposed two new approaches - one maximizes the cost and the other maximizes the reward. One interesting and counter-intuitive finding is that the maximum reward attack is strong, as it can both induce unsafe behaviors and make the attack stealthy by maintaining the reward. We further propose a more effective adversarial training framework for safe RL and evaluate it via comprehensive experiments (video demos are available at: https://sites.google.com/view/robustsaferl/home). This paper provides a pioneer work to investigate the safety and robustness of RL under observational attacks for future safe RL studies.","Safe reinforcement learning, deep reinforcement learning, state robust reinforcement learning" Rethinking Saliency in Data-free Class Incremental Learning,https://openreview.net/forum?id=gc0HvlDPyA,https://openreview.net/pdf?id=gc0HvlDPyA,,"Data-Free Class Incremental Learning (DFCIL) aims to sequentially learn tasks with access only to data from the current one. DFCIL is of interest because it mitigates concerns about privacy and long-term storage of data, while at the same time alleviating the problem of catastrophic forgetting in incremental learning. In this work, we rethink saliency in DFCIL and propose a new framework, which we call RObust Saliency Supervision (ROSS), for mitigating the negative effect of saliency drift. Firstly, we use a teacher-student architecture leveraging low-level tasks to supervise the model with global saliency. We also apply boundary-guided saliency to protect it from drifting across object boundaries at intermediate layers. Finally, we introduce a module for injecting and recovering saliency noise to increase robustness of saliency preservation. Our experiments demonstrate that our method can achieve state-of-the-art results on the CIFAR-100, Tiny-ImageNet and ImageNet-Subset DFCIL benchmarks. Code will be made publicly available.", Rethinking the Training Shot Number in Robust Model-Agnostic Meta-Learning,https://openreview.net/forum?id=nlVOZyTZna,https://openreview.net/pdf?id=nlVOZyTZna,,"Model-agnostic meta-learning (MAML) has been successfully applied to few-shot learning, but is not naturally robust to adversarial attacks. Previous methods attempted to impose robustness-promoting regularization on MAML's bi-level training procedure to achieve an adversarially robust model. They follow the typical MAML practice where training shot number is kept the same with test shot number to guarantee an optimal novel task adaptation. However, as observed by us, introducing robustness-promoting regularization into MAML reduces the intrinsic dimension of features, which actually results in a mismatch between meta-training and meta-testing in terms of affordable intrinsic dimension. Consequently, previous robust MAML methods sacrifice clean accuracy a lot. In this paper, based on our observations, we propose a simple strategy to mitigate the intrinsic dimension mismatch resulted by robustness-promoting regularization, i.e., increasing the number of training shots. Though simple, our method remarkably improves the clean accuracy of MAML without much loss of robustness. Extensive experiments demonstrate that our method outperforms prior arts in achieving a better trade-off between accuracy and robustness. Besides, we observe our method is less sensitive to the number of fine-tuning steps during meta-training, which allows for a reduced number of fine-tuning steps to improve training efficiency. ", Behind the Scenes of Gradient Descent: A Trajectory Analysis via Basis Function Decomposition,https://openreview.net/forum?id=TPiwkItUSu,https://openreview.net/pdf?id=TPiwkItUSu,,"This work analyzes the solution trajectory of gradient-based algorithms via a novel basis function decomposition. We show that, although solution trajectories of gradient-based algorithms may vary depending on the learning task, they behave almost monotonically when projected onto an appropriate orthonormal function basis. Such projection gives rise to a basis function decomposition of the solution trajectory. Theoretically, we use our proposed basis function decomposition to establish the convergence of gradient descent (GD) on several representative learning tasks. In particular, we improve the convergence of GD on symmetric matrix factorization and provide a completely new convergence result for the orthogonal symmetric tensor decomposition. Empirically, we illustrate the promise of our proposed framework on realistic deep neural networks (DNNs) across different architectures, gradient-based solvers, and datasets. Our key finding is that gradient-based algorithms monotonically learn the coefficients of a particular orthonormal function basis of DNNs defined as the eigenvectors of the conjugate kernel after training.","nonconvex optimization, trajectory analysis, neural network optimization" What Is Missing in IRM Training and Evaluation? Challenges and Solutions,https://openreview.net/forum?id=MjsDeTcDEy,https://openreview.net/pdf?id=MjsDeTcDEy,,"Invariant risk minimization (IRM) has received increasing attention as a way to acquire environment-agnostic data representations and predictions, and also a principled solution for preventing spurious correlations from being learned and improving models’ out-of-distribution generalization. Yet, recent works have found that the optimality of the originally-proposed IRM optimization (IRMV1) may be compromised in practice or could be impossible to achieve in some scenarios. Therefore, a series of advanced IRM algorithms have been developed that show practical improvement over IRMV1. In this work, we revisit these recent IRM advancements and identify and resolve three practical limitations in IRM training and evaluation. First, we find that the effect of batch size during training has been chronically overlooked in previous studies, leaving room for further improvement. We propose small-batch training and highlight the improvements over a set of large-batch optimization techniques. Second, we find that improper selection of evaluation environments could give a false sense of invariance for IRM. To alleviate this effect, we leverage diversified test-time environments to precisely characterize the invariance of IRM when applied in practice. Third, we revisit Ahuja et al. (2020)’s proposal to convert IRM into an ensemble game and identify a limitation when a single invariant predictor is desired instead of an ensemble of individual predictors. We propose a new IRM variant to address this limitation based on a novel viewpoint of ensemble IRM games as consensus-constrained bi-level optimization. Lastly, we conduct extensive experiments (covering 7 existing IRM variants and 7 datasets) to justify the practical significance of revisiting IRM training and evaluation in a principled manner.","invariant risk minimization, bi-level optimization" Neural Decoding of Visual Imagery via Hierarchical Variational Autoencoders,https://openreview.net/forum?id=TM9jOSaIzN,https://openreview.net/pdf?id=TM9jOSaIzN,We propose a novel architecture for decoding visual imagery from fMRI recordings using Hierarchical VAEs.,"Reconstructing natural images from fMRI recordings is a challenging task of great importance in neuroscience. The current architectures are bottlenecked because they fail to effectively capture the hierarchical processing of visual stimuli that takes place in the human brain. Motivated by that fact, we introduce a novel neural network architecture for the problem of neural decoding. Our architecture uses Hierarchical Variational Autoencoders (HVAEs) to learn meaningful representations of natural images and leverages their latent space hierarchy to learn voxel-to-image mappings. By mapping the early stages of the visual pathway to the first set of latent variables and the higher visual cortex areas to the deeper layers in the latent hierarchy, we are able to construct a latent variable neural decoding model that replicates the hierarchical visual information processing. Our model achieves better reconstructions compared to the state of the art and our ablation study indicates that the hierarchical structure of the latent space is responsible for that performance. ","neural decoding, hierarchical variational autoencoders, neuroscience" Cooperative Adversarial Learning via Closed-Loop Transcription,https://openreview.net/forum?id=pgC3fd-zjw2,https://openreview.net/pdf?id=pgC3fd-zjw2,"This paper proposes a generative model that implements cooperative adversarial learning, which is robust to net architectures and performs well, and disentangled visual attributes are well modeled in independent principal components.","This paper proposes a generative model that implements cooperative adversarial learning via closed-loop transcription. In the generative model training, the encoder and decoder are trained simultaneously, and not only the adversarial process but also a cooperative process is included. In the adversarial process, the encoder plays as a critic to maximize the distance between the original and transcribed images, in which the distance is measured by rate reduction in the feature space; in the cooperative process, the encoder and the decoder cooperatively minimize the distance to improve the transcription quality. Cooperative adversarial learning possesses the concepts and properties of Auto-Encoding and GAN, and it is unique in that the encoder actively controls the training process as it is trained in both learning processes in two different roles. Experiments demonstrate that without regularization techniques, our generative model is robust to net architectures and easy to train, sample-wise reconstruction performs well in terms of sample features, and disentangled visual attributes are well modeled in independent principal components.","generative models, rate reduction, closed-loop transcription" Multi-task Self-supervised Graph Neural Networks Enable Stronger Task Generalization,https://openreview.net/forum?id=1tHAZRqftM,https://openreview.net/pdf?id=1tHAZRqftM,"We present ParetoGNN, a novel multi-task self-supervised learning framework for graph neural networks, that enhances the task generalization across various downstream tasks and datasets.","Self-supervised learning (SSL) for graph neural networks (GNNs) has attracted increasing attention from the graph machine learning community in recent years, owing to its capability to learn performant node embeddings without costly label information. One weakness of conventional SSL frameworks for GNNs is that they learn through a single philosophy, such as mutual information maximization or generative reconstruction. When applied to various downstream tasks, these frameworks rarely perform equally well for every task, because one philosophy may not span the extensive knowledge required for all tasks. In light of this, we introduce ParetoGNN, a multi-task SSL framework for node representation learning over graphs. Specifically, ParetoGNN is self-supervised by manifold pretext tasks observing multiple philosophies. To reconcile different philosophies, we explore a multiple-gradient descent algorithm, such that ParetoGNN actively learns from every pretext task while minimizing potential conflicts. We conduct comprehensive experiments over four downstream tasks (i.e., node classification, node clustering, link prediction, and partition prediction), and our proposal achieves the best overall performance across tasks on 11 widely adopted benchmark datasets. Besides, we observe that learning from multiple philosophies enhances not only the task generalization but also the single task performance, demonstrating that ParetoGNN achieves better task generalization via the disjoint yet complementary knowledge learned from different philosophies. ","Graph Neural Network, Self-supervised Learning" Analyzing Tree Architectures in Ensembles via Neural Tangent Kernel,https://openreview.net/forum?id=V_06QV-kZX,https://openreview.net/pdf?id=V_06QV-kZX,We formulate and analyze the Neural Tangent Kernel (NTK) induced by soft tree ensembles for arbitrary tree architectures,"A soft tree is an actively studied variant of a decision tree that updates splitting rules using the gradient method. Although it can take various tree architectures, their impact is not theoretically well known. In this paper, we formulate and analyze the Neural Tangent Kernel (NTK) induced by soft tree ensembles for arbitrary tree architectures. This kernel leads to the remarkable finding that only the number of leaves at each depth is relevant for the tree architecture in ensemble learning with infinitely many trees. In other words, if the number of leaves at each depth is fixed, the training behavior in function space and the generalization performance are exactly the same across different tree architectures, even if they are not isomorphic. We also show that the NTK of asymmetric trees like decision lists does not degenerate when they get infinitely deep. This is in contrast to the perfect binary trees, whose NTK is known to degenerate and leads to worse generalization performance for deeper trees.","Neural Tangent Kernel, Tree Ensemble, Soft Tree" Learn Appropriate Precise Distributions for Binary Neural Networks,https://openreview.net/forum?id=4CVu_buZwt,https://openreview.net/pdf?id=4CVu_buZwt,,"Binary Neural Networks (BNNs) have shown great promise for real-world embedded devices. However, BNNs always suffer from obtaining unsatisfactory accuracy performance on a large dataset such as ImageNet, which could hinder their further widespread applications in practice. Nevertheless, enhancing BNN's performance is extremely challenging owing to its limited capacity. Several distillation approaches in which the knowledge of a real-valued teacher model is distilled to a binary student network have been proposed to boost one BNN's accuracy. However, directly employing previous distillation solutions yields inferior results due to an unsuitable match between the representational capacity of the adopted real-valued teacher model and target binary student network. In this work, we reexamine the design of knowledge distillation framework specially for BNNs and test the limits of what a pure BNN can achieve. We firstly define one group which consists of multi real-valued networks owning particular properties, and then introduce a distribution-specific loss to enforce the binary network to mimic the distribution of one real-valued network fetched from this group in a certain order. In addition, we propose one distance-aware combinational model to provide one binary network with more comprehensive guidance, and present related suitable training strategies. The BNN in this built knowledge distillation framework can be facilitated to learn appropriate precise distributions, dubbed APD-BNN. As a result, APD-BNN can reach its performance limit while incurring no additional computational cost. Compared with the state-of-the-art BNNs, APD-BNN can obtain up to 1.4$\%$ higher accuracy on the ImageNet dataset with using the same architecture. Specifically, APD-BNN is capable of gaining 72.0$\%$ top-1 accuracy on ImageNet with only 87M OPs. Thus, it achieves the same accuracy of existing official real-valued MobileNetV2 at 71$\%$ fewer OPs, demonstrating the huge potential to apply BNNs in practice. Our code and models will be available.", Correcting Data Distribution Mismatch in Offline Meta-Reinforcement Learning with Few-Shot Online Adaptation,https://openreview.net/forum?id=Dk7tsv9fkF,https://openreview.net/pdf?id=Dk7tsv9fkF,"This paper formalizes the data distribution mismatch between offline meta-training and online adaptation, and proposes a novel data correction algorithm for effective online adaptation.","Offline meta-reinforcement learning (offline meta-RL) extracts knowledge from a given dataset of multiple tasks and achieves fast adaptation to new tasks. Recent offline meta-RL methods typically use task-dependent behavior policies (e.g., training RL agents on each individual task) to collect a multi-task dataset and learn an offline meta-policy. However, these methods always require extra information for fast adaptation, such as offline context for testing tasks or oracle reward functions. Offline meta-RL with few-shot online adaptation remains an open problem. In this paper, we first formally characterize a unique challenge under this setting: data distribution mismatch between offline training and online adaptation. This distribution mismatch may lead to unreliable offline policy evaluation and the regular adaptation methods of online meta-RL will suffer. To address this challenge, we introduce a novel mechanism of data distribution correction, which ensures the consistency between offline and online evaluation by filtering out out-of-distribution episodes in online adaptation. As few-shot out-of-distribution episodes usually have lower returns, we propose a Greedy Context-based data distribution Correction approach, called GCC, which greedily infers how to solve new tasks. GCC diversely samples “task hypotheses” from the current posterior belief and selects a greedy hypothesis with the highest return to update the task belief. Our method is the first to provide an effective online adaptation without additional information, and can be combined with off-the-shelf context-based offline meta-training algorithms. Empirical experiments show that GCC achieves state-of-the-art performance on the Meta-World ML1 benchmark compared to baselines with/without offline adaptation. ","offline meta reinforcement learning, offline reinforcement learning, meta-reinforcement learning, few-shot online adaptation, data distribution mismatch correction" Sharper Rates and Flexible Framework for Nonconvex SGD with Client and Data Sampling,https://openreview.net/forum?id=En7lGmzT_x,https://openreview.net/pdf?id=En7lGmzT_x,,"We revisit the classical problem of finding an approximately stationary point of the average of $n$ smooth and possibly nonconvex functions. The optimal complexity of stochastic first-order methods in terms of the number of gradient evaluations of individual functions is $\mathcal{O}\left(n + n^{1/2}\varepsilon^{-1}\right)$, attained by the optimal SGD methods SPIDER (Cong Fang et al., 2018) and PAGE (Zhize Li et al., 2020), for example, where $\varepsilon$ is the error tolerance. However, i) the big-$\mathcal{O}$ notation hides crucial dependencies on the smoothness constants associated with the functions, and ii) the rates and theory in these methods assume simplistic sampling mechanisms that do not offer any flexibility. In this work we remedy the situation. First, we generalize the PAGE algorithm so that it can provably work with virtually any (unbiased) sampling mechanism. This is particularly useful in federated learning, as it allows us to construct and better understand the impact of various combinations of client and data sampling strategies. Second, our analysis is sharper as we make explicit use of certain novel inequalities that capture the intricate interplay between the smoothness constants and the sampling procedure. Indeed, our analysis is better even for the simple sampling procedure analyzed in the PAGE paper. However, this already improved bound can be further sharpened by a different sampling scheme which we propose. In summary, we provide the most general and most accurate analysis of optimal SGD in the smooth nonconvex regime. Finally, our theoretical findings are supposed with carefully designed experiments.","nonconvex optimization, empirical risk minimization, SGD, variance reduction, data sampling, client sampling, optimal methods, biased gradient estimator, federated learning" "Universal embodied intelligence: learning from crowd, recognizing the world, and reinforced with experience",https://openreview.net/forum?id=3e5nHhhRK93,https://openreview.net/pdf?id=3e5nHhhRK93,,"The interactive artificial intelligence in the motion control field is an interesting topic, especially when universal knowledge adaptive to multiple task and universal environments is wanted. Although there are increasing efforts on Reinforcement learning (RL) studies with the assistance of transformers, it might subject to the limitation of the offline training pipeline, in which the exploration and generalization ability is prohibited. Motivated by the cognitive and behavioral psychology, such agent should have the ability to learn from others, recognize the world, and practice itself based its own experience. In this study, we propose the framework of Online Decision MetaMorphFormer (ODM) which attempts to achieve the above learning modes, with a unified model architecture to both highlight its own body perception and produce action and observation predictions. ODM can be applied on any arbitrary agent with a multi-joint body, located in different environments, trained with different type of tasks. Large-scale pretrained dataset are used to warmup ODM while the targeted environment continues to reinforce the universal policy. Substantial interactive experiments as well as few-shot and zero-shot tests in unseen environments and never-experienced tasks verify ODM's performance, and generalization ability. Our study shed some lights on research of general artificial intelligence on the embodied and cognitive field studies. ","reinforcement learning, transformer, morphology, pretrain, finetune, generalization" Exploring The Role of Mean Teachers in Self-supervised Masked Auto-Encoders,https://openreview.net/forum?id=7sn6Vxp92xV,https://openreview.net/pdf?id=7sn6Vxp92xV,We conduct analysis of the dynamics of the self-distillation scheme in masked auto-encoder.,"Masked image modeling (MIM) has become a popular strategy for self-supervised learning (SSL) of visual representations with Vision Transformers. A representative MIM model, the masked auto-encoder (MAE), randomly masks a subset of image patches and reconstructs the masked patches given the unmasked patches. Concurrently, many recent works in self-supervised learning utilize the student/teacher paradigm which provides the student with an additional target based on the output of a teacher composed of an exponential moving average (EMA) of previous students. Although common, relatively little is known about the dynamics of the interaction between the student and teacher. Through analysis on a simple linear model, we find that the teacher conditionally removes previous gradient directions based on feature similarities which effectively acts as a conditional momentum regularizer. From this analysis, we present a simple SSL method, The Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE. We find that RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training, which may provide a way to enhance the practicality of prohibitively expensive self-supervised learning of Vision Transformer models. Additionally, we show that RC-MAE achieves more robustness and better performance than MAE on downstream tasks such as ImageNet-1K classification, object detection, and instance segmentation.","self-supervised learning, masked auto-encoder" Multifactor Sequential Disentanglement via Structured Koopman Autoencoders,https://openreview.net/forum?id=6fuPIe9tbnC,https://openreview.net/pdf?id=6fuPIe9tbnC,A new method for learning multifactor disentangled representations of sequential data,"Disentangling complex data to its latent factors of variation is a fundamental task in representation learning. Existing work on sequential disentanglement mostly provides two factor representations, i.e., it separates the data to time-varying and time-invariant factors. In contrast, we consider multifactor disentanglement in which multiple (more than two) semantic disentangled components are generated. Key to our approach is a strong inductive bias where we assume that the underlying dynamics can be represented linearly in the latent space. Under this assumption, it becomes natural to exploit the recently introduced Koopman autoencoder models. However, disentangled representations are not guaranteed in Koopman approaches, and thus we propose a novel spectral loss term which leads to structured Koopman matrices and disentanglement. Overall, we propose a simple and easy to code new deep model that is fully unsupervised and it supports multifactor disentanglement. We showcase new disentangling abilities such as swapping of individual static factors between characters, and an incremental swap of disentangled factors from the source to the target. Moreover, we evaluate our method extensively on two factor standard benchmark tasks where we significantly improve over competing unsupervised approaches, and we perform competitively in comparison to weakly- and self-supervised state-of-the-art approaches.","Koopman methods, Sequential Disentanglement" Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks,https://openreview.net/forum?id=BrJATVZDWEH,https://openreview.net/pdf?id=BrJATVZDWEH,,"The field of Natural Language Processing (NLP) has experienced a dramatic leap in capabilities with the recent introduction of huge Language Models (LMs). Despite this success, natural language problems that involve several compounded steps are still practically unlearnable, even by the largest LMs. This complies with experimental failures for end-to-end learning of composite problems that were demonstrated in a variety of domains. An effective mitigation is to introduce intermediate supervision for solving sub-tasks of the compounded problem. Recently, several works have demonstrated high gains by taking a straightforward approach for incorporating intermediate supervision in compounded natural language problems: the sequence-to-sequence LM is fed with an augmented input, in which the decomposed tasks' labels are simply concatenated to the original input. In this paper, we prove a positive learning result that motivates these recent efforts. We show that when concatenating intermediate supervision to the input and training a sequence-to-sequence model on this modified input, unlearnable composite problems can become learnable. We show that this is true for any family of tasks which on the one hand, are unlearnable, and on the other hand, can be decomposed into a polynomial number of simple sub-tasks, each of which depends only on $O(1)$ previous sub-task results. Beyond motivating contemporary empirical efforts for incorporating intermediate supervision in sequence-to-sequence language models, our positive theoretical result is the first of its kind in the landscape of results on the benefits of intermediate supervision for neural-network learning: Until now, all theoretical results on the subject are negative, i.e., show cases where learning is impossible without intermediate supervision, while our result is positive, showing that learning is facilitated in the presence of intermediate supervision.","Language Models, Generalization, End-to-End, Composition" T2D: Spatiotemporal Feature Learning Based on Triple 2D Decomposition,https://openreview.net/forum?id=HUsh1c7p0gc,https://openreview.net/pdf?id=HUsh1c7p0gc,A new spatiotemporal feature learning method based on triple 2D decomposition.,"In this paper, we propose triple 2D decomposition (T2D) of a 3D vision Transformer (ViT) for efficient spatiotemporal feature learning. The idea is to divide the input 3D video data into three 2D data planes and use three 2D filters, implemented by 2D ViT, to extract spatial and motion features. Such a design not only effectively reduces the computational complexity of a 3D ViT, but also guides the network to focus on learning correlations among more relevant tokens. Compared with other decomposition methods, the proposed T2D is shown to be more powerful at a similar computational complexity. The CLIP-initialized T2D-B model achieves state-of-the-art top-1 accuracy of 85.0% and 70.5% on Kinetics-400 and Something-Something-v2 datasets, respectively. It also outperforms other methods by a large margin on FineGym (+17.9%) and Diving-48 (+1.3%) datasets. Under the zero-shot setting, the T2D model obtains a 2.5% top-1 accuracy gain over X-CLIP on HMDB-51 dataset. In addition, T2D is a general decomposition method that can be plugged into any ViT structure of any model size. We demonstrate this by building a tiny size of T2D model based on a hierarchical ViT structure named DaViT. The resulting DaViT-T2D-T model achieves 82.0\% and 71.3\% top-1 accuracy with only 91 GFLOPs on Kinectics-400 and Something-Something-v2 datasets, respectively. Source code will be made publicly available. ","spatiotemporal feature learning, video recognition, action recognition, video Transformer" Online Placebos for Class-incremental Learning,https://openreview.net/forum?id=D7shOsFXMv,https://openreview.net/pdf?id=D7shOsFXMv,We design an online learning algorithm to quickly evaluate and select unlabeled data to improve the KD loss in class-incremental learning. ,"Not forgetting old class knowledge is a key challenge for class-incremental learning (CIL) when the model continuously adapts to new coming classes. A common technique to address this is knowledge distillation (KD) which penalizes prediction inconsistencies between old and new models. Such prediction is made with almost new class data, as old class data is extremely scarce due to the strict memory limitation in CIL. In this paper, we take a deep dive into KD losses and find that “using new class data for KD” not only hinders the model adaption (for learning new classes) but also results in low efficiency for preserving old class knowledge. We address this by “using the placebos of old classes for KD”, where the placebos are chosen from a free image stream, such as Google Images, in an automatical and economical fashion. To this end, we train an online placebo selection policy to quickly evaluate the quality of streaming images (good or bad placebos) and use only good ones for one-time feed-forward computation of KD. We formulate the policy training process as an online Markov Decision Process (MDP), and introduce an online learning algorithm to solve this MDP problem without causing much computation costs. In experiments, we show that our method 1) is surprisingly effective even when there is no class overlap between placebos and original old class data, 2) does not require any additional supervision or memory budget, and 3) significantly outperforms a number of top-performing CIL methods, in particular when using lower memory budgets for old class exemplars, e.g., five exemplars per class. The code is available in the supplementary. ","incremental learning, continual learning, class-incremental learning" Evaluating Long-Term Memory in 3D Mazes,https://openreview.net/forum?id=yHLvIlE9RGN,https://openreview.net/pdf?id=yHLvIlE9RGN,We introduce a benchmark environment and dataset for evaluating the memory abilities of RL agents and their representations.,"Intelligent agents need to remember salient information to reason in partially-observed environments. For example, agents with a first-person view should remember the positions of relevant objects even if they go out of view. Similarly, to effectively navigate through rooms agents need to remember the floor plan of how rooms are connected. However, most benchmark tasks in reinforcement learning do not test long-term memory in agents, slowing down progress in this important research direction. In this paper, we introduce the Memory Maze, a 3D domain of randomized mazes specifically designed for evaluating long-term memory in agents. Unlike existing benchmarks, Memory Maze measures long-term memory separate from confounding agent abilities and requires the agent to localize itself by integrating information over time. With Memory Maze, we propose an online reinforcement learning benchmark, a diverse offline dataset, and an offline probing evaluation. Recording a human player establishes a strong baseline and verifies the need to build up and retain memories, which is reflected in their gradually increasing rewards within each episode. We find that current algorithms benefit from training with truncated backpropagation through time and succeed on small mazes, but fall short of human performance on the large mazes, leaving room for future algorithmic designs to be evaluated on the Memory Maze.","Reinforcement Learning, Memory, Benchmark, Dataset, Representation Learning" Packed Ensembles for efficient uncertainty estimation,https://openreview.net/forum?id=XXTyv1zD9zD,https://openreview.net/pdf?id=XXTyv1zD9zD,Packed-Ensembles leverage the width of DNNs and grouped convolutions to train subnetworks in parallel and form an efficient ensemble.,"Deep Ensembles (DE) are a prominent approach achieving excellent performance on key metrics such as accuracy, calibration, uncertainty estimation, and out-of-distribution detection. However, hardware limitations of real-world systems constrain to smaller ensembles and lower capacity networks, significantly deteriorating their performance and properties. We introduce Packed-Ensembles (PE), a strategy to design and train lightweight structured ensembles by carefully modulating the dimension of their encoding space. We leverage grouped convolutions to parallelize the ensemble into a single common backbone and forward pass to improve training and inference speeds. PE is designed to work under the memory budget of a single standard neural network. Through extensive studies we show that PE faithfully preserve the properties of DE, e.g., diversity, and match their performance in terms of accuracy, calibration, out-of-distribution detection and robustness to distribution shift.","Efficient Ensembling, Uncertainty Quantification, OOD Detection" Proactive Multi-Camera Collaboration for 3D Human Pose Estimation,https://openreview.net/forum?id=CPIy9TWFYBG,https://openreview.net/pdf?id=CPIy9TWFYBG,We propose a novel MARL framework to solve proactive multi-camrea collaborations for 3D HPE in human crowds,"For human motion capture (MoCap), particularly outdoors, the fixed-viewpoint multi-camera solutions are susceptible to dynamic occlusions and constrained in capture space. While an active camera approach aims to proactively control the camera poses to find optimal viewpoints for 3D reconstruction. This work introduces a multi-agent reinforcement learning (MARL) scheme to proactive Multi-Camera Collaboration for 3D Human Pose Estimation (MCC-HPE) in dynamic human crowds. At its core is a novel Collaborative Triangulation Contribution Reward (CTCR) that incentivizes agents according to their weighted average marginal contribution to the 3D reconstruction. CTCR improves convergence and alleviates the multi-agent credit assignment issue resulted from using 3D reconstruction accuracy as the shared reward. To better capture environment dynamics and to encourage anticipatory behaviors for occlusion avoidance, we jointly train our model with multiple world dynamics learning tasks. We evaluate our proposed method in four photo-realistic UE4 environments to ensure validity and generalizability. The empirical results show that our methods steadily outperform the fixed and active baselines in different scenarios with various numbers of cameras and humans.","Multi-Cameras Collaboration, Multi-Agent Credit Assignment, Active Vision, Human Pose Estimation" OpenFE: Automated Feature Generation beyond Expert-level Performance,https://openreview.net/forum?id=CnG8rd1hHeT,https://openreview.net/pdf?id=CnG8rd1hHeT,OpenFE: automated feature generation beyond expert-level performance,"The goal of automated feature generation is to liberate machine learning experts from the laborious task of manual feature generation, which is crucial for improving the learning performance of tabular data. The major challenge in automated feature generation is to efficiently and accurately identify useful features from a vast pool of candidate features. In this paper, we present OpenFE, an automated feature generation tool that provides competitive results against machine learning experts. OpenFE achieves efficiency and accuracy with two components: 1) a novel feature boosting method for accurately estimating the incremental performance of candidate features. 2) a feature-scoring framework for retrieving effective features from a large number of candidates through successive featurewise halving and feature importance attribution. Extensive experiments on seven benchmark datasets show that OpenFE outperforms existing baseline methods. We further evaluate OpenFE in two famous Kaggle competitions with thousands of data science teams participating. In one of the competitions, features generated by OpenFE with a simple baseline model can beat 99.3% data science teams, demonstrating for the first time that automated feature generation can outperform human experts. In addition to the empirical results, we provide a theoretical perspective to show that feature generation has benefit provably in a simple yet representative setting. Codes and datasets are available in the supplementary materials.","tabular data, feature generation" "FDNet: Focal Decomposed Network for Efficient, Robust and Practical time series forecasting",https://openreview.net/forum?id=WXjBX7uz7lO,https://openreview.net/pdf?id=WXjBX7uz7lO,"A Focal Decomposed Network for efficient, robust and practical time series forecasting","This paper presents FDNet: a Focal Decomposed Network for efficient, robust and practical time series forecasting. We break away from conventional deep time series forecasting formulas which obtain prediction results from universal feature maps of input sequences. In contrary, FDNet neglects universal correlations of input elements and only extracts fine-grained local features from input sequence. We show that: (1) Deep time series forecasting with only fine-grained local feature maps of input sequence is feasible and competitive upon theoretical basis. (2) By abandoning global coarse-grained feature maps, FDNet overcomes distribution shift problem caused by changing local dynamics of time series which is common in real-world applications. (3) FDNet is not dependent on any assumption or priori knowledge of time series except basic auto-regression, which makes it general and practical. Moreover, we propose focal input sequence decomposition method which decomposes input sequence in a focal manner for efficient and robust forecasting when facing LSTI problem. FDNet achieves promising forecasting performances on five benchmark datasets and reduces prediction MSE by 38.4% on average compared with other seven SOTA forecasting baselines.","Time series Forecasting, Deep Learning, Neural Networks" Physics-empowered Molecular Representation Learning,https://openreview.net/forum?id=424tG_RaE-,https://openreview.net/pdf?id=424tG_RaE-,We propose a Transformer-based molecular energy prediction model equipped with physical insights and self-supervised masked atomic modeling.,"Estimating the energetic properties of molecular systems is a critical task in material design. With the trade-off between accuracy and computational cost, various methods have been used to predict the energy of materials, including recent neural-net-based models. However, most existing neural-net models are context-free (physics-ignoring) black-box models, limiting their applications to predict energy only within the distribution of the training set and thus preventing from being applied to the real practice of molecular design. Inspired by the physical mechanism of the interatomic potential, we propose a physics-driven energy prediction model using a Transformer. Our model is trained not only on the energy regression in the training set, but also with conditions inspired by physical insights and self-supervision based on Masked Atomic Modeling, making it adaptable to the optimization of molecular structure beyond the range observed during training, taking a step towards realizable molecular structure optimization.","Physics, Transformer, Molecular representation learning, ML potential" On the Difficulties of Video Summarization: Structure and Subjectivity,https://openreview.net/forum?id=Z3825mh8yk9,https://openreview.net/pdf?id=Z3825mh8yk9,We tackle subjectivity of video summarization by a semantic boundary-aware hierarchical video modeling.,"Video summarization, aiming at selecting a representative set of frames from a video in a limited budget, is a challenging problem in computer vision. First, to summarize a video with complex contents, understanding the storytelling structure is essential, but this fundamental step is still largely under-utilized. Also, summarization is in nature subjective, since each annotator may have different views on what the most important part is within a video. To tackle these difficulties, we propose Hierarchical model for video Summarization (HiSum), discovering semantic hierarchy structure of a video by event boundary detection and taking advantage of it for important frame selection. From extensive experiments on two standard benchmarks and three other new datasets specially designed to take part in subjectivity, we demonstrate that our model achieves the state-of-the-art performance.","hierarchical, video summarization" CCMLN: Combinatorial Correction for Multi-Label Classification with Noisy Labels,https://openreview.net/forum?id=C-7D-u3q62f,https://openreview.net/pdf?id=C-7D-u3q62f,,"Multi-label classification aims to learn classification models from instances associated with multiple labels. It is pivotal to learn and utilize the label dependence among multiple labels in multi-label classification. As a result of today’s big and complex data, noisy labels are inevitable, making it looming to target multi-label classification with noisy labels. Although the importance of label dependence has been shown in multi-label classification with clean labels, it is challenging and unresolved to bring label dependence to the problem of multi-label classification with noisy labels. The issues are, that we do not understand why the label dependence is helpful in the problem, and how to learn and utilize the label dependence only using training data with noisy multiple labels. In this paper, we bring label dependence to tackle the problem of multi-label classification with noisy labels. Specifically, we first provide a high-level understanding of why label dependence helps distinguish the examples with clean/noisy multiple labels. Benefiting from the memorization effect in handling noisy labels, a novel algorithm is then proposed to learn the label dependence by only employing training data with noisy multiple labels, and utilize the learned dependence to help correct noisy multiple labels to clean ones. We prove that the use of label dependence could bring a higher success rate for recovering correct multiple labels. Empirical evaluations justify our claims and demonstrate the superiority of our algorithm. ", Revisiting Domain Randomization Via Relaxed State-Adversarial Policy Optimization,https://openreview.net/forum?id=2aRlyrY-LsJ,https://openreview.net/pdf?id=2aRlyrY-LsJ,,"Domain randomization (DR) is widely used in reinforcement learning (RL) to bridge the gap between simulation and reality through maximizing its average returns under the perturbation of environmental parameters. Although effective, the methods have two limitations: (1) Even the most complex simulators cannot capture all details in reality due to finite domain parameters and simplified physical models. (2) Previous methods often assume that the distribution of domain parameters is a specific family of probability functions, such as a normal or a uniform distribution, which may not be correct. To enable robust RL via DR without the aforementioned limitations, we rethink DR from the perspective of adversarial state perturbation, without the need for re-configuring the simulator or relying on prior knowledge about the environment. We point out that perturbing agents to the worst states during training is naive and could make the agents over-conservative. Hence, we present a Relaxed State-Adversarial Algorithm to tackle the over-conservatism issue by simultaneously maximizing the average-case and worst-case performance of policies. We compared our method to the state-of-the-art methods for evaluation. Experimental results and theoretical proofs verified the effectiveness of our method.", Consistent Targets Provide Better Supervision in Semi-supervised Object Detection,https://openreview.net/forum?id=HeEqRvCtN2-,https://openreview.net/pdf?id=HeEqRvCtN2-,,"In this study, we dive deep into the inconsistency of pseudo targets in semi-supervised object detection (SSOD). Our core observation is that the oscillating pseudo targets undermine the training of an accurate semi-supervised detector. It not only inject noise into student training but also lead to severe overfitting on the classification task. Therefore, we propose a systematic solution, termed Consistent-Teacher, to reduce the inconsistency. First, adaptive anchor assignment~(ASA) substitutes the static IoU-based strategy, which enables the student network to be resistant to noisy pseudo bounding boxes; Then we calibrate the subtask predictions by designing a 3D feature alignment module~(FAM-3D). It allows each classification feature to adaptively query the optimal feature vector for the regression task at arbitrary scales and locations. Lastly, a Gaussian Mixture Model (GMM) dynamically revises the score threshold of the pseudo-bboxes, which stabilizes the number of ground-truths at an early stage and remedies the unreliable supervision signal during training. Consistent-Teacher provides strong results on a large range of SSOD evaluations. It achieves 40.0 mAP with ResNet-50 backbone given only 10\% of annotated MS-COCO data, which surpasses previous baselines using pseudo labels by around 3 mAP. When trained on fully annotated MS-COCO with additional unlabeled data, the performance further increases to 47.2 mAP. Our code will be open-sourced soon.","Semi-supervised Learning, Object Detection" Become a Proficient Player with Limited Data through Watching Pure Videos,https://openreview.net/forum?id=Sy-o2N0hF4f,https://openreview.net/pdf?id=Sy-o2N0hF4f,,"Recently, RL has shown its strong ability for visually complex tasks. However, it suffers from the low sample efficiency and poor generalization ability, which prevent RL from being useful in real-world scenarios. Inspired by the huge success of unsupervised pre-training methods on language and vision domains, we propose to improve the sample efficiency via a novel pre-training method for model-based RL. Instead of using pre-recorded agent trajectories that come with their own actions, we consider the setting where the pre-training data are action-free videos, which are more common and available in the real world. We introduce a two-phase training pipeline as follows: for the pre-training phase, we implicitly extract the hidden action embedding from videos and pre-train the visual representation and the environment dynamics network through a novel cycle consistency objective based on vector quantization; for down-stream tasks, we finetune with small amount of task data based on the learned models. Our framework can significantly improve the sample efficiency on Atari Games with data of only one hour of game playing. We achieve 118.4\% mean human performance and 36.0\% median performance with only 50k environment steps, which is 85.6\% and 65.1\% better than the scratch EfficientZero model. We believe such pre-training approach can provide an option for solving real-world RL problems.","Pre-training, Fine-tune, MCTS, Reinforcement learning, Vector Quantization" Evaluation of Attribution Explanations without Ground Truth,https://openreview.net/forum?id=WevBjPK4V3j,https://openreview.net/pdf?id=WevBjPK4V3j,This paper proposes a metric to evaluate the objectiveness of explanation methods without a need for the ground-truth explanations.,"This paper proposes a metric to evaluate the objectiveness of explanation methods of neural networks, i.e., the accuracy of the estimated importance/attribution/saliency values of input variables. This is crucial for the development of explainable AI, but it also presents significant challenges. The core challenge is that people usually cannot obtain the ground-truth value of the attribution of each input variable. Thus, we design a metric to evaluate the objectiveness of the attribution map without ground truth. Our metric is used to evaluate eight benchmark methods of attribution explanations, which provides new insights into attribution methods. We will release the code when the paper is accepted.","Interpretable machine learning, Explainable AI" Human MotionFormer: Transferring Human Motions with Vision Transformers,https://openreview.net/forum?id=lQVpasnQS62,https://openreview.net/pdf?id=lQVpasnQS62,,"We transfer motions from a target dynamic person to a source static one. An accurate matching between the source and the target human subjects improves the transferred motion quality. This matching paradigm shall be effective to capture both large and subtle motions changes between the target person and the source person, which are challenging for existing methods with CNNs. In this paper, we propose Human MotionFormer, a hierarchical ViT framework for motion transfer between two human subjects. Our MotionFormer leverages the local and the global perceptions (via convolutions and cross-attentions) to capture varying human motions. Specifically, our MotionFormer consists of two ViT encoders to extract input features (i.e., target pose image and a source human image), and a ViT decoder with several blocks for feature matching and motion transfer. The feature matching process is conducted in both motion warping and generation branches within each decoder block, where the target pose feature is set as Query and the source person feature is set as Key and Value. Then, the cross attention maps computed based on the Query, Key and Value are utilized for feature matching. Furthermore, we introduce a convolution layer to improve the local perception after the global cross attention computations. During model training, we propose a mutual learning loss to enable the co-supervision between motion warping and generation branches for consistent motion representations, which benefit the output transferred human motions. To this end, our MotionFormer leverages the local and global perceptions and introduces the mutual learning loss to improve transferred motion results. These designs empower our method to utilize only one source image for motion transfer and get rid of model finetuning. The experimental results qualitatively and quantitatively show that our Human MotionFormer sets the new state-of-the-art performance.", Entity Divider with Language Grounding in Multi-Agent Reinforcement Learning,https://openreview.net/forum?id=icmTV7mhxuQ,https://openreview.net/pdf?id=icmTV7mhxuQ,,"We investigate the use of natural language to drive the generalization of policies in multi-agent settings. Unlike single-agent settings, the generalization of policies should also consider the influence of other agents. Besides, with the increasing number of entities in multi-agent settings, more agent-entity interactions are needed for language grounding, and the enormous search space could impede the learning process. Moreover, given a simple general instruction, e.g., beating all enemies, agents are required to decompose it into multiple subgoals and figure out the right one to focus on. Inspired by previous work, we try to address these issues at the entity level and propose a novel framework for language grounding in multi-agent reinforcement learning, entity divider (EnDi). EnDi enables agents to independently learn subgoal division at the entity level and act in the environment based on the associated entities. The subgoal division is regularized by opponent modeling to avoid subgoal conflicts and promote coordinated strategies. Empirically, EnDi demonstrates the strong generalization ability to unseen games with new dynamics and expresses the superiority over existing methods. ","language-based reinforcement learning, multi-agent reinforcement learning" Multi-Agent Sequential Decision-Making via Communication,https://openreview.net/forum?id=0xHVGIiYK2n,https://openreview.net/pdf?id=0xHVGIiYK2n,A novel communication scheme for multi-agent cooperation," Communication helps agents to obtain information about others so that better coordinated behavior can be learned. Some existing work communicates predicted future trajectory with others, hoping to get clues about what others would do for better coordination. However, circular dependencies sometimes can occur when agents are treated synchronously so it is hard to coordinate decision-making. In this paper, we propose a novel communication scheme, Sequential Communication (SeqComm). SeqComm treats agents asynchronously (the upper-level agents make decisions before the lower-level ones) and has two communication phases. In negotiation phase, agents determine the priority of decision-making by communicating hidden states of observations and comparing the value of intention, which is obtained by modeling the environment dynamics. In launching phase, the upper-level agents take the lead in making decisions and communicate their actions with the lower-level agents. Theoretically, we prove the policies learned by SeqComm are guaranteed to improve monotonically and converge. Empirically, we show that SeqComm outperforms existing methods in various multi-agent cooperative tasks. ","multi-agent communication, multi-agent reinforcement learning" LAMDA: Latent mapping for domain adaption of image generators,https://openreview.net/forum?id=v8Xz3gNgEo4,https://openreview.net/pdf?id=v8Xz3gNgEo4,We adapt GANs to new domains without training on new images. This is done by only learning how to translate one latent space to another.,"Our paper tackles the problem of adapting image generators to new keyword-defined domains without training on any new images. We combine the power of CLIP models for image-text similarity with the disentangled representation of images found in the latent spaces of generative adversarial networks (GANs). We present the latent mapper (LAMDA) which maps directions in CLIP text space to directions in the GAN latent space. Using a latent mapper enables training on a large number of keywords simultaneously which was not previously possible, and allows benefiting from the inter-relation of different keywords. It also leads to higher image quality while requiring only a fraction of the training time and parameters of state-of-the-art methods. Our method allows the network to learn the relationship between words and latent spaces. As a result it allows the composition of words to generate semantically meaningful images. Additionally, it allows adaptation to some unseen words after training. We perform extensive analysis to validate the advantages of our method quantitatively and qualitatively.","domain adaptation, gan, image synthesis, image generative, generative models" Hierarchies of Reward Machines,https://openreview.net/forum?id=wV09GfqYC-n,https://openreview.net/pdf?id=wV09GfqYC-n,,"Reward machines (RMs) are a recent formalism for representing the reward function of a reinforcement learning task through a finite-state machine whose edges encode landmarks of the task using high-level events. The structure of RMs enables the decomposition of a task into simpler and independently solvable subtasks that help tackle long-horizon and/or sparse reward tasks. We propose a formalism for further abstracting the subtask structure by endowing an RM with the ability to call other RMs, thus composing a hierarchy of RMs (HRM). We exploit HRMs by treating each call to an RM as an independently solvable subtask using the options framework, and describe a curriculum-based method to learn HRMs from traces observed by the agent. Our experiments reveal that exploiting a handcrafted HRM leads to faster convergence than with a flat HRM, and that learning an HRM remains feasible in cases where its equivalent flat representation is not.","Hierarchical Reinforcement Learning, Reward Machines" EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion,https://openreview.net/forum?id=__czv_gqDQt,https://openreview.net/pdf?id=__czv_gqDQt,,"Text-to-speech (TTS) field is recently dominated by one-stage text-to-waveform models, in which the speech quality is significantly improved compared to two-stage models. However, the best-performing open-sourced one-stage model, the VITS, is not fully differentiable and suffers from relatively high computation costs. To address these issues, we propose EfficientTTS 2 (EFTS2), a fully differentiable end-to-end TTS framework that is highly efficient. Our method adopts an adversarial training process, with a differentiable aligner and a hierarchical-VAE-based waveform generator. The differentiable aligner is built upon the EfficientTTS. A hybrid attention mechanism and a variational alignment predictor are incorporated into our network to improve the expressiveness of the aligner. The use of the hierarchical-VAE-based waveform generator not only alleviates the one-to-many mapping problem in waveform generation but also allows the model to learn hierarchical and explainable latent variables that control different aspects of the generated speech. We also extend EFTS2 to the voice conversion (VC) task and propose EFTS2-VC, an end-to-end VC model that allows efficient and high-quality conversion. Experimental results suggest that the two proposed models match their strong counterparts in speech quality with a faster inference speed and smaller model size. ","Text-to-Speech, Voice Conversion, End-to-End" LatentAugment: Dynamically Optimized Latent Probabilities of Data Augmentation,https://openreview.net/forum?id=ooqH4D9Xys,https://openreview.net/pdf?id=ooqH4D9Xys,,"Although data augmentation is a powerful technique for improving the performance of image classification tasks, it is difficult to identify the best augmentation policy. The optimal augmentation policy, which is the latent variable, cannot be directly observed. To address this problem, this study proposes \textit{LatentAugment}, which estimates the latent probability of optimal augmentation. The proposed method is appealing in that it can dynamically optimize the augmentation strategies for each input and model parameter in learning iterations. Theoretical analysis shows that LatentAugment is a general model that includes other augmentation methods as special cases, and it is simple and computationally efficient in comparison with existing augmentation methods. Experimental results show that the proposed LatentAugment has higher test accuracy than previous augmentation methods on the CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets. ","data augmentation, image classification, EM algorithm" Cali-NCE: Boosting Cross-modal Video Representation Learning with Calibrated Alignment,https://openreview.net/forum?id=-Ozk9LVtqbV,https://openreview.net/pdf?id=-Ozk9LVtqbV,,"With the large-scale video-text datasets being collected, learning general visual-textual representation has gained increasing attention. While recent methods are designed with the assumption that the alt-text description naturally conveys the meaning and context of the video in semantics (i.e. well aligned with each other), it is unlikely to be satisfied for the Internet data, which potentially harms the quality of the learned visual-textual representation. To address this challenge, we first revisit three mainstream approaches: correspondence modeling, contrastive learning and predictive coding, demonstrating that a simple co-training strategy with these methods leads to a clear improvement in performance. To further explore the complementary nature of different training strategies, we propose a simple yet effective joint training framework that factorizes the total objective into conditional ones, termed as Cali-NCE. Specifically, the correspondence between video and text descriptions is firstly estimated with a correspondence score, which is later used to calibrate the sample weightings during contrastive training. Through extensive experiments, we show that the proposed approach achieves state-of-the-art performance on multiple downstream tasks: text-to-video retrieval, video action recognition, and video retrieval. Code and models will be made publicly available. ",visual-textual representation learning Novel Class Discovery under Unreliable Sampling,https://openreview.net/forum?id=uJzSlJruEjk,https://openreview.net/pdf?id=uJzSlJruEjk,,"When sampling data of specific classes (i.e., known classes) for a scientific task, collectors may encounter unknown classes (i.e., novel classes). Since these novel classes might be valuable for future research, collectors will also sample them and assign them to several clusters with the help of known-class data. This assigning process is also known as novel class discovery (NCD). However, sampling errors are common in practice and may make the NCD process unreliable. To tackle this problem, this paper introduces a new and more realistic setting, where collectors may misidentify known classes and even confuse known classes with novel classes - we name it NCD under unreliable sampling (NUSA). We find that NUSA will empirically degrade existing NCD methods if taking no care of sampling errors. To handle NUSA, we propose an effective solution, named hidden-prototype-based discovery network (HPDN). HPDN first trains a deep network to fully fit the wrongly sampled data, then applies the relatively clean hidden representations yielded by this network into a novel mini-batch K-means algorithm, which further prevents them overfitting to residual errors by detaching noisy supervision timely. Experiments demonstrate that, under NUSA, HPDN significantly outperforms competitive baselines (e.g., 6% more than the best baseline on CIFAR-10) and keeps robust even encountering serious sampling errors.", NEW TRAINING FRAMEWORK FOR SPEECH ENHANCEMENT USING REAL NOISY SPEECH,https://openreview.net/forum?id=_j4ZUpoNO1e,https://openreview.net/pdf?id=_j4ZUpoNO1e,"In this study, we proposed a novel SE training method that can train on real noisy speech instead of synthetic training data (such as clean speech + noise in conventional supervised training or noisy speech + noise in MixIT)","Recently, deep learning-based speech enhancement (SE) models have gained significant improvements. However, the success is mainly based on using synthetic training data created by adding clean speech with noise. On the other hand, in spite of its large amount, real noisy speech is hard to be applied for SE model training because of lack of its clean reference. In this paper, we propose a novel method to utilize real noisy speech for SE model training based on a non-intrusive speech quality prediction model. The SE model is trained through the guide of the quality prediction model. We also find that a speech quality predictor with better accuracy may not necessarily be an appropriate teacher to guide the SE model. In addition, we show that if the quality prediction model is adversarially robust, then the prediction model itself can also be served as a SE model by modifying the input noisy speech through gradient backpropagation. Objective experiment results show that, under the same SE model structure, the proposed new training method trained on a large amount of real noisy speech can outperform the conventional supervised model trained on synthetic noisy speech. Lastly, the two training methods can be combined to utilize both benefits of synthetic noisy speech (easy to learn) and real noisy speech (large amount) to form semi-supervised learning which can further boost the performance both objectively and subjectively. The code will be released after publication.","Speech enhancement, Quality prediction, Semi-supervised learning, Adversarially robust" PA-LoFTR: Local Feature Matching with 3D Position-Aware Transformer,https://openreview.net/forum?id=U8MtHLRK06q,https://openreview.net/pdf?id=U8MtHLRK06q,A Transformer-based method that learns 3D position information to solve image matching problem.,"We propose a novel image feature matching method that utilizes 3D position information to augment feature representation with a deep neural network. The proposed method introduces 3D position embedding to a state-of-the-art feature matcher, LoFTR, and achieves more promising performance. Following the coarse-to-fine matching pipeline of LoFTR, we construct a Transformer-based neural network that generates dense pixel-wise matches. Instead of using 2D position embeddings for transformer, the proposed method generates 3D position embeddings that can precisely capture position correspondence of matches between images. Importantly, in order to guide neural network to learn 3D space information, we augment features with depth information generated by a depth predictor. In this way, our method, PA-LoFTR, can generate 3D position-aware local feature descriptors with Transformer. We experiment on indoor datasets, and results show that PA-LoFTR improves the performance of feature matching compared to state-of-the-art methods.","deep learning, transformer, image matching, pose estimation, position embedding, 3d representation" Policy Contrastive Imitation Learning,https://openreview.net/forum?id=PZZUcxazxSw,https://openreview.net/pdf?id=PZZUcxazxSw,,"Adversarial imitation learning (AIL) is a popular method that has recently achieved much success. However, the performance of AIL is still unsatisfactory on the more challenging tasks. We find that one of the major reasons is due to the low quality of AIL discriminator representation. Since the AIL discriminator is trained via binary classification that does not necessarily discriminate the policy from the expert in a meaningful way, the resulting reward might not be meaningful either. We propose a new method called Policy Contrastive Imitation Learning (PCIL) to resolve this issue. PCIL learns a contrastive representation space by anchoring on different policies and uses a smooth cosine-similarity-based reward to encourage imitation learning. Our proposed representation learning objective can be viewed as a stronger version of the AIL objective and provide a more meaningful comparison between the agent and the policy. From a theoretical perspective, we show the validity of our method using the apprenticeship learning framework. Furthermore, our empirical evaluation on the DeepMind Control suite demonstrates that PCIL can achieve state-of-the-art performance. Finally, qualitative results suggest that PCIL builds a smoother and more meaningful representation space for imitation learning.","adversarial imitation learning, contrastive learning" Backstepping Temporal Difference Learning,https://openreview.net/forum?id=YPChvOgRXRA,https://openreview.net/pdf?id=YPChvOgRXRA,This paper develops a new unifying view to design off-policy temporal difference learning algorithms.," Off-policy learning ability is an important feature of reinforcement learning (RL) for practical applications. However, even one of the most elementary RL algorithms, temporal-difference (TD) learning, is known to suffer form divergence issue when the off-policy scheme is used together with linear function approximation. To overcome the divergent behavior, several off-policy TD learning algorithms have been developed until now. In this work, we provide a unified view of such algorithms from a purely control-theoretic perspective. Our method relies on the backstepping technique, which is widely used in nonlinear control theory.","reinforcement learning, temporal difference learning, policy evaluation" Reconciling Adversarial Robustness with Accuracy via Randomized Weights,https://openreview.net/forum?id=3lH6Pc0Qeg2,https://openreview.net/pdf?id=3lH6Pc0Qeg2,"We study the trade-off between clean accuracy and robustness through randomized weights, design a novel adversarial training method based on Tylor series of randomized weights to improve both clean accuracy and robustness.","Recent years have seen a rapid growth of research on building more robust deep neural networks against adversarial examples. Among them, adversarial training has been shown to be one of the most effective approaches. To balance the robustness of adversarial examples and the accuracy of clean examples, a series of works design enhanced adversarial training methods to strike a trade-off between them with \emph{deterministic} model parameters (i.e., weights). Noting that clean and adversarial examples are highly entangled with the network weights, we propose to study such a trade-off from another perspective, by \emph{treating weights as random variables} in order to harvest the insights yielded from statistical learning theory. Inspired by recent advances of information-theoretic generalization error bound, we found that adversarial training over the randomized weight space can potentially narrow the generalization bound of both clean and adversarial data, and improve both adversarial robustness and clean accuracy simultaneously. Building upon such insights, we propose a novel adversarial training method via Taylor expansion in the hypothesis space of the randomized weights. With PGD, CW, and Auto Attacks, an extensive set of experiments demonstrate that our method further enhances adversarial training, boosting both robustness and clean accuracy.","adversarial robustness, adversarial training, randomized weights" Hidden Markov Transformer for Simultaneous Machine Translation,https://openreview.net/forum?id=9y0HFvaAYD6,https://openreview.net/pdf?id=9y0HFvaAYD6,,"Simultaneous machine translation (SiMT) outputs the target sequence while receiving the source sequence, and hence learning when to start translating each target token is the core challenge for SiMT. However, it is non-trivial to learn the optimal moment among many possible moments of starting translating, as the moments of starting translating always hide inside the model and we can only supervise the SiMT model with the observed target sequence. In this paper, we propose Hidden Markov Transformer (HMT), which treats the moments of starting translating as hidden events and the target sequence as the corresponding observed events, thereby organizing them as a hidden Markov model. HMT explicitly models multiple moments of starting translating, used as the candidate hidden events, and then selects one to generate the target token. During training, by maximizing the marginal likelihood of the target sequence over multiple moments of starting translating, HMT learns to start translating at the moments that target tokens can be generated more accurately. Experiments on multiple SiMT benchmarks show that HMT outperforms strong baselines and achieves state-of-the-art performance.","Simultaneous machine translation, Machine translation, Natural language processing, Transformer" Rank Preserving Framework for Asymmetric Image Retrieval ,https://openreview.net/forum?id=dYHYXZ3uGdQ,https://openreview.net/pdf?id=dYHYXZ3uGdQ,We propose a rank preserving framework to achieve the consistency of the ranking lists returned by asymmetric and symmetric retrieval. ,"Asymmetric image retrieval aims to deploy compatible models on platforms of different resources to achieve a balance between computational efficiency and retrieval accuracy. The most critical issue is how to align the output features of different models. Despite the great progress, existing approaches apply strong constraints so that features or neighbor structures are strictly aligned across different models. However, such a one-to-one constraint is too strict to be preserved well for the query models with low capacity. Considering that the primary concern of the users is the rank of the returned images, we propose a generic rank preserving framework, which achieves feature compatibility and the order consistency between query and gallery models simultaneously. Specifically, we propose two alternatives to instantiate the framework. One realizes straightforward rank order preservation by directly preserving the consistency of the sorting results. To make sorting process differentiable, the Heaviside step function in sorting is approximated by the sigmoid function. The other aims to preserve a learnable monotonic mapping relationship between the returned similarity scores of query and gallery models. The mapped similarity scores of gallery model are considered as pseudo-supervision to guide the query model training. Extensive experiments on various large-scale datasets demonstrate the superiority of our two proposed methods.",Asymmetric image retreival MINI: Mining Implicit Novel Instances for Few-Shot Object Detection,https://openreview.net/forum?id=pXU-5s9yUi1,https://openreview.net/pdf?id=pXU-5s9yUi1,Mining Implicit Novel Instances for Few-Shot Object Detection,"Few-Shot Object Detection (FSOD) aims to detect novel concepts given abundant base data and limited novel data. Recent advances propose an offline mining mechanism to discover implicit novel instances, which exist in the base dataset, as auxiliary training samples to retrain a more powerful model. Nonetheless, the offline mined novel instances remain unchanged during retraining, thus hindering further improvements. A straightforward alternate adopts an online mining mechanism that employs an online teacher to mine implicit novel instances on the fly. However, the online teacher relies on a good initialization which is non-trivial in the scenarios of FSOD. To overcome the obstacles, we present Mining Implicit Novel Instances (MINI), a framework that unifies the offline mining mechanism and online mining mechanism with an adaptive mingling design. In offline mining, MINI leverages an offline discriminator to collaboratively mine implicit novel instances with a trained FSOD model. In online mining, MINI takes a teacher-student framework to simultaneously update the FSOD network and the mined implicit novel instances on the fly. In adaptive mingling, the offline and online mind implicit novel instances are adaptively combined, where the offline mined novel instances warm up the early training and the online mined novel instances gradually substitute the offline mined instances to further improve the performance. Extensive experiments on PASCAL VOC and MS-COCO datasets show MINI achieves new state-of-the-art performance on any shot and split of FSOD tasks. All code will be made available. ","Object Detection, Few-Shot Object Detection" D3C2-Net: Dual-Domain Deep Convolutional Coding Network for Compressive Sensing,https://openreview.net/forum?id=7EUu177KXY,https://openreview.net/pdf?id=7EUu177KXY,"We propose a novel D3C2-Net for compressive sensing based on our new proposed generalized dual-domain optimization framework, achieving higher performance than other state-of-the-arts.","Mapping optimization algorithms into neural networks, deep unfolding networks (DUNs) have achieved impressive success in compressive sensing (CS). From the perspective of optimization, DUNs inherit a well-defined and interpretable structure from iterative steps. However, from the viewpoint of neural network design, most existing DUNs are inherently established based on traditional image-domain unfolding, which takes single-channel images as inputs and outputs between adjacent stages, resulting in insufficient information transmission capability and the inevitable loss of the image details. In this paper, to break the above bottleneck, we propose a generalized dual-domain optimization framework, which is general for inverse imaging problems and integrates the merits of both (1) image-domain and (2) convolutional-coding-domain priors to constrain the feasible region of the solution space. By unfolding the proposed optimization framework into deep neural networks, we further design a novel Dual-Domain Deep Convolutional Coding Network ($\mathrm{D^3C^2}$-Net) for CS imaging with the ability of transmitting high-capacity feature through all the unfolded stages. Experiments on multiple natural and MR image datasets demonstrate that our $\mathrm{D^3C^2}$-Net achieves higher performance and better accuracy-complexity trade-offs than other state-of-the-art.","image reconstruction, compressive sensing (CS), convolutional coding, dual-domain optimization, deep unfolding networks" Single-level Adversarial Data Synthesis based on Neural Tangent Kernels,https://openreview.net/forum?id=_d2f3hRn0hT,https://openreview.net/pdf?id=_d2f3hRn0hT,This paper formulates the adversarial data synthesis as a single-level optimization problem that is much easier to train than existing GANs.,"Generative adversarial networks (GANs) have achieved impressive performance in data synthesis and have driven the development of many applications. How- ever, GANs are known to be hard to train due to their bilevel objective, which leads to the problems of convergence, mode collapse, and gradient vanishing. In this paper, we propose a new generative model called the generative adversarial NTK (GA-NTK) that has a single-level objective. The GA-NTK keeps the spirit of adversarial learning (which helps generate plausible data) while avoiding the training difficulties of GANs. This is done by modeling the discriminator as a Gaussian process with a neural tangent kernel (NTK-GP) whose training dynam- ics can be completely described by a closed-form formula. We analyze the conver- gence behavior of GA-NTK trained by gradient descent and give some sufficient conditions for convergence. We also conduct extensive experiments to study the advantages and limitations of GA-NTK and propose some techniques that make GA-NTK more practical.","Adversarial, Data Synthesis, Neural Tangent Kernels" Learning to Count Everything: Transformer-based Trackers are Strong Baselines for Class Agnostic Counting,https://openreview.net/forum?id=o_Qrw9f512w,https://openreview.net/pdf?id=o_Qrw9f512w,,"Class agnostic counting (CAC) is a vision task which can be used to count the total occurrence number of any given reference objects in the query image. The task is usually formulated as density map estimation problem through similarity computation among few image samples of the reference object and the query image. In this paper, we show the the popular and effective similarity computation operation, bilinear similarity, actually share high resemblance with self-attention and cross-attention operations which are widely used in the transformer architecture. Inspired by this observation, since the formulation of visual object tracking task is similar to CAC, we show the advanced attention modules of transformer-based trackers are actually powerful matching tools for the CAC task. These modules allow to learn more distinct features to capture the shared patterns among the query and reference images. In addition, we propose a transformer-based class agnostic counting framework by adapting transformer-based trackers for CAC. We demonstrate the effectiveness of the proposed framework with two state-of-the-art transformer-based trackers, MixFormer and TransT, with extensive experiments and ablation studies. The proposed methods outperform other state-of-the-art methods on the challenging FSC-147 and CARPK datasets and achieve new state-of-the-art performances. The code will be publicly available upon acceptance. ","class agnostic counting, transformer, tracking" "Unified Algorithms for RL with Decision-Estimation Coefficients: No-Regret, PAC, and Reward-Free Learning",https://openreview.net/forum?id=7lvuPvDNhI4,https://openreview.net/pdf?id=7lvuPvDNhI4,"We design new unified algorithms for no-regret, PAC, and reward-free reinforcement learning with general model classes, building on the Decision-Estimation Coefficient and a strong model estimation procedure.","Finding unified complexity measures and algorithms for sample-efficient learning is a central topic of research in reinforcement learning (RL). The Decision-Estimation Coefficient (DEC) is recently proposed by Foster et al. (2021) as a necessary and sufficient complexity measure for sample-efficient no-regret RL. This paper makes progress towards a unified theory for RL with the DEC framework. First, we propose two new DEC-type complexity measures: Explorative DEC (EDEC), and Reward-Free DEC (RFDEC). We show that they are necessary and sufficient for sample-efficient PAC learning and reward-free learning, thereby extending the original DEC which only captures no-regret learning. Next, we design new unified sample-efficient algorithms for all three learning goals. Our algorithms instantiate variants of the Estimation-To-Decisions (E2D) meta-algorithm with a strong and general model estimation subroutine. Even in the no-regret setting, our algorithm \textsc{E2D-TA} improves upon the algorithms of Foster et al. (2021) which require either bounding a variant of the DEC which may be prohibitively large, or designing problem-specific estimation subroutines. As applications, we recover existing and obtain new sample-efficient learning results for a wide range of tractable RL problems using essentially a single algorithm. Finally, as a connection, we re-analyze two existing optimistic model-based algorithms based on Posterior Sampling or Maximum Likelihood Estimation, showing that they enjoy similar regret bounds as \textsc{E2D-TA} under similar structural conditions as the DEC.","reinforcement learning theory, decision-estimation coefficient, function approximation" Strength-Adaptive Adversarial Training,https://openreview.net/forum?id=Bvaekygzl2m,https://openreview.net/pdf?id=Bvaekygzl2m,,"Adversarial training (AT) is proved to reliably improve network's robustness against adversarial data. However, current AT with a pre-specified perturbation budget has limitations in learning a robust network. Firstly, applying a pre-specified perturbation budget on networks of various model capacities will yield divergent degree of robustness disparity between natural and robust accuracies, which deviates from robust network's desideratum. Secondly, the attack strength of adversarial training data constrained by the pre-specified perturbation budget fails to upgrade as the growth of network robustness, which leads to robust overfitting and further degrades the adversarial robustness. To overcome these limitations, we propose Strength-Adaptive Adversarial Training (SAAT). Specifically, the adversary employs an adversarial loss constraint to generate adversarial training data. Under this constraint, the perturbation budget will be adaptively adjusted according to the training state of adversarial data, which can effectively avoid robust overfitting. Besides, SAAT explicitly constrains the attack strength of training data through the adversarial loss, which manipulates model capacity scheduling during training, and thereby can flexibly control the degree of robustness disparity and adjust the tradeoff between natural accuracy and robustness. Extensive experiments show that our proposal boosts the robustness of adversarial training. ", Teach me how to Interpolate a Myriad of Embeddings,https://openreview.net/forum?id=iF0B-U0J5fG,https://openreview.net/pdf?id=iF0B-U0J5fG,"We introduce MultiMix as a way to go beyond ERM and interpolate as many examples as the mini-batch over their entire convex hull in the embedding space, thereby improving representation learning","Mixup refers to interpolation-based data augmentation, originally motivated as a way to go beyond empirical risk minimization (ERM). Yet, its extensions focus on the definition of interpolation and the space where it takes place, while the augmentation itself is less studied: For a mini-batch of size $m$, most methods interpolate between $m$ pairs with a single scalar interpolation factor $\lambda$. In this work, we make progress in this direction by introducing MultiMix, which interpolates an arbitrary number $n$ of tuples, each of length $m$, with one vector $\lambda$ per tuple. On sequence data, we further extend to dense interpolation and loss computation over all spatial positions. Overall, we increase the number of tuples per mini-batch by orders of magnitude at little additional cost. This is possible by interpolating at the very last layer before the classifier. Finally, to address inconsistencies due to linear target interpolation, we introduce a self-distillation approach to generate and interpolate synthetic targets. We empirically show that our contributions result in significant improvement over state-of-the-art mixup methods on four benchmarks. By analyzing the embedding space, we observe that the classes are more tightly clustered and uniformly spread over the embedding space, thereby explaining the improved behavior.", Exploring Parameter-Efficient Fine-tuning for Improving Communication Efficiency in Federated Learning,https://openreview.net/forum?id=EmH1WE1fRbt,https://openreview.net/pdf?id=EmH1WE1fRbt,We explore the viability of a parameter-efficient fine-tuning framework in federated learning to leverage strong pre-trained models and significantly reduce communication costs.,"Federated learning (FL) has emerged as a promising paradigm for enabling the collaborative training of models without centralized access to the raw data on local devices. In the typical FL paradigm (e.g., FedAvg), model weights are sent to and from the server each round to participating clients. However, this can quickly put a massive communication burden on the system, especially if more capable models beyond very small MLPs are employed. Recently, the use of pre-trained models has been shown effective in federated learning optimization and improving convergence. This opens the door for new research questions. Can we adjust the weight-sharing paradigm in federated learning, leveraging strong and readily-available pre-trained models, to significantly reduce the communication burden while simultaneously achieving excellent performance? To this end, we investigate the use of parameter-efficient fine-tuning in federated learning. Specifically, we systemically evaluate the performance of several parameter-efficient fine-tuning methods across a variety of client stability, data distribution, and differential privacy settings. By only locally tuning and globally sharing a small portion of the model weights, significant reductions in the total communication overhead can be achieved while maintaining competitive performance in a wide range of federated learning scenarios, providing insight into a new paradigm for practical and effective federated systems.","federated learning, computer vision, vision transformer, fine-tuning" Mega: Moving Average Equipped Gated Attention,https://openreview.net/forum?id=qNLe3iq2El,https://openreview.net/pdf?id=qNLe3iq2El,Moving Average Equipped Gated Attention,"The design choices in the Transformer attention mechanism, including weak inductive bias and quadratic computational complexity, have limited its application for modeling long sequences. In this paper, we introduce Mega, a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average to incorporate inductive bias of position-aware local dependencies into the position-agnostic attention mechanism. We further propose a variant of Mega that offers linear time and space complexity yet yields only minimal quality loss, by efficiently splitting the whole sequence into multiple chunks with fixed length. Extensive experiments on a wide range of sequence modeling benchmarks, including the Long Range Arena, neural machine translation, auto-regressive language modeling, and image and speech classification, show that Mega achieves significant improvements over other sequence models, including variants of Transformers and recent state space models. ","Neural Architecture, Attention, Exponential Moving Average" Going Deeper with Spiking Neurons: Towards Binary Outputs of Deep Logic Spiking Neural Network,https://openreview.net/forum?id=gR5yMO1pRRc,https://openreview.net/pdf?id=gR5yMO1pRRc,,"For the simulation of spikes in biological neurons, a natural fit is the spiking neural networks, which produce binary outputs from spiking neurons. SNN receives arising investigations for its high biological plausibility and efficient inference on neuromorphic chips. However, it is still a challenge to train SNNs with more than 50 layers due to the gradient vanishing problem caused by the spiking neuron layers, which greatly prevents SNNs from going deeper to obtain higher performance. In this paper, we first investigate the variants of spiking residual blocks and find that deep SNNs with binary outputs can not be constructed by simply replacing the activation function in the existing residual structure in ANN with the spiking neuron layer. We thus propose a logic spiking network (LSN) to benefit deep SNN training. Our LSN contains two functionally distinct branches, a structure inspired by excitatory and inhibitory pathways followed by a logical operation for binary spike outputs. We demonstrate both theoretically and experimentally that LSN can implement identity mapping as well as overcome the vanishing gradient problem. Our LSN can be expanded by more than 100 layers with binary outputs and performs favorably against existing spiking ResNet and its variants. Our proposed LSN achieved 94.68% accuracy on CIFAR10, 71.86% accuracy on ImageNet, and 75.1% accuracy on CIFAR10-DVS.","Spiking Neural Network, Deep Network, Binary output, Brain-Like" IEDR: A Context-aware Intrinsic and Extrinsic Disentangled Recommender System,https://openreview.net/forum?id=S2N25rUM55l,https://openreview.net/pdf?id=S2N25rUM55l,We propose a recommender system that capture intrinsic and extrinsic factors from various contexts to enhance the recommendation quality.,"Intrinsic and extrinsic factors jointly affect users' decisions in item selection (e.g., click, purchase). Intrinsic factors reveal users' real interests and are invariant in different contexts (e.g., time, weather), whereas extrinsic factors can change w.r.t. different contexts. Analyzing these two factors is an essential yet challenging task in recommender systems. However, in existing studies, factor analysis is either largely neglected, or designed for a specific context (e.g., the time context in sequential recommendation), which limits the applicability of such models. In this paper, we propose a generic model, IEDR, to learn intrinsic and extrinsic factors from various contexts for recommendation. IEDR contains two key components: a contrastive learning component, and a disentangling component. The two components collaboratively enable our model to learn context-invariant intrinsic factors and context-based extrinsic factors from all available contexts. Experimental results on real-world datasets demonstrate the effectiveness of our model in factor learning and impart a significant improvement in recommendation accuracy over the state-of-the-art methods.","Recommender Systems, Intrinsic and Extrinsic Factors, Contrastive Learning, Disentangled Representation, Mutual Information" Correcting Three Existing Beliefs on Mutual Information in Contrastive Learning,https://openreview.net/forum?id=6FAWzRMRk7A,https://openreview.net/pdf?id=6FAWzRMRk7A,,"Contrastive learning has played a pivotal role in the recent success of unsupervised representation learning. It has been commonly explained with instance discrimination and a mutual information loss, and some of the fundamental explanations are based on mutual information analysis. In this work, we develop new methods that enable rigorous analysis of mutual information in contrastive learning. Using the methods, we investigate three existing beliefs and show that they are incorrect. Based on the investigation results, we address two issues in the discussion section. In particular, we question if contrastive learning is indeed an unsupervised representation learning method because the current framework of contrastive learning relies on validation performance for tuning the augmentation design.", Deep Deformation Based on Feature-Constraint for 3D Human Mesh Correspondence,https://openreview.net/forum?id=cjavWixtG9f,https://openreview.net/pdf?id=cjavWixtG9f,," In this study, we address the challenges in mesh correspondence for various types of complete or single-view human body data. The parametric human model has been widely used in various human-related applications and in 3D human mesh correspondence because it provides sufficient scope to modify the resulting model. In contrast to prior methods that optimize both the correspondences and human model parameters (pose and shape), some of the recent methods directly deform each vertex of a parametric template by processing the point clouds that represent the input shapes. This allows the models to have more accurate representations of the details while maintaining the correspondence. However, we identified two limitations in these methods. First, it is difficult for the transformed template to completely restore the input shapes using only a pointwise reconstruction loss. Second, they cannot deform the template to a single-view human body from the depth camera observations or infer the correspondences between various forms of input human bodies. In representation learning, one of the main challenges is to design appropriate loss functions for supervising features with different abilities. To address this, we introduce the feature constraint deformation network (FCD-Net), which is an end-to-end deep learning approach that identifies 3D human mesh correspondences by learning various shape transformations from a predetermined template. The FCD-Net is implemented by an encoder–decoder architecture. A global feature encoded from the input shape and a decoder are used to deform the template based on the encoded global feature. We simultaneously input the complete shape and single-view shape into the encoder and closely constrain the features to enable the encoder to learn more robust features. Meanwhile, the decoder generates a completely transformed template with higher promise by using the complete shape as the ground truth, even if the input is single-view human body data. We conduct extensive experiments to validate the effectiveness of the proposed FCD-Net on four types of single-view human body data, both from qualitative and quantitative aspects. We also demonstrate that our approach improves the state-of-the-art results on the difficult ""FAUST-inter"" and ""SHREC'19"" challenges, with average correspondence errors of 2.54 cm and 6.62 cm, respectively . In addition, the proposed FCD-Net performs well on real and unclean point clouds from a depth camera.","shape correspondence, deep learning, shape deformation" Explaining Representation Bottlenecks of Convolutional Decoder Networks,https://openreview.net/forum?id=0cm8HroIxJV,https://openreview.net/pdf?id=0cm8HroIxJV,"In this paper, we prove representation bottlenecks of a cascaded convolutional decoder network, considering the capacity of representing different frequency components of an input sample.","In this paper, we prove representation bottlenecks of a cascaded convolutional decoder network, considering the capacity of representing different frequency components of an input sample. We conduct the discrete Fourier transform on each channel of the feature map in an intermediate layer of the decoder network. Then, we introduce the rule of the forward propagation of such intermediate-layer spectrum maps, which is equivalent to the forward propagation of feature maps through a convolutional layer. Based on this, we find that each frequency component in the spectrum map is forward propagated independently with other frequency components. Furthermore, we prove two bottlenecks in representing feature spectrums. First, we prove that the convolution operation, the zero-padding operation, and a set of other settings all make a convolutional decoder network more likely to weaken high-frequency components. Second, we prove that the upsampling operation generates a feature spectrum, in which strong signals repetitively appears at certain frequencies. We will release all codes when this paper is accepted.","Fourier transform, Deep Learning Theory, Representation Learning" Batch Normalization Is Blind to the First and Second Derivatives of the Loss w.r.t. Features,https://openreview.net/forum?id=lMPJP3nRGtJ,https://openreview.net/pdf?id=lMPJP3nRGtJ,"When we do the Taylor series expansion of the loss function w.r.t. the output of the BN operation, we prove that the BN operation will block the back-propagation of the first and second derivatives of the loss function.","We prove that when we do the Taylor series expansion of the loss function, the BN operation will block the influence of the first-order term and most influence of the second-order term of the loss. This is a potential defect of the BN operation. We also find that such a problem is caused by the standardization phase of the BN operation. We believe that the proof of the blindness of a deep model is of significant value to avoiding systemic collapses of a deep model, although such a blindness does not always makes significant damages in all applications. Experiments show that the BN operation significantly affects feature representations in specific tasks.","Batch Normalization, Deep Learning Theory, Neural Networks" Dual Ensembled Multiagent Q-Learning with Hypernet Regularizer,https://openreview.net/forum?id=F5LPNbgpuo0,https://openreview.net/pdf?id=F5LPNbgpuo0,,"Overestimation in the temporal-difference single-agent reinforcement learning has been widely studied, where the variance in value estimation causes overestimation of the maximal target value due to Jensen's inequality. Instead, overestimation in multiagent settings has received little attention though it can be even more severe. One kind of pioneer work extends ensemble methods from single-agent deep reinforcement learning to address the multiagent overestimation by discarding the large target values among the ensemble. However, its ability is limited by the ensemble diversity. Another kind of work softens the maximum operator in the Bellman equation to avoid large target values, but also leads to sub-optimal value functions. Unlike previous works, in this paper, we address the multiagent overestimation by analyzing its underlying causes in an estimation-optimization iteration manner. We show that the overestimation in multiagent value-mixing Q-learning not only comes from the overestimation of target Q-values but also accumulates in the online Q-network's optimization step. Therefore, first, we integrate the random ensemble and in-target minimization into the estimation of target Q-values to derive a lower update target. Second, we propose a novel hypernet regularizer on the learnable terms of the online global Q-network to further reduce overestimation. Experiments on various kinds of tasks demonstrate that the proposed method consistently addresses the overestimation problem while previous works fail.","multiagent system, deep reinforcement learning, overestimation" Exploring Chemical Space with Score-based Out-of-distribution Generation,https://openreview.net/forum?id=45TeQUJw9tn,https://openreview.net/pdf?id=45TeQUJw9tn,We propose a score-based molecular generative framework that aims to generate out-of-distribution molecules beyond the known molecular space and find novel chemical optima of desired properties.,"A well-known limitation of existing works on molecule generation is that the generated molecules highly resemble those in the training set. To generate truly novel molecules with completely different structures that may have even better properties than known molecules for de novo drug discovery, more powerful exploration in the chemical space is necessary. To this end, we propose Molecular Out-Of-distribution Diffusion (MOOD), a novel score-based diffusion scheme that incorporates out-of-distribution (OOD) control in the generative stochastic differential equation (SDE) with simple control of a hyperparameter, thus requires no additional computational costs unlike existing methods (e.g., RL-based methods). However, some novel molecules may be chemically implausible, or may not meet the basic requirements of real-world drugs. Thus, MOOD performs conditional generation by utilizing the gradients from a property prediction network that guides the reverse-time diffusion process to high-scoring regions according to multiple target properties such as protein-ligand interactions, drug-likeness, and synthesizability. This allows MOOD to search for novel and meaningful molecules rather than generating unseen yet trivial ones. We experimentally validate that MOOD is able to explore the chemical space beyond the training distribution, generating molecules that outscore ones found with existing methods, and even the top 0.01% of the original training pool.","molecule generation, score-based generative modeling" Divide and conquer policy for efficient GAN training,https://openreview.net/forum?id=4DL3cyuVHrV,https://openreview.net/pdf?id=4DL3cyuVHrV,,"Recent advances in Generative Adversarial Networks (GANs) have achieved impressive results for the purpose of generating high quality synthetic imagery. While capable of synthesizing high-fidelity images, these models often generate unsatisfactory images which fall outside of the data manifold. A considerable research effort has investigated the data manifold, either by simply discarding images having a lower probability according to the discriminator output, or by filtering real images which are within the sparse regions of the data manifold. While effective, these methods fail to get access to either the fake distribution or the real distribution. In this paper, we propose a divide and conquer policy for GAN training. We first introduce a new local data-manifold detector (LDMD), which estimates whether the generated images are inside or outside of the data manifold. With the proposed LDMD, we further introduce a noise replay mode if it is outside the manifold, and a fake sample reuse mode if it is inside the manifold. Extensive experimental results on a number of GANs variants (e.g., SAGAN,SNGAN,BigGAN and StyleGAN) demonstrate qualitatively and quantitatively that our method improves the GAN’s performance, resulting in more realistic images than previous methods as confirmed by a significant drop in the FID.","GANs, image generation" Node Number Awareness Representation for Graph Similarity Learning,https://openreview.net/forum?id=4mFTFqOovux,https://openreview.net/pdf?id=4mFTFqOovux,,"This work aims to address two important issues in the graph similarity computation, the first one is the Node Number Awareness Issue (N$^2$AI), and the second one is how to accelerate the inference speed of graph similarity computation in downstream tasks. We found that existing Graph Neural Network based graph similarity models have a large error in predicting the similarity scores of two graphs with similar number of nodes. Our analysis shows that this is because of the global pooling function in graph neural networks that maps graphs with similar number of nodes to similar embedding distributions, reducing the separability of their embeddings, which we refer to as the N$^2$AI. Our motivation is to enhance the difference between the two embeddings to improve their separability, thus we leverage our proposed Different Attention (DiffAtt) to construct Node Number Awareness Graph Similarity Model (N$^2$AGim). In addition, we propose the Graph Similarity Learning with Landmarks (GSL$^2$) to accelerate similarity computation. GSL$^2$ uses the trained N$^2$AGim to generate the individual embedding for each graph without any additional learning, and this individual embedding can effectively help GSL$^2$ to improve its inference speed. Experiments demonstrate that our N$^2$AGim outperforms the second best approach on Mean Square Error by 24.3\%(1.170 vs 1.546), 43.1\%(0.066 vs 0.116), and 44.3\%(0.308 vs 0.553), on AIDS700nef, LINUX, and IMDBMulti datasets, respectively. Our GSL$^2$ is at most 47.7 and 1.36 times faster than N$^2$AGim and the second faster model. Our code is publicly available on https://github.com/iclr231312/N2AGim. ","graph representation learning, graph similarity learning, graph matching" Evaluating Fairness Without Sensitive Attributes: A Framework Using Only Auxiliary Models,https://openreview.net/forum?id=OKfmDPNPwYF,https://openreview.net/pdf?id=OKfmDPNPwYF,"To evaluate fairness without access to any sensitive attribute, we propose a general framework with only off-the-shelf auxiliary models.","Although the volume of literature and public attention on machine learning fairness has been growing significantly in recent years, in practice some tasks as basic as measuring fairness, which is the first step in studying and promoting fairness, can be challenging. This is because the sensitive attributes are often unavailable in a machine learning system due to privacy regulations. The straightforward solution is to use auxiliary models to predict the missing sensitive attributes. However, our theoretical analyses show that the estimation error of the directly measured fairness metrics is proportional to the error rates of auxiliary models' predictions. Existing works that attempt to reduce the estimation error often require strong assumptions, e.g. access to the ground-truth sensitive attributes in a subset of samples, auxiliary models' training data and the target data are i.i.d, or some form of conditional independence. In this paper, we drop those assumptions and propose a framework that uses only off-the-shelf auxiliary models. The main challenge is how to reduce the negative impact of imperfectly predicted sensitive attributes on the fairness metrics without knowing the ground-truth sensitive attribute values. Inspired by the noisy label learning literature, we first derive a closed-form relationship between the directly measured fairness metrics and their corresponding ground-truth metrics. And then we estimate some key statistics (most importantly transition matrix in the noisy label literature), which we use, together with the derived relationship, to calibrate the fairness metrics. Our framework can be applied to all popular group fairness definitions as well as multi-class classifiers and multi-category sensitive attributes. In addition, we theoretically prove the upper bound of the estimation error in our calibrated metrics and show our method can substantially decrease the estimation error especially when auxiliary models are inaccurate or the target model is highly biased. Experiments on COMPAS and CelebA validate our theoretical analyses and show our method can measure fairness significantly more accurately than baselines under favorable circumstances.","Fairness evaluation, noise transition matrix, sensitive attributes" Dataset Condensation with Latent Space Knowledge Factorization and Sharing,https://openreview.net/forum?id=ab2mCzEPwqK,https://openreview.net/pdf?id=ab2mCzEPwqK,We condense datasets by learning a set of learnable codes defined in a compact latent space followed by a set of tiny decoders which maps them differently to the original input space.,"In this paper, we introduce a novel approach for systematically solving dataset condensation problem in an efficient manner by exploiting the regularity in a given dataset. Instead of condensing the dataset directly in the original input space, we assume a generative process of the dataset with a set of learnable codes defined in a compact latent space followed by a set of tiny decoders which maps them differently to the original input space. By combining different codes and decoders interchangeably, we can dramatically increase the number of synthetic examples with essentially the same parameter count, because the latent space is much lower dimensional and since we can assume as many decoders as necessary to capture different styles represented in the dataset with negligible cost. Such knowledge factorization allows efficient sharing of information between synthetic examples in a systematic way, providing far better trade-off between compression ratio and quality of the generated examples. We experimentally show that our method achieves new state-of-the-art records by significant margins on various benchmark datasets, such as SVHN, CIFAR10, CIFAR100, and TinyImageNet.","Dataset condensation, Generative models" Optimal Neural Network Approximation of Wasserstein Gradient Direction via Convex Optimization,https://openreview.net/forum?id=4NLyCJQR3ZR,https://openreview.net/pdf?id=4NLyCJQR3ZR,Wasserstein gradient descent meets neural networks and convex optimization,"The computation of Wasserstein gradient direction is essential for posterior sampling problems and scientific computing. The approximation of the Wasserstein gradient with finite samples requires solving a variational problem. We study the variational problem in the family of two-layer networks with squared-ReLU activations, towards which we derive a semi-definite programming (SDP) relaxation. This SDP can be viewed as an approximation of the Wasserstein gradient in a broader function family including two-layer networks. By solving the convex SDP, we obtain the optimal approximation of the Wasserstein gradient direction in this class of functions. Numerical experiments including PDE-constrained Bayesian inference and parameter estimation in COVID-19 modeling demonstrate the effectiveness of the proposed method.","Bayesian inference, convex optimization, semi-definite programming" Parallel Deep Neural Networks Have Zero Duality Gap,https://openreview.net/forum?id=6zrOr_Rdhjs,https://openreview.net/pdf?id=6zrOr_Rdhjs,,"Training deep neural networks is a challenging non-convex optimization problem. Recent work has proven that the strong duality holds (which means zero duality gap) for regularized finite-width two-layer ReLU networks and consequently provided an equivalent convex program for training. However, extensions of this result to deeper networks remain to be an open problem. In this paper, we particularly prove that the duality gap for deeper linear networks with vector outputs is non-zero. In contrast, we show that the zero duality gap can be obtained by stacking standard deep networks in parallel, which we call a parallel architecture. Therefore, we prove the strong duality and existence of equivalent convex programs that enable convex and globally optimal training of deep networks. As a by-product of our analysis, we demonstrate that the weight decay regularization on the network parameters explicitly encourages low-rank solutions via closed-form expressions. In addition, we show that strong duality holds for three-layer standard ReLU networks given rank-1 data matrices.","Deep neural networks, Convex duality, Convex optimization" Causal RL Agents for Out-of-distribution Generalization,https://openreview.net/forum?id=GQVfDsoFSBg,https://openreview.net/pdf?id=GQVfDsoFSBg,This paper proposes a novel technique GCRL to learn a OOD generalization policy by establishing the dependence of actions on a disentangled representation that captures the information about causal factors. ,"Out-of-distribution (OOD) generalization is critical for applying reinforcement learning algorithms to real-world applications. To address the OOD problem, recent works focus on learning an OOD adaptation policy by capturing the causal factors affecting the environmental dynamics. However, these works recover the causal factors with only an entangled or binary form, resulting in a limited generalization of the policy that requires extra data from the testing environments. To break this limitation, we propose Generalizable Causal Reinforcement Learning (GCRL) to learn a disentangled representation of causal factors, on the basis of which we learn a policy that achieves the OOD generalization without extra training. For capturing the causal factors, GCRL deploys a variant of $\beta$-VAE structure with a two-stage constraint to ensure that all factors can be disentangled. Then, to achieve the OOD generalization through causal factors, we adopt an additional network to establish the dependence of actions on the learned representation. Theoretically, we prove that while the optimal policy can be found in training environments, the established dependence can recover the causal relationship between causal factors and actions. Experimental results show that GCRL achieves the OOD generalization on eight benchmarks from Causal World and Mujoco. Moreover, the policy learned by our model is more explainable, which can be controlled to generate semantic actions by intervening in the representation of causal factors.","Reinforcement Learning, Out-of-distribution Generalization, Disentangled Representation" Multi-domain image generation and translation with identifiability guarantees,https://openreview.net/forum?id=U2g8OGONA_V,https://openreview.net/pdf?id=U2g8OGONA_V,"We propose a way to learn the pairing information from unpaired data with theoretial guarantees, with direct applications in learning tasks such as image-to-image translation","Multi-domain image generation and unpaired image-to-to-image translation are two important and related computer vision problems. The common technique for the two tasks is the learning of a joint distribution from multiple marginal distributions. However, it is well known that there can be infinitely many joint distributions that can derive the same marginals. Hence, it is necessary to formulate suitable constraints to address this highly ill-posed problem. Inspired by the recent advances in nonlinear Independent Component Analysis (ICA) theory, we propose a new method to learn the joint distribution from the marginals by enforcing a specific type of minimal change across domains. We provide the formulations of minimal changes and some other assumptions, under which the true joint distribution across domains is identifiable. We also provide a practical implementation of multi-domain image generation and a technique to improve unpaired image-to-image translation. We apply our method to five multi-domain image generation and six image-to-image translation tasks. The superior performance of our model supports our theory and demonstrates the effectiveness of our method. ","multi-domain image generation, image translation, identifiability, Nonlinear ICA" Interventional Rationalization,https://openreview.net/forum?id=KoEa6h1o6D1,https://openreview.net/pdf?id=KoEa6h1o6D1,We propose a causal intervention method to remove spurious correlations in selective rationalization.,"Selective rationalizations improve the explainability of neural networks by selecting a subsequence of the input (i.e., rationales) to explain the prediction results. Although existing methods have achieved promising results, they still suffer from adopting the spurious correlations in data (aka., shortcuts) to compose rationales and make predictions. Inspired by the causal theory, in this paper, we develop an interventional rationalization (Inter-RAT) to discover the causal rationales. Specifically, we first analyse the causalities among the input, rationales and results with a structural causal model. Then, we discover spurious correlations between input and rationales, and between rationales and results, respectively, by identifying the confounder in the causalities. Next, based on the backdoor adjustment, we propose a causal intervention method to remove the spurious correlations in input and rationales. Further, we discuss reasons why spurious correlations between the selected rationales and results exist by analysing the limitations of the sparsity constraint in the rationalization, and employ the causal intervention method to remove these correlations. Extensive experimental results on three real-world datasets clearly validate the effectiveness of our proposed method. ","rationalization, causal intervention" Information-Theoretic Analysis of Unsupervised Domain Adaptation,https://openreview.net/forum?id=c5tbxWXU9-y,https://openreview.net/pdf?id=c5tbxWXU9-y,We derived new information-theoretic generalization bounds for the unsupervised domain adaptation problem.,"This paper uses information-theoretic tools to analyze the generalization error in unsupervised domain adaptation (UDA). We present novel upper bounds for two notions of generalization errors. The first notion measures the gap between the population risk in the target domain and that in the source domain, and the second measures the gap between the population risk in the target domain and the empirical risk in the source domain. While our bounds for the first kind of error are in line with the traditional analysis and give similar insights, our bounds on the second kind of error are algorithm-dependent, which also provide insights into algorithm designs. Specifically, we present two simple techniques for improving generalization in UDA and validate them experimentally.","unsupervised domain adaptation, generalization, information theory, regularization" Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes,https://openreview.net/forum?id=PbkBDQ5_UbV,https://openreview.net/pdf?id=PbkBDQ5_UbV,,"We study offline reinforcement learning (RL) in partially observable Markov decision processes. In particular, we aim to learn an optimal policy from a dataset collected by a behavior policy which possibly depends on the latent state. Such a dataset is confounded in the sense that the latent state simultaneously affects the action and the observation, which is prohibitive for existing offline RL algorithms. To this end, we propose the \underline{P}roxy variable \underline{P}essimistic \underline{P}olicy \underline{O}ptimization (\texttt{P3O}) algorithm, which addresses the confounding bias and the distributional shift between the optimal and behavior policies in the context of general function approximation. At the core of \texttt{P3O} is a coupled sequence of pessimistic confidence regions constructed via proximal causal inference, which is formulated as minimax estimation. Under a partial coverage assumption on the confounded dataset, we prove that \texttt{P3O} achieves a $n^{-1/2}$-suboptimality, where $n$ is the number of trajectories in the dataset. To our best knowledge, \texttt{P3O} is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.", DELVING INTO THE HIERARCHICAL STRUCTURE FOR EFFICIENT LARGE-SCALE BI-LEVEL LEARNING,https://openreview.net/forum?id=NxpyLebsLAR,https://openreview.net/pdf?id=NxpyLebsLAR,,"Recent years have witnessed growing interest and emerging successes of bi-level learning in a wide range of applications, such as meta learning and hyper-parameter optimization. While current bi-level learning approaches suffer from high memory and computation costs especially for large-scale deep learning scenarios, which is due to the hierarchical optimization therein. {\textit {It is therefore interesting to know whether the hierarchical structure can be untied for efficient learning}.} To answer this question, we introduce NSGame that, transforming the hierarchical bi-level learning problem into a parallel Nash game, incorporates the tastes of hierarchy by a very small scale Stackelberg game. We prove that strong differential Stackelberg equilibrium (SDSE) of the bi-level learning problem corresponds to local Nash equilibrium of the NSGame. To obtain such SDSE from NSGame, we introduce a two-time scale stochastic gradient descent (TTS-SGD) method, and provide theoretical guarantee that local Nash equilibrium obtained by the TTS-SGD method is SDSE of the bi-level learning problem. We compare NSGame with representative bi-level learning models, such as MWN and MLC, experimental results on class imbalance learning and noisy label learning have verified that the proposed NSGame achieves comparable and even better results than the corresponding meta learning models, while NSGame is computationally more efficient.","Bi-level optimization, Meta learning, Nash game" Can GNNs Learn Heuristic Information for Link Prediction?,https://openreview.net/forum?id=_lnFErG3F1z,https://openreview.net/pdf?id=_lnFErG3F1z,We study existing state-of-the-art GNN-based link prediction methods and show that these methods can hardly learn heuristic information. Our experiments also support our analysis.,"Graph Neural Networks (GNNs) have shown superior performance in Link Prediction (LP). Especially, SEAL and its successors address the LP problem by classifying the subgraphs extracted specifically for candidate links, gaining state-of-the-art results. Nevertheless, we question whether these methods can effectively learn the information equivalent to link heuristics such as Common Neighbors, Katz index, etc. (we refer to such information as heuristic information in this work). We show that link heuristics and GNNs capture different information. Link heuristics usually collect pair-specific information by counting the involved neighbors or paths between two nodes in a candidate link, while GNNs learn node-wise representations through a neighborhood aggregation algorithm in which two nodes in the candidate link do not pay special attention to each other. Our further analysis shows that SEAL-type methods only use a GNN to model the pair-specific subgraphs and also cannot effectively capture heuristic information. To verify our analysis, a straightforward way is to compare the LP performance between existing methods and a model that learns heuristic information independently of the GNN learning. To this end, we present a simple yet light framework ComHG by directly Combining the embeddings of link Heuristics and the representations produced by a GNN. Experiments on OGB LP benchmarks show that ComHG outperforms all top competitors by a large margin, empirically confirming our propositions. Our experimental study also indicates that the contributions of link heuristics and the GNN to LP are sensitive to the graph degree, where the former is powerful on sparse graphs while the latter becomes dominant on dense graphs.","link prediction, graph neural networks, heuristics" Understanding Zero-shot Adversarial Robustness for Large-Scale Models,https://openreview.net/forum?id=P4bXCawRi5J,https://openreview.net/pdf?id=P4bXCawRi5J,,"Pretrained large-scale vision-language models like CLIP have exhibited strong generalization over unseen tasks. Yet imperceptible adversarial perturbations can significantly reduce CLIP's performance on new tasks. In this work, we identify and explore the problem of adapting large-scale models for zero-shot adversarial robustness. We first identify two key factors during model adaption--training losses and adaptation methods--that affect the model's zero-shot adversarial robustness. We then propose a text-guided contrastive adversarial training loss, which aligns the text embeddings and the adversarial visual features with contrastive learning on a small set of training data. We apply this training loss to two adaption methods, model finetuning and visual prompt tuning. We find that visual prompt tuning is more effective in the absence of texts, while finetuning wins in the existence of text guidance. Overall, our approach significantly improves the zero-shot adversarial robustness over CLIP, seeing an average improvement of 31 points over ImageNet and 15 zero-shot datasets. We hope this work can shed light on understanding the zero-shot adversarial robustness of large-scale models. ","Adversarial Robustness, Zero-Shot Recognition" HOYER REGULARIZER IS ALL YOU NEED FOR EXTREMELY SPARSE SPIKING NEURAL NETWORKS,https://openreview.net/forum?id=0L8tuglXJaW,https://openreview.net/pdf?id=0L8tuglXJaW,,"Spiking Neural networks (SNN) have emerged as an attractive spatio-temporal computing paradigm for a wide range of low-power vision tasks. However, state- of-the-art (SOTA) SNN models either incur multiple time steps which hinder their deployment in real-time use cases or increase the training complexity significantly. To mitigate this concern, we present a training framework (from scratch) for one- time-step SNNs that uses a novel variant of the recently proposed Hoyer regularizer. We estimate the threshold of each SNN layer as the Hoyer extremum of a clipped version of its activation map, where the clipping threshold is trained using gradient descent with our Hoyer regularizer. This approach not only downscales the value of the trainable threshold, thereby emitting a large number of spikes for weight update with limited number of iterations (due to only one time step) but also shifts the pre-activation values away from the threshold, thereby mitigating the effect of noise that can degrade the SNN accuracy. Our approach outperforms existing spiking, binary, and adder neural networks in terms of accuracy-FLOPs trade-off for complex image recognition tasks. Downstream experiments on object detection also demonstrate the efficacy of our approach.", Controllable Evaluation and Generation of Physical Adversarial Patch on Face Recognition,https://openreview.net/forum?id=I_HxBH2SeW,https://openreview.net/pdf?id=I_HxBH2SeW,,"Recent studies have revealed the vulnerability of face recognition models against physical adversarial patches, which raises security concerns about the deployed face recognition systems. However, it is still challenging to ensure the reproducibility for most attack algorithms under complex physical conditions, which leads to the lack of a systematic evaluation of the existing methods. It is therefore imperative to develop a framework that can readily and fairly evaluate the vulnerability of face recognition in the physical world. To this end, we propose to simulate the complex transformations of faces in the physical world via 3D face modeling, which serves as a digital counterpart of physical faces. The generic framework allows us to control different face variations and physical conditions to conduct reproducible evaluations conveniently. With this digital simulator, we further propose a Face3DAdv method considering the 3D face transformations and realistic physical variations. Extensive experiments validate that Face3DAdv can significantly improve the effectiveness of diverse physically realizable adversarial patches in both simulated and physical environments, against various white-box and black-box face recognition models.","Physical adversarial attacks, face recogntion, robustness evaluation" Why Adversarial Training of ReLU Networks Is Difficult?,https://openreview.net/forum?id=Vsh8gspKmuu,https://openreview.net/pdf?id=Vsh8gspKmuu,"This paper theoretically analyzes the dynamics of adversarial perturbations, and further theoretically explains the difficulty of adversarial training.","This paper mathematically derives an analytic solution of the adversarial perturbation on a ReLU network, and theoretically explains the difficulty of adversarial training. Specifically, we formulate the dynamics of the adversarial perturbation generated by the multi-step attack, which shows that the adversarial perturbation tends to strengthen eigenvectors corresponding to a few top-ranked eigenvalues of the Hessian matrix of the loss w.r.t. the input. We also prove that adversarial training tends to strengthen the influence of unconfident input samples with large gradient norms in an exponential manner. Besides, we find that adversarial training strengthens the influence of the Hessian matrix of the loss w.r.t. network parameters, which makes the adversarial training more likely to oscillate along directions of a few samples, and boosts the difficulty of adversarial training. Crucially, our proofs provide a unified explanation for previous findings in understanding adversarial training.","Adversarial attack, adversarial training" Rademacher Complexity Over $\mathcal{H} \Delta \mathcal{H}$ Class for Adversarially Robust Domain Adaptation,https://openreview.net/forum?id=_yoBvxHPT_Y,https://openreview.net/pdf?id=_yoBvxHPT_Y,This paper studies a variant of Rademacher complexity to analyze adversarially robust domain adaptation.,"In domain adaptation, a model is trained on a dataset generated from a source domain and its generalization is evaluated on a possibly different target domain. Understanding the generalization capability of the learned model is a longstanding question. Recent studies demonstrated that the adversarial robust learning under $\ell_\infty$ attack is even harder to generalize to different domains. To thoroughly study the fundamental difficulty behind adversarially robust domain adaptation, we propose to analyze a key complexity measure that controls the cross-domain generalization: the adversarial Rademacher complexity over $\mathcal{H} \Delta \mathcal{H}$ class. For linear models, we show that adversarial Rademacher complexity over $\mathcal{H} \Delta \mathcal{H}$ class is always greater than the non-adversarial one, which reveals the intrinsic hardness of adversarially robust domain adaptation. We also establish upper bounds on this complexity measure, and extend them to the ReLU neural network class as well. Finally, by properly extending our generalization bound for adversarially robust domain adaptation, we explain \emph{why adversarial training can help transferring the model performance to different domains}. We believe our results initiate the study of the generalization theory of adversarially robust domain adaptation, and could shed lights on distributed adversarially robust learning from heterogeneous sources -- a scenario typically encountered in federated learning applications.","domain adaptation, learning theory, adversarial learning" Continual evaluation for lifelong learning: Identifying the stability gap,https://openreview.net/forum?id=Zy350cRstc6,https://openreview.net/pdf?id=Zy350cRstc6,"Proposing an iteration-based continual evaluation framework for CL, we discover, quantify, and analyse the ""stability gap"", a phenomenon where upon learning new tasks, past tasks exhibit substantial but transient performance loss for SOTA CL methods.","Time-dependent data generating distributions have proven to be difficult for gradient-based training of neural networks, as the greedy updates result in catastrophic forgetting of previously learned knowledge. Despite the progress in the field of continual learning to overcome this forgetting, we show that common state-of-the-art methods still suffer from substantial forgetting upon starting to learn new tasks, except that this forgetting is temporary and followed by a phase of performance recovery. We refer to this intriguing but potentially problematic phenomenon of transient forgetting as the stability gap. The stability gap has likely remained under the radar due to standard practice in the field of evaluating continual learning models only after each task. Instead, we establish a framework for continual evaluation that uses per-iteration evaluation and define a new set of metrics that enables quantifying the stability gap. Empirically we show that experience replay, constraint-based replay, and knowledge-distillation methods are all prone to the stability gap; and that the stability gap can be observed in both class and domain incremental learning benchmarks. Additionally, a controlled experiment shows that the stability gap increases when tasks are more dissimilar. Finally, by disentangling gradients into plasticity and stability components, we provide a conceptual explanation for the stability gap.","Continual learning, lifelong learning, incremental learning, evaluation metrics" On the Universal Approximation Property of Deep Fully Convolutional Neural Networks,https://openreview.net/forum?id=1tXzHPdOJGZ,https://openreview.net/pdf?id=1tXzHPdOJGZ,," We study the approximation of shift-invariant or equivariant functions by deep fully convolutional networks from the dynamical systems perspective. We prove that deep residual fully convolutional networks and their continuous-layer counterpart can achieve universal approximation of these symmetric functions at constant channel width. Moreover, we show that the same can be achieved by non-residual variants with at least 2 channels in each layer and convolutional kernel size of at least 2. In addition, we show that these requirements are necessary, in the sense that networks with fewer channels or smaller kernels fail to be universal approximators.", Can We Faithfully Represent Absence States to Compute Shapley Values on a DNN?,https://openreview.net/forum?id=YV8tP7bW6Kt,https://openreview.net/pdf?id=YV8tP7bW6Kt,"We propose a method to examine and learn baseline values for Shapley values, which ensures that the absent variables do not introduce information to the model.","Although many methods have been proposed to estimate attributions of input variables, there still exists a significant theoretical flaw in the masking-based attribution methods, i.e., it is hard to examine whether the masking method faithfully represents the absence of input variables. Specifically, for masking-based attributions, setting an input variable to the baseline value is a typical way of representing the absence of the variable. However, there are no studies investigating how to represent the absence of input variables and verify the faithfulness of baseline values. Therefore, we revisit the feature representation of a deep model in terms of causality, and propose to use causal patterns to examine whether the masking method faithfully removes information encoded in the input variable. More crucially, it is proven that the causality can be explained as the elementary rationale of the Shapley value. Furthermore, we define the optimal baseline value from the perspective of causality, and we propose a method to learn the optimal baseline value. Experimental results have demonstrated the effectiveness of our method.","explainable AI, attribution methods, deep neural networks" FedGSNR: Accelerating Federated Learning on Non-IID Data via Maximum Gradient Signal to Noise Ratio,https://openreview.net/forum?id=RusKt9aoTON,https://openreview.net/pdf?id=RusKt9aoTON,This paper interprets federated learning algorithms with Gradient Signal to Noise Ratio and proposes the corresponding method to accelerate model convergence with optimal local updates in non-iid scenarios.,"Federated learning (FL) allows participants jointly training a model without direct data sharing. In such a process, participants rather than the central server perform local updates of stochastic gradient descent (SGD) and the central server aggregates the gradients from the participants to update the global model. However, the non-iid training data in participants significantly impact global model convergence.Most of existing studies addressed this issue by utilizing variance reduction or regularization. However, these studies focusing on specific datasets lack theoretical guarantee for efficient model training. In this paper, we provide a novel perspective on the non-iid issue by optimizing Gradient Signal to Noise Ratio (GSNR) during model training. In each participant, we decompose local gradients calculated on the non-iid training data into the signal and noise components and then speed up the model convergence by maximizing GSNR. We prove that GSNR can be maximized by using the optimal number of local updates. Subsequently, we develop FedGSNR to compute the optimal number of local updates for each participant, which can be applied to existing gradient calculation algorithms to accelerate the global model convergence. Moreover, according to the positive correlation between GSNR and the quality of shared information, FedGSNR allows the server to accurately evaluate contributions of different participants (i.e., the quality of local datasets) by utilizing GSNR. Extensive experimental evaluations demonstrate that FedGSNR achieves on average a 1.69× speedup with comparable accuracy.","Federated learning, Gradient Signal to Noise Ratio, Optimal Local Updates, Non-IID Data" Dataless Knowledge Fusion by Merging Weights of Language Models,https://openreview.net/forum?id=FCnohuR6AnM,https://openreview.net/pdf?id=FCnohuR6AnM,We study the problem of merging individual models built on different training data sets and propose a novel merging algorithm.,"Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models. Oftentimes fine-tuned models are readily available but their training data is not, due to data privacy or intellectual property concerns. This creates a barrier to fusing knowledge across individual models to yield a better single model. In this paper, we study the problem of merging individual models built on different training data sets to obtain a single model that performs well both across all data set domains and can generalize on out-of-domain data. We propose a data-less knowledge fusion method that merges models in their parameter space, guided by weights that minimize prediction differences between the merged model and the individual models. Over a battery of evaluation settings, we show that the proposed method significantly outperforms baselines such as Fisher-weighted averaging or model ensembling. Further, we find that our method is a promising alternative to multi-task learning that can preserve or sometimes improve over the individual models without access to the training data. Finally, model merging is more efficient than training a multi-task model, thus making it applicable to a wider set of scenarios.","model merging, weight merging" Domain-Indexing Variational Bayes for Domain Adaptation,https://openreview.net/forum?id=pxStyaf2oJ5,https://openreview.net/pdf?id=pxStyaf2oJ5,,"Previous studies have shown that leveraging ""domain index"" can significantly boost domain adaptation performance (Wang et al., 2020; Xu et al., 2022). However, such domain indices are not always available. To address this challenge, we first provide a formal definition of domain index from the probabilistic perspective, and then propose an adversarial variational Bayesian framework that infers domain indices from multi-domain data, thereby providing additional insight on domain relations and improving domain adaptation performance. Our theoretical analysis shows that our adversarial variational Bayesian framework finds the optimal domain index at equilibrium. Empirical results on both synthetic and real data verify that our model can produce interpretable domain indices which enable us to achieve superior performance compared to state-of-the-art domain adaptation methods.", Improving the Transferability of Adversarial Attacks through Experienced Precise Nesterov Momentum,https://openreview.net/forum?id=LV8OmADmoOe,https://openreview.net/pdf?id=LV8OmADmoOe,"Our proposed EPN is more effective than traditional momentum in improving transferability, and extensive experiments show that EPN-based attacks are more transferable than SOTA.","Deep Neural Networks are vulnerable to adversarial attacks, which makes adversarial attacks serve as a method to evaluate the robustness of DNNs. However, adversarial attacks have high white-box attack success rates but poor transferability, making black-box attacks impracticable in the real world. Momentum-based attacks were proposed to accelerate optimization to improve transferability. Nevertheless, conventional momentum-based attacks accelerate optimization inefficiently during early iterations since the initial value of momentum is zero, which leads to unsatisfactory transferability. Therefore, we propose Experienced Momentum (EM), which is the pre-trained momentum. Initializing the momentum to EM can help accelerate optimization during the early iterations. Moreover, the pre-update of conventional Nesterov momentum based attacks is rough, prompting us to propose Precise Nesterov momentum (PN). PN refines the pre-update by considering the gradient of the current data point. Finally, we integrate EM with PN as Experienced Precise Nesterov momentum (EPN) to further improve transferability. Extensive experiments against normally trained and defense models demonstrate that our EPN is more effective than conventional momentum in the improvement of transferability. Specifically, the attack success rates of our EPN-based attacks are $\sim$11.9% and $\sim$13.1% higher than conventional momentum-based attacks on average against normally trained and defense models, respectively.","adversarial attacks, transferability, black-box, momentum" TaylorNet: A Taylor-Driven Generic Neural Architecture,https://openreview.net/forum?id=tDNGHd0QmzO,https://openreview.net/pdf?id=tDNGHd0QmzO,"We propose a generic neural architecture, called TaylorNet, that can introduce inductive bias to DNNs with Taylor series expansion","Physics-informed machine learning (PIML) aims to incorporate physics knowledge into deep neural networks (DNNs) to improve the model generalization. However, existing methods in PIML are either designed for specific problems or hard to interpret the results using black-box DNNs. In this work, we propose Taylor Neural Network (TaylorNet), a generic neural architecture that parameterizes Taylor polynomials using DNNs without using non-linear activation functions. The key challenges of developing TaylorNet lie in: (i) mitigating the curse of dimensionality caused by higher-order terms, and (ii) improving the stability of model training. To overcome these challenges, we first adopt Tucker decomposition to decompose the higher-order derivatives in Taylor expansion parameterized by DNNs into low-rank tensors. Then we propose a novel reducible TaylorNet to further reduce the computational complexity by removing more redundant parameters in the hidden layers. In order to improve training accuracy and stability, we develop a new Taylor initialization method. Finally, the proposed models are evaluated on a broad spectrum of applications, including image classification, natural language processing (NLP), and dynamical systems. The results demonstrate that our proposed Taylor-Mixer, which replaces MLP and activation layers in the MLP-Mixer with Taylor layer, can achieve comparable accuracy on image classification, and similarly on sentiment analysis in NLP, while significantly reducing the number of model parameters. More importantly, our method can interpret some dynamical systems with Taylor polynomials. Meanwhile, the results demonstrate that our Taylor initialization can significantly improve classification accuracy compared to Xavier and Kaiming initialization.","Taylor Neural Networks, Image Classification, Physics Guided Machine Learning, Dynamical Systems" Semi-supervised learning of partial differential operators and dynamical flows,https://openreview.net/forum?id=-i73LPWa3bD,https://openreview.net/pdf?id=-i73LPWa3bD,,"The evolution of dynamical systems is generically governed by nonlinear partial differential equations (PDEs), whose solution, in a simulation framework, requires vast amounts of computational resources. In this work, we present a novel method that combines a hyper-network solver with a Fourier Neural Operator architecture. Our method treats time and space separately and as a result, it successfully propagates initial conditions in continuous time steps by employing the general composition properties of the partial differential operators. Following previous works, supervision is provided at a specific time point. We test our method on various time evolution PDEs, including nonlinear fluid flows in one, two, or three spatial dimensions. The results show that the new method improves the learning accuracy at the time of the supervision point, and can interpolate the solutions to any intermediate time.", View Synthesis with Sculpted Neural Points,https://openreview.net/forum?id=0ypGZvm0er0,https://openreview.net/pdf?id=0ypGZvm0er0,,"We address the task of view synthesis, generating novel views of a scene given a set of images as input. In many recent works such as NeRF, the scene geometry is parameterized using neural implicit representations (MLPs). Implicit neural representations have achieved impressive visual quality but have drawbacks in computational efficiency. In this work, we propose a new approach that performs view synthesis using point clouds. It is the first point-based method that achieves better visual quality than NeRF while being 100x faster in rendering speed. Our approach builds on existing works on differentiable point-based rendering but introduces a novel technique we call ""Sculpted Neural Points (SNP)'', which significantly improves the robustness to errors and holes in the reconstructed point cloud. We further propose to use view-dependent point features based on spherical harmonics to capture non-Lambertian surfaces, and new designs in the point-based rendering pipeline that further boost the performance. Finally, we show that our system supports fine-grained scene editing in a user-friendly way.", Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval,https://openreview.net/forum?id=PQOlkgsBsik,https://openreview.net/pdf?id=PQOlkgsBsik,"This paper presents Vision-Language Universal Search (VL-UnivSearch), which builds a unified model for multi-modal retrieval, leans universal representations for images and texts, and achieves the state-of-the-art. ","This paper presents Universal Vision-Language Dense Retrieval (UniVL-DR), which builds a unified model for multi-modal retrieval. UniVL-DR encodes queries and multi-modality resources in an embedding space for searching candidates from different modalities. To learn a unified embedding space for multi-modal retrieval, UniVL-DR proposes two techniques: 1) Universal embedding optimization strategy, which contrastively optimizes the embedding space using the modality-balanced hard negatives; 2) Image verbalization method, which bridges the modality gap between images and texts in the raw data space. UniVL-DR achieves the state-of-the-art on the multi-modal open-domain question answering benchmark, WebQA, and outperforms all retrieval models on the two subtasks, text-text retrieval and text-image retrieval. It demonstrates that universal multi-modal search is feasible to replace the divide-and-conquer pipeline with a united model and also benefits single/cross modality tasks. All source codes will be released via GitHub.","Multi-Modal Retrieval, Dense Retrieval, Universal Embedding Space, Modality-Balanced Hard Negative Training, Image Verbalization" Continual Pre-trainer is an Incremental Model Generalizer,https://openreview.net/forum?id=eaEjWtX3xkY,https://openreview.net/pdf?id=eaEjWtX3xkY,"In this paper, we tackle a novel problem of Continual Pre-training, which aims to increment the generalization of model representations, encouraging positive transfer for future problems.","With the necessity of lifelong-learnable machines over continuously changing real-world problems in practice, there has been rapid progress in continual learning these days. However, most recent works on continual learning focuses on alleviating catastrophic forgetting of a model trained over a sequence of vision tasks, considering only the performance on the tasks themselves rather than the representation transferability. In this paper, we tackle a novel problem of Continual Pre-training, which aims to increment the generalization of model representations, encouraging positive transfer for future problems. An initial empirical study shows a rather surprising finding that the transfer quality of the pre-trained model representation with both supervised and unsupervised task sequences does not show noticeable performance degradation even with full-finetuning. Furthermore, we propose a simple yet efficient Continual Pre-training method with GLobal Attention Discretization (GLAD) which introduces a new constraint to increment the global transferability of the backbone while projecting model weights to adapt to target problems via additional weight vectors. Our continual pretraining method breaks the barriers between pre-training and fine-tuning steps and leads to an integrated design that combines continual representation learning with continual learning of the task-specific learners.","Masked Image Modeling, Representation Learning, Continual Learning, Unsupervised Learning, Pretraining" DFlow: Learning to Synthesize Better Optical Flow Datasets via a Differentiable Pipeline,https://openreview.net/forum?id=5O2uzDusEN5,https://openreview.net/pdf?id=5O2uzDusEN5,Differentiable and efficient optical flow data generation pipeline,"Comprehensive studies of synthetic optical flow datasets have attempted to reveal what properties lead to accuracy improvement. However, manually identifying and verifying all such necessary properties are intractable mainly due to the requirement of large-scale trial-and-error experiments with iteratively generating whole synthetic datasets. To tackle this challenge, we propose a differentiable optical flow data generation pipeline and a loss function to drive the pipeline, called DFlow. These enable automatic and efficient synthesis of a dataset effective to a target domain, given a snippet of target data. This distinctiveness is achieved by proposing an efficient data comparison method, where we approximately encode reference sets of data into neural networks and compare the proxy networks instead of explicitly comparing datasets in a sample-wise way. Our experiments show the competitive performance of our DFlow against the prior arts in pre-training. Moreover, the RAFT model pre-trained with DFlow achieves state-of-the-art performance on the Sintel public benchmark in fine-tuning.","Synthetic data, Optical flow" FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training,https://openreview.net/forum?id=0N66Gl63vq,https://openreview.net/pdf?id=0N66Gl63vq,,"This paper is on Few-Shot Object Detection (FSOD), where given a few templates (examples) depicting a novel class (not seen during training), the goal is to detect all of its occurrences within a set of images. From a practical perspective, an FSOD system must fulfil the following desiderata: (a) it must be used as is, without requiring any fine-tuning at test time, (b) it must be able to process an arbitrary number of novel objects concurrently while supporting an arbitrary number of examples from each class and (c) it must achieve accuracy comparable to a closed system. While there are (relatively) few systems that support (a), to our knowledge, there is no system supporting (b) and (c). In this work, we make the following contributions: We introduce, for the first time, a simple, yet powerful, few-shot detection transformer (FS-DETR) that can address both desiderata (a) and (b). Our system builds upon the DETR framework, extending it based on two key ideas: (1) feed the provided visual templates of the novel classes as visual prompts during test time, and (2) ``stamp'' these prompts with pseudo-class embeddings, which are then predicted at the output of the decoder. Importantly, we show that our system is not only more flexible than existing methods, but also, making a step towards satisfying desideratum (c), it is more accurate, matching and outperforming the current state-of-the-art on the most well-established benchmarks (PASCAL VOC & MSCOCO) for FSOD. Code will be made available.", One-Pixel Shortcut: On the Learning Preference of Deep Neural Networks,https://openreview.net/forum?id=p7G8t5FVn2h,https://openreview.net/pdf?id=p7G8t5FVn2h,"We propose a model-free method to craft unlearnable example by perturbing only one pixel, and construct a benchmark containing images that are unlearnable by various existing methods to avoid shortcut learning.","Unlearnable examples (ULEs) aim to protect data from unauthorized usage for training DNNs. Existing work adds $\ell_\infty$-bounded perturbations to the original sample so that the trained model generalizes poorly. Such perturbations, however, are easy to eliminate by, e.g., adversarial training and data augmentations. In this paper, we resolve this problem from a novel perspective by perturbing only one pixel in each image. Interestingly, such a small modification could effectively degrade model accuracy to almost an untrained counterpart. Moreover, our produced \emph{One-Pixel Shortcut (OPS)} could not be erased by adversarial training and strong augmentations. To generate OPS, we perturb in-class images in the same pixel that, if changed to a 0 or 1, could mostly and stably deviate from all original images. Since such calculation is only based on images, OPS needs significantly less computation than the previous methods based on model training. By OPS, we introduce an unlearnable dataset called CIFAR-10-S, which is indistinguishable from CIFAR-10 by humans but induces the trained model to extremely low accuracy. Even under adversarial training, a ResNet-18 trained on CIFAR-10-S has only 10.61% accuracy, compared to 83.02% by the existing error-minimizing method.","unlearnable examples, shortcut learning, one-pixel feature, deep neural network" An Improved Baseline for Masked Contrastive Learning,https://openreview.net/forum?id=zWnq5AFNhFH,https://openreview.net/pdf?id=zWnq5AFNhFH,"We develop an improved contrastive baseline for vision transformer, which rivals the fine-tuning performance of masked image prediction.","Contrastive learning has significantly advanced self-supervised visual representation learning, making linear probe accuracy close to its supervised counterpart on ImageNet. However, vision transformers pre-trained with contrastive learning typically underperform those pre-trained with masked image prediction, when evaluated on fine-tuning benchmarks, e.g., image classification, object detection, and segmentation. In this paper, we improve the fine-tuning transfer performance of prior state-of-the-art contrastive approaches, e.g., MoCo-v3 and BYOL, from the following empirical perspectives: (i) applying masking strategies to input views; (ii) studying and comparing the effectiveness of Batch Normalization and Layer Normalization in projection and prediction heads; (iii) investigating the effectiveness of data augmentation and finding lighter augmentation during pre-training improves fine-tuning performance. As a result, we come up with a better baseline for contrastive transformers that outperforms baseline MoCo-v3 by $0.6\%$ on ImageNet fine-tuning, and $2.1$ mAP on MS COCO detection and segmentation benchmark for ViT-B, rivaling that of masked image prediction. Furthermore, our approach is significantly more efficient than MoCo-v3 due to the use of masking. These results suggest that, contrary to recent trends, contrastive learning remains competitive with masked image prediction on standard vision tasks.","contrastive learning, self-supervised learning, vision transformer" Make Memory Buffer Stronger in Continual Learning: A Continuous Neural Transformation Approach,https://openreview.net/forum?id=CniFDGvqbUZ,https://openreview.net/pdf?id=CniFDGvqbUZ,,"Continual learning (CL) focuses on learning non-stationary data distribution without forgetting previous knowledge. However, the most widely used memory-replay approach often suffers from memory overfitting. To mitigate the memory overfitting, we propose a continuous and reversible memory transformation method so that the memory data is hard to overfit, thus improving generalization. The transformation is achieved by optimizing a bi-level optimization objective that jointly learns the CL model and memory transformer. Specifically, we propose a deterministic continuous memory transformer (DCMT) modeled by an ordinary differential equation, allowing for infinite memory transformation and generating diverse and hard memory data. Furthermore, we inject uncertainty into the transformation function and propose a stochastic continuous memory transformer (SCMT) modeled by a stochastic differential equation, which substantially enhances the diversity of the transformed memory buffer. The proposed neural transformation approaches have significant advantages over existing ones: (1) we can obtain infinite many transformed data, thus significantly increasing the memory buffer diversity; (2) the proposed continuous transformations are reversible, i.e., the original raw memory data could be restored from the transformed memory data without the need to make a replica of the memory data. Extensive experiments on both task-aware and task-free CL show significant improvement with our approach compared to strong baselines. ",Continual Learning Sparse Random Networks for Communication-Efficient Federated Learning,https://openreview.net/forum?id=k1FHgri5y3-,https://openreview.net/pdf?id=k1FHgri5y3-,"We propose an FL framework, where clients find a sparse random network using a stochastic strategy; and provide (1) lower communication cost, (2) higher accuracy, (3) faster convergence, and (4) at the end of the training, a compressed final model.","One main challenge in federated learning is the large communication cost of exchanging weight updates from clients to the server at each round. While prior work has made great progress in compressing the weight updates through gradient compression methods, we propose a radically different approach that does not update the weights at all. Instead, our method freezes the weights at their initial \emph{random} values and learns how to sparsify the random network for the best performance. To this end, the clients collaborate in training a \emph{stochastic} binary mask to find the optimal sparse random network within the original one. At the end of the training, the final model is a sparse network with random weights -- or a subnetwork inside the dense random network. We show improvements in accuracy, communication (less than $1$ bit per parameter (bpp)), convergence speed, and final model size (less than $1$ bpp) over relevant baselines on MNIST, EMNIST, CIFAR-10, and CIFAR-100 datasets, in the low bitrate regime.","communication-efficient federated learning, sparse networks with random weights, compression, sparsity." WaveMix-Lite: A Resource-efficient Neural Network for Image Analysis,https://openreview.net/forum?id=y_icnxeeUcl,https://openreview.net/pdf?id=y_icnxeeUcl,WaveMix-Lite uses 2D discrete Wavelet transform for resource-efficient token-mixing and performs better than CNNs and transformers in image classification and segmentation tasks while requiring fewer GPU RAM and parameters.,"Gains in the ability to generalize on image analysis tasks for neural networks have come at the cost of increased number of parameters and layers, dataset sizes, training and test computations, and GPU RAM. We introduce a new architecture -- WaveMix-Lite -- that can generalize on par with contemporary transformers and convolutional neural networks (CNNs) while needing fewer resources. WaveMix-Lite uses 2D-discrete wavelet transform to efficiently mix spatial information from pixels. WaveMix-Lite seems to be a versatile and scalable architectural framework that can be used for multiple vision tasks, such as image classification and semantic segmentation, without requiring significant architectural changes, unlike transformers and CNNs. It is able to meet or exceed several accuracy benchmarks while training on a single GPU. For instance, it achieves state-of-the-art accuracy on five EMNIST datasets, outperforms CNNs and transformers in ImageNet-1K and Places-365, and achieves an mIoU of 77\% on Cityscapes validation set, while using less than one-fifth the number parameters and half the GPU RAM of comparable CNNs or transformers. Our experiments show that while the convolutional elements of neural architectures exploit the shift-invariance property of images, new types of layers (e.g., wavelet transform) can exploit additional properties of images, such as scale-invariance and finite spatial extents of objects.","image classification, segmentation, resource-efficient, token-mixer, wavelet, sota" On the Impact of Adversarially Robust Models on Algorithmic Recourse,https://openreview.net/forum?id=BGId14emsBj,https://openreview.net/pdf?id=BGId14emsBj,,"The widespread deployment of machine learning models in various high-stakes settings has underscored the need for ensuring that individuals who are adversely impacted by model predictions are provided with a means for recourse. To this end, several algorithms have been proposed in recent literature to generate recourses. Recent research has also demonstrated that the recourses generated by these algorithms often correspond to adversarial examples. This key finding emphasizes the need for a deeper understanding of the impact of adversarially robust models (which are designed to guard against adversarial examples) on algorithmic recourse. In this work, we make one of the first attempts at studying the impact of adversarially robust models on algorithmic recourse. We theoretically and empirically analyze the cost (ease of implementation) and validity (probability of obtaining a positive model prediction) of the recourses output by state-of-the-art algorithms when the underlying models are adversarially robust. More specifically, we construct theoretical bounds on the differences between the cost and the validity of the recourses generated by various state-of-the-art algorithms when the underlying models are adversarially robust vs. non-robust. We also carry out extensive empirical analysis with multiple real-world datasets to not only validate our theoretical results, but also analyze the impact of varying degrees of model robustness on the cost and validity of the resulting recourses. Our theoretical and empirical analyses demonstrate that adversarially robust models significantly increase the cost and reduce the validity of the resulting recourses, thereby shedding light on the inherent trade-offs between achieving adversarial robustness in predictive models and providing easy-to-implement and reliable algorithmic recourse.","Algorithmic Recourse, Adversarial Robustness, Machine Learning" "Learning to acquire novel cognitive tasks with evolution, plasticity and meta-meta-learning",https://openreview.net/forum?id=WoV6DA5P9OL,https://openreview.net/pdf?id=WoV6DA5P9OL,"We evolve plastic networks that can automatically acquire novel, cognitive (memory-dependent) tasks (never seen during evolution) from stimuli and rewards alone, much like animals do.","A hallmark of intelligence is the ability to autonomously learn new flexible, cognitive behaviors - that is, behaviors where the appropriate action depends not just on immediate stimuli (as in simple reflexive stimulus-response associations), but on memorized contextual information. Such cognitive, memory-dependent behaviors are by definition meta-learning tasks. In typical meta-learning experiments, agents are trained with an external, human-designed algorithm to learn a given cognitive task. By contrast, animals are able to pick up new cognitive tasks automatically, from stimuli and rewards alone: evolution has designed animal brains as self-contained reinforcement (meta-)learning systems, capable not just of performing specific cognitive tasks, but of acquiring novel cognitive tasks, including tasks never seen during evolution. Can we harness this process to generate artificial agents with such abilities? Here we evolve neural networks, endowed with plastic connections and neuromodulation, over a sizable set of simple meta-learning tasks based on a framework from computational neuroscience. The resulting evolved networks can automatically modify their own connectivity to acquire a novel simple cognitive task, never seen during evolution, through the spontaneous operation of their evolved neural organization and plasticity system. We suggest that attending to the multiplicity of loops involved in natural learning may provide useful insight into the emergence of intelligent behavior.","Evolution, Meta-learning, Neuromodulation, Plasticity" Breaking Beyond COCO Object Detection,https://openreview.net/forum?id=hj7uBF92qvm,https://openreview.net/pdf?id=hj7uBF92qvm,"An analysis of the state of the art in object detection, the empirical upper bound, and errors in models and datasets","COCO dataset has become the de facto standard for training and evaluating object detectors. According to the recent benchmarks, however, performance on this dataset is still far from perfect, which raises the following questions, a) how far can we improve the accuracy on this dataset using deep learning, b) what is holding us back in making progress in object detection, and c) what are the limitations of the COCO dataset and how can they be mitigated. To answer these questions, first, we propose a systematic approach to determine the empirical upper bound in AP over COCOval2017, and show that this upper bound is significantly higher than the state-of-the-art mAP (78.2% vs. 58.8%). Second, we introduce two complementary datasets to COCO: i) COCO_OI, composed of images from COCO and OpenImages (from 80 classes in common) with 1,418,978 training bounding boxes over 380,111 images, and 41,893 validation bounding boxes over 18,299 images, and ii) ObjectNet_D containing objects in daily life situations (originally created for object recognition known as ObjectNet; 29 categories in common with COCO). We evaluate models on these datasets and pinpoint the annotation errors on the COCO validation set. Third, we characterize the sources of errors in modern object detectors using a recently proposed error analysis tool (TIDE) and find that models behave differently on these datasets compared to COCO. For instance, missing objects are more frequent in the new datasets. We also find that models lack out of distribution generalization. Code and data will be shared.","object detection, deep learning, performance analysis" BinaryVQA: A Versatile Dataset to Push the Limits of VQA Models,https://openreview.net/forum?id=i2JgYVPce1i,https://openreview.net/pdf?id=i2JgYVPce1i,We introduce a new test set for free-form visual question answering (VQA) called BinaryVQA to push the limits of VQA models. ,"We introduce a new test set for visual question answering (VQA) called BinaryVQA to push the limits of VQA models. Our dataset includes 7,800 questions across 1,024 images and covers a wide variety of objects, topics, and concepts. For easy model evaluation, we only consider binary questions. Questions and answers are formulated and verified carefully and manually. Around 63% of the questions have positive answers. The median number of questions per image and question length are 7 and 5, respectively. The state of the art OFA model achieves 75% accuracy on BinaryVQA dataset, which is significantly lower than its performance on the VQA v2 test-dev dataset (94.7%). We also analyze the model behavior along several dimensions including a) performance over different categories such as text, counting and gaze direction, b) model interpretability, c) the effect of question length on accuracy, d) bias of models towards positive answers and introduction of a new score called the “ShuffleAcc”, and e) sensitivity to spelling and grammar errors. Our investigation demonstrates the difficulty of our dataset and shows that it can challenge VQA models for years to come. Data and code is available [Masked].","Visual question answering, dataset benchmarks, datasets" NormSoftmax: Normalize the Input of Softmax to Accelerate and Stabilize Training,https://openreview.net/forum?id=4g7nCbpjNwd,https://openreview.net/pdf?id=4g7nCbpjNwd,,"Softmax is a basic function that normalizes a vector to a probability distribution and is widely used in machine learning, most notably in cross-entropy loss function and dot product attention operations. However, optimization of softmax-based models is sensitive to the input statistics change. We observe that the input of softmax changes significantly during the initial training stage, causing slow and unstable convergence when training the model from scratch. To remedy the optimization difficulty of softmax, we propose a simple yet effective substitution, named NormSoftmax, where the input vector is first normalized to unit variance and then fed to the standard softmax function. Similar to other existing normalization layers in machine learning models, NormSoftmax can stabilize and accelerate the training process, and also increase the robustness of the training procedure against hyperparameters. Experiments on Transformer-based models and convolutional neural networks validate that our proposed NormSoftmax is an effective plug-and-play module to stabilize and speed up the optimization of neural networks with cross-entropy loss or dot-product attention operations.", "Diverse, Difficult, and Odd Instances (D2O): A New Test Set for Object Classification",https://openreview.net/forum?id=slqzKkFFlp3,https://openreview.net/pdf?id=slqzKkFFlp3,We propose a new test set for object recognition and test a variety of object recognition and tagging models on it. We should that models fails drastically on our test set.,"Test sets are an integral part of evaluating models and gauging progress in object recognition, and more broadly in computer vision and AI. Existing test sets for object recognition, however, suffer from shortcomings such as bias towards the ImageNet characteristics and idiosyncrasies (e.g. ImageNet-V2), being limited to certain types of stimuli (e.g. indoor scenes in ObjectNet), and underestimating the model performance (e.g. ImageNet-A). To mitigate these problems, here we introduce a new test set, called D2O, which is sufficiently different from existing test sets. Images are diverse, unmodified, and representative of real-world scenarios and cause state-of-the-art models to misclassify them with high confidence. To emphasize generalization, our dataset by design does not come paired with a training set. It contains 8,060 images spread across 36 categories, out of which 29 appear in ImageNet. The best Top-1 accuracy on our dataset is around 60% which is much lower than 91% best Top-1 accuracy on ImageNet. We find that popular vision APIs perform very poorly in detecting objects over D2O categories such as “faces”, “cars”, and “cats”. Our dataset also comes with a “miscellaneous” category, over which we test the image tagging algorithms. Overall, our investigations demonstrate that the D2O test set has the right level of difficulty and is predictive of the average-case performance of models. It can challenge object recognition models for years to come and can spur more research in this fundamental area. Data and code are publicly available at [Masked].","object recognition, deep learning, model evaluation, tagging, generalization, out of distribution generalization" Differentially Private Dataset Condensation,https://openreview.net/forum?id=H8XpqEkbua_,https://openreview.net/pdf?id=H8XpqEkbua_,,"Recent work in ICML'22 builds a theoretical connection between dataset condensation (DC) and differential privacy (DP) and claims that DC can provide privacy protection for free. However, the connection is problematic because of two controversial assumptions. In this paper, we revisit the ICML'22 work and elucidate the issues in the two controversial assumptions. To correctly connect DC and DP, we propose two differentially private dataset condensation (DPDC) algorithms---LDPDC and NDPDC. Through extensive evaluations on multiple datasets, we demonstrate that LDPDC has comparable performance to recent DP generative methods despite its simplicity. NDPDC provides acceptable DP guarantees with a mild utility loss, compared to the state-of-the-art DC method. Additionally, NDPDC allows a flexible trade-off between the synthetic data utility and DP budget.", Variation-based Cause Effect Identification,https://openreview.net/forum?id=VJeUPUge4DL,https://openreview.net/pdf?id=VJeUPUge4DL,A framework for causal discovery in bivariate systems based on realization of the independence of causal mechanisms postulate using convex-optimization,"Mining genuine mechanisms underlying the complex data generation process in real-world systems is a fundamental step in promoting interpretability of (and thus trust in) data-driven models. Therefore, we propose a variation-based cause effect \underline{i}dentification (VCEI) framework for causal discovery in bivariate systems from a single observational setting. Our framework relies on the principle of independence of cause and mechanism (ICM) under the assumption of an existing acyclic causal link, and offers a practical realization of this principle. Principally, we artificially construct two settings in which the marginal distributions of one covariate, claimed to be the cause, are guaranteed to have non-negligible variations. This is achieved by re-weighting samples of the marginal so that the resultant distribution is notably distinct from this marginal according to some discrepancy measure. In the causal direction, such variations are expected to have no impact on the effect generation mechanism. Therefore, quantifying the impact of these variations on the conditionals reveals the genuine causal direction. Moreover, we formulate our approach in the kernel-based maximum mean discrepancy, lifting all constraints on the data types of cause and effect covariates, and rendering such artificial interventions a convex optimization problem. We provide a series of experiments on real and synthetic data showing that VCEI is, in principle, competitive to other cause effect identification frameworks.","Causality, Causal Inference, Causal Discovery, Cause Effect Identification, Convex Optimization, Semi-definite Relaxation" A General Framework For Proving The Equivariant Strong Lottery Ticket Hypothesis,https://openreview.net/forum?id=vVJZtlZB9D,https://openreview.net/pdf?id=vVJZtlZB9D,"We extend the strong lottery ticket hypothesis to Equivariant Networks and show optimal pruning strategies in theory and practice for Steerable CNNs, Higher Order GNNs, and Message Passing GNNs.","The Strong Lottery Ticket Hypothesis (SLTH) stipulates the existence of a subnetwork within a sufficiently overparameterized (dense) neural network that---when initialized randomly and without any training---achieves the accuracy of a fully trained target network. Recent works by Da Cunha et. al 2022, Burkholz 2022 demonstrate that the SLTH can be extended to translation equivariant networks---i.e. CNNs---with the same level of overparametrization as needed for the SLTs in dense networks. However, modern neural networks are capable of incorporating more than just translation symmetry, and developing general equivariant architectures such as rotation and permutation has been a powerful design principle. In this paper, we generalize the SLTH to functions that preserve the action of the group $G$---i.e. $G$-equivariant network---and prove, with high probability, that one can approximate any $G$-equivariant network of fixed width and depth by pruning a randomly initialized overparametrized $G$-equivariant network to a $G$-equivariant subnetwork. We further prove that our prescribed overparametrization scheme is optimal and provide a lower bound on the number of effective parameters as a function of the error tolerance. We develop our theory for a large range of groups, including subgroups of the Euclidean $\text{E}(2)$ and Symmetric group $G \leq \mathcal{S}_n$---allowing us to find SLTs for MLPs, CNNs, $\text{E}(2)$-steerable CNNs, and permutation equivariant networks as specific instantiations of our unified framework. Empirically, we verify our theory by pruning overparametrized $\text{E}(2)$-steerable CNNs, $k$-order GNNs, and message passing GNNs to match the performance of trained target networks.","Equivariant Networks, Strong Lottery Ticket, Weight Pruning" TimelyFL: Heterogeneity-aware Asynchronous Federated Learning with Adaptive Partial Training ,https://openreview.net/forum?id=6tPGEjCN4iI,https://openreview.net/pdf?id=6tPGEjCN4iI,An inclusiveness asynchronous federated learning with adaptive partial training.,"In cross-device Federated Learning (FL) environments, scaling synchronous FL methods is challenging as stragglers hinder the training process. Moreover, the availability of each client to join the training is highly variable over time due to system heterogeneities and intermittent connectivity. Recent asynchronous FL methods (e.g., FedBuff) have been proposed to overcome these issues by allowing slower users to continue their work on local training based on stale models and to contribute to aggregation when ready. However, we show empirically that this method can lead to a substantial drop in training accuracy as well as a slower convergence rate. The primary reason is that fast-speed devices contribute to many more rounds of aggregation while others join more intermittently or not at all, and with stale model updates. To overcome this barrier, we propose TimelyFL, a heterogeneity-aware asynchronous FL framework with adaptive partial training. During the training, TimelyFL adjusts the local training workload based on the real-time resource capabilities of each client, aiming to allow more available clients to join in the global update without staleness. We demonstrate the performance benefits of TimelyFL by conducting extensive experiments on various datasets (e.g., CIFAR-10, Google Speech, and Reddit) and models (e.g., ResNet20, VGG11, and ALBERT). We also validate the feasibility of TimelyFL by deploying it on an Android-based mobile device testbed. In comparison with the state-of-the-art (i.e., FedBuff), our evaluations reveal that TimelyFL improves participation rate by 21.13%, harvests 1.28x - 2.89x more efficiency on convergence rate, and provides a 6.25% increment on test accuracy.","Submodel Training, Federated Learning" Robust Fair Clustering: A Novel Fairness Attack and Defense Framework,https://openreview.net/forum?id=4LMIZY7gt7h,https://openreview.net/pdf?id=4LMIZY7gt7h,"We propose a highly effective & novel fairness attack against state-of-the-art fair clustering models, & for self-completeness, we propose a defense framework based on consensus clustering & graph representation learning that is robust to our attack.","Clustering algorithms are widely used in many societal resource allocation applications, such as loan approvals and candidate recruitment, among others, and hence, biased or unfair model outputs can adversely impact individuals that rely on these applications. To this end, many $\textit{fair}$ clustering approaches have been recently proposed to counteract this issue. Due to the potential for significant harm, it is essential to ensure that fair clustering algorithms provide consistently fair outputs even under adversarial influence. However, fair clustering algorithms have not been studied from an adversarial attack perspective. In contrast to previous research, we seek to bridge this gap and conduct a robustness analysis against fair clustering by proposing a novel $\textit{black-box fairness attack}$. Through comprehensive experiments, we find that state-of-the-art models are highly susceptible to our attack as it can reduce their fairness performance significantly. Finally, we propose Consensus Fair Clustering (CFC), the first $\textit{robust fair clustering}$ approach that transforms consensus clustering into a fair graph partitioning problem, and iteratively learns to generate fair cluster outputs. Experimentally, we observe that CFC is highly robust to the proposed attack and is thus a truly robust fair clustering alternative.","Data Clustering, Fairness Attack, Fairness Defense, Consensus Clustering" Learning to Jointly Share and Prune Weights for Grounding Based Vision and Language Models,https://openreview.net/forum?id=UMERaIHMwB3,https://openreview.net/pdf?id=UMERaIHMwB3,,"Transformers have seen growing interest in processing different modalities, including language and image data. As a result, we can process vision and language data using transformers that are architecturally similar. Leveraging this feature of transformers, we propose weight sharing across two transformer backbones and within the same transformer backbone and pruning across two backbones in a unified framework. More specifically, we investigate weight sharing and pruning for two components of the transformers: (1) Multi-Head Attention (MSA) and (2) Feed-Forward Network (FFN) layers. To jointly perform weight sharing and pruning, we propose to use a regularization term to align model weights and the desired structure during the multimodal pre-training step. The structure vectors of sharing and pruning are generated by using a hypernetwork, which can capture complex interactions between pruning and sharing across layers and modalities. We train the hypernetwork and model weights iteratively so that the learned structure evolves along with model weights. After minimizing the proposed objective in the pre-training step, we perform weight sharing and pruning and fine-tune the compressed model on downstream tasks. Finally, we perform experiments on vision and language tasks, including Referring Expression Comprehension (REC), Visual Question Answering (VQA), and Object Detection using the state-of-the-art grounding based models: MDETR and GLIP. Our experiments show that we can compress these models by $35-40\%$ by sharing and pruning MSA and FFN weights without almost any loss in accuracy.", Coupling Semi-supervised Learning with Reinforcement Learning for Better Decision Making -- An application to Cryo-EM Data Collection,https://openreview.net/forum?id=OhdF1l90VoC,https://openreview.net/pdf?id=OhdF1l90VoC,We proposed an iterative semi-supervised learning framework for dual-learning of RL and the perception model with applications to Cryo-EM.,"We consider a semi-supervised Reinforcement Learning (RL) approach that takes inputs from a perception model. Performance of such an approach can be significantly limited by the quality of the perception model in the low labeled data regime. We propose a novel iterative framework that simultaneously couples and improves the training of both RL and the perception model. The perception model takes pseudo labels generated from the trajectories of a trained RL agent believing that the decision-model can correct errors made by the perception model. We apply the framework to cryo-electron microscopy (cryo-EM) data collection, whose goal is to find as many high-quality micrographs taken by cryo-electron microscopy as possible by navigating at different magnification levels. Our proposed method significantly outperforms various baseline methods in terms of both RL rewards and the accuracy of the perception model. We further provide some theoretical insights into the benefits of coupling the decision model and the perception model by showing that RL-generated pseudo labels are biased towards localization which aligns with the underlying data generating mechanism. Our iterative framework that couples both sides of the semi-supervised RL can be applied to a wide range of sequential decision-making tasks when the labeled data is limited.","Reinforcement Learning, Semi-supervised Learning, Cryo-EM" Spatial Attention Kinetic Networks with E(n)-Equivariance,https://openreview.net/forum?id=3DIpIf3wQMC,https://openreview.net/pdf?id=3DIpIf3wQMC,Equivariant functional form termed spatial attention uses neurally parametrized linear combinations of edge vectors to equivariantly yet describe node environments ,"Neural networks that are equivariant to rotations, translations, reflections, and permutations on $n$-dimensional geometric space have shown promise in physical modeling for tasks such as accurately but inexpensively modeling complex potential energy surfaces to guiding the sampling of complex dynamical systems or forecasting their time evolution. Current state-of-the-art methods employ spherical harmonics to encode higher-order interactions among particles, which are computationally expensive. In this paper, we propose a simple alternative functional form that uses neurally parametrized linear combinations of edge vectors to achieve equivariance while still universally approximating node environments. Incorporating this insight, we design \emph{spatial attention kinetic networks} with E(n)-equivariance, or SAKE, which are competitive in many-body system modeling tasks while being significantly faster.", Training Recipe for N:M Structured Sparsity with Decaying Pruning Mask,https://openreview.net/forum?id=bMXueK316u,https://openreview.net/pdf?id=bMXueK316u,,"Sparsity has become one of the promising methods to compress and accelerate Deep Neural Networks (DNNs). Among different categories of sparsity, structured sparsity has gained more attention due to its efficient execution on modern accelerators. Particularly, N:M sparsity is attractive because there are already hardware accelerator architectures that can leverage certain forms of N:M structured sparsity to yield higher compute-efficiency. In this work, we focus on N:M sparsity and extensively study and evaluate various training recipes for N:M sparsity in terms of the trade-off between model accuracy and compute cost (FLOPs). Building upon this study, we propose two new decay-based pruning methods, namely “pruning mask decay” and “sparse structure decay”. Our evaluations indicate that these proposed methods consistently deliver state-of-the-art (SOTA) model accuracy, comparable to unstructured sparsity, on a Transformer-based model for a translation task. The increase in the accuracy of the sparse model using the new training recipes comes at the cost of marginal increase in the total training compute (FLOPs).","sparsity, structured sparsity, pruning, dnn, transformer" Light-weight probing of unsupervised representations for Reinforcement Learning,https://openreview.net/forum?id=dfPuLye6RvY,https://openreview.net/pdf?id=dfPuLye6RvY,"Our paper proposes linear reward probing as an efficient method to evaluate the quality of pretrained representations in the RL setting, and demonstrates its positive correlation with downstream RL performance.","Unsupervised visual representation learning offers the opportunity to leverage large corpora of unlabeled trajectories to form useful visual representations, which can benefit the training of reinforcement learning (RL) algorithms. However, evaluating the fitness of such representations requires training RL algorithms which is both computationally intensive and has high variance outcomes. To alleviate this issue, we design an evaluation protocol for unsupervised RL representations with lower variance and up to 600x lower computational cost. Inspired by the vision community, we propose two linear probing tasks: predicting the reward observed in a given state, and predicting the action of an expert in a given state. These two tasks are generally applicable to many RL domains, and we show through rigorous experimentation that they correlate strongly with the actual downstream control performance on the Atari100k Benchmark. This provides a better method for exploring the space of pretraining algorithms without the need of running RL evaluations for every setting. Leveraging this framework, we further improve existing self-supervised learning (SSL) recipes for RL, highlighting the importance of the forward model, the size of the visual backbone, and the precise formulation of the unsupervised objective.","machine learning, unsupervised learning, reinforcement learning, computer vision" Understanding Rare Spurious Correlations in Neural Networks,https://openreview.net/forum?id=lrzX-rNuRvw,https://openreview.net/pdf?id=lrzX-rNuRvw,Neural networks can learn spurious correlations caused by a small number of training examples. We empirically and theoretically study this phenomena.,"Neural networks are known to use spurious correlations such as background information for classification. While prior work has looked at spurious correlations that are widespread in the training data, in this work, we investigate how sensitive neural networks are to $rare$ spurious correlations, which may be harder to detect and correct, and may lead to privacy leaks. We introduce spurious patterns correlated with a fixed class to a few training examples and find that it takes only a handful of such examples for the network to learn the correlation. Furthermore, these rare spurious correlations also impact accuracy and privacy. We empirically and theoretically analyze different factors involved in rare spurious correlations and propose mitigation methods accordingly. Specifically, we observe that $\ell_2$ regularization and adding Gaussian noise to inputs can reduce the undesirable effects.","spurious correlation, trustworthy machine learning" Graph Domain Adaptation via Theory-Grounded Spectral Regularization,https://openreview.net/forum?id=OysfLgrk8mk,https://openreview.net/pdf?id=OysfLgrk8mk,,"Transfer learning on graphs drawn from varied distributions (domains) is in great demand across many applications. Emerging methods attempt to learn domain-invariant representations using graph neural networks (GNNs), yet the empirical performances vary and the theoretical foundation has been limited. This paper targets at designing theory-grounded algorithms for graph domain adaptation (GDA). (i) As the first attempt, we derive a model-based GDA bound closely related to two GNN spectral properties: spectral smoothness (SS) and maximum frequency response (MFR). This is achieved by cross-pollinating between the OT-based (optimal transport) DA and graph filter theories. (ii) Inspired by the theoretical results, we propose algorithms regularizing spectral properties of SS and MFR to improve GNN transferability. We further extend the GDA theory into the more challenging conditional-shift scenario, where spectral regularization still applies. (iii) More importantly, our analyses of the theory reveal which regularization would improve performance of what transfer learning scenario, (iv) with the numerical agreement from extensive real-world experiments: SS and MFR regularizations bring more benefits to the scenarios of node transfer and link transfer, respectively. In a nutshell, our study paves the way toward explicitly constructing and training GNNs that can capture more transferable representations across graph domains. Codes will be fully released upon acceptance.", Effective dimension of machine learning models,https://openreview.net/forum?id=TgcG85ZvBuu,https://openreview.net/pdf?id=TgcG85ZvBuu,"We introduce a capacity measure called the local effective dimension, which we show has desirable properties and the ability to bound generalization error.","Making statements about the performance of trained models on tasks involving new data is one of the primary goals of machine learning, i.e., to understand the generalization power of a model. Various capacity measures try to capture this ability, but usually fall short in explaining important characteristics of models that we observe in practice. In this study, we propose the local effective dimension as a capacity measure which seems to correlate well with generalization error on standard data sets. Importantly, we prove that the local effective dimension bounds the generalization error and discuss the aptness of this capacity measure for machine learning models.","Generalization, capacity, effective dimension" CLARE: Conservative Model-Based Reward Learning for Offline Inverse Reinforcement Learning,https://openreview.net/forum?id=5aT4ganOd98,https://openreview.net/pdf?id=5aT4ganOd98,This paper introduces a principled algorithm to approach the reward extrapolation error in offline inverse reinforcement learning.,"This work aims to tackle a major challenge in offline Inverse Reinforcement Learning (IRL), namely the reward extrapolation error, where the learned reward function may fail to explain the task correctly and misguide the agent in unseen environments due to the intrinsic covariate shift. Leveraging both expert data and lower-quality diverse data, we devise a principled algorithm (namely CLARE) that solves offline IRL efficiently via integrating ""conservatism"" into a learned reward function and utilizing an estimated dynamics model. Our theoretical analysis provides an upper bound on the return gap between the learned policy and the expert policy, based on which we characterize the impact of covariate shift by examining subtle two-tier tradeoffs between the exploitation (on both expert and diverse data) and exploration (on the estimated dynamics model). We show that CLARE can provably alleviate the reward extrapolation error by striking the right exploitation-exploration balance therein. Extensive experiments corroborate the significant performance gains of CLARE over existing state-of-the-art algorithms on MuJoCo continuous control tasks (especially with a small offline dataset), and the learned reward is highly instructive for further learning. ","offline inverse reinforcement learning, inverse reinforcement learning, offline reinforcement learning" Data-Free One-Shot Federated Learning Under Very High Statistical Heterogeneity,https://openreview.net/forum?id=_hb4vM3jspB,https://openreview.net/pdf?id=_hb4vM3jspB,We vastly improve on one-shot federated learning performance under very high statistical heterogeneity by reframing the local learning task with a conditional variational autoencoder.,"Federated learning (FL) is an emerging distributed learning framework that collaboratively trains a shared model without transferring the local clients' data to a centralized server. Motivated by concerns stemming from extended communication and potential attacks, one-shot FL limits communication to a single round while attempting to retain performance. However, one-shot FL methods often degrade under high statistical heterogeneity, fail to promote pipeline security, or require an auxiliary public dataset. To address these limitations, we propose two novel data-free one-shot FL methods: FedCVAE-Ens and its extension FedCVAE-KD. Both approaches reframe the local learning task using a conditional variational autencoder (CVAE) to address high statistical heterogeneity. Furthermore, FedCVAE-KD leverages knowledge distillation to compress the ensemble of client decoders into a single decoder. We propose a method that shifts the center of the CVAE prior distribution and experimentally demonstrate that this promotes security, and show how either method can incorporate heterogeneous local models. We confirm the efficacy of the proposed methods over baselines under high statistical heterogeneity using multiple benchmark datasets. In particular, at the highest levels of statistical heterogeneity, both FedCVAE-Ens and FedCVAE-KD typically more than double the accuracy of the baselines.","One-Shot Federated Learning, Statistical Heterogeneity, Model Heterogeneity, Variational Autoencoder" Personalized Subgraph Federated Learning,https://openreview.net/forum?id=ToYi8C6fetv,https://openreview.net/pdf?id=ToYi8C6fetv,"A novel personalized subgraph federated learning framework aiming at the joint improvement of interrelated local models trained on interconnected local subgraphs, for instance, subgraphs belonging to the same community.","In real-world scenarios, subgraphs of a larger global graph may be distributed across multiple devices or institutions, and only locally accessible due to privacy restrictions, although there may be links between them. Recently proposed subgraph Federated Learning (FL) methods deal with those missing links across private local subgraphs while distributively training Graph Neural Networks (GNNs) on them. However, they have overlooked the inevitable heterogeneity among subgraphs, caused by subgraphs comprising different communities of a global graph, therefore, consequently collapsing the incompatible knowledge from local GNN models trained on heterogeneous graph distributions. To overcome such a limitation, we introduce a new subgraph FL problem, personalized subgraph FL, which focuses on the joint improvement of the interrelated local GNN models rather than learning a single global GNN model, and propose a novel framework, FEDerated Personalized sUBgraph learning (FED-PUB), to tackle it. A crucial challenge in personalized subgraph FL is that the server does not know which subgraph each client has. FED-PUB thus utilizes functional embeddings of the local GNNs using random graphs as inputs to compute similarities between them, and use them to perform weighted averaging for server-side aggregation. Further, it learns a personalized sparse mask at each client to select and update only the subgraph-relevant subset of the aggregated parameters. We validate FED-PUB for its subgraph FL performance on six datasets, considering both non-overlapping and overlapping subgraphs, on which ours largely outperforms relevant baselines.","Graph Representation Learning, Graph Neural Networks, Federated Learning, Subgraph Federated Learning" Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient for Out-of-Distribution Generalization,https://openreview.net/forum?id=4XE614GBuGR,https://openreview.net/pdf?id=4XE614GBuGR,"We show that features learned via ERM may be ""good enough"" for generalization, and that the main difficulty is robust classification. We give a new model of dist shift and an alg which is minimax-optimal and meets/exceeds SOTA on several benchmarks.","A common explanation for the failure of deep networks to generalize out-of-distribution is that they fail to recover the ""correct"" features. We challenge this notion with a simple experiment which suggests that ERM already learns sufficient features and that the current bottleneck is not feature learning, but robust regression. We therefore argue that devising simpler methods for learning predictors on existing features is a promising direction for future research. Towards this end, we introduce Domain-Adjusted Regression (DARE), a convex objective for learning a linear predictor that is provably robust under a new model of distribution shift. Rather than learning one function, DARE performs a domain-specific adjustment to unify the domains in a canonical latent space and learns to predict in this space. Under a natural model, we prove that the DARE solution is the minimax-optimal predictor for a constrained set of test distributions. Further, we provide the first finite-environment convergence guarantee to the minimax risk, improving over existing analyses which only yield minimax predictors after an environment threshold. Evaluated on finetuned features, we find that DARE compares favorably to prior methods, consistently achieving equal or better performance.","domain generalization, domain generalization theory, out-of-distribution generalization, representation learning" Initial Value Problem Enhanced Sampling for Closed-Loop Optimal Control Design with Deep Neural Networks,https://openreview.net/forum?id=oXM5kdnAUNq,https://openreview.net/pdf?id=oXM5kdnAUNq,A new adaptive sampling method to improve the performance of the closed-loop controller learned by neural networks,"Closed-loop optimal control design for high-dimensional nonlinear systems has been a long-standing challenge. Traditional methods, such as solving the associated Hamilton-Jacobi-Bellman equation, suffer from the curse of dimensionality. Recent literature proposed a new promising approach based on supervised learning, by leveraging powerful open-loop optimal control solvers to generate training data and neural networks as efficient high-dimensional function approximators to fit the closed-loop optimal control. This approach successfully handles certain high-dimensional optimal control problems but still performs poorly on more challenging problems. One of the crucial reasons for the failure is the so-called distribution mismatch phenomenon brought by the controlled dynamics. In this paper, we investigate this phenomenon and propose the initial value problem enhanced sampling method to mitigate this problem. We theoretically prove that this sampling strategy improves over the vanilla strategy on the classical linear-quadratic regulator by a factor proportional to the total time duration. We further numerically demonstrate that the proposed sampling strategy significantly improves the performance on tested control problems, including the optimal landing problem of a quadrotor and the optimal reaching problem of a 7 DoF manipulator.","Optimal Control, Deep Learning, Adaptive Sampling, Distribution Mismatch" Human Pose Estimation in the Dark,https://openreview.net/forum?id=MEdZ-7BOsKM,https://openreview.net/pdf?id=MEdZ-7BOsKM,"We for the first time tackle human pose estimation under extremely low-light conditions, and introduce a new training strategy and new datasets for the challenging task.","We study human pose estimation in extremely low-light images. This task is challenging due to the difficulty of collecting real low-light images with accurate labels, and severely corrupted inputs that degrade prediction quality significantly. To address the first issue, we develop a dedicated camera system and build a new dataset of real low-light images with accurate pose labels. Thanks to our camera system, each low-light image in our dataset is coupled with a near-perfectly aligned well-lit image, which enables accurate pose labeling and is used as privileged information during training. We also propose a new model that fully exploits the privileged information to learn representation insensitive to lighting conditions. Our model demonstrates outstanding performance on real extremely low-light images, and extensive analyses validate that both of our dataset and model contribute to the success.","Low-light image understanding, Robustness, Learning using privileged information, Human pose estimation" Tackling the Retrieval Trilemma with Cross-Modal Indexing,https://openreview.net/forum?id=o5mLawmN232,https://openreview.net/pdf?id=o5mLawmN232,"We propose a novel paradigm Cross-Modal Indexing that directly maps the query into identifiers of relevant candidates to achieve high accuracy, fast speed, and low storage simultaneously.","Current cross-modal retrieval methods still struggle with the retrieval trilemma to simultaneously satisfy three key requirements, including high accuracy, fast speed, and low storage. For example, the cross-modal embedding methods usually suffer from either slow query speed caused by the time-consuming modality interaction or the tremendous memory cost of dense vector storage. While the cross-modal hashing methods are typically unsatisfied in accuracy due to the lossy discrete quantization for vector compression. In this paper, we tackle the retrieval trilemma with a new paradigm named Cross-Modal Indexing (CMI) that directly maps queries into identifiers of the final retrieved candidates. Specifically, we firstly pre-define sequential identifiers (SIDs) for all candidates into a hierarchical tree that maintains data semantically structures. Then we train an encoder-decoder network that maps queries into SIDs with the supervision of the constructed SIDs. Finally, we directly sample SIDs of relevant candidates for queries with O(1) time complexity. By evading the unfavorable modality interaction, dense vector storage, and vector compression, the proposed CMI reaches a satisfactory balance in the retrieval trilemma. For example, experiments demonstrate that CMI achieves comparable accuracy with about 1000x storage reduction and 120x speedup compared to the state-of-the-art methods on several popular image-text retrieval benchmarks.","cross-modal retrieval, retrieval trilemma, cross-modal indexing" C3PO: Learning to Achieve Arbitrary Goals via Massively Entropic Pretraining,https://openreview.net/forum?id=kx8x43_1ftI,https://openreview.net/pdf?id=kx8x43_1ftI,Exploration approximating a uniform sampling over possible states to train a policy that can achieve any pose and position.,"Given a particular embodiment, we propose a novel method (C3PO) that learns policies able to achieve any arbitrary position and pose. Such a policy would allow for easier control, and would be re-useable as a key building block for downstream tasks. The method is two-fold: First, we introduce a novel exploration algorithm that optimizes for uniform coverage, is able to discover a set of achievable states, and investigates its abilities in attaining both high coverage, and hard-to-discover states; Second, we leverage this set of achievable states as training data for a universal goal-achievement policy, a goal-based SAC variant. We demonstrate the trained policy's performance in achieving a large number of novel states. Finally, we showcase the influence of massive unsupervised training of a goal-achievement policy with state-of-the-art pose-based control of the Hopper, Walker, Halfcheetah, Humanoid and Ant embodiments.","Reinforcement Learning, Exploration, Goal-conditioned Policy, Continuous Control" ProtoVAE: Using Prototypical Networks for Unsupervised Disentanglement,https://openreview.net/forum?id=4IaQ99pSbg5,https://openreview.net/pdf?id=4IaQ99pSbg5,Unsupervised Disentangled representation learning using Isometric inductive biases,"Generative modeling and self-supervised learning have in recent years made great strides towards learning from data in a completely \emph{unsupervised} way. There is still, however, an open area of investigation into guiding the neural network to learn useful or good representations. The problem of unsupervised \textit{Disentanglement} is of particular importance as it offers to learn interpretable representations, with disjoint subsets of the representation encoding different, meaningful factors of variation. Recent work has theoretically grounded the factors of variation, via the lens of group theory, as disentangled actions of the symmetry subgroups which transform only the correspond subspaces of the disentangled representation. We use this mathematical formalism instead to impose constraints on the representations learned by a unsupervised generative neural network, such that transformations of the representation correspond to the actions of a unique symmetry subgroup. To this end, we introduce a novel model, ProtoVAE, that leverages a deep metric learning Prototypical network trained via self-supervision to constrain the latent space of a Variational Autoencoder to decompose into independent subspaces. Further, we actively change or \textit{intervene} in the latent space during training to enforce each dimension of the representation to uniquely and consistently transform the data corresponding to some symmetry subgroup. We demonstrate and evaluate our proposed model on the benchmark DSprites and 3DShapes datasets and compare with other state of the art disentanglement methods via qualitative traversals in the latent space, as well as quantitative disentanglement metrics. We further qualitatively demonstrate the effectiveness of our model on the real-world datasets CelebA which consistently encodes the different factors.","Unsupervised Learning, Disentangled Representations" Neural Diffusion Processes,https://openreview.net/forum?id=09I1M8YRJBR,https://openreview.net/pdf?id=09I1M8YRJBR,Diffusion models for stochastic processes,"Gaussian processes provide an elegant framework for specifying prior and posterior distributions over functions. They are, however, also computationally expensive, and limited by the expressivity of their covariance function. We propose Neural Diffusion Processes (NDPs), a novel approach based upon diffusion models, that learns to sample from distributions over functions. Using a novel attention block we are able to incorporate properties of stochastic processes, such as exchangeability, directly into the NDP's architecture. We empirically show that NDPs are able to capture functional distributions that are close to the true Bayesian posterior. This enables a variety of downstream tasks, including hyperparameter marginalisation, non-Gaussian posteriors and global optimisation.","diffusion models, gaussian processes, neural processes, stochastic processes" Global Context Vision Transformers,https://openreview.net/forum?id=HZJje06x6IO,https://openreview.net/pdf?id=HZJje06x6IO,We introduce general computer vision backbone to effectively learn both short and long-range spatial information.,"We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision tasks. The core of the novel model are global context self-attention modules, joint with standard local self-attention, to effectively yet efficiently model both long and short-range spatial interactions, as an alternative to complex operations such as an attention masks or local windows shifting. While the local self-attention modules are responsible for modeling short-range information, the global query tokens are shared across all global self-attention modules to interact with local key and values. In addition, we address the lack of inductive bias in ViTs and improve the modeling of inter-channel dependencies by proposing a novel downsampler which leverages a parameter-efficient fused inverted residual block. The proposed GC ViT achieves new state-of-the-art performance across image classification, object detection and semantic segmentation tasks. On ImageNet-1K dataset for classification, the tiny, small and base variants of GC ViT with 28M, 51M and 90M parameters achieve 83.4%, 83.9% and 84.4% Top-1 accuracy, respectively, surpassing comparably-sized prior art such as CNN-based ConvNeXt and ViT-based Swin Transformer. Pre-trained GC ViT backbones in downstream tasks of object detection, instance segmentation, and semantic segmentation on MS COCO and ADE20K datasets outperform prior work consistently, sometimes by large margins.","Vision Transformers, Classification, Detection, Instance Segmentation, Semantic Segmentation" Adversarial Learned Fair Representations using Dampening and Stacking,https://openreview.net/forum?id=-z911HH4RFv,https://openreview.net/pdf?id=-z911HH4RFv,,"As more decisions in our daily life become automated, the need to have machine learning algorithms that make fair decisions increases. In fair representation learning we are tasked with finding a suitable representation of the data in which a sensitive variable is censored. Recent work aims to learn fair representations through adversarial learning. This paper builds upon this work by introducing a novel algorithm which uses dampening and stacking to learn adversarial fair representations. Results show that that our algorithm improves upon earlier work in both censoring and reconstruction.","Machine Learning, Deep Learning, Fairness, Adversarial Learning, Fair Representation Learning" "Watch What You Pretrain For: Targeted, Transferable Adversarial Examples on Self-Supervised Speech Recognition models",https://openreview.net/forum?id=MR8pqi9R7xP,https://openreview.net/pdf?id=MR8pqi9R7xP,We show that recent Self-supervised ASR model are uniquely vulnerable to adversarial attacks requiring no model access,"A targeted adversarial attack produces audio samples that can force an Automatic Speech Recognition (ASR) system to output attacker-chosen text. To exploit ASR models in real-world, black-box settings, an adversary can leverage the transferability property, i.e. that an adversarial sample produced for a proxy ASR can also fool a different remote ASR. However recent work has shown that transferability against large ASR models is very difficult. In this work, we show that modern ASR architectures, specifically ones based on Self-Supervised Learning, are in fact vulnerable to transferability. We successfully demonstrate this phenomenon by evaluating state-of-the-art self-supervised ASR models like Wav2Vec2, HuBERT, Data2Vec and WavLM. We show that with low-level additive noise achieving a 30dB Signal-Noise Ratio, we can achieve target transferability with up to 80\% accuracy. Next, we 1) use an ablation study to show that Self-Supervised learning is the main cause of that phenomenon, and 2) we provide an explanation for this phenomenon. Through this we show that modern ASR architectures are uniquely vulnerable to adversarial security threats.","Speech recognition, adversarial attacks, self-supervised learning" Imposing conservation properties in deep dynamics modeling via contrastive learning,https://openreview.net/forum?id=gVSJ83n47IT,https://openreview.net/pdf?id=gVSJ83n47IT,We learn dynamical system conservation property through contrastive learning and impose it during simulation to improve prediction robustness.,"Deep neural networks (DNN) has shown great capacity of modeling a dynamical system, but these DNN-based dynamical models usually do not obey conservation laws. To impose the learned DNN dynamical models with key physical properties such as conservation laws, this paper proposes a two-step approach to endow the invariant priors into the simulations. We first establish a contrastive learning framework to capture the system invariants along the trajectory observations. During the dynamics modeling, we design a projection layer of DNNs to preserve the system invariance. Through experiments, we show our method consistently outperforms the baseline in both coordinate error and conservation metrics and can be further extended to complex and large dynamics by leveraging autoencoder. Notably, a byproduct of our framework is the automated conservation law discovery for dynamical systems with single conservation property. ","dynamical system modeling, contrastive learning, learning conservation property" Language Models Can See: Plugging Visual Controls in Text Generation,https://openreview.net/forum?id=UfMrMbSTmy,https://openreview.net/pdf?id=UfMrMbSTmy,"We present a novel plug-and-play decoding scheme, MAGIC Search, that enables a pre-trained language model to tackle multimodal generation tasks in a zero-shot manner.","Generative language models (LMs) such as GPT-2/3 can be prompted to generate text with remarkable quality. While they are designed for text-prompted generation, it remains an open question how the generation process could be guided by modalities beyond text such as images. In this work, we propose a training-free framework, called MAGIC (iMAge-Guided text generatIon with CLIP), for plugging in visual controls in the generation process and enabling LMs to perform multimodal tasks (e.g., image captioning) in a zero-shot manner. MAGIC is a simple yet efficient plug-and-play framework, which directly combines an off-the-shelf LM (i.e., GPT-2) and an image-text matching model (i.e., CLIP) for image-grounded text generation. During decoding, MAGIC influences the generation of the LM by introducing a CLIP-induced score, called magic score, which regularizes the generated result to be semantically related to a given image while being coherent to the previously generated context. Notably, the proposed decoding scheme does not involve any gradient update operation, therefore being computationally efficient. On the challenging task of zero-shot image captioning, MAGIC outperforms the state-of-the-art method by notable margins with a nearly 27 times decoding speedup. MAGIC is a flexible framework and is theoretically compatible with any text generation tasks that incorporate image grounding. In the experiments, we showcase that it is also capable of performing visually grounded story generation given both an image and a text prompt.","CLIP, GPT-2, Plug-and-Play, Zero-Shot, Image Captioning, Visually Grounded Story Generation" GReTo: Remedying dynamic graph topology-task discordance via target homophily,https://openreview.net/forum?id=8duT3mi_5n,https://openreview.net/pdf?id=8duT3mi_5n,"This paper revisits how node-wise relation modeling to facilitate regressions on dynamic graphs, from a new perspective of target-homophily. ","Dynamic graphs are ubiquitous across disciplines where observations usually change over time. Regressions on dynamic graphs often contribute to diverse critical tasks, such as climate early-warning and traffic controlling. Existing homophily Graph Neural Networks (GNNs) adopt physical connections or feature similarity as adjacent matrix to perform node-level aggregations. However, on dynamic graphs with diverse node-wise relations, exploiting a pre-defined fixed topology for message passing inevitably leads to the aggregations of target-deviated neighbors. We designate such phenomenon as the topology-task discordance, which naturally challenges the homophily assumption. In this work, we revisit node-wise relationships and explore novel homophily measurements on dynamic graphs with both signs and distances, capturing multiple node-level spatial relations and temporal evolutions. We discover that advancing homophily aggregations to signed target-oriented message passing can effectively resolve abovementioned discordance and promote the aggregation capacity. Therefore, a novel GReTo is proposed, which performs signed message passing in immediate neighborhood, and exploits both local environments and target awareness to realize high-order message propagation. Empirically, our solution achieves significant improvements against best baselines, notably improving 24.79% on KnowAir and 3.60% on Metr-LA.","Dynamic graph, graph homophily theory, Graph Neural Network, topology-task discordance" Towards predicting dynamic stability of power grids with Graph Neural Networks,https://openreview.net/forum?id=v-3dUexkNn,https://openreview.net/pdf?id=v-3dUexkNn,Predicting the dynamic stability of future power grids with large shares of renewable energies to mitigate climate change by using Graph Neural Networks.,"To mitigate climate change, the share of renewable energies in power production needs to be increased. Renewables introduce new challenges to power grids regarding the dynamic stability due to decentralization, reduced inertia and volatility in production. However, dynamic stability simulations are intractable and exceedingly expensive for large grids. Graph Neural Networks (GNNs) are a promising method to reduce the computational effort of analyzing dynamic stability of power grids. We provide new datasets of dynamic stability of synthetic power grids and find that GNNs are surprisingly effective at predicting highly non-linear targets from topological information only. We show that large GNNs outperform GNNs from previous work as well as as handcrafted graph features and semi-analytic approximations. Further, we demonstrate GNNs can accurately identify \emph{trouble maker}-nodes in the power grids. Lastly, we show that GNNs trained on small grids can perform accurately on a large synthetic Texan power grid model, which illustrates the potential of our approach.","Power grids, dynamic stability, Graph Neural Networks" Pareto-Optimal Diagnostic Policy Learning in Clinical Applications via Semi-Model-Based Deep Reinforcement Learning,https://openreview.net/forum?id=0WVNuEnqVu,https://openreview.net/pdf?id=0WVNuEnqVu,Our proposed RL-based approach is able to reduce up to 85% testing cost while having the state-of-art diagnosis accuracy in three real-world medical diagnostics tasks.,"Dynamic diagnosis is desirable when medical tests are costly or time-consuming. In this work, we use reinforcement learning (RL) to find a dynamic policy that selects lab test panels sequentially based on previous observations, ensuring accurate testing at a low cost. Clinical diagnostic data are often highly imbalanced; therefore, we aim to maximize the $F_1$ score instead of the error rate. However, the $F_1$ score cannot be written as a cumulative sum of rewards, which invalidates standard RL methods. To remedy this issue, we develop a reward shaping approach, leveraging properties of the $F_1$ score and duality of policy optimization, to provably find the set of all Pareto-optimal policies for budget-constrained $F_1$ score maximization. To handle the combinatorially complex state space, we propose a Semi-Model-based Deep Diagnosis Policy Optimization (SM-DDPO) framework that is compatible with end-to-end training and online learning. SM-DDPO is tested on diverse clinical tasks: ferritin abnormality detection, sepsis mortality prediction, and acute kidney injury diagnosis. Experiments with real-world data validate that SM-DDPO trains efficiently and identifies all Pareto-front solutions. Across all tasks, SM-DDPO is able to achieve state-of-the-art diagnosis accuracy (in some cases higher than conventional methods) with up to $85\%$ reduction in testing cost.","medical diagnostics, Pareto front, reinforcement learning, non-Markovian reward, semi-model-based policy optimization" Closing the gap: Exact maximum likelihood training of generative autoencoders using invertible layers,https://openreview.net/forum?id=g8wBdhnstYz,https://openreview.net/pdf?id=g8wBdhnstYz,,"In this work, we provide an exact likelihood alternative to the variational training of generative autoencoders. This is achieved while leaving complete freedom in the choice of encoder, decoder and prior architectures, making our approach a drop-in replacement for the training of existing VAEs and VAE-style models. We refer to the resulting models as AutoEncoders within Flows (AEF), since the encoder, decoder and prior are defined as individual layers of an overall invertible architecture. This results in a deterministic encoding of the data, as opposed to the stochastic encoding of VAEs. We show that the approach often results in strikingly higher performance than architecturally identical VAEs in terms of log-likelihood and sample quality, especially for low dimensional latent spaces. Importantly, we show that AEF samples are substantially sharper than VAE samples. ", ACAT: Adversarial Counterfactual Attention for Classification and Detection in Medical Imaging,https://openreview.net/forum?id=5DkfiQPy9A,https://openreview.net/pdf?id=5DkfiQPy9A,"We propose a method to generate counterfactual images, which are adversarially obtained, and we derive saliency maps from them. These are employed in a framework that refines a classifier pipeline and helps learning better local features.","In some medical imaging tasks and other settings where only small parts of the image are informative for the classification task, traditional CNNs can sometimes struggle to generalise. Manually annotated Regions of Interest (ROI) are sometimes used to isolate the most informative parts of the image. However, these are expensive to collect and may vary significantly across annotators. To overcome these issues, we propose a method to generate saliency maps, obtained from adversarially generated counterfactual images. With this method, we are able to isolate the area of interest in brain and lung CT scans without using any manual annotations. Our saliency maps, in the task of localising the lesion location out of 6 possible regions, obtain a score of $65.05 \%$ on brain CT scans, improving the score of $61.29 \%$ obtained with the best competing method. We also employ the saliency maps in a framework that refines a classifier pipeline. In particular, the saliency maps are used to obtain soft spatial attention masks that modulate the image features at different scales. We refer to our method as \emph{Adversarial Counterfactual Attention} (ACAT). ACAT increases the baseline classification accuracy of lesions in brain CT scans from $71.39 \%$ to $72.55 \%$ and of COVID-19 related findings in lung CT scans from $67.71 \%$ to $70.84 \%$ and exceeds the performance of competing methods.","Medical imaging, counterfactual examples, adversarial attacks, attention, saliency maps" Dynamics-inspired Neuromorphic Representation Learning,https://openreview.net/forum?id=D6hDzJMbRt4,https://openreview.net/pdf?id=D6hDzJMbRt4,We build a dynamics-inspired neural mechanism that outperform the weight-based one on classification tasks.,"This paper investigates the dynamics-inspired neuromorphic architecture for neural representation and learning following Hamilton's principle. The proposed approach converts weight-based neural structure to its dynamics-based form that consists of finite sub-models, whose mutual relations measured by computing path integrals amongst their dynamic states are equivalent to the typical neural weights. The feedback signals interpreted as stress forces amongst sub-models push them to move based on the entropy-reduction process derived from the Euler-Lagrange equations. We first train a dynamics-based neural model from scratch and observe that this model outperforms its corresponding feedforward neural networks on MNIST dataset. Then we convert several pre-trained neural structures (e.g., DenseNet, ResNet, Transformers, etc.) into dynamics-based forms, followed by fine-tuning via entropy-reduction to obtain the stabilized dynamic states. We observe consistent improvements of these transformed models on the ImageNet dataset in terms of computational complexity, the number of trainable units, testing accuracy, and robustness. Moreover, we demonstrate the correlation between the performance of a neural system and its structural entropy.","dynamics-based, neuromorphic representation, neural network, Hamilton's principle" Abstract Visual Reasoning by Self-supervised Contrastive Learning,https://openreview.net/forum?id=AvSIqjCWVId,https://openreview.net/pdf?id=AvSIqjCWVId,Demonstration of an unsupervised model to solve analogy reasoning in Raven’s Progressive Matrices task and its variant. ,"Neuro-symbolic models of artificial intelligence (AI) have been recently developed to perform tasks involving abstract visual reasoning that is a hallmark of human intelligence but remains challenging for deep neural network methods. However, most of the current neuro-symbolic models also rely on supervised learning and auxiliary annotations, different from human cognitive processes that are much dependent on the general cognitive abilities of entity and rule recognitions, rather than learning how to solve the specific tasks from examples. In this work, we propose a neuro-symbolic model by self-supervised contrastive learning (NS-SSCL) with unique and invariant representations of entities and rules in the perception and reasoning modules, respectively, to solve Raven’s Progressive Matrices (RPMs) and its variant, a typical type of visual reasoning task used to test human intelligence. The perception module parses each object into invariant representations of attributes. The reasoning module grounds the representations of object attributes to form the latent rule representations also through SSCL. Further, the relationships between the neural representations of object attributes and symbols used for rule reasoning are coherently mapped. Finally, the scene generation engine aggregates all attribute and rule representation distributions to produce a probabilistic representation of the target. NS-SSCL obtains state-of-the-art performance in unsupervised models to solve the RAVEN and V-PROM benchmarks, even better than most of the supervised models. The success of the proposed model suggests that construction of general cognitive abilities like humans may render the AI algorithms to solve complex tasks involving higher-level cognition such as abstract reasoning in a human-like manner.", POPGym: Benchmarking Partially Observable Reinforcement Learning,https://openreview.net/forum?id=chDrutUTs0K,https://openreview.net/pdf?id=chDrutUTs0K,"We propose POPGym, an RL library containing 14 partially observable gym environments and 13 different memory architectures","Real world applications of Reinforcement Learning (RL) are often partially observable, thus requiring memory. Despite this, partial observability is still largely ignored by contemporary RL benchmarks and libraries. We introduce Partially Observable Process Gym (POPGym), a two-part library containing (1) a diverse collection of 14 partially observable environments, each with multiple difficulties and (2) implementations of 13 memory model baselines -- the most in a single RL library. Existing partially observable benchmarks tend to fixate on 3D visual navigation, which is computationally expensive and only one type of many possible POMDPs. In contrast, POPGym environments are diverse, produce smaller observations, use less memory, and often converge within two hours of training on a consumer-grade GPU. We implement our high-level memory API and memory baselines on top of the popular RLlib framework, providing plug-and-play compatibility with various training algorithms, exploration strategies, and distributed training paradigms. Using POPGym, we execute the largest comparison across RL memory models to date. POPGym is available at https://anonymous.4open.science/r/popgym-51D8/README.md.","partially observable, POMDP, reinforcement learning, memory" ETAD: A Sampling-Based Approach for Efficient Temporal Action Detection,https://openreview.net/forum?id=ap9iq9kaU8j,https://openreview.net/pdf?id=ap9iq9kaU8j,We novelly propose to alleviate the efficiency issue in TAD by the sampling mechanism. We detailed study two questions: where to sample and how to sample in TAD.,"Temporal action detection (TAD) often suffers from the pain of huge demand for computing resources due to long video duration. As a consequence, given limited resources, most action detectors can only operate on pre-extracted features rather than original video frames, resulting in sub-optimal solutions. In this work, we propose an efficient temporal action detector (ETAD) that can train directly from video frames, by introducing a novel sampling mechanism. First, for where to sample in TAD, we propose snippet-level sampling and proposal-level sampling, based on the observation that performance saturates at a small number of snippets/proposals. Such samplings essentially leverage the redundancy in the current detection framework, thus can substantially reduce the computation cost and enable end-to-end training for long untrimmed videos without harming the performance. Second, for how to sample in TAD, we comprehensively study various sampling approaches, and point out that the random sampling and DPP sampling work the best empirically. Our sampling-based ETAD achieves state-of-the-art performance on TAD benchmarks with remarkable efficiency. With end-to-end training, ETAD can reach 38.25% average mAP on ActivityNet-1.3. With pre-extracted features, ETAD only needs 6 mins of training time and 1.23 GB memory, still reaching average mAP 37.78%. Code will be available.","Temporal Action Detection, Untrimmed Video Understanding, Efficient Detection" HierBatching: Locality-Aware Out-of-Core Training of Graph Neural Networks,https://openreview.net/forum?id=WWD_2DKUqdJ,https://openreview.net/pdf?id=WWD_2DKUqdJ,A locality-aware out-of-core training approach for Graph Neural Networks that is an order of magnitude faster without compromising accuracy,"As graph neural networks (GNNs) become increasingly more popular for analyzing data organized as massive graphs, how these models can be efficiently trained under economic computing resources becomes a critical subject that influences the widespread adoption of GNNs in practice. We consider the use of a single commodity machine restrained by limited memory but otherwise is attached with ample external storage. In such an under-explored scenario, not only the feature data often exceeds the memory capacity, but also the graph structure may not fit in memory as well. Then, with data stored on disk, gathering features and constructing neighborhood subgraphs in a usual mini-batch training incur inefficient random access and expensive data movement. To overcome this bottleneck, we propose a locality-aware training scheme, coined HierBatching, to significantly increase sequential disk access, while maintaining the random nature of stochastic training and its quality. HierBatching exploits the memory hierarchy of a modern GPU machine and constructs batches in an analogously hierarchical manner. Therein, graph nodes are organized in many partitions, each of which is laid out contiguously in disk for maximal spatial locality; while the main memory stores random partitions and is treated as the cache of the disk. Its content is reused multiple times for improving temporal locality. We conduct comprehensive experiments, including locality ablation, to demonstrate that HierBatching is economic, fast, and accurate.","Graph Neural Network, Out-of-Core Training, Spatial Locality, Temporal Locality, Hierarchical Batching" Everybody Needs Good Neighbours: An Unsupervised Locality-based Method for Bias Mitigation,https://openreview.net/forum?id=pOnhudsvzR,https://openreview.net/pdf?id=pOnhudsvzR,,"Learning models from human behavioural data often leads to outputs that are biased with respect to user demographics, such as gender or race. This effect can be controlled by explicit mitigation methods, but this typically presupposes access to demographically-labelled training data. Such data is often not available, motivating the need for unsupervised debiasing methods. To this end, we propose a new meta-algorithm for debiasing representation learning models, which combines the notions of data locality and accuracy of model fit, such that a supervised debiasing method can optimise fairness between neighbourhoods of poorly vs. well modelled instances as identified by our method. Results over five datasets, spanning natural language processing and structured data classification tasks, show that our technique recovers proxy labels that correlate with unknown demographic data, and that our method outperforms all unsupervised baselines, while also achieving competitive performance with state-of-the-art supervised methods which are given access to demographic labels.", Continual Learning In Low-coherence Subspace: A Strategy To Mitigate Learning Capacity Degradation,https://openreview.net/forum?id=jU_Q6UWBa2,https://openreview.net/pdf?id=jU_Q6UWBa2,"This paper contributes a novel method in continual learning, called Low-coherence Subspaces Projection (LcSP), which solves both the catastrophic forgetting problem and the learning capacity degradation problem.","Methods using gradient orthogonal projection, an efficient strategy in continual learning, have achieved promising success in mitigating catastrophic forgetting. However, these methods often suffer from the learning capacity degradation problem following the increasing number of tasks. To address this problem, we propose to learn new tasks in low-coherence subspaces rather than orthogonal subspaces. Specifically, we construct a unified cost function involving regular DNN parameters and gradient projections on the Oblique manifold. We finally develop a gradient descent algorithm on a smooth manifold to jointly minimize the cost function and minimize both the inter-task and the intra-task coherence. Numerical experimental results show that the proposed method has prominent advantages in maintaining the learning capacity when tasks are increased, especially on a large number of tasks, compared with baselines.","oblique manifold, continual learning, learning capacity degradation, catastrophic forgetting, low-coherence, orthogonal projection" Particle-based Variational Inference with Preconditioned Functional Gradient Flow,https://openreview.net/forum?id=6OphWWAE3cS,https://openreview.net/pdf?id=6OphWWAE3cS,,"Particle-based variational inference (VI) minimizes the KL divergence between model samples and the target posterior with gradient flow estimates. With the popularity of Stein variational gradient descent (SVGD), the focus of particle-based VI algorithms has been on the properties of functions in Reproducing Kernel Hilbert Space (RKHS) to approximate the gradient flow. However, the requirement of RKHS restricts the function class and algorithmic flexibility. This paper remedies the problem by proposing a general framework to obtain tractable functional gradient flow estimates. The functional gradient flow in our framework can be defined by a general functional regularization term that includes the RKHS norm as a special case. We use our framework to propose a new particle-based VI algorithm: preconditioned functional gradient flow (PFG). Compared with SVGD, the proposed method has several advantages: larger function class; greater scalability in large particle-size scenarios; better adaptation to ill-conditioned distributions; provable continuous-time convergence in KL divergence. Non-linear function classes such as neural networks can be incorporated to estimate the gradient flow. Both theory and experiments have shown the effectiveness of our framework.","Posterior Sampling, Particle-based VI" An Efficient Mean-field Approach to High-Order Markov Logic,https://openreview.net/forum?id=7UrHaeZ5Ie7,https://openreview.net/pdf?id=7UrHaeZ5Ie7,This paper proposes a method to perform mean-field iteration of MLN efficiently via a novel neural network.,"Markov logic networks (MLNs) are powerful models for symbolic reasoning, which combine probabilistic modeling with relational logic. Inference algorithms for MLNs often perform at the level of propositional logic or require building a first-order probabilistic graph, and the computational efficiency remains a challenge. The mean-field algorithm generalizes message passing for approximate inference in many intractable probabilistic graphical models, but in MLNs it still suffers from the high-order dependencies among the massive groundings, resulting in time complexity exponential in both the length and the arity of logic rules. We propose a novel method, LogicMP, to simplify the logic message passing especially. In most practical cases, it can reduce the complexity significantly to polynomial for the formulae in conjunctive normal form (CNF). We exploit the property of CNF logic rules to sidestep the expectation computation of high-order dependency, and then formulate the logic message passing by Einstein summation to facilitate parallel computation, which can be optimized by sequentially contracting the rule arguments. With LogicMP, we achieve evident improvements on several reasoning benchmark datasets in both performance and efficiency over competitor methods. Specifically, the AUC-PR of the UW-CSE and Cora datasets is improved by more than 11\% absolutely and the speed is about ten times faster. ","Logic Rules, Mean-field Algorithm, Markov Logic Network, Symbolic Reasoning" A theory of representation learning in neural networks gives a deep generalisation of kernel methods,https://openreview.net/forum?id=vCbnQZ6lXw3,https://openreview.net/pdf?id=vCbnQZ6lXw3,,"The successes of modern deep neural networks (DNNs) are founded on their ability to transform inputs across multiple layers to build good high-level representations. It is therefore critical to understand this process of representation learning. However, standard theoretical approaches (formally NNGPs) involving infinite width limits eliminate representation learning. We therefore develop a new infinite width limit, the representation learning limit, that exhibits representation learning mirroring that in finite-width networks, yet at the same time, remains extremely theoretically tractable. In particular, we derive an elegant objective that describes how each network layer learns representations that interpolate between input and output, and we confirm that the modes of the objective match the behaviour of finite but wide networks. Moreover, we use this limit and objective to develop a flexible, deep generalisation of kernel methods, that we call deep kernel machines (DKMs). We show that DKMs can be scaled to large datasets using inducing point methods from the Gaussian process literature, and we show that DKMs exhibit superior performance to other kernel-based approaches.","Gaussian process, infinite-width neural networks" A spatiotemporal graph neural network with multi granularity for air quality prediction,https://openreview.net/forum?id=_fouOVXUV7O,https://openreview.net/pdf?id=_fouOVXUV7O,,"Air quality prediction is a complex system engineering. How to fully consider the impact of meteorological, spatial and temporal factors on air quality is the core problem. To address this central conundrum, in an elaborate encoder-decoder architecture, we propose a new air quality prediction method based on multi-granularity spatiotemporal graph network. At the encoder, firstly, we use multi granularity graph and the well-known HYSPLIT model to build spatial relationship and dynamic edge relationship between nodes, respectively, while meteorological, temporal and topographic characteristics are used to build node features and LSTM (Long Short Term Memory) is used to learn the time-series relationship of pollutant concentration. At the decoder, secondly, we use the attention mechanism LSTM for decoding and forecasting of pollutant concentration. The proposed model is capable of tracking different influences on prediction resulting from the changes of air quality. On a project-based dataset, we validate the effectiveness of the proposed model and examine its abilities of capturing both fine-grained and long-term influences in pollutant process. We also compare the proposed model with the state-of-the-art air quality forecasting methods on the dataset of Yangtze River Delta city group, the experimental results show the appealing performance of our model over competitive baselines.","Air quality prediction, graph neural network, long short term memory" Highway Reinforcement Learning,https://openreview.net/forum?id=NFcRC4aYSWf,https://openreview.net/pdf?id=NFcRC4aYSWf,a novel adaptive multi-step Bellman Optimality Equation for efficient credit assignment that converges to the optimal value function with better contraction rate under mild assumptions,"Traditional Dynamic Programming (DP) approaches suffer from slow backward credit-assignment (CA): only a one-step search is performed at each update. A popular solution for multi-step CA is to use multi-step Bellman operators. Unfortunately, in the control settings, existing methods typically suffer from the large variance of multi-step off-policy corrections or are biased, preventing convergence. To overcome these problems, we introduce a novel multi-step Bellman optimality equation with adaptive lookahead steps. We first derive a new multi-step Value Iteration (VI) method that converges to the optimal Value Function (VF) with an exponential contraction rate but linear computational complexity. Given some trial, our so-called Highway RL performs rapid CA, by picking a policy and a possible lookahead (up to the trial end) that maximize the near-term reward during lookahead plus a DP-based estimate of the cumulative reward for the remaining part of the trial. Highway RL does not require off-policy corrections. Under mild assumptions, it achieves better convergence rates than the traditional one-step Bellman Optimality Operator. We then derive Highway Q-Learning, a convergent multi-step off-policy variant of Q-learning. We show that our Highway algorithms significantly outperform DP approaches on toy tasks. Finally, we propose a deep function approximation variant called Highway DQN. We evaluate it on visual MinAtar Games, outperforming similar multi-step methods.","reinforcement learning, off-policy learning, credit assignment, Bellman Equation" Learning Locality and Isotropy in Dialogue Modeling,https://openreview.net/forum?id=dPs6BGO2QT0,https://openreview.net/pdf?id=dPs6BGO2QT0,We present a simple dialogue representation calibration method to learn isotropic and conversational features during the dialogue modeling stage.,"Existing dialogue modeling methods have achieved promising performance on various dialogue tasks with the aid of Transformer and the large-scale pre-trained language models. However, some recent studies revealed that the context representations produced by these methods suffer the problem of anisotropy. In this paper, we find that the generated representations are also not conversational, losing the conversation structure information during the context modeling stage. To this end, we identify two properties in dialogue modeling, i.e., locality and isotropy, and present a simple method for dialogue representation calibration, namely SimDRC, to build isotropic and conversational feature spaces. Experimental results show that our approach significantly outperforms current state-of-the-art models on three open-domain dialogue tasks with eight benchmarks across both automatic and human evaluation metrics. More in-depth analyses further confirm the effectiveness of our proposed approach.","dialogue system, representation learning, feature space calibration" Dynamic Historical Adaptation for Continual Image-Text Modeling,https://openreview.net/forum?id=ouUnWeADZKq,https://openreview.net/pdf?id=ouUnWeADZKq,We propose a novel direct parameter transfer method for continual image-text modeling.,"In realistic application scenarios, existing methods for image-text modeling have limitations in dealing with data stream: training on all data needs too much computation/storage resources, and even the full access to previous data is invalid. In this work, we thus propose a new continual image-text modeling (CITM) setting that requires a model to be trained sequentially on a number of diverse image-text datasets. Although recent continual learning methods can be directly applied to the CITM setting, most of them only consider reusing part of previous data or aligning the output distributions of previous and new models, which is a partial or indirect way to acquire the old knowledge. In contrast, we propose a novel dynamic historical adaptation (DHA) method which can holistically and directly review the old knowledge from a historical model. Concretely, the historical model transfers its total parameters to the main/current model to utilize the holistic old knowledge. In turn, the main model dynamically transfers its parameters to the historical model at every five training steps to ensure that the knowledge gap between them is not too large. Extensive experiments show that our DHA outperforms other representative/latest continual learning methods under the CITM setting.","Image-text modeling, continual learning, contrastive learning, cross-modal retrieval" AutoGT: Automated Graph Transformer Architecture Search,https://openreview.net/forum?id=GcM7qfl5zY,https://openreview.net/pdf?id=GcM7qfl5zY,,"Although Transformer architectures have been successfully applied to graph data with the advent of Graph Transformer, current design of Graph Transformer still heavily relies on human labor and expertise knowledge to decide proper neural architectures and suitable graph encoding strategies at each Transformer layer. In literature, there have been some works on automated design of Transformers focusing on non-graph data such as texts and images without considering graph encoding strategies, which fail to handle the non-euclidean graph data. In this paper, we study the problem of automated graph Transformer, for the first time. However, solving these problems poses the following challenges: i) how can we design a unified search space for graph Transformer, and ii) how to deal with the coupling relations between Transformer architectures and the graph encodings of each Transformer layer. To address these challenges, we propose Automated Graph Transformer (AutoGT), a neural architecture search framework that can automatically discover the optimal graph Transformer architectures by joint optimization of Transformer architecture and graph encoding strategies. Specifically, we first propose a unified graph Transformer formulation that can represent most of state-of-the-art graph Transformer architectures. Based upon the unified formulation, we further design the graph Transformer search space that includes both candidate architectures and various graph encodings. To handle the coupling relations, we propose a novel encoding-aware performance estimation strategy by gradually training and splitting the supernets according to the correlations between graph encodings and architectures. The proposed strategy can provide a more consistent and fine-grained performance prediction when evaluating the jointly optimized graph encodings and architectures. Extensive experiments and ablation studies show that our proposed AutoGT gains sufficient improvement over state-of-the-art hand-crafted baselines on all datasets, demonstrating its effectiveness and wide applicability. ", Rememory-Based SimSiam for Unsupervised Continual Learning,https://openreview.net/forum?id=hlCBgdwvBx,https://openreview.net/pdf?id=hlCBgdwvBx,We propose a novel rememory-based SimSiam method for unsupervised continual learning.,"Unsupervised continual learning (UCL) has started to draw attention from the continual learning community, motivated by the practical need of representation learning with unlabeled data on sequential tasks. However, most of recent UCL methods focus on mitigating the catastrophic forgetting problem with a replay buffer to store previous data (i.e., rehearsal-based strategy), which needs much extra storage and thus limits their practical applications. To overcome this drawback, based on contrastive learning via SimSiam, we propose a novel rememory-based SimSiam (RM-SimSiam) method to reduce the dependency on replay buffer under the UCL setting. The core idea of our RM-SimSiam is to store and remember the old knowledge with a data-free historical module instead of replay buffer. Specifically, this historical module is designed to store the historical average model of all previous models (i.e., the memory process) and then transfer the knowledge of the historical average model to the new model (i.e., the rememory process). To further improve the rememory ability of our RM-SimSiam, we devise an enhanced SimSiam-based contrastive loss by aligning the representations outputted by the historical and new models. Extensive experiments on three benchmarks demonstrate the effectiveness of our RM-SimSiam under the UCL setting. ","Continual learning, unsupervised representation learning, contrastive learning, rememory process" OPERA: Omni-Supervised Representation Learning with Hierarchical Supervisions,https://openreview.net/forum?id=BckALoxD8ow,https://openreview.net/pdf?id=BckALoxD8ow,We propose an omni-supervised representation learning with hierarchical supervisions method for better transferability.,"The pretrain-finetune paradigm in modern computer vision facilitates the success of self-supervised learning, which tends to achieve better transferability than supervised learning. However, with the availability of massive labeled data, a natural question emerges: how to train a better model with both self and full supervision signals? In this paper, we propose omni-supervised representation learning with hierarchical supervisions (OPERA) as a solution. We provide a unified perspective of supervisions from labeled and unlabeled data and propose a unified framework of fully supervised and self-supervised learning. We extract a set of hierarchical proxy representations for each image and impose self and full supervisions on the corresponding proxy representations. Extensive experiments on both convolutional neural networks and vision transformers demonstrate the superiority of OPERA in image classification, segmentation, and object detection.","Representation learning, omni-supervised learning." i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable?,https://openreview.net/forum?id=uAmdu-GTG-_,https://openreview.net/pdf?id=uAmdu-GTG-_,,"Masked image modeling (MIM) has been recognized as a strong and popular self-supervised pre-training approach in the vision domain. However, the interpretability of the mechanism and properties in the learned representations by such a scheme is so far not well explored. In this work, through comprehensive experiments and empirical studies on Masked Autoencoders (MAE), we address two critical questions to explore the behaviors of the learned representations: ${\bf (i)}$ Are the latent representations in Masked Autoencoders linearly separable if the input is a mixture of two images instead of a single one? This can be concrete evidence to explain why MAE-learned representations have superior performance on downstream tasks, as proven by many literatures impressively. ${\bf (ii)}$ What is the degree of semantics encoded in the latent feature space by Masked Autoencoders? To explore these two problems, we propose a simple yet effective Interpretable MAE (${\bf i-MAE})$ framework with a two-way image reconstruction and a latent feature reconstruction with distillation loss, to help us understand the behaviors inside MAE structure. Extensive experiments are conducted on CIFAR-10/100, Tiny-ImageNet and ImageNet-1K datasets to verify the observations we discovered. Furthermore, in addition to qualitatively analyzing the characteristics in the latent representations, we also examine the existence of linear separability and the degree of semantics in the latent space by proposing two novel metrics. The surprising and consistent results between the qualitative and quantitative experiments demonstrate that i-MAE is a superior framework design for interpretability research of MAE frameworks, as well as achieving better representational ability.","Interpretability, Masked Autoencoders, Self-supervised Learning" Logit Margin Matters: Improving Transferable Targeted Adversarial Attack by Logit Calibration,https://openreview.net/forum?id=8OFAtZzIf7T,https://openreview.net/pdf?id=8OFAtZzIf7T,,"Previous works have extensively studied the transferability of adversarial samples in untargeted black-box scenarios. However, it still remains challenging to craft the targeted adversarial examples with higher transferability than non-targeted ones. Recent studies reveal that the traditional Cross-Entropy (CE) loss function is insufficient to learn transferable targeted perturbations due to the issue of vanishing gradient. In this work, we provide a comprehensive investigation of the CE loss function and find that the logit margin between the targeted and untargeted classes will quickly obtain saturation in CE, which largely limits the transferability. Therefore, in this paper, we devote to the goal of enlarging logit margins and propose two simple and effective logit calibration methods, which are achieved by downscaling the logits with a temperature factor and an adaptive margin, respectively. Both of them can effectively encourage the optimization to produce larger logit margins and lead to higher transferability. Besides, we show that minimizing the cosine distance between the adversarial examples and the classifier of the target class can further improve the transferability, which is benefited from downscaling logits via L2-normalization. Experiments conducted on the ImageNet dataset validate the effectiveness of the proposed methods, which outperform the state-of-the-art methods in black-box targeted attacks. The source code is available at \href{https://anonymous.4open.science/r/Target-Attack-72EB/README.md}{Link}.", GSCA: Global Spatial Correlation Attention,https://openreview.net/forum?id=WSIHedvwmru,https://openreview.net/pdf?id=WSIHedvwmru,A novel parameter-free self-attention with linear complexity is proposed to enhance convolution.,"Convolution and self-attention, with their characteristics and complementing each other, are two powerful techniques in vision tasks. The ability of self-attention to capture long-range dependencies compensates for the lack of convolution in understanding global feature information. However, the quadratic computational complexity of self-attention impedes their direct combination. This paper proposes global spatial correlation attention (GSCA), which is a self-attention approximation with linear computational complexity without any additional parameters. The aim is to adjust the attention distribution in the global space by utilizing the statistical relationships of the input feature maps themselves. We compress the key matrix into a vector and evaluate the pairwise affinity of each pixel with the key vector in terms of the cross-correlation coefficient, and apply the attention weights to the inputs using the Hadamard product. A multi-head attention form is further built to enhance the module's ability to capture the feature subspace. Based on the above lightweight operations, the proposed method can simply and effectively improve the aggregation capability of convolution for global information. We extensively evaluate our GSCA module on image classification, object detection, and instance segmentation tasks. Parameter-free GSCA is lighter than state-of-the-arts while achieving very competitive performance. It is combined with channel attentions, which also further outperforms the original methods. The experiments also demonstrate the generalizability and robustness of GSCA. The source code is available at GSCA.","self attention, attention mechanism, cross-correlation" Cross Modal Domain Generalization for Query-based Video Segmentation,https://openreview.net/forum?id=XHgHn5gYIVP,https://openreview.net/pdf?id=XHgHn5gYIVP,," Domain generalization (DG) aims to increase a model's generalization ability against the performance degradation when transferring to the target domains, which has been successfully applied in various visual and natural language tasks. However, DG on multi-modal tasks is still an untouched field. Compared with traditional single-modal DG, the biggest challenge of multi-modal DG is that each modality has to cope with its own domain shift. Directly applying the previous methods will make the generalization direction of the model in each modality inconsistent, resulting in negative effects when the model is migrated to the target domains. Thus in this paper, we explore the scenario of query-based video segmentation to study how to better advance the generalization ability of the model in the multi-modal situation. Considering the information from different modalities often shows consistency, we propose query-guided feature augmentation (QFA) and attention map adaptive instance normalization (AM-AdaIN) modules. Compared with traditional DG models, our method can combine visual and textual modalities together to guide each other for data augmentation and learn a domain-agnostic cross-modal relationship, which is more suitable for multi-modal transfer tasks. Extensive experiments on three query-based video segmentation generalization tasks demonstrate the effectiveness of our method. ","domain generalization, multi-modal, video segmentation" Accumulative Poisoning Defense with Memorization Discrepancy,https://openreview.net/forum?id=lhJtB_F1Ga1,https://openreview.net/pdf?id=lhJtB_F1Ga1,,"Adversarial poisoning attacks pose huge threats to various machine learning applications. Especially, the recent accumulative poisoning attacks show that it is possible to achieve irreparable harm on models via a sequence of imperceptible attacks followed by the trigger sample. Due to the limited data-level information in real-time data streaming, the current defensive methods are indiscriminate in handling the poison and clean samples. In this paper, we dive into the perspective of model dynamics and propose a novel information measure, namely, Memorization Discrepancy, to explore the defense via the model-level information. Through implicitly transferring changes in the data manipulation to that in model outputs, our Memorization Discrepancy constructed by the victim and historical models is aware of the imperceptible poison samples based on their distinct values from the clean samples. We thoroughly analyze its properties and accordingly propose a Discrepancy-aware Sample Correction (DSC) to defend against the accumulative poisoning attacks. Extensive experiments comprehensively characterize our proposed Memorization Discrepancy and verified the effectiveness of our DSC.", Combating Exacerbated Heterogeneity for Robust Decentralized Models,https://openreview.net/forum?id=eKllxpLOOm,https://openreview.net/pdf?id=eKllxpLOOm,,"The emerging privacy and security issues in real-world applications motivate us to pursue the adversarially robust federated models. However, the straightforward combination between adversarial training and federated learning in one framework, usually induces the undesired robustness deterioration. We discover that the attribution behind this phenomenon is the generated adversarial data could exacerbate the data heterogeneity among local clients, making the wrapped federated learning perform poorly. To deal with this problem, we propose a novel framework termed as Slack Federated Adversarial Training (SFAT), assigning the client-wise slack during aggregation to combat the intensified heterogeneity. Theoretically, we analyze the convergence of the proposed method to properly relax the objective when combining federated learning and adversarial training. Experimentally, we verify the rationality and effectiveness of SFAT on various benchmarked and real-world datasets with different adversarial training and federated optimization methods.", Shot Retrieval and Assembly with Text Script for Video Montage Generation,https://openreview.net/forum?id=3owqfawaLv,https://openreview.net/pdf?id=3owqfawaLv,We propose a novel transformer-based model for video montage generation by retrieving and assembling shots with arbitrary text scripts.,"With the development of video sharing websites, numerous users desire to create their own attractive video montages. However, it is difficult for inexperienced users to create a well-edited video montage due to the lack of professional expertise. In the meantime, it is time-consuming even for experts to create video montages of high quality, which requires effectively selecting shots from abundant candidates and assembling them together. Instead of manual creation, a number of automatic methods have been proposed for video montage generation. However, these methods typically take a single sentence as input for text-to-shot retrieval, and ignore the semantic cross-sentence coherence given complicated text script of multiple sentences. To overcome this drawback, we propose a novel model for video montage generation by retrieving and assembling shots with arbitrary text scripts. To this end, a sequence consistency transformer is devised for cross-sentence coherence modeling. More importantly, with this transformer, two novel sequence-level tasks are defined for sentence-shot alignment in sequence-level: Cross-Modal Sequence Matching (CMSM) task, and Chaotic Sequence Recovering (CSR) task. To facilitate the research on video montage generation, we construct a new, highly-varied dataset which collects thousands of video-script pairs in documentary. Extensive experiments on the constructed dataset demonstrate the superior performance of the proposed model. The dataset and generated video demos are available at https://github.com/RATVDemo/RATV","Video montage generation, text-to-shot retrieval, transformer, dataset construction" Pruning with Output Error Minimization for Producing Efficient Neural Networks,https://openreview.net/forum?id=YlMvAomKXO,https://openreview.net/pdf?id=YlMvAomKXO,"We present a pruning method that conducts pruning and then performs ""reconstruction"" to minimize the output error of the activation function, while previous methods minimize the error of the value before passing through the activation function.","DNNs are known to have excessive parameters and thus are computationally expensive, which poses a challenge for implementations in various applications. Structured pruning is a technique of compressing a trained DNN model by removing redundant neurons (or channels). How well a pruned model maintains its accuracy depends on two factors. The first is compression ratio optimization, in other words, how many neurons are reduced in each layer. The other is layer-wise optimization, in other words, which neurons to be preserved in each layer. In this paper, we propose Pruning with Output Error Minimization (POEM), a layer-wise pruning method that conducts pruning and then performs reconstruction to compensate the error caused by pruning. The strength of POEM lies in its reconstruction using the weighted least squares method so as to minimize the output error of the activation function, while the previous methods minimize the error of the value before applying the activation function. The experiments with well-known DNN models and a large scale image recognition dataset show that POEM is better than the previous methods in maintaining the accuracy of those models.","pruning, weighted least squares method, convolutional neural networks, compression" Orientation-Aware Graph Neural Networks for Protein Structure Representation Learning,https://openreview.net/forum?id=WcTLZrpzfe,https://openreview.net/pdf?id=WcTLZrpzfe,We design a new type of geometric neural networks for learning protein representations.,"By folding to particular 3D structures, proteins play a key role in living beings. To learn meaningful representation from a protein structure for downstream tasks, not only the global backbone topology but the local fine-grained orientational relations between amino acids should also be considered. In this work, we propose the Orientation-Aware Graph Neural Networks (OAGNNs) to better sense the geometric characteristics in protein structure (e.g. inner-residue torsion angles, inter-residue orientations). Extending a single weight from a scalar to a 3D vector, we construct a rich set of geometric-meaningful operations to process both the classical and SO(3) representations of a given structure. To plug our designed perceptron unit into existing Graph Neural Networks, we further introduce an equivariant message passing paradigm, showing superior versatility in maintaining SO(3)-equivariance at the global scale. Experiments have shown that our OAGNNs have a remarkable ability to sense geometric orientational features compared to classical networks. OAGNNs have also achieved state-of-the-art performance on various computational biology applications related to protein 3D structures. ","geometric learning, representation learning, structural biology" Adaptive Update Direction Rectification for Unsupervised Continual Learning,https://openreview.net/forum?id=R498E9vaqZ,https://openreview.net/pdf?id=R498E9vaqZ,We propose an Actor-Critic framework with adaptive update direction rectification for unsupervised continual learning.,"Recent works on continual learning have shown that unsupervised continual learning (UCL) methods rival or even beat supervised continual learning methods. However, most UCL methods typically adopt fixed learning strategies with pre-defined objectives and ignore the influence of the constant shift of data distributions on the newer training process. This non-adaptive paradigm tends to achieve sub-optimal performance, since the optimal update direction (to ensure the trade-off between old and new tasks) keeps changing during training over sequential tasks. In this work, we thus propose a novel UCL framework termed AUDR to adaptively rectify the update direction by a policy network (i.e., the Actor) at each training step based on the reward predicted by a value network (i.e., the Critic). Concretely, different from existing Actor-Critic based reinforcement learning works, there are three vital designs that make our AUDR applicable to the UCL setting: (1) A reward function to measure the score/value of the currently selected action, which provides the ground-truth reward to guide the Critic's predictions; (2) An action space for the Actor to select actions (i.e., update directions) according to the reward predicted by the Critic; (3) A multinomial sampling strategy with a lower-bound on the sampling probability of each action, which is designed to improve the variance of the Actor's selected actions for more diversified exploration. Extensive experiments show that our AUDR achieves state-of-the-art results under both the in-dataset and cross-dataset UCL settings. Importantly, our AUDR also shows superior performance when combined with other UCL methods, which suggests that our AUDR is highly extensible and versatile.","Continual learning, unsupervised learning, representation learning" Language Model Pre-training with Linguistically Motivated Curriculum Learning,https://openreview.net/forum?id=y7CNId2RnV,https://openreview.net/pdf?id=y7CNId2RnV,We propose a language model pre-training method based on linguistically motivated curriculum learning.,"Pre-training serves as a foundation of recent NLP models, where language modeling task is performed over large texts. It has been shown that data affects the quality of pre-training, and curriculum has been investigated regarding sequence length. We consider a linguistic perspective in the curriculum, where frequent words are learned first and rare words last. This is achieved by substituting syntactic constituents for rare words with their constituent labels. By such syntactic substitutions, a curriculum can be made by gradually introducing words with decreasing frequency levels. Without modifying model architectures or introducing external computational overhead, our data-centric method gives better performances over vanilla BERT on various downstream benchmarks.","language model pre-training, curriculum learning, data-centric method" Towards Generalized Combinatorial Solvers via Reward Adjustment Policy Optimization,https://openreview.net/forum?id=KjzZrBsORz,https://openreview.net/pdf?id=KjzZrBsORz,Towards Generalized Combinatorial Solvers via Reward Adjustment Policy Optimization,"Recent reinforcement learning approaches have achieved impressive success in solving combinatorial optimization (CO) problems. However, most existing works focus on evaluating their solvers under a prevalent fixed-size protocol, ignoring generalization to differentRecent reinforcement learning approaches have achieved impressive success in solving combinatorial optimization (CO) problems. However, most existing works focus on evaluating their solvers under a prevalent fixed-size protocol, ignoring generalization to different-size instances. When the solver is confronted with instances of the size it has not been trained on, the performance drops dramatically. In practice, these approaches that lack size-insensitive generalization capacities are unacceptable since an additional training period is repeated for each new instance size. We observe the main obstacle preventing us from training a generalized combinatorial solver is oscillating reward signals. Reward oscillation mainly includes two sides: 1) The conventional reward fails to depict the actual performance of solvers for different instance sizes. 2) The inherent difficulties varying across different sizes worsen training stability. Thus, we present Reward Adjustment Policy Optimization (RAPO), an end-to-end approach to building combinatorial solvers for a wide range of CO problems. RAPO contains a reward adjustment method across instances with variable sizes to address the first side of reward oscillation, along with a promising curriculum strategy to alleviate another side. We conduct experiments on three popular CO problems, namely, the traveling salesman problem (TSP), the capacitated vehicle routing problem (CVRP), and the 0-1 knapsack problem (KP). RAPO exhibits significant improvement in generalization to instances with variable sizes consistently on all benchmarks. Remarkably, RAPO even outperforms its fixed-size counterparts in its well-trained size by a clear margin. size instances. When the solver is confronted with instances of the size it has not been trained on, the performance drops dramatically. In practice, these approaches that lack size-insensitive generalization capacities are unacceptable since an additional training period is repeated for each new instance size. We observe the main obstacle preventing us from training a generalized combinatorial solver is oscillating reward signals. Reward oscillation mainly includes two sides: 1) The conventional reward fails to depict the actual performance of solvers for different instance sizes. 2) The inherent difficulties varying across different sizes worsen training stability. Thus, we present Reward Adjustment Policy Optimization (RAPO), an end-to-end approach to building combinatorial solvers for a wide range of CO problems. RAPO contains a reward adjustment method across instances with variable sizes to address the first side of reward oscillation, along with a promising curriculum strategy to alleviate another side. We conduct experiments on three popular CO problems, namely, the traveling salesman problem (TSP), the capacitated vehicle routing problem (CVRP), and the 0-1 knapsack problem (KP). RAPO exhibits significant improvement in generalization to instances with variable sizes consistently on all benchmarks. Remarkably, RAPO even outperforms its fixed-size counterparts in its well-trained size by a clear margin. ","combinatorial optimization, reinforcement learning, traveling salesman problem, vehicle routing problem" Offline Reinforcement Learning with Closed-Form Policy Improvement Operators,https://openreview.net/forum?id=1usJZBGNrZ,https://openreview.net/pdf?id=1usJZBGNrZ,We proposed a closed-form policy improvement operator and modeled the behavior policies as a Gaussian Mixture.,"Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning. By exploiting historical transitions, a policy is trained to maximize a learned value function while constrained by the behavior policy to avoid a significant distributional shift. In this paper, we propose our closed-form policy improvement operators. We make a novel observation that the behavior constraint naturally motivates the use of first-order Taylor approximation, leading to a linear approximation of the policy objective. Additionally, as practical datasets are usually collected by heterogeneous policies, we model the behavior policies as a Gaussian Mixture and overcome the induced optimization difficulties by leveraging the LogSumExp's lower bound and Jensen's Inequality, giving rise to a closed-form policy improvement operator. We instantiate an offline RL algorithm with our novel policy improvement operator and empirically demonstrate its effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.","Offline Reinforcement Learning algorithms, Deep Reinforcement Learning" Sensitivity-aware Visual Parameter-efficient Tuning,https://openreview.net/forum?id=9GOjmbRQ2o,https://openreview.net/pdf?id=9GOjmbRQ2o,We propose a visual parameter-efficient tuning approach to identify and tune the parameters at task-specific important positions while being inference-efficient.,"Visual Parameter-efficient Tuning (VPT) has become a powerful alternative for full fine-tuning, which only updates a small number of parameters while freezing the remaining vast majority of parameters to significantly reduce the storage costs for adapting the pre-trained vision models to downstream tasks. Although the storage burden is largely alleviated, VPT approaches still face many challenges, e.g., lower inference speed and lacking effective configurations for trainable parameters tailored for each task. In this paper, we present a simple yet effective approach termed Sensitivity-aware visual Parameter-efficient Tuning (SPT) to tackle these challenges. Given a desired tunable parameter budget, SPT quickly identifies the important parameters to the given task in a data-dependent way before fine-tuning, without the complex selection schedule. To increase the representational capacity at a negligible cost within the same parameter budget, we employ low-rank reparameterization to achieve a better trade-off between parameter efficiency and accuracy. Through extensive experiments on a wide range of downstream recognition tasks, our SPT achieves better overall transfer performance than the full fine-tuning and the other VPT approaches, with no additional computational or memory overhead during inference. For instance, SPT saves 99.35% of the trainable parameters than the full fine-tuning while achieving a 7.3% higher average top-1 accuracy on VTAB-1k benchmark with the supervised pre-trained ViT-B backbone. Notably, SPT is also the first work that bridges the gap between full fine-tuning and VPT approaches for backbones under self-supervised pre-training strategies MAE and MoCo v3 on the challenging VTAB-1k benchmark.","Visual Parameter-efficient Tuning, Fine-tuning, Visual Task Adaptation" Towards Robust Object Detection Invariant to Real-World Domain Shifts,https://openreview.net/forum?id=vqSyt8D3ny,https://openreview.net/pdf?id=vqSyt8D3ny,We perturb feature channel statistics to generalize object detectors under real-world domain shifts.,"Safety-critical applications such as autonomous driving require robust object detection invariant to real-world domain shifts. Such shifts can be regarded as different domain styles, which can vary substantially due to environment changes and sensor noises, but deep models only know the training domain style. Such domain style gap impedes object detection generalization on diverse real-world domains. Existing classification domain generalization (DG) methods cannot effectively solve the robust object detection problem, because they either rely on multiple source domains with large style variance or destroy the content structures of the original images. In this paper, we analyze and investigate effective solutions to overcome domain style overfitting for robust object detection without the above shortcomings. Our method, dubbed as Normalization Perturbation (NP), perturbs the channel statistics of source domain low-level features to synthesize various latent styles, so that the trained deep model can perceive diverse potential domains and generalizes well even without observations of target domain data in training. This approach is motivated by the observation that feature channel statistics of the target domain images deviate around the source domain statistics. We further explore the style-sensitive channels for effective style synthesis. Normalization Perturbation only relies on a single source domain and is surprisingly simple and effective, contributing a practical solution by effectively adapting or generalizing classification DG methods to robust object detection. Extensive experiments demonstrate the effectiveness of our method for generalizing object detectors under real-world domain shifts.","robust object detection, autonomous driving" Light Sampling Field and BRDF Representation for Physically-based Neural Rendering,https://openreview.net/forum?id=yYEb8v65X8,https://openreview.net/pdf?id=yYEb8v65X8,,"Physically-based rendering (PBR) is key for immersive rendering effects used widely in the industry to showcase detailed realistic scenes from computer graphics assets. A well-known caveat is that producing the same is computationally heavy and relies on complex capture devices. Inspired by the success in quality and efficiency of recent volumetric neural rendering, we want to develop a physically-based neural shader to eliminate device dependency and significantly boost performance. However, no existing lighting and material models in the current neural rendering approaches can accurately represent the comprehensive lighting models and BRDFs properties required by the PBR process. Thus, this paper proposes a novel lighting representation that models direct and indirect light locally through light sampling strategy in a learned light sampling field. We also propose BRDF models to separately represent surface/subsurface scattering details to enable complex objects such as translucent material (i.e., skin, jade).We then practice our proposed representations with an end-to-end physically-based neural face skin shader, which takes a standard face asset (i.e., geometry, albedo map, and normal map) and an HDRI for illumination as inputs and generates a photo-realistic rendering as output. Extensive experiments showcase the quality and efficiency of our PBR face skin shader, indicating the effectiveness of our proposed lighting and material representations.",Neural Rendering Margin-based Neural Network Watermarking,https://openreview.net/forum?id=pIPvjnW1B9A,https://openreview.net/pdf?id=pIPvjnW1B9A,We propose margin-based watermarking for deep neural networks.,"As Machine Learning as a Service (MLaaS) platforms become prevalent, deep neural network (DNN) watermarking has gained increasing attention, which enables one to verify the ownership of a target DNN model in a black-box scenario. Unfortunately, previous watermarking methods are vulnerable to functionality stealing attacks, thus allowing an adversary to falsely claim the ownership of a DNN model stolen from its original owner. In this work, we propose a novel margin-based DNN watermarking approach that is robust to the functionality stealing attacks based on model extraction and distillation. Specifically, during training, our method maximizes the margins of watermarked samples by using projected gradient ascent on them so that their predicted labels cannot change without compromising the accuracy of the model that the attacker tries to steal. We validate our method on multiple benchmarks and show that our watermarking method successfully defends against model extraction attacks, outperforming recent baselines.","neural network watermarking, machine learning, deep learning, ownership verification" Your Denoising Implicit Model is a Sub-optimal Ensemble of Denoising Predictions,https://openreview.net/forum?id=NrJ-x9KbdZ,https://openreview.net/pdf?id=NrJ-x9KbdZ,,"Denoising diffusion models construct a Markov denoising process to learn the transport from Gaussian noise distribution to the data distribution, however require thousands of denoising steps to achieve the SOTA generative performance. Denoising diffusion implicit models (DDIMs) introduce non-Markovian process to largely reduce the required steps, but its performance degenerates as the sampling steps further reducing. In this work, we show that DDIMs belong to our $\textit{ensemble denoising implicit models}$ which heavily rely on the convex ensemble of obtained denoising predictions. We propose improved DDIM (iDDIM) to demonstrate DDIMs adopt sub-optimal ensemble coefficients. The iDDIM can largely improve on DDIMs, but still deteriorates in the case of a few sampling steps. Thus we further propose $\textit{generalized denoising implicit model}$ (GDIM) that replace the ensemble prediction with a probabilistic inference conditioned on the obtained states. Then a specific instance $t$-GDIM that only depends on the latest state is parameterized by the conditional energy-based model (EBM) and variational sampler. The models are jointly trained with variational maximum likelihood. Extensive experiments show $t$-GDIM can reduces the sampling steps to only 4 and remains comparable generative quality to other generative models.", DREAM: Domain-free Reverse Engineering Attributes of Black-box Model,https://openreview.net/forum?id=MLStcoDEhqi,https://openreview.net/pdf?id=MLStcoDEhqi,,"Deep learning models are usually black boxes when deployed on machine learning platforms. Prior works have shown that the attributes (e.g., the number of convolutional layers) of a target black-box neural network can be exposed through a sequence of queries. There is a crucial limitation that these works assume the dataset used for training the target model to be known beforehand, and leverage this dataset for model attribute attack. However, it is difficult to access the training dataset of the target black-box model in reality. Therefore, whether the attributes of a target black-box model could be still revealed in this case is doubtful. In this paper, we investigate a new problem of Domain-free Reverse Engineering the Attributes of a black-box target Model, called DREAM, without requiring the model's training dataset available, and put forward a general and principled framework by casting this problem as an out of distribution (OOD) generalization problem. At the heart of our framework, we devise a multi-discriminator generative adversarial network (MDGAN) to learn domain invariant features. Based on these features, we can learn a domain-free model to inversely infer the attributes of a target black-box model with unknown training data. This makes our method one of the kinds that can gracefully apply to an arbitrary domain for model attribute reverse engineering with good generalization ability. Extensive experimental studies are conducted and the results validate the superiority of our proposed method over the baselines.","Model Attribute inference, domain-free method" Structural Generalization of Visual Imitation Learning with Position-Invariant Regularization,https://openreview.net/forum?id=eMuXAIEYXh9,https://openreview.net/pdf?id=eMuXAIEYXh9,,"How the visual imitation learning models can generalize to novel unseen visual observations is a highly challenging problem. Such a generalization ability is very crucial for their real-world applications. Since this generalization problem has many different aspects, we focus on one case called structural generalization, which refers to generalization to unseen task setup, such as a novel setup of object locations in the robotic manipulation problem. In this case, previous works observe that the visual imitation learning models will overfit to the absolute information (e.g., coordinates) rather than the relational information between objects, which is more important for decision making. As a result, the models will perform poorly in novel scenarios. Nevertheless, so far, it remains unclear how we can solve this problem effectively. Our insight into this problem is to explicitly remove the absolute information from the features learned by imitation learning models so that the models can use robust, relational information to make decisions. To this end, we propose a novel, position-invariant regularizer for generalization. The proposed regularizer will penalize the imitation learning model when its features contain absolute, positional information of objects. We carry out experiments on the MAGICAL and ProcGen benchmark, as well as a real-world robot manipulation problem. We find that our regularizer can effectively boost the structural generalization performance of imitation learning models. Through both qualitative and quantitative analysis, we verify that our method does learn robust relational representations. ", Dealing with missing data using attention and latent space regularization,https://openreview.net/forum?id=jny79Mfgkno,https://openreview.net/pdf?id=jny79Mfgkno,A novel framework for dealing with missing data without imputation by regularizing latent space representations.,"Most practical data science problems encounter missing data. A wide variety of solutions exist, each with strengths and weaknesses that depend upon the missingness-generating process. Here we develop a theoretical framework for training and inference using only observed variables enabling modeling of incomplete datasets without imputation. Using an information and measure-theoretic argument we construct models with latent space representations that regularize against the potential bias introduced by missing data. The theoretical properties of this approach are demonstrated empirically using a synthetic dataset. The performance of this approach is tested on 11 benchmarking datasets with missingness and 18 datasets corrupted across three missingness patterns with comparison against a state-of-the-art model and industry-standard imputation. We show that our proposed method outperforms common imputation methods and the current state-of-the-art with statistical significance. ","missing data, missingness, latent space regularization, measure theory" Revisiting Global Pooling through the Lens of Optimal Transport,https://openreview.net/forum?id=FAHVsSfhWs,https://openreview.net/pdf?id=FAHVsSfhWs,"We develop a novel and solid global pooling framework through the lens of optimal transport, which covers many existing pooling methods and performs well on various learning problems.","Global pooling is one of the most significant operations in many machine learning models and tasks, whose implementation, however, is often empirical in practice. In this study, we develop a novel and solid global pooling framework through the lens of optimal transport. We demonstrate that most existing global pooling methods are equivalent to solving some specializations of an unbalanced optimal transport (UOT) problem. Making the parameters of the UOT problem learnable, we unify various global pooling methods in the same framework, and accordingly, propose a generalized global pooling layer called UOT-Pooling (UOTP) for neural networks. Besides implementing the UOTP layer based on the classic Sinkhorn-scaling algorithm, we design new model architectures based on the Bregman ADMM algorithm, which has comparable complexity but better numerical stability. We test our UOTP layers in several application scenarios, including multi-instance learning, graph classification, and image classification. In these applications, our UOTP layers can either imitate conventional global pooling layers or learn new pooling mechanisms to perform better. ","Global pooling, regularized optimal transport, Bregman ADMM, multi-instance learning, graph embedding" On the Importance of Pretrained Knowledge Distillation for 3D Object Detection,https://openreview.net/forum?id=T1Qx6EC08o,https://openreview.net/pdf?id=T1Qx6EC08o,"We propose PreDistill, a pretrained distillation paradigm for knowledge transfer and demonstrate that PreDistill serves as a plug-and-play module to various state-of-the-art detectors.","Multi-camera 3D object detection for autonomous driving is quite challenging and has drawn great attention from both academia and industry. The core issue of the vision-only methods is that it is difficult to mine accurate geometry-aware features from images. To improve the performance of vision-only approaches, one promising ingredient in the recipe lies in how to use visual features to simulate the geometry information of LiDAR, since point cloud data inherently carries 3D spatial information. In this paper, we resort to knowledge distillation to leverage useful representations from the LiADR-based expert to enhance feature learning in the camera-based pipeline. It is observed that the joint optimization of expert-apprentice distillation as well as the target task might be difficult to learn in the conventional distillation paradigm. Inspired by the great blossom and impressive results of foundation models in general vision, we propose a pretrained distillation paradigm, termed as PreDistill, to decouple the training procedure into two stages. The apprentice network first emphasizes the knowledge transfer from the expert; then it performs finetuning on the downstream target task. Such a strategy would facilitate the optimal representation learning with targeted goals and ease the joint feature learning as resided in conventional single-stage counterpart. PreDistill serves as a convenient plug-and-play that is flexible to extend to multiple state-of-the-art detectors. Without bells and whistles, building on top of the most recent approaches, e.g., BEVFusion-C, BEVFormer, and BEVDepth, we could guarantee a unanimous gain of 7.6%, 1.0%, and 0.6% in terms of NDS metric on nuScenes benchmark. Code and model checkpoints would be available.","knowledge distillation, object detection" Bidirectional Propagation for Cross-Modal 3D Object Detection,https://openreview.net/forum?id=gYs_cRuK7V,https://openreview.net/pdf?id=gYs_cRuK7V,We innovatively propose bidirectional feature propagation to address cross-modal 3D object detection. Such a new perspective will inspire the research on multi-modal learning for scene understanding and analysis.,"Recent works have revealed the superiority of feature-level fusion for cross-modal 3D object detection, where fine-grained feature propagation from 2D image pixels to 3D LiDAR points has been widely adopted for performance improvement. Still, the potential of heterogeneous feature propagation between 2D and 3D domains has not been fully explored. In this paper, in contrast to existing pixel-to-point feature propagation, we investigate an opposite point-to-pixel direction, allowing point-wise features to flow inversely into the 2D image branch. Thus, when jointly optimizing the 2D and 3D streams, the gradients back-propagated from the 2D image branch can boost the representation ability of the 3D back-bone network working on LiDAR point clouds. Then, combining pixel-to-point and point-to-pixel information flow mechanisms, we further construct an interactive bidirectional feature propagation framework, dubbed BiProDet. In addition to the architectural design, we also propose normalized local coordinates map estimation, a new 2D auxiliary task for the training of the 2D image branch, which facilitates learning local spatial-aware features from the image modality and implicitly enhances the overall 3D detection performance. Extensive experiments and ablation studies validate the effectiveness of our method. Notably, we rank 1st on the highly competitive KITTI benchmark on the cyclist class by the time of submission. We also uploaded the source code in the supplementary material, which will be publicly available.","Cross-modal, 3D Object Detection, 3D Point Cloud, Deep Learning" Policy Pre-training for Autonomous Driving via Self-supervised Geometric Modeling,https://openreview.net/forum?id=X5SUR7g2vVw,https://openreview.net/pdf?id=X5SUR7g2vVw,"We introduce a visuomotor driving policy pre-training paradigm, which leverages self-supervised geometric modeling to learn driving policy representation and achieves superior performance on various downstream driving tasks.","Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks the view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pre-training in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. As a side product, the pre-trained geometric modeling networks could bring further improvement to the depth and odometry estimation tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models would be made public.","Policy Pre-training, End-to-end Autonomous Driving, Self-supervised Learning" Towards Expressive Graph Representations for Graph Neural Networks,https://openreview.net/forum?id=aCGIa6GfbmN,https://openreview.net/pdf?id=aCGIa6GfbmN,"graph representation, graph neural network, set representation, expressive power","Graph Neural Network (GNN) shows its powerful capability for graph representation learning in various application areas. However, most existing GNN variants learn the graph representations in a non-injective or non-continuous fashion, both reducing the model expressive power. In this paper, we present a theoretical framework to improve the expressive power of GNN by taking both injectivity and continuity into account. Accordingly, we develop \textit{Injective Continuous Graph Neural Network} (ICGNN) that learns the graph and node representations in an injective and continuous fashion, so that it can map similar nodes or graphs to similar embeddings, and non-equivalent nodes or non-isomorphic graphs to different embeddings. We validate the proposed ICGNN model for graph classification and node classification on multiple benchmark datasets including both simple graphs and attributed graphs. The experimental results demonstrate that our model achieves state-of-the-art performances on most of the benchmarks.", EurNet: Efficient Multi-Range Relational Modeling of Spatial Multi-Relational Data,https://openreview.net/forum?id=rMbrVNxYuqZ,https://openreview.net/pdf?id=rMbrVNxYuqZ,This paper proposes the EurNet for efficiently modeling spatial multi-relational data like images and protein structures.,"Modeling spatial relationship in the data remains critical across many different tasks, such as image classification, semantic segmentation and protein structure understanding. Previous works often use a unified solution like relative positional encoding. However, there exists different kinds of spatial relations, including short-range, medium-range and long-range relations, and modeling them separately can better capture the focus of different tasks on the multi-range relations (e.g., short-range relations can be important in instance segmentation, while long-range relations should be upweighted for semantic segmentation). In this work, we introduce the EurNet for Efficient multi-range relational modeling. EurNet constructs the multi-relational graph, where each type of edge corresponds to short-, medium- or long-range spatial interactions. In the constructed graph, EurNet adopts a novel modeling layer, called gated relational message passing (GRMP), to propagate multi-relational information across the data. GRMP captures multiple relations within the data with little extra computational cost. We study EurNets in two important domains for image and protein structure modeling. Extensive experiments on ImageNet classification, COCO object detection and ADE20K semantic segmentation verify the gains of EurNet over the previous SoTA FocalNet. On the EC and GO protein function prediction benchmarks, EurNet consistently surpasses the previous SoTA GearNet. Our results demonstrate the strength of EurNets on modeling spatial multi-relational data from various domains. ","Multi-Relational Modeling, Image Modeling, Protein Structure Modeling" TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis,https://openreview.net/forum?id=ju_Uqw384Oq,https://openreview.net/pdf?id=ju_Uqw384Oq,"Based on the multi-periodicity, we analyze the intraperiod- and interperiod-variations in 2D space and propose the TimesNet as a task-general model, which achieves consistent state-of-the-art in five mainstream time series analysis tasks.","Time series analysis is of immense importance in extensive applications, such as weather forecasting, anomaly detection, and action recognition. This paper focuses on temporal variation modeling, which is the common key problem of extensive analysis tasks. Previous methods attempt to accomplish this directly from the 1D time series, which is extremely challenging due to the intricate temporal patterns. Based on the observation of multi-periodicity in time series, we ravel out the complex temporal variations into the multiple intraperiod- and interperiod-variations. To tackle the limitations of 1D time series in representation capability, we extend the analysis of temporal variations into the 2D space by transforming the 1D time series into a set of 2D tensors based on multiple periods. This transformation can embed the intraperiod- and interperiod-variations into the columns and rows of the 2D tensors respectively, making the 2D-variations to be easily modeled by 2D kernels. Technically, we propose the TimesNet with TimesBlock as a task-general backbone for time series analysis. TimesBlock can discover the multi-periodicity adaptively and extract the complex temporal variations from transformed 2D tensors by a parameter-efficient inception block. Our proposed TimesNet achieves consistent state-of-the-art in five mainstream time series analysis tasks, including short- and long-term forecasting, imputation, classification, and anomaly detection.","Time Series Analysis, Deep Learning" Learning without Prejudices: Continual Unbiased Learning via Benign and Malignant Forgetting,https://openreview.net/forum?id=gfPUokHsW-,https://openreview.net/pdf?id=gfPUokHsW-,"We propose a novel method, coined Learning without Prejudices, that encourages benign forgetting and regularizes malignant forgetting for continual unbiased learning. ","Although machine learning algorithms have achieved state-of-the-art status in image classification, recent studies have substantiated that the ability of the models to learn several tasks in sequence, termed continual learning (CL), often suffers from abrupt degradation of performance from previous tasks. A large body of CL frameworks has been devoted to alleviating this issue. However, we observe that forgetting phenomena in CL are not always unfavorable, especially when there is bias (spurious correlation) in training data. We term such type of forgetting benign forgetting, and categorize detrimental forgetting as malignant forgetting. Based on this finding, our objective in this study is twofold: (a) to discourage malignant forgetting by generating previous representations, and (b) encourage benign forgetting by employing contrastive learning in conjunction with feature-level augmentation. Extensive evaluations of biased experimental setups demonstrate that our proposed method, Learning without Prejudices, is effective for continual unbiased learning.","representation learning, continual learning, unbiased learning" FINDE: Neural Differential Equations for Finding and Preserving Invariant Quantities,https://openreview.net/forum?id=tLScKVhcCR,https://openreview.net/pdf?id=tLScKVhcCR,"Real-world dynamical systems have invariant quantities such as energy, momenta, and mass. Even without prior knowledge, the proposed neural network finds and preserves such quantities from data by leveraging projection and discrete gradient methods.","Many real-world dynamical systems are associated with first integrals (a.k.a. invariant quantities), which are quantities that remain unchanged over time. The discovery and understanding of first integrals are fundamental and important topics both in the natural sciences and in industrial applications. First integrals arise from the conservation laws of system energy, momentum, and mass, and from constraints on states; these are typically related to specific geometric structures of the governing equations. Existing neural networks designed to ensure such first integrals have shown excellent accuracy in modeling from data. However, these models incorporate the underlying structures, and in most situations where neural networks learn unknown systems, these structures are also unknown. This limitation needs to be overcome for scientific discovery and modeling of unknown systems. To this end, we propose first integral-preserving neural differential equation (FINDE). By leveraging the projection method and the discrete gradient method, FINDE finds and preserves first integrals from data, even in the absence of prior knowledge about underlying structures. Experimental results demonstrate that FINDE can predict future states of target systems much longer and find various quantities consistent with well-known first integrals in a unified manner.","neural ordinary differential equations, first integral, conservetaion law" Controllable Adaptive Learning,https://openreview.net/forum?id=T7mOB22uL_,https://openreview.net/pdf?id=T7mOB22uL_,,"As deep learning enabled unprecedented applications in versatile vision cognition tasks, researchers surged for the solutions of higher performance and more generalized algorithms, coming with expensive training and deployment to be applied in complex scenarios across domains. However, we argue that generalization and high performance are not always the ultimate goal in real-life with various applications and regulatory requirements. In this work, for the first time to our knowledge, we propose a Controllable Adaptive Learning (CAL) paradigm that allows the model to perform well on some data domains while performing poorly on others by control. We define the problem as a Controlled Multi-target Unsupervised Domain Adaptation (CMUDA) Task. Without the need to access labels in the target domain, we make the model perform poorly on certain target domains through a novel distribution different loss function design. We then introduced two easy-to-use control methods, namely implicit representation controller and explicit text-prompt controller, to regain access to the high-performance result with little effort, without the need to retrain the entire network. Extensive experiments demonstrated the effectiveness of our approach. We believe that our CAL paradigm will lead to an emerging trend for future research. Our code is at *URL*.",Controllable Adaptive Learning Approximate Vanishing Ideal Computations at Scale,https://openreview.net/forum?id=3ZPESALKXO,https://openreview.net/pdf?id=3ZPESALKXO,We study approximate vanishing ideal algorithms at scale.,"The vanishing ideal of a set of points $X = \{\mathbf{x}_1, \ldots, \mathbf{x}_m\}\subseteq \mathbb{R}^n$ is the set of polynomials that evaluate to $0$ over all points $\mathbf{x} \in X$ and admits an efficient representation by a finite subset of generators. In practice, to accommodate noise in the data, algorithms that construct generators of the approximate vanishing ideal are widely studied but their computational complexities remain expensive. In this paper, we scale up the Oracle Approximate Vanishing Ideal algorithm (OAVI), the only generator-constructing algorithm with known learning guarantees. We prove that the computational complexity of OAVI is not superlinear, as previously claimed, but linear in the number of samples $m$. In addition, we propose two modifications that accelerate OAVI's training time: Our analysis reveals that replacing the Pairwise Conditional Gradients algorithm, one of the solvers used in OAVI, with the faster Blended Pairwise Conditional Gradients algorithm leads to an exponential speed-up in the number of features $n$. Finally, using a new Inverse Hessian Boosting approach, intermediate convex optimization problems can be solved almost instantly, improving OAVI's training time by multiple orders of magnitude in a variety of numerical experiments.","approximate vanishing ideal, convex optimization, conditional gradients algorithms, Hessian matrix" How you start matters for generalization,https://openreview.net/forum?id=1Imd7_uamo,https://openreview.net/pdf?id=1Imd7_uamo,We promote a shift of focus towards initialization rather than neural architecture or (stochastic) gradient descent to explain this implicit regularization,"Characterizing the remarkable generalization properties of over-parameterized neural networks remains an open problem. A growing body of recent literature shows that the bias of stochastic gradient descent (SGD) and architecture choice implicitly leads to better generalization. In this paper, we show on the contrary that, independently of architecture, SGD can itself be the cause of poor generalization if one does not ensure good initialization. Specifically, we prove that any differentiably parameterized model, trained under gradient flow, obeys a weak spectral bias law which states that sufficiently high frequencies train arbitrarily slowly. This implies that very high frequencies present at initialization will remain after training, and hamper generalization. Further, we empirically test the developed theoretical insights using practical, deep networks. Finally, we contrast our framework with that supplied by the \emph{flat-minima} conjecture and show that Fourier analysis grants a more reliable framework for understanding the generalization of neural networks.","spectral bias, generalization" Understanding Incremental Learning of Gradient Descent: A Fine-grained analysis of Matrix Sensing,https://openreview.net/forum?id=5Jq1ASp33L,https://openreview.net/pdf?id=5Jq1ASp33L,,"The implicit bias of optimization algorithms such as gradient descent (GD) is believed to play an important role in generalization of modern machine learning methods such as deep learning. This paper provides a fine-grained analysis of the dynamics of GD for the matrix sensing problem, whose goal is to recover a low-rank ground-truth matrix from near-isotropic linear measurements. With small initialization, we that GD behaves similarly to the greedy low-rank learning heuristics~\citep{li2020towards} and follows an incremental learning procedure~\citep{gissin2019implicit}. That is, GD sequentially learns solutions with increasing ranks until it recovers the ground-truth matrix. Compared to existing works which only analyze the first learning phase for rank-1 solutions, our result is stronger because it characterizes the whole learning process. Moreover, our analysis of the incremental learning procedure applies to the under-parameterized regime as well. As a key ingredient of our analysis, we observe that GD always follows an approximately low-rank trajectory and develops novel landscape properties for matrix sensing with low-rank parameterization. Finally, we conduct numerical experiments which confirm our theoretical findings.","deep learning theory, incremental learning, non-convex optimization" Selective Annotation Makes Language Models Better Few-Shot Learners,https://openreview.net/forum?id=qY1hlv7gwg,https://openreview.net/pdf?id=qY1hlv7gwg,"We propose a select-then-annotate framework to make large language models better few-shot learners. Our method, vote-k, greatly improves the task performance over classification, commonsense reasoning, dialogue, and text/code generation.","Many recent approaches to natural language tasks are built on the remarkable abilities of large language models. Large language models can perform in-context learning, where they learn a new task from a few task demonstrations, without any parameter updates. This work examines the implications of in-context learning for the creation of datasets for new natural language tasks. Departing from recent in-context learning methods, we formulate an annotation-efficient, two-step framework: selective annotation that chooses a pool of examples to annotate from unlabeled data in advance, followed by prompt retrieval that retrieves task examples from the annotated pool at test time. Based on this framework, we propose an unsupervised, graph-based selective annotation method, voke-k, to select diverse, representative examples to annotate. Extensive experiments on 10 datasets (covering classification, commonsense reasoning, dialogue, and text/code generation) demonstrate that our selective annotation method improves the task performance by a large margin. On average, vote-k achieves a 12.9%/11.4% relative gain under an annotation budget of 18/100, as compared to randomly selecting examples to annotate. Compared to state-of-the-art supervised finetuning approaches, it yields similar performance with 10-100x less annotation cost across 10 tasks. We further analyze the effectiveness of our framework in various scenarios: language models with varying sizes, alternative selective annotation methods, and cases where there is a test data domain shift. We hope that our studies will serve as a basis for data annotations as large language models are increasingly applied to new tasks.","few-shot learning, language models, in-context learning, active learning" Switch-NeRF: Learning Scene Decomposition with Mixture of Experts for Large-scale Neural Radiance Fields,https://openreview.net/forum?id=PQ2zoIZqvm,https://openreview.net/pdf?id=PQ2zoIZqvm, We propose an applicable end-to-end sparse NeRF network with learning-based decomposition for large-scale scenes.,"The Neural Radiance Fields (NeRF) have been recently applied to reconstruct building-scale and even city-scale scenes. To model a large-scale scene efficiently, a dominant strategy is to employ a divide-and-conquer paradigm via performing scene decomposition, which decomposes a complex scene into parts that are further processed by different sub-networks. Existing large-scale NeRFs mainly use heuristic hand-crafted scene decomposition, with regular 3D-distance-based or physical-street-block-based schemes. Although achieving promising results, the hand-crafted schemes severely limit the capabilities of NeRF in large-scale scene modeling. First, it is extremely challenging to manually design a universal scene decomposition rule for different complex scenes, leading to adaptation issues while applying the model to different scenarios. Second, the decomposition procedure is not learnable, hindering the network from jointly optimizing the scene decomposition and the radiance fields in an end-to-end manner, to better model the scene with different sub-networks. Third, the different sub-networks are typically optimized independently, and thus the inconsistency among them cannot be effectively handled during the optimization. To tackle these issues, in this paper, we propose Switch-NeRF, a novel end-to-end large-scale NeRF with learning-based scene decomposition. We design a gating network to dispatch 3D points to different NeRF sub-networks. The gating network can be optimized together with the NeRF sub-networks for different scene partitions, by a design with the Sparsely Gated Mixture of Experts (MoE). The outputs from different sub-networks can also be fused in a learnable way in the unified framework to effectively guarantee the consistency of the whole scene. Furthermore, the proposed MoE-based Switch-NeRF model is carefully implemented and optimized to achieve both high-fidelity scene reconstruction and efficient computation. Our method establishes clear state-of-the-art performances on several large-scale datasets. To the best of our knowledge, we are the first to propose an applicable end-to-end sparse NeRF network with learning-based decomposition for large-scale scenes.","Neural Radiance Fields, Mixture of Experts, Large-scale scene, Novel view synthesis, Sparse network" "Efficient, Stable, and Analytic Differentiation of the Sinkhorn Loss",https://openreview.net/forum?id=uATOkwOZaI,https://openreview.net/pdf?id=uATOkwOZaI,"We have derived an analytic solution coupled with a stable and efficient algorithm for the differentiation of the Sinkhorn loss, as an approximation to the Wasserstein distance for optimal transport problems.","Optimal transport and the Wasserstein distance have become indispensable building blocks of modern deep generative models, but their computional costs greatly prohibit their applications in statistical machine learning models. Recently, the Sinkhorn loss, as an approximation to the Wasserstein distance, has gained massive popularity, and much work has been done for its theoretical properties. To embed the Sinkhorn loss into gradient-based learning frameworks, efficient algorithms for both the forward and backward passes of the Sinkhorn loss are required. In this article, we first demonstrate issues of the widely-used Sinkhorn's algorithm, and show that the L-BFGS algorithm is a potentially better candidate for the forward pass. Then we derive an analytic form of the derivative of the Sinkhorn loss with respect to the input cost matrix, which results in a very efficient backward algorithm. We rigorously analyze the convergence and stability properties of the advocated algorithms, and use various numerical experiments to validate the superior performance of the proposed methods.","Optimal transport, Wasserstein distance, Sinkhorn loss, differentiable optimization, deep generative model" A Holistic View of Noise Transition Matrix in Deep Learning and Beyond,https://openreview.net/forum?id=aFzaXRImWE,https://openreview.net/pdf?id=aFzaXRImWE,,"In this paper, we explore learning statistically consistent classifiers under label noise by estimating the noise transition matrix T. We first provide a holistic view of existing T-estimation methods including those with or without anchor point assumptions. We unified them into the Minimum Geometric Envelope Operator (MGEO) framework, which tries to find the smallest T (in terms of a certain metric) that elicits a convex hull to enclose the posteriors of all the training data. Although MGEO methods show appealing theoretical properties and empirical results, we find them prone to failing when the noisy posterior estimation is imperfect, which is inevitable in practice. Specifically, we show that MGEO methods are in-consistent even with infinite samples if the noisy posterior is not estimated accurately. In view of this, we make the first effort to address this issue by proposing a novel T-estimation framework via the lens of bilevel optimization, and term it RObust Bilevel OpTimzation (ROBOT). ROBOT paves a new road beyond MGEO framework, which enjoys strong theoretical properties: identifibility, consistency and finite-sample generalization guarantees. Notably, ROBOT neither requires the perfect posterior estimation nor assumes the existence of anchor points. We further theoretically demonstrate that ROBOT is more robust in the case where MGEO methods fail. Experimentally, our framework also shows superior performance across multiple benchmarks.", Active Learning in Bayesian Neural Networks with Balanced Entropy Learning Principle,https://openreview.net/forum?id=ZTMuZ68B1g,https://openreview.net/pdf?id=ZTMuZ68B1g,We propose a new bayesian active learning principle.,"Acquiring labeled data is challenging in many machine learning applications with limited budgets. Active learning gives a procedure to select the most informative data points and improve data efficiency by reducing the cost of labeling. The info-max learning principle maximizing mutual information such as BALD has been successful and widely adapted in various active learning applications. However, this pool-based specific objective inherently introduces a redundant selection and further requires a high computational cost for batch selection. In this paper, we design and propose a new uncertainty measure, Balanced Entropy Acquisition (BalEntAcq), which captures the information balance between the uncertainty of underlying softmax probability and the label variable. To do this, we approximate each marginal distribution by Beta distribution. Beta approximation enables us to formulate BalEntAcq as a ratio between an augmented entropy and the marginalized joint entropy. The closed-form expression of BalEntAcq facilitates parallelization by estimating two parameters in each marginal Beta distribution. BalEntAcq is a purely standalone measure without requiring any relational computations with other data points. Nevertheless, BalEntAcq captures a well-diversified selection near the decision boundary with a margin, unlike other existing uncertainty measures such as BALD, Entropy, or Mean Standard Deviation (MeanSD). Finally, we demonstrate that our balanced entropy learning principle with BalEntAcq consistently outperforms well-known linearly scalable active learning methods, including a recently proposed PowerBALD, a simple but diversified version of BALD, by showing experimental results obtained from MNIST, CIFAR-100, SVHN, and TinyImageNet datasets.","bayesian neural network, bayesian active learning, balanced entropy learning, uncertainty quantification" Near-Optimal Adversarial Reinforcement Learning with Switching Costs,https://openreview.net/forum?id=i9ogGQHYbkY,https://openreview.net/pdf?id=i9ogGQHYbkY,"This paper provides the first algorithms with near-optimal regrets for adversarial reinforcement learning with switching costs, and a matching lower bound on the regret.","Switching costs, which capture the costs for changing policies, are regarded as a critical metric in reinforcement learning (RL), in addition to the standard metric of losses (or rewards). However, existing studies on switching costs have mainly focused on static RL, where the loss distribution is assumed to be fixed during the learning process, and thus practical scenarios where the loss distribution could be non-stationary or even adversarial are not considered. While adversarial RL better models this type of practical scenarios, an open problem remains: how to develop a provably efficient algorithm for adversarial RL with switching costs? This paper makes the first effort towards solving this problem. First, we provide a regret lower-bound that shows that the regret of any algorithm must be larger than $\tilde{\Omega}( ( H S A )^{1/3} T^{2/3} )$, where $T$, $S$, $A$ and $H$ are the number of episodes, states, actions and layers in each episode, respectively. Our lower bound indicates that, due to the fundamental challenge of switching costs in adversarial RL, the best achieved regret (whose dependency on $T$ is $\tilde{O}(\sqrt{T})$) in static RL with switching costs (as well as adversarial RL without switching costs) is no longer achievable. Moreover, we propose two novel switching-reduced algorithms with regrets that match our lower bound when the transition function is known, and match our lower bound within a small factor of $\tilde{O}( H^{1/3} )$ when the transition function is unknown. Our regret analysis demonstrates the near-optimal performance of them.","adversarial reinforcement learning, switching costs, regret analysis, lower bound" Bias Mitigation Framework for Intersectional Subgroups in Neural Networks,https://openreview.net/forum?id=8YnDrbx8bnh,https://openreview.net/pdf?id=8YnDrbx8bnh,This papers proposes a bias mitigation approach for intersectional subgroups.,"We propose a fairness-aware learning framework that mitigates intersectional subgroup bias associated with protected attributes. Prior research has primarily focused on mitigating one kind of bias by incorporating complex fairness-driven constraints into optimization objectives or designing additional layers that focus on specific protected attributes. We introduce a simple and generic bias mitigation framework that prevents models from learning relationships between protected attributes and output variable by reducing mutual information. We demonstrate that our approach is effective in reducing bias with little or no drop in accuracy. We also show that our approach mitigates intersectional bias even when other attributes in the dataset are correlated with protected attributes. Finally, we validate our approach by studying feature interactions between protected and non-protected attributes. We demonstrate that these interactions are significantly reduced when applying our bias mitigation. ","Fairness, Feature Interactions, Bias Mitigation" NORM: Knowledge Distillation via N-to-One Representation Matching,https://openreview.net/forum?id=CRNwGauQpb6,https://openreview.net/pdf?id=CRNwGauQpb6,This paper presents a new knowledge distillation method via n-to-one representation matching,"Existing feature distillation methods commonly adopt the One-to-one Representation Matching between each pre-selected teacher-student layer pair. In this paper, we present $N$-to-$O$ne $R$epresentation $M$atching (NORM), a new two-stage knowledge distillation method, which relies on a linear Feature Transform (FT) module. In view of preserving the intact information learnt by the teacher network, during training, our FT module consisting of two linear layers is merely inserted after the last convolutional layer of the student network. The first linear layer projects the student representation to a feature space having N times feature channels than the teacher representation from the last convolutional layer, and the second linear layer contracts the expanded output back to the original feature space. By splitting the expanded student representation into N non-overlapping segments having the same number of feature channels as the teacher's, they can be forced to approximate the intact teacher representation simultaneously, formulating a novel many-to-one representation matching mechanism conditioned on a single teacher-student layer pair. After training, such an FT module will be merged into the subsequent fully connected layer thanks to its linear property, introducing no extra parameters or architectural modifications to the student network at inference. Extensive experiments on CIFAR100 and ImageNet with various teacher-student network pairs show competitive performance of NORM. Code will be released.","Knowledge distillation, model compression, image classification" Downstream Datasets Make Surprisingly Good Pretraining Corpora,https://openreview.net/forum?id=gMOhS9EvJDX,https://openreview.net/pdf?id=gMOhS9EvJDX,,"For most natural language processing tasks, the dominant practice is to finetune large pretrained transformer models (e.g., BERT) using smaller downstream datasets. Despite the success of this approach, it remains unclear to what extent these gains are attributable to the massive background corpora employed for pretraining versus to the pretraining objectives themselves. This paper introduces a large-scale study of self-pretraining, where the same (downstream) training data is used for both pretraining and finetuning. In experiments addressing both ELECTRA and RoBERTa models and 10 distinct downstream datasets, we observe that self-pretraining rivals standard pretraining on the BookWiki corpus (despite using around $10\times$--$500\times$ less data), outperforming the latter on $7$ and $5$ datasets, respectively. Surprisingly, these task-specific pretrained models often perform well on other tasks, including the GLUE benchmark. Our results suggest that in many scenarios, performance gains attributable to pretraining are driven primarily by the pretraining objective itself and are not always attributable to the incorporation of massive datasets. These findings are especially relevant in light of concerns about intellectual property and offensive content in web-scale pretraining data.", Revisiting Embeddings for Graph Neural Networks,https://openreview.net/forum?id=xYA5j3IH19I,https://openreview.net/pdf?id=xYA5j3IH19I,We question current graph neural network embedding quality and present new GNN techniques to use large models (pre-trained or trained from scratch) to work directly on graph-connected data,"Current graph representation learning techniques use Graph Neural Networks (GNNs) to extract features from dataset embeddings. In this work, we examine the quality of these embeddings and assess how changing them can affect the ac- curacy of GNNs. We explore different embedding extraction techniques for both images and texts; and find that the choice of embedding biases the performance of different GNN architectures and thus the choice of embedding influences the selection of GNNs regardless of the underlying dataset. In addition, we only see an improvement in accuracy from some GNN models compared to the accuracy of models trained from scratch or fine-tuned on the underlying data without utilising the graph connections. As an alternative, we propose Graph-connected Network (GraNet) layers to better leverage existing unconnected models within a GNN. Existing language and vision models are thus improved by allowing neighbour- hood aggregation. This gives a chance for the model to use pre-trained weights, if possible, and we demonstrate that this approach improves the accuracy compared to traditional GNNs: on Flickr v2, GraNet beats GAT2 and GraphSAGE by 7.7% and 1.7% respectively.","Graph Neural Networks, Embeddings, Graph Attention, Large Pretrained Models, Transfer Learning" NOAH: A New Head Structure To Improve Deep Neural Networks For Image Classification,https://openreview.net/forum?id=eZ5KJo5GwSI,https://openreview.net/pdf?id=eZ5KJo5GwSI,This paper presents a simple and universal head structure to improve the representation learning of deep neural networks for image classification.,"A modern deep neural network (DNN) for image classification typically consists of two parts: a backbone for feature extraction, and a head for feature encoding and class predication. We notice that the head structures of prevailing DNNs share a similar processing pipeline, exploiting global feature dependencies while disregarding local ones. Instead, this paper presents $N$on-gl$O$bal $A$ttentive $H$ead (NOAH), a simple and universal head structure, to improve the learning capacity of DNNs. NOAH relies on a novel form of attention dubbed pairwise object category attention, which models dense local-to-global feature dependencies via a concise association of feature split, interaction and aggregation operations. As a drop-in design, NOAH can replace existing heads of many DNNs, and meanwhile, maintains almost the same model size and similar model efficiency. We validate the efficacy of NOAH mainly on the large-scale ImageNet dataset with various DNN architectures that span convolutional neural networks, vision transformers and multi-layer perceptrons when training from scratch. Without bells and whistles, experiments show that: (a) NOAH can significantly boost the performance of lightweight DNNs, e.g., bringing $3.14%$|$5.30%$|$1.90%$ top-1 accuracy improvement for MobileNetV2 ($0.5\times$)|Deit-Tiny ($0.5\times$)|gMLP-Tiny ($0.5\times$); (b) NOAH can generalize well on relatively large DNNs, e.g., bringing $1.02%$|$0.78%$|$0.91%$ top-1 accuracy improvement for ResNet50|Deit-Small|MLP-Mixer-Small; (c) NOAH can still bring acceptable performance gains to large DNNs (having over 50 million parameters), e.g., $0.41%$|$0.37%$|$0.35%$ top-1 accuracy improvement for ResNet152|Deit-Base|MLP-Mixer-Base. Besides, NOAH also retains its effectiveness in the aggressive training regime (e.g., a ResNet50 model with NOAH reaches $79.32%$ top-1 accuracy, yielding $0.88%$ gain) and other image classification tasks. Code is provided for results reproduction.","Deep neural network, convolutional neural network, vision transformer, multi-layer perceptron, image classification" Empirical analysis of representation learning and exploration in neural kernel bandits,https://openreview.net/forum?id=e9rdb24Yzqx,https://openreview.net/pdf?id=e9rdb24Yzqx,Neural kernel bandits achieve better performance than neural-linear on complex UCI datasets. Impact of NK distributions on exploration varies with task complexity and need to explore.,"Neural bandits have been shown to provide an efficient solution to practical sequential decision tasks that have nonlinear reward functions. The main contributor to that success is approximate Bayesian inference, which enables neural network (NN) training with uncertainty estimates. However, Bayesian NNs often suffer from a prohibitive computational overhead or operate on a subset of parameters. Alternatively, certain classes of infinite neural networks were shown to directly correspond to Gausian processes (GP) with neural kernels (NK). NK-GPs provide accurate uncertainty estimates and can be trained faster than most Bayesian NNs. We propose to guide common bandit policies with NK distributions and show that NK bandits achieve state-of-the-art performance on nonlinear structured data. Moreover, we propose a framework for measuring independently the ability of a bandit algorithm to learn representations and explore, and use it to analyze the impact of NK distributions w.r.t. those two aspects. We consider policies based on a GP and a Student's t-process (TP). Furthermore, we study practical considerations, such as training frequency and model partitioning. We believe our work will help better understand the impact of utilizing NKs in applied settings.","neural bandits, contextual bandits, gaussian process, neural tangent kernel, neural kernel" S^2-Transformer for Mask-Aware Hyperspectral Image Reconstruction,https://openreview.net/forum?id=WgG3bpUiSqE,https://openreview.net/pdf?id=WgG3bpUiSqE,,"The technology of hyperspectral imaging (HSI) records the visual information upon long-range-distributed spectral wavelengths. A representative hyperspectral image acquisition procedure conducts a 3D-to-2D encoding by the coded aperture snapshot spectral imager (CASSI), and requires a software decoder for the 3D signal reconstruction. By observing this physical encoding procedure, two major challenges may stand in the way of a high-fidelity reconstruction: (i) To obtain 2D measurements, CASSI dislocates multiple channels by disperser-titling and squeezes them onto the same spatial region, yielding an entangled data loss. (ii) The physical coded aperture (mask) will lead to a masked data loss by selectively blocking the pixel-wise light exposure. To tackle these challenges, we propose a spatial-spectral (S2-) transformer architecture with a mask-aware learning strat- egy. Firstly, we simultaneously leverage spatial and spectral attention modelings to disentangle the blended information in the 2D measurement along both two dimensions. A series of Transformer structures across spatial & spectral clues are systematically designed, which considers the information inter-dependency between the two-fold cues. Secondly, the masked pixels will induce higher predic- tion difficulty and should be treated differently from unmasked ones. Thereby, we adaptively prioritize the loss penalty attributing to the mask structure by inferring the difficulty-level upon the mask-aware prediction. Our proposed method not only sets a new state-of-the-art quantitatively, but also yields a better perceptual quality upon structured areas. Code and pre-trained models are available at https://anonymous.4open.science/r/S2-transformer-HSI-FEBF.", Exploiting Spatial Separability for Deep Learning Multichannel Speech Enhancement with an Align-and-Filter Network,https://openreview.net/forum?id=DQou0RiwkR0,https://openreview.net/pdf?id=DQou0RiwkR0,This paper presents an Align-and-Filter network to study spatial separability of sound sources for deep learning multichannel speech enhancement by incorporating relative transfer functions for signal alignment with sequential masking network design.,"Multichannel speech enhancement (SE) systems separate the target speech from background noise by performing spatial and spectral filtering. The development of multichannel SE has a long history in the signal processing field, where one crucial step is to exploit spatial separability of sound sources by aligning the microphone signals in response to the target speech source prior to further filtering processes. However, most existing deep learning based multichannel SE works have yet to effectively incorporate or emphasize this spatial alignment aspect in the network design – we postulate that it is owing to the lack of suitable datasets with sufficient spatial diversity of the speech sources. In this paper, we highlight this important but often overlooked step in deep learning based multichannel SE, i.e., signal alignment, by introducing an Align-and-Filter network (AFnet) featuring a two-stage sequential masking design. The AFnet estimates two sets of masks, the alignment masks and filtering masks, and multiplies the estimated masks with the respective input signals to each stage sequentially, while leveraging the relative transfer functions (RTFs) for guiding the model to align signals with various speech source locations during training. For exploration purposes, we argue that the popular CHiME-3 multichannel dataset has its own limitation in representing spatially diverse speech data as the speakers were mostly located at the front side, and thereby adopt simulated and real-world measured room impulse responses to generate multichannel recordings where the target speech sources might come from arbitrary directions. Our findings suggest that for spatially diverse speaker scenarios, careful consideration of exploiting spatial characteristics is of great importance for deep learning based multichannel SE especially when the number of microphone gets increased. We show that utilizing the RTFs for signal alignment purposes in the two-stage, sequential masking framework consistently improves the capability of the network to separate the target speech from the noise signals, supporting that spatial separability is being effectively exploited by the proposed model. Our studies advocate for the advantages and significance of considering the signal alignment aspect, a wisdom coming from conventional signal processing, for developing future deep based multichannel SE algorithms to improve enhancement outcomes with positional diverse target speech scenarios.","Multichannel speech enhancement, microphone array beamforming, spatial filtering, signal alignment, relative transfer functions" A deep top-down approach to hierarchically coherent probabilistic forecasting ,https://openreview.net/forum?id=dMSxTUlQgrZ,https://openreview.net/pdf?id=dMSxTUlQgrZ,Deep top-down proportions model for coherent probabilistic hierarchical forecasting,"Probabilistic, hierarchically coherent forecasting is a key problem in many practical forecasting applications -- the goal is to obtain coherent probabilistic predictions for a large number of time series arranged in a pre-specified tree hierarchy. In this paper, we present a probabilistic top-down approach to hierarchical forecasting that uses a novel attention-based RNN model to learn the distribution of the proportions according to which each parent prediction is split among its children nodes at any point in time. These probabilistic proportions are then coupled with an independent univariate probabilistic forecasting model for the root time series. The resulting forecasts are naturally coherent, and provide probabilistic predictions over all time series in the hierarchy. We experiment on several public datasets and demonstrate significant improvements up to 27% on most datasets compared to state-of-the-art probabilistic hierarchical models. Finally, we also provide theoretical justification for the superiority of our top-down approach compared to traditional bottom-up modeling.","Hierarchical Forecasting, Time-Series" CroMA: Cross-Modality Adaptation for Monocular BEV Perception,https://openreview.net/forum?id=V8xIHUK3c5Sr,https://openreview.net/pdf?id=V8xIHUK3c5Sr,We propose a Cross-Modality Adaptation (CroMA) framework to learn a robust monocular BEV perception model under sensor shift and domain gaps.,"Incorporating multiple sensor modalities, and closing the domain gaps between training and deployment are two challenging yet critical topics for self-driving. Existing adaption works only focus on visual-level domain gaps, overlooking the sensor-type gaps which exist in reality. A model trained with a collection of sensor modalities may need to run on another setting with less types of sensors available. In this work, we propose a Cross-Modality Adaptation (CroMA) framework to facilitate the learning of a more robust monocular BEV perception model, which transfer the point clouds knowledge from LiDAR sensor during training phase to the camera-only testing scenario. The absence of LiDAR during testing negates the usage of it as model input. Hence, our key idea lies in the design of a LiDAR-teacher and Camera-student knowledge distillation model, as well as a multi-level adversarial learning mechanism, which adapt and align the features learned from different sensors and domains. This work results in the first open analysis of cross-domain perception and cross-sensor adaptation model for monocular 3D tasks in the wild. We benchmark our approach on large-scale datasets under various domain shifts and show state-of-the-art results against various baselines.","Multi-Modality, Cross-Modality, Domain Adaptation, Autonomous Driving" CausalAgents: A Robustness Benchmark for Motion Forecasting Using Causal Relationships,https://openreview.net/forum?id=9WdB5yVICCA,https://openreview.net/pdf?id=9WdB5yVICCA,We construct a benchmark to measure the robustness of motion forecasting models for autonomous driving; we find models are sensitive to deleting irrelevant agents from the scene.,"As machine learning models become increasingly prevalent in motion forecasting systems for autonomous vehicles (AVs), it is critical that we ensure that model predictions are safe and reliable. However, exhaustively collecting and labeling the data necessary to fully test the long tail of rare and challenging scenarios is difficult and expensive. In this work, we construct a new benchmark for evaluating and improving model robustness by applying perturbations to existing data. Specifically, we conduct an extensive labeling effort to identify causal agents, or agents whose presence influences human driver behavior in any way, in the Waymo Open Motion Dataset (WOMD), and we use these labels to perturb the data by deleting non-causal agents from the scene. We then evaluate a diverse set of state-of-the-art deep-learning model architectures on our proposed benchmark and find that all models exhibit large shifts under perturbation. Under non-causal perturbations, we observe a 25-38% relative change in minADE as compared to the original. We then investigate techniques to improve model robustness, including increasing the training dataset size and using targeted data augmentations that drop agents throughout training. We provide the causal agent labels as an additional attribute to WOMD and release the robustness benchmarks to aid the community in building more reliable and safe deep-learning models for motion forecasting. ","robustness, motion forecasting, self-driving cars" Adaptive Client Sampling in Federated Learning via Online Learning with Bandit Feedback,https://openreview.net/forum?id=f3dqV4KLZV1,https://openreview.net/pdf?id=f3dqV4KLZV1,An adaptive client sampling approach in federated learning,"Due to the high cost of communication, federated learning (FL) systems need to sample a subset of clients that are involved in each round of training. As a result, client sampling plays an important role in FL systems as it affects the convergence rate of optimization algorithms used to train machine learning models. Despite its importance, there is limited work on how to sample clients effectively. In this paper, we cast client sampling as an online learning task with bandit feedback, which we solve with an online stochastic mirror descent (OSMD) algorithm designed to minimize the sampling variance. We then theoretically show how our sampling method can improve the convergence speed of optimization algorithms. To handle the tuning parameters in OSMD that depend on the unknown problem parameters, we use the online ensemble method and doubling trick. We prove a dynamic regret bound relative to any sampling sequence. The regret bound depends on the total variation of the comparator sequence, which naturally captures the intrinsic difficulty of the problem. To the best of our knowledge, these theoretical contributions are new and the proof technique is of independent interest. Through both synthetic and real data experiments, we illustrate advantages of the proposed client sampling algorithm over the widely used uniform sampling and existing online learning based sampling strategies. The proposed adaptive sampling procedure is applicable beyond the FL problem studied here and can be used to improve the performance of stochastic optimization procedures such as stochastic gradient descent and stochastic coordinate descent.","Federated Learning, Client Sampling, Optimization" Dynamical Isometry for Residual Networks,https://openreview.net/forum?id=n-5qp16As_C,https://openreview.net/pdf?id=n-5qp16As_C,We derive an initialization scheme for ResNets that induces perfect dynamical isometry at initialization.,"The training success, training speed and generalization ability of neural networks rely crucially on the choice of random parameter initialization. It has been shown for multiple architectures that initial dynamical isometry is particularly advantageous. Known initialization schemes for residual blocks, however, miss this property and suffer from degrading separability of different inputs for increasing depth and instability without Batch Normalization or lack feature diversity. We propose a random initialization scheme, Risotto, that achieves perfect dynamical isometry for residual networks with ReLU activation functions even for finite depth and width. It balances the contributions of the residual and skip branches unlike other schemes, which initially bias towards the skip connections. In experiments, we demonstrate that in most cases our approach outperforms initialization schemes proposed to make Batch Normalization obsolete, including Fixup and SkipInit, and facilitates stable training. Also in combination with Batch Normalization, we find that Risotto often achieves the overall best result. ","deep learning, parameter initialization, dynamical isometry, ResNets" How Erdös and Rényi Win the Lottery,https://openreview.net/forum?id=jgsGOtbktux,https://openreview.net/pdf?id=jgsGOtbktux,We prove that random networks contain lottery tickets with high probability.,"Random masks define surprisingly effective sparse neural network models, as has been shown empirically. The resulting Erd\""os-R\'enyi (ER) random graphs can often compete with dense architectures and state-of-the-art lottery ticket pruning algorithms struggle to outperform them, even though the random baselines do not rely on computationally expensive pruning-training iterations but can be drawn initially without significant computational overhead. We offer a theoretical explanation of how such ER masks can approximate arbitrary target networks if they are wider by a logarithmic factor in the inverse sparsity $1 / \log(1/\text{sparsity})$. While we are the first to show theoretically and experimentally that random ER source networks contain strong lottery tickets, we also prove the existence of weak lottery tickets that require a lower degree of overparametrization than strong lottery tickets. These unusual results are based on the observation that ER masks are well trainable in practice, which we verify in experiments with varied choices of random masks. Some of these data-free choices outperform previously proposed random approaches on standard image classification benchmark datasets. ","deep learning, lottery tickets, pruning at initialization, random masks, theory" PVT++: A Simple End-to-End Latency-Aware Visual Tracking Framework,https://openreview.net/forum?id=adT0c0pxbfZ,https://openreview.net/pdf?id=adT0c0pxbfZ,This work proposes a simple framework for end-to-end latency aware visual tracking.,"Visual object tracking is an essential capability of intelligent robots. Most existing approaches have ignored the online latency that can cause severe performance degradation during real-world processing. Especially for unmanned aerial vehicle, where robust tracking is more challenging and onboard computation is limited, latency issue could be fatal. In this work, we presents a simple framework for end-to-end latency-aware tracking, i.e., end-to-end predictive visual tracking (PVT++). PVT++ is capable of turning most leading-edge trackers into predictive trackers by appending an online predictor. Unlike existing solutions that use model-based approaches, our framework is learnable, such that it can take not only motion information as input but it can also take advantage of visual cues or a combination of both. Moreover, since PVT++ is end-to-end optimizable, it can further boost the latency-aware tracking performance. Additionally, this work presents an extended latency-aware evaluation benchmark for assessing an \textit{any-speed} tracker in the online setting. Empirical results on robotic platform from aerial perspective show that the motion-based PVT++ can obtain on par or better performance than existing approaches. Further incorporating visual information and joint training techniques, PVT++ can achieve significant performance gain on various trackers and exhibit better robustness than prior model-based solution, essentially removing the degradation brought by their latency onboard.","Latency-aware perception, aerial tracking, visual tracking benchmark" GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation,https://openreview.net/forum?id=IowKt5rYWsK,https://openreview.net/pdf?id=IowKt5rYWsK,A high-resolution vision transformer architecture based on a new efficient global information exchange mechanism for general visual recognition.,"We present the Group Propagation Vision Transformer (GPViT): a novel non- hierarchical (i.e. non-pyramidal) transformer model designed for general visual recognition with high-resolution features. High-resolution features (or tokens) are a natural fit for tasks that involve perceiving fine-grained details such as detection and segmentation, but exchanging global information between these features is expensive in memory and computation because of the way self-attention scales. We provide a highly efficient alternative Group Propagation Block (GP Block) to exchange global information. In each GP Block, features are first grouped together by a fixed number of learnable group tokens; we then perform Group Propagation where global information is exchanged between the grouped features; finally, global information in the updated grouped features is returned back to the image features through a transformer decoder. We evaluate GPViT on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation. We show state-of-the-art performance across all tasks and significant gains over previous approaches on tasks requiring high- resolution outputs, for instance, our GPViT outperforms Swin Transformer-B by 2.0 mIoU on ADE20K semantic segmentation with only half as many parameters.","Visual Recognition, Vision transformer architecture" Variational Imbalanced Regression,https://openreview.net/forum?id=-ltZ1uw8ZE7,https://openreview.net/pdf?id=-ltZ1uw8ZE7,"We propose a probabilistic deep learning model, dubbed variational imbalanced regression (VIR), which not only performs well in imbalanced regression but naturally produces reasonable uncertainty estimation as a byproduct.","Existing regression models tend to fall short in both accuracy and uncertainty estimation when the label distribution is imbalanced. In this paper, we propose a probabilistic deep learning model, dubbed variational imbalanced regression (VIR), which not only performs well in imbalanced regression but naturally produces reasonable uncertainty estimation as a byproduct. Different from typical variational autoencoders assuming I.I.D. representation (a data point's representation is not directly affected by other data points), our VIR borrows data with similar regression labels to compute the latent representation's variational distribution; furthermore, different from deterministic regression models producing point estimates, VIR predicts the entire normal-inverse-gamma distributions and modulates the associated conjugate distributions to impose probabilistic reweighting on the imbalanced data, thereby providing better uncertainty estimation. Experiments in several real-world datasets show that our VIR can outperform state-of-the-art imbalanced regression models in terms of both accuracy and uncertainty estimation. ","probabilistic methods, variational inference, imbalanced regression, uncertainty estimation" EMO: Episodic Memory Optimization for Few-Shot Meta-Learning,https://openreview.net/forum?id=ZXu1S-wdy6d,https://openreview.net/pdf?id=ZXu1S-wdy6d,"We propose an episodic memory optimization for meta-learning, which we call EMO, that retains the gradient history of past experienced tasks in external memory. ","For few-shot meta-learning, gradient descent optimization is challenging due to the limited number of training samples per task. Inspired by the human ability to recall past learning experiences from the brain’s memory, we propose an episodic memory optimization for meta-learning, which we call EMO, that retains the gradient history of past experienced tasks in external memory. It enables few-shot learning in a memory-augmented way by leveraging the meta-learning setting and learns to retain and recall the learning process of past training tasks for gradient descent optimization. By doing so, EMO nudges the parameter updates in the right direction, even when the gradients provided by a limited number of examples are uninformative. Additionally, we prove theoretically that our algorithm converges for smooth, strongly convex objectives. EMO is generic, flexible, and model agnostic, making it a simple plug-and-play optimizer seamlessly embedded into existing optimization-based meta-learning approaches. Empirically, EMO scales well with most of the few-shot classification benchmarks, and our experiments show that the optimization-based meta-learning method enjoys accelerated convergence and improved performance with EMO.","Episodic memory, Meta-learning, Few-shot learning, Optimization" Critic Sequential Monte Carlo,https://openreview.net/forum?id=ObtGcyKmwna,https://openreview.net/pdf?id=ObtGcyKmwna,We present a novel method called CriticSMC capable of being deployed in model-predictive planning and model-free online control cases within environments with hard constraints taking advantage of informative prior policies.,"We introduce CriticSMC, a new algorithm for planning as inference built from a composition of sequential Monte Carlo with learned Soft-Q function heuristic factors. These heuristic factors, obtained from parametric approximations of the marginal likelihood ahead, more effectively guide SMC towards the desired target distribution, which is particularly helpful for planning in environments with hard constraints placed sparsely in time. Compared with previous work, we modify the placement of such heuristic factors, which allows us to cheaply propose and evaluate large numbers of putative action particles, greatly increasing inference and planning efficiency. CriticSMC is compatible with informative priors, whose density function need not be known, and can be used as a model-free control algorithm. Our experiments on collision avoidance in a high-dimensional simulated driving task show that CriticSMC significantly reduces collision rates at a low computational cost while maintaining realism and diversity of driving behaviors across vehicles and environment scenarios.","sequential monte carlo, reinforcement learning as inference, soft Q-learning, heuristic factors, driving behavior models" Radial Spike and Slab Bayesian Neural Networks for Sparse Data in Ransomware Attacks,https://openreview.net/forum?id=SNZxVIFZBIq,https://openreview.net/pdf?id=SNZxVIFZBIq,,"Ransomware attacks are increasing at an alarming rate, leading to large financial losses, unrecov- erable encrypted data, data leakage, and privacy concerns. The prompt detection of ransomware attacks is required to minimize further damage, particularly during the encryption stage. However, the frequency and structure of the observed ransomware attack data makes this task difficult to accomplish in practice. The data corresponding to ransomware attacks represents temporal, high- dimensional sparse signals, with limited records and very imbalanced classes. While traditional deep learning models have been able to achieve state-of-the-art results in a wide variety of domains, Bayesian Neural Networks, which are a class of probabilistic models, are better suited to the issues of the ransomware data. These models combine ideas from Bayesian statistics with the rich expres- sive power of neural networks. In this paper, we propose the Radial Spike and Slab Bayesian Neural Network, which is a new type of Bayesian Neural network that includes a new form of the approx- imate posterior distribution. The model scales well to large architectures and recovers the sparse structure of target functions. We provide a theoretical justification for using this type of distribution, as well as a computationally efficient method to perform variational inference. We demonstrate the performance of our model on a real dataset of ransomware attacks and show improvement over a large number of baselines, including state-of-the-art models such as Neural ODEs (ordinary dif- ferential equations). In addition, we propose to represent low-level events as MITRE ATT&CK tactics, techniques, and procedures (TTPs) which allows the model to better generalize to unseen ransomware attacks.", Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning?,https://openreview.net/forum?id=8Oun8ZUVe8N,https://openreview.net/pdf?id=8Oun8ZUVe8N,This paper shows that pretrained 2D image Transformers can help self-supervised 3D representation learning by training autoencoders as cross-modal teachers.,"The success of deep learning heavily relies on large-scale data with comprehensive labels, which is more expensive and time-consuming to fetch in 3D compared to 2D images or natural languages. This promotes the potential of utilizing models pretrained with data more than 3D as teachers for cross-modal knowledge transferring. In this paper, we show that pretrained 2D image Transformers can help self-supervised 3D representation learning through training Autoencoders as Cross-Modal Teachers (ACT). Our proposed ACT involves two stages. In the first stage, the pretrained 2D Transformers are transferred as cross-modal 3D teachers using discrete variational autoencoding self-supervision. The 2D Transformers are frozen with prompt tuning for better knowledge inheritance. In the second stage, the latent feature encoded by the 3D autoencoder teachers is used as the target of masked point modeling, wherein the dark knowledge is distilled to the 3D Transformer students as foundational geometry understanding. Our ACT pretrained 3D learner achieves state-of-the-art generalization capacity across various downstream benchmarks, e.g., 88.21% overall accuracy on the ScanObjectNN dataset. Codes have been included in the supplemental material and will be released on GitHub.","Representation Learning, Cross-Modal Learning, 3D Point Clouds" Explainability of deep reinforcement learning algorithms in robotic domains by using Layer-wise Relevance Propagation,https://openreview.net/forum?id=l2cryUoxvaX,https://openreview.net/pdf?id=l2cryUoxvaX,Explaining the policy learned by a deep reinforcement learning algorithm with graph networks as function approximators in robotic environments by using layer-wise relevance propagation technique.,"A key component to the recent success of reinforcement learning is the introduction of neural networks for representation learning. Doing so allows for solving challenging problems in several domains, one of which is robotics. However, a major criticism of deep reinforcement learning (DRL) algorithms is their lack of explainability and interpretability. This problem is even exacerbated in robotics as they oftentimes cohabitate space with humans, making it imperative to be able to reason about their behaviour. In this paper, we propose to analyze the learned representation in a robotic setting by utilizing graph neural networks. Using the graphical neural networks and Layer-wise Relevance Propagation (LRP), we represent the observations as an entity-relationship to allow us to interpret the learned policy. We evaluate our approach in two environments in MuJoCo. These two environments were delicately designed to effectively measure the value of knowledge gained by our approach to analyzing learned representations. This approach allows us to analyze not only how different parts of the observation space contribute to the decision-making process but also differentiate between policies and their differences in performance. This difference in performance also allows for reasoning about the agent's recovery from faults. These insights are key contributions to explainable deep reinforcement learning in robotic settings.","Explainability, Deep Reinforcement Learning, Graph Network, Layer-wise Relevance Propagation, Robotic" High Dimensional Bayesian Optimization with Reinforced Transformer Deep Kernels,https://openreview.net/forum?id=bl5pGwUQsZq,https://openreview.net/pdf?id=bl5pGwUQsZq,Transformer Deep Kernels combined with general combination gaussian process kernels help optimize high dimensional functions when using reinforcement learning acquisitions for exploration.,"Bayesian Optimization (BO) has proved to be an invaluable technique for efficient, high-dimensional optimization. The use of Gaussian process (GP) surrogates and dynamic acquisition functions has allowed BO to shine in challenging high dimensional optimization due to its sample efficiency and uncertainty modeling. Reinforcement Learning has been introduced to improve optimization performance on both single function optimization as well as \textit{few-shot} multi-objective optimization. However, until now, even few-shot techniques treat each objective as independent optimization tasks, failing to exploit the similarities shared between objectives. We combine recent developments in Deep Kernel Learning (DKL) and attention-based Transformer models to improve the modeling powers of GP surrogates with meta-learning. We propose a method for improving meta-learning BO surrogates by incorporating attention mechanisms into DKL, empowering the surrogates to adapt to contextual information gathered during the BO process. This Transformer Deep Kernel is combined with Reinforcement Learning techniques to aid in exploration, ensuring state-of-the-art results on a variety of high-dimensional optimization problems.","Bayesian Optimization, Reinforcement Learning, Deep Kernel Learning" Learning to Take a Break: Sustainable Optimization of Long-Term User Engagement,https://openreview.net/forum?id=fwP9Bc4E71,https://openreview.net/pdf?id=fwP9Bc4E71,We use Lotka-Volterra dynamics to learn optimal `take-a-break' schedules that promote sustainable media habits.,"Optimizing user engagement is a key goal for modern recommendation systems, but blindly pushing users towards increased consumption risks burn-out, churn, or even addictive habits. To promote digital well-being, most platforms now offer a service that periodically prompts users to take a break. These, however, must be set up manually, and so may be suboptimal for both users and the system. In this paper, we propose a framework for optimizing long-term engagement by learning individualized breaking policies. Using Lotka-Volterra dynamics, we model users as acting based on two balancing latent states: drive, and interest---which must be conserved. We then give an efficient learning algorithm, provide theoretical guarantees, and empirically evaluate its performance on semi-synthetic data.","Lotka-Volterra dynamics, breaking policies, digital well-being, feed-based recommendation" "Laziness, Barren Plateau, and Noises in Machine Learning",https://openreview.net/forum?id=SDHSQuBpf2,https://openreview.net/pdf?id=SDHSQuBpf2,Variational quantum algorithms are lazy and noise-resilient in the overparametrization regime.,"We define \emph{laziness} to describe a large suppression of variational parameter updates for neural networks, classical or quantum. In the quantum case, the suppression is exponential in the number of qubits for randomized variational quantum circuits. We discuss the difference between laziness and \emph{barren plateau} in quantum machine learning created by quantum physicists in \cite{mcclean2018barren} for the flatness of the loss function landscape during gradient descent. We address a novel theoretical understanding of those two phenomena in light of the theory of neural tangent kernels. For noiseless quantum circuits, without the measurement noise, the loss function landscape is complicated in the overparametrized regime with a large number of trainable variational angles. Instead, around a random starting point in optimization, there are large numbers of local minima that are good enough and could minimize the mean square loss function, where we still have quantum laziness, but we do not have barren plateaus. However, the complicated landscape is not visible within a limited number of iterations, and low precision in quantum control and quantum sensing. Moreover, we look at the effect of noises during optimization by assuming intuitive noise models, and show that variational quantum algorithms are noise-resilient in the overparametrization regime. Our work precisely reformulates the quantum barren plateau statement towards a precision statement and justifies the statement in certain noise models, injects new hope toward near-term variational quantum algorithms, and provides theoretical connections toward classical machine learning. Our paper provides conceptual perspectives about quantum barren plateaus, together with discussions about the gradient descent dynamics.","theoretical issues in deep learning, learning representations of outputs or states" HyperQuery: A Framework for Higher Order Link Prediction,https://openreview.net/forum?id=wYYCBNLEmBv,https://openreview.net/pdf?id=wYYCBNLEmBv,A new state-of-the-art hyperedge prediction framework for knowledge hypergraphs as well as regular hypergraphs,"Groups with complex set intersection relations are a natural way to model a wide array of data, from the formation of social groups to the complex protein interactions which form the basis of biological life. While graphs are a natural way to represent complex networks and are well studied, typical approaches to modeling group membership using graphs are lossy. Hypergraphs are a more natural way to represent such ``higher order'' relationships, but efforts to apply machine learning techniques to hypergraph structured datasets have been limited thus far. In this paper, we address the problem of link prediction in knowledge hypergraphs as well as regular hypergraphs and develop a novel, simple, and effective optimization architecture to solve this task. Additionally, we study how integrating data from node-level labels can improve the results of our system. Our self-supervised approach achieves significant improvement over state of the art results on several hyperedge prediction and knowledge hypergraph completeion benchmarks. ","link prediction, Hyperedge prediction, Hypergraph learning, message passing, hypergraphs" Generative Model Based Noise Robust Training for Unsupervised Domain Adaptation,https://openreview.net/forum?id=yhLVkvwUGtH,https://openreview.net/pdf?id=yhLVkvwUGtH,,"Target domain pseudo-labeling has shown effectiveness in unsupervised domain adaptation (UDA). However, pseudo-labels of unlabeled target domain data are inevitably noisy due to the distribution shift between source and target domains. In this paper, we propose a generative model-based noise-robust training method (GeMo-NoRT), serving for domain shift elimination and label noise robustness simultaneously. GeMo-NoRT incorporates a distribution-based class-wise feature augmentation (D-CFA) and a generative-discriminative classifier consistency (GDC), both based on the class-wise target distributions modeled by generative models. D-CFA minimizes the domain gap by augmenting the source data with distribution-sampled target features, and trains a noise-robust discriminative classifier by using target domain knowledge from the generative models. GDC regards all the class-wise generative models as a generative classifier and enforces a consistency regularization between the generative and discriminative classifiers. It exploits an ensemble of target knowledge from all the generative models to train a noise-robust discriminative classifier, and eventually gets theoretically linked to the Ben-David domain adaptation theorem for reducing domain gap. Extensive experiments on Office-Home, PACS, and Digit-Five show that our GeMo-NoRT achieves state of the art under single-source and multi-source UDA settings.","Unsupervised Domain Adaptation, Generative Models, Feature Augmentation, Generative and Discriminative Consistency" Deep Learning meets Nonparametric Regression: Are Weight-Decayed DNNs Locally Adaptive?,https://openreview.net/forum?id=0Q9H_Pgx132,https://openreview.net/pdf?id=0Q9H_Pgx132,Parallel NN with only weight decay achieves an estimation error close to the minimax rates for both the Besov and BV classes.,"We study the theory of neural network (NN) from the lens of classical nonparametric regression problems with a focus on NN’s ability to adaptively estimate functions with heterogeneous smoothness — a property of functions in Besov or Bounded Variation (BV) classes. Existing work on this problem requires tuning the NN architecture based on the function spaces and sample sizes. We consider a “Parallel NN” variant of deep ReLU networks and show that the standard weight decay is equivalent to promoting the ℓp -sparsity (0 < p < 1) of the coefficient vector of an end-to-end learned function bases, i.e., a dictionary. Using this equivalence, we further establish that by tuning only the weight decay, such Parallel NN achieves an estimation error arbitrarily close to the minimax rates for both the Besov and BV classes. Notably, it gets exponentially closer to minimax optimal as the NN gets deeper. Our research sheds new lights on why depth matters and how NNs are more powerful than kernel methods","Neural network, nonparametric regression, minimax optimal" Sparse Token Transformer with Attention Back Tracking,https://openreview.net/forum?id=VV0hSE8AxCw,https://openreview.net/pdf?id=VV0hSE8AxCw,"We propose an attention back-tracking method that tracks the importance of each attention in a Transformer architecture from the outputs to the inputs, to preserve the tokens that have large impact on the final predictions.","Despite the success of Transformers in various applications from text, vision, and speech domains, they are yet to become standard architectures for mobile and edge device applications due to their heavy memory and computational requirements. While there exist many different approaches to reduce the complexities of the Transformers, such as the pruning of the weights/attentions/tokens, quantization, and distillation, we focus on token pruning, which reduces not only the complexity of the attention operations, but also the linear layers, which have non-negligible computational costs. However, previous token pruning approaches often remove tokens during the feed-forward stage without consideration of their impact on later layers' attentions, which has a potential risk of dropping out important tokens for the given task. To tackle this issue, we propose an attention back-tracking method that tracks the importance of each attention in a Transformer architecture from the outputs to the inputs, to preserve the tokens that have a large impact on the final predictions. We experimentally validate the effectiveness of the method on both NLP and CV benchmarks, using Transformer architectures for both domains, and the results show that the proposed attention back-tracking allows the model to better retain the full models' performance even at high sparsity rates, significantly outperforming all baselines. Qualitative analysis of the examples further shows that our method does preserve semantically meaningful tokens.","Token Pruning, Sparse Token, Attention Back-tracking, BERT, Vision Transformer, DynamicViT" A Deep Conjugate Direction Method for Iteratively Solving Linear Systems,https://openreview.net/forum?id=ldRb12nMfLQ,https://openreview.net/pdf?id=ldRb12nMfLQ,"We present a CNN-based algorithm for solving linear systems with millions of degrees of freedom in a way that rapidly achieves convergence to a specified tolerance, a significant improvement over learning methods that converge slowly or not at all.","We present a novel deep learning approach to approximate the solution of large, sparse, symmetric, positive-definite linear systems of equations.These systems arise from many problems in applied science, e.g., in numerical methods for partial differential equations. Algorithms for approximating the solution to these systems are often the bottleneck in problems that require their solution, particularly for modern applications that require many millions of unknowns. Indeed, numerical linear algebra techniques have been investigated for many decades to alleviate this computational burden. Recently, data-driven techniques have also shown promise for these problems. Motivated by the conjugate gradients algorithm that iteratively selects search directions for minimizing the matrix norm of the approximation error, we design an approach that utilizes a deep neural network to accelerate convergence via data-driven improvement of the search directions. Our method leverages a carefully chosen convolutional network to approximate the action of the inverse of the linear operator up to an arbitrary constant. We train the network using unsupervised learning with a loss function equal to the $L^2$ difference between an input and the system matrix times the network evaluation, where the unspecified constant in the approximate inverse is accounted for. We demonstrate the efficacy of our approach on spatially discretized Poisson equations with millions of degrees of freedom arising in computational fluid dynamics applications.Unlike state-of-the-art learning approaches, our algorithm is capable of reducing the linear system residual to a given tolerance in a small number of iterations, independent of the problem size.Moreover, our method generalizes effectively to various systems beyond those encountered during training.","Computational Linear Algebra, Convolutional Neural Network, Conjugate Gradients, Partial Differential Equations, Fluid Simulation" MixMask: Revisiting Masked Siamese Self-supervised Learning in Asymmetric Distance,https://openreview.net/forum?id=L5yBcwO1yKH,https://openreview.net/pdf?id=L5yBcwO1yKH,A New Masking Strategy for Masked Siamese Self-supervised Learning,"Recent advances in self-supervised learning integrate masked modeling and Siamese Networks into a single framework to fully reap the advantages of both the two techniques. However, these approaches simply inherit the default loss design from previous siamese networks and ignore the distance change after employing masking operation in the frameworks. In this paper, we propose a filling-based masking strategy called MixMask to prevent information loss in vanilla masking method due to the randomly erased areas in an image. We further introduce a dynamic loss function design with soft distance to adapt the integrated architecture and avoid mismatches between transformed input and objective in Masked Siamese ConvNets (MSCN). The dynamic loss distance is calculated according to the mix-masking scheme. Extensive experiments are conducted on various datasets of CIFAR-100, Tiny-ImageNet and ImangeNet-1K. The results demonstrate that the proposed framework can achieve better accuracy on linear evaluation and semi-supervised learning, which outperforms the state-of-the-art MSCN by a significant margin. We also show the superiority on downstream tasks of object detection and segmentation. Our source code will be publicly available.","MixMask, Masked Siamese Networks, Self-supervised Learning" Smart Multi-tenant Federated Learning,https://openreview.net/forum?id=HkQ7Ompkpqe,https://openreview.net/pdf?id=HkQ7Ompkpqe,"We propose a smart multi-tenant federated learning system, MuFL, to efficiently coordinate and execute simultaneous training activities under resource constraints by considering both synergies and differences among training activities.","Federated learning (FL) is an emerging distributed machine learning method that empowers in-situ model training on decentralized edge devices. However, multiple simultaneous training activities could overload resource-constrained devices. In this work, we propose a smart multi-tenant FL system, MuFL, to effectively coordinate and execute simultaneous training activities. We first formalize the problem of multi-tenant FL, define multi-tenant FL scenarios, and introduce a vanilla multi-tenant FL system that trains activities sequentially to form baselines. Then, we propose two approaches to optimize multi-tenant FL: 1) activity consolidation merges training activities into one activity with a multi-task architecture; 2) after training it for rounds, activity splitting divides it into groups by employing affinities among activities such that activities within a group have better synergy. Extensive experiments demonstrate that MuFL outperforms other methods while consuming 40% less energy. We hope this work will inspire the community to further study and optimize multi-tenant FL.","federated learning, multi-tenant federated learning" Robust Active Distillation,https://openreview.net/forum?id=ALDM5SN2r7M,https://openreview.net/pdf?id=ALDM5SN2r7M,A new way of actively soft-labeling points in semi-supervised knowledge distillation to teach the student model in an efficient and robust way,"Distilling knowledge from a large teacher model to a lightweight one is a widely successful approach for generating compact, powerful models in the semi-supervised learning setting where a limited amount of labeled data is available. In large-scale applications, however, the teacher tends to provide a large number of incorrect soft-labels that impairs student performance. The sheer size of the teacher additionally constrains the number of soft-labels that can be queried due to prohibitive computational and/or financial costs. The difficulty in achieving simultaneous \emph{efficiency} (i.e., minimizing soft-label queries) and \emph{robustness} (i.e., avoiding student inaccuracies due to incorrect labels) hurts the widespread application of knowledge distillation to many modern tasks. In this paper, we present a parameter-free approach with provable guarantees to query the soft-labels of points that are simultaneously informative and correctly labeled by the teacher. At the core of our work lies a game-theoretic formulation that explicitly considers the inherent trade-off between the informativeness and correctness of input instances. We establish bounds on the expected performance of our approach that hold even in worst-case distillation instances. We present empirical evaluations on popular benchmarks that demonstrate the improved distillation performance enabled by our work relative to that of state-of-the-art active learning and active distillation methods.","knowledge distillation, active learning, semi-supervised learning, model compression" Controllable Image Generation via Collage Representations,https://openreview.net/forum?id=1CHhsUY32a,https://openreview.net/pdf?id=1CHhsUY32a,"We present Mixing and Match scenes (M&Ms), a novel approach to controllable image generation by conditioning on representations of image collages containing objects and backgrounds from several reference images, resulting in high quality generations.","Recent advances in conditional generative image models have enabled impressive results. On the one hand, text-based conditional models have achieved remarkable generation quality, by leveraging large-scale datasets of image-text pairs. To enable fine-grained controllability, however, text-based models require long prompts, whose details may be ignored by the model. On the other hand, layout-based conditional models have also witnessed significant advances. These models rely on bounding boxes or segmentation maps for precise spatial conditioning in combination with coarse semantic labels. The semantic labels, however, cannot be used to express detailed appearance characteristics. In this paper, we approach fine-grained scene controllability through image collages which allow a rich visual description of the desired scene as well as the appearance and location of the objects therein, without the need of class nor attribute labels. We introduce ""mixing and matching scenes"" (M&Ms), an approach that consists of an adversarially trained generative image model which is conditioned on appearance features and spatial positions of the different elements in a collage, and integrates these into a coherent image. We train our model on the OpenImages (OI) dataset and evaluate it on collages derived from OI and MS-COCO datasets. Our experiments on the OI dataset show that M&Ms outperforms baselines in terms of fine-grained scene controllability while being very competitive in terms of image quality and sample diversity. On the MS-COCO dataset, we highlight the generalization ability of our model by outperforming DALL-E in terms of the zero-shot FID metric, despite using two magnitudes fewer parameters and data. Collage based generative models have the potential to advance content creation in an efficient and effective way as they are intuitive to use and yield high quality generations.","image generation, controllable image generation, conditional image generation, instance-conditioned generation, image collage, out-of-distribution generation, unseen layout generation, scene generation" Robust Multi-Agent Reinforcement Learning with State Uncertainties,https://openreview.net/forum?id=Rl4ihTreFnV,https://openreview.net/pdf?id=Rl4ihTreFnV,fundamental research about robust multi-agent reinforcement learning with state uncertainty ,"In real-world multi-agent reinforcement learning (MARL) applications, agents may not have perfect state information (e.g., due to inaccurate measurement or malicious attacks), which challenges the robustness of agents' policies. Though robustness is getting important in MARL deployment, little prior work has studied state uncertainties in MARL, neither in problem formulation nor algorithm design. Motivated by this robustness issue, we study the problem of MARL with state uncertainty in this work. We provide the first attempt to the theoretical and empirical analysis of this challenging problem. We first model the problem as a Markov Game with state perturbation adversaries (MG-SPA), and introduce Robust Equilibrium as the solution concept. We conduct fundamental analysis regarding MG-SPA and give conditions under which such an equilibrium exists. Then we propose a robust multi-agent Q-learning (RMAQ) algorithm to find such an equilibrium, with convergence guarantees. To handle high-dimensional state-action space, we design a robust multi-agent actor-critic (RMAAC) algorithm based on an analytical expression of the policy gradient derived in the paper. Our experiments show that the proposed RMAQ algorithm converges to the optimal value function; our RMAAC algorithm outperforms several MARL methods that do not consider the state uncertainty in several multi-agent environments.","multi-agent reinforcement learning, robust reinforcement learning" Tiny Adapters for Vision Transformers,https://openreview.net/forum?id=V0Vo9eW2nzL,https://openreview.net/pdf?id=V0Vo9eW2nzL,Tiny Adapters for Vision Transformers,"Vision Transformers (ViTs) have become one of the dominant architectures in computer vision and pretrained ViT models are commonly adapted to new tasks via fine-tuning of its parameters. Recent works in NLP proposed a variety of parameter-efficient transfer learning methods such as adapters to avoid the prohibitive storage cost of fine-tuning. In this work, we start from the observation that adapters perform poorly when the dimension of adapters is small and we propose a training algorithm that addresses this issue. We start from large adapters which can be trained easily and iteratively reduce the size of every adapter. We introduce a scoring function that can compare neuron importance across layers and consequently allow automatic estimation of the hidden dimension of every adapter. Our method outperforms existing approaches in terms of the trade-off between accuracy and trained parameters across domain adaptation benchmarks. We will release our code publicly upon acceptance.","Vision Transformers, Parameter Efficient Training, Adapters" Accelerating Inverse Reinforcement Learning with Expert Bootstrapping,https://openreview.net/forum?id=b_cUyW2CJO1,https://openreview.net/pdf?id=b_cUyW2CJO1,"This paper presents two simple, general-purpose recipes for accelerating inverse reinforcement learning through better utilization of expert information.","Existing inverse reinforcement learning methods (e.g. MaxEntIRL, $f$-IRL) search over candidate reward functions and solve a reinforcement learning problem in the inner loop. This creates a rather strange inversion where a harder problem, reinforcement learning, is in the inner loop of a presumably easier problem, imitation learning. In this work, we show that better utilization of expert demonstrations can reduce the need for hard exploration in the inner RL loop, hence accelerating learning. Specifically, we propose two simple recipes: (1) placing expert transitions into the replay buffer of the inner RL algorithm (e.g. Soft-Actor Critic) which directly informs the learner about high reward states instead of forcing the learner to discover them through extensive exploration, and (2) using expert actions in Q value bootstrapping in order to improve the target Q value estimates and more accurately describe high value expert states. Our methods show significant gains over a MaxEntIRL baseline on the benchmark MuJoCo suite of tasks, speeding up recovery to 70\% of deterministic expert performance by 2.13x on HalfCheetah-v2, 2.6x on Ant-v2, 18x on Hopper-v2, and 3.36x on Walker2d-v2.","inverse reinforcement learning, imitation learning, reinforcement learning" Kernel Neural Optimal Transport,https://openreview.net/forum?id=Zuc_MHtUma4,https://openreview.net/pdf?id=Zuc_MHtUma4,,"We study the Neural Optimal Transport (NOT) algorithm which uses the general optimal transport formulation and learns stochastic transport plans. We show that NOT the weak quadratic cost is doomed to learn fake plans which are not optimal. To resolve this issue, we introduce kernel weak quadratic costs. We show that they provide improved theoretical guarantees and practical performance. We test NOT with kernel costs on the unpaired image-to-image translation task.","optimal transport, neural networks, kernels" Neural Optimal Transport,https://openreview.net/forum?id=d8CBRlWNkqH,https://openreview.net/pdf?id=d8CBRlWNkqH,We present a novel neural-networks-based algorithm to compute optimal transport maps and plans for strong and weak transport costs.,"We present a novel neural-networks-based algorithm to compute optimal transport maps and plans for strong and weak transport costs. To justify the usage of neural networks, we prove that they are universal approximators of transport plans between probability distributions. We evaluate the performance of our optimal transport algorithm on toy examples and on the unpaired image-to-image style translation task.","weak optimal transport, neural networks" SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation,https://openreview.net/forum?id=-qg8MQNrxZw,https://openreview.net/pdf?id=-qg8MQNrxZw,,"Since the introduction of Vision Transformers, the landscape of many computer vision tasks (e.g., semantic segmentation), which has been overwhelmingly dominated by CNNs, recently has significantly revolutionized. However, the computational cost and memory requirement render these methods unsuitable on the mobile device, especially for the high resolution per-pixel semantic segmentation task. In this paper, we introduce a new method squeeze-enhanced Axial Transformer (SeaFormer) for mobile semantic segmentation. Specifically, we design a generic attention block characterized by the formulation of squeeze Axial and spatial enhancement. It can be further used to create a family of backbone architectures with superior cost-effectiveness. Coupled with a light segmentation head, we demonstrate state-of-the-art results on the ADE20K, Pascal Context and COCO-stuff datasets. Critically, we beat both the mobile-friendly rivals and Transformer-based counterparts with better performance and lower latency without bells and whistles. Beyond semantic segmentation, we further apply the proposed SeaFormer architecture to image classification problem, demonstrating the potentials of serving as a versatile mobile-friendly backbone.", Joint Edge-Model Sparse Learning is Provably Efficient for Graph Neural Networks,https://openreview.net/forum?id=4UldFtZ_CVF,https://openreview.net/pdf?id=4UldFtZ_CVF,"Encouraged by the empirical success of sparse learners in accelerating GNN training, this paper characterizes the impact of graph sampling and neuron pruning on the sample complexity and convergence rate for a desirable test accuracy quantitatively.","Due to the significant computational challenge of training large-scale graph neural networks (GNNs), various sparse learning techniques have been exploited to reduce memory and storage costs. Examples include graph sparsification that samples a subgraph to reduce the amount of data aggregation and model sparsification that prunes the neural network to reduce the number of trainable weights. Despite the empirical successes in reducing the training cost while maintaining the test accuracy, the theoretical generalization analysis of sparse learning for GNNs remains elusive. To the best of our knowledge, this paper provides the first theoretical characterization of joint edge-model sparse learning from the perspective of sample complexity and convergence rate in achieving zero generalization error. It proves analytically that both sampling important nodes and pruning neurons with lowest-magnitude can reduce the sample complexity and improve convergence without compromising the test accuracy. Although the analysis is centered on two-layer GNNs with structural constraints on data, the insights are applicable to more general setups and justified by both synthetic and practical citation datasets.","Learning theory, Graph neural networks, Generalization analysis, Graph sparisification" Harnessing spectral representations for subgraph alignment,https://openreview.net/forum?id=EONdIvi64h-,https://openreview.net/pdf?id=EONdIvi64h-,,"With the rise and advent of graph learning techniques, graph data has become ubiquitous in the machine learning field. However, while several efforts have been devoted to the design of new convolutional architectures, pooling or positional encoding schemes, relatively little focus has been spent on modeling pairwise problems such as signal transfer, graph isomorphism and subgraph correspondence tasks. With this paper, we anticipate the need for a convenient framework to deal with problems that revolve around the notion of a map among graphs, and focus in particular on the challenging subgraph alignment scenario. We claim that, first and foremost, the representation of a map plays a central role in how these problems should be modeled -- be it a map inference problem or a simpler signal transport task. Taking the hint from recent work in geometry processing, we propose the adoption of a spectral representation for maps that is compact, easy to compute, permutation-equivariant, easy to plug into learning pipelines, and especially effective for a wide range of situations, most notably when dealing with subgraph alignment problems. We further report for the first time a surprising phenomenon where the partiality arising in subgraph alignment is manifested in the structure of the map coefficients, even in the absence of exact isomorphism, and which is consistently observed over different families of graphs.","Graph alignment, Spectral theory" MotifExplainer: a Motif-based Graph Neural Network Explainer,https://openreview.net/forum?id=0YXmOFLb1wQ,https://openreview.net/pdf?id=0YXmOFLb1wQ,"We propose a motif-based explainer that can provide better human-understandable explanations than methods based on nodes, edges, and regular subgraphs.","We consider the explanation problem of Graph Neural Networks (GNNs). Most existing GNN explanation methods identify the most important edges or nodes but fail to consider substructures, which are more important for graph data. One method considering subgraphs tries to search all possible subgraphs and identifies the most significant ones. However, the subgraphs identified may not be recurrent or statistically important for interpretation. This work proposes a novel method, named MotifExplainer, to explain GNNs by identifying important motifs, which are recurrent and statistically significant patterns in graphs. Our proposed motif-based methods can provide better human-understandable explanations than methods based on nodes, edges, and regular subgraphs. Given an instance graph and a pre-trained GNN model, our method first extracts motifs in the graph using domain-specific motif extraction rules. Then, a motif embedding is encoded by feeding motifs into the pre-trained GNN. Finally, we employ an attention-based method to identify the most influential motifs as explanations for the prediction results. The empirical studies on both synthetic and real-world datasets demonstrate the effectiveness of our method.","Graph Neural Networks, Explainer, Motif" Receding Neuron Importances for Structured Pruning,https://openreview.net/forum?id=gQsRPozZYIQ,https://openreview.net/pdf?id=gQsRPozZYIQ,,"Structured pruning efficiently compresses networks by identifying and removing unimportant neurons. While this can be elegantly achieved by applying sparsity-inducing regularisation on BatchNorm parameters, an L1 penalty would shrink all scaling factors rather than just those of superfluous neurons. To tackle this issue, we introduce a simple BatchNorm variation with bounded scaling parameters, based on which we design a novel regularisation term that suppresses only neurons with low importance. Under our method, the weights of unnecessary neurons effectively recede, producing a polarised bimodal distribution of importances. We show that neural networks trained this way can be pruned to a larger extent and with less deterioration. We one-shot prune VGG and ResNet architectures at different ratios on CIFAR and ImagenNet datasets. In the case of VGG-style networks, our method significantly outperforms existing approaches particularly under severe pruning.","structured pruning, regularization, sparsity, batchnorm, neuron importance" Learning Sparse and Low-Rank Priors for Image Recovery via Iterative Reweighted Least Squares Minimization,https://openreview.net/forum?id=TXPN6MtdSE4,https://openreview.net/pdf?id=TXPN6MtdSE4,,"In this work we introduce a novel optimization algorithm for image recovery under learned sparse and low-rank constraints, which are parameterized with weighted extensions of the $\ell_p^p$-vector and $\mathcal{S}_p^p$ Schatten-matrix quasi-norms for $0\!0 is the Lipschitz constant of the gradient of an objective function. Note that the existing techniques in the literature can only handle the case when the stepsize is strictly less than 1/L. By exploiting this we introduce and study HT-stable and HT-unstable stationary points and show no matter how close an initialization is to a HT-unstable stationary point (saddle point in sparse sense), the IHT sequence leaves it. Finally, we show that no matter what sparse initial point is selected, the IHT sequence converges if the function values at HT-stable stationary points are distinct, where the last condition is a new assumption that has not been found in the literature. We provide a video of 4000 independent runs where the IHT algorithm is initialized very close to a HT-unstable stationary point and show the sequences escape them.","Sparse optimization, Hard thresholding, Iterative hard thresholding, HT-stationary point, HT-stable point, HT-unstable point" Fair Attribute Completion on Graph with Missing Attributes,https://openreview.net/forum?id=9vcXCMp9VEp,https://openreview.net/pdf?id=9vcXCMp9VEp,,"Tackling unfairness in graph learning models is a challenging task, as the unfairness issues on graphs involve both attributes and topological structures. Existing work on fair graph learning simply assumes that attributes of all nodes are available for model training and then makes fair predictions. In practice, however, the attributes of some nodes might not be accessible due to missing data or privacy concerns, which makes fair graph learning even more challenging. In this paper, we propose FairAC, a fair attribute completion method, to complement missing information and learn fair node embeddings for graphs with missing attributes. FairAC adopts an attention mechanism to deal with the attribute missing problem and meanwhile, it mitigates two types of unfairness, i.e., feature unfairness from attributes and topological unfairness due to attribute completion. FairAC can be applied to any graph and generate fair embeddings and thus can be applied to most downstream tasks to improve their fairness performance. To our best knowledge, FairAC is the first method that jointly addresses the graph attribution completion and graph unfairness problems. Experimental results on benchmark datasets show that our method achieves better fairness performance with less sacrifice in accuracy, compared with the state-of-the-art methods of fair graph learning.", Comfort Zone: A Vicinal Distribution for Regression Problems,https://openreview.net/forum?id=zU8O5w0Wnh1,https://openreview.net/pdf?id=zU8O5w0Wnh1,,"Domain-dependent data augmentation methods generate artificial samples using transformations suited for the underlying data domain, for example rotations on images and time warping on time series data. However, domain-independent approaches, e.g. mixup, are applicable to various data modalities, and as such they are general and versatile. While mixup-based techniques are used extensively in classification problems, their effect on regression tasks is somewhat less explored. To bridge this gap, we study the problem of domain-independent augmentation for regression, and we introduce comfort-zone: a new data-driven, domain-independent data augmentation method. Essentially, our approach samples new examples from the tangent planes of the train distribution. Augmenting data in this way aligns with the network tendency towards capturing the dominant features of its input signals. Evaluating comfort-zone on regression and time series forecasting benchmarks, we show that it improves the generalization of several neural architectures. We also find that mixup and noise injection are less effective in comparison to comfort-zone.","Deep Learning Regularization, Data Augmentation, Regression Learning" Implicit Neural Spatial Representations for Time-dependent PDEs,https://openreview.net/forum?id=4Vwx-VwS5b3,https://openreview.net/pdf?id=4Vwx-VwS5b3,"We replace traditional PDE solvers' spatial representations (e.g., grid, mesh, and point cloud) with a neural spatial representation.","Numerically solving partial differential equations (PDEs) often entails spatial and temporal discretizations. Traditional methods (e.g., finite difference, finite element, smoothed-particle hydrodynamics) frequently adopt explicit spatial discretizations, such as grids, meshes, and point clouds, where each degree-of-freedom corresponds to a location in space. While these explicit spatial correspondences are intuitive to model and understand, these representations are not necessarily optimal for accuracy, memory-usage, or adaptivity. In this work, we explore implicit neural representation as an alternative spatial discretization, where spatial information is implicitly stored in the neural network weights. With implicit neural spatial representation, PDE-constrained time-stepping translates into updating neural network weights, which naturally integrates with commonly adopted optimization time integrators. We validate our approach on a variety of classic PDEs with examples involving large elastic deformations, turbulent fluids, and multiscale phenomena. While slower to compute than traditional representations, our approach exhibits higher accuracy, lower memory consumption, and dynamically adaptive allocation of degrees of freedom without complex remeshing.","PDE, implicit neural representation, neural field, numerical methods" Deep Ranking Ensembles for Hyperparameter Optimization,https://openreview.net/forum?id=_ruvo2KCL2x,https://openreview.net/pdf?id=_ruvo2KCL2x,Meta-learn Deep Ensembles using Ranking Losses to improve the performance on Hyperparameter Optimization,"Automatically optimizing the hyperparameters of Machine Learning algorithms is one of the primary open questions in AI. Existing work in Hyperparameter Optimization (HPO) trains surrogate models for approximating the response surface of hyperparameters as a regression task. In contrast, we hypothesize that the optimal strategy for training surrogates is to preserve the ranks of the performances of hyperparameter configurations as a Learning to Rank problem. As a result, we present a novel method that meta-learns neural network surrogates optimized for ranking the configurations' performances while modeling their uncertainty via ensembling. In a large-scale experimental protocol comprising 12 baselines, 16 HPO search spaces and 86 datasets/tasks, we demonstrate that our method achieves new state-of-the-art results in HPO.","Hyperparameter Optimization, Meta-learning, Deep Ensembles, Ranking Losses" Multi-skill Mobile Manipulation for Object Rearrangement,https://openreview.net/forum?id=Z3IClM_bzvP,https://openreview.net/pdf?id=Z3IClM_bzvP,,"We study a modular approach to tackle long-horizon mobile manipulation tasks for object rearrangement, which decomposes a full task into a sequence of subtasks. To tackle the entire task, prior work chains multiple stationary manipulation skills with a point-goal navigation skill, which are learned individually on subtasks. Although more effective than monolithic end-to-end RL policies, this framework suffers from compounding errors in skill chaining, e.g., navigating to a bad location where a stationary manipulation skill can not reach its target to manipulate. To this end, we propose that the manipulation skills should include mobility to have flexibility in interacting with the target object from multiple locations and at the same time the navigation skill could have multiple end points which lead to successful manipulation. We operationalize these ideas by implementing mobile manipulation skills rather than stationary ones and training a navigation skill trained with region goal instead of point goal. We evaluate our multi-skill mobile manipulation method M3 on 3 challenging long-horizon mobile manipulation tasks in the Home Assistant Benchmark (HAB), and show superior performance as compared to the baselines.","mobile manipulation, reinforcement learning" Robustness to corruption in pre-trained Bayesian neural networks,https://openreview.net/forum?id=kUI41mY8bHl,https://openreview.net/pdf?id=kUI41mY8bHl,,"We develop ShiftMatch, a new training-data-dependent likelihood for robustness to corruption in Bayesian neural networks (BNNs). ShiftMatch is inspired by the training-data-dependent “EmpCov” priors from Izmailov et al. (2021a), and efficiently matches test-time spatial correlations to those at training time. Critically, ShiftMatch is designed to leave the neural network’s training time likelihood unchanged, allowing it to use publicly available samples from pre-trained BNNs. Using pre-trained HMC samples, ShiftMatch gives strong performance improvements on CIFAR-10-C, outperforms EmpCov priors (though ShiftMatch uses extra information from a minibatch of corrupted test points), and is perhaps the first Bayesian method capable of convincingly outperforming plain deep ensembles.", Accelerating Federated Learning Convergence via Opportunistic Mobile Relaying,https://openreview.net/forum?id=W98qRtArTy3,https://openreview.net/pdf?id=W98qRtArTy3,,"This paper studies asynchronous Federated Learning (FL) subject to clients' individual arbitrary communication patterns with the parameter server. We propose FedMobile, a new asynchronous FL algorithm that exploits the mobility attribute of the mobile FL system to improve the learning performance. The key idea is to leverage the random client-client communication in a mobile network to create additional indirect communication opportunities with the server via upload and download relaying. We prove that FedMobile achieves a convergence rate $O(\frac{1}{\sqrt{NT}})$, where $N$ is the number of clients and $T$ is the number of communication slots, and show that the optimal design involves an interesting trade-off on the best timing of relaying. Our analysis suggests that with an increased rate of client-client communication opportunities, asynchronous FL converges faster using FedMobile. Experiment results on a synthetic dataset and two real-world datasets verify our theoretical findings. ","Asynchronous Federated Learning, Convergence Analysis" What Knowledge gets Distilled in Knowledge Distillation? ,https://openreview.net/forum?id=z0uB4Zfb-Ex,https://openreview.net/pdf?id=z0uB4Zfb-Ex,This work studies different ways in which knowledge can get transferred from a teacher to a student.,"Knowledge distillation aims to transfer useful information from a teacher network to a student network, with the primary goal of improving the student's performance for the task at hand. Over the years, there has a been a deluge of novel techniques and use cases of knowledge distillation. Yet, despite the various improvements, there seems to be a glaring gap in the community's fundamental understanding of the process. Specifically, what is the knowledge that gets distilled in knowledge distillation? In other words, in what ways does the student become similar to the teacher? Does it start to localize objects in the same way? Does it get fooled by the same adversarial samples? Does its data invariance properties become similar? Our work presents a comprehensive study to try to answer these questions and more. Our results, using image classification as a case study and three state-of-the-art knowledge distillation techniques, show that knowledge distillation methods can indeed indirectly distill other kinds of properties beyond improving task performance. By exploring these questions, we hope for our work to provide a clearer picture of what happens during knowledge distillation.",knowledge distillation Single-shot General Hyper-parameter Optimization for Federated Learning,https://openreview.net/forum?id=3RhuF8foyPW,https://openreview.net/pdf?id=3RhuF8foyPW,We propose a single-shot hyperparameter optimization scheme for Federated Learning systems with theoretical performance guarantees and strong empirical performance against baselines.,"We address the problem of hyper-parameter optimization (HPO) for federated learning (FL-HPO). We introduce Federated Loss SuRface Aggregation (FLoRA), a general FL-HPO solution framework that can address use cases of tabular data and any Machine Learning (ML) model including gradient boosting training algorithms, SVMs, neural networks, among others and thereby further expands the scope of FL-HPO. FLoRA enables single-shot FL-HPO: identifying a single set of good hyper-parameters that are subsequently used in a single FL training. Thus, it enables FL-HPO solutions with minimal additional communication overhead compared to FL training without HPO. Utilizing standard smoothness assumptions, we theoretically characterize the optimality gap of FLoRA for any convex and non-convex loss functions, which explicitly accounts for the heterogeneous nature of the parties' local data distributions, a dominant characteristic of FL systems. Our empirical evaluation of FLoRA for multiple FL algorithms on seven OpenML datasets demonstrates significant model accuracy improvements over the baselines, and robustness to increasing number of parties involved in FL-HPO training.","Federated Learning, Hyperparameter Optimization, Optimality Gap Analysis" Spatially constrained Adversarial Attack Detection and Localization in the Representation Space of Optical Flow Networks,https://openreview.net/forum?id=WoaNX-8VDrK,https://openreview.net/pdf?id=WoaNX-8VDrK,,"Optical flow estimation have shown significant improvements with advances in deep neural networks. However, these flow networks have recently been shown to be vulnerable to patch-based adversarial attacks, which poses security risks in real-world applications, such as self-driving cars and robotics. We propose SADL, a Spatially constrained adversarial Attack Detection and Localization framework, which does not require dedicated training. The detection of an attacked input sequence is performed via iterative optimization on the activations from the inner layers of flow networks, without any prior knowledge of the attacks. The novel spatially constrained optimization ensures that the detected anomalous subset of features comes from a local region. To this end, SADL provides a subset of nodes within a spatial neighborhood that contribute more to the detection, which will be utilized to localize the attack in the input sequence. The proposed SADL is validated across multiple datasets (i.e., MPI-Sintel and KITTI) and flow networks (i.e., FlowNetC, FlowNet2, PWCNet, and RAFT). With patch attacks $4.8\%$ of the size of the input image resolution on RAFT, our method successfully detects and localizes them with an average precision of $0.946$ and $0.951$ for KITTI-2015 and MPI-Sintel datasets, respectively. The results show that SADL consistently achieves higher detection rates than existing methods and provides new localization capabilities.","optical flow, adversarial attack detection, adversarial attack localization" Weakly-supervised HOI Detection via Prior-guided Bi-level Representation Learning,https://openreview.net/forum?id=resApVNcqSB,https://openreview.net/pdf?id=resApVNcqSB,,"Human object interaction (HOI) detection plays a crucial role in human-centric scene understanding and serves as a fundamental building block for many vision tasks. One generalizable and scalable strategy for HOI detection is to use weak supervision, learning from image-level annotations only. This is inherently challenging due to ambiguous human-object associations, large search space of detecting HOIs and highly noisy training signal. A promising strategy to address those challenges is to exploit knowledge from large-scale pretrained models (e.g., CLIP), but a direct knowledge distillation strategy does not perform well on the weakly-supervised setting. In contrast, we develop a CLIP-guided HOI representation capable of incorporating the prior knowledge at both image level and HOI instance level, and adopt a self-taught mechanism to prune incorrect human-object associations. Experimental results on HICO-DET and V-COCO show that our method outperforms the previous works by a sizable margin, showing the efficacy of our HOI representation.","HOI Detection, Weakly-supervised Learning, CLIP-guided Representation Learning" Meta-learning Adaptive Deep Kernel Gaussian Processes for Molecular Property Prediction,https://openreview.net/forum?id=KXRSh0sdVTP,https://openreview.net/pdf?id=KXRSh0sdVTP,"This paper proposes a meta-learning approach for fitting deep kernel GPs via implicit differentiation, which outperforms previous SOTA methods on a variety of real-world chemical tasks.","We propose Adaptive Deep Kernel Fitting with Implicit Function Theorem (ADKF-IFT), a novel framework for learning deep kernel Gaussian processes (GPs) by interpolating between meta-learning and conventional deep kernel learning. Our approach employs a bilevel optimization objective where we meta-learn generally useful feature representations across tasks, in the sense that task-specific GP models estimated on top of such features achieve the lowest possible predictive loss on average. We solve the resulting nested optimization problem using the implicit function theorem (IFT). We show that our ADKF-IFT framework contains previously proposed Deep Kernel Learning (DKL) and Deep Kernel Transfer (DKT) as special cases. Although ADKF-IFT is a completely general method, we argue that it is especially well-suited for drug discovery problems and demonstrate that it significantly outperforms previous state-of-the-art methods on a variety of real-world few-shot molecular property prediction tasks and out-of-domain molecular property prediction and optimization tasks.","meta-learning, few-shot learning, Gaussian processes, deep kernel learning, bilevel optimization, chemistry, molecules, drug discovery" ERL-Re$^2$: Efficient Evolutionary Reinforcement Learning with Shared State Representation and Individual Policy Representation ,https://openreview.net/forum?id=FYZCHEtt6H0,https://openreview.net/pdf?id=FYZCHEtt6H0,A novel and effective framework to fuse Reinforcement Learning and Evolutionary Algorithm for policy optimization.,"Deep Reinforcement Learning (Deep RL) and Evolutionary Algorithm (EA) are two major paradigms of policy optimization with distinct learning principles, i.e., gradient-based v.s. gradient-free. An appealing research direction is integrating Deep RL and EA to devise new methods by fusing their complementary advantages. However, existing works on combining Deep RL and EA have two common drawbacks:1) the RL agent and EA agents learn their policies individually, neglecting efficient sharing of useful common knowledge; 2) parameter-level policy optimization guarantees no semantic level of behavior evolution for the EA side. In this paper, we propose Evolutionary Reinforcement Learning with Two-scale State Representation and Policy Representation (ERL-Re$^2$), a novel solution to the aforementioned two drawbacks. The key idea of ERL-Re$^2$ is two-scale representation: all EA and RL policies share the same nonlinear state representation while maintaining individual linear policy representations. The state representation conveys expressive common features of the environment learned by all the agents collectively; the linear policy representation provides a favorable space for efficient policy optimization, where novel behavior-level crossover and mutation operations can be performed. Moreover, the linear policy representation allows convenient generalization of policy fitness with the help of Policy-extended Value Function Approximator (PeVFA), further improving the sample efficiency of fitness estimation. The experiments on a range of continuous control tasks show that ERL-Re$^2$ consistently outperforms strong baselines and achieves significant improvement over both its Deep RL and EA components.","Reinforcement Learning, Evolutionary Algorithm, Representation" $\mathrm{R}^2$-VOS: Robust Referring Video Object Segmentation via Relational Cycle Consistency,https://openreview.net/forum?id=h-oMvNaV_Nr,https://openreview.net/pdf?id=h-oMvNaV_Nr,,"Referring video object segmentation (R-VOS) aims to segment the object masks in a video given a referring linguistic expression to the object. R-VOS introduces human language in the traditional VOS loop to extend flexibility, while all current studies are based on a strict assumption: the object depicted by the expression must exist in the video, namely, the expression and video must have an object-level semantic consensus. This is often violated in real-world applications where an expression can be queried to false videos, and existing methods always fail due to abusing the assumption. In this work, we emphasize that studying semantic consensus is necessary to improve the robustness of R-VOS. Accordingly, we pose an extended task from R-VOS without the semantic consensus assumption, named Robust R-VOS ($\mathrm{R}^2$-VOS). The new task essentially corresponds to the joint modeling of the primary R-VOS problem and its dual (text reconstruction). We embrace the observation that the textual embedding spaces have relational structure consistency in the text-video-text transformation cycle that links the primary and dual problems. We leverage the cycle consistency to consolidate and discriminate the semantic consensus, thus advancing the primary task. We then propose an early grounding module to enable the parallel optimization of the primary and dual problems. To measure the robustness of R-VOS models against unpaired videos and expressions, we construct a new evaluation dataset, $\mathrm{R}^2$-Youtube-VOS. Extensive experiments demonstrate that our method not only identifies negative text-video pairs but also improves the segmentation accuracy for positive pairs with superior disambiguating ability. Our model achieves the state-of-the-art performance on Ref-DAVIS17, Ref-Youtube-VOS, and $\mathrm{R}^2$-Youtube-VOS dataset.",Referring Video Object Segmentation Least-to-Most Prompting Enables Complex Reasoning in Large Language Models,https://openreview.net/forum?id=WZH7099tgfM,https://openreview.net/pdf?id=WZH7099tgfM,"We propose a novel prompting strategy, least-to-most prompting, that enables large language models to achieve easy-to-hard generalization","Although chain-of-thought prompting has shown impressive results on many natural language reasoning tasks, it often performs poorly on tasks which need to solve problems harder than the demonstration examples. To tackle such easy-to-hard generalization issues, we propose a novel prompting strategy, least-to-most prompting. It is implemented through two stage prompting: reducing a complex problem into a list of subproblems, and then sequentially solving these subproblems, whereby solving a given subproblem is facilitated by the answers to previously solved subproblems. Experiments on symbolic manipulation, compositional generalization and math reasoning show that least-to-most prompting can generalize to the examples that are harder than those seen in the prompt, and outperform chain-of-thought prompting by a large margin. A notable result is that the GPT-3 code-davinci-002 model with least-to-most-prompting solves the SCAN benchmark regardless of splits (such as length split) with an accuracy of 99.7% using 14 examples versus an accuracy of 16.2% by chain-of-thought prompting, and neural-symbolic models in the literature specialized for solving SCAN are trained with the full training set of more than 15,000 examples.","large language models, natural language processing, prompting, reasoning, compositional generalization" STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition ,https://openreview.net/forum?id=shuT5_7eeQ_,https://openreview.net/pdf?id=shuT5_7eeQ_,We propose the first mesh-based action recognition method which achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models.,"We study the problem of human action recognition using motion capture (MoCap) sequences. Existing methods for MoCap-based action recognition take skeletons as input, which requires an extra manual mapping step and loses body shape information. Therefore, we propose a novel method that directly models raw mesh sequences which can benefit from the body prior and surface motion. We propose a new hierarchical transformer with intra- and inter-frame attention to learn effective spatial-temporal representations. Moreover, our model defines two self-supervised learning tasks, namely masked vertex modeling and future frame prediction, to further learn global context for appearance and motion. Our model achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models. We will release our code and models.","mesh-based action recognition, motion capture, transformer" Deep Ensembles for Graphs with Higher-order Dependencies,https://openreview.net/forum?id=hZftxQGJ4Re,https://openreview.net/pdf?id=hZftxQGJ4Re,We propose an ensemble of GNNs that exploits variance in the neighborhood subspaces of nodes in graphs with higher-order dependencies and consistently outperforms baselines on semisupervised and supervised learning tasks.,"Graph neural networks (GNNs) continue to achieve state-of-the-art performance on many graph learning tasks, but rely on the assumption that a given graph is a sufficient approximation of the true neighborhood structure. In the presence of higher-order sequential dependencies, we show that the tendency of traditional graph representations to underfit each node's neighborhood causes existing GNNs to generalize poorly. To address this, we propose a novel Deep Graph Ensemble (DGE), which captures neighborhood variance by training an ensemble of GNNs on different neighborhood subspaces of the same node within a higher-order network structure. We show that DGE consistently outperforms existing GNNs on semisupervised and supervised tasks on six real-world data sets with known higher-order dependencies, even under a similar parameter budget. We demonstrate that learning diverse and accurate base classifiers is central to DGE's success, and discuss the implications of these findings for future work on GNNs.","graph neural networks, higher order networks, deep ensembles, representation learning, semisupervised learning" Simplicial Embeddings in Self-Supervised Learning and Downstream Classification,https://openreview.net/forum?id=RWtGreRpovS,https://openreview.net/pdf?id=RWtGreRpovS,"We use softmax to embed representations in a collection of simplices in SSL models, which offers improved generalization properties for downstream classification.","Simplicial Embeddings (SEM) are representations learned through self-supervised learning (SSL), wherein a representation is projected into $L$ simplices of $V$ dimensions each using a \texttt{softmax} operation. This procedure conditions the representation onto a constrained space during pretraining and imparts an inductive bias for group sparsity. For downstream classification, we formally prove that the SEM representation leads to better generalization than an unnormalized representation. Furthermore, we empirically demonstrate that SSL methods trained with SEMs have improved generalization on natural image datasets such as CIFAR-100 and ImageNet. Finally, when used in a downstream classification task, we show that SEM features exhibit emergent semantic coherence where small groups of learned features are distinctly predictive of semantically-relevant classes.","Self-Supervised learning, Representation learning, Pre-training" Lossless Filter Pruning via Adaptive Clustering for Convolutional Neural Networks,https://openreview.net/forum?id=Pi5LI8sJYYz,https://openreview.net/pdf?id=Pi5LI8sJYYz,We propose a clustering-based filter pruning method which uses equivalence to remove redundancy. Our solution can omit fine-tuning and achieve the best trade-off between performance and complexity compared with other algorithms.. ,"The filter pruning method introduces structural sparsity by removing selected filters and is thus particularly effective for reducing complexity. However, previous works face two common limitations. 1) The pruned filters are prevented from contributing to the final outputs, resulting in performance degradation, especially when it comes to a large pruning rate. 2) To recover accuracy, the time-consuming fine-tuning step is required. The cost in time and the need for training data make it difficult to deploy in real-world scenarios. To address the aforementioned limitations, we propose a novel filter pruning method called Cluster Pruning (CP). Our CP reconstructs the redundant filters from the perspective of similarity and removes them equivalently using the proposed channel addition operation in a lossless manner. Pruning in such a way allows CP to preserve as many learned features as possible while getting rid of the need for fine-tuning. Specifically, each filter is first distinguished by clustering and then reconstructed as the centroid to which it belongs. Filters are then updated to eliminate the effect caused by mistakenly selected. After convergence, CP can equivalently remove identical filters through the proposed channel addition operation. The strategies for adjusting the pruning rate and the adaptive coefficient for clustering make our CP even smoother and more efficient. Extensive experiments on CIFAR-10 and ImageNet datasets show that our method achieves the best trade-off between performance and complexity compared with other state-of-the-art algorithms.", ViT-Adapter: Exploring Plain Vision Transformer for Accurate Dense Predictions,https://openreview.net/forum?id=plKu2GByCNW,https://openreview.net/pdf?id=plKu2GByCNW,This work investigates a simple yet powerful dense prediction task adapter for Vision Transformer (ViT).,"This work investigates a simple yet powerful dense prediction task adapter for Vision Transformer (ViT). Unlike recently advanced variants that incorporate vision-specific inductive biases into their architectures, the plain ViT suffers inferior performance on dense predictions due to weak prior assumptions. To address this issue, we propose the ViT-Adapter, which allows plain ViT to achieve comparable performance to vision-specific transformers. Specifically, the backbone in our framework is a plain ViT that can learn powerful representations from large-scale multi-modal data. When transferring to downstream tasks, a pre-training-free adapter is used to introduce the image-related inductive biases into the model, making it suitable for these tasks. We verify ViT-Adapter on multiple dense prediction tasks, including object detection, instance segmentation, and semantic segmentation. Notably, without using extra detection data, our ViT-Adapter-L yields state-of-the-art 60.9 box AP and 53.0 mask AP on COCO test-dev. We hope that the ViT-Adapter could serve as an alternative for vision-specific transformers and facilitate future research. The code and models will be released.","Plain Vision Transformer, Adapter, Dense Prediction" Towards Understanding Why Mask Reconstruction Pretraining Helps in Downstream Tasks,https://openreview.net/forum?id=PaEUQiY40Dk,https://openreview.net/pdf?id=PaEUQiY40Dk,,"For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches, e.g. MAE and data2vec, randomly mask input patches and then reconstruct the pixels or semantic features of these masked patches via an auto-encoder. Then for a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional ""supervised learning"" (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic (feature) learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we first theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative semantics of each potential semantic class in the pretraining dataset. Then considering the fact that the pretraining dataset is of huge size and high diversity and thus covers most semantics in downstream dataset, in fine-tuning phase, the pretrained encoder can capture as much semantics as it can in downstream datasets, and would not lost these semantics with theoretical guarantees. In contrast, SL only randomly captures some semantics due to lottery ticket hypothesis. So MRP provably achieves better performance than SL on the classification tasks. Experimental results testify to our data assumptions and also our theoretical implications. ", Similarity and Generalization: from Noise to Corruption,https://openreview.net/forum?id=uzbd9jGCnN4,https://openreview.net/pdf?id=uzbd9jGCnN4,We investigate for the first time double descent and online/offline training in the context of similarity learning and find that the resulting learning model is heavily affected both by the topology of the dataset and noise.,"Contrastive learning aims to extract distinctive features from data by finding an embedding representation where similar samples are close to each other, and different ones are far apart. We study how NNs generalize the concept of similarity in the presence of noise, investigating two phenomena: Double Descent (DD) behavior and online/offline correspondence. While DD examines how the network adjusts to the dataset during a long training time or by increasing the number of parameters, online/offline correspondence compares the network performances varying the quality (diversity) of the dataset. We focus on the simplest contrastive learning representative: Siamese Neural Networks (SNNs). We point out that SNNs can be affected by two distinct sources of noise: Pair Label Noise (PLN) and Single Label Noise (SLN). The effect of SLN is asymmetric, but it preserves similarity relations, while PLN is symmetric but breaks transitivity. We find that DD also appears in SNNs and is exacerbated by noise. We show that the dataset topology crucially affects generalization. While sparse datasets show the same performances under SLN and PLN for an equal amount of noise, SLN outperforms PLN in the overparametrized region in dense datasets. Indeed, in this regime, PLN similarity violation becomes macroscopical, corrupting the dataset to the point where complete overfitting cannot be achieved. We call this phenomenon Density-Induced Break of Similarity (DIBS). Probing the equivalence between online optimization and offline generalization in SNNs, we find that their correspondence breaks down in the presence of label noise for all the scenarios considered.","double descent, online/offline training, generalization, similarity learning, noise" On Trace of PGD-Like Adversarial Attacks,https://openreview.net/forum?id=sF2Ut0fflgx,https://openreview.net/pdf?id=sF2Ut0fflgx,We present ARC features where SAE is a unique trace left by PGD-like attacks.,"Adversarial attacks pose safety and security concerns to deep learning applications, but their characteristics are under-explored. Yet largely imperceptible, a strong trace could have been left by PGD-like attacks in an adversarial example. Recall that PGD-like attacks trigger the ``local linearity'' of a network, which implies different extents of linearity for benign or adversarial examples. Inspired by this, we construct an Adversarial Response Characteristics (ARC) feature to reflect the model's gradient consistency around the input to indicate the extent of linearity. Under certain conditions, it qualitatively shows a gradually varying pattern from benign example to adversarial example, as the latter leads to Sequel Attack Effect (SAE). To quantitatively evaluate the effectiveness of ARC, we conduct experiments on CIFAR-10 and ImageNet for attack detection and attack type recognition in a challenging setting. The results suggest that SAE is an effective and unique trace of PGD-like attacks reflected through the ARC feature. The ARC feature is intuitive, light-weighted, non-intrusive, and data-undemanding. ","adversarial attack characterization, local linearity, adversarial response characteristics, sequel attack effect" MEGAN: Multi Explanation Graph Attention Network,https://openreview.net/forum?id=H6LVUiHzYDE,https://openreview.net/pdf?id=H6LVUiHzYDE,"Novel, self-explaining graph attention network features multiple explanation channels independent of task specifications to improve interpretability of graph regression and classification problems","Explainable artificial intelligence (XAI) methods are expected to improve trust during human-AI interactions, provide tools for model analysis and extend human understanding of complex problems. Attention-based models are an important subclass of XAI methods, partly due to their full differentiability and the potential to improve explanations by means of explanation-supervised training. We propose the novel multi-explanation graph attention network (MEGAN). Our graph regression and classification model features multiple explanation channels, which can be chosen independently of the task specifications. We first validate our model on a synthetic graph regression dataset, where our model produces single-channel explanations with quality similar to GNNExplainer. Furthermore, we demonstrate the advantages of multi-channel explanations on one synthetic and two real-world datasets: The prediction of water solubility of molecular graphs and sentiment classification of movie reviews. We find that our model produces explanations consistent with human intuition, opening the way to learning from our model in less well-understood tasks.","explainable artificial intelligence, interpretable machine learning, graph neural networks, attention network, graph regression, graph classification" Learning Control Lyapunov Functions For High-dimensional Unknown Systems using Guided Iterative State Space Exploration,https://openreview.net/forum?id=YHxp8eRry6F,https://openreview.net/pdf?id=YHxp8eRry6F,"We develop a novel algorithm that learns a stable controller for high-dimensional unknown systems. We provide theoretical guarantees for the convergence, and empirical results that it outperforms other baselines in a suite of environments.","Designing stable controllers in complex, high-dimensional systems with unknown dynamics is a critical problem when we deploy robots in the real world. Prior works use learning-based control Lyapunov functions (CLFs) or adaptive control to derive such controllers, but they suffer from two significant challenges: scalability and model transparency. This paper proposes a general framework to jointly learn the local dynamics, a stable controller, and the corresponding CLF in high-dimensional unknown systems. Our approach, GIE-CLF, does not need any knowledge of the environment, such as the dynamics, reward functions, etc, and can scale up to high dimensional systems using only local knowledge of the dynamics inside a trusted tunnel instead of global knowledge required by other methods. We provide theoretical guarantees for our framework and demonstrate it on highly complex systems including a high-fidelity F-16 jet aircraft model that has a 16-dimensional state space and a 4-dimensional input space. Experimental results show that GIE-CLF significantly outperforms prior works in reinforcement learning and imitation learning. We also show that our algorithm can also be extended to learn other control certificate functions for unknown systems.","Neural Certificates, Lyapunov Functions, Robotics" Practical Approaches for Fair Learning with Multitype and Multivariate Sensitive Attributes,https://openreview.net/forum?id=xWutyHiLtwP,https://openreview.net/pdf?id=xWutyHiLtwP,,"It is important to guarantee that machine learning algorithms deployed in the real world do not result in unfairness or unintended social consequences. Fair ML has largely focused on the protection of single attributes in the simpler setting where both attributes and target outcomes are binary. However, the practical application in many a real-world problem entails the simultaneous protection of multiple sensitive attributes, which are often not simply binary, but continuous or categorical. To address this more challenging task, we introduce FairCOCCO, a fairness measure built on cross-covariance operators on reproducing kernel Hilbert Spaces. This leads to two practical tools: first, the FairCOCCO Score, a normalized metric that can quantify fairness in settings with single or multiple sensitive attributes of arbitrary type; and second, a subsequent regularization term that can be incorporated into arbitrary learning objectives to obtain fair predictors. These contributions address crucial gaps in the algorithmic fairness literature, and we empirically demonstrate consistent improvements against state-of-the-art techniques in balancing predictive power and fairness on real-world datasets.","Fairness, Metrics, Regularization" Self-Supervised Category-Level Articulated Object Pose Estimation with Part-Level SE(3) Equivariance,https://openreview.net/forum?id=20GtJ6hIaPA,https://openreview.net/pdf?id=20GtJ6hIaPA,,"Category-level articulated object pose estimation aims to estimate a hierarchy of articulation-aware object poses of an unseen articulated object from a known category. To reduce the heavy annotations needed for supervised learning methods, we present a novel self-supervised strategy that solves this problem without any human labels. Our key idea is to factorize canonical shapes and articulated object poses from input articulated shapes through part-level equivariant shape analysis. Specifically, we first introduce the concept of part-level SE(3) equivariance and devise a network to learn features of such property. Then, through a carefully designed fine-grained pose-shape disentanglement strategy, we expect that canonical spaces to support pose estimation could be induced automatically. Thus, we could further predict articulated object poses as per-part rigid transformations describing how parts transform from their canonical part spaces to the camera space. Extensive experiments demonstrate the effectiveness of our method on both complete and partial point clouds from synthetic and real articulated object datasets.", Universal Mini-Batch Consistency for Set Encoding Functions,https://openreview.net/forum?id=FWl6TFsE7Cp,https://openreview.net/pdf?id=FWl6TFsE7Cp,We propose a method to make arbitrary set functions produce consistent outputs given mini-batches of a set.,"Previous works have established solid foundations for neural set functions, complete with architectures which preserve the necessary properties for operating on sets, such as invariance to permutations of the set elements. Subsequent work has highlighted the utility of Mini-Batch Consistency (MBC), the ability to sequentially process any permutation of a set partition scheme (e.g. streaming chunks of data) while maintaining consistency guarantees on the output, although there are limited options for MBC architectures. We propose a framework which can convert an arbitrary non-MBC model to one which satisfies MBC. In doing so, we allow all set functions to universally be considered in an MBC setting (UMBC). Additionally, we explore a Monte Carlo dropout strategy made possible by our framework which allows performing Monte Carlo dropout on streaming sets while never seeing the entire set at once. We validate UMBC with theoretical proofs, unit tests, and also provide qualitative/quantitative experiments on Gaussian data, clean and corrupted point cloud classification, and amortized clustering on ImageNet. Additionally, we investigate the probabilistic calibration of set-functions under test-time distributional shifts. Our results demonstrate the utility of universal mini-batch consistency, and we further discover that our dropout strategy improves uncertainty calibration.",set Learn the Time to Learn: Replay Scheduling in Continual Learning,https://openreview.net/forum?id=kyJ5Mrh5Cz9,https://openreview.net/pdf?id=kyJ5Mrh5Cz9,We demonstrate that scheduling which tasks to replay at different times is important in continual learning scenarios. ,"Replay methods are known to be successful at mitigating catastrophic forgetting in continual learning scenarios despite having limited access to historical data. However, storing historical data is cheap in many real-world settings, yet replaying all historical data is often prohibited due to processing time constraints. In such settings, we propose that continual learning systems should learn the time to learn and schedule which tasks to replay at different time steps. We first demonstrate the benefits of our proposal by using Monte Carlo tree search to find a proper replay schedule, and show that the found replay schedules can outperform fixed scheduling policies when combined with multiple replay methods in various continual learning settings. Additionally, we propose a framework for learning replay scheduling policies with reinforcement learning. We show that the learned policies can generalize better in new continual learning scenarios compared to equally replaying all seen tasks, without added computational cost. Our study reveals the importance of learning the time to learn in continual learning, which brings current research closer to real-world needs.","Continual Learning, Replay Methods, Reinforcement Learning" Divide to Adapt: Mitigating Confirmation Bias for Domain Adaptation of Black-Box Predictors,https://openreview.net/forum?id=hVrXUps3LFA,https://openreview.net/pdf?id=hVrXUps3LFA,A black-box model adaptation approach that purifies the pseudo labels for knowledge distillation.,"Domain Adaptation of Black-box Predictors (DABP) aims to learn a model on an unlabeled target domain supervised by a black-box predictor trained on a source domain. It does not require access to both the source-domain data and the predictor parameters, thus addressing the data privacy and portability issues of standard domain adaptation methods. Existing DABP approaches mostly rely on knowledge distillation (KD) from the black-box predictor, i.e., training the model with its noisy target-domain predictions, which however inevitably introduces the confirmation bias accumulated from the prediction noises and leads to degrading performance. To mitigate such bias, we propose a new strategy, \textit{divide-to-adapt}, that purifies cross-domain knowledge distillation by proper domain division. This is inspired by an observation we make for the first time in domain adaptation: the target domain usually contains easy-to-adapt and hard-to-adapt samples that have different levels of domain discrepancy w.r.t. the source domain, and deep models tend to fit easy-to-adapt samples first. Leveraging easy-to-adapt samples with less noise can help KD alleviate the negative effect of prediction noises from black-box predictors. In this sense, the target domain can be divided into an easy-to-adapt subdomain with less noise and a hard-to-adapt subdomain at the early stage of training. Then the adaptation is achieved by semi-supervised learning. We further reduce distribution discrepancy between subdomains and develop weak-strong augmentation strategy to filter the predictor errors progressively. As such, our method is a simple yet effective solution to reduce error accumulation in cross-domain knowledge distillation for DABP. Moreover, we prove that the target error of DABP is bounded by the noise ratio of two subdomains, i.e., the confirmation bias, which provides the theoretical justifications for our method. Extensive experiments demonstrate our method achieves state of the art on all DABP benchmarks, outperforming the existing best approach by 7.0\% on VisDA-17, and is even comparable with the standard domain adaptation methods that use the source-domain data.","model adaptation, black-box predictors, transfer learning" Thalamus: a brain-inspired algorithm for biologically-plausible continual learning and disentangled representations,https://openreview.net/forum?id=6orC5MvgPBK,https://openreview.net/pdf?id=6orC5MvgPBK,A brain-inspired algorithm that alternates optimizing in weight space with optimizing the latent embedding space in the same neural network leading to open-ended discovery of tasks and disentangled learning.,"Animals thrive in a constantly changing environment and leverage the temporal structure to learn well-factorized causal representations. In contrast, traditional neural networks suffer from forgetting in changing environments and many methods have been proposed to limit forgetting with different trade-offs. Inspired by the brain thalamocortical circuit, we introduce a simple algorithm that uses optimization at inference time to generate internal representations of the current task dynamically. The algorithm alternates between updating the model weights and a latent task embedding, allowing the agent to parse the stream of temporal experience into discrete events and organize learning about them. On a continual learning benchmark, it achieves competitive end average accuracy by mitigating forgetting, but importantly, the interaction between the weights dynamics and the latent dynamics organizes knowledge into flexible structures with a cognitive interface to control them. Tasks later in the sequence can be solved through knowledge transfer as they become reachable within the well-factorized latent space. The algorithm meets many of the desiderata of an ideal continually learning agent in open-ended environments, and its simplicity suggests fundamental computations in circuits with abundant feedback control loops such as the thalamocortical circuits in the brain","brain-inspired learning, neuroscience, recurrent neural networks, context inference, bayesian brain" A Generalized EigenGame With Extensions to Deep Multiview Representation Learning,https://openreview.net/forum?id=tVHCysldFe0,https://openreview.net/pdf?id=tVHCysldFe0,A new approach to solving Generalized Eigenvalue Problems in the stochastic setting extended to Deep Canonical Correlation Analysis with state-of-the-art results for stochastic minibatches,"Generalized Eigenvalue Problems (GEPs) encompass a range of interesting scientific computing problems. Canonical Correlation Analysis (CCA) and Partial Least Squares (PLS) are two such examples of GEPs which are often used to learn representations of multiview data. Development of efficient stochastic approaches to these problems would allow them to scale to large datasets. Furthermore, existing deep learning based extensions of CCA require large minibatch sizes in the stochastic setting to achieve good performance. Inspired by recent formulations of Principal Components Analysis and GEPs as games with differentiable utilities, we develop an alternative game theoretic approach to solving GEPs in which all constraints are softly enforced by Lagrange multipliers. We show that our approach shares much of the theoretical grounding of the previous game theoretic approaches but has fewer hyperparameters, is faster to converge, and permits extension to general function approximators like neural networks for certain GEPs including CCA. We demonstrate the effectiveness of our method for solving GEPs using canonical multiview datasets and demonstrate state-of-the-art performance for the Deep CCA problem for multiview representation learning.","Optimisation, Generalized Eigenvalue Problem, Deep CCA, CCA, PLS" Deep Variational Implicit Processes,https://openreview.net/forum?id=8aeSJNbmbQq,https://openreview.net/pdf?id=8aeSJNbmbQq," We propose here a multi-layer generalization of IPs called the Deep Variational Implicit process, similar to that of deep GPs over GPs.","Implicit processes (IPs) are a generalization of Gaussian processes (GPs). IPs may lack a closed-form expression but are easy to sample from. Examples include, among others, Bayesian neural networks or neural samplers. IPs can be used as priors over functions, resulting in flexible models with well-calibrated prediction uncertainty estimates. Methods based on IPs usually carry out function-space approximate inference, which overcomes some of the difficulties of parameter-space approximate inference. Nevertheless, the approximations employed often limit the expressiveness of the final model, resulting, e.g., in a Gaussian predictive distribution, which can be restrictive. We propose here a multi-layer generalization of IPs called the Deep Variational Implicit process (DVIP). This generalization is similar to that of deep GPs over GPs, but it is more flexible due to the use of IPs as the prior distribution over the latent functions. We describe a scalable variational inference algorithm for training DVIP and show that it outperforms previous IP-based methods and also deep GPs. We support these claims via extensive regression and classification experiments. We also evaluate DVIP on large datasets with up to several million data instances to illustrate its good scalability and performance.","Gaussian process, implicit process, variational implicit process, Bayesian inference, function-space inference, implicit process concatenation" Denoising Masked Autoencoders are Certifiable Robust Vision Learners,https://openreview.net/forum?id=zDjtZZBZtqK,https://openreview.net/pdf?id=zDjtZZBZtqK,"In this paper, we propose a new self-supervised method for learning certified robust classifiers of images.","In this paper, we propose a new self-supervised method, which is called denoising masked autoencoders (DMAE), for learning certified robust classifiers of images. In DMAE, we corrupt each image by adding Gaussian noises to each pixel value and randomly masking several patches. A Transformer-based encoder-decoder model is then trained to reconstruct the original image from the corrupted one. In this learning paradigm, the encoder will learn to capture relevant semantics for the downstream tasks, which is also robust to Gaussian additive noises. We show that the pre-trained encoder can naturally be used as the base classifier in Gaussian smoothed models, where we can analytically compute the certified radius for any data point. Although the proposed method is simple, it yields significant performance improvement in downstream classification tasks. We show that the DMAE ViT-Base model, which just uses 1/10 parameters of the model developed in recent work (Carlini et al., 2022), achieves competitive or better certified accuracy in various settings. The DMAE ViT-Large model significantly surpasses all previous results, establishing a new state-of-the-art on ImageNet dataset. We further demonstrate that the pre-trained model has good transferability to the CIFAR-10 dataset, suggesting its wide adaptability. All model checkpoints and code will be released soon.","self-supervised, certified robustness, randomized smoothing" Estimating individual treatment effects under unobserved confounding using binary instruments,https://openreview.net/forum?id=ULsuEVQbV-9,https://openreview.net/pdf?id=ULsuEVQbV-9,We propose a multiple robust machine learning framework for individual treatment effect estimation using binary instrumental variables.,"Estimating individual treatment effects (ITEs) from observational data is relevant in many fields such as personalized medicine. However, in practice, the treatment assignment is usually confounded by unobserved variables and thus introduces bias. A remedy to remove the bias is the use of instrumental variables (IVs). Such settings are widespread in medicine (e.g., trials where compliance is used as binary IV). In this paper, we propose a novel, multiply robust machine learning framework, called MRIV, for estimating ITEs using binary IVs and thus yield an unbiased ITE estimator. Different from previous work for binary IVs, our framework estimates the ITE directly via a pseudo outcome regression. (1)~We provide a theoretical analysis where we show that our framework yields multiple robust convergence rates: our ITE estimator achieves fast convergence even if several nuisance estimators converge slowly. (2)~We further show that our framework asymptotically outperforms state-of-the-art plug-in IV methods for ITE estimation. (3)~We build upon our theoretical results and propose a tailored deep neural network architecture called MRIV-Net for ITE estimation using binary IVs. Across various computational experiments, we demonstrate empirically that our MRIV-Net achieves state-of-the-art performance. To the best of our knowledge, our MRIV is the first multiply robust machine learning framework tailored to estimating ITEs in the binary IV setting. ","Causal machine learning, treatment effect estimation, instrumental variables" Approximate Bayesian Inference with Stein Functional Variational Gradient Descent,https://openreview.net/forum?id=a2-aoqmeYM4,https://openreview.net/pdf?id=a2-aoqmeYM4,,"We propose a general-purpose variational algorithm that forms a natural analogue of Stein variational gradient descent (SVGD) in function space. While SVGD successively updates a set of particles to match a target density, the method introduced here of Stein functional variational gradient descent (SFVGD) updates a set of particle functions to match a target stochastic process (SP). The update step is found by minimizing the functional derivative of the Kullback-Leibler divergence between SPs. SFVGD can either be used to train Bayesian neural networks (BNNs) or for ensemble gradient boosting. We show the efficacy of training BNNs with SFVGD on various real-world datasets.", Soundness and Completeness: An Algorithmic Perspective on Evaluation of Feature Attribution,https://openreview.net/forum?id=zWwrB9wenY1U,https://openreview.net/pdf?id=zWwrB9wenY1U,We propose a novel method to evaluate \emph{soundness} and \emph{completeness} of feature attribution methods.,"Feature attribution is a fundamental approach to explaining neural networks by quantifying the importance of input features for a model's prediction. Although a variety of feature attribution methods have been proposed, there is little consensus on the assessment of attribution methods. In this study, we empirically show the limitations of \emph{order-based} and \emph{model-retraining} metrics. To overcome the limitations and enable evaluation with higher granularity, we propose a novel method to evaluate the \emph{completeness} and \emph{soundness} of feature attribution methods. Our proposed evaluation metrics are mathematically grounded on algorithm theory and require no knowledge of ""ground truth"" informative features. We validate our proposed metrics by conducting experiments on synthetic and real-world datasets. Lastly, we use the proposed metrics to benchmark a wide range of feature attribution methods. Our evaluation results provide an innovative perspective on comparing feature attribution methods. Code is in the supplementary material. ","explainable AI, explainability, feature attribution" SCoMoE: Efficient Mixtures of Experts with Structured Communication,https://openreview.net/forum?id=s-c96mSU0u5,https://openreview.net/pdf?id=s-c96mSU0u5,," Mixture-of-Experts (MoE) models are promising architectures for massive multilingual neural machine translation and large language models due to the advantage of sublinear scaling. However, the training of large MoE models is usually bottlenecked by the all-to-all communication (Lepikhin et al., 2020). To reduce the communication cost, we propose SCoMoE, an MoE architecture with structured all-to-all communication, inspired by the hierarchical architecture of the communication topology. SCoMoE encourages data to be communicated across devices through fast intra-accelerator/node communication channels, reducing communication throughput in the slow inter-node communication channel. We slice the data on the sequence dimension (SCoMoE-Seq) into three communication groups and project the data on the feature dimension (SCoMoE-Feat) into low-dimensional representations. To compensate the potential performance drop caused by the routing locality in SCoMoE, we further propose a token clustering approach to aggregating related tokens from different devices before the MoE layers. The sigmoid gating in the balanced router used in the token clustering is substituted with the softmax gating with differential sorting. Experiments on massive bilingual and multilingual machine translation demonstrate that SCoMoE achieves a speedup of 1.44x over GShard with comparable performance, and substantially outperforms Gshard (2.8 BLEU) on OPUS-100 with a speedup of 1.25x.", Locally Invariant Explanations: Towards Stable and Unidirectional Explanations through Local Invariant Learning,https://openreview.net/forum?id=SoAnNZ7Z3xw,https://openreview.net/pdf?id=SoAnNZ7Z3xw,A local explanation method that is stable and unidirectional,"Locally interpretable model agnostic explanations (LIME) method is one of the most popular methods used to explain black-box models at a per example level. Although many variants have been proposed, few provide a simple way to produce high fidelity explanations that are also stable and intuitive. In this work, we provide a novel perspective by proposing a model agnostic local explanation method inspired by the invariant risk minimization (IRM) principle -- originally proposed for (global) out-of-distribution generalization -- to provide such high fidelity explanations that are also stable and unidirectional across nearby examples. Our method is based on a game theoretic formulation where we theoretically show that our approach has a strong tendency to eliminate features where the gradient of the black-box function abruptly changes sign in the locality of the example we want to explain, while in other cases it is more careful and will choose a more conservative (feature) attribution, a behavior which can be highly desirable for recourse. Empirically, we show on tabular, image and text data that the quality of our explanations with neighborhoods formed using random perturbations are much better than LIME and in some cases even comparable to other methods that use realistic neighbors sampled from the data manifold. This is desirable given that learning a manifold to either create realistic neighbors or to project explanations is typically expensive or may even be impossible. Moreover, our algorithm is simple and efficient to train, and can ascertain stable input features for local decisions of a black-box without access to side information such as a (partial) causal graph as has been seen in some recent works.",explainable AI Uncertainty-Aware Self-Supervised Learning with Independent Sub-networks,https://openreview.net/forum?id=CLmXXljIf__,https://openreview.net/pdf?id=CLmXXljIf__,We introduce an uncertainty-aware training regime for self-supervised models with an ensemble of independent sub-networks and a novel loss function for encouraging diversity.,"Self-supervised learning methods are state-of-the-art across a wide range of tasks in computer vision, natural language processing, and multimodal analysis. Estimating the epistemic -- or model -- uncertainty of self-supervised model predictions is critical for building trustworthy machine learning systems in crucial applications, such as medical diagnosis and autonomous driving. A common approach to estimating model uncertainty is to train a \emph{model ensemble}. However, deep ensembles induce high computational costs and memory demand. This is particularly challenging in self-supervised deep learning, where even a single network is computationally demanding. Moreover, most existing model uncertainty techniques are built for supervised deep learning. Motivated by this, we propose a novel approach to making self-supervised learning probabilistic. We introduce an uncertainty-aware training regime for self-supervised models with an ensemble of independent sub-networks and a novel loss function for encouraging diversity. Our method builds a sub-model ensemble with high diversity -- and consequently, well-calibrated estimates of model uncertainty -- at low computational overhead over a single model, while performing on par with deep self-supervised ensembles. Extensive experiments across different tasks, such as in-distribution generalization, out-of-distribution detection, dataset corruption, and semi-supervised settings, demonstrate that our approach increases prediction reliability. We show that our method achieves both excellent accuracy and calibration, improving over existing ensemble methods in a wide range of self-supervised architectures for computer vision, natural language processing, and genomics data. ","uncertainty-awareness, calibration, self-supervised pretraining, independent sub-networks, efficient ensemble" Prompt Learning with Optimal Transport for Vision-Language Models,https://openreview.net/forum?id=zqwryBoXYnh,https://openreview.net/pdf?id=zqwryBoXYnh,,"With the increasing attention to large vision-language models such as CLIP, there has been a significant amount of effort dedicated to building efficient prompts. Unlike conventional methods of only learning one single prompt, we propose to learn multiple comprehensive prompts to describe diverse characteristics of categories such as intrinsic attributes or extrinsic contexts. However, directly matching each prompt to the same visual feature is problematic, as it pushes the prompts to converge to one point. To solve this problem, we propose to apply optimal transport to match the vision and text modalities. Specifically, we first model images and the categories with visual and textual feature sets. Then, we apply a two-stage optimization strategy to learn the prompts. In the inner loop, we optimize the optimal transport distance to align visual features and prompts by the Sinkhorn algorithm, while in the outer loop, we learn the prompts by this distance from the supervised data. Extensive experiments are conducted on the few-shot recognition task and the improvement demonstrates the superiority of our method.", An Additive Instance-Wise Approach to Multi-class Model Interpretation,https://openreview.net/forum?id=5OygDd-4Eeh,https://openreview.net/pdf?id=5OygDd-4Eeh,,"Interpretable machine learning offers insights into what factors drive a certain prediction of a black-box system. A large number of interpreting methods focus on selecting explanatory input features, which follow either additive or instance-wise directions. Additive methods exploit local neighbourhoods to learn instance-specific explainers sequentially. The process is thus inefficient and susceptible to poorly-conditioned samples. Meanwhile, instance-wise methods directly optimize local feature distributions in a global training framework, thereby being capable of leveraging global information from other inputs. However, they can only interpret single-class predictions and suffer from inconsistency across different settings, due to a strict reliance on a pre-defined number of features selected. This work exploits the strengths of both methods and proposes a framework for learning local explanations simultaneously for multiple target classes. Our model explainer significantly outperforms additive and instance-wise counterparts on faithfulness with more compact and comprehensible explanations. We also demonstrate the capacity to select stable and important features through extensive experiments on various data sets and black-box model architectures.", Knowledge-Consistent Dialogue Generation with Language Models and Knowledge Graphs,https://openreview.net/forum?id=WhWlYzUTJfP,https://openreview.net/pdf?id=WhWlYzUTJfP,"Knowledge-Consistent Dialogue Generation with Context-Relevant Subgraph Retrieval, Invariant Graph Encoding, and Graph-Text Contrastive Learning","Pre-trained language models have achieved impressive performances on dialogue generation tasks. However, when generating responses for a conversation that requires factual knowledge, they are far from perfect, due to the absence of mechanisms to retrieve, encode, and reflect the knowledge in the generated responses. Some knowledge-grounded dialogue generation methods tackle this problem by leveraging the structured knowledge from Knowledge Graphs (KGs). However, existing methods do not guarantee that the model utilizes a relevant piece of knowledge from the KG before generating knowledge-consistent dialogues. To overcome this limitation, we propose SUbgraph Retrieval-augmented GEneration (SURGE), a framework for generating context-relevant and knowledge-consistent dialogues with a KG. Specifically, our method first retrieves the relevant subgraph from the KG, and then enforces consistency across facts by perturbing their word embeddings conditioned on the retrieved subgraph. Then, it learns a latent representation space using contrastive learning which ensures that the generated texts have high similarity to the retrieved subgraphs. We validate the performance of our SURGE framework on the OpendialKG and KOMODIS datasets and show that our method generates high-quality dialogues that faithfully reflect the knowledge from the KG. ","knowledge-grounded dialogue generation, knowledge graph" Offline Model-Based Reinforcement Learning with Causal Structure,https://openreview.net/forum?id=aIpq2eA4vDR,https://openreview.net/pdf?id=aIpq2eA4vDR,,"Model-based methods have recently been shown promising for offline reinforcement learning (RL), which aims at learning good policies from historical data without interacting with the environment. Previous model-based offline RL methods employ a straightforward prediction method that maps the states and actions directly to the next-step states. However, such a prediction method tends to capture spurious relations caused by the sampling policy preference behind the offline data. It is sensible that the environment model should focus on causal influences, which can facilitate learning an effective policy that can generalize well to unseen states. In this paper, we first provide theoretical results that causal environment models can outperform plain environment models in offline RL by incorporating the causal structure into the generalization error bound. We also propose a practical algorithm, o\textbf{F}fline m\textbf{O}del-based reinforcement learning with \textbf{C}a\textbf{U}sal \textbf{S}tructure (FOCUS), to illustrate the feasibility of learning and leveraging causal structure in offline RL. Experimental results on two benchmarks show that FOCUS reconstructs the underlying causal structure accurately and robustly, and, as a result, outperforms both model-based offline RL algorithms and causal model-based offline RL algorithms.", Towards Semi-Supervised Learning with Non-Random Missing Labels,https://openreview.net/forum?id=aibmXGQJPs0,https://openreview.net/pdf?id=aibmXGQJPs0,A simple but effective approach yielding tangible improvement in the performance of semi-supervised learning with non-random missing labels.,"Semi-supervised learning (SSL) tackles the label missing problem by enabling the effective usage of unlabeled data. While existing SSL methods focus on the traditional setting, a practical and challenging scenario called label Missing Not At Random (MNAR) is usually ignored. In MNAR, the labeled and unlabeled data fall into different class distributions resulting in biased label imputation, which deteriorates the performance of SSL models. In this work, class transition tracking based Pseudo-Rectifying Guidance (PRG) is devised for MNAR. We explore the class-level guidance information obtained by the Markov random walk, which is modeled on a dynamically created graph built over the class tracking matrix. PRG unifies the history information of each class transition caused by the pseudo-rectifying procedure to activate the model's enthusiasm for neglected classes, so as the quality of pseudo-labels on both popular classes and rare classes in MNAR could be improved. We show the superior performance of PRG across a variety of the MNAR scenarios and the conventional SSL setting, outperforming the latest SSL solutions by a large margin. Checkpoints and evaluation code are available at the anonymous link https://anonymous.4open.science/r/PRG4SSL-MNAR-8DE2 while the source code will be available upon paper acceptance.","semi-supervised learning, label missing not at random, pseudo-rectifying guidance" Improving Generalization with Domain Convex Game,https://openreview.net/forum?id=eJtlrcnRtAs,https://openreview.net/pdf?id=eJtlrcnRtAs,,"Domain generalization (DG) tends to alleviate the poor generalization capability of deep neural networks by learning model with multiple source domains. A classical solution to DG is domain augmentation, the common belief of which is that diversifying source domains will be conducive to the out-of-distribution generalization. However, these claims are understood intuitively, rather than mathematically, and the relation between the diversity of source domains and model generalization still remains unclear. We thus made some explorations and found that the model generalization does not strictly improve with the increase of domain diversity, limiting the effectiveness of domain augmentation. In view of this observation, we propose a new perspective on DG that recast it as a convex game between domains. We formulate a regularization term based on the supermodularity property of convex game which rigorously demonstrates that the growth of domain diversity will enhance model generalization monotonically. This enables model to best utilize the rich information within input data so that each diversified domain contributes to model generalization. Furthermore, we construct a sample filter to eliminate the bad samples which contain unprofitable or even harmful information to generalization performance, such as noisy or redundant samples. Our framework presents a new avenue for the formal analysis of DG, the rationality and effectiveness of which have been demonstrated on extensive benchmark datasets.","transfer learning, domain generalization, convex game" Elastic Mean-Teacher Distillation Mitigates the Continual Learning Stability Gap,https://openreview.net/forum?id=f77FhfGzQWN,https://openreview.net/pdf?id=f77FhfGzQWN,,"Nowadays, neural networks are being used to solve a variety of tasks. They are very effective when trained on large datasets. However, in continual learning, they are trained on non-stationary stream of data, which often results in forgetting of the previous knowledge. In the literature, continual learning models are exposed to a sequence of tasks, and must learn each task one by one. They are then evaluated at the end of each learning session. This allows to measure the average accuracy over all tasks encountered so far. Recently De Lange et al. (2022) showed that continual learning methods suffer from the Stability Gap, encountered when evaluating the model continually. Even when the performance at the end of training is high, the worst-case performance is low, which could be a problem in applications where the learner needs to always perform greatly on all tasks while learning the new task. In this paper, we propose to apply a refined variant of knowledge distillation, adapted to the class-incremental learning setting, and used in combination with replay, to improve the stability of the continual learning algorithms. We also propose to use a distillation method derived from the Mean teacher distillation training paradigm introduced in semi-supervised learning. We demonstrate empirically that the use of this method enhances the stability in the more challenging setting of online continual learning.","continual learning, continual evaluation" Neural Prompt Search,https://openreview.net/forum?id=58QUPAU0RJs,https://openreview.net/pdf?id=58QUPAU0RJs,"We propose to search, instead of hand-engineering, prompt modules for parameter-efficient transfer learning.","The size of vision models has grown exponentially over the last few years, especially after the emergence of Vision Transformer. This has motivated the development of parameter-efficient tuning methods, such as learning adapter layers or visual prompt tokens, which allow a tiny portion of model parameters to be trained, whereas the vast majority obtained from pre-training are frozen. However, designing a proper tuning method is non-trivial: one might need to try out a lengthy list of design choices, not to mention that each downstream dataset often requires custom designs. In this paper, we view the existing parameter-efficient tuning methods as ""prompt modules"" and propose Neural prOmpt seArcH (NOAH), a novel approach that learns, for large vision models, the optimal design of prompt modules through a neural architecture search algorithm, specifically for each downstream dataset. By conducting extensive experiments on over 20 vision datasets, we demonstrate that NOAH (i) is superior to individual prompt modules, (ii) has a good few-shot learning ability, and (iii) is domain-generalizable. The code and models will be released to facilitate future research.","transfer learning, computer vision, parameter-efficient tuning, prompt learning, neural architecture search" It Takes Two: Masked Appearance-Motion Modeling for Self-Supervised Video Transformer Pre-Training,https://openreview.net/forum?id=0CbYJNJtM-X,https://openreview.net/pdf?id=0CbYJNJtM-X,,"Self-supervised video transformer pre-training has recently benefited from the mask-and-predict pipeline. They have demonstrated outstanding effectiveness on downstream video tasks and superior data efficiency on small datasets. However, temporal relation is not fully exploited by these methods. In this work, we explicitly investigate motion cues in videos as extra prediction target and propose our Masked Appearance-Motion Modeling (MAM²) framework. Specifically, we design an encoder-regressor-decoder pipeline for this task. The regressor separates feature encoding and pretext tasks completion, such that the feature extraction process is completed adequately by the encoder. In order to guide the encoder to fully excavate spatial-temporal features, two separate decoders are used for two pretext tasks of disentangled appearance and motion prediction. We explore various motion prediction targets and figure out RGB-difference is simple yet effective. As for appearance prediction, VQGAN codes are leveraged as prediction target. With our pre-training pipeline, convergence can be remarkably speed up, e.g., we only require 2x fewer epochs than state-of-the-art VideoMAE (400 v.s. 800) to achieve the competitive performance. Extensive experimental results prove that our method learns generalized video representations. Notably, our MAM² with ViT-B achieves 82.3% on Kinects-400, 71.3% on Something-Something V2, 91.5% on UCF101, and 62.5% on HMDB51. ","Video Understanding, Masked Visual Modeling, Self-supervised Learning" DASHA: Distributed Nonconvex Optimization with Communication Compression and Optimal Oracle Complexity,https://openreview.net/forum?id=VA1YpcNr7ul,https://openreview.net/pdf?id=VA1YpcNr7ul,We provide a new method that improves the state-of-the-art theoretical complexity of distributed optimization methods with compressed communication in the nonconvex regime.,"We develop and analyze DASHA: a new family of methods for nonconvex distributed optimization problems. When the local functions at the nodes have a finite-sum or an expectation form, our new methods, DASHA-PAGE, DASHA-MVR and DASHA-SYNC-MVR, improve the theoretical oracle and communication complexity of the previous state-of-the-art method MARINA by Gorbunov et al. (2020). In particular, to achieve an $\varepsilon$-stationary point, and considering the random sparsifier Rand$K$ as an example, our methods compute the optimal number of gradients $\mathcal{O}\left(\frac{\sqrt{m}}{\varepsilon\sqrt{n}}\right)$ and $\mathcal{O}\left(\frac{\sigma}{\varepsilon^{3/2}n}\right)$ in finite-sum and expectation form cases, respectively, while maintaining the SOTA communication complexity $\mathcal{O}\left(\frac{d}{\varepsilon \sqrt{n}}\right)$. Furthermore, unlike MARINA, the new methods DASHA, DASHA-PAGE and DASHA-MVR send compressed vectors only, which makes them more practical for federated learning. We extend our results to the case when the functions satisfy the Polyak-Lojasiewicz condition. Finally, our theory is corroborated in practice: we see a significant improvement in experiments with nonconvex classification and training of deep learning models.","Nonconvex Optimization, Variance Reduction, Compressed Communication, Distributed Optimization" LDMIC: Learning-based Distributed Multi-view Image Coding,https://openreview.net/forum?id=ILQVw4cA5F9,https://openreview.net/pdf?id=ILQVw4cA5F9,"We design a multi-view image compression framework based on symmetric distributed source coding paradigm, which achieves higher compression performance than previous multi-view image compression methods.","Multi-view image compression plays a critical role in 3D-related applications. Existing methods adopt a predictive coding architecture, which requires joint encoding to compress the corresponding disparity as well as residual information. This demands collaboration among cameras and enforces the epipolar geometric constraint between different views, which makes it challenging to deploy these methods in distributed camera systems with randomly overlapping fields of view. Meanwhile, distributed source coding theory indicates that efficient data compression of correlated sources can be achieved by independent encoding and joint decoding, which motivates us to design a learning-based distributed multi-view image coding (LDMIC) framework. With independent encoders, LDMIC introduces a simple yet effective joint context transfer module based on the cross-attention mechanism at the decoder to effectively capture the global inter-view correlations, which is insensitive to epipolar geometry relations between images. Experimental results show that LDMIC significantly outperforms both traditional and learning-based MIC methods while enjoying fast encoding speed.","Deep multi-view image compression, distributed source coding, cross-attention mechanism" Improving Differentially-Private Deep Learning with Gradients Index Pruning,https://openreview.net/forum?id=dod5argWzR1,https://openreview.net/pdf?id=dod5argWzR1,,"Differential Privacy (DP) provides a formal privacy guarantee preventing adversaries from inferring an individual record from populations. Differentially Private Stochastic Gradient Descent (DPSGD), the widely used method to train a model satisfying DP, inserts randomized noise to the gradients in each iteration but leads to significant accuracy decline, particularly on large and deep models. Facing the curse of dimensionality in differentially-private deep learning, we propose a Gradient Index Pruning (GIP) mechanism, which prunes gradients by a novel index perturbation scheme, to preserve important components of the gradients while reducing their sizes. Our mechanism does not alter the model, but merely adds a noisy top-$k$ pruning step before the conventional gradients noise insertion in DPSGD. It is proven that GIP meets DP, yet improves accuracy over DPSGD. We also present theoretical analysis to show GIP indeed introduces less perturbation to the training. Experiments on a variety of models and datasets have demonstrated that GIP exceeds the state-of-the-art differentially-private deep learning methods by a $1-3\%$ accuracy boost.","Differential Privacy, Stochastic Gradient Descent" Additive Poisson Process: Learning Intensity of Higher-Order Interaction in Poisson Processes,https://openreview.net/forum?id=H9LxwdiXlh,https://openreview.net/pdf?id=H9LxwdiXlh,An efficient technique that uses a log-linear model on a partial order structure to approximate a high-dimensional intensity functions in a Poisson Process.,"We present the Additive Poisson Process (APP), a novel framework that can model the higher-order interaction effects of the intensity functions in Poisson processes using projections into lower-dimensional space. Our model combines the techniques in information geometry to model higher-order interactions on a statistical manifold and in generalized additive models to use lower-dimensional projections to overcome the effects from the curse of dimensionality. Our approach solves a convex optimization problem by minimizing the KL divergence from a sample distribution in lower-dimensional projections to the distribution modeled by an intensity function in the Poisson process. Our empirical results show that our model is able to use samples observed in the lower dimensional space to estimate the higher-order intensity function with extremely sparse observations.","Poisson Process, Log-Linear Model, Energy-Based Model, Generalized Additive Models, Information Geometry" Sound Randomized Smoothing in Floating-Point Arithmetic,https://openreview.net/forum?id=HaHCoGcpV9,https://openreview.net/pdf?id=HaHCoGcpV9,We construct classifiers producing wrong randomized smoothing certificates on images and propose a method to overcome this at a negligible cost.,"Randomized smoothing is sound when using infinite precision. However, we show that randomized smoothing is no longer sound for limited floating-point precision. We present a simple example where randomized smoothing certifies a radius of $1.26$ around a point, even though there is an adversarial example in the distance $0.8$ and show how this can be abused to give false certificates for CIFAR10. We discuss the implicit assumptions of randomized smoothing and show that they do not apply to generic image classification models whose smoothed versions are commonly certified. In order to overcome this problem, we propose a sound approach to randomized smoothing when using floating-point precision with essentially equal speed for quantized input. It yields sound certificates or image classifiers which for the ones tested so far are very similar to the unsound practice of randomized smoothing. Our only assumption is that we have access to a fair coin.","Randomized smoothing, floating-point arithmetic, adversarial robustness, formal methods" Shuffle Gaussian Mechanism for Differential Privacy,https://openreview.net/forum?id=NIzeVwedJzB,https://openreview.net/pdf?id=NIzeVwedJzB,We give a first non-trivial study of Gaussian mechanism in the shuffle model using R{\'e}nyi differential privacy (RDP).,"We study Gaussian mechanism in the shuffle model of differential privacy (DP). We present the \textit{first} non-trivial privacy guarantee of the mechanism by showing that its R{\'e}nyi differential privacy (RDP) is of the form: $$ \epsilon(\lambda) = % D_{\lambda}(\calM(D)||\calM(D')) = \frac{1}{\lambda-1}\log\left(\frac{e^{-\lambda/2\sigma^2}}{n^\lambda}\sum_{\substack{k_1+\dotsc+k_n=\lambda;\\k_1,\dotsc,k_n\geq 0}}\binom{\lambda}{k_1,\dotsc,k_n}e^{\sum_{i=1}^nk_i^2/2\sigma^2}\right) $$ We further prove that the RDP is strictly upper-bounded by the Gaussian RDP without shuffling. The shuffle Gaussian RDP is advantageous in composing multiple DP mechanisms, where we demonstrate its improvement over the state-of-the-art approximate DP composition theorems in privacy guarantees of the shuffle model. Our formalism also has immediate application in several problems studied in the literature, including learning with stochastic gradient descent and distributed/federated learning, of which an empirical study is presented to demonstrate the efficacy of learning privately while employing the shuffle Gaussian mechanism.","differential privacy, shuffle model, dp-sgd, federated learning" Assessing Model Out-of-distribution Generalization with Softmax Prediction Probability Baselines and A Correlation Method,https://openreview.net/forum?id=1maXoEyeqx,https://openreview.net/pdf?id=1maXoEyeqx,,"This paper studies the use of Softmax prediction to assess model generalization under distribution shift. Specifically, given an out-of distribution (OOD) test set and a pool of classifiers, we aim to develop a Softmax prediction-based measure which has a monotonic relationship with OOD generalization performance. We first show existing uncertainty measures (e.g., entropy and maximum Softmax prediction) are fairly useful of predicting generalization in some OOD scenarios. We then move ahead with proposing a new measure, Softmax Correlation (SoftmaxCorr). To obtain the SoftmaxCorr score for a classifier, we compute the class-class correlation matrix from all the Softmax vectors in a test set, and then its cosine similarity with an identity matrix. We show that the class-class correlation matrix reveals significant knowledge about the confusion matrix: its high similarity with the identity matrix means predictions have low confusion (uncertainty) and evenly cover all classes, and vice versa. Across three setups including ImageNet, CIFAR-10, and WILDS, we show that SoftmaxCorr is well predictive of model accuracy on both in-distribution and OOD datasets.", In-the-wild Pretrained Models Are Good Feature Extractors for Video Quality Assessment,https://openreview.net/forum?id=xf7_gvYW1Mx,https://openreview.net/pdf?id=xf7_gvYW1Mx,In-the-wild pretrained models can be used as feature extractors to represent the perceptual quality of videos directly.,"Video quality assessment (VQA) is a challenging problem since the perceptual quality of a video can be affected by many factors, \eg content attractiveness, distortion type and level, motion pattern, and level. Further, the huge expense of annotating limits the scale of VQA datasets, which becomes the main obstacle for deep learning-based VQA methods. In this paper, we propose a VQA method leveraging PreTrained Models, named PTM-VQA, to transfer knowledge from models pretrained on various pre-tasks to benefit VQA from different aspects. Specifically, features of input videos are extracted by different pretrained models with frozen weights, transformed to the same dimension, and integrated to generate the final representation. Since these models possess various fields of knowledge and are often trained with labels irrelevant to quality, we propose an Intra-Consistency and Inter-Divisibility (ICID) loss, which imposes constraints on features extracted by multiple pretrained models from different samples. The intra-consistency constrain is model-wise and requires features extracted by different pretrained models to be in the same unified quality-aware latent space, while the sample-wise inter-divisibility introduces pseudo clusters based on the annotation of samples and tries to separate features of samples from different clusters. Further, confronted with a constantly growing number of pretrained models, it is crucial to determine which ones to use and how to use them. To tackle the problem, we propose an efficient scheme to choose suitable candidates: models that possess better clustering performance on a VQA dataset are chosen to be our candidate backbones. Extensive experiments demonstrate the effectiveness of the proposed method. ","video quality assessment, pretrained models, metric learning" Collaborative Pure Exploration in Kernel Bandit,https://openreview.net/forum?id=hLbeJ6jObDD,https://openreview.net/pdf?id=hLbeJ6jObDD,,"In this paper, we propose a novel Collaborative Pure Exploration in Kernel Bandit model (CoPE-KB), where multiple agents collaborate to complete different but related tasks with limited communication. Our model generalizes prior CoPE formulation with the single-task and classic MAB setting to allow multiple tasks and general reward structures. We propose a novel communication scheme with an efficient kernelized estimator, and design optimal algorithms CoKernelFC and CoKernelFB for CoPE-KB with fixed-confidence and fixed-budget objectives, respectively. Nearly matching upper and lower bounds in both sampling and communication complexity are established to demonstrate the optimality of our algorithms. Our theoretical results explicitly quantify how task similarities influence learning speedup, and only depend on the effective dimension of feature space. Our novel techniques including an efficient kernelized estimator and linear structured instance transformation, which overcome the communication difficulty in high-dimensional feature space and derive communication round lower bounds, can be of independent interests. ","Collaborative Pure Exploration (CoPE), kernel bandit, multi-agent bandit, multi-task learning, communication round" Provably Efficient Risk-Sensitive Reinforcement Learning: Iterated CVaR and Worst Path,https://openreview.net/forum?id=Yn0xg-kHNW-,https://openreview.net/pdf?id=Yn0xg-kHNW-,,"In this paper, we study a novel episodic risk-sensitive Reinforcement Learning (RL) problem, named Iterated CVaR RL, which aims to maximize the tail of the reward-to-go at each step, and focuses on tightly controlling the risk of getting into catastrophic situations at each stage. This formulation is applicable to real-world tasks that demand strong risk avoidance throughout the decision process, such as autonomous driving, clinical treatment planning and robotics. We investigate two performance metrics under Iterated CVaR RL, i.e., Regret Minimization and Best Policy Identification. For both metrics, we design efficient algorithms ICVaR-RM and ICVaR-BPI, respectively, and provide nearly matching upper and lower bounds with respect to the number of episodes $K$. We also investigate an interesting limiting case of Iterated CVaR RL, called Worst Path RL, where the objective becomes to maximize the minimum possible cumulative reward. For Worst Path RL, we propose an efficient algorithm with constant upper and lower bounds. Finally, the techniques we develop for bounding the change of CVaR due to the value function shift and decomposing the regret via a distorted visitation distribution are novel, and can find applications in other risk-sensitive online learning problems. ","Risk-sensitive Reinforcement Learning (RL), Iterated CVaR RL, Worst Path RL, RL theory" "FedREP: A Byzantine-Robust, Communication-Efficient and Privacy-Preserving Framework for Federated Learning",https://openreview.net/forum?id=j2SvoOSjxH8,https://openreview.net/pdf?id=j2SvoOSjxH8,,"Federated learning (FL) has recently become a hot research topic, in which Byzantine robustness, communication efficiency and privacy preservation are three important aspects. However, the tension among these three aspects makes it hard to simultaneously take all of them into account. In view of this challenge, we theoretically analyze the conditions that a communication compression method should satisfy to be compatible with existing Byzantine-robust methods and privacy-preserving methods. Motivated by the analysis results, we propose a novel communication compression method called consensus sparsification (ConSpar). To the best of our knowledge, ConSpar is the first communication compression method that is designed to be compatible with both Byzantine-robust methods and privacy-preserving methods. Based on ConSpar, we further propose a novel FL framework called FedREP, which is Byzantine-robust, communication-efficient and privacy-preserving. We theoretically prove the Byzantine robustness and the convergence of FedREP. Empirical results show that FedREP can significantly outperform communication-efficient privacy-preserving baselines. Furthermore, compared with Byzantine-robust communication-efficient baselines, FedREP can achieve comparable accuracy with an extra advantage of privacy preservation.","federated learning, Byzantine robustness, communication efficiency, privacy preservation" Few-Shot Transferable Robust Representation Learning via Bilevel Attacks,https://openreview.net/forum?id=LLy2vm_p35C,https://openreview.net/pdf?id=LLy2vm_p35C,,"Existing adversarial learning methods for enhancing the robustness of deep neural networks assume the availability of a large amount of data from which we can generate adversarial examples. However, in an adversarial meta-learning setting, the model needs to train with only a few adversarial examples to learn a robust model for unseen tasks, which is a very difficult goal to achieve. Further, learning transferable robust representations for unseen domains is a difficult problem even with a large amount of data. To tackle such a challenge, we propose a novel adversarial self-supervised meta-learning framework with bilevel attacks which aims to learn robust representations that can generalize across tasks and domains. Specifically, in the inner loop, we update the parameters of the given encoder by taking inner gradient steps using two different sets of augmented samples, and generate adversarial examples for each view by maximizing the instance classification loss. Then, in the outer loop, we meta-learn the encoder parameter to maximize the agreement between the two adversarial examples, which enables it to learn robust representations. We experimentally validate the effectiveness of our approach on unseen domain adaptation tasks, on which it achieves impressive performance. Specifically, our method significantly outperforms the state-of-the-art meta-adversarial learning methods on few-shot learning tasks, as well as self-supervised learning baselines in standard learning settings with large-scale datasets.","robust meta-learning, unseen domain, self-supervised learning, robustness" Targeted Adversarial Self-Supervised Learning,https://openreview.net/forum?id=wwRjJScpsOO,https://openreview.net/pdf?id=wwRjJScpsOO,We propose a novel targeted adversarial training method for the self-supervised learning frameworks.,"Recently, unsupervised adversarial training (AT) has been extensively studied to attain robustness with the models trained upon unlabeled data. To this end, previous studies have applied existing supervised adversarial training techniques to self-supervised learning (SSL) frameworks. However, all have resorted to untargeted adversarial learning as obtaining targeted adversarial examples is unclear in the SSL setting lacking of label information. In this paper, we propose a novel targeted adversarial training method for the SSL frameworks. Specifically, we propose a target selection algorithm for the adversarial SSL frameworks; it is designed to select the most confusing sample for each given instance based on similarity and entropy, and perturb the given instance toward the selected target sample. Our method significantly enhances the robustness of an SSL model without requiring large batches of images or additional models, unlike existing works aimed at achieving the same goal. Moreover, our method is readily applicable to general SSL frameworks that only uses positive pairs. We validate our method on benchmark datasets, on which it obtains superior robust accuracies, outperforming existing unsupervised adversarial training methods. ","adversarial self-supervised learning, self-supervised learning, targeted attack, robustness" Accurate and Efficient Soma Reconstruction in a Full Adult Fly Brain,https://openreview.net/forum?id=M1BrqvlID5J,https://openreview.net/pdf?id=M1BrqvlID5J,,"Neuron reconstruction in a full adult fly brain from high-resolution electron microscopy (EM) data is regarded as a cornerstone for neuroscientists to explore how neurons inspire intelligence. As the central part of neurons, somas in the full brain indicate the origin of neurogenesis and neural functions. However, due to the absence of EM datasets specifically annotated for somas, existing deep learning-based neuron reconstruction methods cannot directly provide accurate soma distribution and morphology. Moreover, full brain neuron reconstruction remains extremely time-consuming due to the unprecedentedly large size of EM data. In this paper, we develop an efficient soma reconstruction method for obtaining accurate soma distribution and morphology information in a full adult fly brain. To this end, we first make a high-resolution EM dataset with fine-grained 3D manual annotations on somas. Relying on this dataset, we propose an efficient, two-stage deep learning algorithm for predicting accurate locations and boundaries of 3D soma instances. Further, we deploy a parallelized, high-throughput data processing pipeline for executing the above algorithm on the full brain. Finally, we provide quantitative and qualitative results to validate the superiority of the proposed method, as well as comprehensive statistics of the reconstructed somas in the full adult fly brain from the biological perspective.", NIERT: Accurate Numerical Interpolation through Unifying Scattered Data Representations using Transformer Encoder,https://openreview.net/forum?id=pm7O7gJObtk,https://openreview.net/pdf?id=pm7O7gJObtk,We present an accurate data-driven approach to numerical interpolation for scattered data using transformer encoder with enhancement using pre-training technique.,"Numerical interpolation for scattered data, i.e., estimating values for target points based on those of some observed points, is widely used in computational science and engineering. The existing approaches either require explicitly pre-defined basis functions, which makes them inflexible and limits their performance in practical scenarios, or train neural networks as interpolators, which still have limited interpolation accuracy as they treat observed and target points separately and cannot effectively exploit the correlations among data points. Here, we present a learning-based approach to numerical interpolation for scattered data using encoder representation of Transformers (called NIERT). Unlike the recent learning-based approaches, NIERT treats observed and target points in a unified fashion through embedding them into the same representation space, thus gaining the advantage of effectively exploiting the correlations among them. The specially-designed partial self-attention mechanism used by NIERT makes it escape from the unexpected interference of target points on observed points. We further show that the partial self-attention is essentially a learnable interpolation module combining multiple neural basis functions, which provides interpretability of NIERT. Through pre-training on large-scale synthetic datasets, NIERT achieves considerable improvement in interpolation accuracy for practical tasks. On both synthetic and real-world datasets, NIERT outperforms the existing approaches, e.g., on the TFRD-ADlet dataset for temperature field reconstruction, NIERT achieves an MAE of $1.897\times 10^{-3}$, substantially better than the state-of-the-art approach (MAE: $27.074\times 10^{-3}$). The source code of NIERT is available at https://anonymous.4open.science/r/NIERT-2BCF.","numerical interpolation, transformer encoder, mask mechanism, pre-training model" Triplet Similarity Learning on Concordance Constraint,https://openreview.net/forum?id=N8N2VMkWdVf,https://openreview.net/pdf?id=N8N2VMkWdVf,A simple and elegant loss function is proposed to exploit the concordance constraint of triplet similarity for deep metric learning.,"Triplet-based loss functions have been the paradigm of choice for robust deep metric learning (DML). However, conventional triplet-based losses require carefully tuning a decision boundary, i.e., violation margin. When performing online triplet mining on each mini-batch, choosing a good global and constant prior value for violation margin is challenging and irrational. To circumvent this issue, we propose a novel yet efficient concordance-induced triplet (CIT) loss as an objective function to train DML models. We formulate the similarity of triplet samples as a concordance constraint problem, then directly optimize concordance during DML model learning. Triplet concordance refers to the predicted ordering of intra-class and inter-class similarities being correct, which is invariant to any monotone transformation of the decision boundary of triplet samples. Hence, our CIT loss is free from the plague of adopting the violation margin as a prior constraint. In addition, due to the high training complexity of triplet-based losses, we introduce a partial likelihood term for CIT loss to impose additional penalties on hard triplet samples, thus enforcing fast convergence. We extensively experiment on a variety of DML tasks to demonstrate the elegance and simplicity of our CIT loss against its counterparts. In particular, on face recognition, person re-identification, as well as image retrieval datasets, our method can achieve comparable performances with state-of-the-arts without tuning any hyper-parameters laboriously.","Metric Learning, Triplet Loss, Concordance, Hard samples" Temporal Label Smoothing for Early Prediction of Adverse Events,https://openreview.net/forum?id=miyZxvBxdoP,https://openreview.net/pdf?id=miyZxvBxdoP,Modulating label smoothing strength over time to reflect signal noise patterns and clinical priorities significantly improves deep learning model performance in the prediction of adverse medical events.,"Models that can predict adverse events ahead of time with low false-alarm rates are critical to the acceptance of decision support systems in the medical community. This challenging machine learning task remains typically treated as a simple binary classification, with few bespoke methods proposed to leverage temporal dependency across samples. We propose Temporal Label Smoothing (TLS), a novel learning strategy that modulates smoothing strength as a function of proximity to the event of interest. This regularization technique reduces model confidence at the class boundary, where the signal is often noisy or uninformative, thus allowing training to focus on clinically informative data points away from this boundary region. From a theoretical perspective, we also show that our method can be framed as an extension of multi-horizon prediction, a learning heuristic proposed in other early prediction work. TLS empirically matches or outperforms all competitor methods across all evaluation measures on various early prediction benchmark tasks. In particular, our approach significantly improves performance on clinically-relevant metrics such as event recall under low false-alarm rates.","Healthcare, Time-Series, Label Smoothing, Deep Learning, Application" Test-Time Robust Personalization for Federated Learning,https://openreview.net/forum?id=3aBuJEza5sq,https://openreview.net/pdf?id=3aBuJEza5sq,We identify the pitfalls of existing personalized federated learning methods during deployment and propose a novel test-time personalization solution.,"Federated Learning (FL) is a machine learning paradigm where many clients collaboratively learn a shared global model with decentralized training data. Personalization on FL models additionally adapts the global model to different clients, achieving promising results on consistent local training & test distributions. However, for real-world personalized FL applications, it is crucial to go one step further: robustifying FL models under evolving local test set during deployment, where various types of distribution shifts can arise. In this work, we identify the pitfalls of existing works under test-time distribution shifts and propose Federated Test-time Head Ensemble plus tuning (FedTHE+), which personalizes FL models with robustness to various test-time distribution shifts. We illustrate the advancement of FedTHE+ (and its degraded computationally efficient variant FedTHE) over strong competitors, for training various neural architectures (CNN, ResNet, and Transformer) on CIFAR10 and ImageNet and evaluating on diverse test distributions. Along with this, we build a benchmark for assessing performance and robustness of personalized FL methods during deployment.","Federated Learning, Personalized Federated Learning, Test-time Robustness" On-Device Domain Generalization,https://openreview.net/forum?id=ddcqRzq6g2n,https://openreview.net/pdf?id=ddcqRzq6g2n,A systematic study on how to improve domain generalization for tiny neural networks,"We present a systematic study of domain generalization (DG) for tiny neural networks, a problem that is critical to on-device machine learning applications but has been overlooked in the literature where research has been focused on large models only. Tiny neural networks have much fewer parameters and lower complexity, and thus should not be trained the same way as their large counterparts for DG applications. We find that knowledge distillation is a strong candidate for solving the problem: it outperforms state-of-the-art DG methods that were developed using large models with a large margin. Moreover, we observe that the teacher-student performance gap on test data with domain shift is bigger than that on in-distribution data. To improve DG for tiny neural networks without increasing the deployment cost, we propose a simple idea called out-of-distribution knowledge distillation (OKD), which aims to teach the student how the teacher handles (synthetic) out-of-distribution data and is proved to be a promising framework for tackling the problem. We also contribute a scalable method for creating DG datasets, called DOmain Shift in COntext (DOSCO), which can be applied to broad data at scale without much human effort. Code and models will be released.","Domain Generalization, Mobile Applications" What's Wrong with the Robustness of Object Detectors?,https://openreview.net/forum?id=rDArMCIvldMR,https://openreview.net/pdf?id=rDArMCIvldMR,,"Despite tremendous successes achieved, object detection models confront the vulnerability to adversarial attacks. Even with imperceptible adversarial perturbations in images, they probably yield erroneous detection predictions, posing a threat to various realistic applications, e.g., medical diagnosis and automatic driving. Although some existing methods can improve the adversarial robustness of detectors, they still suffer from the detection robustness bottleneck: the significant performance degradation on clean images and the limited robustness on adversarial images. In this paper, we conduct empirically a comprehensive investigation on what's wrong with the robustness of object detectors in four different seminal architectures, i.e., two-stage, one-stage, anchor-free, and Transformer-based detectors, inspiring more research interest on this task. We also devise a Detection Confusion Matrix (DCM) and Classification-Ablative Validation (ClsAVal) for further detection robustness analyses. We explore underlying factors that account for robustness bottleneck. It is empirically demonstrated that robust detectors have reliable localization robustness and poor classification robustness. The classification module easily mis-classifies the foreground objects into the background. Furthermore, Robust Derformable-DETR suffers from a poor classification and localization robustness. Our source codes, trained models, and detailed experiment results will be publicly available.","Object Detection, Adversarial Robustness" LAVA: Data Valuation without Pre-Specified Learning Algorithms,https://openreview.net/forum?id=JJuP86nBl4q,https://openreview.net/pdf?id=JJuP86nBl4q,"We propose LAVA: a novel model-agnostic approach to data valuation using a non-conventional, class-wise Wasserstein discrepancy.","Traditionally, data valuation is posed as a problem of equitably splitting the validation performance of a learning algorithm among the training data. As a result, the calculated data values depend on many design choices of the underlying learning algorithm. However, this dependence is undesirable for many use cases of data valuation, such as setting priorities over different data sources in a data acquisition process and informing pricing mechanisms in a data marketplace. In these scenarios, data needs to be valued before the actual analysis and the choice of the learning algorithm is still undetermined then. Another side-effect of the dependence is that to assess the value of individual points, one needs to re-run the learning algorithm with and without a point, which incurs a large computation burden. This work leapfrogs over the current limits of data valuation methods by introducing a new framework that can value training data in a way that is oblivious to the downstream learning algorithm. Our main results are as follows. $\textbf{(1)}$ We develop a proxy for the validation performance associated with a training set based on a non-conventional $\textit{class-wise}$ $\textit{Wasserstein distance}$ between the training and the validation set. We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions. $\textbf{(2)}$ We develop a novel method to value individual data based on the sensitivity analysis of the $\textit{class-wise}$ Wasserstein distance. Importantly, these values can be directly obtained $\textit{for free}$ from the output of off-the-shelf optimization solvers once the Wasserstein distance is computed. $\textbf{(3) }$We evaluate our new data valuation framework over various use cases related to detecting low-quality data and show that, surprisingly, the learning-agnostic feature of our framework enables a $\textit{significant improvement}$ over the state-of-the-art performance while being $\textit{orders of magnitude faster.}$ ","data valuation, optimal transport, model agnostic, data-driven" FONDUE: an Algorithm to Find the Optimal Dimensionality of the Latent Representations of Variational Autoencoders,https://openreview.net/forum?id=x9S5kdaQkkY,https://openreview.net/pdf?id=x9S5kdaQkkY,A principled method using intrinsic dimension estimation to find the optimal number of latent dimensions for variational autoencoders.,"When training a variational autoencoder (VAE) on a given dataset, determining the optimal number of latent variables is mostly done by grid search: a costly process in terms of computational time and carbon footprint. In this paper, we explore the intrinsic dimension estimation (IDE) of the data and latent representations learned by VAEs. We show that the discrepancies between the IDE of the mean and sampled representations of a VAE after only a few steps of training reveal the presence of passive variables in the latent space, which, in well-behaved VAEs, indicates a superfluous number of dimensions. Using this property, we propose FONDUE: an algorithm which quickly finds the number of latent dimensions after which the mean and sampled representations start to diverge (i.e., when passive variables are introduced), providing a principled method for selecting the number of latent dimensions for VAEs and autoencoders.","variational autoencoders, VAEs, representation learning, intrinsic dimension estimation, IDE, polarised regime" Distortion-Aware Network Pruning and Feature Reuse for Real-time Video Segmentation,https://openreview.net/forum?id=DCgjv41MD2M,https://openreview.net/pdf?id=DCgjv41MD2M,,"Real-time video segmentation is a crucial task for many real-world applications such as autonomous driving and robot control. Since state-of-the-art semantic segmentation models are often too heavy for real-time applications despite their impressive performance, researchers have proposed lightweight architectures with speed-accuracy trade-offs, achieving real-time speed at the expense of reduced accuracy. In this paper, we propose a novel framework to speed up any architecture with skip-connections for real-time vision tasks by exploiting the temporal locality in videos. Specifically, at the arrival of each frame, we transform the features from the previous frame to reuse them at specific spatial bins. We then perform partial computation of the backbone network on the regions of the current frame that captures temporal differences between the current and previous frame. This is done by dynamically dropping out residual blocks using a gating mechanism which decides which blocks to drop based on inter-frame distortion. We validate our Spatial-Temporal Mask Generator (STMG) on video semantic segmentation benchmarks with multiple backbone networks, and show that our method largely speeds up inference with minimal loss of accuracy.", An Encryption Framework for Pre-Trained Neural Networks,https://openreview.net/forum?id=w4eFMKkF_a_,https://openreview.net/pdf?id=w4eFMKkF_a_,,"Having consumed huge amounts of training data and computational resource, large-scale pre-trained models are often considered key assets of AI service providers. This raises an important problem: how to prevent these models from being maliciously copied when they are running on customer's computing device? We answer this question by adding a set of confusion neurons into the pre-trained model, where the position of these neurons is encoded into a few integers that are easy to be encrypted. We find that most often, a small portion of confusion neurons are able to effectively contaminate the pre-trained model. Thereafter, we extend our study to a bigger picture that the customers may develop algorithms to eliminate the effect of confusion neurons and recover the original network, and we show that our simple approach is somewhat capable of defending itself against the fine-tuning attack.", How do Variational Autoencoders Learn? Insights from Representational Similarity,https://openreview.net/forum?id=s_2Rye-RctO,https://openreview.net/pdf?id=s_2Rye-RctO,How VAEs' representations converge during learning,"The ability of Variational Autoencoders (VAEs) to learn disentangled representations has made them popular for practical applications. However, their behaviour is not yet fully understood. For example, the questions of when they can provide disentangled representations, or suffer from posterior collapse are still areas of active research. Despite this, there are no layerwise comparisons of the representations learned by VAEs, which would further our understanding of these models. In this paper, we thus look into the internal behaviour of VAEs using representational similarity techniques. Specifically, using the CKA and Procrustes similarities, we found that the encoders' representations are learned long before the decoders', and this behaviour is independent of hyperparameters, learning objectives, and datasets. Moreover, the encoders' representations in all but the mean and variance layers are similar across hyperparameters and learning objectives.","variational autoencoders, VAEs, CKA, Procrustes, representation learning, representational similarity, learning dynamics" Meta-prediction Model for Distillation-Aware NAS on Unseen Datasets,https://openreview.net/forum?id=SEh5SfEQtqB,https://openreview.net/pdf?id=SEh5SfEQtqB,"We propose a one-shot meta accuracy prediction model which can predict a given architecture's final performances on a dataset when performing KD with a given teacher, without having to actually train it on the target task. ","Distillation-aware Network Architecture Search (DaNAS) aims to search for an optimal student architecture that can obtain the best performance and/or efficiency when distilling the knowledge from a given teacher model. Previous DaNAS methods have mostly tackled the search for the network architecture for a fixed source/target tasks and the teacher, which are not generalized well on a new task, thus need to perform costly search for any new combination of the domains and the teachers. For standard NAS tasks without KD, meta-learning-based computationally efficient NAS methods have been proposed, which learn the generalized search process over multiple tasks and transfer the knowledge obtained over those tasks to a new task. However, since they assume learning from scratch without KD from a teacher, they might not be ideal for DaNAS scenarios, which could significantly affect the final accuracies of the architectures obtained from the search. To eliminate excessive computational cost of DaNAS methods and the sub-optimality of rapid NAS methods, we propose a distillation-aware meta accuracy prediction model which can predict a given architecture's final performances on a dataset when performing KD with a given teacher, without having to actually train it on the target task. The experimental results demonstrate that our proposed meta-prediction model successfully generalizes to multiple unseen datasets for DaNAS tasks, largely outperforming existing meta-NAS methods and rapid NAS baselines. ","Neural Architecture Search, Meta Learning" Manifold Characteristics That Predict Downstream Task Performance,https://openreview.net/forum?id=rjYUBo_uWEs,https://openreview.net/pdf?id=rjYUBo_uWEs,"We introduce the Representation manifold quality metric (RMQM), which measures the structure of the learned representation manifold, where we then show that RMQM correlates positively to the generalisation of neural networks to downstream tasks. ","Pretraining methods are typically compared by evaluating the accuracy of linear classifiers, transfer learning performance, or visually inspecting the representation manifold's (RM) lower-dimensional projections. We show that the differences between methods can be understood more clearly by investigating the RM directly, which allows for a more detailed comparison. To this end, we propose a framework and new metric to measure and compare different RMs. We also investigate and report on the RM characteristics for various pretraining methods. These characteristics are measured by applying sequentially larger local alterations to the input data, using white noise injections and Projected Gradient Descent (PGD) adversarial attacks, and then tracking each datapoint. We calculate the total distance moved for each datapoint and the relative change in distance between successive alterations. We show that self-supervised methods learn an RM where alterations lead to large but constant size changes, indicating a smoother RM than fully supervised methods. We then combine these measurements into one metric, the Representation Manifold Quality Metric (RMQM), where larger values indicate larger and less variable step sizes, and show that RMQM correlates positively with performance on downstream tasks.","self-supervised learning, deep learning, representation learning" Context Autoencoder for Self-Supervised Representation Learning,https://openreview.net/forum?id=Gb2Rndy5595,https://openreview.net/pdf?id=Gb2Rndy5595,,"We present a novel masked image modeling (MIM) approach, context autoencoder (CAE), for self-supervised representation pretraining. The goal is to pretrain an encoder by solving the pretext task: estimate the masked patches from the visible patches in an image. Our approach first feeds the visible patches into the encoder, extracting the representations. Then, we make predictions from visible patches to masked patches in the encoded representation space. We introduce an alignment constraint, encouraging that the representations for masked patches, predicted from the encoded representations of visible patches, are aligned with the masked patch presentations computed from the encoder. In other words, the predicted representations are expected to lie in the encoded representation space, which empirically shows the benefit to representation learning. Last, the predicted masked patch representations are mapped to the targets of the pretext task through a decoder. One additional characteristic is that our approach encourages the separation of the representation learning part (encoder), and the pretext task completion part that will be replaced by the downstream task part. In contrast, previous MIM methods (e.g., BEiT and MAE) couple the two parts, potentially limiting the representation learning quality. We demonstrate the effectiveness of our CAE through superior transfer performance in downstream tasks: semantic segmentation, and object detection and instance segmentation.","Self-Supervised Representation Learning, Masked Image Modeling, Context Autoencoder" Learning to Linearize Deep Neural Networks for Secure and Efficient Private Inference,https://openreview.net/forum?id=BGF9IeDfmlH,https://openreview.net/pdf?id=BGF9IeDfmlH,We present an automated linearization method to train a DNN with limited ReLU budget for inference in yielding models able to perform significantly better than exiting private inference SOTA both in terms of potentially improved latency and accuracy.,"The large number of ReLU non-linearity operations in existing deep neural networks makes them ill-suited for latency-efficient private inference (PI). Existing techniques to reduce ReLU operations often involve manual effort and sacrifice significant accuracy. In this paper, we first present a novel measure of non-linearity layers’ ReLU sensitivity, enabling mitigation of the time-consuming manual efforts in identifying the same. Based on this sensitivity, we then present SENet, a three-stage training method that for a given ReLU budget, automatically assigns per-layer ReLU counts, decides the ReLU locations for each layer’s activation map, and trains a model with significantly fewer ReLUs to potentially yield latency and communication efficient PI. Experimental evaluations with multiple models on various datasets show SENet’s superior performance both in terms of reduced ReLUs and improved classification accuracy compared to existing alternatives. In particular, SENet can yield models that require up to ∼2× fewer ReLUs while yielding similar accuracy. For a similar ReLU budget SENet can yield models with ∼2.32% improved classification accuracy, evaluated on CIFAR-100.","Efficient private inference, cryptographic inference, machine learning as a service, efficient cryptographic inference, automated ReLU reduction" Mitigating Forgetting in Online Continual Learning via Contrasting Semantically Distinct Augmentations,https://openreview.net/forum?id=78IUEPOGjG6,https://openreview.net/pdf?id=78IUEPOGjG6,Leverage the strong data augmentation to mitigate catastrophic forgetting,"Online continual learning (OCL) aims to enable model learning from a non-stationary data stream to continuously acquire new knowledge as well as retain the learnt one, under the constraints of having limited system size and computational cost, in which the main challenge comes from the ""catastrophic forgetting"" issue -- the inability to well remember the learnt knowledge while learning the new ones. With the specific focus on the class-incremental OCL scenario, i.e. OCL for classification, the recent advance incorporates the contrastive learning technique for learning more generalised feature representation to achieve the state-of-the-art performance but is still unable to fully resolve the catastrophic forgetting. In this paper, we follow the strategy of adopting contrastive learning but further introduce the semantically distinct augmentation technique, in which it leverages strong augmentation to generate more data samples, and we show that considering these samples semantically different from their original classes (thus being related to the out-of-distribution samples) in the contrastive learning mechanism contributes to alleviate forgetting and facilitate model stability. Moreover, in addition to contrastive learning, the typical classification mechanism and objective (i.e. softmax classifier and cross-entropy loss) are included in our model design for faster convergence and utilising the label information, but particularly equipped with a sampling strategy to tackle the tendency of favouring the new classes (i.e. model bias towards the recently learnt classes). Upon conducting extensive experiments on CIFAR-10, CIFAR-100, and Mini-Imagenet datasets, our proposed method is shown to achieve superior performance against various baselines.","continual learning, representation learning, memory replay" Wasserstein Fair Autoencoders,https://openreview.net/forum?id=PXibCVxXdT,https://openreview.net/pdf?id=PXibCVxXdT,We present a framework based on Wasserstein autoencoders that can reinforce some theoretical weak links in the variational approaches on fair or disentangled represenation.,"Autoencoders, or nonlinear factor models parameterized by neural networks, have become an indispensable tool for generative modeling and representation learning in high dimensions. Imposing structural constraints such as conditional independence on the latent variables (representation, or factors) in order to capture invariance or fairness with autoencoders has been attempted through adding ad hoc penalties to the loss function mostly in the variational autoencoder (VAE) context, often based on heuristic arguments. In this paper, we demonstrate that Wasserstein autoencoders (WAEs) are highly flexible in embracing structural constraints. Well-known extensions of VAEs for this purpose are gracefully handled within the framework of the seminal result by Tolstikhin et al. (2018). In particular, given a conditional independence structure of the generative model (decoder), corresponding encoder structure and penalties are induced from the functional constraints that define the WAE. This property of WAEs opens up a principled way of penalizing autoencoders to impose structural constraints. Utilizing this generative model structure, we present results on fair representation and conditional generation tasks, and compare them with other preceding methods.","conditional generation, fair representation, disentanglement, wasserstein autoencoder" Results for Perfect Classification for Graph Attention on the Contextual Stochastic Block Model,https://openreview.net/forum?id=470wZ5Qk4ur,https://openreview.net/pdf?id=470wZ5Qk4ur,,"We study the ability of one layer Graph Attention Networks (GAT) to achieve perfect node classification for a simple synthetic data model called the contextual stochastic block model (CSBM). We determine a \textit{positive} CSBM parameter regime such that GAT achieves perfect classification and a \textit{negative} CSBM parameter regime such that GAT fails to achieve perfect classification. For the positive result we use a generalized attention mechanism of the original~\citep{Velickovic2018GraphAN}. For the negative result we consider a fixed attention mechanism which is determined using the labels of the nodes. We pose two questions. \textit{Is the condition of GAT for achieving perfect classification better than that of a simple community detection method, i.e., thresholding the second principal eigenvector of the adjacency matrix~\citep{Abbe2018}?} The answer to this question is negative, and it depends on the parameter regime of the CSBM distribution. This happens because the graph information is coupled with the feature information using the operation of matrix multiplication. However, such matrix multiplication operation can be detrimental for perfect node classification. The second question is, \textit{is the condition of GAT for achieving perfect classification better than that of simple graph convolution (GCN)~\citep{kipf:gcn}?} We show that GAT is better than GCN if the attention mechanism of GAT is a Lipschitz function, while it is not better if it is not a Lipschitz function.", Denoising Diffusion Error Correction Codes,https://openreview.net/forum?id=rLwC0_MG-4w,https://openreview.net/pdf?id=rLwC0_MG-4w,We propose a novel SOTA Neural error correction decoder based on a new diffusion model.,"Error correction code (ECC) is an integral part of the physical communication layer, ensuring reliable data transfer over noisy channels. Recently, neural decoders have demonstrated their advantage over classical decoding techniques. However, recent state-of-the-art neural decoders suffer from high complexity and lack the important iterative scheme characteristic of many legacy decoders. In this work, we propose to employ denoising diffusion models for the soft decoding of linear codes at arbitrary block lengths. Our framework models the forward channel corruption as a series of diffusion steps that can be reversed iteratively. Three contributions are made: (i) a diffusion process suitable for the decoding setting is introduced, (ii) the neural diffusion decoder is conditioned on the number of parity errors, which indicates the level of corruption at a given step, (iii) a line search procedure based on the code's syndrome obtains the optimal reverse diffusion step size. The proposed approach demonstrates the power of diffusion models for ECC and is able to achieve state of the art accuracy, outperforming the other neural decoders by sizable margins, even for a single reverse diffusion step. ","ECC, Deep Learning, Diffusion Models" Low-Rank Winograd Transformation for 3D Convolutional Neural Networks,https://openreview.net/forum?id=zOLLCOgUGIH,https://openreview.net/pdf?id=zOLLCOgUGIH,,"This paper focuses on Winograd transformation in 3D convolutional neural networks (CNNs) that are more over-parameterized compared with the common 2-D version. The over-increasing Winograd parameters not only exacerbate training complexity but also barricade the practical speedups due simply to the volume of element-wise products in the Winograd domain. We attempt to reduce trainable parameters by introducing a low-rank Winograd transformation, a novel training paradigm that decouples the original large tensor into two less storage-required trainable tensors, leading to a significant complexity reduction. Built upon our low-rank Winograd transformation, we take one step ahead by proposing a low-rank oriented sparse granularity that measures column-wise parameter importance. By simply involving the non-zero columns in the element-wise product, our sparse granularity is empowered with the ability to produce a very regular sparse pattern to acquire effectual Winograd speedups. To better understand the efficacy of our method, we perform extensive experiments upon 3D CNNs. Results manifest that our low-rank Winograd transformation well outperforms the vanilla Winograd transformation. We also show that our proposed low-rank oriented sparse granularity permits practical Winograd acceleration compared with the vanilla counterpart.","3D CNN, Network Pruning, Winograd Algorithm" Progressive Purification for Instance-Dependent Partial Label Learning,https://openreview.net/forum?id=PHcLZ8Yh6h4,https://openreview.net/pdf?id=PHcLZ8Yh6h4,,"Partial-label learning (PLL) aims to train multi-class classifiers from instances with partial labels (PLs)---a PL for an instance is a set of candidate labels where a fixed but unknown candidate is the true label. In the last few years, the instance-independent generation process of PLs has been extensively studied, on the basis of which many practical and theoretical advances have been made in PLL, while relatively less attention has been paid to the practical setting of instance-dependent PLs, namely, the PL depends not only on the true label but the instance itself. In this paper, we propose a theoretically grounded and practically effective approach called progressive purification (POP) for instance-dependent PLL: in each epoch, POP updates the learning model while purifies each PL by progressively moving out false candidate labels for the next epoch of the model training. Theoretically, we prove that POP enlarges the region where the model is reliable by a promising rate, and eventually approximates the Bayes optimal classifier with mild assumptions; technically, POP is flexible with arbitrary losses and compatible with deep networks, so the previous advanced PLL losses can be embedded in it and the performance is often significantly improved.",Partial label learning Meta Knowledge Condensation for Federated Learning,https://openreview.net/forum?id=TDf-XFAwc79,https://openreview.net/pdf?id=TDf-XFAwc79,,"Existing federated learning paradigms usually extensively exchange distributed models, rather than original data, at a central solver to achieve a more powerful model. However, this would incur severe communication burden between a server and multiple clients especially when data distributions are heterogeneous. As a result, current federated learning methods often require plenty of communication rounds in training. Unlike existing paradigms, we introduce an alternative perspective to significantly decrease the federate learning communication cost without leaking original data. In this work, we first present a meta knowledge representation method that extracts meta knowledge from distributed clients. The extracted meta knowledge encodes essential information that can be used to improve the current model. As the training progresses, the contributions of the same training samples to a federated model should also vary. Thus, we introduce a dynamic weight assignment mechanism that enables informative samples to contribute adaptively to the current model update. Then, informative meta knowledge from all active clients is sent to the server for model update. Training model on the combined meta knowledge that is regarded as a condense form of original data can significantly mitigate the heterogeneity issues. Moreover, to further ameliorate data heterogeneity, we also exchange meta knowledge among clients as conditional initialisation for meta knowledge extraction. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method. Remarkably, our method outperforms the state-of-the-art by a large margin (from $74.07\%$ to $92.95\%$) on MNIST with a restricted communication budget (\textit{i.e.}, 10 rounds).", Improved Fully Quantized Training via Rectifying Batch Normalization,https://openreview.net/forum?id=87n67AtiHo,https://openreview.net/pdf?id=87n67AtiHo,,"Quantization-aware Training (QAT) is able to reduce the training cost by quantizing neural network weights and activations in the forward pass and improve the speed at the inference stage. QAT can be extended to Fully-Quantized Training (FQT), which further accelerates the training by quantizing gradients in the backward pass as back-propagation typically occupies half of the training time. Unfortunately, gradient quantization is challenging as Stochastic Gradient Descent (SGD) based training is sensitive to the precision of the gradient signal. Particularly, the noise introduced by gradient quantization accumulates during backward pass, which causes the exploding gradient problem and results in unstable training and significant accuracy drop. Though Batch Normalization (BatchNorm) is a de-facto resort to stabilize training in regular full-precision scenario, we observe that it fails to prevent the gradient explosion when gradient quantizers are injected in the backward pass. Surprisingly, our theory shows that BatchNorm could amplify the noise accumulation, which in turn hastens the explosion of gradients. A BatchNorm rectification method is derived from our theory to suppress the amplification effect and bridge the performance gap between full-precision training and FQT. Adding this simple rectification loss to baselines generates better results than most prior FQT algorithms on various neural network architectures and datasets, regardless of the gradient bit-widths used (8,4, and 2 bits).","Model Compression, Gradient Quantization, Convolution Neural Networks, Batch Normalization" Video-based 3D Object Detection with Learnable Object-Centric Global Optimization,https://openreview.net/forum?id=G7_LoXdE2Oe,https://openreview.net/pdf?id=G7_LoXdE2Oe,,"We study utilizing long-term temporal visual correspondence-based optimization for video-based 3D object detection in this work. Visual correspondence refers to one-to-one mappings for pixels across multiple images. Correspondence-based optimization is the cornerstone for 3D scene reconstruction but is less studied in 3D object detection, for that moving objects violate multi-view geometry constraints and are treated as outliers during scene reconstruction. We resolve this issue by treating objects as first-class citizens during correspondence-based optimization. In this work, we propose BA-Det, an end-to-end optimizable object detector with object-centric temporal correspondence learning and object-centric featuremetric bundle adjustment. Empirically, we verify the effectiveness and efficiency of BA-Det for multiple baseline 3D detectors under various setups. Our BA-Det achieves SOTA performance on the large-scale Waymo Open Dataset (WOD) with only marginal computation cost. Codes will be released soon.", Edge Wasserstein Distance Loss for Oriented Object Detection,https://openreview.net/forum?id=NeH20Y8mDvp,https://openreview.net/pdf?id=NeH20Y8mDvp,This paper proposes a novel orinted object regression loss,"Regression loss design is an essential topic for oriented object detection. Due to the periodicity of the angle and the ambiguity of width and height definition, traditional L1-distance losses and its variants have been suffered from the metric discontinuity and the square-like problem. As a solution, the distribution based methods show significant advantage by representing oriented boxes as distributions. Differing from exploited the Gaussian distribution to get analytical form of distance measure, we propose a novel oriented regression loss, Wasserstein Distance(EWD) loss, to alleviate the square-like problem.Specifically, for the oriented box representation, we choose a specially-designed distribution whose probability density function is only nonzero over the edges. On this basis, we develop Wasserstein distance as the measure. Besides, based on the edge representation of oriented box, the EWD loss can be generalized to quadrilateral and polynomial regression scenery. Experiments on multiple popular datasets and different detectors show the effectiveness of the proposed method.","oriented object detection, regression loss design" StyleGenes: Discrete and Efficient Latent Distributions for GANs,https://openreview.net/forum?id=eyyS-zovT9m,https://openreview.net/pdf?id=eyyS-zovT9m,,"Networks (GANs). Instead of drawing latent vectors from a continuous prior, we sample from a finite set of learnable latents. However, a direct parametrization of such a distribution leads to an intractable linear increase in memory in order to ensure sufficient sample diversity. We address this key issue by taking inspiration from the encoding of information in biological organisms. Instead of learning a separate latent vector for each sample, we split the latent space into a set of genes. For each gene, we train a small bank of gene variants. Thus, by independently sampling a variant for each gene and combining them into the final latent vector, our approach can represent a vast number of unique latent samples from a compact set of learnable parameters. Interestingly, our gene-inspired latent encoding allows for new and intuitive approaches to manipulation, latent-space exploration, and classifier-based conditional sampling, while preserving state-of-the-art photo-realism.","generative adversarial networks, discrete sampling, unconditional training, conditional generation" Scratching Visual Transformer's Back with Uniform Attention,https://openreview.net/forum?id=rrjOLTU1jkw,https://openreview.net/pdf?id=rrjOLTU1jkw,Vision Transformers may need yet more dense interactions. We tried to supply them with a simple trick. We get improvements.,"The favorable performance of Vision Transformers (ViTs) is often attributed to the multi-head self-attention ($\mathtt{MSA}$). The $\mathtt{MSA}$ enables global interactions at each layer of a ViT model, which is a contrasting feature against Convolutional Neural Networks (CNNs) that gradually increase the range of interaction across multiple layers. We study the role of the density of the attention. Our preliminary analyses suggest that the spatial interactions of attention maps are close to dense interactions rather than sparse ones. This is a curious phenomenon, as dense attention maps are harder for the model to learn due to steeper softmax gradients around them. We interpret this as a strong preference for ViT models to include dense interaction. We thus manually insert the uniform attention to each layer of ViT models to supply the much needed dense interactions. We call this method Context Broadcasting, $\mathtt{CB}$. We observe that the inclusion of $\mathtt{CB}$ reduces the degree of density in the original attention maps and increases both the capacity and generalizability of the ViT models. $\mathtt{CB}$ incurs negligible costs: 1 line in your model code, no additional parameters, and minimal extra operations. ","Vision Transformer, Self-attention, Attention, Dense Interactions, Image Classification" Exploring Active 3D Object Detection from a Generalization Perspective,https://openreview.net/forum?id=2RwXVje1rAh,https://openreview.net/pdf?id=2RwXVje1rAh,,"To alleviate the high annotation cost in LiDAR-based 3D object detection, active learning is a promising solution that learns to select only a small portion of unlabeled data to annotate, without compromising model performance. Our empirical study, however, suggests that mainstream uncertainty-based and diversity-based active learning policies are not effective when applied in the 3D detection task, as they fail to balance the trade-off between point cloud informativeness and box-level annotation costs. To overcome this limitation, we jointly investigate three novel criteria in our framework CRB for point cloud acquisition - label conciseness, feature representativeness and geometric balance, which hierarchically filters out the point clouds of redundant 3D bounding box labels, latent features and geometric characteristics (e.g., point cloud density) from the unlabeled sample pool and greedily selects informative ones with fewer objects to annotate. Our theoretical analysis demonstrates that the proposed criteria aligns the marginal distributions of the selected subset and the prior distributions of the unseen test set, and minimizes the upper bound of the generalization error. To validate the effectiveness and applicability of CRB, we conduct extensive experiments on the two benchmark 3D object detection datasets of KITTI and Waymo and examine both one-stage (i.e., Second) and two-stage 3D detector (i.e., PV-RCNN). Experiments evidence that the proposed approach outperforms existing active learning strategies and achieves fully supervised performance requiring $1\%$ and $8\%$ annotations of bounding boxes and point clouds, respectively. ","Active Learning, 3D Object Detection, Lidar Point Clouds" A Unified Pretraining Framework for Human Motion Analysis,https://openreview.net/forum?id=dBbdV4CTEQf,https://openreview.net/pdf?id=dBbdV4CTEQf,,"We present a unified pretraining framework to tackle different sub-tasks of human motion analysis, including 3D pose estimation, skeleton-based action recognition, and mesh recovery. The proposed framework is capable of utilizing all kinds of human motion data resources, including motion capture data and in-the-wild videos. During pretraining, the pretext task requires the motion encoder to recover the underlying 3D motion from noisy partial 2D observations. The pretrained motion representation thus acquires geometric, kinematic, and physical knowledge about human motion and therefore can be easily transferred to multiple downstream tasks. We implement the motion encoder with a novel Dual-stream Spatio-temporal Transformer (DSTformer) neural network. It could capture long-range spatio-temporal relationships among the skeletal joints comprehensively and adaptively, exemplified by the lowest 3D pose estimation error so far when trained from scratch. More importantly, the proposed framework achieves state-of-the-art performance on all three downstream tasks by simply finetuning the pretrained motion encoder with 1-2 linear layers, which demonstrates the versatility of the learned motion representations. Code and pretrained models will be publicly available.", Masked Frequency Modeling for Self-Supervised Visual Pre-Training,https://openreview.net/forum?id=9-umxtNPx5E,https://openreview.net/pdf?id=9-umxtNPx5E,,"We present Masked Frequency Modeling (MFM), a unified frequency-domain-based approach for self-supervised pre-training of visual models. Instead of randomly inserting mask tokens to the input embeddings in the spatial domain, in this paper, we shift the perspective to the frequency domain. Specifically, MFM first masks out a portion of frequency components of the input image and then predicts the missing frequencies on the frequency spectrum. Our key insight is that predicting masked components in the frequency domain is more ideal to reveal underlying image patterns rather than predicting masked patches in the spatial domain, due to the heavy spatial redundancy. Our findings suggest that with the right configuration of mask-and-predict strategy, both the structural information within high-frequency components and the low-level statistics among low-frequency counterparts are useful in learning good representations. For the first time, MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token. Experimental results on image classification and semantic segmentation, as well as several robustness benchmarks show the competitive performance and advanced robustness of MFM compared with recent masked image modeling approaches. Furthermore, we also comprehensively investigate the effectiveness of classical image restoration tasks for representation learning from a unified frequency perspective and reveal their intriguing relations with our MFM approach.","unsupervised learning, self-supervised learning, representation learning, masked frequency modeling" Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning,https://openreview.net/forum?id=DHyHRBwJUTN,https://openreview.net/pdf?id=DHyHRBwJUTN,"We present a new tabular math word problem dataset, TabMWP, and we propose a novel approach to it that learns to select in-context examples in few-shot GPT-3 via policy gradient. ","Mathematical reasoning, a core ability of human intelligence, presents unique challenges for machines in abstract thinking and logical reasoning. Recent large pre-trained language models such as GPT-3 have achieved remarkable progress on mathematical reasoning tasks written in text form, such as math word problems (MWP). However, it is unknown if the models can handle more complex problems that involve math reasoning over heterogeneous information, such as tabular data. To fill the gap, we present Tabular Math Word Problems (TabMWP), a new dataset containing 38,431 open-domain grade-level problems that require mathematical reasoning on both textual and tabular data. Each question in TabMWP is aligned with a tabular context, which is presented as an image, semi-structured text, and a structured table. There are two types of questions: free-text and multi-choice, and each problem is annotated with gold solutions to reveal the multi-step reasoning process. We evaluate different pre-trained models on TabMWP, including the GPT-3 model in a few-shot setting. As earlier studies suggest, since few-shot GPT-3 relies on the selection of in-context examples, its performance is unstable and can degrade to near chance. The unstable issue is more severe when handling complex problems like TabMWP. To mitigate this, we further propose a novel approach, PromptPG, which utilizes policy gradient to learn to select in-context examples from a small amount of training data and then constructs the corresponding prompt for the test example. Experimental results show that our method outperforms the best baseline by 5.31% on the accuracy metric and reduces the prediction variance significantly compared to random selection, which verifies its effectiveness in the selection of in-context examples.","Mathematical Reasoning, Tabular Math Word Problems, Prompt Learning, Policy Gradient" Lottery Aware Sparsity Hunting: Enabling Federated Learning on Resource-Limited Edge,https://openreview.net/forum?id=qhplAU1BOZW,https://openreview.net/pdf?id=qhplAU1BOZW,We present methodologies for sparse federated learning for resource constrained edge (both homogeneous and heterogeneous compute budget).,"Limited computation and communication capabilities of clients pose significant challenges in federated learning (FL) over resource-limited edge nodes. A potential solution to this problem is to deploy off-the-shelf sparse learning algorithms that train a binary sparse mask on each client with the expectation of training a consistent sparse server mask yielding sparse weight tensors. However, as we investigate in this paper, such naive deployments result in a significant drop in accuracy compared to FL with dense models, especially for clients with limited resource budgets. In particular, our investigations reveal a serious lack of consensus among the trained sparsity masks on clients, which prevents convergence for the server mask and potentially leads to a substantial drop in model performance. Based on such key observations, we propose federated lottery aware sparsity hunting (FLASH), a unified sparse learning framework to make the server win a lottery in terms of yielding a sparse sub-model, able to maintain classification performance under highly resource-limited client settings. Moreover, to support FL on different devices requiring different parameter density, we leverage our findings to present hetero-FLASH, where clients can have different target sparsity budgets based on their device resource limits. Experimental evaluations with multiple models on various datasets (both IID and non-IID) show superiority of our models in closing the gap with unpruned baseline while yielding up to ∼10.1% improved accuracy with ∼10.26x fewer communication costs, compared to existing alternatives, at similar hyperparameter settings.","Sparse federated learning (FL), communication efficient FL, computation efficient FL" Self-Organizing Pathway Expansion for Non-Exemplar Incremental Learning,https://openreview.net/forum?id=lqgkka9jzzK,https://openreview.net/pdf?id=lqgkka9jzzK,,"Non-exemplar class-incremental learning aims to recognize both the old and new classes without access to old class samples. The conflict between old and new class optimization is exacerbated since the shared neural pathways can only be differentiated by the incremental samples. To address this problem, we propose a novel self-organizing pathway expansion scheme. Our scheme consists of a class-specific pathway organization strategy that decouples the optimization pathway of different classes to enhance the independence of the feature representation, and a pathway-guided feature optimization mechanism to mitigate the update interference between the old and new classes. Extensive experiments on four datasets demonstrate superior incremental performance, outperforming the state-of-the-art methods by a margin of 1%, 3%, 2% and 2%, respective.",Incremental Learning Corruption Depth: Analysis of DNN depth for Misclassification,https://openreview.net/forum?id=Xj1orI5p6Sv,https://openreview.net/pdf?id=Xj1orI5p6Sv,Identify the layers lead to robust image classification ,"Many large and complex deep neural networks have been shown to provide higher accuracy. However, very little is known about the relationship between the complexity of the input data along with the type of noise and the depth needed for correct classification. Existing studies do not address the issue of common corruption adequately, especially in understanding what impact these corruptions leave on the individual part of a deep neural network. Therefore, we can safely assume that the classification (or misclassification) might be happening at a particular layer(s) of a network that accumulates to draw a final correct or incorrect prediction. In this paper, we introduce a novel concept called {\bf corruption depth}, which identifies the location of the network layer/depth until the misclassification persists. We assert that the identification of such layers will help in better design of the network by pruning certain layers in comparison to the purification of the entire network which is computationally heavy to do. Through our extensive experiments, we present a coherent study in comparison to the existing studies which are diverse in understanding the processing of examples through the network. Our approach also illustrates different philosophies of example memorization and a one-dimensional view of sample or query difficulty. We believe that the understanding of the corruption depth can \textbf{open a new dimension of model explainability}, where in place of just visualizing the attention map, the classification progress can be seen throughout the network.","Classification Depth, Deep Representation, Robust Recognition" MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning,https://openreview.net/forum?id=4Tx2-AH-jG_,https://openreview.net/pdf?id=4Tx2-AH-jG_,,"In this study, we propose Mixed and Masked Image Modeling (MixMIM), a simple but efficient MIM method that is applicable to various hierarchical Vision Transformers. Existing MIM methods replace a random subset of input tokens with a special $\mathrm{[MASK]}$ symbol and aim at reconstructing original image tokens from the corrupted image. However, we find that using the $\mathrm{[MASK]}$ symbol greatly slows down the training and causes training-finetuning inconsistency, due to the large masking ratio (e.g., 60% in SimMIM). In contrast, we replace the masked tokens of one image with visible tokens of another image, i.e., creating a mixed image. We then conduct dual reconstruction to reconstruct the original two images from the mixed input, which significantly improves efficiency. While MixMIM can be applied to various architectures, this paper explores a simpler but stronger hierarchical Transformer, and scales with MixMIM-B, -L, and -H. Empirical results demonstrate that MixMIM can learn high-quality visual representations efficiently. Notably, MixMIM-B with 88M parameters achieves 85.1% top-1 accuracy on ImageNet-1K by pretraining for 600 epochs. Besides, its transferring performances on the other 6 datasets show MixMIM has better FLOPs / performance tradeoff than previous MIM methods.","self-supervised learning, masked image modeling" Neuro-Symbolic Procedural Planning with Commonsense Prompting,https://openreview.net/forum?id=iOc57X9KM54,https://openreview.net/pdf?id=iOc57X9KM54,We propose a neuro-symbolic procedural planner that elicits procedural planning knowledge from the large language models with commonsense-infused prompting. We achieve state-of-the-art performance on WikiHow and RobotHow.,"Procedural planning aims to implement complex high-level goals by decomposition into sequential simpler low-level steps. Although procedural planning is a basic skill set for humans in daily life, it remains a challenge for large language models (LLMs) that lack a deep understanding of the cause-effect relations in procedures. Previous methods require manual exemplars to acquire procedural planning knowledge from LLMs in the zero-shot setting. However, such elicited pre-trained knowledge in LLMs induces spurious correlations between goals and steps, which impair the model generalization to unseen tasks. In contrast, this paper proposes a neuro-symbolic procedural PLANner (PLAN) that elicits procedural planning knowledge from the LLMs with commonsense-infused prompting. To mitigate spurious goal-step correlations, we use symbolic program executors on the latent procedural representations to formalize prompts from commonsense knowledge bases as a causal intervention toward the Structural Causal Model. Both automatic and human evaluations on WikiHow and RobotHow show the superiority of PLAN on procedural planning without further training or manual exemplars.","Procedural Planning, Commonsense Knowledge, Prompting, Neuro-Symbolic" ZERO: A Large-scale Chinese Cross-modal Benchmark with a New Vision-Language Framework,https://openreview.net/forum?id=fuxn3HyIZjU,https://openreview.net/pdf?id=fuxn3HyIZjU,,"Vision-language pre-training (VLP) on large-scale datasets has shown premier performance on various downstream tasks. In contrast to plenty of available benchmarks with English corpus, large-scale pre-training and downstream datasets with Chinese corpus remain largely unexplored. In this paper, we build a large-scale Chinese cross-modal benchmark from ZERO, which is named for our database publicly available for the research community to build VLP models. We release a pre-training dataset and five fine-tuning datasets for downstream tasks, and also develop a pre-training framework of pre-Ranking + Ranking with target-guided Distillation and feature-guided Distillation (R2D2) for cross-modal learning. In specific, a global contrastive pre-ranking is introduced to learn the individual representations of images and texts. We then fuse the representations in a fine-grained ranking manner via an image-text cross encoder and a text-image cross encoder. To further enhance the capability of our method, a two-way distillation strategy is used with target-guided distillation and feature-guided distillation. We achieve state-of-the-art performance on eleven downstream datasets from four broad categories of tasks including image-text retrieval, image-text matching, image caption, and text-to-image generation.", Learning Object-Language Alignments for Open-Vocabulary Object Detection,https://openreview.net/forum?id=mjHlitXvReu,https://openreview.net/pdf?id=mjHlitXvReu,,"Existing object detection methods are bounded in a fixed-set vocabulary by costly labeled data. When dealing with novel categories, the model has to be retrained with more bounding box annotations. Natural language supervision is an attractive alternative for its annotation-free attributes and broader object concepts. However, learning open-vocabulary object detection from language is challenging since image-text pairs do not contain fine-grained object-language alignments. Previous solutions rely on either expensive grounding annotations or distilling classification-oriented vision models. In this paper, we propose a novel open-vocabulary object detection framework directly learning from image-text pair data. We formulate object-language alignment as a set matching problem between a set of image region features and a set of word embeddings. It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way. Extensive experiments on two benchmark datasets, COCO and LVIS, demonstrate our superior performance over the competing approaches on novel categories, e.g. achieving 32.0% mAP on COCO and 21.7% mask mAP on LVIS. Code will be released.", Phase transition for detecting a small community in a large network,https://openreview.net/forum?id=iN3Lh-Vy2TH,https://openreview.net/pdf?id=iN3Lh-Vy2TH,Signed-quadrilateral is optimal among computationally efficient tests for detecting a small community in the degree-corrected stochastic block model.,"How to detect a small community in a large network is an interesting problem, including clique detection as a special case, where a naive degree-based $\chi^2$-test was shown to be powerful in the presence of an Erdös-Renyi (ER) background. Using Sinkhorn's theorem, we show that the signal captured by the $\chi^2$-test may be a modeling artifact, and it may disappear once we replace the Erdös-Renyi model by a broader network model. We show that the recent SgnQ test is more appropriate for such a setting. The test is optimal in detecting communities with sizes comparable to the whole network, but has never been studied for our setting, which is substantially different and more challenging. Using a degree-corrected block model (DCBM), we establish phase transitions of this testing problem concerning the size of the small community and the edge densities in small and large communities. When the size of the small community is larger than $\sqrt{n}$, the SgnQ test is optimal for it attains the computational lower bound (CLB), the information lower bound for methods allowing polynomial computation time. When the size of the small community is smaller than $\sqrt{n}$, we establish the parameter regime where the SgnQ test has full power and make some conjectures of the CLB. We also study the classical information lower bound (LB) and show that there is always a gap between the CLB and LB in our range of interest. ","community detection, degree-correct block model, global testing, planted clique, statistical-computational gap" On the Word Boundaries of Emergent Languages Based on Harris's Articulation Scheme,https://openreview.net/forum?id=b4t9_XASt6G,https://openreview.net/pdf?id=b4t9_XASt6G,This paper investigates whether Harris's articulation scheme (HAS) also holds in emergent languages.,"This paper shows that emergent languages in signaling games lack meaningful word boundaries in terms of Harris's Articulation Scheme (HAS), a universal property of natural language. Emergent Languages are artificial communication protocols arising among agents. However, it is not obvious whether such a simulated language would have the same properties as natural language. In this paper, we test if they satisfy HAS. HAS states that word boundaries can be obtained solely from phonemes in natural language. We adopt HAS-based word segmentation and verify whether emergent languages have meaningful word segments. The experiment suggested they do not have, although they meet some preconditions for HAS. We discovered a gap between emergent and natural languages to be bridged, indicating that the standard signaling game satisfies prerequisites but is still missing some necessary ingredients.","Emergent Communication, Emergent Language, Unsupervised Word Segmentation, Harris's Articulation Scheme, Compositionality" Mine yOur owN Anatomy: Revisiting Medical Image Segmentation with Extremely Limited Labels,https://openreview.net/forum?id=U1T5FpFZ6zZ,https://openreview.net/pdf?id=U1T5FpFZ6zZ,"This paper proposes a semi-supervised contrastive learning framework that seamlessly assembles three effective principles: tailness, consistency, and diversity, which outperforms existing semi-supervised and fully-supervised competitors.","Recent studies on contrastive learning have achieved remarkable performance solely by leveraging few labels in the context of medical image segmentation. Existing methods mainly focus on instance discrimination and invariant mapping (i.e., pulling positive samples closer and negative samples apart in the feature space). However, they face three common pitfalls: (1) tailness: medical image data usually follows an implicit long-tail class distribution. Blindly leveraging all pixels in training hence can lead to the data imbalance issues, and cause deteriorated performance; (2) consistency: it remains unclear whether a segmentation model has learned meaningful and yet consistent anatomical features due to the intra-class variations between different anatomical features; and (3) diversity: the intra-slice correlations within the entire dataset have received significantly less attention. This motivates us to seek a principled approach for strategically making use of the dataset itself to discover similar yet distinct samples from different anatomical views. In this paper, we introduce a novel semi-supervised medical image segmentation framework termed Mine yOur owN Anatomy (MONA), and make three contributions. First, prior work argues that every pixel equally matters to the model training; we observe empirically that this alone is unlikely to define meaningful anatomical features, mainly due to lacking the supervision signal. We show two simple solutions towards learning invariances -- through the use of stronger data augmentations and nearest neighbors. Second, we construct a set of objectives that encourage the model to be capable of decomposing medical images into a collection of anatomical features in an unsupervised manner. Lastly, our extensive results on three benchmark datasets with different labeled settings validate the effectiveness of our proposed MONA which achieves new state-of-the-art under different labeled settings. Perhaps most impressively, MONA trained with 10% labeled -- for the first time -- outperforms the supervised counterpart on all three datasets. MONA makes minimal assumptions on domain expertise, and hence constitutes a practical and versatile solution in medical image analysis. Codes will be made available to the public.","Contrastive Learning, Self-supervised Learning, Semi-supervised Learning, Medical Image Segmentation" Zipper: Decoupling the tradeoff Between Robustness and Accuracy,https://openreview.net/forum?id=Fw516fpXI-c,https://openreview.net/pdf?id=Fw516fpXI-c,We propose a bi-expert framework where we simultaneously train base-learners with distribution-aware strategies so that it can obtain both satisfying clean accuracy and robustenss,"Deep neural networks obtained by standard training have been constantly plagued by adversarial examples. Although adversarial training demonstrates its capability to defend against adversarial examples, unfortunately, training robust classifiers leads to an inevitable drop in the natural generalization when performing adversarial training. To address the issues, we decouple the standard generalization and the robust generalization from joint training and formulate different training strategies for each one. Specifically, instead of minimizing a global loss on the expectation over these two generalization errors, we propose a bi-expert framework called \emph{Zipper} where we simultaneously train base learners with distribution-aware strategies so that they can specialize in their own fields. The parameters of base learners are collected and combined to form a global learner at intervals during the training process, which is then distributed to base learners as initialized parameters for continued training. Theoretically, we show that the risks of Zipper will get lower once the base learners are well trained. Extensive experiments verify the applicability of Zipper to satisfying high clean accuracy in the natural setting while keeping considerably robust to the adversarial setting, compared to relevant techniques. ",Adversarial Training TempCLR: Temporal Alignment Representation with Contrastive Learning,https://openreview.net/forum?id=CIFOsnhZvON,https://openreview.net/pdf?id=CIFOsnhZvON,Global sequence matching under temporal order consistency matters in contrastive-based video-paragraph/text learning.,"Video representation learning has been successful in video-text pre-training for zero-shot transfer, where each sentence is trained to be close to the paired video clips in a common feature space. For long videos, given a paragraph of description where the sentences describe different segments of the video, by matching all sentence-clip pairs, the paragraph and the full video are aligned implicitly. However, such unit-level similarity measure may ignore the global temporal context over a long time span, which inevitably limits the generalization ability. In this paper, we propose a contrastive learning framework TempCLR to compare the full video and the paragraph explicitly. As the video/paragraph is formulated as a sequence of clips/sentences, under the constraint of their temporal order, we use dynamic time warping to compute the minimum cumulative cost over sentence-clip pairs as the sequence-level distance. To explore the temporal dynamics, we break the consistency of temporal order by shuffling the video clips or sentences according to the temporal granularity. In this way, we obtain the representations for clips/sentences, which perceive the temporal information and thus facilitate the sequence alignment. In addition to pre-training on the video and paragraph, our approach can also generalize on the matching between different video instances. We evaluate our approach on video retrieval, action step localization, and few-shot action recognition, and achieve consistent performance gain over all three tasks. Detailed ablation studies are provided to justify the approach design. ","Representation learning, Global Sequence Alignment, Zero/Few-shot Transfer" Generative Augmented Flow Networks,https://openreview.net/forum?id=urF_CBK5XC0,https://openreview.net/pdf?id=urF_CBK5XC0,We propose a novel GFlowNet learning framework to incorporate intermediate rewards represented by intrinsic motivation to improve exploration.,"The Generative Flow Network is a probabilistic framework where an agent learns a stochastic policy for object generation, such that the probability of generating an object is proportional to a given reward function. Its effectiveness has been shown in discovering high-quality and diverse solutions, compared to reward-maximizing reinforcement learning-based methods. Nonetheless, GFlowNets only learn from rewards of the terminal states, which can limit its applicability. Indeed, intermediate rewards play a critical role in learning, for example from intrinsic motivation to provide intermediate feedback even in particularly challenging sparse reward tasks. Inspired by this, we propose Generative Augmented Flow Networks (GAFlowNets), a novel learning framework to incorporate intermediate rewards into GFlowNets. We specify intermediate rewards by intrinsic motivation to tackle the exploration problem in sparse reward environments. GAFlowNets can leverage edge-based and state-based intrinsic rewards in a joint way to improve exploration. Based on extensive experiments on the GridWorld task, we demonstrate the effectiveness and efficiency of GAFlowNet in terms of convergence, performance, and diversity of solutions. We further show that GAFlowNet is scalable to a more complex and large-scale molecule generation domain, where it achieves consistent and significant performance improvement.","Generative Flow Networks (GFlowNets), Exploration" Inferring Fluid Dynamics via Inverse Rendering,https://openreview.net/forum?id=EeEU0b9CPD3,https://openreview.net/pdf?id=EeEU0b9CPD3,,"Humans have a strong intuitive understanding of physical processes such as fluid falling by just a glimpse of such a scene picture, i.e., quickly derived from our immersive visual experiences in memory. This work achieves such a photo-to-fluid-dynamics reconstruction functionality learned from unannotated videos, without any supervision of ground-truth fluid dynamics. In a nutshell, a differentiable Euler simulator modeled with a ConvNet-based pressure projection solver, is integrated with a volumetric renderer, supporting end-to-end/coherent differentiable dynamic simulation and rendering. By endowing each sampled point with a fluid volume value, we derive a NeRF-like differentiable renderer dedicated from fluid data; and thanks to this volume-augmented representation, fluid dynamics could be inversely inferred from error signal between the rendered result and ground-truth video frame (i.e., inverse rendering). Experiments on our generated Fluid Fall datasets and DPI Dam Break dataset are conducted to demonstrate both effectiveness and generalization ability of our method.", How Does Value Distribution in Distributional Reinforcement Learning Help Optimization?,https://openreview.net/forum?id=pT4ref-FMAX,https://openreview.net/pdf?id=pT4ref-FMAX,We study the optimization advantages of distritbutional reinforcement learning.,"We consider the problem of learning a set of probability distributions from the Bellman dynamics in distributional reinforcement learning~(RL) that learns the whole return distribution compared with only its expectation in classical RL. Despite its success to obtain superior performance, we still have a poor understanding of how the value distribution in distributional RL works. In this study, we analyze the optimization benefits of distributional RL by leverage of additional value distribution information over classical RL in the Neural Fitted Z-Iteration~(Neural FZI) framework. To begin with, we demonstrate that the distribution loss of distributional RL has desirable smoothness characteristics and hence enjoys stable gradients, which is in line with its tendency to promote optimization stability. Furthermore, the acceleration effect of distributional RL is revealed by decomposing the return distribution. It turns out that distributional RL can perform favorably if the value distribution approximation is appropriate, measured by the variance of gradient estimates in each environment for any specific distributional RL algorithm. Rigorous experiments validate the stable optimization behaviors of distributional RL, contributing to its acceleration effects compared to classical RL. The findings of our research illuminate how the value distribution in distributional RL algorithms helps the optimization.","distributional reinforcement learning, optimization" Bort: Towards Explainable Neural Networks with Bounded Orthogonal Constraint,https://openreview.net/forum?id=My57qBufZWs,https://openreview.net/pdf?id=My57qBufZWs,"We propose an optimizer, Bort, for training explainable neural networks with boundedness and orthogonality constraints.","Deep learning has revolutionized human society, yet the black-box nature of deep neural networks hinders further application to reliability-demanded industries. In the attempt to unpack them, many works observe or impact internal variables to improve the model's comprehensibility and transparency. However, existing methods rely on intuitive assumptions and lack mathematical guarantees. To bridge this gap, we introduce Bort, an optimizer for improving model explainability with boundedness and orthogonality constraints on model parameters, derived from the sufficient conditions of model comprehensibility and transparency. We perform reconstruction and backtracking on the model representations optimized by Bort and observe an evident improvement in model explainability. Based on Bort, we are able to synthesize explainable adversarial samples without additional parameters and training. Surprisingly, we find Bort constantly improves the classification accuracy of various architectures including ResNet and DeiT on MNIST, CIFAR-10, and ImageNet.","Neural network, explainable AI, optimizer." Coordinate and Generalize: A Unified Framework for Audio-Visual Zero-Shot Learning,https://openreview.net/forum?id=mDHjdjQl0Ae,https://openreview.net/pdf?id=mDHjdjQl0Ae,,"Audio-Visual Zero-Shot Learning (AV-ZSL) aims to train a model that can classify videos of unseen classes leveraging audio and visual data, which is achieved by transferring knowledge obtained from seen classes. We identify two imperative issues needed to be addressed: (1) \emph{How to effectively exploit both the audio and visual information?} and (2) \emph{How to transfer the knowledge from seen classes to unseen classes?} In this paper, we ameliorate both of the issues in a unified framework by enhancing two ingredients that existing methods seldom consider. (1) \emph{Multi-Modal Coordination}: Different from existing methods simply fusing the audio and visual features by attention mechanism, we further perform knowledge distillation between the visual and audio branches. This allows information interaction between the two branches and encourages them to learn from each other. (2) \emph{Generalization Capacity}: Existing methods only consider the alignment between the audio-visual features and semantic features on the seen classes, which ignores the generalization capacity. Inspired by the interpretability methods of Deep Neural Networks (DNNs), we propose a novel gradient-based approach to generate transferable masks for the visual and audio features, enforcing the model to focus on the most discriminative segments and benefiting knowledge transfer from seen to unseen classes. Extensive experiments on three challenging benchmarks, ActivityNet-GZSL, UCF-GZSL, and VGGSound-GZSL, demonstrate that our proposed approach can significantly outperform the state-of-the-art methods.", Interpreting Distributional Reinforcement Learning: A Regularization Perspective,https://openreview.net/forum?id=zAbFj7FpD-C,https://openreview.net/pdf?id=zAbFj7FpD-C,We interpret distributional reinforcement learning from the perspectives of regularization.,"Distributional reinforcement learning~(RL) is a class of state-of-the-art algorithms that estimate the entire distribution of the total return rather than its expected value alone. The theoretical advantages of distributional RL over expectation-based RL remain elusive, despite the remarkable performance of distributional RL. Our work attributes the superiority of distributional RL to its regularization effect stemming from the value distribution information regardless of only its expectation. We decompose the value distribution into its expectation and the remaining distribution part using a variant of the gross error model in robust statistics. Hence, distributional RL has an additional benefit over expectation-based RL thanks to the impact of a \textit{risk-sensitive entropy regularization} within the Neural Fitted Z-Iteration framework. Meanwhile, we investigate the role of the resulting regularization in actor-critic algorithms by bridging the risk-sensitive entropy regularization of distributional RL and the vanilla entropy in maximum entropy RL. It reveals that distributional RL induces an augmented reward function, which promotes a risk-sensitive exploration against the intrinsic uncertainty of the environment. Finally, extensive experiments verify the importance of the regularization effect in distributional RL, as well as the mutual impacts of different entropy regularizations. Our study paves the way towards a better understanding of distributional RL, especially when looked at through a regularization lens.","distributional reinforcement learning, regularization, entropy" The Power of Regularization in Solving Extensive-Form Games,https://openreview.net/forum?id=bPiHuNUNv_R,https://openreview.net/pdf?id=bPiHuNUNv_R,,"In this paper, we investigate the power of {\it regularization}, a common technique in reinforcement learning and optimization, in solving extensive-form games (EFGs). We propose a series of new algorithms based on regularizing the payoff functions of the game, and establish a set of convergence results that strictly improve over the existing ones, with either weaker assumptions or stronger convergence guarantees. In particular, we first show that dilated optimistic mirror descent (DOMD), an efficient variant of OMD for solving EFGs, with adaptive regularization can achieve a fast $\tilde O(1/T)$ last-iterate convergence in terms of duality gap without the uniqueness assumption of the Nash equilibrium (NE). Moreover, regularized dilated optimistic multiplicative weights update (\texttt{Reg-DOMWU}), an instance of \texttt{Reg-DOMD}, further enjoys the $\tilde O(1/T)$ last-iterate convergence rate of the distance to the set of NE. This addresses an open question of whether iterate convergence could be obtained for OMWU-type algorithms with constant stepsizes, without the unique NE assumption in both the EFG and normal-form game literature. Second, we show that regularized counterfactual regret minimization (\texttt{Reg-CFR}), with a variant of optimistic mirror descent algorithm as regret-minimizer, can achieve $O(1/T^{1/4})$ best-iterate, and $O(1/T^{3/4})$ average-iterate convergence rate for finding NE in EFGs. Finally, we show that \texttt{Reg-CFR} can achieve asymptotic last-iterate convergence, and optimal $O(1/T)$ average-iterate convergence rate, for finding the NE of perturbed EFGs, which is useful for finding approximate extensive-form perfect equilibria (EFPE). To the best of our knowledge, they constitute the first last-iterate convergence results for CFR-type algorithms, while matching the state-of-the-art average-iterate convergence rate in finding NE for non-perturbed EFGs. We also provide numerical results to corroborate the advantages of our algorithms.","Game Theory, Optimization" Neural Topic Modeling with Embedding Clustering Regularization,https://openreview.net/forum?id=nZYU28EJ3OS,https://openreview.net/pdf?id=nZYU28EJ3OS,We propose a neural topic model that addresses the topic collapsing issue with a novel clustering regularization on word and topic embeddings.,"Topic models have been prevalent for decades with various applications like automatic text analysis due to their effectiveness and interpretability. However, existing topic models commonly suffer from the notorious topic collapsing issue: the discovered topics semantically collapse towards each other, leading to highly repetitive topics, insufficient topic discovery, and damaged model interpretability. In this paper, we propose a new neural topic model, Embedding Clustering Regularization Topic Model (ECRTM), to solve the topic collapsing issue. In addition to the reconstruction error of existing work, we propose a novel Embedding Clustering Regularization (ECR), which forces each topic embedding to be the center of a separately aggregated word embedding cluster in the semantic space. Instead of collapsing together, this makes topic embeddings away from each other and cover different semantics of word embeddings. Thus our ECR enables each produced topic to contain distinct word semantics, which alleviates topic collapsing. Through jointly optimizing our ECR objective and the neural topic modeling objective, ECRTM generates diverse and coherent topics together with high-quality topic distributions of documents. Extensive experiments on benchmark datasets demonstrate that ECRTM effectively addresses the topic collapsing issue and consistently surpasses state-of-the-art baselines in terms of topic quality, topic distributions of documents, and downstream classification tasks. ", Distributional Reinforcement Learning via Sinkhorn Iterations,https://openreview.net/forum?id=uHrJ1AY1xR1,https://openreview.net/pdf?id=uHrJ1AY1xR1,We designed a new class of distributional RL algorithm based on Sinkhorn divergence.,"Distributional reinforcement learning~(RL) is a class of state-of-the-art algorithms that estimate the entire distribution of the total return rather than only its expectation. The empirical success of distributional RL is determined by the representation of return distributions and the choice of distribution divergence. In this paper, we propose a new class of \textit{Sinkhorn distributional RL~(SinkhornDRL)} algorithm that learns a finite set of statistics, i.e., deterministic samples, from each return distribution and then uses Sinkhorn iterations to evaluate the Sinkhorn distance between the current and target Bellmen distributions. Sinkhorn divergence features as the interpolation between the Wasserstein distance and Maximum Mean Discrepancy~(MMD). SinkhornDRL finds a sweet spot by taking advantage of the geometry of optimal transport-based distance and the unbiased gradient estimate property of MMD. Finally, compared to state-of-the-art algorithms, SinkhornDRL's competitive performance is demonstrated on the suit of 55 Atari games.","distributional reinforcement learning, sinkhorn divergence" Contextual Symbolic Policy For Meta-Reinforcement Learning,https://openreview.net/forum?id=WZ2L6D8IHoc,https://openreview.net/pdf?id=WZ2L6D8IHoc,"This paper propose a gradient-based framework to generate contextual symbolic policy for Meta-Reinforcement Learning to improve the generalization ability, efficiency and interpretability.","Context-based Meta-Reinforcement Learning (Meta-RL), which conditions the RL agent on the context variables, is a powerful method for learning a generalizable agent. Current context-based Meta-RL methods often construct their contextual policy with a neural network (NN) and directly take the context variables as a part of the input. However, the NN-based policy contains tremendous parameters which possibly result in overfitting, the difficulty of deployment and poor interpretability. To improve the generation ability, efficiency and interpretability, we propose a novel Contextual Symbolic Policy (CSP) framework, which generates contextual policy with a symbolic form based on the context variables for unseen tasks in meta-RL. Our key insight is that the symbolic expression is capable of capturing complex relationships by composing various operators and has a compact form that helps strip out irrelevant information. Thus, the CSP learns to produce symbolic policy for meta-RL tasks and extract the essential common knowledge to achieve higher generalization ability. Besides, the symbolic policies with a compact form are efficient to be deployed and easier to understand. In the implementation, we construct CSP as a gradient-based framework to learn the symbolic policy from scratch in an end-to-end and differentiable way. The symbolic policy is represented by a symbolic network composed of various symbolic operators. We also employ a path selector to decide the proper symbolic form of the policy and a parameter generator to produce the coefficients of the symbolic policy. Empirically, we evaluate the proposed CSP method on several Meta-RL tasks and demonstrate that the contextual symbolic policy achieves higher performance and efficiency and shows the potential to be interpretable.","meta learning, reinforcement learning, context variables, symbolic policy" Do We Really Achieve Fairness with Explicit Sensitive Attributes? ,https://openreview.net/forum?id=SJO188Y53lk,https://openreview.net/pdf?id=SJO188Y53lk,"We found that different sample leak different amount of sensitive information and has different-level violation of demographic parity, thus we propose a new metric and method to address this problem.","Recently the wide usage of machine learning models for high-stake decision-making raises the concerns about the fairness and discrimination issue. Existing works found that sensitive information of a sample could be leaked completely by sensitive attributes or partially by non-sensitive attributes, thus removing the sensitive attributes directly from the original features can not achieve fairness. The current fairness practice is to leverage the explicit sensitive attributes (i.e., as regularization) to debias the prediction, based on a strong assumption that non-sensitive attributes of all samples leak the sensitive information totally. However, we investigate the distribution of leaked sensitive information from non-sensitive attributes and make interesting findings that 1) the sensitive information distinctly varies across different samples. 2) the violation of demographic parity for samples prone to leak sensitive information (high-sensitive) are worse than that for low-sensitive samples, indicating the failure of current demographic parity measurements. To this end, we propose a new group fairness ($\alpha$-Demographic Parity) to measure the demographic parity for samples with different levels of sensitive information leakage. Furthermore, we move one step forward and propose to achieve $\alpha$-demographic parity by encouraging the independence of the distribution of the sensitive information in non-sensitive attributes and that of downstream task prediction, which is formulated as a cross-task knowledge distillation framework. Specifically, the sensitive teacher models the distribution of the sensitive information and the fair student models the distribution of the downstream task prediction. Then we encourage the independence between them by minimizing the Hilbert-Schmidt Independence Criterion. Our model can naturally tackle the limited sensitive attribution scenario since the teacher models can be trained with partial samples with sensitive attributes. Extensive experiments show the superior performance of our proposed method on the $\alpha$-demographic parity and performs well on limited sensitive attribute scenarios.","fairness, debias, demographic parity" MLPInit: Embarrassingly Simple GNN Training Acceleration with MLP Initialization,https://openreview.net/forum?id=P8YIphWNEGO,https://openreview.net/pdf?id=P8YIphWNEGO,"we propose an embarrassingly simple, yet hugely effective initialization for GNN training acceleration by initializing GNN with full trained MLP.","Training graph neural networks (GNNs) on large graphs is complex and extremely time consuming. This is attributed to overheads caused by sparse matrix multiplication, which are sidestepped when training multi-layer perceptrons (MLPs) with only node features. MLPs, by ignoring graph context, are simple and faster for graph data, however they usually sacrifice prediction accuracy, limiting their applications for graph data. We observe that for most message passing-based GNNs, we can trivially derive an analog MLP (we call this a PeerMLP) whose weights can be made identical, making us curious about how do GNNs using weights from a fully trained PeerMLP perform? Surprisingly, we find that GNNs initialized with such weights significantly outperform their PeerMLPs for graph data, motivating us to use PeerMLP training as a precursor, initialization step to GNN training. To this end, we propose an embarrassingly simple, yet hugely effective initialization method for GNN training acceleration, called MLPInit. Our extensive experiments on multiple large-scale graph datasets with diverse GNN architectures validate that MLPInit can accelerate the training of GNNs (up to 33× speedup on OGB-products) and often improve prediction performance (e.g., up to 7.97% improvement for GraphSAGE across 7 datasets for node classification, and up to 17.81% improvement across 4 datasets for link prediction on metric Hits@10). Most importantly, MLPInit is extremely simple to implement and can be flexibly used as a plug-and-play initialization method for message passing-based GNNs","Graph Neural Network, Large-scale Graph, Accleration" SinGRAV: Learning a Generative Radiance Volume from a Single Natural Scene,https://openreview.net/forum?id=wamiG4pzNN1,https://openreview.net/pdf?id=wamiG4pzNN1,,"We present a 3D generative model for general natural scenes. Lacking necessary volumes of 3D data characterizing the target scene, we propose to learn from a single scene. Our key insight is that a natural scene often contains multiple constituents whose geometry, texture, and spatial arrangements follow some clear patterns, but still exhibit rich variations over different regions within the same scene. This suggests localizing the learning of a generative model on substantial local regions. Hence, we exploit a multi-scale convolutional network, which possesses the spatial locality bias in nature, to learn from the statistics of local regions at multiple scales within a single scene. In contrast to existing methods, our learning setup bypasses the need to collect data from many homogeneous 3D scenes for learning common features. We coin our method SinGRAV, for learning a Generative RAdiance Volume from a Single natural scene. We demonstrate the ability of SinGRAV in generating plausible and diverse variations from a single scene, the merits of SinGRAV over state-of-the-art generative neural scene methods, as well as the versatility of SinGRAV by its use in a variety of applications, spanning 3D scene editing, composition, and animation. Code and data will be released to facilitate further research.","Generative model, 3D Single Scene" Progressive Compressed Auto-Encoder for Self-supervised Representation Learning,https://openreview.net/forum?id=8T4qmZbTkW7,https://openreview.net/pdf?id=8T4qmZbTkW7,,"Masked Image Modeling (MIM) methods are driven by recovering all masked patches from visible ones. However, patches from the same image are highly correlated and it is redundant to reconstruct all the masked patches in MIM. This redundancy is neglected by existing methods and causes non-negligible overheads in computation and storage that do not necessarily benefit self-supervised learning. In this paper, we present a novel approach named Progressive Compressed AutoEncoder (PCAE) to address this problem by progressively compacting tokens and retaining the least necessary information for representation. In particular, we propose to mitigate the performance degradation caused by token reduction through exploiting the vision transformer to leak information from discarded tokens to the retained ones. Besides, we also propose the progressive discarding strategy to achieve a better trade-off between performance and efficiency. Identifying redundant tokens plays a key role in redundancy reduction. We resolve this issue using a simple yet effective criterion, i.e., we identify redundant tokens according to their similarity to the mean of token sequence. Thanks to the flexible strategy, PCAE can be employed for both pre-training and downstream fine-tuning and, consequently, reduces the computing overhead non-trivially throughout the training pipeline. Experiments show that PCAE achieves comparable performance while at most accelerates 1.9 times throughput compared with MAE for self-supervised learning, and accelerates 15\%-57\% throughput while the performance drop is within 0.6\% for downstream classification.","MIM, Transformer, self-supervised learning" ConBaT: Control Barrier Transformer for Safety-Critical Policy Learning,https://openreview.net/forum?id=C49AIKljGaa,https://openreview.net/pdf?id=C49AIKljGaa,,"Large-scale self-supervised models have recently revolutionized our ability to perform a variety of tasks within the vision and language domains. However, using such models for autonomous systems is challenging because of safety requirements: besides executing correct actions, an autonomous agent needs to also avoid high cost and potentially fatal critical mistakes. Traditionally, self-supervised training mostly focuses on imitating previously observed behaviors, and the training demonstrations carry no notion of which behaviors should be explicitly avoided. In this work, we propose Control Barrier Transformer (ConBaT), an approach that learns safe behaviors from demonstrations in a self-supervised fashion. ConBaT is inspired by the concept of control barrier functions in control theory and uses a causal transformer that learns to predict safe robot actions autoregressively using a critic that requires minimal safety data labeling. During deployment, we employ a lightweight online optimization to find actions that can ensure future states lie within the safe set. We apply our approach to different simulated control tasks and show that our method results in safer control policies compared to other classical and learning-based methods.","Learning from demonstration, Control barrier functions, Transformer models" Robust Transfer Learning Based on Minimax Principle,https://openreview.net/forum?id=8Z6OZ3qKHDD,https://openreview.net/pdf?id=8Z6OZ3qKHDD,,"The similarity between target and source tasks is a crucial quantity for theoretical analyses and algorithm designs in transfer learning studies. However, this quantity is often difficult to be precisely captured. To address this issue, we make a boundedness assumption on the task similarity and then propose a mathematical framework based on the minimax principle, which minimizes the worst-case expected population risk under this assumption. Furthermore, our proposed minimax problem can be solved analytically, which provides a guideline for designing robust transfer learning models. According to the analytical expression, we interpret the influences of sample sizes, task distances, and the model dimensionality in knowledge transferring. Then, practical algorithms are developed based on the theoretical results. Finally, experiments conducted on image classification tasks show that our approaches can achieve robust and competitive accuracies under random selections of training sets.","Transfer Learning, Minimax Principle, Robustness" Interpreting Neural Networks Through the Lens of Heat Flow,https://openreview.net/forum?id=h-tOz83WrC,https://openreview.net/pdf?id=h-tOz83WrC,Solve the heat equation to interpret neural networks.,"Machine learning models are often developed in a way that prioritizes task-specific performance but defers the understanding of how they actually work. This is especially true nowadays for deep neural networks. In this paper, we step back and consider the basic problem of understanding a learned model represented as a smooth scalar-valued function. We introduce HeatFlow, a framework based upon the heat diffusion process for interpreting the multi-scale behavior of the model around a test point. At its core, our approach looks into the heat flow initialized at the function of interest, which generates a family of functions with increasing smoothness. By applying differential operators to these smoothed functions, summary statistics (i.e., explanations) characterizing the original model on different scales can be drawn. We place an emphasis on studying the heat flow on data manifold, where the model is trained and expected to be well behaved. Numeric approximation procedures for implementing the proposed method in practice are discussed and demonstrated on image recognition tasks.","interpretable, explanation, heat, Laplacian, attribution, geometry, flow, PDE" Efficient Surrogate Gradients for Training Spiking Neural Networks,https://openreview.net/forum?id=nsT1vO6i3Ri,https://openreview.net/pdf?id=nsT1vO6i3Ri,"We propose a method to change the shape of surrogate gradients, which can improve the performance of spiking neural networks with low extra overhead.","Spiking Neural Network (SNN) is widely regarded as one of the next-generation neural network infrastructures, yet it suffers from an inherent non-differentiable problem that makes the traditional backpropagation (BP) method infeasible. Surrogate gradients (SG), which are an approximation to the shape of the Dirac's $\delta$-function, can help alleviate this issue to some extent. To our knowledge, the majority of research, however, keep a fixed surrogate gradient for all layers, ignorant of the fact that there exists a trade-off between the approximation to the delta function and the effective domain of gradients under the given dataset, hence limiting the efficiency of surrogate gradients and impairing the overall model performance. To guide the shape optimization in applying surrogate gradients for training SNN, we propose an indicator $\chi$, which represents the proportion of parameters with non-zero gradients in backpropagation. Further we present a novel $\chi$-based training pipeline that adaptively makes trade-offs between the surrogate gradients' shapes and its effective domain, followed by a series of ablation experiments for verification. Our algorithm achieves 69.09\% accuracy on the ImageNet dataset using SEW-ResNet34 - a 2.05\% absolute improvement from baseline. Moreover, our method only requires extremely low external cost and can be simply integrated into the existing training procedure.","surragate gradient, spiking neural network, low extra overhead" DCE: Offline Reinforcement Learning With Double Conservative Estimates,https://openreview.net/forum?id=reEMFxIRMAl,https://openreview.net/pdf?id=reEMFxIRMAl,,"Offline Reinforcement Learning has attracted much interest in solving the application challenge for traditional reinforcement learning. Offline reinforcement learning uses previously-collected datasets to train agents without any interaction. For addressing the overestimation of OOD (out-of-distribution) actions, conservative estimates give a low value for all inputs. Previous conservative estimation methods are usually difficult to avoid the impact of OOD actions on Q-value estimates. In addition, these algorithms usually need to lose some computational efficiency to achieve the purpose of conservative estimation. In this paper, we propose a simple conservative estimation method, double conservative estimates (DCE), which use two conservative estimation method to constraint policy. Our algorithm introduces V-function to avoid the error of in-distribution action while implicit achieving conservative estimation. In addition, our algorithm uses a controllable penalty term changing the degree of conservatism in training. We theoretically show how this method influences the estimation of OOD actions and in-distribution actions. Our experiment separately shows that two conservative estimation methods impact the estimation of all state-action. DCE demonstrates the state-of-the-art performance on D4RL. ","Offline RL, Conservative estimation" The Trade-off between Universality and Label Efficiency of Representations from Contrastive Learning,https://openreview.net/forum?id=rvsbw2YthH_,https://openreview.net/pdf?id=rvsbw2YthH_,We focus on contrastive learning and systematically study a trade-off between label efficiency and universality both empirically and theoretically.,"Pre-trained representations (a.k.a. foundation models) have recently become a prevalent learning paradigm, where one first pre-trains a representation using large-scale unlabeled data, and then learns simple classifiers on top of the representation using small labeled data from the downstream tasks. There are two key desiderata for the representation: label efficiency (the ability to learn an accurate classifier on top of the representation with a small amount of labeled data) and universality (usefulness across a wide range of downstream tasks). In this paper, we focus on one of the most popular instantiations of this paradigm: contrastive learning with linear probing, i.e., learning a linear predictor on the representation pre-trained by contrastive learning. We show that there exists a trade-off between the two desiderata so that one may not be able to achieve both simultaneously. Specifically, we provide analysis using a theoretical data model and show that, while more diverse pre-training data result in more diverse features for different tasks (improving universality), it puts less emphasis on task-specific features, giving rise to larger sample complexity for down-stream supervised tasks, and thus worse prediction performance. Guided by this analysis, we propose a contrastive regularization method to improve the trade-off. We validate our analysis and method empirically with systematic experiments using real-world datasets and foundation models.","Contrastive Learning, Self-Supervised Learning, Foundation Model, Complexity" S-NeRF: Neural Radiance Fields for Street Views,https://openreview.net/forum?id=gx2yJS-ENqI,https://openreview.net/pdf?id=gx2yJS-ENqI,,"Neural Radiance Fields (NeRFs) aim to synthesize novel views of objects and scenes, given the object-centric camera views with large overlaps. However, we conjugate that this paradigm does not fit the nature of the street views that are collected by many self-driving cars from the large-scale unbounded scenes. Also, the onboard cameras perceive scenes without much overlapping. Thus, existing NeRFs often produce blurs, ""floaters"" and other artifacts on street-view synthesis. In this paper, we propose a new street-view NeRF (S-NeRF) that considers novel view synthesis of both the large-scale background scenes and the foreground moving vehicles jointly. Specifically, we improve the scene parameterization function and the camera poses for learning better neural representations from street views. We also use the the noisy and sparse LiDAR points to boost the training and learn a robust geometry and reprojection based confidence to address the depth outliers. Moreover, we extend our S-NeRF for reconstructing moving vehicles that is impracticable for conventional NeRFs. Thorough experiments on the large-scale driving datasets (e.g., nuScenes and Waymo) demonstrate that our method beats the state-of-the-art rivals by reducing 7~40% of the mean-squared error in the street-view synthesis and a 45% PSNR gain for the moving vehicles rendering.", Generalized structure-aware missing view completion network for incomplete multi-view clustering,https://openreview.net/forum?id=OdZcJYT5Z4k,https://openreview.net/pdf?id=OdZcJYT5Z4k,A general incomplete multi-view clustering framework via missing view completion and recurrent graph constraint.,"In recent years, incomplete multi-view clustering has been widely regarded as a challenging problem. The missing views inevitably damage the effective information of the multi-view data itself. To date, existing methods for incomplete multi-view clustering usually bypass invalid views according to prior missing information, which is considered as a second-best scheme based on evasion. Other methods that attempt to recover missing information are mostly applicable to specific two-view datasets. To handle these problems, we design a general structure-aware missing view completion network (SMVC) for incomplete multi-view clustering. Concretely, we build a two-stage autoencoder network with the self-attention structure to synchronously extract high-level semantic representations of multiple views and recover the missing data. In addition, we develop a recurrent graph reconstruction mechanism that cleverly leverages the restored views to promote the representation learning and the further data reconstruction. Sufficient experimental results confirm that our SMVC has obvious advantages over other top methods.","Incomplete multi-view clustering, Missing view imputation, Representation learning, Deep neural network" EXACT: Compositional Augmentation for Image-level Weakly-Supervised Instance Segmentation,https://openreview.net/forum?id=66kLbXgU_ae,https://openreview.net/pdf?id=66kLbXgU_ae,,"We propose EXACT: EXtract-AugContext-pasTe, a compositional image augmentation pipeline for weakly-supervised instance segmentation using only image-level supervision. The proposed method consists of three main components. The first component generates high-quality foreground object masks. To this end, an EM-like approach is proposed that iteratively refines an initial set of object mask proposals generated by a generic entity segmentation method. Next, in the second component, high-quality context-aware background images are generated using a text-to-image compositional synthesis method like DALL-E. Finally, the third component creates a large-scale pseudo-labeled instance segmentation training dataset by compositing the foreground object masks onto the original and generated background images. The proposed approach achieves state-of-the-art weakly-supervised instance segmentation results on both the PASCAL VOC 2012 and MS COCO dataset by using only image-level, weak label information. In particular, it outperforms the best baseline by +7.4 and +2.8 mAP-0.50 on PASCAL and COCO, respectively. Further, the method provides a new solution to the long-tail weakly-supervised instance segmentation problem (when many classes may only have few training samples), by selectively augmenting under-represented classes.", Learning Visual Representation with Synthetic Images and Topologically-defined Labels,https://openreview.net/forum?id=TnzdAU7c8WM,https://openreview.net/pdf?id=TnzdAU7c8WM,We propose a new type of pretext task for self-supervised learning with synthetic images and mathematically-defined labels to incentivise learning global topological features of images,"We propose a scheme for neural networks to learn visual representation with synthetic images and mathematically-defined labels that capture topological information. To verify that the model acquires a different visual representation than with the usual supervised learning with manually-defined labels, we show that the models pretrained with our scheme can be finetuned for image classification tasks to achieve an improved convergence compared to those trained from scratch. Convolutional neural networks, built upon iterative local operations, are good at learning local features of the image, such as texture, whereas they tend to pay less attention to larger structures. Our method provides a simple way to encourage the model to learn global features through a specifically designed task based on topology. Furthermore, our method requires no real images nor manual labels; hence it sheds light on some of the lately concerned topics in computer vision, such as the cost and the fairness in data collection and annotation.","topology, persistent homology, self-supervised learning, synthetic image" Cycle-consistent Masked AutoEncoder for Unsupervised Domain Generalization,https://openreview.net/forum?id=wC98X1qpDBA,https://openreview.net/pdf?id=wC98X1qpDBA,,"Self-supervised learning methods undergo undesirable performance drops when there exists a significant domain gap between training and testing scenarios. Therefore, unsupervised domain generalization (UDG) is proposed to tackle the problem, which requires the model to be trained on several different domains without supervision and generalize well on unseen test domains. Existing methods either rely on a cross-domain and semantically consistent image pair in contrastive methods or the reconstruction pair in generative methods, while the precious image pairs are not available without semantic labels. In this paper, we propose a cycle cross-domain reconstruction task for unsupervised domain generalization in the absence of paired images. The cycle cross-domain reconstruction task converts a masked image from one domain to another domain and then reconstructs the original image from the converted images. To preserve the divergent domain knowledge of decoders in the cycle reconstruction task, we propose a novel domain-contrastive loss to regularize the domain information in reconstructed images encoded with the desirable domain style. Qualitative results on extensive datasets illustrate our method improves the state-of-the-art unsupervised domain generalization methods by average $\textbf{+5.59\%}, \textbf{+4.52\%}, \textbf{+4.22\%}, \textbf{+7.02\%}$ on $1\%, 5\%, 10\%, 100\%$ PACS, and $\textbf{+5.08\%}, \textbf{+6.49\%}, \textbf{+1.79\%}, \textbf{+0.53\%}$ on $1\%, 5\%, 10\%, 100\%$ DomainNet, respectively. Codes shall be released upon acceptance.", CFlowNets: Continuous control with Generative Flow Networks,https://openreview.net/forum?id=yAYHho4fATa,https://openreview.net/pdf?id=yAYHho4fATa,Continuous GFlowNets,"Generative flow networks (GFlowNets), as an emerging technique, can be used as an alternative to reinforcement learning for exploratory control tasks. GFlowNets aims to sample actions with a probability proportional to the reward, similar to sampling different candidates in an active learning fashion. However, existing GFlowNets cannot adapt to continuous control tasks because GFlowNets need to form a DAG and compute the flow matching loss by traversing the inflows and outflows of each node in the trajectory. In this paper, we propose generative continuous flow networks (CFlowNets) that can be applied to continuous control tasks. First, we present the theoretical formulation of CFlowNets. Then, a training framework for CFlowNets is proposed, including the action selection process, the flow approximation algorithm, and the continuous flow matching loss function. Afterward, we theoretically prove the error bound of the flow approximation. The error decreases rapidly as the number of flow samples increases. Finally, experimental results on continuous control tasks demonstrate the performance advantages of CFlowNets compared to many reinforcement learning methods, especially regarding exploration ability.","Continuous control tasks, Generative flow networks" Global Hardest Example Mining with Prototype-based Triplet Loss,https://openreview.net/forum?id=u4k-Zgqr9SN,https://openreview.net/pdf?id=u4k-Zgqr9SN,,"Hard examples are the performance bottleneck of machine learning models, and therefore efficient identification and correct classification of them can significantly improve the model performance. However, most hard example mining schemes search for hard examples in randomly selected mini-batches in each epoch, which often result in local hardest examples and thus sub-optimal performances. Besides, the triplet loss is commonly adopted to explore the mined hard examples by pulling the hard positives close to and pushing the negatives away from the anchor. However, when the anchor in a triplet is an outlier at or close to the cluster boundary, the positive example will be pulled away from the centroid of the cluster, which would result in an incompact cluster, thus inferior performance. To address above challenges, we propose a global hardest example mining with prototype-based triplet loss, which is composed of two major components, namely a Prototype-based Global Hardest Example Miner (GHEM) and a Prototype-based Triplet Loss (pTriplet). First, a global hardest example miner (GHEM) is present to mine firstly the hardest classes on the prototype-based nearest neighbor graph of classes, and then the hardest examples by searching for examples at the cluster boundaries. Second, a prototype-based triplet loss (pTriplet) is developed, which replaces the anchor with an anchor-fused prototype to alleviate the influence of the outlier anchor and provides a normal anchor for triplet loss. Extensive experiments on typical Computer Vision (CV) and Natural Language Processing (NLP) tasks, namely person re-identification and few-shot relation extraction, demonstrated the effectiveness and generalizability of the proposed scheme, which consistently outperforms the-state-of-the-art models. We will publish all source codes of this work on Github for further research explorations.", Differentiable Gaussianization Layers for Inverse Problems Regularized by Deep Generative Models,https://openreview.net/forum?id=OXP9Ns0gnIq,https://openreview.net/pdf?id=OXP9Ns0gnIq,,"Deep generative models such as GANs and normalizing flows are powerful regularizers for inverse problems. They exhibit great potential for helping reduce ill-posedness and attain high-quality results. However, the latent tensors of such deep generative models can fall out of the desired high-dimensional standard Gaussian distribution during an inversion process, particularly in the presence of data noise and inaccurate forward models. In such cases, deep generative models are ineffective in attaining high-fidelity solutions. To address this issue, we propose to reparameterize and Gaussianize the latent tensors using novel differentiable data-dependent layers wherein custom operators are defined by solving optimization problems. These proposed layers constrain inverse problems to obtain high-fidelity in-distribution solutions. We tested and validated our technique on three inversion tasks: compressive-sensing MRI, image deblurring, and eikonal tomography (a nonlinear PDE-constrained inverse problem), using two representative deep generative models: StyleGAN2 and Glow, and achieved state-of-the-art performance in terms of accuracy and consistency.","Deep generative models, inverse problems, Gaussianization" Extreme Masking for Learning Instance and Distributed Visual Representations,https://openreview.net/forum?id=4JRX93ADS2r,https://openreview.net/pdf?id=4JRX93ADS2r,A method that uses extremely large masking as a novel augmentation for learning siamese networks.,"The paper presents a scalable approach for learning distributed representations over individual tokens and a holistic instance representation simultaneously. We use self-attention blocks to represent distributed tokens, followed by cross-attention blocks to aggregate the holistic instance. The core of the approach is the use of extremely large token masking (75\%-90\%) as the data augmentation for supervision. Our model, named ExtreMA, follows the plain BYOL approach where the instance representation from the unmasked subset is trained to predict that from the intact input. Learning requires the model to capture informative variations in an instance, instead of encouraging invariances. The paper makes three contributions: 1) Random masking is a strong and computationally efficient data augmentation for learning generalizable attention representations. 2) With multiple sampling per instance, extreme masking greatly speeds up learning and hungers for more data. 3) Distributed representations can be learned from the instance supervision alone, unlike per-token supervisions in masked modeling.","visual representation learning, self-supervised learning, masked modeling" Quality Matters: Embracing Quality Clues for Robust 3D Multi-Object Tracking,https://openreview.net/forum?id=Qnxcl6zWobO,https://openreview.net/pdf?id=Qnxcl6zWobO,,"3D Multi-Object Tracking (MOT) has achieved tremendous achievement thanks to the rapid development of 3D object detection and 2D MOT. Recent advanced works generally employ a series of object attributes, e.g., position, size, velocity, and appearance, to provide the clues for the association in 3D MOT. However, these cues may not be reliable due to some visual noise, such as occlusion and blur, leading to tracking performance bottleneck. To reveal the dilemma, we conduct extensive empirical analysis to expose the key bottleneck of each clue and how they correlate with each other. The analysis results motivate us to efficiently absorb the merits among all cues, and adaptively produce an optimal tacking manner. Specifically, we present \textit{Location and Velocity Quality Learning}, which efficiently guides the network to estimate the quality of predicted object attributes. Based on these quality estimations, we propose a quality-aware object association (QOA) strategy to leverage the quality score as an important reference factor for achieving robust association. Despite its simplicity, extensive experiments indicate that the proposed strategy significantly boosts tracking performance by 2.2% AMOTA and our method outperforms all existing state-of-the-art works on nuScenes by a large margin. Moreover, QTrack achieves 48.0% and 51.1% AMOTA tracking performance on the nuScenes validation and test sets, which significantly reduces the performance gap between pure camera and LiDAR based trackers.", MGMA: Mesh Graph Masked Autoencoders for Self-supervised Learning on 3D Shape,https://openreview.net/forum?id=C0oEBO4ZpOj,https://openreview.net/pdf?id=C0oEBO4ZpOj,We introduce a self-supervised learning model to extract face nodes and global graph embeddings on meshes.,"We introduce a self-supervised learning model to extract face nodes and global graph embeddings on meshes. We define a graph masking on a mesh graph composed of faces. We evaluate our model on shape classification and segmentation benchmarks. The results suggest that our model outperforms prior state-of-the-art mesh encoders: In ModelNet40 classification task, it achieves an accuracy of 89.8% and in ShapeNet segmentation task, it achieves a mean Intersection-over-Union (mIoU) of 78.5. Further, we explore and explain the correlation between test and training masking ratios on MGMA. And we find best performances are obtained when mesh graph masked autoencoders are trained and evaluated under different masking ratios. Our work may open up new opportunities to address label scarcity and improve the learning power in geometric deep learning research.","mesh graph, self-supvervised learning, masked autoencoder, attention" DBQ-SSD: Dynamic Ball Query for Efficient 3D Object Detection,https://openreview.net/forum?id=ZccFLU-Yk65,https://openreview.net/pdf?id=ZccFLU-Yk65,,"Many point-based 3D detectors adopt point-feature sampling strategies to drop some points for efficient inference. These strategies are typically based on fixed and handcrafted rules, making it difficult to handle complicated scenes. Different from them, we propose a Dynamic Ball Query (DBQ) network to adaptively select a subset of input points according to the input features, and assign the feature transform with a suitable receptive field for each selected point. It can be embedded into some state-of-the-art 3D detectors and trained in an end-to-end manner, which significantly reduces the computational cost. Extensive experiments demonstrate that our method can reduce latency by 30%-100% on KITTI, Waymo, and ONCE datasets. Specifically, the inference speed of our detector can reach 162 FPS on KITTI scene, and 30 FPS on Waymo and ONCE scenes without performance degradation. Due to skipping the redundant points, some evaluation metrics show significant improvements.", Exploring Low-Rank Property in Multiple Instance Learning for Whole Slide Image Classification,https://openreview.net/forum?id=01KmhBsEPFO,https://openreview.net/pdf?id=01KmhBsEPFO,draft,"The classification of gigapixel-sized whole slide images (WSIs) with slide-level labels can be formulated as a multiple-instance-learning (MIL) problem. State-of-the-art models often consist of two decoupled parts: local feature embedding with a pre-trained model followed by a global feature aggregation network for classification. We leverage the properties of the apparent similarity in high-resolution WSIs, which essentially exhibit \textit{low-rank} structures in the data manifold, to develop a novel MIL with a boost in both feature embedding and feature aggregation. We extend the contrastive learning with a pathology-specific Low-Rank Constraint (LRC) for feature embedding to pull together samples (i.e., patches) belonging to the same pathological tissue in the low-rank subspace and simultaneously push apart those from different latent subspaces. At the feature aggregation stage, we introduce an iterative low-rank attention MIL (ILRA-MIL) model to aggregate features with low-rank learnable latent vectors to model global interactions among all instances. We highlight the importance of instance correlation modelling but refrain from directly using the transformer encoder considering the $O(n^2)$ complexity. ILRA-MIL with LRC pre-trained features achieves strong empirical results across various benchmarks, including (i) 96.49\% AUC on the CAMELYON16 for binary metastasis classification, (ii) 97.63\% AUC on the TCGA-NSCLC for lung cancer subtyping, and (iii) 0.6562 kappa on the large-scale PANDA dataset for prostate cancer classification. Code will be available.","computational pathology, multiple instance learning, low-rank constraint, self-attention" Evaluating and Inducing Personality in Pre-trained Language Models,https://openreview.net/forum?id=0MqQ88Z2Kta,https://openreview.net/pdf?id=0MqQ88Z2Kta,"We propose the Machine Personality Inventory (MPI) dataset for evaluating the machine personality and devise a Chain Prompting method to induce the language model with a specific personality, capable of producing diversified behaviors.","Originated as a philosophical quest, personality discerns how individuals differ from each other in terms of thinking, feeling, and behaving. Toward building social machines that work with humans on a daily basis, we are motivated to ask: (1) Do existing Large Language Models (LLMs) possess personalities, akin to their human counterparts? (2) If so, how can we evaluate them? (3) Further, given this evaluation framework, how can we induce a certain personality in a fully controllable fashion? To tackle these three questions, we propose the Machine Personality Inventory (MPI) dataset for evaluating the machine personality; MPI follows standardized personality tests, built upon the Big Five Personality Factors (Big Five) theory and personality assessment inventories. By evaluating models with MPI, we provide the first piece of evidence showing the existence of personality in LLMs. We further devise a Chain Prompting method to induce LLMs with a specific personality in a controllable manner, capable of producing diversified behaviors. We hope to shed light on future studies by adopting personality as the essential guide for various downstream tasks, building more human-like and in situ dialogue agents.","machine personality, pre-trained language model, personality trait theory, psychometric inventory, prompt" MLM with Global Co-occurrence,https://openreview.net/forum?id=DswOSXvLfuy,https://openreview.net/pdf?id=DswOSXvLfuy,We present MLM-GC (Masked Language Modeling with Global Co-occurrence) for multilingual tasks.,"When pre-training models with the objective of MLM (masked language modeling) on multilingual corpora, the model learns to refine different language spaces to overlap each other for forming isomorphic spaces by understanding structural similarities from local bidirectional information. Global co-occurrence information is the primary source of information available to all methods, which potentially gives additional structural similarities to the model. In this work, we push MLM pre-training further to leverage global co-occurrence information. The result is MLM-GC (MLM with Global Co-occurrence) pre-training that the model learns local bidirectional information from masking and global co-occurrence information from a log-bilinear regression. In our experiments, MLM-GC pre-training substantially outperforms MLM pre-training for 4 downstream multilingual/cross-lingual tasks and 1 additional monolingual task, showing the advantages of capturing embedding analogies.","MLM pre-training, Multilingual model, Machine Learning for NLP, Language Modeling" Node Classification Beyond Homophily: Towards a General Solution,https://openreview.net/forum?id=kh3JurmKlux,https://openreview.net/pdf?id=kh3JurmKlux,,"Graph neural networks (GNNs) have become core building blocks behind a myriad of graph learning tasks. The vast majority of the existing GNNs are built upon, either implicitly or explicitly, the homophily assumption, which is not always true and could heavily degrade the performance of learning tasks. In response, GNNs tailored for heterophilic graphs have been developed. However, most of the existing works are designed for the specific GNN models to address heterophily, which lacks generality. In this paper, we study the problem from the structure learning perspective and propose a family of general solutions named ALT. It can work hand in hand with most of the existing GNNs to decently handle graphs with either low or high homophily. The core of our method is learning to (1) decompose a given graph into two components, (2) extract complementary graph signals from these two components, and (3) adaptively merge the graph signals for node classification. Moreover, analysis based on graph signal processing shows that our framework can empower a broad range of existing GNNs to have adaptive filter characteristics and further modulate the input graph signals, which is critical for handling complex homophilic/heterophilic patterns. The proposed ALT brings significant and consistent performance improvement in node classification for a wide range of GNNs over a variety of real-world datasets.","node classification, structure learning, homophily, heterophily" Leveraging Hierarchical Structure for Multi-Domain Active Learning with Theoretical Guarantees,https://openreview.net/forum?id=iPhccmh9FyK,https://openreview.net/pdf?id=iPhccmh9FyK,We formalize the general definition of multi-domain active learning and propose Composite Active Learning (CAL) as the first general deep AL method for addressing this problem with theoretical guarantees by leveraging hierarchical structure. ,"Active learning (AL) aims to improve model performance within a fixed labeling budget by choosing the most informative data points to label. Existing AL focuses on the single-domain setting, where all data come from the same domain (e.g., the same dataset). However, many real-world tasks often involve multiple domains. For example, in visual recognition, it is often desirable to train an image classifier that works across different environments (e.g., different backgrounds), where images from each environment constitute one domain. Such a multi-domain AL setting is challenging for prior methods because they (1) ignore the similarity among different domains when assigning labeling budget and (2) fail to handle distribution shift of data across different domains. In this paper, we propose the first general method, dubbed composite active learning (CAL), for multi-domain AL. Our approach explicitly considers the hierarchical structure of the problem, i.e., domain-level and instance-level structures. CAL first assigns domain-level budgets according to domain-level importance, which is estimated by optimizing an upper error bound that we develop. With the domain-level budgets, CAL then leverages a certain instance-level query strategy to select samples to label from each domain. Our theoretical analysis shows that our method achieves a better error bound compared to current AL methods. Our empirical results demonstrate that our approach significantly outperforms the state-of-the-art AL methods on both synthetic and real-world multi-domain datasets.","Active Learning, Multi-Domain Learning" Causal Balancing for Domain Generalization,https://openreview.net/forum?id=F91SROvVJ_6,https://openreview.net/pdf?id=F91SROvVJ_6,We propose a balanced mini-batch sampling strategy to reduce spurious correlations for domain generalization.,"While machine learning models rapidly advance the state-of-the-art on various real-world tasks, out-of-domain (OOD) generalization remains a challenging problem given the vulnerability of these models to spurious correlations. We propose a balanced mini-batch sampling strategy to transform a biased data distribution into a spurious-free balanced distribution, based on the invariance of the underlying causal mechanisms for the data generation process. We argue that the Bayes optimal classifiers trained on such balanced distribution are minimax optimal across a diverse enough environment space. We also provide an identifiability guarantee of the latent variable model of the proposed data generation process, when utilizing enough train environments. Experiments are conducted on DomainBed, demonstrating empirically that our method obtains the best performance across 20 baselines reported on the benchmark.","domain generalization, causality, latent variable model" Elastic Aggregation for Federated Optimization,https://openreview.net/forum?id=EWjYk3R2jhr,https://openreview.net/pdf?id=EWjYk3R2jhr,Elastic aggregation works well with other federated optimizers and achieves significant improvements across the board.," Federated learning enables the privacy-preserving training of neural network models using real-world data across distributed clients. FedAvg has become the preferred optimizer for federated learning because of its simplicity and effectiveness. FedAvg uses naïve aggregation to update the server model, interpolating client models based on the number of instances used in their training. However, naïve aggregation suffers from client-drift when the data is heterogenous~(non-IID), leading to unstable and slow convergence. In this work, we propose a novel aggregation approach, elastic aggregation, to overcome these issues. Elastic aggregation interpolates client models adaptively according to parameter sensitivity, which is measured by computing how much the overall prediction function output changes when each parameter is changed. This measurement is performed in an unsupervised and online manner. Elastic aggregation reduces the magnitudes of updates to the more sensitive parameters so as to prevent the server model from drifting to any one client distribution, and conversely boosts updates to the less sensitive parameters to better explore different client distributions. Empirical results on real and synthetic data as well as analytical results show that elastic aggregation leads to efficient training in both convex and non-convex settings, while being fully agnostic to client heterogeneity and robust to large numbers of clients, partial participation, and imbalanced data. Finally, elastic aggregation works well with other federated optimizers and achieves significant improvements across the board.","Federated Learning, AI Safety, Autonomous Driving, Drug Discovery, Clinical Diagnosis, Recommender Systems" Decouple Graph Neural Networks: Train Multiple Simple GNNs Simultaneously Instead of One,https://openreview.net/forum?id=UGOpvh4vXS2,https://openreview.net/pdf?id=UGOpvh4vXS2,An efficient model to decouple multi-layer graph neural networks and training them by forward and backward training,"Graph neural networks (GNN) suffer from severe inefficiency due to the exponential growth of node dependency with the increase of layers. It extremely limits the application of stochastic optimization algorithms so that the training of GNN is usually time-consuming. To address this problem, we propose to decouple a multi-layer GNN as multiple simple modules for more efficient training, which is comprised of classical forward training (FT)and designed backward training (BT). Under the proposed framework, each module can be trained efficiently in FT by stochastic algorithms without distortion of graph information owing to its simplicity. To avoid the only unidirectional information delivery of FT and sufficiently train shallow modules with the deeper ones, we develop a backward training mechanism that makes the former modules perceive the latter modules, inspired by the classical backward propagation algorithm. The backward training introduces the reversed information delivery into the decoupled modules as well as the forward information delivery. To investigate how the decoupling and greedy training affect the representational capacity, we theoretically prove that the error produced by linear modules will not accumulate on unsupervised tasks in most cases. The theoretical and experimental results show that the proposed framework is highly efficient with reasonable performance, which may deserve more investigation. ","Graph neural network, efficient training, backward training" Reinforced Sample Reweighting Policy for Semi-supervised Learning,https://openreview.net/forum?id=8FL8vRvlk59,https://openreview.net/pdf?id=8FL8vRvlk59,,"Semi-supervised learning (SSL) has been shown to be an effective paradigm for learning with less labeled data. To improve the performance of SSL, existing methods build sample reweighting or thresholding strategies to handle the category bias or erroneous pseudo labels. However, most of these existing methods are based on the heuristic hand-crafted rules, which require laborious adjustment, and may lead to sub-optimal solutions that cannot improve the model performance to the greatest extent. Here, to the best of our knowledge, we pioneer to develop an automatic strategy that boosts the performance of SSL. We introduce an end-to-end sample reweighting policy for semi-supervised learning, with a delicately designed Markov Decision Process (MDP) framework. The MDP framework is constructed with an agent network, which is optimized in a reward-driven manner, and receives the carefully designed state and action representations for decision reference. We also design a memory paradigm for computation-efficient representation construction and MDP solving. We further introduce a ""pretraining-boosting"" two-stage MDP curriculum where the agent network is firstly pretrained and then optimized continuously in the deployment phase to catch up with the constantly updated classification network. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple datasets, outperforming previous advanced approaches such as FixMatch.",Semi-supervised Learning Neural Radiance Fields with Geometric Consistency for Few-Shot Novel View Synthesis,https://openreview.net/forum?id=WoByU5W5te0,https://openreview.net/pdf?id=WoByU5W5te0,,"We present a novel method to regularizes neural radiance field (NeRF) in few-shot setting with geometry-based consistency regularization. The proposed approach leverages NeRF's rendered depth map to warp source images to unobserved viewpoints and impose them as pseudo ground truths to facilitate learning of detailed features. By encouraging consistency at feature-level instead of using pixel-level reconstruction loss, we regularize the network solely at semantic and structural levels while allowing view-dependent radiance to model freely after color variations. Our application of proposed consistency term for the network is twofold: between and observed and unobserved viewpoints, image rendered at unseen view is forced to model after the image warped from input observation, while between observed viewpoints the warped image undergoes optimization for geometry-specific regularization. We also demonstrate an effective method to filter out erroneous warped solutions, along with relevant techniques to stabilize training during optimization. We show that our model achieves competitive results compared to concurrent few-shot NeRF models.","NeRF, 3D Computer Vision" Towards Addressing Label Skews in One-shot Federated Learning,https://openreview.net/forum?id=rzrqh85f4Sc,https://openreview.net/pdf?id=rzrqh85f4Sc,We propose FedOV to significantly improve the test accuracy under diverse label skews in one-shot federated learning.,"Federated learning (FL) has been a popular research area, where multiple clients collaboratively train a model without sharing their local raw data. Among existing FL solutions, one-shot FL is a promising and challenging direction, where the clients conduct FL training with a single communication round. However, while label skew is a common real-world scenario where some clients may have few or no data of some classes, existing one-shot FL approaches that conduct voting on the local models are not able to produce effective global models. Due to the limited number of classes in each party, the local models misclassify the data from unseen classes into seen classes, which leads to very ineffective global models from voting. To address the label skew issue in one-shot FL, we propose a novel approach named FedOV which generates diverse outliers and introduces them as an additional unknown class in local training to improve the voting performance. Specifically, based on open-set recognition, we propose novel outlier generation approaches by corrupting the original features and further develop adversarial learning to enhance the outliers. Our extensive experiments show that FedOV can significantly improve the test accuracy compared to state-of-the-art approaches in various label skew settings.",federated learning Breaking Correlation Shift via Conditional Invariant Regularizer,https://openreview.net/forum?id=-jTaz3CMk72,https://openreview.net/pdf?id=-jTaz3CMk72,"This paper proposes an algorithm to make the model to generalize on data with spurious correlation, the method can be implemented without information on spurious feature. ","Recently, generalization on out-of-distribution (OOD) data with correlation shift has attracted great attention. The correlation shift is caused by the spurious attributes that correlate to the class label, as the correlation between them may vary in training and test data. For such a problem, we show that given the class label, the conditionally independent models of spurious attributes are OOD generalizable. Based on this, a metric Conditional Spurious Variation (CSV) which controls OOD generalization error, is proposed to measure such conditional independence. To improve the OOD generalization, we regularize the training process with the proposed CSV. Under mild assumptions, our training objective can be formulated as a nonconvex-concave mini-max problem. An algorithm with a provable convergence rate is proposed to solve the problem. Extensive empirical results verify our algorithm's efficacy in improving OOD generalization. ","OOD Generalization, Spurious Correlation, Optimization" CROM: Continuous Reduced-Order Modeling of PDEs Using Implicit Neural Representations,https://openreview.net/forum?id=FUORz1tG8Og,https://openreview.net/pdf?id=FUORz1tG8Og,We accelerate PDE solvers via rapid latent space traversal of continuous vector fields leveraging implicit neural representations.,"The long runtime of high-fidelity partial differential equation (PDE) solvers makes them unsuitable for time-critical applications. We propose to accelerate PDE solvers using reduced-order modeling (ROM). Whereas prior ROM approaches reduce the dimensionality of discretized vector fields, our continuous reduced-order modeling (CROM) approach builds a smooth, low-dimensional manifold of the continuous vector fields themselves, not their discretization. We represent this reduced manifold using continuously differentiable neural fields, which may train on any and all available numerical solutions of the continuous system, even when they are obtained using diverse methods or discretizations. We validate our approach on an extensive range of PDEs with training data from voxel grids, meshes, and point clouds. Compared to prior discretization-dependent ROM methods, such as linear subspace proper orthogonal decomposition (POD) and nonlinear manifold neural-network-based autoencoders, CROM features higher accuracy, lower memory consumption, dynamically adaptive resolutions, and applicability to any discretization. For equal latent space dimension, CROM exhibits 79$\times$ and 49$\times$ better accuracy, and 39$\times$ and 132$\times$ smaller memory footprint, than POD and autoencoder methods, respectively. Experiments demonstrate 109$\times$ and 89$\times$ wall-clock speedups over unreduced models on CPUs and GPUs, respectively.","PDE, implicit neural representation, neural field, latent space traversal, reduced-order modeling, numerical methods" Relaxed Combinatorial Optimization Networks with Self-Supervision: Theoretical and Empirical Notes on the Cardinality-Constrained Case,https://openreview.net/forum?id=h21yJhdzbwz,https://openreview.net/pdf?id=h21yJhdzbwz,"We present a Gumbel-Sinkhorn network for cardinality-constrained combinatorial optimization with theoretical and empirical notes. We surpass Erdos Goes Neural on optimization problems, and present an application on predictive portfolio optimization.","Self-supervised neural networks for combinatorial optimization (CO) handle non-differentiable constraints via relaxation. Despite their superiority in efficiency, one possible limitation is that these methods often put the constraints as soft penalty terms in the learning objective, and the degree of constraint-violation usually cannot be accurately or directly modulated. In this paper, we aim to develop a new paradigm to solve the CO problem by incorporating the constraints into the network architecture and computational operators, which is a more natural learning pipeline and decouples the constraint violation penalty from the raw objective optimization. Seeing such a paradigm may be rather general such that there only exist perturbation-based blackbox differentiable learning methods as generic solvers in literature, here we consider the commonly used cardinality constraints which in fact can incorporate many existing CO problem instances as its special cases. Specifically, the cardinality constraints are encoded by a differentiable optimal transport layer. We theoretically characterize the constraint-violations of two variants of our architecture (w.r.t. existing CO network whose constraint-violation is non-controlled), and we further show that their empirical performances are in line with our theoretical results. On self-supervised learning of pure CO problems on synthetic and real-world data, our networks surpass the state-of-the-art CO network, and are comparable to Gurobi and can sometimes even surpass. Our general paradigm also enables the application of end-to-end predictive portfolio optimization on real-world asset price data, improving the Sharpe ratio from 1.1 to 2.1 with a predict-then-optimize paradigm with LSTM+Gurobi.","deep learning, combinatorial optimization, facility location problem, max coverage problem, portfolio optimization" Block and Subword-Scaling Floating-Point (BSFP) : An Efficient Non-Uniform Quantization For Low Precision Inference,https://openreview.net/forum?id=VWm4o4l3V9e,https://openreview.net/pdf?id=VWm4o4l3V9e,,"In this paper, we propose Block and Subword-Scaling Floating-Point (BSFP), a non-uniform quantization scheme for the skewed and non-uniform distribution of weight vectors in neural networks. By quantizing each weight vector as the superposition of multiple subword vectors (in two's complement) with scaling factors (in Low-bit Floating-Point, LBFP), BSFP can effectively fit the distribution of weight vectors while maintaining high computation efficiency. Furthermore, we present a grid search-based MSE-optimal quantization flow and a scaled serial processing engine to complete the quantization pipeline and the infrastructure. The experimental results on the ImageNet classification task show that our proposed method outperforms state-of-the-art Microsoft Floating Point (MSFP) by up to 20.56% top-1 accuracy at the same weight precision and reduces up to 10.3% model size. Furthermore, BSFP outperforms MSFP by up to 2.0$\times$ computing throughput and up to 5.3$\times$ energy efficiency under the same silicon area budget.", Exploring The Capacity Mismatch Problem in Knowledge Distillation from the View of Soft Labels,https://openreview.net/forum?id=9IUxnGC8e9u,https://openreview.net/pdf?id=9IUxnGC8e9u,"The main contributions of our work are the discovery, analysis, and validation of the effect of the smoothed soft label and a less time-consuming and adaptive transfer of the teacher's knowledge method.","Knowledge distillation (KD) has been extensively employed to transfer the knowledge using the soft label from a large teacher model to the smaller students, where the parameters of the teacher are fixed (or partially) during training. Recent studies show that this mode may cause difficulties in knowledge transfer due to the mismatched model capacities. To alleviate the mismatch problem, adjustment of temperature parameters, label smoothing and teacher-student joint training methods (online distillation) to smooth the soft label of a teacher network, have been proposed. But those methods rarely explain the effect of smoothed soft labels to enhance the KD performance. The main contributions of our work are the discovery, analysis, and validation of the effect of the smoothed soft label and a less time-consuming and adaptive transfer of the teacher's knowledge method, namely PESF-KD by adaptive tuning soft labels of the teacher network. Technically, we first mathematically formulate the mismatch as the sharpness gap between teachers' and students' predictive distributions, where we show such a gap can be narrowed with the appropriate smoothness of the soft label. Then, we introduce an adapter module for the teacher and only update the adapter to obtain soft labels with appropriate smoothness. Experiments on various benchmarks show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods. ","knowledge distillation, parameter-efficiency, transfer learning" Rethinking the Effect of Data Augmentation in Adversarial Contrastive Learning,https://openreview.net/forum?id=0qmwFNJyxCL,https://openreview.net/pdf?id=0qmwFNJyxCL,"We revisit adversarial contrastive training through the lens of data augmentation, and propose an effective adversarial contrastive framework that outperforms vanilla supervised adversarial robustness.","Recent works have shown that self-supervised learning can achieve remarkable robustness when integrated with adversarial training (AT). However, the robustness gap between supervised AT (sup-AT) and self-supervised AT (self-AT) remains significant. Motivated by this observation, we revisit existing self-AT and discover an inherent dilemma that affects self-AT robustness: either strong or weak data augmentations are harmful to self-AT, and a medium strength is insufficient to bridge the gap. To resolve this dilemma, we propose a simple remedy named DynACL (Dynamic Adversarial Contrastive Learning). In particular, we propose an augmentation schedule that gradually anneals from a strong augmentation to a weak one to benefit from both extreme cases. Besides, we adopt a fast post-processing stage for adapting it to downstream tasks. Through extensive experiments, we show that DynACL can improve the state-of-the-art self-AT robustness by 8.84% under Auto-Attack on the CIFAR-10 dataset, and can even outperform vanilla supervised adversarial training. We demonstrate that self-supervised AT can attain even better robustness than supervised AT for the first time.","adversarial training, contrastive learning, adversarial contrastive learning" FeatER: An Efficient Network for Human Reconstruction Feature map-based TransformER,https://openreview.net/forum?id=sLPhtvZ9A7f,https://openreview.net/pdf?id=sLPhtvZ9A7f,,"Recently, vision transformers have shown great success in a set of human reconstruction tasks such as 2D human pose estimation (2D HPE), 3D human pose estimation (3D HPE), and human mesh reconstruction (HMR) tasks. In these tasks, feature map representations of the human structural information are often extracted first from the image by a CNN (such as HRNet), and then further processed by transformer to predict the heatmaps (encodes each joint's location into a feature map with a Gaussian distribution) for HPE or HMR. However, existing transformer architectures are not able to process these feature map inputs directly, forcing an unnatural flattening of the location-sensitive human structural information. Furthermore, much of the performance benefit in recent HPE and HMR methods has come at the cost of ever-increasing computation and memory needs. Therefore, to simultaneously address these problems, we propose FeatER, a novel transformer design which preserves the inherent structure of feature map representations when modeling attention while reducing the memory and computational costs. Taking advantage of FeatER, we build an efficient network for a set of human reconstruction tasks including 2D HPE, 3D HPE, and HMR. A feature map reconstruction module is applied to improve the performance of the estimated human pose and mesh. Extensive experiments demonstrate the effectiveness of FeatER on various human pose and mesh datasets. For instance, FeatER outperforms the SOTA method MeshGraphormer by requiring 5\% of Params (total parameters) and 16\% of MACs (the Multiply–Accumulate Operations) on Human3.6M and 3DPW datasets. Code will be publicly available.","human pose estimation, human mesh recovery, transformer architecture" Pareto Automatic Multi-Task Graph Representation Learning,https://openreview.net/forum?id=p0zTRXkTtB8,https://openreview.net/pdf?id=p0zTRXkTtB8,"From a multi-objective perspective, this paper first tries to automatically search for a general-purpose multi-task graph neural network architecture that matches various user-desired task preferences.","Various excellent graph representation learning models, such as graph neural networks (GNNs), can produce highly task-specific embeddings in an end-to-end manner. Due to the low transferability of learned embeddings and limited representational capabilities of handcrafted models, existing efforts cannot efficiently handle multiple downstream tasks simultaneously, especially in resource-constrained scenarios. This paper first tries to automatically search for multi-task GNN architectures to improve the generalization performance of GNNs through knowledge sharing across tasks. Because of possible task (objective) conflicts and complex dependencies of architectures and weights, the multi-task GNN architecture search is a bi-level multi-objective optimization problem (BL-MOP) to find a set of Pareto architectures and their Pareto weights, representing different trade-offs across tasks at upper and lower levels (UL and LL), respectively. The Pareto optimality of sub-problems results in each Pareto architecture corresponding to a set of Pareto weights, which is particularly challenging in deep learning with high training costs. For the first time, we propose a simple but effective differentiable multi-task GNN architecture search framework (DMTGAS) with convergence guarantees. By introducing consistent task preferences for UL and LL, DMTGAS only alternately optimizes a single architecture and weights via the gradient-based multi-objective optimizer, which neatly overcomes the above optimization difficulties. Experimental results on several tasks in three real-world graph datasets demonstrate the superiority of the GNNs obtained by our proposal compared with existing handcrafted ones.","Graph Representation Learning, Multi-Objective Optimization, Multi-Task Learning, Neural Architecture Search" Semi-supervised Community Detection via Structural Similarity Metrics,https://openreview.net/forum?id=cxvEGLCHpgl,https://openreview.net/pdf?id=cxvEGLCHpgl,"We propose a fast semi-supervised community detection algorithm AngleMin+ based on the structural similarity metric of DCBM, which is able to address degree heterogeneity and non-assortative network and possesses nice theoretical guarantees.","Motivated by the interests of social network analysis and network-based recommendation systems, we consider a semi-supervised community detection problem, where the goal is to estimate the community label of a new node by leveraging on the network structure and partially observed community labels of existing nodes. We model the network with a degree-corrected stochastic block model, which allows for severe degree heterogeneity and potentially non-assortative communities. We propose a fast algorithm that computes a `structural similarity metric' between the new node and each of the $K$ communities, aggregating information in labeled and unlabeled data. The estimated label of the new node is equal to the value of $k$ that maximizes this similarity metric. Our method is computationally fast and compares favorably with existing semi-supervised algorithms on numerical performance. In theory, we derive explicit bounds for the misclassification error and show the efficiency of our method by comparing it with an ideal classifier. To our best knowledge, our results provide the first semi-supervised community detection algorithm with theoretical guarantees. ","Semi-supervised, Community Detection, Network, DCBM, Degree Heterogeneity, Non-Assortative" DDM$^2$: Self-Supervised Diffusion MRI Denoising with Generative Diffusion Models,https://openreview.net/forum?id=0vqjc50HfcC,https://openreview.net/pdf?id=0vqjc50HfcC,,"Magnetic resonance imaging (MRI) is a common and life-saving medical imaging technique. However, acquiring high signal-to-noise ratio MRI scans requires long scan times, resulting in increased costs and patient discomfort, and decreased throughput. Thus, there is great interest in denoising MRI scans, especially for the subtype of diffusion MRI scans that are severely SNR-limited. While most prior MRI denoising methods are supervised in nature, acquiring supervised training datasets for the multitude of anatomies, MRI scanners, and scan parameters proves impractical. Here, we propose Denoising Diffusion Models for Denoising Diffusion MRI (DDM^2), a self-supervised denoising method for MRI denoising using diffusion denoising generative models. Our three-stage framework integrates statistic-based denoising theory into diffusion models and performs denoising through conditional generation. During inference, we represent input noisy measurements as a sample from an intermediate posterior distribution within the diffusion Markov chain. We conduct experiments on 4 real-world in-vivo diffusion MRI datasets and show that our DDM^2 demonstrates superior denoising performances ascertained with clinically-relevant visual qualitative and quantitative metrics.","Unsupervised MRI Denoising, Diffusion Models" Multivariate Time-series Imputation with Disentangled Temporal Representations,https://openreview.net/forum?id=rdjeCNUS6TG,https://openreview.net/pdf?id=rdjeCNUS6TG,"We propose a multivariate time-series imputation model based on matrix factorization, which composes meaningful disentangled temporal representations that account for multiple explanatory factors (trend, seasonality, local bias).","Multivariate time series often faces the problem of missing value. Many time series imputation methods have been developed in the literature. However, these methods all rely on an entangled representation to model dynamics of time series, which may fail to fully exploit the multiple factors (e.g., periodic patterns) contained in the time series. Moreover, the entangled representation usually has no semantic meaning, and thus they often lack interpretability. In addition, many recent models are proposed to deal with the whole time series to capture cross-channel correlations and identify temporal dynamics, but they are not scalable to large-scale datasets. Different from existing approaches, we propose TIDER, a novel matrix factorization-based method with disentangled temporal representations that account for multiple factors, namely trend, seasonality, and local bias, to model complex dynamics. The learned disentanglement makes the imputation process more reliable and offers explainability for imputation results. Moreover, TIDER is scalable to large datasets. Empirical results show that our method not only outperforms existing approaches by notable margins on three real-world datasets, but also scales well to large datasets on which existing deep learning based methods struggle. Disentanglement validation experiments further demonstrate the robustness of our model in obtaining accurate and explainable disentangled components.","multivariate time-series imputation, disentangled representation" Knowledge-driven Scene Priors for Semantic Audio-Visual Embodied Navigation,https://openreview.net/forum?id=nYqCVDAXAPE,https://openreview.net/pdf?id=nYqCVDAXAPE,"We introduce knowledge-driven scene priors in audio-visual navigation, combining semantics from a knowledge graph, spatial knowledge from dual Graph Encoder Networks, and background knowledge from pre-training tasks—all within an RL framework.","Generalisation to unseen contexts remains a challenge for embodied navigation agents. In the context of semantic audio-visual navigation (SAVi) tasks, generalisation includes both generalising to unseen indoor visual scenes as well as generalising to unheard sounding objects. Previous SAVi task definitions do not include evaluation conditions on truly novel sounding objects, resorting instead to evaluating agents on unheard sound clips of known objects; meanwhile, previous SAVi methods do not include explicit mechanisms for incorporating domain knowledge about object and region semantics. These weaknesses limit the development and assessment of models' abilities to generalise their learned experience. In this work, we introduce the use of knowledge-driven scene priors in the semantic audio-visual embodied navigation task: we combine semantic information from our novel knowledge graph that encodes object-region relations, spatial knowledge from dual Graph Encoder Networks, and background knowledge from a series of pre-training tasks---all within a reinforcement learning framework for audio-visual navigation. We define a new audio-visual navigation sub-task, where agents are evaluated on novel sounding objects, as opposed to unheard clips of known objects. We show improvements over strong baselines in generalisation to unseen regions and novel sounding objects, within the Habitat-Matterport3D simulation environment, under the SoundSpaces task. We release code, knowledge graphs, and dataset generation details in the supplementary material.","Audio-Visual Navigation, Scene Priors, Object Relations, Embodied AI" Promoting Semantic Connectivity: Dual Nearest Neighbors Contrastive Learning for Unsupervised Domain Generalization,https://openreview.net/forum?id=Iewi8zwGsZr,https://openreview.net/pdf?id=Iewi8zwGsZr,,"Domain Generalization (DG) has achieved great success in generalizing knowledge from source domains to unseen target domains. However, current DG methods rely heavily on labeled source data, which are usually costly and unavailable. Thus, we study a more practical unsupervised domain generalization (UDG) problem. Learning invariant visual representation from different views, i.e., contrastive learning, promises well semantic features for in-domain unsupervised learning. However, it fails in cross-domain scenarios. In this paper, we first delve into the failure of vanilla contrastive learning and point out that semantic connectivity is the key to UDG. Specifically, suppressing the intra-domain connectivity and encouraging the intra-class connectivity help to learn the domain-invariant semantic information. Then, we propose a novel unsupervised domain generalization approach, namely Dual Nearest Neighbors contrastive learning with strong Augmentation (DN$^2$A). DN$^2$A leverages strong augmentations to suppress the intra-domain connectivity and proposes a novel dual nearest neighbors search strategy to find trustworthy cross domain neighbors along with in-domain neighbors to encourage intra-class connectivity. Experimental results demonstrate that our DN$^2$A outperforms the state-of-the-art by a large margin, e.g., 12.01% and 13.11% accuracy gain with only 1% labels for linear evaluation on PACS and DomainNet, respectively. ", Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language,https://openreview.net/forum?id=G2Q2Mh3avow,https://openreview.net/pdf?id=G2Q2Mh3avow,"We present a modular class of systems in which multiple pretrained models may be composed zero-shot via multimodal-informed prompt engineering to capture new multimodal capabilities, without additional finetuning.","We investigate how multimodal prompt engineering can use language as the intermediate representation to combine complementary knowledge from different pretrained (potentially multimodal) language models for a variety of tasks. This approach is both distinct from and complementary to the dominant paradigm of joint multimodal training. It also recalls a traditional systems-building view as in classical NLP pipelines, but with prompting large pretrained multimodal models. We refer to these as Socratic Models (SMs): a modular class of systems in which multiple pretrained models may be composed zero-shot via multimodal-informed prompting to capture new multimodal capabilities, without additional finetuning. We show that these systems provide competitive state-of-the-art performance for zero-shot image captioning and video-to-text retrieval, and also enable new applications such as (i) answering free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes), and (iii) robot perception and planning. We hope this work provides (a) results for stronger zero-shot baseline performance with analysis also highlighting their limitations, (b) new perspectives for building multimodal systems powered by large pretrained models, and (c) practical application advantages in certain regimes limited by data scarcity, training compute, or model access.","prompt engineering, multimodal applications, visual language models, large language models, commonsense reasoning" MixPath: A Unified Approach for One-shot Neural Architecture Search ,https://openreview.net/forum?id=gO8vNRKzzjp,https://openreview.net/pdf?id=gO8vNRKzzjp,A multi-path one-shot neural architecture search approach,"Blending multiple convolutional kernels is proved advantageous in neural architecture design. However, current two-stage neural architecture search methods are mainly limited to single-path search spaces. How to efficiently search models of multi-path structures remains a difficult problem. In this paper, we are motivated to train a one-shot multi-path supernet to accurately evaluate the candidate architectures. Specifically, we discover that in the popular search spaces, feature vectors summed from multiple paths are nearly multiples of those from a single path. Such disparity perturbs the supernet training and its ranking ability. Therefore, we propose a novel mechanism called \emph{Shadow Batch Normalization} (SBN) to regularize the disparate feature statistics. Extensive experiments prove that SBNs are capable of stabilizing the optimization and improving ranking performance by a clear margin. We call our unified multi-path one-shot approach as MixPath, which efficiently generates a series of competitive models on ImageNet.","neural architecture search, multi-path, one-shot" CAST: Concurrent Recognition and Segmentation with Adaptive Segment Tokens,https://openreview.net/forum?id=P17yA67o3VL,https://openreview.net/pdf?id=P17yA67o3VL,A new ViT integrated with data-driven perceptual organization to simultaneously learn image segmentation for free while training the model for unsupervised recognition.,"Recognizing an image and segmenting it into coherent regions are often treated as separate tasks. Human vision, however, has a general sense of segmentation hierarchy before recognition occurs. We are thus inspired to learn image recognition with hierarchical image segmentation based entirely on unlabeled images. Our insight is to learn fine-to-coarse features concurrently at superpixels, segments, and full image levels, enforcing consistency and goodness of feature induced segmentations while maximizing discrimination among image instances. Our model innovates vision transformers on three aspects. 1) We use adaptive segment tokens instead of fixed-shape patch tokens. 2) We create a token hierarchy by inserting graph pooling between transformer blocks, naturally producing consistent multi-scale segmentations while increasing the segment size and reducing the number of tokens. 3) We produce hierarchical image segmentation for free {\it while} training for recognition by maximizing image-wise discrimination. Our work delivers the first concurrent recognition and hierarchical segmentation model without any supervision. Validated on ImageNet and PASCAL VOC, it achieves better recognition and segmentation with higher computational efficiency.", Improving the Latent Space of Image Style Transfer,https://openreview.net/forum?id=MXoeggsH7yP,https://openreview.net/pdf?id=MXoeggsH7yP,We find a widespread problem in style transfer caused by the inappropriate pre-trained encoders to provide supervision signals and design a training scheme to alleviate this problem. ,"Existing neural style transfer studies utilize statistical information of features from a pre-trained encoder as representations of the style and achieve significant improvement in synthesizing artistic images. However, in some cases, the feature statistics from the pre-trained encoder may not be consistent with the visual style we perceived. The style distance between some images of different styles is small than that of the same style. In such an inappropriate latent space, the objective function of the existing methods will be optimized in the wrong direction, resulting in bad stylization results. In addition, the lack of content details in the features extracted by the pre-trained encoder also leads to the content leak problem. In order to solve these issues in the latent space used by style transfer, we propose two contrastive training schemes to get a refined encoder that is more suitable for this task. The style contrastive loss pulls the stylized result closer to the same visual style image and pushes it away from the content image. The content contrastive loss enables the encoder to retain more available details. The training scheme can be directly added to existing style transfer methods and significantly improve their results. Extensive experimental results demonstrate the effectiveness and superiority of our methods. ","style transfer, contrastive learning" Multi-lingual Evaluation of Code Generation Models,https://openreview.net/forum?id=Bo7eeXm6An8,https://openreview.net/pdf?id=Bo7eeXm6An8,,"We present MBXP, an execution-based code completion benchmark in 10+ programming languages. This collection of datasets is generated by our conversion framework that translates prompts and test cases from the original MBPP dataset to the corresponding data in a target language. Based on this benchmark, we are able to evaluate code generation models in a multi-lingual fashion, and in particular discover generalization ability of language models on out-of-domain languages, advantages of large multi-lingual models over mono-lingual, benefits of few-shot prompting, and zero-shot translation abilities. In addition, we use our code generation model to perform large-scale bootstrapping to obtain synthetic canonical solutions in several languages. These solutions can be used for other code-related evaluations such as insertion-based, summarization, or code translation tasks where we demonstrate results and release as part of our benchmark.","code generation, execution-based evaluation, test-based evaluation, language models, multi-lingual code generation benchmark, code insertion, code summarization, robustness for code, code translation, zero-shot code translation, multi-lingual, mono-lingual, language models." GRACE-C: Generalized Rate Agnostic Causal Estimation via Constraints,https://openreview.net/forum?id=B_pCIsX8KL_,https://openreview.net/pdf?id=B_pCIsX8KL_,A novel method for causal structure discovery in undersampled time-series with three orders of magnitude speedup under the same theoretical guarantees.,"Graphical structures estimated by causal learning algorithms from time series data can provide highly misleading causal information if the causal timescale of the generating process fails to match the measurement timescale of the data. Existing algorithms provide limited resources to respond to this challenge, and so researchers must either use models that they know are likely misleading, or else forego causal learning entirely. Existing methods face up-to-four distinct shortfalls, as they might a) require that the difference between causal and measurement timescales is known; b) only handle very small number of random variables when the timescale difference is unknown; c) only apply to pairs of variables (albeit with fewer assumptions about prior knowledge); or d) be unable to find a solution given statistical noise in the data. This paper aims to address these challenges. We present an algorithm that combines constraint programming with both theoretical insights into the problem structure and prior information about admissible causal interactions to achieve speed up of multiple orders of magnitude. The resulting system scales to significantly larger sets of random variables ($>100$) without knowledge of the timescale difference while maintaining theoretical guarantees. This method is also robust to edge misidentification and can use parametric connection strengths, while optionally finding the optimal among many possible solutions. ","Causal structure learning, causal learning, graph theory, brain imaging, fMRI" How Powerful is Implicit Denoising in Graph Neural Networks,https://openreview.net/forum?id=Rsrd5wK4kEh,https://openreview.net/pdf?id=Rsrd5wK4kEh,We theoretically analyze the denoising effect in graph neural networks.,"Graph Neural Networks (GNNs), which aggregate features from neighbors, are widely used for processing graph-structured data due to their powerful representation learning capabilities. It is generally believed that GNNs can implicitly remove feature noises. However, existing works have not rigorously analyzed the implicit denoising effect in graph neural networks. In this work, we conduct a comprehensive theoretical study and analyze when and why implicit denoising happens in GNNs. Our theoretical analysis suggests that the implicit denoising largely depends on the connectivity and size of the graph, as well as the GNN architectures. Motivated by adversarial machine learning in improving the robustness of neural networks, we propose the adversarial graph signal denoising (AGSD) problem. By solving such a problem, we derive a robust graph convolution, where the smoothness of the node representations and the implicit denoising effect can be enhanced. Extensive empirical evaluations verify our theoretical analyses and the effectiveness of our proposed model.","GNN denoising, GNN theory" Unified Detoxifying and Debiasing in Language Generation via Inference-time Adaptive Optimization,https://openreview.net/forum?id=FvevdI0aA_h,https://openreview.net/pdf?id=FvevdI0aA_h,"We propose an inference-time unified detoxifying and debiasing framework, which achieves better balance among effectiveness, computation cost and generation quality.","Recently pre-trained language models (PLMs) have prospered in various natural language generation (NLG) tasks due to their ability to generate fairly fluent text. Nevertheless, these models are observed to capture and reproduce harmful contents in training corpora, typically toxic language and social biases, raising severe moral issues. Prior works on ethical NLG tackle detoxifying and debiasing separately, which is problematic since we find debiased models still exhibit toxicity while detoxified ones even exacerbate biases. To address such a challenge, we propose the first unified framework of detoxifying and debiasing called UDDIA, which jointly formalizes these two problems as rectifying the output space. We theoretically interpret our framework as learning a text distribution mixing weighted attributes. Besides, UDDIA conducts adaptive optimization of only a few parameters during decoding based on a parameter-efficient tuning schema without any training data. This leads to minimal generation quality loss and improved rectification performance with acceptable computational cost. Experimental results demonstrate that compared to several strong baselines, UDDIA achieves debiasing and detoxifying simultaneously and better balances efficiency and effectiveness, taking a further step towards practical ethical NLG.","detoxify, debias, language generation" Distribution Aware Metrics for Conditional Natural Language Generation,https://openreview.net/forum?id=QEfpL9Iy2KD,https://openreview.net/pdf?id=QEfpL9Iy2KD,his work introduces alternative methods for the evaluation of conditional natural language generation based on language distributional divergences.,"Traditional automated metrics for evaluating conditional natural language generation use pairwise comparisons between a single generated text and the best-matching gold-standard ground truth text. When multiple ground truths are available, scores are aggregated using an average or max operation across references. While this approach works well when diversity in the ground truth data (i.e. dispersion of the distribution of conditional texts) can be ascribed to noise, such as in automated speech recognition, it does not allow for robust evaluation in the case where diversity in the ground truths represents signal for the model. In this work we argue that existing metrics are not appropriate for domains such as visual description or summarization where ground truths are semantically diverse, and where the diversity in those captions captures useful additional information about the context. We propose a novel paradigm for multi-candidate evaluation of conditional language generation models, and a new family of metrics that compare the {\em distributions} of reference and model-generated caption sets using small sample sets of each. We demonstrate the utility of our approach with a case study in visual description: where we show that existing models optimize for single-description quality over diversity, and gain some insights into how sampling methods and temperature impact description quality and diversity.","Natural Language Generation, Video Description, Image Description, Metrics" GOAT: A Global Transformer on Large-scale Graphs,https://openreview.net/forum?id=z29R0uMiF3v,https://openreview.net/pdf?id=z29R0uMiF3v,A global graph transformer working well on both homophilious and heterophilious graphs.,"Graph transformers have been competitive on graph classification tasks, but they fail to outperform Graph Neural Networks (GNNs) on node classification, which is a common task performed on large-scale graphs for industrial applications. Meanwhile, existing GNN architectures are limited in their ability to perform equally well on both homophilious and heterophilious graphs as their inductive biases are generally tailored to only one setting. To address these issues, we propose GOAT, a scalable global graph transformer. In GOAT, each node conceptually attends to all the nodes in the graph and homophily/heterophily relationships can be learnt adaptively from the data. We provide theoretical justification for our approximate global self-attention scheme, and show it to be scalable to large-scale graphs. We demonstrate the competitiveness of GOAT on both heterophilious and homophilious graphs with millions of nodes.","Graph Neural Network, Transformer, Node Classification" An Empirical Study on Anomaly detection Using Density Based and Representative Based Clustering algorithms,https://openreview.net/forum?id=u6ay5ONhWJV,https://openreview.net/pdf?id=u6ay5ONhWJV,"In this paper, we focus on existing anomaly detection approaches, by empirically studying the performance of unsupervised anomaly detection techniques.","In data mining, and statistics, anomaly detection is the process of finding data patterns (outcomes, values, or observations) that deviate from the rest of the d other observations or outcomes. Anomaly detection is heavily used in solving real-world problems in many application domains like medicine, finance, cybersecurity, banking, networking, transportation, and military surveillance for enemy activities, but not limited to only those fields. In this paper, we present an empirical study of unsupervised anomaly detection techniques such as DBSCAN, DBSCAN++ (with uniform initialization, k-center initialization, uniform with approximate neighbor initialization, and k-center with approximate neighbor initialization), and k-means --(minus minus) algorithms on six benchmark imbalanced datasets. Findings from our in-depth empirical study show that k-means -- is a robust than DBSCAN, and DBSCAN++, in terms of the different evaluation measures (F1 score, False alarm rate, adjusted rand index, and Jaccard coefficient), and running time. We also observe that DBSCAN performs very well on datasets with fewer number of data points. ","Anomaly, Outliers, Noise points, ANN, DBSCAN, DBSCAN++, k-means - - (minus minus)" Recommender Transformers with Behavior Pathways,https://openreview.net/forum?id=YsdscENWse9,https://openreview.net/pdf?id=YsdscENWse9,We build the Recommender Transformer (RETR) with a novel Pathway Attention mechanism that can dynamically plan the behavior pathway. It achieves SOTA in both intra-domain and cross-domain benchmarks for sequential recommendation.,"Sequential recommendation requires the recommender to capture the evolving behavior characteristics from logged user behavior data for accurate recommendations. Nevertheless, user behavior sequences are viewed as a script with multiple ongoing threads intertwined. We find that only a small set of pivotal behaviors can be evolved into the user's future action. As a result, the future behavior of the user is hard to predict. We conclude this characteristic for sequential behaviors of each user as the \textit{behavior pathway}. Different users have their unique behavior pathways. Among existing sequential models, transformers have shown great capacity in capturing global-dependent characteristics. However, these models mainly provide a dense distribution over all previous behaviors using the self-attention mechanism, making the final predictions overwhelmed by the trivial behaviors not adjusted to each user. In this paper, we build the Recommender Transformer (RETR) with a novel Pathway Attention mechanism. RETR can dynamically plan the behavior pathway specified for each user, and sparingly activate the network through this behavior pathway to effectively capture evolving patterns useful for recommendation. The key design is a learned binary route to prevent the behavior pathway from being overwhelmed by trivial behaviors. Pathway attention is model-agnostic and can be applied to a series of transformer-based models for sequential recommendation. We empirically evaluate RETR on seven intra-domain benchmarks and RETR yields state-of-the-art performance. On another five cross-domain benchmarks, RETR can capture more domain-invariant representations for sequential recommendation.","Recommendation, Transformers, Behavior Pathway" Outlier Robust Adversarial Training,https://openreview.net/forum?id=8VCiVV97Pji,https://openreview.net/pdf?id=8VCiVV97Pji,,"Supervised learning models are challenged by the intrinsic complexities of training data such as outliers and minority subpopulations and intentional attacks at inference time with adversarial samples. While traditional robust learning methods and the recent adversarial training approaches are designed to handle each of the two challenges, to date, no work has been done to develop models that are robust with regard to the low-quality training data and the potential adversarial attack at inference time simultaneously. It is for this reason that we introduce Outlier Robust Adversarial Training (ORAT) in this work. ORAT is based on a bi-level optimization formulation of adversarial training with a robust rank-based loss function. Theoretically, we show that the learning objective of ORAT satisfies the H-consistency in binary classification, which establishes it as a proper surrogate to adversarial 0/1 loss. Furthermore, we analyze its generalization ability and provide uniform convergence rates in high probability. ORAT can be optimized with a simple algorithm. Experimental evaluations on three benchmark datasets demonstrate the effectiveness and robustness of ORAT in handling outliers and adversarial attacks. ", Towards Discovering Neural Architectures from Scratch,https://openreview.net/forum?id=UIpwFLrJiDi,https://openreview.net/pdf?id=UIpwFLrJiDi,"We introduce an algebraic view on Neural Architecture Search that allows us to construct highly expressive search spaces with context-free grammars, and show that we can efficiently find well-performing architectures.","The discovery of neural architectures from scratch is the long-standing goal of Neural Architecture Search (NAS). Searching over a wide spectrum of neural architectures can facilitate the discovery of previously unconsidered but well-performing architectures. In this work, we take a large step towards discovering neural architectures from scratch by expressing architectures algebraically. This algebraic view leads to a more general method for designing search spaces, which allows us to compactly represent search spaces that are 100s of orders of magnitude larger than common spaces from the literature. Further, we propose a Bayesian Optimization strategy to efficiently search over such huge spaces, and demonstrate empirically that both our search space design and our search strategy can be superior to existing baselines. We open source our algebraic NAS approach and provide APIs for PyTorch and TensorFlow.","Neural Architecture Search, Search Space Design, Bayesian Optimization" Equiformer: Equivariant Graph Attention Transformer for 3D Atomistic Graphs,https://openreview.net/forum?id=KwmPfARgOTD,https://openreview.net/pdf?id=KwmPfARgOTD,"We propose an equivariant graph neural network based on Transformer networks and propose a novel attention mechanism, which improves upon self-attention in typical Transformers.","3D-related inductive biases like translational invariance and rotational equivariance are indispensable to learning on 3D atomistic graphs such as molecules. Inspired by the success of Transformers in various domains, we study how to incorporate these inductive biases into Transformers. In this paper, we present Equiformer, a graph neural network leveraging the strength of Transformer architectures and incorporating SE(3)/E(3)-equivariant features based on irreducible representations (irreps). Irreps features encode equivariant information in channels without complicating graph structures and thus enable us to directly incorporate them into Transformers by using equivariant operations. Moreover, we propose a novel attention mechanism called equivariant graph attention, which considers both content and geometric information contained in irreps features. The proposed equivariant graph attention improves upon typical attention in Transformers through replacing dot product attention with multi-layer perceptron attention and including non-linear message passing. We benchmark Equiformer on QM9, MD17, and OC20 datasets. Experiments demonstrate that Equiformer achieves competitive results to previous models and verify the effectiveness of the proposed attention.","equivariant neural networks, graph neural networks, computational physics, transformer networks" Multiple Instance Learning via Iterative Self-Paced Supervised Contrastive Learning,https://openreview.net/forum?id=ALMbHbLb3PK,https://openreview.net/pdf?id=ALMbHbLb3PK,"We propose a framework for multiple instance learning, which iteratively improves instance-level features by jointly estimating latent instance-level pseudo labels, and show that it outperforms existing methods on three real-world medical datasets.","Learning representations for individual instances when only bag-level labels are available is a fundamental challenge in multiple instance learning (MIL). Recent works have shown promising results using contrastive self-supervised learning (CSSL), which learns to push apart representations corresponding to two different randomly-selected instances. Unfortunately, in real-world applications such as medical image classification, there is often class imbalance, so randomly-selected instances mostly belong to the same majority class, which precludes CSSL from learning inter-class differences. To address this issue, we propose a novel framework, Iterative Self-paced Supervised Contrastive Learning for MIL Representations (ItS2CLR), which improves the learned representation by exploiting instance-level pseudo labels derived from the bag-level labels. The framework employs a novel self-paced sampling strategy to ensure the accuracy of pseudo labels. We evaluate ItS2CLR on three medical datasets, showing that it improves the quality of instance-level pseudo labels and representations, and outperforms existing MIL methods in terms of both bag and instance level accuracy.","multiple instance learning, whole slide image, contrastive learning, medical imaging" Automating Nearest Neighbor Search Configuration with Constrained Optimization,https://openreview.net/forum?id=KfptQCEKVW4,https://openreview.net/pdf?id=KfptQCEKVW4,,"The approximate nearest neighbor (ANN) search problem is fundamental to efficiently serving many real-world machine learning applications. A number of techniques have been developed for ANN search that are efficient, accurate, and scalable. However, such techniques typically have a number of parameters that affect the speed-recall tradeoff, and exhibit poor performance when such parameters aren’t properly set. Tuning these parameters has traditionally been a manual process, demanding in-depth knowledge of the underlying search algorithm. This is becoming an increasingly unrealistic demand as ANN search grows in popularity. To tackle this obstacle to ANN adoption, this work proposes a constrained optimization-based approach to tuning quantization-based ANN algorithms. Our technique takes just a desired search cost or recall as input, and then generates tunings that, empirically, are very close to the speed-recall pareto frontier and give leading performance on standard benchmarks.","AutoML, Convex Optimization, Vector Retrieval, Hyperparameter Search, ANN" A prototype-oriented clustering for domain shift with source privacy,https://openreview.net/forum?id=pjYWuX78J6p,https://openreview.net/pdf?id=pjYWuX78J6p,We propose a method to solve the problem of unsupervised clustering under domain shift and privacy concerns.,"Unsupervised clustering under domain shift (UCDS) studies how to transfer the knowledge from abundant unlabeled data from multiple source domains to learn the representation of the unlabeled data in a target domain. In this paper, we introduce Prototype-oriented Clustering with Distillation (PCD) to not only improve the performance and applicability of existing methods for UCDS, but also address the concerns on protecting the privacy of both the data and model of the source domains. PCD first constructs a source clustering model by aligning the distributions of prototypes and data. It then distills the knowledge to the target model through cluster labels provided by the source model while simultaneously clustering the target data. Finally, it refines the target model on the target domain data without guidance from the source model. Experiments across multiple benchmarks show the effectiveness and generalizability of our source-private clustering method.","deep learning, clustering, privacy, computer vision" On the Effectiveness of Adapting Pre-trained Transformer Models via Adversarial Noise,https://openreview.net/forum?id=_5jWpg7TK9b,https://openreview.net/pdf?id=_5jWpg7TK9b,We investigate the computation efficiency vs. generalization in adapting natural language understanding tasks and propose a method to accelerates model adaptation of Transformers by up to 9.8 times.,"Pretraining Transformer-based language models followed by adapting the pre-trained models to a downstream task is an effective transfer mechanism in NLP. While it is well-known that the pretraining stage is computationally expensive, the downstream adaptation also becomes costly as Transformers grow in size rapidly and the wide usage scenarios of fine-tuning pre-trained Transformers. In this work, we find that techniques that have demonstrated success in accelerating the pre-training tasks, such as large-batch optimizations, lead to severe accuracy degradation. We find strong regularization techniques such as adversarial training help to close the accuracy gap. However, the computational complexity associated with this approach, due to the high cost of generating adversaries, prevents it from reducing adaptation costs even with a large number of GPUs. As such, we systematically study both the computation efficiency and generalization of adversarial training for adapting pre-trained transformers, under a large-batch optimization regime. Our investigation yields simple yet effective algorithms for adapting transformer models. We show in experiments that our proposed method attains up to 9.8$\times$ adaptation speedups over the baseline on BERT$_{base}$ and RoBERTa$_{large}$, while achieving comparable and sometimes higher accuracy than fine-tuning using existing baselines.","Fast adaptation, Pre-trained Transformer Networks, Natural Language Understanding" Sparse Tokens for Dense Prediction - The Medical Image Segmentation Case,https://openreview.net/forum?id=-D5TVtzt3fP,https://openreview.net/pdf?id=-D5TVtzt3fP,We show how to perform dense prediction efficiently with a sparse token ViT while maintaining performance.,"Can we use sparse tokens for dense prediction, e.g., segmentation? Although token sparsification has been applied to Vision Transformers (ViT) for acceleration on classification tasks, it is still unknown how to perform segmentation from sparse tokens. To this end, we reformulate segmentation as a sparse encoding -> token completion -> dense decoding (SCD) pipeline. We first show empirically that naively applying existing approaches from classification token pruning and masked image modeling (MIM) leads to failure and training inefficiency. This is caused by inappropriate sampling algorithms and the low quality of the restored dense features. In this paper, we propose Soft-topK Token Pruning (STP) and Multi-layer Token Assembly (MTA) to address the above problems. Particularly, in the sparse encoding stage, STP predicts token-wise importance scores with a lightweight sub-network and samples topK-scored tokens. The intractable gradients of topK are approximated through a continuous perturbed score distribution. In the token completion stage, MTA restores a full token sequence by assembling both sparse output tokens and pruned intermediate tokens from multiple layers. Compared to MIM which fills the pruned positions with mask tokens, MTA produces more informative representations allowing more accurate segmentation. The last dense decoding stage is compatible with decoders of existing segmentation frameworks, e.g., UNETR. Experiments show SCD pipelines equipped with our STP and MTA are much faster than baselines without token sparsification in both training (up to 120% higher throughput) and inference (up to 60.6% higher throughput) while maintaining segmentation quality.","token pruning, vision transformer, dense prediction, medical image segmentation" Truncated Diffusion Probabilistic Models and Diffusion-based Adversarial Auto-Encoders,https://openreview.net/forum?id=HDxgaKk956l,https://openreview.net/pdf?id=HDxgaKk956l,"We propose truncated diffusion probabilistic models, which models an implicit prior to truncate the diffusion chain and requires significantly fewer reverse steps to generate high-quality samples.","Employing a forward diffusion chain to gradually map the data to a noise distribution, diffusion-based generative models learn how to generate the data by inferring a reverse diffusion chain. However, this approach is slow and costly because it needs many forward and reverse steps. We propose a faster and cheaper approach that adds noise not until the data become pure random noise, but until they reach a hidden noisy-data distribution that we can confidently learn. Then, we use fewer reverse steps to generate data by starting from this hidden distribution that is made similar to the noisy data. We reveal that the proposed model can be cast as an adversarial auto-encoder empowered by both the diffusion process and a learnable implicit prior. Experimental results show even with a significantly smaller number of reverse diffusion steps, the proposed truncated diffusion probabilistic models can provide consistent improvements over the non-truncated ones in terms of performance in both unconditional and text-guided image generations.","Diffusion model, adversarial autoencoder, implicit prior" NTK-SAP: Improving neural network pruning by aligning training dynamics,https://openreview.net/forum?id=-5EWhW_4qWP,https://openreview.net/pdf?id=-5EWhW_4qWP,We introduce a pruning-at-initialization method by aligning the eigenspectrum of NTK to that of the dense network.,"Pruning neural networks before training has received increasing interest due to its potential to reduce training time and memory. One popular method is to prune the connections based on a certain metric, but it is not entirely clear what metric is the best choice. Recent advances in neural tangent kernel (NTK) theory suggest that the training dynamics of large enough neural networks is closely related to the spectrum of the NTK. Motivated by this finding, we propose to prune the connections that have the least influence on the spectrum of the NTK. This method can help maintain the NTK spectrum, which may help align the training dynamics to that of its dense counterpart. However, one possible issue is that the fixed-weight-NTK corresponding to a given initial point can be very different from the NTK corresponding to later iterates during the training phase. We further propose to sample multiple realizations of random weights to estimate the NTK spectrum. Note that our approach is weight-agnostic, which is different from most existing methods that are weight-dependent. In addition, we use random inputs to compute the fixed-weight-NTK, making our method data-agnostic as well. We name our foresight pruning algorithm Neural Tangent Kernel Spectrum-Aware Pruning (NTK-SAP). Empirically, our method achieves better performance than all baselines on multiple datasets.","empirical deep learning, pruning at initialization, neural network pruning" One Ring to Bring Them All: Model Adaptation under Domain and Category Shift,https://openreview.net/forum?id=dOq0Jbg9hUt,https://openreview.net/pdf?id=dOq0Jbg9hUt,We propose a simple method which could address source-free universal domain adaptation and also several other different tasks.,"In this paper, we investigate model adaptation under domain and category shift, where the final goal is to achieve $\textit{Source-free Universal Domain Adaptation}$ (SF-UNDA), which addresses the situation where there exist both domain and category shifts between source and target domains. Under the SF-UNDA setting, the model cannot access source data anymore during target adaptation, which aims to address data privacy concerns. We propose a novel training scheme to learn a ($n$+1)-way classifier to predict the $n$ source classes and the unknown class, where samples of only known source categories are available for training. Furthermore, for target adaptation, we simply adopt a weighted entropy minimization to adapt the source pretrained model to the unlabeled target domain without source data. In experiments, we show: $\textbf{1)}$ After source training, the resulting source model can get excellent performance for $\textit{open-set single domain generalization}$; $\textbf{2)}$ After target adaptation, our method surpasses current UNDA approaches which demand source data during adaptation. The versatility to several different tasks strongly proves the efficacy and generalization ability of our method. $\textbf{3)}$ When augmented with a closed-set domain adaptation approach during target adaptation, our source-free method further outperforms the current state-of-the-art UNDA method by 2.5%, 7.2% and 13% on Office-31, Office-Home and VisDA respectively.",Source-free Universal Domain Adaptation Towards Equivariant Graph Contrastive Learning via Cross-Graph Augmentation,https://openreview.net/forum?id=9L1Ts8t66YK,https://openreview.net/pdf?id=9L1Ts8t66YK,We propose a cross-graph augmentation to achieve equivariant self-supervised learning on graphs. ,"Leading graph contrastive learning (GCL) frameworks conform to the invariance mechanism by encouraging insensitivity to different augmented views of the same graph. Despite the promising performance, invariance worsens representation when augmentations cause aggressive semantics shifts. For example, dropping the super-node can dramatically change a social network's topology. In this case, encouraging invariance to the original graph can bring together dissimilar patterns and hurt the task of instance discrimination. To resolve the problem, we get inspiration from equivariant self-supervised learning and propose Equivariant Graph Contrastive Learning (E-GCL) to encourage the sensitivity to global semantic shifts. Viewing each graph as a transformation to others, we ground the equivariance principle as a cross-graph augmentation -- graph interpolation -- to simulate global semantic shifts. Without using annotation, we supervise the representation of cross-graph augmented views by linearly combining the representations of their original samples. This simple but effective equivariance principle empowers E-GCL with the ability of cross-graph discrimination. It shows significant improvements over the state-of-the-art GCL models in unsupervised learning and transfer learning. Further experiments demonstrate E-GCL's generalization to various graph pre-training frameworks. Code is available at \url{https://anonymous.4open.science/r/E-GCL/}","equivariant, self-supervised learning, contrastive learning, graph neural networks" Configuring Mixed-Integer Linear Programming Solvers with Deep Metric Learning,https://openreview.net/forum?id=itmei3dxTQ5,https://openreview.net/pdf?id=itmei3dxTQ5,We learn similarities among MILP problem instances using deep metric learning to predict an instance-specific solver configuration,"Mixed Integer Linear Programming (MILP) solvers expose a large number of configuration parameters for their internal algorithms. Solutions, and their associated costs or runtimes, are significantly affected by the choice of the configuration parameters, even when problem instances are coming from the same distribution. On one hand, using the default solver configuration leads to poor suboptimal solutions. On the other hand, searching and evaluating an exponential number of configurations for every problem instance is time-consuming and in some cases infeasible. In this work, we propose MILPTune -- a machine learning-based approach to predict an instance-aware parameters configuration for MILP solvers. It enables avoiding the expensive search of configuration parameters for each new problem instance, while tuning the solver's behavior for the given instance. Our method trains a metric learning model based on a graph neural network to project problem instances to a space where instances with similar costs are closer to each other. At inference time, and given a new problem instance, we first embed the instance to the learned metric space, and then predict a parameters configuration using nearest neighbor data. Empirical results on real-world problem benchmarks show that our method predicts configuration parameters that improve solutions' costs by 10-67% compared to the baselines and previous approaches.","Mixed Integer Linear Programming, Metric Learning, Algorithm Configuration" Effective Self-supervised Pre-training on Low-compute networks without Distillation,https://openreview.net/forum?id=cbpRzMy-UZH,https://openreview.net/pdf?id=cbpRzMy-UZH,,"Despite the impressive progress of self-supervised learning (SSL), its applicability to low-compute networks has received limited attention. Reported performance has trailed behind standard supervised pre-training by a large margin, barring self-supervised learning from making an impact on models that are deployed on device. Most prior works attribute this poor performance to the capacity bottleneck of the low-compute networks and opt to bypass the problem through the use of knowledge distillation (KD). In this work, we revisit SSL for efficient neural networks, taking a closer at what are the detrimental factors causing the practical limitations, and whether they are intrinsic to the self-supervised low-compute setting. We find that, contrary to accepted knowledge, there is no intrinsic architectural bottleneck, we diagnose that the performance bottleneck is related to the model complexity vs regularization strength trade-off, and we propose an effective training strategy that achieves the new state-of-the-art for SSL on low-compute networks despite not using KD at all. In particular, we start by empirically observing that the use of local views can have a dramatic impact on the effectiveness of the SSL method. This hints at view sampling being the performance bottleneck for SSL on low-capacity networks. We hypothesize that the view sampling strategy for large neural networks, which requires matching views in very diverse spatial scales and contexts, is too demanding for low-capacity architectures. We systematize the design of the view sampling mechanism, leading to a new training methodology that consistently improves performance by a wide margin across SSL methods (e.g. MoCo-v2, SwAV or DINO), across low-size networks (convolution-based networks, e.g. MobileNetV2, ResNet18, ResNet34 and vision transformer, e.g. ViT-Ti), and across tasks (linear probe, object detection, instance segmentation, and semi-supervised learning). Our best models establish a new state-of-the-art for SSL methods on low-compute networks across all standard benchmarks despite not using a KD loss term. ","Self-supervised learning, Low-compute network" Graph Neural Bandits,https://openreview.net/forum?id=vRq1XIHV8Go,https://openreview.net/pdf?id=vRq1XIHV8Go,,"Contextual bandits aim to choose the optimal arm with the highest reward out of a set of candidates based on their contextual information, and various bandit algorithms have been applied to personalized recommendation due to their ability of solving the exploitation-exploration dilemma. Motivated by online recommendation scenarios, in this paper, we propose a framework named Graph Neural Bandits (GNB) to leverage the collaborative nature among users empowered by graph neural networks (GNNs). Instead of estimating rigid user clusters, we model the ""fine-grained'' collaborative effects through estimated user graphs in terms of exploitation and exploration individually. Then, to refine the recommendation strategy, we utilize separate GNN-based models on estimated user graphs for exploitation and adaptive exploration. Theoretical analysis and experimental results on multiple real data sets in comparison with state-of-the-art baselines are provided to demonstrate the effectiveness of our proposed framework.","Contextual Bandits, Graph Neural Networks" CoRTX: Contrastive Framework for Real-time Explanation,https://openreview.net/forum?id=L2MUOUp0beo,https://openreview.net/pdf?id=L2MUOUp0beo,Learning real-time model explainer with limited explanation labels.,"Recent advancements in explainable machine learning provide effective and faithful solutions for interpreting model behaviors. However, many explanation methods encounter efficiency issues, which largely limit their deployments in practical scenarios. Real-time explainer (RTX) frameworks have thus been proposed to accelerate the model explanation process by learning an one-feed-forward explainer. Existing RTX frameworks typically build the explainer under the supervised learning paradigm, which requires large amounts of explanation labels as the ground truth. Considering that accurate explanation labels are usually hard to obtain, due to constrained computational resources and limited human efforts, effective explainer training is still challenging in practice. In this work, we propose a COntrastive Real-Time eXplanation (CoRTX) framework to learn the explanation-oriented representation and relieve the intensive dependence of explainer training on explanation labels. Specifically, we design a synthetic strategy to select positive and negative instances for explanation representation learning. Theoretical analysis show that our selection strategy can benefit the contrastive learning process on explanation tasks. Experimental results on three real-world datasets further demonstrate the efficiency and efficacy of our proposed CoRTX framework.","Interpretability, explainability, real-time explanation, feature attribution, feature importance ranking" "MPCFORMER: FAST, PERFORMANT AND PRIVATE TRANSFORMER INFERENCE WITH MPC",https://openreview.net/forum?id=CWmvjOEhgH-,https://openreview.net/pdf?id=CWmvjOEhgH-,"We develop a framework that allows fast, performant, and private inference with MPC for Transformer models.","Enabling private inference is crucial for many cloud inference services that are based on Transformer models. However, existing private inference solutions for Transformers can increase the inference latency by more than 60$\times$ or significantly compromise the quality of inference results. In this paper, we design the framework MPCFORMER using secure multi-party computation (MPC) and Knowledge Distillation (KD). It can be used in tandem with many specifically designed MPC-friendly approximations and trained Transformer models. We use MPCFORMER to significantly speed up Transformer model inference in MPC settings while achieving similar ML performance to the original model. We evaluate MPCFORMER with various settings in MPC. On the IMDb dataset, we achieve similar performance to $\text{BERT}_\text{BASE}$, while being 5.3$\times$ faster. On the GLUE benchmark, we achieve 97% performance of $\text{BERT}_\text{BASE}$ with a 2.2$\times$ speedup. We show that MPCFORMER remains effective with different trained Transformer weights such as $\text{ROBERTA}_\text{BASE}$ and larger models including $\text{BERT}_\text{LARGE}$. In particular, we achieve similar performance to $\text{BERT}_\text{LARGE}$, while being 5.93$\times$ faster on the IMDb dataset.","Secure Multiparty Computation, Privacy, Machine Learning, Transformer model" Discovering Distinctive ``Semantics'' in Super-Resolution Networks,https://openreview.net/forum?id=RrO3xNCqz7J,https://openreview.net/pdf?id=RrO3xNCqz7J,,"Image super-resolution (SR) is a representative low-level vision problem. Although deep SR networks have achieved extraordinary success, we are still unaware of their working mechanisms. Specifically, whether SR networks can learn semantic information, or just perform complex mapping function? What hinders SR networks from generalizing to real-world data? These questions not only raise our curiosity, but also influence SR network development. In this paper, we make the primary attempt to answer the above fundamental questions. After comprehensively analyzing the feature representations (via dimensionality reduction and visualization), we successfully discover the distinctive ``semantics'' in SR networks, i.e., deep degradation representations (DDR), which relate to image degradation instead of image content. We show that a well-trained deep SR network is naturally a good descriptor of degradation information. Our experiments also reveal two key factors (adversarial learning and global residual) that influence the extraction of such semantics. We further apply DDR in several interesting applications (such as distortion identification, blind SR and generalization evaluation) and achieve promising results, demonstrating the correctness and effectiveness of our findings.", Networks are Slacking Off: Understanding Generalization Problem in Image Deraining,https://openreview.net/forum?id=qGuU8To1y7x,https://openreview.net/pdf?id=qGuU8To1y7x,,"Deep low-level networks are successful in laboratory benchmarks, but still suffer from severe generalization problems in real-world applications, especially for the deraining task. An ``acknowledgement'' of deep learning drives us to use the training data with higher complexity, expecting the network to learn richer knowledge to overcome generalization problems. Through extensive systematic experiments, we show that this approach fails to improve their generalization ability but instead makes the networks overfit to degradations even more. Our experiments establish that it is capable of training a deraining network with better generalization by reducing the training data complexity. Because the networks are slacking off during training, i.e. learn the less complex element in the image content and degradation to reduce the training loss. When the background image is less complex than the rain streak, the network will focus on the reconstruction of the background without overfitting the rain patterns, thus achieving a good generalization effect. Our research demonstrates excellent application potential and provides an indispensable perspective and research methodology for understanding the generalization problem of low-level vision. ", Disparate Impact in Differential Privacy from Gradient Misalignment,https://openreview.net/forum?id=qLOaeRvteqbx,https://openreview.net/pdf?id=qLOaeRvteqbx,"DPSGD can have unfair outcomes on protected groups because of direction errors caused by per-sample gradient clipping, but unfairness can be dramatically reduced with a global clipping technique.","As machine learning becomes more widespread throughout society, aspects including data privacy and fairness must be carefully considered, and are crucial for deployment in highly regulated industries. Unfortunately, the application of privacy enhancing technologies can worsen unfair tendencies in models. In particular, one of the most widely used techniques for private model training, differentially private stochastic gradient descent (DPSGD), frequently intensifies disparate impact on groups within data. In this work we study the fine-grained causes of unfairness in DPSGD and identify gradient misalignment due to inequitable gradient clipping as the most significant source. This observation leads us to a new method for reducing unfairness by preventing gradient misalignment in DPSGD.","Differential privacy, fairness, privacy" IDP: Iterative Differentiable Pruning based on Attention for Deep Neural Networks,https://openreview.net/forum?id=puguRjbs6Rg,https://openreview.net/pdf?id=puguRjbs6Rg,"We proposed a differentiable pruning method, IDP which yields the state-of-the-art pruning quality on popular computer vision and natural language models, based on attention-based soft-mask.","Deep Neural network (DNN) pruning is an effective method to reduce the size of a model, improve the inference latency, and minimize the power consumption on DNN accelerators, at the risk of decreasing model accuracy. In this paper, we propose a novel differentiable pruning scheme, Iterative Differentiable Pruning or IDP which offers state-of-the-art qualities in model size, accuracy, and training cost. IDP creates attention-based pruning masks for a given sparsity target to achieve the state-of-the-art trade-offs between model accuracy and inference compute with negligible training overhead. We evaluated IDP on various computer vision and natural language processing tasks, and found that IDP delivers the state-of-the-art results. For MobileNet-v1 (which is a challenging DNN for pruning), IDP can achieve 68.2% top-1 ImageNet1k accuracy with 86.6% sparsity which is 2.3% higher accuracy than the latest state-of-the-art pruning algorithms. For ResNet18, IDP offers 69.5% top-1 ImageNet1k accuracy with 85.5% sparsity at the same training budget and 0.8% better top-1 accuracy than the state-of-the-art method. Also, IDP demonstrates over 83.1% accuracy on Multi-Genre Natural Language Inference with 90% sparsity for BERT, while the next best from the existing techniques shows 81.5% accuracy.","pruning, deep learning, attention" Language-Guided Artistic Style Transfer Using the Latent Space of DALL-E,https://openreview.net/forum?id=yDx3GP7Qjfl,https://openreview.net/pdf?id=yDx3GP7Qjfl,We propose a language-guided style transfer method that manipulates the discrete DALL-E latent space using a non-autoregressive sequence translation approach.,"Despite the progress made in the style transfer task, most of the previous work focuses on transferring only relatively simple features like color or texture, while missing other more abstract and creative concepts such as the specific artistic trait of the painter or the overall feeling of the scene. However, these more abstract concepts can be captured by the semantics of the latent space of models like DALL-E or CLIP, which have been trained using huge datasets of images and textual documents. In this paper, we propose a style transfer method that exploits both of these models and uses the natural language to describe abstract artistic styles. Specifically, we formulate the language-guided style transfer task as a non-autoregressive token sequence translation in the DALL-E discrete latent space. Moreover, we propose a textual-prompt-based Reinforcement Learning strategy to incorporate style-specific information in the translation network using the CLIP space as the only guidance. Our empirical results show that we can transfer artistic styles using language instructions at different granularities on content images that are not restricted to a specific domain. Our code will be publicly available.","Language-Guided Style Transfer, Non-Autoregressive Transformer, Deep Reinforcement Learning" FADE: Enabling Large-Scale Federated Adversarial Training on Resource-Constrained Edge Devices,https://openreview.net/forum?id=NzrpxT5hTY_,https://openreview.net/pdf?id=NzrpxT5hTY_,We propose a novel framework to enable large-scale federated adversarial training on resource-constrained edge devices.,"Federated adversarial training can effectively complement adversarial robustness into the privacy-preserving federated learning systems. However, the high demand for memory capacity and computing power makes large-scale federated adversarial training infeasible on resource-constrained edge devices. Few previous studies in federated adversarial training have tried to tackle both memory and computational constraints at the same time. In this paper, we propose a new framework named Federated Adversarial Decoupled Learning (FADE) to enable AT on resource-constrained edge devices. FADE decouples the entire model into small modules to fit into the resource budget of each edge device respectively, and each device only needs to perform AT on a single module in each communication round. We also propose an auxiliary weight decay to alleviate objective inconsistency and achieve better accuracy-robustness balance in FADE. FADE offers a theoretical guarantee for convergence and adversarial robustness, and our experimental results show that FADE can significantly reduce the consumption of memory and computing power while maintaining accuracy and robustness.","Federated Learning, Adversarial Training" Temporal Relevance Analysis for Video Action Models,https://openreview.net/forum?id=OTiSSCBm1QD,https://openreview.net/pdf?id=OTiSSCBm1QD,The paper provides a deep analysis of the temporal modeling for action recognition.,"In this paper, we provide a deep analysis of temporal modeling for action recognition, an important but underexplored problem in the literature. We first propose a new approach to quantify the temporal relationships between frames captured by CNN-based action models based on layer-wise relevance propagation. We then conduct comprehensive experiments and in-depth analysis to provide a better understanding of how temporal modeling is affected by various factors such as dataset, network architecture, and input frames. With this, we further study some important questions for action recognition that lead to interesting findings. Our analysis shows that there is no strong correlation between temporal relevance and model performance; and action models tend to capture local temporal information, but less long-range dependencies.","Temporal analysis, Frame relevance, Video data, Action recognition" HNeRV: A Hybrid Neural Representation for Videos,https://openreview.net/forum?id=dOM_GHvkO2h,https://openreview.net/pdf?id=dOM_GHvkO2h,,"Implicit neural representations store videos as neural networks and have performed well for vision tasks such as video compression and denoising. With frame index and/or positional index as input, implicit representations (NeRV, E-NeRV, etc.) reconstruct video frames from fixed and content-agnostic embeddings. Such embedding largely limits the regression capacity and internal generalization for video interpolation. In this paper, we propose a Hybrid Neural Representation for Videos (HNeRV), where learnable and content-adaptive embeddings act as decoder input. Besides the input embedding, we introduce a HNeRV block to make model parameters evenly distributed across the entire network, therefore higher layers (layers near the output) can have more capacity to store high-resolution content and video details. With content-adaptive embedding and re-designed model architecture, HNeRV outperforms implicit methods (NeRV, E-NeRV) in video regression task for both reconstruction quality and convergence speed, and shows better internal generalization. As a simple and efficient video representation, HNeRV also shows decoding advantages for speed, flexibility, and deployment, compared to traditional codecs (H.264, H.265) and learning-based compression methods. Finally, we explore the effectiveness of HNeRV on downstream tasks such as video compression and video inpainting.","video neural representation, implicit neural representation" "DeepPipe: Deep, Modular and Extendable Representations of Machine Learning Pipelines",https://openreview.net/forum?id=TT66Tpbus3b,https://openreview.net/pdf?id=TT66Tpbus3b,How to learn Machine Learning pipelines representations to improve their optimization,"Finding accurate Machine Learning pipelines is essential in achieving state-of-the-art AI predictive performance. Unfortunately, most existing Pipeline Optimization techniques rely on flavors of Bayesian Optimization that do not explore the deep interaction between pipeline stages/components (e.g. between hyperparameters of the deployed preprocessing algorithm and the hyperparameters of a classifier). In this paper, we are the first to capture the deep interaction between components of a Machine Learning pipeline. We propose embedding pipelines in a deep latent representation through a novel per-component encoder mechanism. Such pipeline embeddings are used with deep kernel Gaussian Process surrogates inside a Bayesian Optimization setup. Through extensive experiments on large-scale meta-datasets, we demonstrate that learning pipeline embeddings with Deep Neural Networks significantly advances the state-of-the-art in Pipeline Optimization.","Pipeline optimization, meta-learning, bayesian optimization, representation learning" "OTOv2: Automatic, Generic, User-Friendly",https://openreview.net/forum?id=7ynoX1ojPMt,https://openreview.net/pdf?id=7ynoX1ojPMt,,"Only-Train-Once (OTOv1) is recently proposed to drastically simplify and automate the complicated multi-stage procedure for model compression via structured pruning. However, its automation relies on manually conducting zero-invariant groups (ZIGs) partition and slimmer model construction for specific DNNs beforehand, which necessitates numerous engineering efforts and domain-knowledge and prevents its wider applications onto general scenarios. We propose the second generation of Only-Train-Once (OTOv2), which trains and compresses an arbitrary DNN only once from scratch to produce a more compact model with competitive performance without fine-tuning. OTOv2 is automated and pluggable into various deep learning applications, and requires almost minimal engineering efforts from the users. Methodologically, OTOv2 proposes two revolutionary improvements: (i) Autonomy: automatically partitions ZIGs and constructs compressed model for arbitrary DNNs; and (ii) Dual Half-Space Projected Gradient (DHSPG): a novel optimizer to more reliably solve structured-sparsity problems. Numerically, we demonstrate the generality and autonomy of OTOv2 on a variety of model architectures such as VGG, ResNet, CARN, DenseNet and StackedUnets, the majority of which cannot be handled by OTOv1 without extensive handcrafting. Together with benchmark datasets including CIFAR10/100, Fashion-MNIST, SVNH and ImageNet, the effectiveness of OTOv2 is validated by achieving competitive or even better results than the state-of-the-arts.","Model Compression, One Shot, Automatic, Generic, User-Friendly" TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second,https://openreview.net/forum?id=cp5PvcI6w8_,https://openreview.net/pdf?id=cp5PvcI6w8_,"We present TabPFN, a trained Transformer that learned to solve small tabular data classification problems at SOTA level in less than a second by training on synthetic data generated by integrating principles from causal reasoning and simplicity. ","We present TabPFN, a trained Transformer that can do supervised classification for small tabular datasets in less than a second, needs no hyperparameter tuning and is competitive with state-of-the-art classification methods. TabPFN is fully entailed in the weights of our network, which accepts training and test samples as a set-valued input and yields predictions for the entire test set in a single forward pass. TabPFN is a Prior-Data Fitted Network (PFN) and is trained offline once, to approximate Bayesian inference on synthetic datasets drawn from our prior. This prior incorporates ideas from causal reasoning: It entails a large space of structural causal models with a preference for simple structures. On $30$ datasets from the OpenML-CC18 suite, we show that our method clearly outperforms boosted trees and performs on par with complex state-of-the-art AutoML systems with up to $70\times$ speedup. This increases to a $3\,200\times$ speedup when a GPU is available. We provide all our code, the trained TabPFN, an interactive browser demo and a Colab notebook at https://github.com/tabpfn-anonym/TabPFNAnonym.","Tabular Data, AutoML, Green AI, Bayesian prediction, Causal Reasoning, Real-time Machine Learning" On the Importance of Architectures and Hyperparameters for Fairness in Face Recognition,https://openreview.net/forum?id=iiRDsy85uXi,https://openreview.net/pdf?id=iiRDsy85uXi,We analyze the impact of architectures and hyperparameters on fairness in face recognition and use NAS to design simultaneously fairer and more accurate models.,"Face recognition systems are deployed across the world by government agencies and contractors for sensitive and impactful tasks, such as surveillance and database matching. Despite their widespread use, these systems are known to exhibit bias across a range of sociodemographic dimensions, such as gender and race. Nonetheless, an array of works proposing pre-processing, training, and post-processing methods have failed to close these gaps. Here, we take a very different approach to this problem, identifying that both architectures and hyperparameters of neural networks are instrumental in reducing bias. We first run a large-scale analysis of the impact of architectures and training hyperparameters on several common fairness metrics and show that the implicit convention of choosing high-accuracy architectures may be suboptimal for fairness. Motivated by our findings, we run the first neural architecture search for fairness, jointly with a search for hyperparameters. We output a suite of models which Pareto-dominate all other competitive architectures in terms of accuracy and fairness. Furthermore, we show that these models transfer well to other face recognition datasets with similar and distinct protected attributes. We release our code and raw result files so that researchers and practitioners can replace our fairness metrics with a bias measure of their choice.","Neural Architecture Search, Face Recognition, Fairness, Hyperparameter Optimization" Evaluating natural language processing models with generalization metrics that do not need access to any training or testing data,https://openreview.net/forum?id=naAzVF_v0yA,https://openreview.net/pdf?id=naAzVF_v0yA,,"The search for effective and robust metrics has been the focus of recent theoretical and empirical work on generalization of deep neural networks (NNs). In this paper, we discuss the performance of natural language processing (NLP) models, and we evaluate various existing and novel generalization metrics. Compared to prior studies, we (i) focus on NLP instead of computer vision (CV), (ii) focus on generalization metrics that predict test error instead of the generalization gap, (iii) focus on generalization metrics that do not need the access to data, and (iv) focus on the heavy-tail (HT) phenomenon that has received comparatively less attention in the study of deep neural networks. We extend recent HT-based work which focuses on power law (PL) distributions, and we study exponential (EXP) and exponentially truncated power law (E-TPL) fitting to the empirical spectral densities (ESDs) of weight matrices. Our empirical studies are carried on (i) hundreds of Transformers trained in different settings, in which we systematically vary the amount of data, the model size and the optimization hyperparameters, (ii) a total of 51 pretrained Transformers from eight families of Huggingface NLP models, including BERT, GPT2, ALBERT, etc., and (iii) a total of 28 existing and novel generalization metrics. From our detailed empirical analyses, we show that shape metrics, or the metrics obtained from fitting the shape of the ESDs, perform uniformly better at predicting generalization performance than scale metrics commonly studied in the literature, as measured by the average rank correlations with the generalization performance for all of our experiments. We also show that among the three HT distributions considered in our paper, the E-TPL fitting of ESDs performs the most robustly when the models are trained in experimental settings, while the PL fitting achieves the best performance on well-trained Huggingface models, and that both E-TPL and PL metrics (which are both shape metrics) outperform scale metrics.", Human Motion Diffusion Model,https://openreview.net/forum?id=SJ1kSyO2jwu,https://openreview.net/pdf?id=SJ1kSyO2jwu,,"Natural and expressive human motion generation is the holy grail of computer animation. It is a challenging task, due to the diversity of possible motion, human perceptual sensitivity to it, and the difficulty of accurately describing it. Therefore, current generative solutions are either low-quality or limited in expressiveness. Diffusion models are promising candidates for the human motion domain since they have already shown remarkable generative capabilities in other domains, and their many-to-many nature. In this paper, we introduce Motion Diffusion Model (MDM), a carefully adapted classifier-free diffusion-based generative model for human motion data. MDM is transformer-based, combining insights from motion generation literature. A notable design-choice is that it predicts the sample itself rather than the noise in each step to facilitate the use of established geometric losses on the locations and velocities of the motion, such as the foot contact loss. As we demonstrate, MDM is a generic approach, enabling different modes of conditioning, and different generation tasks. We show that our model is trained with lightweight resources and yet achieves state-of-the-art results on leading benchmarks for text-to-motion, action-to-motion, and unconditioned motion generation. ", Federated Learning in Non-IID Settings Aided by Differentially Private Synthetic Data,https://openreview.net/forum?id=HHd2OVBoF_-,https://openreview.net/pdf?id=HHd2OVBoF_-,A novel federated learning framework utilizing data augmentation to improve global accuracy among data-heterogeneous clients,"Federated learning (FL) is a privacy-promoting framework that enables potentially large number of clients to collaboratively train machine learning models. In an FL system, a server coordinates the collaboration by collecting and aggregating clients' model updates while the clients' data remains local and private. A major challenge in federated learning arises when the local data is non-iid -- the setting in which performance of the learned global model may deteriorate significantly compared to the scenario where the data is identically distributed across the clients. In this paper we propose FedDPMS (Federated Differentially Private Means Sharing), an FL algorithm in which clients augment local datasets with data synthesized using differentially private information collected and communicated by a trusted server. In particular, the server matches the pairs of clients having complementary local datasets and facilitates differentially-private sharing of the means of latent data representations; the clients then deploy variational auto-encoders to enrich their datasets and thus ameliorate the effects of non-iid data distribution. Our experiments on deep image classification tasks demonstrate that FedDPMS outperforms competing state-of-the-art FL methods specifically developed to address the challenge of federated learning on non-iid data.","Federated Learning, Representation Learning, Differential Privacy" Structure-based Drug Design with Equivariant Diffusion Models,https://openreview.net/forum?id=uKmuzIuVl8z,https://openreview.net/pdf?id=uKmuzIuVl8z,,"Structure-based drug design (SBDD) aims to design small-molecule ligands that bind with high affinity and specificity to pre-determined protein targets. Traditional SBDD pipelines start with large-scale docking of compound libraries from public databases, thus limiting the exploration of chemical space to existent previously studied regions. Recent machine learning methods approached this problem using an atom-by-atom generation approach, which is computationally expensive. In this paper, we formulate SBDD as a 3D-conditional generation problem and present DiffSBDD, an E(3)-equivariant 3D-conditional diffusion model that generates novel ligands conditioned on protein pockets. Furthermore, we curate a new dataset of experimentally determined binding complex data from Binding MOAD to provide realistic binding scenario rather than the synthetic CrossDocked dataset. Comprehensive in silico experiments demonstrate the efficiency of DiffSBDD in generating novel and diverse drug-like ligands that engage protein pockets with high binding energies as predicted by in silico docking.","Diffusion Models, Equivariant Neural Networks, Structure-based Drug Design, Molecule Generation, Conditional Generation" Deep reinforced active learning for multi-class image classification,https://openreview.net/forum?id=AvwF6IvT8et,https://openreview.net/pdf?id=AvwF6IvT8et,,"High accuracy medical image classification can be limited by the costs of acquiring more data as well as the time and expertise needed to label existing images. In this paper, we apply active learning to medical image classification, a method which aims to maximise model performance on a minimal subset from a larger pool of data. We present a new active learning framework, based on deep reinforcement learning, to learn an active learning query strategy to label images based on predictions from a convolutional neural network. Our framework modifies the deep-Q network formulation, allowing us to pick data based additionally on geometric arguments in the latent space of the classifier, allowing for high accuracy multi-class classification in a batch-based active learning setting, enabling the agent to label datapoints that are both diverse and about which it is most uncertain. We apply our framework to two medical imaging datasets and compare with standard query strategies as well as the most recent reinforcement learning based active learning approach for image classification.","Reinforcement learning, active learning, image classification" HeatDETR: Hardware-Efficient DETR with Device-Adaptive Thinning,https://openreview.net/forum?id=5p4wvBz9xIe,https://openreview.net/pdf?id=5p4wvBz9xIe,,"Vision transformers (ViTs) have continuously achieved new milestones in computer vision. A natural usage of ViTs in detection is to replace the CNN-based backbone with a transformer-based backbone directly, but with the price of considerable computation burden for their deployment on resource-limited edge devices. More potential usage is the DETR family, which eliminates the need for many hand-designed components in object detection but still cannot reach real-time edge applications. In this paper, we propose a novel hardware-efficient adaptive-thinning DETR (HeatDETR), achieving high speed inference on multiple edge devices and even the realtime, for the first time. Specifically, it mainly includes three contributions: 1) For decent detection performance, we introduce a backbone design principle based on the visual modeling process that focuses on locality to globality. Meanwhile, we propose a semantic-augmented module (SAM) in the backbone with the global modeling capabilities of self-attention to enhance low-level semantics. We also introduce an attention-based task-couple module (TCM) to reduce contradictions between classification and regression tasks. 2) For on-device efficiency, we propose a scale-combined module (SCM), through which we transform the multi-level detection process into the single-level process, releasing the multi-branch inference for higher hardware speed while maintaining detection performance. Then we first revisit network architectures and operators used in ViT-based models, reparametered CNNs, identify hardware-efficient design and introduce basic HeatDETR structure. 3) Based on our device-adaptive model-thinning strategy, deployable end-to-end HeatDETR on target devices can be generated efficiently. Experiments on the MS COCO dataset show HeatDETR outperforms current DETR-based methods by 0.3%~6.2% AP with 5%~68% speedup on a single Tesla V100. Even real-time inference can be achieved on extremely memory-constrained devices, e.g., Dual-Core Intel Core i7 CPU.", ChemSpacE: Interpretable and Interactive Chemical Space Exploration,https://openreview.net/forum?id=IiDeZZZ18zi,https://openreview.net/pdf?id=IiDeZZZ18zi,,"Discovering meaningful molecules in the vast combinatorial chemical space has been a long-standing challenge in many fields from materials science to drug discovery. Recent advances in machine learning, especially generative models, have made remarkable progress and demonstrate considerable promise for automated molecule design. Nevertheless, most molecule generative models remain black-box systems, whose utility is limited by a lack of interpretability and human participation in the generation process. In this work we propose \textbf{Chem}ical \textbf{Spac}e \textbf{E}xplorer (ChemSpacE), a simple yet effective method for exploring the chemical space with pre-trained deep generative models. It enables users to interact with existing generative models and inform the molecule generation process. We demonstrate the efficacy of ChemSpacE on the molecule optimization task and the molecule manipulation task in single property and multi-property settings. On the molecule optimization task, the performance of ChemSpacE is on par with previous black-box optimization methods yet is considerably faster and more sample efficient. Furthermore, the interface from ChemSpacE facilitates human-in-the-loop chemical space exploration and interactive molecule design.","Molecule Generation, Molecule Manipulation, Human-in-the-loop Molecule Design, Chemical Space Exploration" A UNIFIED VIEW OF FINDING AND TRANSFORMING WINNING LOTTERY TICKETS,https://openreview.net/forum?id=CJl2S0w1mbq,https://openreview.net/pdf?id=CJl2S0w1mbq,This paper presents a novel paradigm that combines the increased regularization term and early stopping to find or transform winning tickets.,"While over-parameterized deep neural networks obtain prominent results on various machine learning tasks, their superfluous parameters usually make model training and inference notoriously inefficient. Lottery Ticket Hypothesis (LTH) addresses this issue from a novel perspective: it articulates that there always exist sparse and admirable subnetworks in a randomly initialized dense network, which can be realized by an iterative pruning strategy. Dual Lottery Ticket Hypothesis (DLTH) further investigates sparse network training from a complementary view. Concretely, it introduces a gradually increased regularization term to transform a dense network to an ultra-light subnetwork without sacrificing learning capacity. After revisiting the success of LTH and DLTH, we unify these two research lines by coupling the stability of iterative pruning and the excellent performance of increased regularization, resulting in two new algorithms (UniLTH and UniDLTH) for finding and transforming winning tickets, respectively. Unlike either LTH without regularization or DLTH which applies regularization across the training, our methods first train the network without any regularization force until the model reaches a certain point (i.e., the validation loss does not decrease for several epochs), and then employ increased regularization for information extrusion and iteratively perform magnitude pruning till the end. We theoretically prove that the early stopping mechanism acts analogously as regularization and can help the optimization trajectory stop at a particularly better point in space than regularization. This not only prevent the parameters from being excessively skewed to the training distribution (over-fitting), but also better stimulate the network potential to obtain more powerful subnetworks. Extensive experiments are conducted to show the superiority of our methods in terms of accuracy and sparsity. ","Lottery Tickets Hypothesis, Dual Lottery Tickets Hypothesis, Non-linear increased regularization, early stopping" The Effects of Nonlinearity on Approximation Capacity of Recurrent Neural Networks,https://openreview.net/forum?id=Q4B6g_ubd39,https://openreview.net/pdf?id=Q4B6g_ubd39,"The nonlinear recurrent activations do not make the approximation capacity of RNN worse, however also not much better.","We study the effects of nonlinear recurrent activations on the approximation properties of recurrent neural networks (RNNs). Previous works indicate that in the linear setting, RNNs show good approximation performance when the target sequential relationship is smooth and has fast decaying memory. Otherwise, RNNs may suffer from the so-called “curse of memory”, meaning that an exponentially large number of neurons is required for accurate approximation. A natural question is whether the recurrent nonlinearity has a substantial effect on RNNs’ approximation capacity and approximation speed. In this paper, we present some negative results in this direction. We discover that, while the addition of nonlinearity does not shrink the hypothesis space, in the sense that nonlinear RNNs can still approximate linear functionals with the same approximation rates established for linear RNNs, it does not essentially alleviate the limitations of RNNs either. In particular, we prove that nonlinear RNNs fail to be universal approximators of arbitrary nonlinear functionals, and any linear functional that can be efficiently approximated must also possess an exponentially decaying memory. ","Recurrent Neural Network, Approximation Theory, Functional Analysis, Dynamical System" Friends to Help: Saving Federated Learning from Client Dropout,https://openreview.net/forum?id=DiKT4rrUD9n,https://openreview.net/pdf?id=DiKT4rrUD9n,This paper proposes an algorithm to address client dropout in Federated Learning that discovers the ``friendship'' among clients and uses the friend client's local update as a substitute for the dropout client. ,"Federated learning (FL) is an outstanding distributed machine learning framework due to its benefits on data privacy and communication efficiency. Since full client participation in many cases is infeasible due to constrained resources, partial participation FL algorithms have been investigated that proactively select/sample a subset of clients, aiming to achieve learning performance close to the full participation case. This paper studies a passive partial client participation scenario that is much less well understood, where partial participation is a result of external events, namely client dropout, rather than a decision of the FL algorithm. We cast FL with client dropout as a special case of a larger class of FL problems where clients can submit substitute (possibly inaccurate) local model updates. Based on our convergence analysis, we develop a new algorithm FL-FDMS that discovers friends of clients (i.e., clients whose data distributions are similar) on-the-fly and uses friends' local updates as substitutes for the dropout clients, thereby reducing the substitution error and improving the convergence performance. A complexity reduction mechanism is also incorporated into FL-FDMS, making it both theoretically sound and practically useful. Experiments on MNIST and CIFAR-10 confirmed the superior performance of FL-FDMS in handling client dropout in FL.","Federated Learning, Client Dropout, Partial Participation" Filter-Recovery Network for Multi-Speaker Audio-Visual Speech Separation,https://openreview.net/forum?id=fiB2RjmgwQ6,https://openreview.net/pdf?id=fiB2RjmgwQ6,,"We aim at audio-visual speech separation task. Given the face information for each speaker, the goal is to separate the corresponding speech in the speech mixture. Existing works are designed for a controlled setting with a fixed number of speakers, mostly 2 or 3 speakers, which is not easily scalable in practical application. To deal with this, we focus on separating voices for variable number of speakers with a single model, and build concrete mixture test sets for a fair comparison. There are two prominent issues in complex multi-speaker separation results: 1) There exists some noisy voice pieces belong to other speakers; 2) Part of the target speech is missing. To deal with these, we propose a valid method BFRNet, including a basic audio-visual speech separator and a Filter-Recovery Network (FRNet). The FRNet filters the noisy speech and recovery the missing parts for the output of the basic separator. Our method achieves the state-of-the-art results on audio-visual speech separation datasets. Besides, we apply the FRNet to other methods and achieve general performance improvements, which proves the effectiveness of the proposed FRNet.", Probing into the Fine-grained Manifestation in Multi-modal Image Synthesis,https://openreview.net/forum?id=2xQVAXKjLdH,https://openreview.net/pdf?id=2xQVAXKjLdH,A new method for evaluating the semantic consistency and robustness of multi-modal image synthesis models,"The ever-growing development of multi-modal image synthesis brings unprecedented realism to generation tasks. In practice, it is straightforward to judge the visual quality and reality of an image. However, it is labor-consuming to verify the correctness of semantic consistency in the auto-generation, which requires a comprehensive understanding and mapping of different modalities. The results of existing models are sorted and displayed largely relying on the global visual-text similarity. However, this coarse-grained approach does not capture the fine-grained semantic alignment between image regions and text spans. To address this issue, we first present a new method to evaluate the cross-modal consistency by inspecting the decomposed semantic concepts. We then introduce a new metric, called MIS-Score, which is designed to measure the fine-grained semantic alignment between a prompt and its generation quantitatively. Moreover, we have also developed an automated robustness testing technique with referential transforms to test and measure the robustness of multi-modal synthesis models. We have conducted comprehensive experiments to evaluate the performance of recent popular models for text-to-image generation. Our study demonstrates that the proposed metric MIS-Score represents better evaluation criteria than existing coarse-grained ones (e.g., CLIP) to understand the semantic consistency of the synthesized results. Our robustness testing method also proves the existence of biases embedded in the models, hence uncovering their limitations in real applications.","Multi-modal image synthesis, semantic consistency measurement, robustness testing" Can discrete information extraction prompts generalize across language models?,https://openreview.net/forum?id=sbWVtxq8-zE,https://openreview.net/pdf?id=sbWVtxq8-zE,"We show that automatically generated prompts can be learned on a language model and used to retrieve information from another. We further provide some preliminary insights on the nature of these ""universal prompts"".","We study whether automatically-induced prompts that effectively extract information from a language model can also be used, out-of-the-box, to probe other language models for the same information. After confirming that discrete prompts induced with the AutoPrompt algorithm outperform manual and semi-manual prompts on the slot-filling task, we demonstrate a drop in performance for AutoPrompt prompts learned on a model and tested on another. We introduce a way to induce prompts by mixing language models at training time that results in prompts that generalize well across models. We conduct an extensive analysis of the induced prompts, finding that the more general prompts include a larger proportion of existing English words and have a less order-dependent and more uniform distribution of information across their component tokens. Our work provides preliminary evidence that it's possible to generate discrete prompts that can be induced once and used with a number of different models, and gives insights on the properties characterizing such prompts.","prompting, prompt analysis, language model interfaces, prompt generalizations" Deep Power Laws for Hyperparameter Optimization,https://openreview.net/forum?id=NZ8Gb5GOrRu,https://openreview.net/pdf?id=NZ8Gb5GOrRu,Multi-fidelity hyperparameter optimization with deep power laws that achieves state-of-the-art results across diverse benchmarks.,"Hyperparameter optimization is an important subfield of machine learning that focuses on tuning the hyperparameters of a chosen algorithm to achieve peak performance. Recently, there has been a stream of methods that tackle the issue of hyperparameter optimization, however, most of the methods do not exploit the scaling law property of learning curves. In this work, we propose Deep Power Law (DPL), a neural network model conditioned to yield predictions that follow a power-law scaling pattern. Our model dynamically decides which configurations to pause and train incrementally by making use of multi-fidelity estimation. We compare our method against 7 state-of-the-art competitors on 3 benchmarks related to tabular, image, and NLP datasets covering 59 diverse search spaces. Our method achieves the best results across all benchmarks by obtaining the best any-time results compared to all competitors.","hyperparameter optimization, multi-fidelity optimization, power laws, deep neural networks, deep power laws." "A view of mini-batch SGD via generating functions: conditions of convergence, phase transitions, benefit from negative momenta.",https://openreview.net/forum?id=bzaPGEllsjE,https://openreview.net/pdf?id=bzaPGEllsjE,We have developed an analytic framework for analysis of mini-batch SGD dynamics via generating functions using a novel Spectrally Expressible approximation. ,"Mini-batch SGD with momentum is a fundamental algorithm for learning large predictive models. In this paper we develop a new analytic framework to analyze noise-averaged properties of mini-batch SGD for linear models at constant learning rates, momenta and sizes of batches. Our key idea is to consider the dynamics of the second moments of model parameters for a special family of ""Spectrally Expressible"" approximations. This allows to obtain an explicit expression for the generating function of the sequence of loss values. By analyzing this generating function, we find, in particular, that 1) the SGD dynamics exhibits several convergent and divergent regimes depending on the spectral distributions of the problem; 2) the convergent regimes admit explicit stability conditions, and explicit loss asymptotics in the case of power-law spectral distributions; 3) the optimal convergence rate can be achieved at negative momenta. We verify our theoretical predictions by extensive experiments with MNIST and synthetic problems, and find a good quantitative agreement.","SGD, linear models, optimization, analytic framework, NTK" Curiosity-Driven Unsupervised Data Collection for Offline Reinforcement Learning,https://openreview.net/forum?id=At0BdxvACds,https://openreview.net/pdf?id=At0BdxvACds,We propose a novel adaptive reachability-based method to improve the data collection process in offline reinforcement learning. ,"In offline reinforcement learning (RL), while the majority of efforts are focusing on engineering sophisticated learning algorithms given a fixed dataset, very few works have been carried out to improve the dataset quality itself. More importantly, it is even challenging to collect a task-agnostic dataset such that the offline RL agent can learn multiple skills from it. In this paper, we propose a Curiosity-driven Unsupervised Data Collection (CUDC) method to improve the data collection process. Specifically, we quantify the agent's internal belief to estimate the probability of the k-step future states being reachable from the current states. Different from existing approaches that implicitly assume limited feature space with fixed environment steps, CUDC is capable of adapting the number of environment steps to explore. Thus, the feature representation can be substantially diversified with the dynamics information. With this adaptive reachability mechanism in place, the agent can navigate itself to collect higher-quality data with curiosity. Empirically, CUDC surpasses existing unsupervised methods in sample efficiency and learning performance in various downstream offline RL tasks of the DeepMind control suite.","Offline Reinforcement Learning, Data Collection, Reachability, Unsupervised Learning, Curiosity-Driven Learning" Understanding and Bridging the Modality Gap for Speech Translation,https://openreview.net/forum?id=ZmGrci84heu,https://openreview.net/pdf?id=ZmGrci84heu,We aim to understand the modality gap for speech translation and propose a simple yet effective Cross-modal Regularization with Scheduled Sampling (Cress) method to bridge this gap.,"How to achieve better end-to-end speech translation (ST) by leveraging (text) machine translation (MT) data? Among various existing techniques, multi-task learning is one of the effective ways to share knowledge between ST and MT, thus additional MT data can help to learn the source-to-target mapping. However, due to the differences between speech and text, there is always a gap between ST and MT. In this paper, we first aim to understand this modality gap from the target-side representation differences. We also link the modality gap to another well-known problem in neural machine translation: exposure bias, where the modality gap is relatively small during training except for some hard cases, but keeps increasing during inference due to the cascading effect. To address these problems, we propose the Cross-modal Regularization with Scheduled Sampling (Cress) method. Specifically, we regularize the output predictions of ST and MT, whose target-side contexts are derived by sampling between ground truth words and self-generated words with a varying probability. Furthermore, to handle the difficult cases with large modality gaps, we introduce token-level adaptive training to assign different training weights to target tokens according to the extent of the modality gap. Experiments and analysis show that our approach effectively bridges the modality gap, and achieves significant improvements over a strong baseline, which establishes new state-of-the-art results in all eight directions of the MuST-C dataset.","neural machine translation, speech translation, modality gap" Big Learning: A Universal Machine Learning Paradigm?,https://openreview.net/forum?id=UfFXUfAsnPH,https://openreview.net/pdf?id=UfFXUfAsnPH,"We propose a new machine learning framework named Big Learning, which exhaustively exploits data information and underlies existing foundation models.","Recent breakthroughs based on big/foundation models reveal a vague avenue for AI, that is, \emph{big data, big/foundation models, big learning, $\cdots$}. Following that avenue, here we elaborate on our newly introduced big learning. Specifically, big learning exhaustively exploits the information/tasks inherent in its large-scale \emph{complete/incomplete} training data, by learning to simultaneously model many/all joint/conditional/marginal data distributions (thus named big learning) with one universal foundation model. We reveal that big learning is what existing foundation models are implicitly doing; accordingly, our big learning provides high-level guidance for flexible design and improvements of foundation models. Besides, big learning ($i$) is equipped with great flexibilities for complete/incomplete training data and for customizing trustworthy data tasks; ($ii$) potentially delivers all joint/conditional/marginal data capabilities after training; ($iii$) significantly reduces the training-test gap with improved model generalization; and ($iv$) potentially unifies conventional machine learning paradigms and enables their flexible cooperations, manifested as a universal learning paradigm. Preliminary experiments verified the effectiveness of the presented big learning. ","Foundation models, big learning, incomplete data, GAN" Spike Calibration: Bridging the Gap between ANNs and SNNs in ANN-SNN Conversion ,https://openreview.net/forum?id=PFbzoWZyZRX,https://openreview.net/pdf?id=PFbzoWZyZRX,A calibration method based on shifting initial membrane potential is proposed for ANN-SNN conversion to reach the same level of performance as BPTT.,"Spiking Neural Networks (SNNs) have attracted great attention due to the distinctive characteristics of low power consumption and temporal information processing. ANN-SNN conversion, as the most commonly used method, can make converted SNNs achieve comparable performance as ANNs on large-scale datasets. However, the performance degrades severely under low time-steps, which hampers the practical applications of SNNs on neuromorphic chips. In this paper, instead of evaluating different conversion errors and then eliminating these errors, we define offset spike to measure the deviation degree of actual and desired firing rates of SNNs. We make a detailed analysis of offset spike and point out that the case of firing one more (or less) spike is the main reason for conversion error. Based on this, we propose an optimization strategy based on shifting initial membrane potential and theoretically prove the corresponding optimal shifting distance to calibrate the spike. In addition, we also note that our method has a unique iterative property to further reduce conversion error. The experimental results show that our proposed method achieves state-of-the-art performance on CIFAR-10, CIFAR-100, and ImageNet datasets. For example, we reach top-1 accuracy of 67.12% on ImageNet with 6 time-steps. To the best of our knowledge, this is the first time ANN-SNN conversion can simultaneously achieve high accuracy and ultra-low latency on the complex dataset.",Spiking Neural Networks,Spike Calibration,Ultra-low-latency Conversion MIA: A Framework for Certified Robustness of Time-Series Classification and Forecasting Against Temporally-Localized Perturbations,https://openreview.net/forum?id=szTcqSSc5Vx,https://openreview.net/pdf?id=szTcqSSc5Vx,,"Recent literature demonstrates that times-series forecasting/classification are sensitive to input perturbations. However, the defenses for time-series models are relatively under-explored. In this paper, we propose \textbf{M}asking \textbf{I}mputing \textbf{A}ggregation (MIA), a plug-and-play framework to provide an arbitrary deterministic time-series model with certified robustness against temporally-localized perturbations (also known as $\ell_0$-norm localized perturbations), which is to our knowledge the first $\ell_0$-norm defense for time-series models. Our main insight is to let an occluding mask move across the input series, guaranteeing that, for an arbitrary localized perturbation there must exist at least a mask that completely remove out the perturbation, so that our prediction on this masked series is uninfluenced. Remarkably, MIA is high-availability as it still works even if we only have query access to the pretrained model. Furthermore, as there is no dedicated defense against $\ell_0$-norm perturbations for time-series models, we specifically adapt two matrix-based defenses to time-series models for comparison. Extensive experiments show that MIA yields stronger robustness as well as practicality.","Certified robustness, time series forecasting, time series classification" Sparse Q-Learning: Offline Reinforcement Learning with Implicit Value Regularization,https://openreview.net/forum?id=ueYYgo2pSSU,https://openreview.net/pdf?id=ueYYgo2pSSU,"We show that some form of Implicit Value Regularization (IVR) will result in the In-sample Learning paradigm in offline RL. We also propose a practical algorithm based on the IVR framework, which obtains new SOTA results.","Most offline reinforcement learning (RL) methods suffer from the trade-off between improving the policy to surpass the behavior policy and constraining the policy to limit the deviation from the behavior policy as computing Q-values using out-of-distribution actions will suffer from errors due to distributional shift. The recent proposed \textit{In-sample Learning} paradigm (e.g., IQL), which improves the policy by quantile regression using only data samples, shows great promise because it learns an optimal policy without querying the value function of any unseen actions. However, it remains unclear how this type of method handles the distributional shift in learning the value function. In this work, we make a key finding that the in-sample learning paradigm arises under the \textit{Implicit Value Regularization} (IVR) framework. This gives a deeper understanding of why the in-sample learning paradigm works, i.e., it applies implicit value regularization to the policy. Based on the IVR framework, we further propose a practical algorithm, which uses the same value regularization as CQL, but in a complete in-sample manner. Compared with IQL, we find that our algorithm introduces sparsity in learning the value function, we thus dub our method Sparse Q-learning (SQL). We verify the effectiveness of SQL on D4RL benchmark datasets. We also show the benefits of sparsity by comparing SQL with IQL in noisy data regimes and show the robustness of in-sample learning by comparing SQL with CQL in small data regimes. Under all settings, SQL achieves better results and owns faster convergence compared to other baselines. ","Deep Reinforcement Learning, Offline Reinforcement Learning, Value Regularization, Continuous Control" Eliminating Catastrophic Overfitting Via Abnormal Adversarial Examples Regularization,https://openreview.net/forum?id=YjKqWExiy6s,https://openreview.net/pdf?id=YjKqWExiy6s,,"Single-step adversarial training (SSAT) is shown to be able to defend against iterative-step adversarial attacks to achieve both efficiency and robustness. However, SSAT suffers from catastrophic overfitting (CO) with strong adversaries, showing that the classifier decision boundaries are highly distorted and robust accuracy against iterative-step adversarial attacks suddenly drops from peak to nearly 0% in a few epochs. In this work, we find that some adversarial examples generated on the network trained by SSAT exhibit anomalous behaviour, that is, although the training data is generated by the inner maximization process, the loss of some adversarial examples decreases instead, which we called abnormal adversarial examples. Furthermore, network optimization on these abnormal adversarial examples will further accelerate the model decision boundaries distortion, and correspondingly, the number of abnormal adversarial examples will sharply increase with CO. These observations motivate us to prevent CO by hindering the generation of abnormal adversarial examples. Specifically, we design a novel method, Abnormal Adversarial Examples Regularization (AAER), which explicitly regularizes the number and logits variation of abnormal adversarial examples to hinder the model from generating abnormal adversarial examples. Extensive experiments demonstrate that our method can prevent CO and further boost adversarial robustness with strong adversaries.", TIB: Detecting Unknown Objects via Two-Stream Information Bottleneck,https://openreview.net/forum?id=Uk3zO5A-CSe,https://openreview.net/pdf?id=Uk3zO5A-CSe,,"Detecting diverse objects, including ones never-seen-before during model training, is critical for the safe application of object detectors. To this end, a task of unsupervised out-of-distribution object detection (OOD-OD) is proposed to detect unknown objects without the reliance on an auxiliary dataset. For this task, it is important to reduce the impact of lacking unknown data for supervision and leverage in-distribution (ID) data to improve the model's discrimination ability. In this paper, we propose a method of Two-Stream Information Bottleneck (TIB), which consists of a standard Information Bottleneck and a dedicated Reverse Information Bottleneck (RIB). Specifically, after extracting the features of an ID image, we first define a standard IB network to disentangle instance representations that are beneficial for localizing and recognizing objects. Meanwhile, we present RIB to obtain simulative OOD features to alleviate the impact of lacking unknown data. Different from standard IB aiming to extract task-relevant compact representations, RIB is to obtain task-irrelevant representations by reversing the optimization objective of the standard IB. Next, to further enhance the discrimination ability, a mixture of information bottlenecks is designed to sufficiently capture object-related information. In the experiments, our method is evaluated on OOD-OD and incremental object detection. The significant performance gains over baselines show the superiorities of our method. ", Revisiting Residual Networks for Adversarial Robustness,https://openreview.net/forum?id=4tsqGWfBb3Q,https://openreview.net/pdf?id=4tsqGWfBb3Q,Designing robust convolutional neural networks against adversarial attack. ,"Convolutional neural networks are known to be vulnerable to adversarial attacks. Solutions to improve their robustness have largely focused on developing more effective adversarial training methods, while limited efforts have been devoted to analyzing the role of architectural elements (such as topology, depth, and width) on adversarial robustness. This paper seeks to resolve this limitation and present a holistic study on the impact of architecture choice on adversarial robustness. We focus on residual networks and consider architecture design at the block level, i.e., topology, kernel size, activation, and normalization, as well as at the network scaling level, i.e., depth and width of each block in the network. We first derive insights on the block structure through systematic ablative experiments and design a novel residual block, dubbed RobustResBlock. It improves CW40 robust accuracy by ∼3% over Wide residual networks (WRNs), the de facto architecture of choice for designing robust architectures. Then we derive insights on the impact of depth and width of the network and design a compound scaling rule, dubbed RobustScaling, to distribute depth and width at a given desired FLOP count. Finally, we combine RobustResBlock and RobustScaling and present a portfolio of adversarially robust residual networks, RobustResNets, spanning a wide spectrum of model capacities. Experimental validation, on three datasets across four adversarial attacks, demonstrates that RobustResNets consistently outperform both the standard WRNs ( 3 ∼ 4% improvement in robust accuracy while saving about half parameters) and other robust architectures proposed by existing works.","Adversarial robustness, neural architecture design" Win: Weight-Decay-Integrated Nesterov Acceleration for Adaptive Gradient Algorithms,https://openreview.net/forum?id=CPdc77SQfQ5,https://openreview.net/pdf?id=CPdc77SQfQ5,"We propose a new and general Weight-decay-Integrated Nesterov acceleration for adaptive algorithms to enhance their convergence speed, and also analyze their convergence justify their convergence superiority. ","Training deep networks on increasingly large-scale datasets is computationally challenging. In this work, we explore the problem of ``\textit{how to accelerate the convergence of adaptive gradient algorithms in a general manner}"", and aim at providing practical insights to boost the training efficiency. To this end, we propose an effective and general {Weight-decay-Integrated Nesterov acceleration} (Win) for adaptive algorithms to enhance their convergence speed. Taking AdamW and Adam as examples, we minimize a dynamical loss per iteration which combines the vanilla training loss and a dynamic regularizer inspired by proximal point method (PPM) to improve the convexity of the problem. To introduce Nesterov-alike-acceleration into AdamW and Adam, we respectively use the first- and second-order Taylor approximations of vanilla loss to update the variable twice while fixing the above dynamic regularization brought by PPM. In this way, we arrive at our Win acceleration (like Nesterov acceleration) for AdamW and Adam that uses a conservative step and a reckless step to update twice and then linearly combines these two updates for acceleration. Next, we extend this Win acceleration to LAMB and SGD. Our transparent acceleration derivation could provide insights for other accelerated methods and their integration into adaptive algorithms. Besides, we prove the convergence of Win-accelerated adaptive algorithms and justify their convergence superiority over their non-accelerated counterparts by taking AdamW and Adam as examples. Experimental results testify the faster convergence speed and superior performance of our Win-accelerated AdamW, Adam, LAMB and SGD over their non-accelerated counterparts on vision classification tasks and language modeling tasks with both CNN and Transformer backbones. We hope Win acceleration shall be a default acceleration option for all popular optimizers in deep learning community to improve the training efficiency.","Optimization acceleration in deep learning, network optimizers, deep learning optimizer, deep learning algorithm" A Quasi-Bayesian Nonparametric Density Estimator via Autoregressive Predictive Updates,https://openreview.net/forum?id=Iyi7eb9VIW,https://openreview.net/pdf?id=Iyi7eb9VIW,We introduce a Quasi-Bayesian nonparametric density estimator for moderate-sized data sets that is inspired by an autoregressive Dirichlet Process Mixture Model.,"Bayesian methods are a popular choice for statistical inference in small-data regimes due to the regularization effect induced by the prior. In the context of density estimation, the standard nonparametric Bayesian approach is to target the posterior predictive of the Dirichlet process mixture model. In general, direct estimation of the posterior predictive is intractable and so methods typically resort to approximating the posterior distribution as an intermediate step. The recent development of quasi-Bayesian predictive copula updates, however, has made it possible to perform tractable predictive density estimation without the need for posterior approximation. Although these estimators are computationally appealing, they tend to struggle on non-smooth data distributions. This is due to the comparatively restrictive form of the likelihood models from which the proposed copula updates were derived. To address this shortcoming, we consider a Bayesian nonparametric model with an autoregressive likelihood decomposition and a Gaussian process prior. While the predictive update of such a model is typically intractable, we derive a quasi-Bayesian predictive update that achieves state-of-the-art results on moderate-sized examples.","Bayesian nonparametrics, Dirichlet Process Mixture Models, Quasi-Bayes" Towards Understanding Convergence and Generalization of AdamW,https://openreview.net/forum?id=EfTN2tSGlF,https://openreview.net/pdf?id=EfTN2tSGlF,"It theoretically proves the convergence of AdamW, and justifies its generalization superiority over both Adam and its $\ell_2$-regularized variant. ","AdamW modifies vanilla Adam by decaying network weights per training iteration, and shows remarkable generalization superiority over Adam and its $\ell_2$-regularized variant. In context of adaptive gradient algorithms (\eg~Adam), the decoupled weight decay in AdamW differs from the widely used $\ell_2$-regularizer, since the former does not affect optimization steps, while the latter changes the first- and second-order gradient moments and thus the optimization steps. Despite its great success on both vision transformers and CNNs, for AdamW, its convergence behavior and its generalization improvement over ($\ell_2$-regularized) Adam remain absent yet. To solve this issue, we prove the convergence of AdamW and justify its generalization advantages over Adam and its $\ell_2$-regularized version. Specifically, AdamW can provably converge but minimizes a dynamically regularized loss that combines a vanilla loss and a dynamical regularization induced by the decoupled weight decay, thus leading to its different behaviors compared with Adam and its $\ell_2$-regularized version. Moreover, on both general nonconvex problems and P\L-conditioned problems, we establish the stochastic gradient complexity of AdamW to find a stationary point. Such complexity is also applicable to Adam and its $\ell_{2}$-regularized variant, and indeed improves their previously known complexity, especially for modern over-parametrized networks. Besides, we theoretically show that AdamW often enjoys smaller generalization error bound than both Adam and its $\ell_2$-regularized variant from the Bayesian posterior aspect. This result, for the first time, explicitly reveals the benefits of the unique decoupled weight decay in AdamW. We hope the theoretical results in this work could motivate researchers to propose novel optimizers with faster convergence and better generalization. Experimental results testify our theoretical implications. ","deep learning optimization, network optimizer" GeoVeX: Geospatial Vectors with Hexagonal Convolutional Autoencoders,https://openreview.net/forum?id=7bvWopYY1H,https://openreview.net/pdf?id=7bvWopYY1H,We introduce a new geospatial representation model called GeoVeX to learn global vectors for all geographical locations on Earth land cover (200+ million embeddings). ,"We introduce a new geospatial representation model called GeoVeX to learn global vectors for all geographical locations on Earth land cover (200+ million embeddings). GeoVeX is built on a novel model architecture named Hexagonal Convolutional Autoencoders (HCAE) combined with a Zero-Inflated Poisson (ZIP) reconstruction layer, applied to a grid of Uber's H3 hexagons, each one described by the histogram of OpenStreetMap (OSM) geographical tags occurrences. GeoVeX is novel on three aspects: 1) it produces the first geospatial vectors trained on worldwide open data, enabling wide adoption on every downstream tasks which may benefit from enriched geographical information, requiring only location coordinates; 2) it represents the first use of hexagonal convolutions within autoencoder architectures, to learn latent representations of an hexagonal grid; and 3) it introduces a spatial-contextual Poisson reconstruction loss function for autoencoder architectures suitable for training on sparse geographical count data. Experiments demonstrate that GeoVeX embeddings can improve upon state-of-the-art geospatial location representations on two different downstream tasks: price prediction in the travel industry and hyperlocal interpolation of climate data from weather stations. ","Representation learning, Geospatial Embedding, Convolutional Autoencoders on hexagonal grids, OpenStreetMap, H3 hexagons" Prompt-Matched Semantic Segmentation,https://openreview.net/forum?id=PxFpWq6FNiW,https://openreview.net/pdf?id=PxFpWq6FNiW,We proposed a generic and effective prompt tuning method for semantic segmentation.,"The objective of this work is to explore how to effectively and efficiently adapt pre-trained visual foundation models to downstream tasks, e.g., image semantic segmentation. Conventional methods usually fine-tuned the entire networks for each specific dataset, which will be burdensome to store massive parameters of these networks. Several recent works attempted to insert some extra trainable parameters into the frozen networks to learn visual prompts for parameter-efficient tuning. However, these works showed poor generality as they were designed specifically for Transformers. Moreover, using limited information in these schemes, they exhibited a poor capacity to learn effective prompts. To alleviate these issues, we propose a novel Inter-Stage Prompt-Matched Framework for generic and effective visual prompt tuning. Specifically, to ensure generality, we divide the pre-trained backbone with frozen parameters into multiple stages and perform prompt learning between different stages, which makes the proposed scheme applicable to various architectures of CNN and Transformer. For effective tuning, a lightweight Semantic-aware Prompt Matcher (SPM) is designed to progressively learn reasonable prompts with a recurrent mechanism, guided by the rich information of interim semantic maps. Working as a deep matched filter of representation learning, the proposed SPM can well transform the output of the previous stage into a desirable input for the next stage, thus achieving the better matching/stimulating for the pre-trained knowledge. Finally, we apply the proposed method to handle various semantic segmentation tasks. Extensive experiments on five benchmarks show that the proposed scheme can achieve a promising trade-off between parameter efficiency and performance effectiveness.","foundation model, prompt tuning, semantic segmentation, model generality" Split and Merge Proxy: pre-training protein-protein contact prediction by mining rich information from monomer data,https://openreview.net/forum?id=o8fqVVKN3H,https://openreview.net/pdf?id=o8fqVVKN3H,,"Protein-protein contact prediction is a key intelligent biology computation technology for complex multimer protein function analysis but still sufferers from low accuracy. An important problem is that the number of training data cannot meet the requirements of deep-learning-based methods due to the expensive cost of capturing structure information of multimer data. In this paper, we solve this data volume bottleneck in a cheap way, borrowing rich information from monomer data. To utilize monomer (single chain) data in this multimer (multiple chains) problem, we propose a simple but effective pre-training method called Split and Merger Proxy (SMP), which utilizes monomer data to construct a proxy task for model pre-training. This proxy task cuts monomer data into two sub-parts, called pseudo multimer, and pre-trains the model to merge them back together by predicting their pseudo contacts. The pre-trained model is then used to initialize for our target – protein-protein contact prediction. Because of the consistency between this proxy task and the final target, the whole method brings a stronger pre-trained model for subsequent fine-tuning, leading to significant performance gains. Extensive experiments validate the effectiveness of our method and show the model performs better than the state of the art by 11.40% and 2.97% on the P@ L/10 metric for bounded benchmarks DIPS-Plus and CASP-CAPRI, respectively. Further, the model also achieves almost 1.5 times performance superiority to the state of the art on the harder unbounded benchmark DB5. The code, model, and pre-training data will be released after this paper is accepted. ","Protein Bioinformatics, Protein-Protein Contact Prediction, Pre-training" ESD: Expected Squared Difference as a Tuning-Free Trainable Calibration Measure,https://openreview.net/forum?id=bHW9njOSON,https://openreview.net/pdf?id=bHW9njOSON,"We propose a tuning-free calibration obejctive loss Expected Squared Difference (ESD), where we view the calibration error from the perspective of the squared difference between two expectations.","Recent studies have shown that modern neural networks tend to be poorly calibrated due to over-confident predictions. Traditionally, post-processing methods have been used to calibrate the model after training. On the other hand, various trainable calibration measures have been proposed recently to incorporate the calibration objective loss directly into the training process. However, these methods all incorporate internal hyperparameters introduced in the process of obtaining a differential calibration measure. Consequently, the performance of these calibration objectives relies on tuning these hyperparameters, incurring more computational cost as the size of neural networks and datasets become larger. As such, we present Expected Squared Difference (ESD), a tuning-free (i.e., hyperparameter-free) trainable calibration objective loss, where we view the calibration error from the perspective of the squared difference between two expectations. With extensive experiments on several architectures (MLPs, CNNs, Transformers) and datasets, we demonstrate that (1) incorporating ESD into the training improves model calibration in various batch size settings without the need for internal hyperparmeter tuning, (2) ESD yields the best calibrated results compared with previous approaches, (3) show that ESD drastically improve the computational cost required for calibration during training due to the absence of internal hyperparamater. Code will be publicly available.",calibration Interactive Portrait Harmonization,https://openreview.net/forum?id=AP0iZoaRaS,https://openreview.net/pdf?id=AP0iZoaRaS,A new flexible framework that allows users to pick certain regions of the background image and use it to guide the harmonization.,"Current image harmonization methods consider the entire background as the guidance for harmonization. However, this may limit the capability for user to choose any specific object/person in the background to guide the harmonization. To enable flexible interaction between user and harmonization, we introduce interactive harmonization, a new setting where the harmonization is performed with respect to a selected region in the reference image instead of the entire background. A new flexible framework that allows users to pick certain regions of the background image and use it to guide the harmonization is proposed. Inspired by professional portrait harmonization users, we also introduce a new luminance matching loss to optimally match the color/luminance conditions between the composite foreground and select reference region. This framework provides more control to the image harmonization pipeline achieving visually pleasing portrait edits. Furthermore, we also introduce a new dataset carefully curated for validating portrait harmonization. Extensive experiments on both synthetic and real-world datasets show that the proposed approach is efficient and robust compared to previous harmonization baselines, especially for portraits.","harmonization, image editing, low-level vision" Self-Distillation for Further Pre-training of Transformers,https://openreview.net/forum?id=kj6oK_Hj40,https://openreview.net/pdf?id=kj6oK_Hj40,We propose self-distillation in further pre-training to improve effectiveness of adaptation of pre-trained model to target tasks.,"Pre-training a large transformer model on a massive amount of unlabeled data and fine-tuning it on labeled datasets for diverse downstream tasks has proven to be a successful strategy, for a variety of vision and natural language processing tasks. However, direct fine-tuning of the pre-trained model may be suboptimal if there exist large discrepancies across data domains for pre-training and fine-tuning. To tackle this issue, several previous studies have proposed further pre-training strategies, where we continue to pre-train the model on the target unlabeled dataset before fine-tuning. However, all of them solely focus on language models and we empirically find that a Vision Transformer is vulnerable to overfitting as we continue to pretrain the model on target unlabeled data. In order to tackle this limitation, we propose self-distillation as a regularization for a further pre-training stage. Specifically, we first further pre-train the initial pre-trained model on the target unlabeled data and then consider it as a teacher for self-distillation. Then we take the same initial pre-trained model as a student and enforce its hidden representations to be close to those of the teacher while optimizing the student with a masked auto-encoding objective. We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks. Experimentally, we show that our proposed method outperforms all the relevant baselines. Theoretically, we analyze the proposed method with a simplified model to understand how self-distillation for further pre-training can potentially help improve the performance of the downstream tasks.","self-distillation, adaptation of pre-trained models, regularization" Iterative Relaxing Gradient Projection for Continual Learning,https://openreview.net/forum?id=MMBILyoRKQ,https://openreview.net/pdf?id=MMBILyoRKQ,We propose a novel gradient projection approach to facilitate forward knowledge transfer within a fixed network capacity by iterative searching and relaxing the critical subspace of the frozen space.,"A critical capability for intelligent systems is to continually learn given a sequence of tasks. An ideal continual learner should be able to avoid catastrophic forgetting and effectively leverage past learned experiences to master new knowledge. Among different continual learning algorithms, gradient projection approaches impose hard constraints on the optimization space for new tasks to minimize task interference, yet hinder forward knowledge transfer at the same time. Recent methods use expansion-based techniques to relax the constraints, but a growing network can be computationally expensive. Therefore, it remains a challenge whether we can improve forward knowledge transfer for gradient projection approaches \textit{using a fixed network architecture}. In this work, we propose the Iterative Relaxing Gradient Projection (IRGP) framework. The basic idea is to iteratively search for the parameter subspaces most related to the current task and relax these parameters, then reuse the frozen spaces to facilitate forward knowledge transfer while consolidating previous knowledge. Our framework requires neither memory buffers nor extra parameters. Extensive experiments have demonstrated the superiority of our framework over several strong baselines. We also provide theoretical guarantees for our iterative relaxing strategies. ","continual learning, gradient projection methods" Adversarial Counterfactual Environment Model Learning,https://openreview.net/forum?id=zT5T9gHpGI,https://openreview.net/pdf?id=zT5T9gHpGI,We propose a new environment model learning techniques with better genalization ability on counterfactual data.,"A good model for action-effect prediction, i.e., the environment model, is essential for sample-efficient policy learning, in which the agent can take numerous free trials to find good policies. Currently, the model is commonly learned by fitting historical transition data through empirical risk minimization (ERM). However, we discover that simple data fitting can lead to a model that will be totally wrong in guiding policy learning due to the selection bias in offline dataset collection. In this work, we introduce weighted empirical risk minimization (WERM) to handle this problem in model learning. A typical WERM method utilizes inverse propensity scores to re-weight the training data to approximate the target distribution. However, during the policy training, the data distributions of the candidate policies can be various and unknown. Thus, we propose an adversarial weighted empirical risk minimization (AWRM) objective that learns the model with respect to the worst case of the target distributions. We implement AWRM in a sequential decision structure, resulting in the GALILEO model learning algorithm. We also discover that GALILEO is closely related to adversarial model learning, explaining the empirical effectiveness of the latter. We apply GALILEO in synthetic tasks and verify that GALILEO makes accurate predictions on counterfactual data. We finally applied GALILEO in real-world offline policy learning tasks and found that GALILEO significantly improves policy performance in real-world testing.","offline environment model learning, reinforcement learning, causal inference" Admeta: A Novel Double Exponential Moving Average to Adaptive and Non-adaptive Momentum Optimizers with Bidirectional Looking,https://openreview.net/forum?id=MdSGM9PEQ7,https://openreview.net/pdf?id=MdSGM9PEQ7,"We propose a bidirectional-looking framework, Admeta, in which a novel double exponential moving average mechanism is proposed to adaptive and non-adaptive momentum optimizers.","Optimizer is an essential component for the success of deep learning, which guides the neural network to update the parameters according to the loss on the training set. SGD and Adam are two classical and effective optimizers on which researchers have proposed many variants, such as SGDM and RAdam. In this paper, we innovatively combine the backward-looking and forward-looking aspects of the optimizer algorithm and propose a novel \textsc{Admeta} (\textbf{A} \textbf{D}ouble exponential \textbf{M}oving averag\textbf{E} \textbf{T}o \textbf{A}daptive and non-adaptive momentum) optimizer framework. For backward-looking part, we propose a DEMA variant scheme, which is motivated by a metric in the stock market, to replace the common exponential moving average scheme. While in the forward-looking part, we present a dynamic lookahead strategy which asymptotically approaching a set value, maintaining its speed at early stage and high convergence performance at final stage. Based on this idea, we provide two optimizer implementations, \textsc{AdmetaR} and \textsc{AdmetaS}, the former based on RAdam and the latter based on SGDM. Through extensive experiments on diverse tasks, we find that the proposed \textsc{Admeta} optimizer outperforms our base optimizers and shows advantages over recently proposed competitive optimizers. We also provide theoretical proof of these two algorithms, which verifies the convergence of our proposed \textsc{Admeta}.","optimizer, double exponential moving average, bidirectional looking, Adam, SGD" Learning from Interval-valued Data,https://openreview.net/forum?id=9NzCUqU7i1,https://openreview.net/pdf?id=9NzCUqU7i1,Learn a classifier with interval-valued observations using multi-view learning.,"The classification problem concerning crisp-valued data has been well resolved. However, interval-valued data, where all of the observations’ features are described by intervals, is also a common type of data in real-world scenarios. For example, the data extracted by many measuring devices are not exact numbers but intervals. In this paper, we focus on a highly challenging problem called learning from interval-valued data (LIND), where we aim to learn a classifier with high performance on interval-valued observations. First, we obtain the estimation error bound of the LIND problem based on Rademacher complexity. Then, we give the theoretical analysis to show the strengths of multi-view learning on classification problems, which inspires us to construct a new framework called multi-view interval information extraction (Mv-IIE) approach for improving classification accuracy on interval-valued data. The experiment comparisons with several baselines on both synthetic and real-world datasets illustrate the superiority of the proposed framework in handling interval-valued data. Moreover, we describe an application of the Mv-IIE framework that we can prevent data privacy leakage by transforming crisp-valued (raw) data into interval-valued data.","Machine learning, Interval-valued data, Classification" Feature Synchronization in Backdoor Attacks,https://openreview.net/forum?id=wxyLBOk-ag,https://openreview.net/pdf?id=wxyLBOk-ag,,"Backdoor attacks train models on a mixture of poisoned data and clean data to implant backdoor triggers into the model. An interesting phenomenon has been observed in the training process: the loss of poisoned samples tends to drop significantly faster than that of clean samples, which we call the early-fitting phenomenon. Early-fitting provides a simple but effective method to defend against backdoor attacks, as the poisoned samples can be identified by picking the samples with the lowest loss values in the early training epochs. Therefore, two natural questions arise: (1) What characteristics of poisoned samples cause early-fitting? (2) Is it possible to design stronger attacks to circumvent existing defense methods? To answer the first question, we find that early-fitting could be attributed to a unique property of poisoned samples called synchronization, which depicts the latent similarity between two samples. Meanwhile, the degree of synchronization could be explicitly controlled based on whether it is captured by shallow or deep layers of the model. Then, we give an affirmative answer to the second question by proposing a new backdoor attack method, Deep Backdoor Attack (DBA), which utilizes deep synchronization to reversely generate trigger patterns by activating neurons in the deep layer. Experimental results validate our propositions and the effectiveness of DBA. Our code is available at https://anonymous.4open.science/r/Deep-Backdoor-Attack-8875","backdoor attacks, model interpretation" Efficient Hyperdimensional Computing,https://openreview.net/forum?id=9RQh6MOOaD,https://openreview.net/pdf?id=9RQh6MOOaD,"Based on a detailed analysis of dimension, accuracy, and orthogonality, this paper proposes a suite of novel techniques that reduce the hypervector dimension significantly while maintaining state-of-art accuracies and efficiency.","Hyperdimensional computing (HDC) uses binary vectors of high dimensions to perform classification. Due to its simplicity and massive parallelism, HDC can be highly energy-efficient and well-suited for resource-constrained platforms. However, in trading off orthogonality with efficiency, hypervectors may use tens of thousands of dimensions. In this paper, we will examine the necessity for such high dimensions. In particular, we give a detailed theoretical analysis of the relationship among dimensions of hypervectors, accuracy, and orthogonality. The main conclusion of this study is that a much lower dimension, typically less than 100, can also achieve similar or even higher detecting accuracy compared with other state-of-the-art HDC models. Based on this insight, we propose a suite of novel techniques to build HDC models that use binary hypervectors of dimensions that are orders of magnitude smaller than those found in the state-of-the-art HDC models, yet yield equivalent or even improved accuracy and efficiency. For image classification, we achieved an HDC accuracy of 96.88\% with a dimension of only 32 on the MNIST dataset. We further explore our methods on more complex datasets like CIFAR-10 and show the limits of HDC computing.",Hyperdimensional computing Contextual Convolutional Networks,https://openreview.net/forum?id=PldynS56bN,https://openreview.net/pdf?id=PldynS56bN,"In this paper, we propose to augment potential category memberships as contextual priors in the convolution for contextualized representation learning.","This paper presents a new Convolutional Neural Network, named Contextual Convolutional Network, that capably serves as a general-purpose backbone for visual recognition. Most existing convolutional backbones follow the representation-to-classification paradigm, where representations of the input are firstly generated by category-agnostic convolutional operations, and then fed into classifiers for specific perceptual tasks (e.g., classification and segmentation). In this paper, we deviate from this classic paradigm and propose to augment potential category memberships as contextual priors in the convolution for contextualized representation learning. Specifically, top-k likely classes from the preceding stage are encoded as a contextual prior vector. Based on this vector and the preceding features, offsets for spatial sampling locations and kernel weights are generated to modulate the convolution operations. The new convolutions can readily replace their plain counterparts in existing CNNs and can be easily trained end-to-end by standard back-propagation without additional supervision. The qualities of Contextual Convolutional Networks make it compatible with a broad range of vision tasks and boost the state-of-the-art architecture ConvNeXt-Tiny by 1.8% on top-1 accuracy of ImageNet classification. The superiority of the proposed model reveals the potential of contextualized representation learning for vision tasks. Code will be released in the final version. ",Convolutional Neural Networks Factor Learning Portfolio Optimization Informed by Continuous-Time Finance Models,https://openreview.net/forum?id=UJ4nGMHZYI,https://openreview.net/pdf?id=UJ4nGMHZYI,,"We study financial portfolio optimization in the presence of unknown and uncontrolled system variables referred to as stochastic factors. Existing work falls into two distinct categories: (i) reinforcement learning employs end-to-end policy learning with flexible factor representation, but does not precisely model the dynamics of asset prices or factors; (ii) continuous-time finance methods, in contrast, take advantage of explicitly modeled dynamics but pre-specify, rather than learn, factor representation. We propose FaLPO (factor learning portfolio optimization), a framework that interpolates between these two approaches. Specifically, FaLPO hinges on deep policy gradient to learn a performant investment policy that takes advantage of flexible representation for stochastic factors. Meanwhile, FaLPO also incorporates continuous-time finance models when modeling the dynamics. It uses the optimal policy functional form derived from such models and optimizes an objective that combines policy learning and model calibration. We prove the convergence of FaLPO, and provide performance guarantees via a finite-sample bound. On both synthetic and real-world portfolio optimization tasks, we observe that FaLPO outperforms five leading methods. Finally, we show that FaLPO can be extended to other decision-making problems with stochastic factors.", An Incremental Learning Approach for Sustainable Regional Isolation and Integration,https://openreview.net/forum?id=mNNAjdv3Am,https://openreview.net/pdf?id=mNNAjdv3Am,"Sustainable regional isolation and integration contribute to incremental learning while alleviating ""recency bias"".","Humans are capable of acquiring new knowledge on a constant basis, while integrating and optimizing old knowledge without forgetting them. This is mainly attributed to the human brain’s ability of partitioned learning and memory replay. In this paper, we simulate this ability and propose an incremental learning network of Sustainable Regional Isolation and Integration (SRII). SRII consists of two phases, regional isolation and regional integration, which are iterated to achieve continuous incremental class learning. Regional isolation isolates new learning processes to avoid interfering with existing knowledge, while regional integration introduces knowledge distillation and margin loss regularization term, knowledge distillation to transfer replay knowledge for alleviating catastrophic forgetting, margin loss regularization term to clarify the boundaries of new and old knowledge for alleviating recency bias. Experimental results on the CIFAR100 and miniImageNet datasets demonstrate that SRII outperforms the state-of-the-arts to avoid catastrophic forgetting. In all 5-stage and 10-stage incremental settings, SRII outperforms the baseline and achieves at least $5.27\%+$ average accuracy improvement. Our source code is available at https://github.com/Wuziyi123/SRII.","Continual Learning, Incremental Learning, Catastrophic Forgetting, Memory Replay, Regional Isolation, Regional Integration, Alleviate Recency Bias" GraphPNAS: Learning Distribution of Good Neural Architectures via Deep Graph Generative Models,https://openreview.net/forum?id=c0U6KmokuFK,https://openreview.net/pdf?id=c0U6KmokuFK,Learning Distribution of Good Neural Architectures via Probabilistic Deep Graph Generative Models,"Neural architectures can be naturally viewed as computational graphs. Motivated by this perspective, we, in this paper, study neural architecture search (NAS) through the lens of learning random graph models. In contrast to existing NAS methods which largely focus on searching for a single best architecture, i.e, point estimation, we propose GraphPNAS a deep graph generative model that learns a distribution of well-performing architectures. Relying on graph neural networks (GNNs), our GraphPNAS can better capture topologies of good neural architectures and relations between operators therein. Moreover, our graph generator leads to a learnable probabilistic search method that is more flexible and efficient than the commonly used RNN generator and random search methods. Finally, we learn our generator via an efficient reinforcement learning formulation for NAS. To assess the effectiveness of our GraphPNAS, we conduct extensive experiments on three search spaces, including the challenging RandWire on TinyImageNet, ENAS on CIFAR10, and NAS-Bench-101. The complexity of RandWire is significantly larger than other search spaces in the literature. We show that our proposed graph generator consistently outperforms RNN-based one and achieves better or comparable performances than state-of-the-art NAS methods. ","Neural Architecture Search, Deep Generative Models of Graphs, Graph Neural Networks" "Private GANs, Revisited",https://openreview.net/forum?id=QEmn_Hvh7j8,https://openreview.net/pdf?id=QEmn_Hvh7j8,,"We show that with improved training, the standard approach for differentially private GANs -- updating the discriminator with noisy gradients -- achieves or competes with state-of-the-art results for private image synthesis. Existing instantiations of this approach neglect to consider how adding noise only to discriminator updates disrupts the careful balance between generator and discriminator necessary for successful GAN training. We show that a simple fix restores parity: taking more discriminator steps between generator steps. Finally, with the goal of restoring parity between generator and discriminator, we experiment with further modifications to improve discriminator training and see further improvements. For MNIST at $\eps=10$, our private GANs improve the record FID from 48.4 to 13.0, as well as downstream classifier accuracy from 83.2\% to 95.0\%.", Hidden Poison: Machine unlearning enables camouflaged poisoning attacks,https://openreview.net/forum?id=MWoZh1gvbxA,https://openreview.net/pdf?id=MWoZh1gvbxA,We show that machine unlearning can be used to implement a new type of camouflaged data poisoning attack. ,"We introduce camouflaged data poisoning attacks, a new attack vector that arises in the context of machine unlearning and other settings when model retraining may be induced. An adversary first adds a few carefully crafted points to the training dataset such that the impact on the model's predictions is minimal. The adversary subsequently triggers a request to remove a subset of the introduced points at which point the attack is unleashed and the model's predictions are negatively affected. In particular, we consider clean-label targeted attacks (in which the goal is to cause the model to misclassify a specific test point) on datasets including CIFAR-10, Imagenette, and Imagewoof. This attack is realized by constructing camouflage datapoints that mask the effect of a poisoned dataset.","Machine Unlearning, Poisoning Attack, Camouflaging Poisons" Improve the Adaptation Process by Reasoning From Failed and Successful Cases,https://openreview.net/forum?id=K9DghwWLWF3,https://openreview.net/pdf?id=K9DghwWLWF3,This work presents a new approach to the adaptation process in the case-based reasoning paradigm,"Usually, existing works on adaptation in reasoning-based systems assume that the case base holds only successful cases, i.e., cases having solutions believed to be appropriate for the corresponding problems. However, in practice, the case base could hold failed cases, resulting from an earlier adaptation process but discarded by the revision process. Not considering failed cases would be missing an interesting opportunity to learn more knowledge for improving the adaptation process. This paper proposes a novel approach to the adaptation process in the case-based reasoning paradigm, based on an improved barycentric approach by considering the failed cases.The experiment performed on real data demonstrates the benefit of the method considering the failed cases in the adaptation process compared to the classical ones that ignore them, thus, improving the performance of the case-based reasoning system.","case-based reasoning, adaptation, failed cases, artificial potential field" FEW-SHOT NODE PROMPT TUNING,https://openreview.net/forum?id=ATWW-bUtxH,https://openreview.net/pdf?id=ATWW-bUtxH,"In this paper, we propose Few-shot Node Prompt Tuning as a effective method to tackle general few-shot node classification tasks.","Despite the powerful representation ability of GNNs, recent works have demonstrated that the performance of GNNs can severely degrade when the number of labeled nodes is limited in training data. \textit{Few-shot Node Classification} is one of the problems with an extreme shortage of node labels and has drawn growing attention lately. The current modus operandi, i.e., meta-learning, has succeeded in transferring the structural knowledge learned from \textit{base classes} with abundant labeled nodes to few-shot \textit{novel classes}. However, for real-world scenarios, it is often the case that all the classes on the graph have limited labeled nodes, thus meta-learning cannot be directly deployed. In this work, we generalize the few-shot node classification by removing the assumption that there exist abundant labeled nodes for the base classes. In the meantime, we propose a novel \textit{Few-shot Node Prompt Tuning} method to effectively elicit substantial prior knowledge in the input graph for solving few-shot node classification tasks without labeled base classes. Specifically, we fix a pretrained graph transformer as the encoder and inject virtual nodes as soft prompts in the embedding space to bridge the gap of training objectives between the pretexts and downstream few-shot node classification tasks. Such prompts are small tensors and can be efficiently optimized with a simple classifier corresponding to the few labeled nodes. Since a single pretrained encoder is shared across different tasks, the proposed method retains the efficiency and potential for the model ensemble. Extensive experiments on four prevalent node classification datasets show that the proposed method, FS-NPT, is an efficient and effective way to tackle the general few-shot node classification problem. Our implementation is released\footnote{\url{https://github.com/Anonymous-submit-23/FS-NPT.git}}.","node classification, few-shot learning, graph neural networks" Statistical Inference for Fisher Market Equilibrium,https://openreview.net/forum?id=KemSBwOYJC,https://openreview.net/pdf?id=KemSBwOYJC,We propose a statistical inference framework for Fisher market equilibrium.,"Statistical inference under market equilibrium effects has attracted increasing attention recently. In this paper we focus on the specific case of linear Fisher markets. They have been widely use in fair resource allocation of food/blood donations and budget management in large-scale Internet ad auctions. In resource allocation, it is crucial to quantify the variability of the resource received by the agents (such as blood banks and food banks) in addition to fairness and efficiency properties of the systems. For ad auction markets, it is important to establish statistical properties of the platform's revenues in addition to their expected values. To this end, we propose a statistical framework based on the concept of infinite-dimensional Fisher markets. In our framework, we observe a market formed by a finite number of items sampled from an underlying distribution (the ``observed market'') and aim to infer several important equilibrium quantities of the underlying long-run market. These equilibrium quantities include individual utilities, social welfare, and pacing multipliers. Through the lens of sample average approximation (SSA), we derive a collection of statistical results and show that the observed market provides useful statistical information of the long-run market. In other words, the equilibrium quantities of the observed market converge to the true ones of the long-run market with strong statistical guarantees. These include consistency, finite sample bounds, asymptotics, and confidence. As an extension, we discuss revenue inference in quasilinear Fisher markets.","Fisher market equilibrium, first-price auction, statistical inference under interference, revenue management" Auxiliary task discovery through generate and test,https://openreview.net/forum?id=z4g0Vpf5Zki,https://openreview.net/pdf?id=z4g0Vpf5Zki,,"In this paper, we explore an approach to auxiliary task discovery in reinforcement learning based on ideas from representation learning. Auxiliary tasks tend to improve data efficiency by forcing the agent to learn auxiliary prediction and control objectives in addition to the main task of maximizing reward, and thus producing better representations. Typically these tasks are designed by people. Meta-learning offers a promising avenue for automatic task discovery; however, these methods are computationally expensive and challenging to tune in practice. In this paper, we explore a complementary approach to the auxiliary task discovery: continually generating new auxiliary tasks and preserving only those with high utility. We also introduce a new measure of auxiliary tasks' usefulness based on how useful the features induced by them are for the main task. Our discovery algorithm significantly outperforms random tasks, hand-designed tasks, and learning without auxiliary tasks across a suite of environments. ", MMTSA: Multi-Modal Temporal Segment Attention Network for Efficient Human Activity Recognition,https://openreview.net/forum?id=vQXbQEDJi5,https://openreview.net/pdf?id=vQXbQEDJi5,,"Multimodal sensors (e.g., visual, non-visual, and wearable) provide complementary information to develop robust perception systems for recognizing activities. However, most existing algorithms use dense sampling and heterogeneous sub-network to extract unimodal features and fuse them at the end of their framework, which causes data redundancy, lack of multimodal complementary information and high computational cost. In this paper, we propose a new novel multi-modal neural architecture based on RGB and IMU wearable sensors (e.g., accelerometer, gyroscope) for human activity recognition called Multimodal Temporal Segment Attention Network (MMTSA). MMTSA first employs a multimodal data isomorphism mechanism based on Gramian Angular Field (GAF) and then applies a novel multimodal sparse sampling method to reduce redundancy. Moreover, we propose an inter-segment attention module in MMTSA to fuse multimodal features effectively and efficiently. We demonstrate the importance of imu data imaging and attention mechanism in human activity recognition by rigours evaluation on three public datasets, and achieved superior improvements ($11.13\%$ on the MMAct dataset) than the previous state-of-the-art methods. ","Multimodal Learning, Human Activity Recognition" Scenario-based Question Answering with Interacting Contextual Properties,https://openreview.net/forum?id=tPrRs6YB2P,https://openreview.net/pdf?id=tPrRs6YB2P,We proposed a model for scenario-based QA which requires reasoning over multiple contextual properties in user scenarios to find answers that are consistent with the scenarios and to identify necessary information which is missing from the scenarios.,"In the scenario-based Question Answering (QA) task, models are asked to find answers that are appropriate to the user scenarios associated with the question and identify information that is missing from the scenarios but is necessary for the answers to hold. Scenarios commonly include multiple properties of users, such as age, employment status, and income level for the question “How much can I claim from this benefit”. The properties relevant to a potential answer are given in a document, which will state conditions necessary for the answer to hold. Documents also may specify how conditions interact with each other, e.g. with text like “one of the conditions below must apply”. Although understanding the relationship between conditions is crucial for solving this challenging QA task, limited work has been done so far in modeling this. In this paper, we propose the T-Reasoner model, which solves this problem with three jointly learned modules: an entailment module which checks whether a condition has been satisfied by the scenario, a decoding module which locates eligible answers from documents, and a reasoning module which infers the relationship between conditions and performs a reasoning step to determine the logically consistent answers and identify missing conditions. T-Reasoner outperforms strong baselines on a synthetic scenario-based QA dataset and achieves a new state-of-the-art on two scenario-based QA benchmarks, outperforming the prior best models by 3-10 points.",Question Answering Easy Differentially Private Linear Regression,https://openreview.net/forum?id=rSUCajhLsQ,https://openreview.net/pdf?id=rSUCajhLsQ,A practical algorithm for differentially private linear regression which does not require data bounds or parameter tuning but is competitive with methods that do.,"Linear regression is a fundamental tool for statistical analysis. This has motivated the development of linear regression methods that also satisfy differential privacy and thus guarantee that the learned model reveals little about any one data point used to construct it. However, existing differentially private solutions assume that the end user can easily specify good data bounds and hyperparameters. Both present significant practical obstacles. In this paper, we study an algorithm which uses the exponential mechanism to select a model with high Tukey depth from a collection of non-private regression models. Given $n$ samples of $d$-dimensional data used to train $m$ models, we construct an efficient analogue using an approximate Tukey depth that runs in time $O(d^2n + dm\log(m))$. We find that this algorithm obtains strong empirical performance in the data-rich setting with no data bounds or hyperparameter selection required.","differential privacy, linear regression" PointDP: Diffusion-driven Purification against 3D Adversarial Point Clouds,https://openreview.net/forum?id=293zPCqNqe,https://openreview.net/pdf?id=293zPCqNqe,"We propose PointDP, a diffusion-driven purification strategy to defend against adversarial point cloud. PointDP consistently achieves the strongest robustness under various attacks.","3D Point cloud is a critical data representation in many real-world applications, such as autonomous driving, robotics, and medical imaging. Although the success of deep learning further accelerates the adoption of 3D point clouds in the physical world, deep learning is notoriously vulnerable to adversarial attacks. Various defense solutions have been proposed to build robust models against adversarial attacks. In this work, we identify that the state-of-the-art empirical defense, adversarial training, has a major limitation in 3D point cloud models due to gradient obfuscation, resulting in significant degradation of robustness against strong attacks. To bridge the gap, we propose PointDP, a purification strategy that leverages diffusion models to defend against 3D adversarial attacks. Since PointDP does not rely on predefined adversarial examples for training, it can defend against diverse threats. We extensively evaluate PointDP on six representative 3D point cloud architectures and leverage sixteen strong and adaptive attacks to demonstrate its lower-bound robustness. Our evaluation shows that PointDP achieves significantly better (i.e., 12.6\%-40.3\%) adversarial robustness than state-of-the-art methods under strong attacks bounded by different $\ell_p$ norms. ","Adversarial Robustness, Point Cloud Classification, Diffusion Model" Deep Physics-based Deformable Models for Efficient Shape Abstractions,https://openreview.net/forum?id=_MlB0iqfmM,https://openreview.net/pdf?id=_MlB0iqfmM,,"Efficient shape abstraction with explainability is challenging due to the complex geometries of natural objects. Recent methods learn to represent objects using a set of simple primitives or fit locally parameterized deformable models to the target shapes. However, these methods either are limited in geometric flexibility or fail to intrinsically offer shape abstractions with explainability. In this paper, we investigate salient and efficient primitive descriptors for accurate shape abstractions, and propose \textit{Deep Physics-based Deformable Model (DPDM)}. DPDM employs global deformations with parameter functions and local deformations. These properties enable DPDM to abstract complex object shapes with significantly fewer primitives that offer broader geometry coverage and finer details. DPDM learning formulation is based on physics-based modeling (i.e., dynamics and kinematics) to enable multiscale explainable abstractions. The proposed DPDM is evaluated on two different shape abstraction tasks: 3D shape reconstruction and object segmentation. Extensive experiments on \textit{ShapeNet} demonstrate that DPDM outperforms the state-of-the-art methods in terms of reconstruction accuracy and is more robust since it uses much fewer primitives. We conduct comprehensive experiments on \textit{ACDC}, \textit{M\&Ms}, and \textit{M\&Ms-2} for cardiac MR segmentation, and show the leading abstraction performance of our approach.","Deformable models, Shape abstraction, Deep learning" Benchmarking and Improving Robustness of 3D Point Cloud Recognition against Common Corruptions,https://openreview.net/forum?id=wshUUnnDjc,https://openreview.net/pdf?id=wshUUnnDjc,"We propose ModelNet40-C, a novel corruption robustness dataset and benchmark for point cloud recognition with RobustNet and PointCutMixup to further improve the rosbustness.","Deep neural networks on 3D point cloud data have been widely used in the real world, especially in safety-critical applications. However, their robustness against corruptions is less studied. In this paper, we present ModelNet40-C, a comprehensive benchmark on 3D point cloud corruption robustness, consisting of 15 common and realistic corruptions. Our evaluation shows a significant gap between the performances on ModelNet40 and ModelNet40-C for state-of-the-art models. We identify a number of critical insights for future studies on corruption robustness in point cloud recognition. For instance, we unveil that Transformer-based architectures with proper training recipes achieve the strongest robustness. To bridge this gap, we further propose RobustNet and PointCutMixup that embrace the merits of existing architectural designs to further improve the corruption robustness in the 3D point cloud domain, after evaluating a wide range of augmentation and test-time adaptation strategies. Our codebase and dataset are open-sourced.","Corruption Robustness Benchmark, Point Cloud Classification, Data Augmentation" Visual Recognition with Deep Nearest Centroids,https://openreview.net/forum?id=CsKwavjr7A,https://openreview.net/pdf?id=CsKwavjr7A,,"We devise deep nearest centroids (DNC), a conceptually elegant yet surprisingly effective network for large-scale visual recognition, by revisiting Nearest Centroids, one of the most classic and simple classifiers. Current deep models learn the classifier in a fully parametric manner, ignoring the latent data structure and lacking simplicity and explainability. DNC instead conducts nonparametric, case-based reasoning; it utilizes sub-centroids of training samples to describe class distributions and clearly explains the classification as the proximity of test data and the class sub-centroids in the feature space. Due to the distance-based nature, the network output dimensionality is flexible, and all the learnable parameters are only for data embedding. That means all the knowledge learnt for ImageNet classification can be completely transferred for pixel recognition learning, under the ‘pre-training and fine-tuning’ paradigm. Apart from its nested simplicity and intuitive decision-making mechanism, DNC can even possess ad-hoc explainability when the sub-centroids are selected as actual training images that humans can view and inspect. Compared with parametric counterparts, DNC performs better on image classification (CIFAR-10, ImageNet) and greatly boots pixel recognition (ADE20K, Cityscapes), with improved transparency and fewer learnable parameters, using various network architectures (ResNet, Swin) and segmentation models (FCN, DeepLabV3, Swin). We feel this work brings fundamental insights into related fields. Our code will be released.","Nearest centroids classifier, Cased-base reasoning, Image classification, Image segmentation, Explainable neural networks" Closing the Gap Between SVRG and TD-SVRG with Gradient Splitting,https://openreview.net/forum?id=_AkC4QYxF5,https://openreview.net/pdf?id=_AkC4QYxF5,We prove a linear convergence time for an SVRG-inspired temporal difference method which is identical to the original convergence time bound of SVRG in the convex setting.,"Temporal difference (TD) learning is a simple algorithm for policy evaluation in reinforcement learning. The performance of TD learning is affected by high variance and it can be naturally enhanced with variance reduction techniques, such as the Stochastic Variance Reduced Gradient (SVRG) method. Recently, multiple works have sought to fuse TD learning with SVRG to obtain a policy evaluation method with a linear rate of convergence. However, the resulting convergence rate is significantly weaker than what is achieved by SVRG in the setting of convex optimization. In this work we utilize a recent interpretation of TD-learning as the splitting of the gradient of an appropriately chosen function, thus simplifying the algorithm and fusing TD with SVRG. We prove a linear convergence bound that is identical to the convergence bound available for SVRG in the convex setting.","Temporal Difference learning, Reinforcement Learning, SVRG, Optimization" Rethinking Backdoor Data Poisoning Attacks in the Context of Semi-Supervised Learning,https://openreview.net/forum?id=OT1xF6_56J,https://openreview.net/pdf?id=OT1xF6_56J,We investigate vulnerabilities of semi-supervised learning methods to backdoor data poisoning attacks in unlabeled data and identify characteristics necessary for attack success. ,"Semi-supervised learning methods can train high-accuracy machine learning models with a fraction of the labeled training samples required for traditional supervised learning. Such methods do not typically involve close review of the unlabeled training samples, making them tempting targets for data poisoning attacks. In this paper we investigate the vulnerabilities of semi-supervised learning methods to backdoor data poisoning attacks on the unlabeled samples. We show that a simple poisoning attack using adversarially perturbed samples is highly effective - achieving an average attack success rate of 93.6%. We introduce a generalized attack framework targeting semi-supervised learning methods to better understand and exploit their limitations and to motivate future defense strategies.","data poisoning, backdoor attacks, semi-supervised learning" Categorial Grammar Induction as a Compositionality Measure for Emergent Languages in Signaling Games,https://openreview.net/forum?id=SI0ON7mZYY,https://openreview.net/pdf?id=SI0ON7mZYY,This paper proposes a method for investigating the non-trivially compositional structure of emergent languages using Categorial Grammar Induction.,"This paper proposes a method to analyze the compositional structure of emergent languages using Categorial Grammar Induction (CGI). Emergent languages are communication protocols arising among agents in environments such as signaling games. Previous work has studied how similar or dissimilar emergent languages are to natural languages in compositionality. However, most of them focused on trivial compositionality, assuming flat structures in languages. We further focus on non-trivial compositionality, i.e., the relationship between hierarchical syntax and semantics. To this end, we apply CGI to emergent languages, inspired by previous NLP work. Given sentence-meaning pairs of a language, CGI induces 1) a categorial grammar that describes the syntax of the language and 2) a semantic parser that compositionally maps sentences to meanings. We also propose compositionality measures based on the grammar size and semantic parser performance. CGI and the proposed measures enable deeper insights into the non-trivial compositionality of emergent languages, while correlating well with existing measures like TopSim.","Emergent Communication, Emergent Language, Categorial Grammar Induction, Syntax, Compositionality" LPT: Long-tailed Prompt Tuning for Image Classification,https://openreview.net/forum?id=8pOVAeo8ie,https://openreview.net/pdf?id=8pOVAeo8ie,,"For long-tailed classification tasks, most works often pretrain a big model on a large-scale (unlabeled) dataset, and then fine-tune the whole pretrained model for adapting to long-tailed data. Though promising, fine-tuning the whole pretrained model tends to suffer from high cost in computation and deployment of different models for different tasks, as well as weakened generalization capability for overfitting to certain features of long-tailed data. To alleviate these issues, we propose an effective Long-tailed Prompt Tuning (LPT) method for long-tailed classification tasks. LPT introduces several trainable prompts into a frozen pretrained model to adapt it to long-tailed data. For better effectiveness, we divide prompts into two groups: 1) a shared prompt for the whole long-tailed dataset to learn general features and to adapt a pretrained model into the target long-tailed domain; and 2) group-specific prompts to gather group-specific features for the samples which have similar features and also to empower the pretrained model with fine-grained discrimination ability. Then we design a two-phase training paradigm to learn these prompts. In the first phase, we train the shared prompt via conventional supervised prompt tuning to adapt a pretrained model to the desired long-tailed domain. In the second phase, we use the learnt shared prompt as query to select a small best matched set for a group of similar samples from the group-specific prompt set to dig the common features of these similar samples, and then optimize these prompts with a dual sampling strategy and the asymmetric Gaussian Clouded Logit loss. By only fine-tuning a few prompts while fixing the pretrained model, LPT can reduce training cost and deployment cost by storing a few prompts, and enjoys a strong generalization ability of the pretrained model. Experiments show that on various long-tailed benchmarks, with only $\sim$1.1\% extra trainable parameters, LPT achieves comparable or higher performance than previous whole model fine-tuning methods, and is more robust to domain-shift.", Interpretable Out-of-Distribution Detection using Pattern Identification,https://openreview.net/forum?id=APkMDZtY9HL,https://openreview.net/pdf?id=APkMDZtY9HL,We apply pattern detection to Out-of-Distribution detection on an extensive benchmark. ,"Out-of-distribution (OoD) detection for data-based programs is a goal of paramount importance. Common approaches in the literature tend to train binary classifiers requiring inside-of-distribution (IoD) and OoD validation samples, and/or implement confidence metrics that are often abstract and therefore difficult to interpret. In this work, we propose to use the PARTICUL pattern identification algorithm in order to build more interpretable and robust OoD detectors for visual classifiers. Crucially, this approach does not require retraining the classifier and is tuned directly to the IoD dataset, making it applicable to domains where OoD does not have a clear definition. Moreover, pattern identification allows us to provide images from the IoD dataset as reference points to better explain our confidence scores. We illustrate the generalization abilities of our approach through an extensive benchmark across four datasets and two definitions of OoD. Our experiments show that the robustness of all metrics under test does not solely depend on the nature of the IoD dataset or the OoD definition, but also on the architecture of the classifier, which stresses the need for thorough experimentations for future work in OoD detection.","out-of-distribution detection, pattern detection, interpretable artificial intelligence, confidence, metric" TopoZero: Digging into Topology Alignment on Zero-Shot Learning,https://openreview.net/forum?id=GOEpRos3w0L,https://openreview.net/pdf?id=GOEpRos3w0L,"we utilize persistent homology to investigate geometry structure alignment, based on which, we propose a TopoZero framework to achieve multi-dimensional structure alignment.","Common space learning, associating semantic and visual domains in a common latent space, is essential to transfer knowledge from seen classes to unseen ones on Zero-Shot Learning (ZSL) realm. Existing methods for common space learning rely heavily on structure alignment due to the heterogeneous nature between semantic and visual domains, but the existing design is sub-optimal. In this paper, we utilize persistent homology to investigate geometry structure alignment, and observe two following issues: (i) The sampled mini-batch data points present a distinct structure gap compared to global data points, thus the learned structure alignment space inevitably neglects abundant and accurate global structure information. (ii) The latent visual and semantic space fail to preserve multiple dimensional geometry structure, especially high dimensional structure information. To address the first issue, we propose a Topology-guided Sampling Strategy (TGSS) to mitigate the gap between sampled and global data points. Both theoretical analyses and empirical results guarantee the effectiveness of the TGSS. To solve the second issue, we introduce a Topology Alignment Module (TAM) to preserve multi-dimensional geometry structure in latent visual and semantic space, respectively. The proposed method is dubbed TopoZero. Empirically, our TopoZero achieves superior performance on three authoritative ZSL benchmark datasets.","Zero-Shot Learning, Structure Alignment, Persistent Homology" Digging into Backbone Design on Face Detection,https://openreview.net/forum?id=NkJOhtNKX91,https://openreview.net/pdf?id=NkJOhtNKX91,"We propose a novel DDSAR score to characterize stage-wise detection ability, based on which, we employ off-the-shelf NAS technology to search FD-friendly backbone architectures.","Face detection (FD) has achieved remarkable success over the past few years, yet, these leaps often arrive when consuming enormous computation costs. Moreover, when considering a realistic situation, i.e., building a lightweight face detector under a computation-scarce scenario, such heavy computation cost limits the application of the face detector. To remedy this, several pioneering works design tiny face detectors through off-the-shelf neural architecture search (NAS) technologies, which are usually applied to the classification task. However, the searched architectures are sub-optimal for the face detection task since some design criteria between detection and classification task are different. As a representative, the face detection backbone design needs to guarantee the stage-level detection ability while it is not required for the classification backbone. Furthermore, the detection backbone consumes a vast body of inference costs in detection frameworks. Considering the intrinsic design property and the virtual importance role of the face detection backbone, we thus ask a critical question: How to employ NAS to search FD-friendly backbone architecture? To cope with this question, we propose a distribution-dependent stage-aware ranking score (DDSAR-Score) to explicitly characterize the stage-level expressivity and identify the individual importance of each stage, thus satisfying the aforementioned design criterion of the FD backbone. Based on our proposed DDSAR-Score, we conduct comprehensive experiments on the challenging Wider Face benchmark dataset and achieve dominant performance across a wide range of compute regimes. In particular, compared to the tiniest face detector SCRFD-0.5GF, our method is +2.5 % better in Average Precision (AP) score when using the same amount of FLOPs.","Face Detection, Neural Architecture Search, Network Expressivity" Towards Stable Test-time Adaptation in Dynamic Wild World,https://openreview.net/forum?id=g2YraF75Tj,https://openreview.net/pdf?id=g2YraF75Tj,Propose a Sharpness-aware and Reliable entropy minimization method to make online test-time adaptation stable under wild test scenarios 1) small batch sizes; 2) mixed distribution shifts; 3) imbalanced online label distribution shifts.,"Test-time adaptation (TTA) has shown to be effective at tackling distribution shifts between training and testing data by adapting a given model on test samples. However, the online model updating of TTA may be unstable and this is often a key obstacle preventing existing TTA methods from being deployed in the real world. Specifically, TTA may fail to improve or even harm the model performance when test data have: 1) mixed distribution shifts, 2) small batch sizes, and 3) online imbalanced label distribution shifts, which are quite common in practice. In this paper, we investigate the unstable reasons and find that the batch norm layer is a crucial factor hindering TTA stability. Conversely, TTA can perform more stably with batch-agnostic norm layers, i.e., group or layer norm. However, we observe that TTA with group and layer norms does not always succeed and still suffers many failure cases. By digging into the failure cases, we find that certain noisy test samples with large gradients may disturb the model adaption and result in collapsed trivial solutions, i.e., assigning the same class label for all samples. To address the above collapse issue, we propose a sharpness-aware and reliable entropy minimization method, called SAR, for further stabilizing TTA from two aspects: 1) remove partial noisy samples with large gradients, 2) encourage model weights to go to a flat minimum so that the model is robust to the remaining noisy samples. Promising results demonstrate that SAR performs more stably than prior methods and is computationally efficient under the above wild test scenarios. ","Test-time adaptation, Roustness" Exploring Over-smoothing in Graph Attention Networks from the Markov Chain Perspective,https://openreview.net/forum?id=EdH_fkyhAO,https://openreview.net/pdf?id=EdH_fkyhAO,We give a theoretical analysis on the over-smoothing in GAT under the perspective of Markov Chains and propose a method to solve this problem.,"The over-smoothing problem causing the depth limitation is an obstacle of developing deep graph neural network (GNN). Compared with Graph Convolutional Networks (GCN), over-smoothing in Graph Attention Network (GAT) has not drawed enough attention. In this work, we analyze the over-smoothing problem in GAT from the Markov chain perspective. First we establish a connection between GAT and a time-inhomogeneous random walk on the graph. Then we show that the GAT is not always over-smoothing using conclusions in the time-inhomogeneous Markov chain. Finally, we derive a sufficient condition for GAT to avoid over-smoothing based on our findings about the existence of the limiting distribution of the time-inhomogeneous Markov chain. We design experiments to verify our theoretical findings. Results show that our proposed sufficient condition can effectively improve over-smoothing problem in GAT and enhance the performance of the model.","Graph Attention Networks, Over-smoothing, Markov Chain" Sorted eigenvalue comparison $d_{\mathsf{Eig}}$: A simple alternative to $d_{\mathsf{FID}}$,https://openreview.net/forum?id=HehY2ZX2Cz,https://openreview.net/pdf?id=HehY2ZX2Cz,We propose to compare sorted eigenvalues as a simple alternative to FID score.,"For $i = 1, 2$, let $\mathbf{S}_i$ be the sample covariance of $\mathbf{Z}_i$ with $n_i$ $p$-dimensional vectors. First, we theoretically justify an improved Fréchet Inception Distance ($d_{\mathsf{FID}}$) algorithm that replaces np.trace(sqrtm($\mathbf{S}_1 \mathbf{S}_2$)) with np.sqrt(eigvals($\mathbf{S}_1 \mathbf{S}_2$)).sum(). With the appearance of unsorted eigenvalues in the improved $d_{\mathsf{FID}}$, we are then motivated to propose sorted eigenvalue comparison ($d_{\mathsf{Eig}}$) as a simple alternative: $d_{\mathsf{Eig}}(\mathbf{S}_1, \mathbf{S}_2)^2=\sum_{j=1}^p (\sqrt{\lambda_j^1} - \sqrt{\lambda_j^2})^2$, and $\lambda_j^i$ is the $j$-th largest eigenvalue of $\mathbf{S}_i$. Second, we present two main takeaways for the improved $d_{\mathsf{FID}}$ and proposed $d_{\mathsf{Eig}}$ . (i) $d_{\mathsf{FID}}$: The error bound for computing non-negative eigenvalues of diagonalizable $\mathbf S_1 \mathbf S_2$ is reduced to $\mathcal{O}(\varepsilon) \|\mathbf S_1 \| \|\mathbf S_1 \mathbf S_2 \|$, along with reducing the run time by $\sim25\%$. (ii) $d_{\mathsf{Eig}}$: The error bound for computing non-negative eigenvalues of sample covariance $\mathbf S_i$ is further tightened to $\mathcal{O}(\varepsilon) \|\mathbf S_i \|$, with reducing $\sim90\%$ run time. Taking a statistical viewpoint (random matrix theory) on $\mathsf{S}_i$, we illustrate the asymptotic stability of its largest eigenvalues, i.e., rigidity estimates of $\mathcal{O}(n_i^{-\frac{1}{2}+\alpha})$. Last, we discuss limitations and future work for $d_{\mathsf{Eig}}$.","Distribution shift, FID, eigenvalue comparison, random matrix theory" Towards Smooth Video Composition,https://openreview.net/forum?id=W918Ora75q,https://openreview.net/pdf?id=W918Ora75q,We develop a simple yet strong baseline for smooth video generation.,"Video generation, with the purpose of producing a sequence of frames, requires synthesizing consistent and persistent dynamic contents over time. This work investigates how to model the temporal relations for composing a video with arbitrary number of frames, from a few to even infinite, using generative adversarial networks (GANs). First, towards composing adjacent frames, we show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, bring a smooth frame transition without harming the per-frame quality. Second, through incorporating a temporal shift module (TSM), which is originally designed for video understanding, into the discriminator, we manage to advance the generator in synthesizing more reasonable dynamics. Third, we develop a novel B-Spline based motion representation to ensure the temporal smoothness, and hence achieve infinite-length video generation, going beyond the frame number used in training. We evaluate our approach on a range of datasets and show substantial improvements over baselines on video generation. Code and models will be made publicly available.","video generation, generative adversarial network" Deep Dynamic AutoEncoder for Vision BERT Pretraining,https://openreview.net/forum?id=k4p382L0bw,https://openreview.net/pdf?id=k4p382L0bw,,"Recently, masked image modeling (MIM) has demonstrated promising prospects in self-supervised representation learning. However, existing MIM frameworks recover all masked patches equivalently, ignoring that the reconstruction difficulty of different patches can vary sharply due to their diverse distance from visible patches. In this paper, we propose Deep Dynamic AutoEncoder (DDAE), a novel MIM framework that dynamically focuses on patch reconstructions with different degrees of difficulty at different pretraining phases and depths of the model. In addition to raw pixel regression, DDAE performs dynamic feature self-distillation for intermediate layers to learn semantic information. Our methodology provides more locality inductive bias for ViTs, especially in deep layers, which inherently makes up for the absence of local prior for self-attention mechanism. Moreover, our core design deep dynamic supervision can be migrated into existing MIM methods (e.g., MAE, BEiT-v2) seamlessly. The Experimental results demonstrate the effectiveness of our approach. As a tokenizer-free framework, the base-size DDAE can achieve 83.5% top-1 accuracy with only 100 epochs pretraining, surpassing MAE and BEiT pretrained for 800 epochs. For a longer pretraining schedule, DDAE achieves 84.3% top-1 accuracy on Imagenet-1K, and 49.3% mIoU on ADE20K for semantic segmentation.", Continuous PDE Dynamics Forecasting with Implicit Neural Representations,https://openreview.net/forum?id=B73niNjbPs,https://openreview.net/pdf?id=B73niNjbPs,"We propose a continuous-time, continuous-space data-driven PDE forecasting model with extensive spatiotemporal extrapolation capabilities including generalization to unseen sparse meshes and resolutions.","Effective data-driven PDE forecasting methods often rely on fixed spatial and / or temporal discretizations. This raises limitations in real-world applications like weather prediction where flexible extrapolation at arbitrary spatiotemporal locations is required. We address this problem by introducing a new data-driven approach, DINo, that models a PDE's flow with continuous-time dynamics of spatially continuous functions. This is achieved by embedding spatial observations independently of their discretization via Implicit Neural Representations in a small latent space temporally driven by a learned ODE. This separate and flexible treatment of time and space makes DINo the first data-driven model to combine the following advantages. It extrapolates at arbitrary spatial and temporal locations; it can learn from sparse irregular grids or manifolds; at test time, it generalizes to new grids or resolutions. DINo outperforms alternative neural PDE forecasters in a variety of challenging generalization scenarios on representative PDE systems.","spatiotemporal forecasting, Partial Differential Equations, PDEs, Implicit Neural Representations, INRs, continuous models, generalization, dynamical systems, physics" Adversarial Collaborative Learning on Non-IID Features,https://openreview.net/forum?id=dOxe6utTKC,https://openreview.net/pdf?id=dOxe6utTKC,The paper proposes a new collaborative learning framework on non-IID features.,"Federated Learning (FL) has been a popular approach to enable collaborative learning on multiple parties without exchanging raw data. However, the model performance of FL may degrade a lot due to non-IID data. While many FL algorithms focus on non-IID labels, FL on non-IID features has largely been overlooked. Different from typical FL approaches, the paper proposes a new learning concept called ADCOL (Adversarial Collaborative Learning) for non-IID features. Instead of adopting the widely used model-averaging scheme, ADCOL conducts training in an adversarial way: the server aims to train a discriminator to distinguish the representations of the parties, while the parties aim to generate a common representation distribution. Our experiments on three tasks show that ADCOL achieves better performance than state-of-the-art FL algorithms on non-IID features.","Federated Learning, Collaborative Learning" DiffMimic: Efficient Motion Mimicking with Differentiable Physics,https://openreview.net/forum?id=06mk-epSwZ,https://openreview.net/pdf?id=06mk-epSwZ,Mimic agile skills for physics-based character with differentiable physics simulators.,"Motion mimicking is a foundational task in physics-based character animation. However, most existing motion mimicking methods are built upon reinforcement learning (RL) and suffer from heavy reward engineering, high variance, and slow convergence with hard explorations. Specifically, they usually take tens of hours or even days of training to mimic a simple motion sequence, resulting in poor scalability. In this work, we leverage differentiable physics simulators (DPS) and propose an efficient motion mimicking method dubbed $\textbf{DiffMimic}$. Our key insight is that DPS casts a complex policy learning task to a much simpler state matching problem. In particular, DPS learns a stable policy by analytical gradients with ground-truth physical priors hence leading to significantly faster and stabler convergence than RL-based methods. Moreover, to escape from local optima, we utilize an \textit{Demonstration Replay} mechanism to enable stable gradient backpropagation in a long horizon. Extensive experiments on standard benchmarks show that DiffMimic has a better sample efficiency and time efficiency than existing methods (e.g., DeepMimic). Notably, DiffMimic allows a physically simulated character to learn back-flip after 10 minutes of training and be able to cycle it after 3 hours of training, while DeepMimic requires about a day of training to cycle back-flip. More importantly, we hope DiffMimic can benefit more differentiable animation systems with techniques like differentiable clothes simulation in future research. Our code is available at https://github.com/diffmimic/diffmimic. Qualitative results can be viewed at https://diffmimic-demo-main-g7h0i8.streamlitapp.com",Physics-based Animation Towards Inferential Reproducibility of Machine Learning Research,https://openreview.net/forum?id=li4GQCQWkv,https://openreview.net/pdf?id=li4GQCQWkv,"Methods for inferential reproducibility of machine learning, using signficance testing under meta-parameter variation, variance components, and reliability coefficients.","Non-determinism in deep learning and the consequential randomness and variability in performance evaluation has spawned attempts to foster reproducibility of SOTA benchmark results by sharing data, code, and meta-parameter settings. In this paper, we propose to shift from the goal of duplicating a SOTA training result without any changes to a new type of reproducibility called inferential reproducibility that treats performance variation depending on data characteristics, meta-parameter settings, and their interactions as an inherent and interesting feature of non-deterministic deep learning, not as a bug that needs to be resolved. We propose to answer questions of inferential reproducibility by classical statistical methods: We show how to design a linear mixed effects model (LMEM) to analyze performance evaluation scores of machine learning algorithms, and to conduct statistical inference on the interpretable parameters of this model with a generalized likelihood ratio test (GLRT). This approach allows us to efficiently assess statistical significance of performance differences between models by simultaneously acknowledging for variability in meta-parameters and data. Furthermore, performance differences conditional on data properties can be assessed, and a variance component analysis (VCA) can be performed to reveal the contribution of meta-parameters to overall variance. Lastly, a reliability coefficient can be computed to assess the general robustness of the model. Code (R and Python) and sample applications of our tools are publicly available.","reproducibility, variance component analysis, reliability, significance test" Knowledge Distillation based Degradation Estimation for Blind Super-Resolution,https://openreview.net/forum?id=Fg3mYW8owg,https://openreview.net/pdf?id=Fg3mYW8owg,"We propose a knowledge distillation based blind super-resolution network, which can generalize to all degradation processes and achieve SOTA performance efficiently.","Blind image super-resolution (Blind-SR) aims to recover a high-resolution (HR) image from its corresponding low-resolution (LR) input image with unknown degradations. Most of the existing works design an explicit degradation estimator for each degradation to guide SR. However, it is infeasible to provide concrete labels of multiple degradation combinations (\eg, blur, noise, jpeg compression) to supervise the degradation estimator training. In addition, these special designs for certain degradation, such as blur, impedes the models from being generalized to handle different degradations. To this end, it is necessary to design an implicit degradation estimator that can extract discriminative degradation representation for all degradations without relying on the supervision of degradation ground-truth. In this paper, we propose a Knowledge Distillation based Blind-SR network (KDSR). It consists of a knowledge distillation based implicit degradation estimator network (KD-IDE) and an efficient SR network. To learn the KDSR model, we first train a teacher network: KD-IDE$_{T}$. It takes paired HR and LR patches as inputs and is optimized with the SR network jointly. Then, we further train a student network KD-IDE$_{S}$, which only takes LR images as input and learns to extract the same implicit degradation representation (IDR) as KD-IDE$_{T}$. In addition, to fully use extracted IDR, we design a simple, strong, and efficient IDR based dynamic convolution residual block (IDR-DCRB) to build an SR network. We conduct extensive experiments under classic and real-world degradation settings. The results show that KDSR achieves SOTA performance and can generalize to various degradation processes. The source codes and pre-trained models will be released.",Image Super-Resolution Very Large Scale Multi-Agent Reinforcement Learning with Graph Attention Mean Field,https://openreview.net/forum?id=MdiVU9lMmVS,https://openreview.net/pdf?id=MdiVU9lMmVS,A multi-agent reinforcement learning method solving very large scale problem by mean-field technique combining graph attention mechanism.,"With recent advances in reinforcement learning, we have witnessed countless successes of intelligent agents in various domains. Especially, multi-agent reinforcement learning (MARL) is suitable for many real-world scenarios and has vast potential applications. However, typical MARL methods can only handle tens of agents, leaving scenarios with up to hundreds or even thousands of agents almost unexplored. There exist two key challenges in scaling up the number of agents: (1) agent-agent interactions are critical in multi-agent systems while the number of interactions grows quadratically with the number of agents, causing great computational complexity and difficulty in strategies-learning; (2) the strengths of interactions vary among agents and over time, making it difficult to precisely model such interactions. In this paper, we propose the Graph Attention Mean Field (GAT-MF) method, where we convert agent-agent interactions into interactions between each agent and a weighted mean field, greatly reducing the computational complexity. We mathematically prove the correctness of this conversion. We design a graph attention mechanism to automatically capture the different and time-varying strengths of interactions, ensuring the ability of our method to precisely model interactions among the agents. We conduct extensive experiments in both manual and real-world scenarios with up to more than 3000 agents, demonstrating that comparing existing MARL methods, our method reaches superior performance and 9.4 times computational efficiency.","Multi-agent reinforcement learning, large-scale problems, graph attention, mean field" Graph Contrastive Learning for Skeleton-based Action Recognition,https://openreview.net/forum?id=PLUXnnxUdr4,https://openreview.net/pdf?id=PLUXnnxUdr4,"For GCN-based methods in skeleton-based action recognition, this work extends the graph learning from using intra-sequence local context to exploring cross-sequence global context.","In the field of skeleton-based action recognition, current top-performing graph convolutional networks (GCNs) exploit intra-sequence context to construct adaptive graphs for feature aggregation. However, we argue that such context is still $\textit{local}$ since the rich cross-sequence relations have not been explicitly investigated. In this paper, we propose a graph contrastive learning framework for skeleton-based action recognition ($\textit{SkeletonGCL}$) to explore the $\textit{global}$ context across all sequences. In specific, SkeletonGCL associates graph learning across sequences by enforcing graphs to be class-discriminative, i.e., intra-class compact and inter-class dispersed, which improves the GCN capacity to distinguish various action patterns. Besides, two memory banks are designed to enrich cross-sequence context from two complementary levels, i.e., instance and semantic levels, enabling graph contrastive learning in multiple context scales. Consequently, SkeletonGCL establishes a new training paradigm, and it can be seamlessly incorporated into current GCNs. Without loss of generality, we combine SkeletonGCL with three GCNs (2S-ACGN, CTR-GCN, and InfoGCN), and achieve consistent improvements on NTU60, NTU120, and NW-UCLA benchmarks. ",Skeleton-based Action Recognition Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation,https://openreview.net/forum?id=s4WVupnJjmX,https://openreview.net/pdf?id=s4WVupnJjmX,,"This paper presents a novel end-to-end framework with Explicit box Detection for multi-person Pose estimation, called ED-Pose, where it unifies the contextual learning between human-level (global) and keypoint-level (local) information. Different from previous one-stage methods, ED-Pose re-considers this task as two explicit box detection processes with a unified representation and regression supervision. First, we introduce a human detection decoder from encoded tokens to extract global features. It can provide a good initialization for the latter keypoint detection, making the training process converge fast. Second, to bring in contextual information near keypoints, we regard pose estimation as a keypoint box detection problem to learn both box positions and contents for each keypoint. A human-to-keypoint detection decoder adopts an interactive learning strategy between human and keypoint features to further enhance global and local feature aggregation. In general, ED-Pose is conceptually simple without post-processing and dense heatmap supervision. It demonstrates its effectiveness and efficiency compared with both two-stage and one-stage methods. Notably, explicit box detection boosts the pose estimation performance by 4.5 AP on COCO and 9.9 AP on CrowdPose. For the first time, as a fully end-to-end framework with a L1 regression loss, ED-Pose surpasses heatmap-based Top-down methods under the same backbone by 1.2 AP on COCO and achieves the state-of-the-art with 76.6 AP on CrowdPose without bells and whistles.",Multi-person Pose Estimation Expected Perturbation Scores for Adversarial Detection,https://openreview.net/forum?id=xKlCpphHAsg,https://openreview.net/pdf?id=xKlCpphHAsg,,"Adversarial detection aims to determine whether a given sample is an adversarial one based on the discrepancy between natural and adversarial distributions. Unfortunately, estimating or comparing two data distributions is extremely difficult, especially in the high-dimension space. Recently, the gradients of log probability density (a.k.a., score) w.r.t. samples is used as an alternative statistic to compute. However, we find that the score is sensitive in identifying adversarial samples due to insufficient information with one sample only. In this paper, we propose a new statistic called expected perturbation score (EPS), which is essentially the expected score of a sample after various perturbations. Specifically, to obtain adequate information regarding one sample, we can perturb it by adding various noises to capture its multi-view observations. We theoretically prove that EPS is a proper statistic to compute the discrepancy between two distributions under mild conditions. In practice, we can use a pre-trained diffusion model to estimate EPS for each sample. Last, we propose an EPS-based adversarial detection (EPS-AD) method, in which we develop EPS-based maximum mean discrepancy (MMD) as a metric to measure the discrepancy between the test sample and natural samples. To verify the validity of our proposed method, we also prove that the EPS-based MMD between natural and adversarial samples is larger than that among natural samples. Empirical studies on CIFAR-10 and ImageNet across different network architectures including ResNet, WideResNet, and ViT show the superior adversarial detection performance of EPS-AD compared to existing methods.", Look Back When Surprised: Stabilizing Reverse Experience Replay for Neural Approximation,https://openreview.net/forum?id=15fiz99C8B,https://openreview.net/pdf?id=15fiz99C8B,We propose a new experience replay which outperforms previous SOTA on most environments ,"Experience replay-based sampling techniques are essential to several reinforcement learning (RL) algorithms since they aid in convergence by breaking spurious correlations. The most popular techniques, such as uniform experience replay(UER) and prioritized experience replay (PER), seem to suffer from sub-optimal convergence and significant bias error, respectively. To alleviate this, we introduce a new experience replay method for reinforcement learning, called IntrospectiveExperience Replay (IER). IER picks batches corresponding to data points consecutively before the ‘surprising’ points. Our proposed approach is based on the theoretically rigorous reverse experience replay (RER), which can be shown to remove bias in the linear approximation setting but can be sub-optimal with neural approximation. We show empirically that IER is stable with neural function approximation and has a superior performance compared to the state-of-the-art techniques like uniform experience replay (UER), prioritized experience replay(PER), and hindsight experience replay (HER) on the majority of tasks.","Experience Replay, Reinforcement Learning" BQ-NCO: Bisimulation Quotienting for Generalizable Neural Combinatorial Optimization,https://openreview.net/forum?id=5ZLWi--i57,https://openreview.net/pdf?id=5ZLWi--i57,"A generic formulation of Combinatorial Optimization problems as MDP, and pre-processing steps to improve it, with experiments on routing problems","Despite the success of Neural Combinatorial Optimization methods for end-to-end heuristic learning, out-of-distribution generalization remains a challenge. In this paper, we present a novel formulation of combinatorial optimization (CO) problems as Markov Decision Processes (MDPs) that effectively leverages symmetries of the CO problems to improve out-of-distribution robustness. Starting from the standard MDP formulation of constructive heuristics, we introduce a generic transformation based on bisimulation quotienting (BQ) in MDPs. This transformation allows to reduce the state space by accounting for the intrinsic symmetries of the CO problem and facilitates the MDP solving. We illustrate our approach on the Traveling Salesman and Capacitated Vehicle Routing Problems. We present a BQ reformulation of these problems and introduce a simple attention-based policy network that we train by imitation of (near) optimal solutions for small instances from a single distribution. We obtain new state-of-the-art generalization results for instances with up to 1000 nodes from synthetic and realistic benchmarks that vary both in size and node distributions.", CoGANs: Collaborative Generative Adversarial Networks,https://openreview.net/forum?id=u_-XxuTcnJ7,https://openreview.net/pdf?id=u_-XxuTcnJ7,We introduce a new method to train multi-generator GANs which manages to beat the state-of-the-art for MNIST,"In complex creative scenarios, co-creativity by multiple agents offer great advantages. Each agent has a specific skill set and a set of abilities, which is sometimes not enough to perform a general, large and complex task single-handed. These kinds of tasks benefit substantially from collaboration. In deep learning applications, data generation is an example of such a complex, potentially multi-modal task. Previous Generative Adversarial Networks (GANs) focused on using a single generator to generate multi-modal datasets, which is sometimes known to face issues such as mode-collapse and failure to converge. The multi-generator based works such as MGAN, MMGAN, MADGAN and AdaGAN either require training a classifier online, the use of complex mixture models or sequentially adding generators, which is computationally complex. In this work, we present a simple, novel approach of training collaborative GANs (CoGAN), with multiple generators and a single critic/discriminator, without introducing external complexities such as a classifier model. We show that this method of workload division meets the state-of-the-art quality metrics, and makes GAN training robust. We present a proof-of-concept on the MNIST dataset, which has 10 modes of data. The individual generators learn to generate different digits from the distribution, and together learn to generate the whole distribution. We introduce a new component to the generator loss during GAN training, based on the Total Variation Distance (TVD) and show that it significantly improves stability during training and performance over state-of-the-art single generator GANs.","GANs, Multiple generators" Multiscale Neural Operator: Learning Fast and Grid-independent PDE Solvers,https://openreview.net/forum?id=4inSu6mXdZk,https://openreview.net/pdf?id=4inSu6mXdZk,We are the first to embed grid-independent neural operators as closure model or parametrization in physical simulations -- in doing so we created a fast and accurate surrogate of multiscale PDEs.,"Numerical simulations in climate, chemistry, or astrophysics are computationally too expensive for uncertainty quantification or parameter-exploration at high-resolution. Reduced-order or surrogate models are multiple orders of magnitude faster, but traditional surrogates are inflexible or inaccurate and pure machine learning (ML)-based surrogates too data-hungry. We propose a hybrid, flexible surrogate model that exploits known physics for simulating large-scale dynamics and limits learning to the hard-to-model term, which is called parametrization or closure and captures the effect of fine- onto large-scale dynamics. Leveraging neural operators, we are the first to learn grid-independent, non-local, and flexible parametrizations. Our \textit{multiscale neural operator} is motivated by a rich literature in multiscale modeling, has quasilinear runtime complexity, is more accurate or flexible than state-of-the-art parametrizations and demonstrated on the chaotic equation multiscale Lorenz96.","physics-informed machine learning, pinns, scientific machine learning, neural ODEs, neural operators, machine learning, neural networks, Matryoshka, multiphysics, multiscale, parametrizations, closure, subgrid, superstructures, partial differential equations, PDEs, differential equations, numerical solvers, physics, hpc, surrogate, reduced order modeling, model reduction, uncertainty quantification, climate, fluid dynamics, physics, computational physics" NASiam: Efficient Representation Learning using Neural Architecture Search for Siamese Networks,https://openreview.net/forum?id=apZRm_0VClK,https://openreview.net/pdf?id=apZRm_0VClK,A novel method improving Siamese Networks architecture using Neural Architecture Search.,"Siamese networks are one of the most trending methods to achieve unsupervised visual representation learning. Meanwhile, Neural Architecture Search (NAS) is becoming increasingly important as a technique to discover efficient deep learning architectures. In this article, we present NASiam, a novel approach that uses for the first time differentiable NAS to improve the Multilayer Perceptron projector and predictor (encoder/predictor pair) architectures inside Siamese networks frameworks while preserving the simplicity of previous baselines. We show that these new architectures allow backbone convolutional models to learn strong representations efficiently. NASiam reaches competitive performance in both small-scale (CIFAR) and large-scale (ImageNet) image classification datasets. We discuss the composition of the NAS-discovered architectures and emit hypotheses on why they manage to prevent collapsing behavior.","Neural Architecture Search, Self-Supervised Learning, Representation Learning, Siamese Networks, Computer Vision" Out-of-distribution Detection with Diffusion-based Neighborhood,https://openreview.net/forum?id=5tKhUU5WBi8,https://openreview.net/pdf?id=5tKhUU5WBi8,We design a general strategy to combine a diffusion model and a Resnet to do OOD detection.,"Out-of-distribution (OOD) detection is an important task to ensure the reliability and safety of deep learning and the discriminator models outperform others for now. However, the feature extraction of such models must compress the data and lose certain information, leaving room for bad cases and malicious attacks. However, despite effectively fitting the data distribution and producing high-quality samples, generative models lack suitable indicator scores to match with discriminator models in the OOD detection tasks. In this paper, we find that these two kinds of models can be combined to solve each other's problems. We introduce diffusion models (DMs), a kind of powerful generative model, into OOD detection and find that the denoising process of DMs also functions as a novel form of asymmetric interpolation. This property establishes a diffusion-based neighborhood for each input data. Then, we perform discriminator-based OOD detection based on the diffusion-based neighborhood instead of isolated data. In this combination, the discriminator models provide detection metrics for generation models and the diffusion-based neighborhood reduces the information loss of feature extraction. According to our experiments on CIFAR10 and CIFAR100, our new methods successfully outperform state-of-the-art methods. Our implementation is put in the supplementary materials.","OOD detection, diffusion model" A Massively Parallel Benchmark for Safe Dexterous Manipulation,https://openreview.net/forum?id=k2Ml8FGtJZp,https://openreview.net/pdf?id=k2Ml8FGtJZp,"Safety Dexteroushands is the first large-scale task collection focused on safe dexterous manipulation, offering 10+ manipulators and 100+ task combinations.","Safe Reinforcement Learning (Safe RL) aims to maximize expected total rewards and avoids violation of certain constraints at the same time. Many constrained environments have been designed to evaluate Safe RL algorithms, but they are more focused on simple navigation tasks and have tremendous gaps with the real world. Meanwhile, dexterous manipulation is a challenging topic in the field of robotics, and places high demands on safety constraints to ensure reliable manipulation in the real world. Consequently, we propose Safety DexterousHands, a massively parallel physical benchmark to facilitate experimental validation in Safe RL research. Safety DexterousHands is built in the Isaac Gym, a GPU-level parallel simulator that enables highly efficient RL training. We designed a series of challenging dexterous manipulation tasks around the safety constraints. To the best of our knowledge, Safety DexterousHands is the first large-scale benchmark focused on safe dexterous manipulation, offering 10+ manipulators and 100+ task combinations. Our experimental results show that Safe RL algorithms can perfectly solve the safe dexterous manipulation task by exploiting the sparse cost penalty, while unsafe RL algorithms struggle to solve most tasks without causing disruption. We expect that this benchmark can deliver a reliable and comprehensive evaluation for Safe RL algorithms and promote a integration of Safe RL and dexterous manipulation. ","Dexterous Manipulation, Safe Reinforcement Learning, Robot Learning" Never Revisit: Continuous Exploration in Multi-Agent Reinforcement Learning,https://openreview.net/forum?id=gULfK60oYr1,https://openreview.net/pdf?id=gULfK60oYr1,,"Recently, intrinsic motivations are wildly used for exploration in multi-agent reinforcement learning. We discover that coming with intrinsic rewards is the issue of revisitation -- the relative values of intrinsic rewards fluctuate, causing a sub-space visited before becomes attractive after a period of exploration to other areas. Consequently, agents risk exploring some sub-spaces repeatedly. In this paper, we formally define the concept of revisitation, based on which we propose an observation-distribution matching approach to detect the appearance of revisitation. To avoid it, we add branches to agents' local Q-networks and the mixing network to separate sub-spaces which have already been revisited. Furthermore, to prevent adding branches excessively, we design intrinsic rewards to reduce the probability of and penalize the occurrence of revisitation. By virtue of these advances, our method achieves superior performance on three challenging Google Research Football (GRF) scenarios with sparse rewards. ", Do Not Train It: A Linear Neural Architecture Search of Graph Neural Networks,https://openreview.net/forum?id=TUhgwGQBtE,https://openreview.net/pdf?id=TUhgwGQBtE,,"Neural architecture search (NAS) for Graph neural networks (GNNs), called NAS-GNNs, has achieved significant performance over manually designed GNN architectures. However, these methods inherit issues from the conventional NAS methods, such as high computational cost and optimization difficulty. More importantly, previous NAS methods have ignored the uniqueness of GNNs, where the non-linearity has limited effect. Based on this, we are the first to theoretically prove that a GNN fixed with random weights can obtain optimal outputs under mild conditions. With the randomly-initialized weights, we can then seek the optimal architecture parameters via the sparse coding objective and derive a novel NAS-GNNs method, namely neural architecture coding (NAC). Consequently, our NAC holds a no-update scheme on GNNs and can efficiently compute in linear time. Empirical evaluations on multiple GNN benchmark datasets demonstrate that our approach leads to state-of-the-art performance, which is up to $200\times$ faster and $18.8\%$ more accurate than the strong baselines.","Neural Architecture Search, Graph neural network, Automated Machine Learning" Rethinking the Explanation of Graph Neural Network via Non-parametric Subgraph Matching,https://openreview.net/forum?id=p0MBhpO5wQ,https://openreview.net/pdf?id=p0MBhpO5wQ,,"The great success in graph neural networks (GNNs) provokes the question about explainability: ``Which fraction of the input graph is the most determinant to the prediction?'' However, current approaches usually resort to a black-box to decipher another black-box (i.e., GNN), making it difficult to understand how the explanation is made. Based on the observation that graphs typically share some joint motif patterns, we propose a novel subgraph matching framework named MatchExplainer to explore explanatory subgraphs. It couples the target graph with other counterpart instances and identifies the most crucial joint substructure by minimizing the node corresponding-based distance between them. After that, an external graph ranking is followed to select the most informative substructure from all subgraph candidates. Thus, MatchExplainer is entirely non-parametric. Moreover, present graph sampling or node dropping methods usually suffer from the false positive sampling problem. To ameliorate that issue, we take advantage of MatchExplainer to fix the most informative portion of the graph and merely operate graph augmentations on the rest less informative part, which is dubbed as MatchDrop. We conduct extensive experiments on both synthetic and real-world datasets, showing the effectiveness of our MatchExplainer by outperforming all parametric baselines with large margins. Additional results also demonstrate that our MatchDrop is a general paradigm to be equipped with GNNs for enhanced performance.","Graph Neural Networks, Graph Matching, Explanation" Spikformer: When Spiking Neural Network Meets Transformer ,https://openreview.net/forum?id=frE4fUwz_h,https://openreview.net/pdf?id=frE4fUwz_h,,"We consider two biologically plausible structures, the Spiking Neural Network (SNN) and the self-attention mechanism. The former offers an energy-efficient and event-driven paradigm for deep learning, while the latter has the ability to capture feature dependencies, enabling Transformer to achieve good performance. It is intuitively promising to explore the marriage between them. In this paper, we consider leveraging both self-attention capability and biological properties of SNNs, and propose a novel Spiking Self Attention (SSA) as well as a powerful framework, named Spiking Transformer (Spikformer). The SSA mechanism in Spikformer models the sparse visual feature by using spike-form Query, Key, and Value without softmax. Since its computation is sparse and avoids multiplication, SSA is efficient and has low computational energy consumption. It is shown that Spikformer with SSA can outperform the state-of-the-art SNNs-like frameworks in image classification on both neuromorphic and static datasets. Spikformer (66.3M parameters) with comparable size to SEW-ResNet-152 (60.2M,69.26%) can achieve 74.81% top1 accuracy on ImageNet using 4 time steps, which is the state-of-the-art in directly trained SNNs models.","Transformer, Spiking Neural Network" Representation Mutual Learning for End-to-End Weakly-Supervised Semantic Segmentation,https://openreview.net/forum?id=BnznzofWMi,https://openreview.net/pdf?id=BnznzofWMi,"An efficient and decoder-free Representation Mutual Learning (RML) framework for WSSS that combines instance-level, feature-level and pixel-level mutual learning strategies to improve segmentation quality.","In recent years, end-to-end solutions for Weakly Supervised Semantic Segmentation (WSSS) with image-level labels have been developed rapidly. Previous end-to-end methods usually rely on segmentation branches or decoders to predict segmentation masks, bringing additional parameter numbers and consumption time. In this paper, we propose a decoder-free Representation Mutual Learning (RML) framework to directly predict segmentation masks, which leverages collaborative learning and mutual teaching among multi-level feature representations to improve segmentation performance. Our RML is a straightforward and efficient end-to-end WSSS framework, which incorporates the instance-level, feature-level and pixel-level representation mutual learning strategies to improve segmentation quality. To enhance the Class Activation Map (CAM) representations, we propose a CAM-driven Instance-leave Mutual Learning strategy that preserves the equivariance of CAMs and expands the distance between different classes of semantic prototypes. Besides, we design a Multi-scale Feature-leave Mutual Learning strategy, which can align aggregated contextual representations and facilitate the representation capability of contextual representations. Furthermore, we also provide an Affinity-aware Pixel-level Mutual Learning strategy to learn semantic affinity representations. Experiments validate that our RML yields a significant performance improvement over recent end-to-end methods on the Pascal VOC 2012 dataset and the MS COCO 2014 dataset. The release code is available at supplementary material.","Weakly Supervised Semantic Segmentation, Representation Mutual Learning, End-to-End" DeSCo: Towards Scalable Deep Subgraph Counting,https://openreview.net/forum?id=lL8LF0O8Y2,https://openreview.net/pdf?id=lL8LF0O8Y2,"We propose DeSCo, a neural-based deep subgraph counting framework aims to accurately predict count of query graphs on any given target graph.","Subgraph counting is the problem of determining the number of a given query graph in a large targe graph. Despite being a #P problem, subgraph counting is a crucial graph analysis method in domains ranging from biology and social science to risk management and software analysis. However, existing exact counting methods take combinatorially long runtime as target and query sizes increase. Existing approximate heuristic methods and neural approaches fall short in accuracy due to high label dynamic range, limited model expressive power, and inability to predict the distribution of subgraph counts in the target graph. Here we propose DeSCo, a neural deep subgraph counting framework, which aims to accurately predict the count and distribution of query graphs on any given target graph. DeSCo uses canonical partition to divide the large target graph into small neighborhood graphs and predict the canonical count objective on each neighborhood. The proposed partition method avoids missing or double-counting any patterns of the target graph. A novel subgraph-based heterogeneous graph neural network is then used to improve the expressive power. Finally, gossip correction improves counting accuracy via prediction propagation with learnable weights. Compared with state-of-the-art approximate heuristic and neural methods. DeSCo achieves 437x improvement in the mean squared error of count prediction and benefits from the polynomial runtime complexity. ","subgraph counting, graph neural network, graph mining" On a Built-in Conflict between Deep Learning and Systematic Generalization,https://openreview.net/forum?id=3TfSOxiRiFH,https://openreview.net/pdf?id=3TfSOxiRiFH,,"Out-of-distribution or systematic generalization is a desirable property that most deep learning algorithms lack. In this paper, we hypothesize that internal function sharing is one of the reasons to weaken systematic generalization in deep learning for classification tasks. Under equivalent prediction, a model partitions an input space into multiple parts separated by boundaries. The function sharing prefers to reuse boundaries, leading to fewer parts for new outputs, which conflicts with systematic generalization. We show such phenomena in standard deep learning models, such as fully connected, convolutional, residual networks, LSTMs, and (Vision) Transformers. We hope this study provides novel insights and forms a basis for new research directions to improve systematic generalization.","out-of-distribution generalization, systematic generalization, compositional generalization" SepRep-Net: Multi-source Free Domain Adaptation via Model Separation and Reparameterization,https://openreview.net/forum?id=E67OghNSDMf,https://openreview.net/pdf?id=E67OghNSDMf,"We introduce a general approach to multi-source free domain adaptation via model separation and reparameterization, which enhances effectiveness, efficiency and generalizability. ","We consider multi-source free domain adaptation, the problem of adapting multiple existing models to a new domain without accessing the source data. This is a practical problem, which often arises in commercial settings but remains an open question despite the advances in recent years. Previous methods, e.g., model ensemble, are effective, but they also incur significantly increased computational costs. Conventional solutions for efficiency, such as distillation, are limited in preserving source knowledge, i.e., maintaining generalizability. In this work, we propose a novel framework called SepRep-Net, which tackles multi-source free domain adaptation via model Separation and Reparameterization. Concretely, SepRep-Net reassembled multiple existing models to a unified network, while maintaining separate pathways (Separation). During training, separate pathways are optimized in parallel with the information exchange regularly performed via an additional feature merging unit. With our specific design, these pathways can be further reparameterized into a single one to facilitate inference (Reparameterization). SepRep-Net is characterized by 1) effectiveness: competitive performance on the target domain, 2) efficiency: low computational costs, and 3) generalizability: maintaining more source knowledge than existed solutions. As a general approach, SepRep-Net can be seamlessly plugged into various methods. Extensive experiments validate the performance of SepRep-Net on mainstream benchmarks.","multi-source free domain adaptation, generalized domain adaptation" Consistent and Truthful Interpretation with Fourier Analysis,https://openreview.net/forum?id=YnVpYUjzVHC,https://openreview.net/pdf?id=YnVpYUjzVHC,"We find that the previous attribution methods are not consistent with neighborhood predictions, and introduce a new framework with an efficient algorithm to support consistency. ","For many interdisciplinary fields, ML interpretations need to be consistent with \emph{what-if} scenarios related to the current case, i.e., if one factor changes, how does the model react? Although the attribution methods are supported by the elegant axiomatic systems, they mainly focus on individual inputs, and are generally inconsistent. To support what-if scenarios, we introduce a new objective of consistency based on a notion called truthful interpretation. Towards this objective, we apply Fourier analysis of Boolean functions to get consistency guarantees. Experimental results show that for neighborhoods with various radii, our method achieves $2$x - $50$x lower inconsistency compared with the other methods.",AI Interpretability D2Match: Leveraging Deep Learning and Degeneracy for Subgraph Matching,https://openreview.net/forum?id=Om_QvnjjBL2,https://openreview.net/pdf?id=Om_QvnjjBL2,,"Subgraph matching is a fundamental building block for many graph-based applications and is challenging due to its high-order combinatorial nature. However, previous methods usually tackle it by combinatorial optimization or representation learning and suffer from exponential computational cost or matching without theoretical guarantees. In this paper, we develop D2Match by leveraging the efficiency of Deep learning and Degeneracy for subgraph matching. More specifically, we prove that subgraph matching can degenerate to subtree matching, and subsequently is equivalent to finding a perfect matching on a bipartite graph. This matching procedure can be implemented by the built-in tree-structured aggregation mechanism on graph neural networks, which yields linear time complexity. Moreover, circle structures, abstracted as {\em supernodes}, and node attributes can be easily incorporated in D2Match to boost the matching. Finally, we conduct extensive experiments to show the superior performance of our D2Match and confirm that our D2Match indeed tries to exploit the subtrees and differs from existing learning-based subgraph matching methods that depend on memorizing the data distribution divergence.", Multimodal Analogical Reasoning over Knowledge Graphs,https://openreview.net/forum?id=NRHajbzg8y0P,https://openreview.net/pdf?id=NRHajbzg8y0P,Multimodal analogical reasoning over knowledge graphs with a new dataset MARS and a new framework MarT.,"Analogical reasoning is fundamental to human cognition and holds an important place in various fields. However, previous studies mainly focus on single-modal analogical reasoning and ignore taking advantage of structure knowledge. Notably, the research in cognitive psychology has demonstrated that information from multimodal sources always brings more powerful cognitive transfer than single modality sources. To this end, we introduce the new task of multimodal analogical reasoning over a knowledge graph, which requires multimodal reasoning ability with the help of background knowledge. Specifically, we construct a Multimodal Analogical Reasoning dataSet (MARS) and a multimodal knowledge graph MarKG. We evaluate with multimodal knowledge graph embedding and pre-trained Transformer baselines, illustrating the potential challenges of the proposed task. We further propose a novel model-agnostic Multimodal analogical reasoning framework with Transformer (MarT) motivated by the structure mapping theory, which can obtain better performance. We hope our work can deliver benefits and inspire future research.","knowledge graph, multimodal, analogical reasoning, prompt learning, pre-trained language model" QFuture: Learning Future Expectations in Multi-Agent Reinforcement Learning,https://openreview.net/forum?id=yotPBsMyfTe,https://openreview.net/pdf?id=yotPBsMyfTe,future expectations learning,"Building accurate and robust value functions to estimate the expected future return from the current state is critical in Multi-Agent Reinforcement Learning. Previous works perform better estimation by strengthening the representation of the value function. However, due to the uncertain and unavailable future, directly estimating the future return from the current state is challenging and cannot be addressed by just promoting representation ability. Socially, humans will derive future expectations from current available information to help evaluate their behavior's long-term return. Motivated by this, we propose a novel framework, called \textit{future expectation multi-agent Q-learning} (QFuture), for better estimating future expected returns. In this framework, we design a future expectation module (FEM) to build future expectations in the calculation process of the individual (IAV) and joint action-value (JAV). In FEM, the future expectations are modeled as random variables and perform representation learning by maximizing their mutual information (MI) with the future trajectory given current observation (in IAV) or state (in JAV). We design a future representation module (FRM) to encode the future trajectory, where a regularizer is designed to ensure informativeness. Experiments on StarCraft II micromanagement tasks and Google Research Football demonstrate that QFuture significantly achieves state-of-the-art performance.","multi-agent reinforcement learning, future expectations learning, value decomposition, mutual information" MMCAP: LEARNING TO BROAD-SIGHT NEURAL NETWORKS BY CLASS ATTENTION POOLING,https://openreview.net/forum?id=u6t9zT8h3p5,https://openreview.net/pdf?id=u6t9zT8h3p5,,"Recently, the global average pooling is believed to be losing the local information that saturates the performance of neural networks. In this lossy pooling operation, we propose a new interpretation, termed over-concentration, to explain the real reason why it degrades network performance. We argue that the problem of global average pooling is disregarding the local patterns by relying solely on the overly concentrated activation. Global average pooling enforces the network to learn objects regardless of their location, so features tend to be activated only in specific regions. To support this claim, we provide a novel analysis of the problems that over-concentration brings about in the network with extensive experiments. We analyze the over-concentration through problems arising from feature variance and dead neurons that are not activated. Based on our analysis, we introduce a multi-token and multi-scale class attention pooling layer to alleviate the over-concentration problem. The proposed attention pooling method captures rich, localized patterns with an efficient network design using multiple scales and tokens. Our method is highly applicable to downstream task and network architectures such as CNN, ViT, and MLP-Mixer. In our experiment, the proposed method improves MLP-Mixer, ViT, and CNN architectures with little additional resources, and a network employing our pooling method works well compared to even stateof-the-art networks. We will opensource the proposed pooling method.","class attention, global average pooling, visual recognition, over-concentration, feature variance" GAIN: Enhancing Byzantine Robustness in Federated Learning with Gradient Decomposition,https://openreview.net/forum?id=Mwpw3weZrK8,https://openreview.net/pdf?id=Mwpw3weZrK8,,"Federated learning provides a privacy-aware learning framework by enabling participants to jointly train models without exposing their private data. However, federated learning has exhibited vulnerabilities to Byzantine attacks, where the adversary aims to destroy the convergence and performance of the global model. Meanwhile, we observe that most existing robust AGgregation Rules (AGRs) fail to stop the aggregated gradient deviating from the optimal gradient (the average of honest gradients) in the non-IID setting. We attribute the reason of the failure of these AGRs to two newly proposed concepts: identification failure and integrity failure. The identification failure mainly comes from the exacerbated curse of dimensionality in the non-IID setting. The integrity failure is a combined result of conservative filtering strategy and gradient heterogeneity. In order to address both failures, we propose GAIN, a gradient decomposition scheme that can help adapt existing robust algorithms to heterogeneous datasets. We theoretically show that integrating exisiting robust AGRs into our GAIN can mitigate the deviation of aggregated gradient, thus improve the performance. Experiments on various real-world datasets verify the efficacy of our proposed GAIN","Federated Learning, Byzantine Robustness." Temporary feature collapse phenomenon in early learning of MLPs,https://openreview.net/forum?id=FLMvYXMucWk,https://openreview.net/pdf?id=FLMvYXMucWk,"In this paper, we focus on a typical two-phase phenomenon in the learning of multi-layer perceptrons (MLPs), and we discover and explain the reason for the feature collapse in the first phase.","In this paper, we focus on a typical two-phase phenomenon in the learning of multi-layer perceptrons (MLPs). We discover and explain the reason for the feature collapse phenomenon in the first phase, i.e., the diversity of features over different samples keeps decreasing in the first phase, until samples of different categories share almost the same feature, which hurts the optimization of MLPs. We explain such a phenomenon in terms of the learning dynamics of MLPs. Furthermore, we theoretically analyze the reason why four typical operations can alleviate the feature collapse. The code has been attached with the submission.","Neural Networks, Deep Learning Theory, Multi-Layer Perceptrons" MECTA: Memory-Economic Continual Test-Time Model Adaptation,https://openreview.net/forum?id=N92hjSf5NNh,https://openreview.net/pdf?id=N92hjSf5NNh,,"Continual Test-time Adaptation (CTA) is a promising art to secure accuracy gains in continually-changing environments. The state-of-the-art adaptations improve out-of-distribution model accuracy via computation-efficient online test-time gradient descents but meanwhile cost about times of memory versus the inference, even if only a small portion of parameters are updated. Such high memory consumption of CTA substantially impedes wide applications of advanced CTA on memory-constrained devices. In this paper, we provide a novel solution, dubbed MECTA, to drastically improve the memory efficiency of gradient-based CTA. Our profiling shows that the major memory overhead comes from the intermediate cache for back-propagation, which scales by the batch size, channel, and layer number. Therefore, we propose to reduce batch sizes, adopt an adaptive normalization layer to maintain stable and accurate predictions, and stop the back-propagation caching heuristically. On the other hand, we prune the networks to reduce the computation and memory overheads in optimization and recover the parameters afterward to avoid forgetting. The proposed MECTA is efficient and can be seamlessly plugged into state-of-the-art CTA algorithms at negligible overhead on computation and memory. On three datasets, CIFAR10, CIFAR100, and ImageNet, MECTA improves the accuracy by at least 8.5% with constrained memory and significantly reduces the memory cots of ResNet50 on ImageNet by at least 70% without sacrificing accuracy. Our code will be published upon acceptance.","continual test-time adaptation, memory efficiency" MocoSFL: enabling cross-client collaborative self-supervised learning,https://openreview.net/forum?id=2QGJXyMNoPz,https://openreview.net/pdf?id=2QGJXyMNoPz,"Existing collaborative SSL schemes are not suitable for cross-client applications because of their expensive computation and local data requirements. To address these issues, we propose MocoSFL based on Split Federated Learning and MoCo.","Existing collaborative self-supervised learning (SSL) schemes are not suitable for cross-client applications because of their expensive computation and large local data requirements. To address these issues, we propose MocoSFL, a collaborative SSL framework based on Split Federated Learning (SFL) and Momentum Contrast (MoCo). In MocoSFL, the large backbone model is split into a small client-side model and a large server-side model, and only the small client-side model is processed locally on the client's local devices. MocoSFL has three key components: (i) vector concatenation which enables the use of small batch size and reduces computation and memory requirements by orders of magnitude; (ii) feature sharing that helps achieve high accuracy regardless of the quality and volume of local data; (iii) frequent synchronization that helps achieve better non-IID performance because of smaller local model divergence. For a 1,000-client case with non-IID data (each client only has data from 2 random classes of CIFAR-10), MocoSFL can achieve over 84% accuracy with ResNet-18 model. Next we present TAResSFL module that significantly improves the resistance to privacy threats and communication overhead with small sacrifice in accuracy for a MocoSFL system. On a Raspberry Pi 4B device, the MocoSFL-based scheme requires less than 1MB of memory and less than 40MB of communication, and consumes less than 5W power. Thus, compared to the state-of-the-art FL-based approach, MocoSFL has significant advantages in both accuracy and practicality for cross-client applications.","Self-supervised Learning, Collaborative Learning, Split Federated Learning, Momentum Contrast" Block-level Stiffness Analysis of Residual Networks,https://openreview.net/forum?id=W5U_xEGOaIY,https://openreview.net/pdf?id=W5U_xEGOaIY,In this paper we are the first ones to connect the concepts of stiffness and ResNets via the dynamical systems interpretation to propose that ResNets can be viewed as stiff ODEs.,"Residual Networks (ResNets) can be interpreted as dynamic systems, which are systems whose state changes over time and can be described with ordinary differential equations (ODEs). Specifically, the dynamic systems interpretation views individual residual blocks as ODEs. The solution to an ODE is an approximation; and therefore contains an error term. If an ODE is stiff it is likely that this error is amplified and becomes dominating in the solution calculations, which negatively affects the accuracy of the approximated solution. Therefore, stiff ODEs are often numerically unstable. In this paper we leverage the dynamic systems interpretation to perform a novel theoretical analysis of ResNets by leveraging findings and tools from numerical analysis of ODEs. Specifically, we perform block level stiffness analysis of ResNets. We find that residual blocks towards the end of ResNet models exhibit increased stiffness and that there is a significant correlation between stiffness and model accuracy and loss. Based on these findings, we propose that ResNets behave as stiff numerically unstable ODEs.","resnets, stiffness, ordinary differential equations" Q-Match: Self-Supervised Learning For Tabular Data by Matching Distributions Induced by a Queue,https://openreview.net/forum?id=zNq-jISUm7E,https://openreview.net/pdf?id=zNq-jISUm7E,A self-supervised method to train models by minimizing the cross-entropy loss between student-teacher distributions generated using a queue of embeddings. This results in better downstream task performance with less labeled data.,"In semi-supervised learning, student-teacher distribution matching has been successful in improving performance of models using unlabeled data in conjunction with few labeled samples. In this paper, we aim to replicate that success in the self-supervised setup where we do not have access to any labeled data during pre-training. We show it is possible to induce the student-teacher distributions without any knowledge of downstream classes by using a queue of embeddings of samples from the unlabeled dataset. We show that Q-Match outperforms previous self-supervised learning techniques on tabular datasets when measuring downstream classification performance. Furthermore, we show that our method is sample efficient, both in terms of labels required for both downstream task training and amount of unlabeled data required for pre-training.","self-supervised learning, deep learning for tabular data" Supervised Contrastive Regression,https://openreview.net/forum?id=_QZlje4dZPu,https://openreview.net/pdf?id=_QZlje4dZPu,,"Deep regression models typically learn in an end-to-end fashion and do not explicitly try to learn a regression-aware representation. Their representations tend to be fragmented and fail to capture the continuous nature of regression tasks. In this paper, we propose Supervised Contrastive Regression (SupCR), a framework that learns a regression-aware representation by contrasting samples against each other based on their target distance. SupCR is orthogonal to existing regression models, and can be used in combination with such models to improve performance. Extensive experiments using five real-world regression datasets that span computer vision, human-computer interaction, and healthcare show that using SupCR achieves the state-of-the-art performance and consistently improves prior regression baselines on all datasets, tasks, and input modalities. SupCR also improves robustness to data corruptions, resilience to reduced training data, performance on transfer learning, and generalization to unseen targets.","regression, regression learning" SELF-SUPERVISED PRETRAINING FOR DIFFERENTIALLY PRIVATE LEARNING,https://openreview.net/forum?id=lfzmAJ12sg,https://openreview.net/pdf?id=lfzmAJ12sg,We demonstrate self-supervised pretraining is a scalable solution to deep learning with differential privacy regardless of the size of available public datasets in image classification.,"We demonstrate self-supervised pretraining (SSP) is a scalable solution to deep learning with differential privacy (DP) regardless of the size of available public datasets in image classification. When facing the lack of public datasets, we show the features generated by SSP on only one single image enable a private classifier to obtain a much better utility than the non-learned handcrafted features under the same privacy budget. When a moderate or large size public dataset is available, the features produced by SSP greatly outperform the features trained with labels on various complex private datasets under the same private budget. We also compared multiple DP-enabled training frameworks to train a private classifier on the features generated by SSP.","differential privacy, contrastive learning, learned features, one image" Explainable Artificial Intelligence: Reaping the Fruits of Decision Trees,https://openreview.net/forum?id=K5si8PjaSy,https://openreview.net/pdf?id=K5si8PjaSy,This work assessed node weight patterns toward explaining artificial intelligence systems.,"The recent push for explainable artificial intelligence (XAI) has given rise to extensive work toward understanding the inner workings of neural networks. Much of that work, however, has focused on manipulating input data feeding the network to assess their effect on network output. It is shown in this study that XAI can benefit from investigating the network node, the most fundamental unit of neural networks. Whereas studies on XAI have mostly benefited from a focus on manipulating input data, assessing patterns in node weights may prove equally beneficial, if not more significant, especially when realizing that weight values may not be as random as previously thought. A manipulated, a contrived, and a real dataset were used in this study. Datasets were run on convolutional and deep neural network models. Node rank stability was the central construct to investigate neuronal patterns in this study. Rank stability was defined as the number of epochs wherein nodes held their rank in terms of weight value compared to their rank at the last epoch, when the model reached convergence, or stability (defined in this study as accuracy $\geq$ 0.90). Findings indicated that neural networks behaved like a decision tree, in that rank stability increased as weight absolute values increased. Decision tree behavior may assist in more efficient pruning algorithms, which may produce distilled models simpler to explain to technical and non-technical audiences.","Explainable artificial intelligence, XAI, decision trees, explainability, neural networks, pruning" Meta-Evolve: Continuous Robot Evolution for One-to-many Policy Transfer,https://openreview.net/forum?id=h9yn69de6f,https://openreview.net/pdf?id=h9yn69de6f,,"We investigate the problem of transferring an expert policy from a source robot to multiple different robots. To solve this problem, we propose a method name Meta-Evolve that uses continuous robot evolution to efficiently transfer the policy to a newly defined meta robot and then to each target robot. Since the meta robot is closer to the target robots, our approach can significantly naive one-to-one policy transfer. We also present three heuristic approaches with theoretical results to determine the meta robot. Experiments have shown that with three target robots, our method is able to improve over the baseline of launching multiple independent one-to-one robot-to-robot policy transfers by up to 2.4$\times$ in terms of training and exploration needed.", Interpretability with full complexity by constraining feature information,https://openreview.net/forum?id=R_OL5mLhsv,https://openreview.net/pdf?id=R_OL5mLhsv,,"Interpretability is a pressing issue for machine learning. Common approaches to interpretable machine learning constrain interactions between features of the input, rendering comprehensible the effects of features on a model's output at the expense of model complexity. We approach interpretability from a different angle: constrain the information about the features utilized by the model, without any restrictions on the complexity of the model. Borrowing from information theory, we use the Distributed Information Bottleneck to find optimal compressions of each feature that maximally preserve information about the output. The learned information allocation, by feature and by feature value, is a rich source of interpretability well-suited to problems with many features and complex feature interactions. The central object of analysis is not a single trained model, but rather a spectrum of models serving as approximations that leverage variable amounts of information and range from the uninformative to the most performant model obtainable. Information is allocated to features by their relevance to the output, tackling the problem of feature selection with inclusion/exclusion existing on a learned continuum. The optimal compression of each feature---at every stage of approximation---allows fine-grained inspection of how feature values are similar or distinct with regards to the prediction. We develop a framework for extracting insight from the spectrum of approximate models and demonstrate its utility on a range of tabular datasets.", Revisiting Group Robustness: Class-specific Scaling is All You Need,https://openreview.net/forum?id=pkgVPeL9gpX,https://openreview.net/pdf?id=pkgVPeL9gpX,"We propose a simple class-specific scaling strategy to control the trade-off between robust and average accuracies, and based on this, we develop a comprehensive performance evaluation metric and advanced algorithm to improve the trade-off.","Group distributionally robust optimization, which aims to improve robust accuracies such as worst-group or unbiased accuracy, is one of the mainstream algorithms to mitigate spurious correlation and reduce dataset bias. While existing approaches have apparently gained performance in robust accuracy, these improvements mainly come from a trade-off at the expense of average accuracy. To address the challenges, we first propose a simple class-specific scaling strategy to control the trade-off between robust and average accuracies flexibly and efficiently, which is directly applicable to existing debiasing algorithms without additional training; it reveals that a naive ERM baseline matches or even outperforms the recent debiasing approaches by adopting the class-specific scaling. Then, we employ this technique to 1) evaluate the performance of existing algorithms in a comprehensive manner by introducing a novel unified metric that summarizes the trade-off between the two accuracies as a scalar value and 2) develop an instance-wise adaptive scaling technique for overcoming the trade-off and improving the performance even further in terms of both accuracies. Experimental results verify the effectiveness of the proposed frameworks in both tasks. ","Group robustness, spurious correlation, debiasing, worst-group accuracy, unbiased accuracy, performance evaluation" Provable Benefits of Representational Transfer in Reinforcement Learning,https://openreview.net/forum?id=sNKZaNkyi7Q,https://openreview.net/pdf?id=sNKZaNkyi7Q,"We present an algorithm that performs efficient transfer learning in low-rank MDPs, and provide sample complexity bounds under minimal assumptions.","We study the problem of representational transfer in RL, where an agent first pretrains in a number of source tasks to discover a shared representation, which is subsequently used to learn a good policy in a target task. We propose a new notion of task relatedness between source and target tasks, and develop a novel approach for representational transfer under this assumption. Concretely, we show that given a generative access to source tasks, we can discover a representation, using which subsequent linear RL techniques quickly converge to a near-optimal policy, with only online access to the target task. The sample complexity is close to knowing the ground truth features in the target task, and comparable to prior representation learning results in the source tasks. We complement our positive results with lower bounds without generative access, and validate our findings with empirical evaluation on rich observation MDPs that require deep exploration. ","reinforcement learning theory, representation learning, transfer learning" Set Discrimination Contrastive Learning,https://openreview.net/forum?id=lJsr4DwZm1z,https://openreview.net/pdf?id=lJsr4DwZm1z,We propose a method that integrates the concept of set representation learning to improve self-supervised visual representation learning,"In this work, we propose a self-supervised contrastive learning method that integrates the concept of set-based feature learning. The main idea of our method is to randomly construct sets of instances in a mini-batch and then learn to contrast the set representations. Inspired by set-based feature learning, we aggregate set features from individual sample features by a symmetric function. To improve the effectiveness of our set-based contrastive learning, we propose a set construction scheme built upon sample permutation in a mini-batch that allows a sample to appear in multiple sets, which naturally ensures common features among sets by construction. Our set construction scheme also increases both the number of positive and negative sets in a mini-batch, leading to better representation learning. We demonstrate the robustness of our method by seamlessly integrating it into existing contrastive learning methods such as SimCLR and MoCo. Extensive experiments demonstrate that our method consistently improves the performance of these contrastive learning methods in various datasets and downstream tasks.","self-supervised learning, contrastive learning" What shapes the loss landscape of self supervised learning?,https://openreview.net/forum?id=3zSn48RUO8M,https://openreview.net/pdf?id=3zSn48RUO8M,We analytically solve the loss landscape of self-supervised learning and identify the causes of complete and dimensional collapse,"Prevention of complete and dimensional collapse of representations has recently become a design principle for self-supervised learning (SSL). However, questions remain in our theoretical understanding: When do those collapses occur? What are the mechanisms and causes? We provide answers to these questions by thoroughly analyzing SSL loss landscapes for a linear model. We derive an analytically tractable theory of SSL landscape and show that it accurately captures an array of collapse phenomena and identifies their causes. Finally, we leverage the interpretability afforded by the analytical theory to understand how dimensional collapse can be beneficial and what affects the robustness of SSL against data imbalance.","loss landscape, self-supervised learning, collapse" Hard Regularization to Prevent Collapse in Online Deep Clustering without Data Augmentation,https://openreview.net/forum?id=HTbp9Y7g9P,https://openreview.net/pdf?id=HTbp9Y7g9P,regularizing hard cluster assignments with a Bayesian optimization problem to prevent collapse in online deep clustering without data augmentaiton,"Online deep clustering refers to the joint use of a feature extraction network and a clustering model to assign cluster labels to each new data point or batch as it is processed. While faster and more versatile than offline methods, online clus- tering can easily reach the collapsed solution where the encoder maps all inputs to the same point and all are put into a single cluster. Successful existing mod- els have employed various techniques to avoid this problem, most of which re- quire data augmentation or which aim to make the average soft assignment across the dataset the same for each cluster. We propose a method that does not require data augmentation, and that, different from existing methods, regularizes the hard assignments. Using a Bayesian framework, we derive an intuitive optimization objective that can be straightforwardly included in the training of the encoder net- work. Tested on four image datasets, we show that it consistently avoids collapse more robustly than other methods and that it leads to more accurate clustering. We also conduct further experiments and analysis justifying our choice to regularize the hard cluster assignments.","deep learning, clustering, online" Learning Lightweight Object Detectors via Progressive Knowledge Distillation,https://openreview.net/forum?id=UVKwsWsXTt,https://openreview.net/pdf?id=UVKwsWsXTt,We propose a progressive approach to distill knowledge from multiple teacher detectors into a lightweight student.,"Resource-constrained perception systems such as edge computing and vision-for-robotics require vision models to be both accurate and lightweight in computation and memory usage. Knowledge distillation is one effective strategy to improve the performance of lightweight classification models, but it is less well-explored for structured outputs such as object detection and instance segmentation, where the variable number of outputs and complex internal network modules complicate the distillation. In this paper, we propose a simple yet surprisingly effective sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teachers to a given lightweight student. Our approach is inspired by curriculum learning: To distill knowledge from a highly accurate but complex teacher model, we construct a sequence of teachers to help the student gradually adapt. Our progressive distillation strategy can be easily combined with existing distillation mechanisms to consistently maximize student performance in various settings. To the best of our knowledge, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students, and unprecedentedly boost the performance of ResNet-50 based RetinaNet from 36.5% to 42.0% AP and Mask R-CNN from 38.2% to 42.5% AP on the MS COCO benchmark.","object detection, knowledge distillation" Topologically faithful image segmentation via induced matching of persistence barcodes,https://openreview.net/forum?id=a3-QYAgcDBl,https://openreview.net/pdf?id=a3-QYAgcDBl,"In this work, we propose the first topologically and feature-wise, spatially accurate metric and loss function for supervised image segmentation.","Image segmentation is a largely researched field where neural networks find vast applications in many facets of technology. Some of the most popular approaches to train segmentation networks employ loss functions optimizing pixel-overlap, an objective that is insufficient for many segmentation tasks. In recent years, their limitations fueled a growing interest in topology-aware methods, which aim to recover the correct topology of the segmented structures. However, so far, none of the existing approaches achieve a spatially correct matching between the topological features (persistence barcodes) of label (ground truth) and prediction (output of the neural network). In this work, we propose the first topologically and feature-wise accurate metric and loss function for supervised image segmentation, which we term TopoMatch. We show how induced matchings guarantee the spatially correct matching between barcodes in a segmentation setting. Furthermore, we propose an efficient algorithm to compute TopoMatch for images. We show that TopoMatch is an interpretable metric to evaluate the topological correctness of segmentations. Moreover, we demonstrate how induced matchings can be used to train segmentation networks and improve the topological correctness of the segmentations across all 6 baseline datasets while preserving volumetric segmentation performance. ","Topology, Segmentation, Machine Learning" Prompt-driven efficient Open-set Semi-supervised Learning,https://openreview.net/forum?id=WcSm-iommPR,https://openreview.net/pdf?id=WcSm-iommPR,,"Open-set Semi-Supervised Learning (OSSL) has always been vulnerable to the unseen categories, i.e., outliers, that have never been seen in the labeled set. Then a out-of-distribution (OOD) detector is introduced to identify outliers unseen in the labeled training data that the unlabeled data may contain, to reduce the damage to the SSl algorithm. In this work, we suggest that using a visual prompting driven mechanism to obtain higher effectiveness in the OSSL task. To this end, we propose a prompt-driven efficient OSSL framework, called OpenPrompt, which can propagate class information from labeled to unlabeled data with only a small amount of trainable parameters in the input space. Besides, a prompt-driven joint space learning mechanism is proposed to detect OOD data by maximizing the distribution gap between ID and OOD samples in unlabeled data. The experimental results on three public datasets show that OpenPrompt outperforms state-of-the-art methods with less than 1% of trainable parameters. More importantly, OpenPrompt achieves a 4% improvement in term of AUROC on outlier detection over a fully supervised model on CIFAR10.", Generalizability of Adversarial Robustness Under Distribution Shifts,https://openreview.net/forum?id=V3GQRhBzEi,https://openreview.net/pdf?id=V3GQRhBzEi,We study the generalizability of empirical and certified robustness to unseen domains.,"Recent progress in empirical and certified robustness promises to deliver reliable and deployable Deep Neural Networks (DNNs). Despite that success, most existing evaluations of DNN robustness have been done on images sampled from the same distribution that the model was trained on. Yet, in the real world, DNNs may be deployed in dynamic environments that exhibit significant distribution shifts. In this work, we take a first step towards thoroughly investigating the interplay between empirical and certified adversarial robustness on one hand and domain generalization on another. To do so, we train robust models on multiple domains and evaluate their accuracy and robustness on an unseen domain. We observe that: (1) both empirical and certified robustness generalize to unseen domains, and (2) the level of generalizability does not correlate well with input visual similarity, measured by the FID between source and target domains. We also extend our study to cover a real-world medical application, in which adversarial augmentation enhances both the robustness and generalization accuracy in unseen domains.","domain generalization, adversarial robustness" Uncertainty-Driven Active Vision for Implicit Scene Reconstruction,https://openreview.net/forum?id=Khh7jHEJJFX,https://openreview.net/pdf?id=Khh7jHEJJFX,"We use neural rendering to approximate the observable uncertainty of an occupancy based scene reconstruction model, which we use to select camera parameters for a next-best-view task.","Multi-view implicit scene reconstruction methods have become increasingly popular due to their ability to represent complex scene details. Recent efforts have been devoted to improving the representation of input information and to reducing the number of views required to obtain high quality reconstructions. Yet, perhaps surprisingly, the study of which views to select to maximally improve scene understanding remains largely unexplored. We propose an uncertainty-driven active vision approach for implicit scene reconstruction, which leverages occupancy uncertainty accumulated across the scene using volume rendering to select the next view to acquire. To this end, we develop an occupancy-based reconstruction method which accurately represents scenes using either 2D or 3D supervision. We evaluate our proposed approach on the ABC dataset and the in the wild CO3D dataset, and show that: (1) we are able to obtain high quality state-of-the-art occupancy reconstructions; (2) our perspective conditioned uncertainty definition is effective to drive improvements in next best view selection and outperforms strong baseline approaches; and (3) we can further improve shape understanding by performing a gradient-based search on the view selection candidates. Overall, our results highlight the importance of view selection for implicit scene reconstruction, making it a promising avenue to explore further.","Neural Rendering, 3D Reconstruction, Scene Reconstruction, Next Best View, Uncertainty Estimation" No Reason for No Supervision: Improved Generalization in Supervised Models,https://openreview.net/forum?id=3Y5Uhf5KgGK,https://openreview.net/pdf?id=3Y5Uhf5KgGK,,"We consider the problem of training a deep neural network on a given classification task, e.g., ImageNet-1K (IN1K), so that it excels at both the training task as well as at other (future) transfer tasks. These two seemingly contradictory properties impose a trade-off between improving the model’s generalization while maintaining its performance on the original task. Models trained with self-supervised learning tend to generalize better than their supervised counterparts for transfer learning; yet, they still lag behind supervised models on IN1K. In this paper, we propose a supervised learning setup that leverages the best of both worlds. We extensively analyse supervised training using multi-scale crops for data augmentation and an expendable projector head, and reveal that the design of the projector allows us to control the trade-off between performance on the training task and transferability. We further replace the last layer of class weights with class prototypes computed on the fly using a memory bank and derive two models: t-ReX that achieves a new state of the art for transfer learning and outperforms top methods such as DINO and PAWS on IN1K, and t-ReX* that matches the highly optimized RSB-A1 model on IN1K while performing better on transfer tasks. Finally, we perform several analyses of the features and class weights to present insights on how each component of our setup affects the training and learned representations.","supervised learning, transfer learning, representation learning" Linear Convergence of Natural Policy Gradient Methods with Log-Linear Policies,https://openreview.net/forum?id=-z9hdsyUwVQ,https://openreview.net/pdf?id=-z9hdsyUwVQ,We show linear convergence of natural policy gradient methods with log-linear policies without any regularization.,"We consider infinite-horizon discounted Markov decision processes and study the convergence rates of the natural policy gradient (NPG) and the Q-NPG methods with the log-linear policy class. Using the compatible function approximation framework, both methods with log-linear policies can be written as approximate versions of the policy mirror descent (PMD) method. We show that both methods attain linear convergence rates and $\tilde{\mathcal{O}}(1/\epsilon^2)$ sample complexities using a simple, non-adaptive geometrically increasing step size, without resorting to entropy or other strongly convex regularization. Lastly, as a byproduct, we obtain sublinear convergence rates for both methods with arbitrary constant step size.","Discounted Markov decision process, natural policy gradient, policy mirror descent, log-linear policy, sample complexity" Active Learning with Controllable Augmentation Induced Acquisition,https://openreview.net/forum?id=vE93gf9kYkf,https://openreview.net/pdf?id=vE93gf9kYkf,,"The mission of active learning is to iteratively identify the most informative data samples to annotate, and therefore to attain decent performance with much fewer samples. Despite the promise, the acquisition of informative unlabeled samples can be unreliable --- particularly during early cycles --- owning to limited data samples and sparse supervision. To tackle this, the data augmentation techniques seem straightforward yet promising to easily extend the exploration of the input space. In this work, we thoroughly study the coupling of data augmentation and active learning whereby we propose Controllable Augmentation ManiPulator for Active Learning. In contrast to the few prior work that touched on this line, CAMPAL emphasizes a tighten and better-controlled integration of data augmentation into the active learning framework, as in three folds: (i)-carefully designed data augmentation policies applied separately on labeled and unlabeled data pool in every cycle; (ii)-controlled and quantifiably optimizable augmentation strengths; (iii)-full but flexible coverage for most (if not all) active learning schemes. Through extensive empirical experiments, we bring the performance of active learning methods to a new level: an absolute performance boost of 16.99% on CIFAR-10 and 12.25% on SVHN with 1,000 annotated samples. Complementary to the empirical results, we further provide theoretical analysis and justification of CAMPAL.","active learning, data augmentation, strength" Learning Axis-Aligned Decision Trees with Gradient Descent,https://openreview.net/forum?id=gwizseh-Iam,https://openreview.net/pdf?id=gwizseh-Iam,"A novel approach to learn univariate, axis-aligned decision trees with gradient descent using a dense tree representation and an adjusted backpropagation algorithm.","Decision Trees are commonly used for many machine learning tasks due to their high interpretability. However, learning a decision tree from data is a difficult optimization problem, since it is non-convex and non-differentiable. Therefore, common approaches learn decision trees using a greedy growth algorithm that minimizes the impurity at each internal node. Unfortunately, this greedy procedure can lead to suboptimal trees. In this paper, we present a novel approach for learning univariate, axis-aligned decision trees with gradient descent. This is achieved by applying backpropagation with an adjusted gradient flow on a dense decision tree representation that optimizes all decision tree parameters jointly. We show that our gradient-based optimization outperforms existing baselines on several binary classification benchmarks and achieves competitive results for multi-class tasks. To the best of our knowledge, this is the first approach that attempts to learn univariate, axis-aligned decision trees with gradient descent.","Decision Trees, Gradient Descent" DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models,https://openreview.net/forum?id=4vGwQqviud5,https://openreview.net/pdf?id=4vGwQqviud5,We propose a fast ODE solver for guided sampling of diffusion probabilistic models in around 15 to 20 steps.,"Diffusion probabilistic models (DPMs) have achieved impressive success in high-resolution image synthesis, especially in recent large-scale text-to-image generation applications. An essential technique for improving the sample quality of DPMs is guided sampling, which usually needs a large guidance scale to obtain the best sample quality. The commonly-used fast sampler for guided sampling is DDIM, a first-order diffusion ODE solver that generally needs 100 to 250 steps for high-quality samples. Although recent works propose dedicated high-order solvers and achieve a further speedup for sampling without guidance, their effectiveness for guided sampling has not been well-tested before. In this work, we demonstrate that previous high-order fast samplers suffer from instability issues, and they even become slower than DDIM when the guidance scale grows large. To further speed up guided sampling, we propose DPM-Solver++, a high-order solver for the guided sampling of DPMs. DPM-Solver++ solves the diffusion ODE with the data prediction model and adopts thresholding methods to keep the solution matches training data distribution. We further propose a multistep variant of DPM-Solver++ to address the instability issue by reducing the effective step size. Experiments show that DPM-Solver++ can generate high-quality samples within only 15 to 20 steps for guided sampling by pixel-space and latent-space DPMs. ","diffusion probabilistic models, score-based generative models, fast sampling, guided sampling" A Class-Aware Representation Refinement Framework for Graph Classification,https://openreview.net/forum?id=9HFobmKAmGv,https://openreview.net/pdf?id=9HFobmKAmGv,CARE computes simple yet powerful class representations and injects them to steer the learning of graph representations towards better class separability,"Graph Neural Networks (GNNs) are widely used for graph representation learning. Despite its prevalence, GNN suffers from two drawbacks in the graph classification task, the neglect of graph-level relationships, and the generalization issue. Each graph is treated separately in GNN message passing/graph pooling, and existing methods to address overfitting operate on each individual graph. This makes the graph representations learnt less effective in the downstream classification. In this paper, we propose a Class-Aware Representation rEfinement (CARE) framework for the task of graph classification. CARE computes simple yet powerful class representations and injects them to steer the learning of graph representations towards better class separability. CARE is a plug-and-play framework that is highly flexible and able to incorporate arbitrary GNN backbones without significantly increasing the computational cost. We also theoretically prove that CARE has a better generalization upper bound than its GNN backbone through Vapnik-Chervonenkis (VC) dimension analysis. Our extensive experiments with 10 well-known GNN backbones on 9 benchmark datasets validate the superiority and effectiveness of CARE over its GNN counterparts.","Graph Neural Network, Representation Learning, Graph Classification" EVA3D: Compositional 3D Human Generation from 2D Image Collections,https://openreview.net/forum?id=g7U9jD_2CUr,https://openreview.net/pdf?id=g7U9jD_2CUr,"We propose EVA3D, a high-quality unconditional 3D human generative model learned from 2D image collections.","Inverse graphics aims to recover 3D models from 2D observations. Utilizing differentiable rendering, recent 3D-aware generative models have shown impressive results of rigid object generation using 2D images. However, it remains challenging to generate articulated objects, like human bodies, due to their complexity and diversity in poses and appearances. In this work, we propose, EVA3D, an unconditional 3D human generative model learned from 2D image collections only. EVA3D can sample 3D humans with detailed geometry and render high-quality images (up to 512x256) without bells and whistles (e.g. super resolution). At the core of EVA3D is a compositional human NeRF representation, which divides the human body into local parts. Each part is represented by an individual volume. This compositional representation enables 1) inherent human priors, 2) adaptive allocation of network parameters, 3) efficient training and rendering. Moreover, to accommodate for the characteristics of sparse 2D human image collections (e.g. imbalanced pose distribution), we propose a pose-guided sampling strategy for better GAN learning. Extensive experiments validate that EVA3D achieves state-of-the-art 3D human generation performance regarding both geometry and texture quality. Notably, EVA3D demonstrates great potential and scalability to ""inverse-graphics"" diverse human bodies with a clean framework.","3D Human Generation, Human NeRF, Inverse Graphics" Spurious Local Minima Provably Exist for Deep Convolutional Neural Networks,https://openreview.net/forum?id=0sjwFxqLHw3,https://openreview.net/pdf?id=0sjwFxqLHw3,We prove that a general class of spurious local minima exist in the loss landscape of deep convolutional neural networks with squared loss or cross-entropy loss.,"In this paper, we prove that a general family of spurious local minima exist in the loss landscape of deep convolutional neural networks with squared loss or cross-entropy loss. For this purpose, we develop some new techniques to solve the challenges introduced by convolutional layers. We solve a combinatorial problem which considers the limited receptive fields of hidden neurons, and possible distinct activation status for different samples and different locations in feature maps, to show that a differentiation of data samples is always possible somewhere in feature maps. Training loss is then decreased by perturbation of network parameters that can affect different samples in different ways. Despite filters and biases are tied in each feature map, we give a construction in which this perturbation only affects the output of a single ReLU neuron and keeps the outputs at other locations unchanged. Finally, we give an example of nontrivial spurious local minimum in which different activation patterns of samples are explicitly constructed. Experimental results verify our theoretical findings. ",theoretical issues in deep learning Nearly Minimax Optimal Offline Reinforcement Learning with Linear Function Approximation: Single-Agent MDP and Markov Game,https://openreview.net/forum?id=UP_GHHPw7rP,https://openreview.net/pdf?id=UP_GHHPw7rP,,"Offline reinforcement learning (RL) aims at learning an optimal strategy using a pre-collected dataset without further interactions with the environment. While various algorithms have been proposed for offline RL in the previous literature, the minimax optimality has only been (nearly) established for tabular Markov decision processes (MDPs). In this paper, we focus on offline RL with linear function approximation and propose two new algorithms, LinPEVI-ADV+ and LinPMVI-ADV+, for single-agent MDPs and two-player zero-sum Markov games (MGs), respectively. The proposed algorithms establish pessimism in a variance-reduction manner via reference-advantage decomposition and variance-reweighted ridge regression. Theoretical analysis demonstrates that they can match the performance lower bounds up to logarithmic factors. We also establish new performance lower bounds for MDPs and MGs, which tighten the existing results, to demonstrate the nearly minimax optimality of the proposed algorithms. As a byproduct, equipped with the techniques developed in this paper, we can further improve the suboptimality bound when the feature vector set is finite. To the best of our knowledge, these are the first computationally efficient and nearly minimax optimal algorithms for offline single-agent MDPs and MGs with linear function approximation.",RL theory Semi-Supervised Semantic Segmentation via Boosting Uncertainty on Unlabeled Data,https://openreview.net/forum?id=88Z7kxbZLL3,https://openreview.net/pdf?id=88Z7kxbZLL3,We theoretically analyze and experimentally prove that appropriately boosting uncertainty on unlabeled data can help minimize the distribution gap in semi-supervised semantic segmentation.,"We bring a new perspective to semi-supervised semantic segmentation by providing an analysis on the labeled and unlabeled distributions in training datasets. We firstly figure out that the distribution gap between labeled and unlabeled datasets cannot be ignored, even though the two datasets are sampled from the same distribution. To address this issue, we theoretically analyze and experimentally prove that appropriately boosting uncertainty on unlabeled data can help minimize the distribution gap, which benefits the generalization of the model. We propose two strategies and design an algorithm of uncertainty booster specially for semi-supervised semantic segmentation. Extensive experiments are carried out based on these theories, and the results confirmed the efficacy of the algorithm and strategies. Our plug-and play uncertainty booster is tiny, efficient and robust to hyper parameters, but can significantly promote the performance. In our experiments, our method achieves state-of-the-art performance compared to the current semi-supervised semantic segmentation methods on the popular benchmark: Cityscapes and PASCAL VOC 2012 with different train settings.","Semantic Segmentation, Semi-supervised Learning, Uncertainty in Deep Learning" Clustering Structure Identification With Ordering Graph,https://openreview.net/forum?id=HG0SwOmlaEo,https://openreview.net/pdf?id=HG0SwOmlaEo,,"In machine learning, data is often presented in the form of a graph or similarity (or distance) values between samples. Graph-based clustering methods such as spectral clustering are defined for general weighted graphs to identify the clustering structure. Graph construction research has developed significantly for decades, but the graph-based partition study still requires more attention because of its poor performance. For example, spectral clustering needs a post-processing (e.g., K-Means) step to uncover the clustering indicators. Yet, K-Means is sensitive to the initial center setting and easily falls into a local optimum. In this paper, we investigate a new type of graph-based clustering approach. Firstly, we introduce a new algorithm for the purpose of cluster analysis which does not explicitly produce a clustering of a dataset but instead creates an augmented graph representing its density-based ordered clustering structure. This ordered graph contains information equivalent to density-based clustering corresponding to a broad range of parameter settings. Secondly, we found that the graph matrix is shown in a block-diagonal form because of the nature of ordering. We propose a partition method to learn the graph matrix's block-diagonal structure and identify the clustering directly. The global optimality is guaranteed theoretically. We test the proposed method on synthetic datasets and five high-dimensional datasets. Experimental results show that the proposed method outperforms state-of-the-art graph-based clustering methods and improves their performance by roughly 10%-50%. ", Benchmarking Deformable Object Manipulation with Differentiable Physics,https://openreview.net/forum?id=1NAzMofMnWl,https://openreview.net/pdf?id=1NAzMofMnWl,,"Deformable Object Manipulation (DOM) is of significant importance to both daily and industrial applications. Recent successes in differentiable physics simulators allow learning algorithms to train a policy with analytic gradients through environment dynamics, which significantly facilitates the development of DOM algorithms. However, existing DOM benchmarks are either single-object-based or non-differentiable. This leaves the questions of 1) how a task-specific algorithm performs on other tasks and 2) how a differentiable-physics-based algorithm compares with the non-differentiable ones in general. In this work, we present DaXBench, a differentiable DOM benchmark with a wide object and task coverage. DaXBench includes 9 challenging high-fidelity simulated tasks, covering rope, cloth, and liquid manipulation with various difficulty levels. To better understand the performance of general algorithms on different DOM tasks, we conduct comprehensive experiments over representative DOM methods, ranging from planning to imitation learning and reinforcement learning. In addition, we provide careful empirical studies of existing decision-making algorithms based on differentiable physics, and discuss their limitations, as well as potential future directions.","deformable object manipulation, differentiable physics, benchmark" Voxurf: Voxel-based Efficient and Accurate Neural Surface Reconstruction,https://openreview.net/forum?id=DSy8tP4WctmZ,https://openreview.net/pdf?id=DSy8tP4WctmZ,"We present Voxurf, a voxel-based approach for efficient and accurate neural surface reconstruction.","Neural surface reconstruction aims to reconstruct accurate 3D surfaces based on multi-view images. Previous methods based on neural volume rendering mostly train a fully implicit model with MLPs, which typically require hours of training for a single scene. Recent efforts explore the explicit volumetric representation to accelerate the optimization via memorizing significant information with learnable voxel grids. However, existing voxel-based methods often struggle in reconstructing fine-grained geometry, even when combined with an SDF-based volume rendering scheme. We reveal that this is because 1) the voxel grids tend to break the color-geometry dependency that facilitates fine-geometry learning, and 2) the under-constrained voxel grids lack spatial coherence and are vulnerable to local minima. In this work, we present Voxurf, a voxel-based surface reconstruction approach that is both efficient and accurate. Voxurf addresses the aforementioned issues via several key designs, including 1) a two-stage training procedure that attains a coherent coarse shape and recovers fine details successively, 2) a dual color network that maintains color-geometry dependency, and 3) a hierarchical geometry feature to encourage information propagation across voxels. Extensive experiments show that Voxurf achieves high efficiency and high quality at the same time. On the DTU benchmark, Voxurf achieves higher reconstruction quality with a 20x training speedup compared to previous fully implicit methods. Our code will be made publicly available.","Surface Reconstruction, Neural Radiance Field" Graph Contrastive Learning with Personalized Augmentation,https://openreview.net/forum?id=bds4tm-XK2I,https://openreview.net/pdf?id=bds4tm-XK2I,,"Graph contrastive learning (GCL) has emerged as an effective tool to learn representations for whole graphs in the absence of labels. The key idea is to maximize the agreement between two augmented views of each graph via data augmentation. Existing GCL models mainly focus on applying \textit{identical augmentation strategies} for all graphs within a given scenario. However, real-world graphs are often not monomorphic but abstractions of diverse natures. Even within the same scenario (e.g., macromolecules and online communities), different graphs might need diverse augmentations to perform effective GCL. Thus, blindly augmenting all graphs without considering their individual characteristics may undermine the performance of GCL arts. However, it is non-trivial to achieve personalized allocation since the search space for all graphs is exponential to the number of graphs. To bridge the gap, we propose the first principled framework, termed as \textit{G}raph contrastive learning with \textit{P}ersonalized \textit{A}ugmentation (GPA). It advances conventional GCL by allowing each graph to choose its own suitable augmentation operations. To cope with the huge search space, we design a tailored augmentation selector by converting the discrete space into continuous, which is a plug-and-play module and can be effectively trained with downstream GCL models end-to-end. Extensive experiments across 10 benchmark graphs from different types and domains demonstrate the superiority of GPA against state-of-the-art competitors. Moreover, by visualizing the learned augmentation distributions across different types of datasets, we show that GPA can effectively identify the most suitable augmentations for each graph based on its characteristics.", Tree Structure LSTM for Chinese Named Entity Recognition,https://openreview.net/forum?id=LdQUvGLk7yU,https://openreview.net/pdf?id=LdQUvGLk7yU,,"In this paper, we remodel the bi-directional LSTM (Bi-LSTM) network for Chinese Named Entity Recognition (NER). We convert LSTM from chain-like structure into tree structure which is fixed to the dependency parsing tree of the sentence. The new structure model can fully leverage the syntax information of the sentence and has the capability of capturing long-range dependencies. In addition, we use dependency parsing label embedding to improve the performance of our approach. Experimental studies on four benchmarking Chinese NER datasets have verified the effectiveness of our approach.","LSTM, NER, dependency parsing, tree structure, Chinese" Unfixed Bias Iterator: A New Iterative Format,https://openreview.net/forum?id=zUN76iFTyfi,https://openreview.net/pdf?id=zUN76iFTyfi,A deep learning-based unfixed bias iterator for solving partial differential equations.,"Partial differential equations (PDEs) have a wide range of applications in physics and computational science. Solving PDEs numerically is usually done by first meshing the solution region with finite difference method (FDM) and then using iterative methods to obtain an approximation of the exact solution on these meshes, hence decades of research to design iterators with fast convergence properties. With the renaissance of neural networks, many scholars have considered using deep learning to speed up solving PDEs, however, these methods leave poor theoretical guarantees, or sub-convergence. We build our iterator on top of the existing standard hand-crafted iterative solvers. At the operational level, for each iteration, we use a deep convolutional network to modify the current iterative result based on the historical iterative results as a way to achieve faster convergence. At the theoretical level, due to the introduced historical iterative results, our iterator is a new iterative format: Unfixed Bias Iterator. We provide sufficient theoretical guarantees, and theoretically prove that our iterator can obtain correct results with convergence, as well as a better generalization. Finally, sufficient numerical experiments show that our iterator has a convergence speed far beyond that of other iterators and exhibits strong generalization ability.","Partial differential equations, iterators, deep learning" Conditional Positional Encodings for Vision Transformers,https://openreview.net/forum?id=3KWnuT-R1bh,https://openreview.net/pdf?id=3KWnuT-R1bh,A conditional positional encoding scheme for vision transformers,"We propose a conditional positional encoding (CPE) scheme for vision Transformers. Unlike previous fixed or learnable positional encodings that are predefined and independent of input tokens, CPE is dynamically generated and conditioned on the local neighborhood of the input tokens. As a result, CPE can easily generalize to the input sequences that are longer than what the model has ever seen during the training. Besides, CPE can keep the desired translation equivalence in vision tasks, resulting in improved performance. We implement CPE with a simple Position Encoding Generator (PEG) to get seamlessly incorporated into the current Transformer framework. Built on PEG, we present Conditional Position encoding Vision Transformer (CPVT). We demonstrate that CPVT has visually similar attention maps compared to those with learned positional encodings and delivers outperforming results.",Vision Transformer Variational Reparametrized Policy Learning with Differentiable Physics,https://openreview.net/forum?id=tHsu1olr9ZcQ,https://openreview.net/pdf?id=tHsu1olr9ZcQ,,"We study the problem of policy parameterization for reinforcement learning (RL) with high-dimensional continuous action space. Our goal is to find a good way to parameterize the policy of continuous RL as a multi-modality distribution. To this end, we propose to treat the continuous RL policy as a generative model over the distribution of optimal trajectories. We use a diffusion process-like strategy to model the policy and derive a novel variational bound which is the optimization objective to learn the policy. To maximize the objective by gradient descent, we introduce the Reparameterized Policy Gradient Theorem. This theorem elegantly connects classical method REINFORCE and trajectory return optimization for computing the gradient of a policy. Moreover, our method enjoys strong exploration ability due to the multi-modality policy parameterization; notably, when a strong differentiable world model presents, our method also enjoys the fast convergence speed of trajectory optimization. We evaluate our method on numerical problems and manipulation tasks within a differentiable simulator. Qualitative results show its ability to capture the multi-modality distribution of optimal trajectories, and quantitative results show that it can avoid local optima and outperforms baseline approaches.",Differentiable Physics Reinforcement Learning A Fairness Analysis on Differentially Private Aggregation of Teacher Ensembles,https://openreview.net/forum?id=Q2WE65ToiLT,https://openreview.net/pdf?id=Q2WE65ToiLT,This paper analyzes the causes of the disparate impacts arising in a popular teacher ensemble model used for differentially private learning tasks,"Private Aggregation of Teacher Ensembles (PATE) is an important private machine learning framework. It combines multiple learning models used as teachers for a student model that learns to predict an output chosen by noisy voting among the teachers. The resulting model satisfies differential privacy and has been shown effective in learning high-quality private models in semi-supervised settings or when one wishes to protect the data labels. This paper asks whether this privacy-preserving framework introduces or exacerbates unfairness and shows that PATE can introduce accuracy disparity among individuals and groups of individuals. The paper analyzes which algorithmic and data properties are responsible for the disproportionate impacts, why these aspects are affecting different groups disproportionately, and proposes guidelines to mitigate these effects.","Differential Privacy, Fairness, Semisupervised learning" GENERALIZED MATRIX LOCAL LOW RANK REPRESENTATION BY RANDOM PROJECTION AND SUBMATRIX PROPAGATION,https://openreview.net/forum?id=BgMo9ofIQi6,https://openreview.net/pdf?id=BgMo9ofIQi6,We developed a sub-matrix propagation based approach to solve the fundamental mathematical problem of matrix local low rank representation.,"Detecting distinct submatrices of low rank property is a highly desirable matrix representation learning technique for the ease of data interpretation, called the matrix local low rank representation (MLLRR). Based on different mathematical assumptions of the local pattern, the MLLRR problem could be categorized into two sub-problems, namely local constant variation (LCV) and local linear low rank (LLR). Existing solutions on MLLRR only focused on the LCV problem, which misses a substantial amount of true and interesting patterns. In this work, we develop a novel matrix computational framework called RPSP (Random Probing based submatrix Propagation) that provides an effective solution for both of the LCV and LLR problems. RPSP detects local low rank patterns that grow from small submatrices of low rank property, which are determined by a random projection approach. RPSP is supported by theories of random projection. Experiments on synthetic data demonstrate that RPSP outperforms all state-of-the-art methods, with the capacity to robustly and correctly identify the low rank matrices under both LCV and LLR settings. On real-world datasets, RPSP also demonstrates its effectiveness in identifying interpretable local low rank matrices. ","Matrix decomposition, Local Low Rank matrix detection, Representation learning, Subspace learning" "Stable, Efficient, and Flexible Monotone Operator Implicit Graph Neural Networks",https://openreview.net/forum?id=IajGRJuM7D3,https://openreview.net/pdf?id=IajGRJuM7D3,"We propose stable, efficient, and flexible implicit graph neural networks leveraging monotone operator theory","Implicit graph neural networks (IGNNs) that solve a fixed-point equilibrium equation for representation learning can learn the long-range dependencies (LRD) in the underlying graphs and show remarkable performance for various graph learning tasks. However, the expressivity of IGNNs is limited by the constraints for their well-posedness guarantee. Moreover, when IGNNs become effective for learning LRD, their eigenvalues converge to the value that slows down the convergence, and their performance is unstable across different tasks. In this paper, we provide a new well-posedness condition of IGNNs leveraging monotone operator theory. The new well-posedness characterization informs us to design effective parameterizations to improve the accuracy, efficiency, and stability of IGNNs. Leveraging accelerated operator splitting schemes and graph diffusion convolution, we design efficient and flexible implementations of monotone operator IGNNs that are significantly faster and more accurate than existing IGNNs.","implicit graph neural networks, monotone operator, accelerated operator splitting, orthogonal parameterization" ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills,https://openreview.net/forum?id=b_CQDy9vrD1,https://openreview.net/pdf?id=b_CQDy9vrD1,,"Generalizable manipulation skills, which can be composed to tackle long-horizon and complex daily chores, are one of the cornerstones of Embodied AI. However, existing benchmarks, mostly composed of a suite of simulatable environments, are insufficient to push cutting-edge research works because they lack object-level topological and geometric variations, are not based on fully dynamic simulation, or are short of native support for multiple types of manipulation tasks (e.g., stationary/mobile-base, single/dual-arm, rigid/soft-body). To this end, we present ManiSkill2, the next generation of the SAPIEN ManiSkill benchmark, to address critical pain points often encountered by researchers when using benchmarks for generalizable manipulation skills. ManiSkill2 includes 20 manipulation task families with 2000+ object models and 4M+ demonstration frames, which cover stationary/mobile-base, single/dual-arm, and rigid/soft-body manipulation tasks with 2D/3D-input data simulated by fully dynamic engines. It defines a unified interface and evaluation protocol to support a wide range of algorithms (e.g., classic sense-plan-act, RL, IL), visual observations (point cloud, RGBD), and controllers (e.g., action type and parameterization). Moreover, it empowers fast visual input learning algorithms so that a CNN-based policy can collect samples at about 2000 FPS with 1 GPU and 16 processes on a regular workstation. It implements a render server infrastructure to allow sharing rendering resources across all environments, thereby significantly reducing memory usage. We will open-source all codes of our benchmark (simulator, environments, and baselines) and host an online challenge open to interdisciplinary researchers.","object manipulation, benchmark, computer vision, robotics, reinforcement learning" "LSAP: Rethinking Inversion Fidelity, Perception and Editability in GAN Latent Space",https://openreview.net/forum?id=SZBy3XeXQvd,https://openreview.net/pdf?id=SZBy3XeXQvd,,"As the methods evolve, inversion is mainly divided into two steps. The first step is Image Embedding, in which an encoder or optimization process embeds images to get the corresponding latent codes. Afterward, the second step aims to refine the inversion and editing results, which we named Result Refinement. Although the second step significantly improves fidelity, perception and editability are almost unchanged, deeply dependent on inverse latent codes attained in the first step. Therefore, a crucial problem is gaining the latent codes with better perception and editability while retaining the reconstruction fidelity. In this work, we first point out that these two characteristics are related to the degree of alignment (or disalignment) of the inverse codes with the synthetic distribution. Then, we propose Latent Space Alignment Inversion Paradigm (LSAP), which consists of evaluation metric and solution for this problem. Specifically, we introduce Normalized Style Space ($\mathcal{S^N}$ space) and $\mathcal{S^N}$ Cosine Distance (SNCD) to measure disalignment of inversion methods. Since our proposed SNCD is differentiable, it can be optimized in both encoder-based and optimization-based embedding methods to conduct a uniform solution. Extensive experiments in various domains demonstrate that SNCD effectively reflects perception and editability, and our alignment paradigm archives the state-of-the-art in both two steps.", Neural Sorting Networks with Error-Free Differentiable Swap Functions,https://openreview.net/forum?id=vv6siYLQJqS,https://openreview.net/pdf?id=vv6siYLQJqS,,"Sorting is a fundamental operation of all computer systems, having been a long-standing significant research topic. Beyond the problem formulation of traditional sorting algorithms, we consider sorting problems for more abstract yet expressive inputs e.g., multi-digit images and image fragments, through a neural sorting network. To learn a mapping from a high-dimensional input to an ordinal variable, the differentiability of sorting networks needs to be guaranteed. In this paper we define a softening error by a differentiable swap function, and develop an error-free swap function that holds non-decreasing and differentiability conditions. Furthermore, a permutation-equivariant Transformer network with multi-head attention is adopted to capture dependency between given inputs and also leverage its model capacity. Experiments on diverse sorting benchmarks demonstrate that our method performs better than or comparable to existing baseline methods.", Twofer: Tackling Continual Domain Shift with Simultaneous Domain Generalization and Adaptation,https://openreview.net/forum?id=L8iZdgeKmI6,https://openreview.net/pdf?id=L8iZdgeKmI6,"To tackle continual domain shift in real-world applications, this work proposes a novel framework for achieving target domain generalization, target domain adaptation, and forgetting compensation at the same time.","In real-world applications, deep learning models often run in non-stationary environments where the target data distribution continually shifts over time. There have been numerous domain adaptation (DA) methods in both online and offline modes to improve cross-domain adaptation ability. However, these DA methods typically only provide good performance after a long period of adaptation and perform poorly on new domains before and during adaptation, especially when domain shifts happen suddenly and momentarily. On the other hand, domain generalization (DG) methods have been proposed to improve the model generalization ability on unadapted domains. However, existing DG works are ineffective for continually changing domains due to severe catastrophic forgetting of learned knowledge. To overcome these limitations of DA or DG in tackling continual domain shifts, we propose Twofer, a framework that simultaneously achieves target domain generalization (TDG), target domain adaptation (TDA), and forgetting alleviation (FA). Twofer includes a training-free data augmentation module to prepare data for TDG, a novel pseudo-labeling mechanism to provide reliable supervision for TDA, and a prototype contrastive alignment algorithm to align different domains for achieving TDG, TDA, and FA. Extensive experiments on Digits, PACS, and Domain Net datasets demonstrate that Twofer substantially outperforms state-of-the-art works in Continual DA, Source-Free DA, Test-Time/Online DA, Single DG, Multiple DG, and Unified DA&DG. We envision this work as a significant milestone in tackling continual data domain shifts, with improved performance across target domain generalization, adaptation, and forgetting alleviation abilities.","Domain Generalization, Domain Adaptation" ModelAngelo: Automated Model Building for Cryo-EM Maps,https://openreview.net/forum?id=65XDF_nwI61,https://openreview.net/pdf?id=65XDF_nwI61,Using graph neural networks to automatically build atomic models in cryo-EM maps,"Electron cryo-microscopy (cryo-EM) produces three-dimensional (3D) maps of the electrostatic potential of biological macromolecules, including proteins. At sufficient resolution, the cryo-EM maps, along with some knowledge about the imaged molecules, allow de novo atomic modelling. Typically, this is done through a laborious manual process. Recent advances in machine learning applications to protein structure prediction show potential for automating this process. Taking inspiration from these techniques, we have built ModelAngelo for automated model building of proteins in cryo-EM maps. ModelAngelo first uses a residual convolutional neural network (CNN) to initialize a graph representation with nodes assigned to individual amino acids of the proteins in the map and edges representing the protein chain. The graph is then refined with a graph neural network (GNN) that combines the cryo-EM data, the amino acid sequence data and prior knowledge about protein geometries. The GNN refines the geometry of the protein chain and classifies the amino acids for each of its nodes. The final graph is post-processed with a hidden Markov model (HMM) search to map each protein chain to entries in a user provided sequence file. Application to 28 test cases shows that ModelAngelo outperforms state-of-the-art and approximates manual building for cryo-EM maps with resolutions better than 3.5 A.","cryo-em, model building, graph neural networks, attention networks, proteins" Stealing and Defending Transformer-based Encoders,https://openreview.net/forum?id=LoJ6oXzc_P3,https://openreview.net/pdf?id=LoJ6oXzc_P3,We perform attacks against transformer-based encoders and propose a new defense against extraction of vision transformers that combines watermarking with dataset inference.,"Self-supervised learning (SSL) has become the predominant approach to training on large amounts of unlabeled data. New real-world APIs offer services to generate high-dimensional representations for given inputs based on SSL encoders with transformer architectures. Recent efforts highlight that it is possible to steal high-quality SSL encoders trained on convolutional neural networks. In this work, we are the first to extend this line of work to stealing and defending transformer-based encoders in both language and vision domains. We show that it is possible to steal transformer-based sentence embedding models solely using their returned representations and with 40x fewer queries than the number of victim's training data points. We also decrease the number of required stealing queries for the vision encoders by leveraging semi-supervised learning. Finally, to defend vision transformers against stealing attacks, we propose a defense technique that combines watermarking with dataset inference. Our method creates a unique encoder signature based on a private data subset that acts as a secret seed during training. By applying dataset inference on the seed, we can then successfully identify stolen transformers.","model stealing, model extraction, defenses against model extraction, transformers, encoders, self-supervised learning" On the Lower Bound of Minimizing Polyak-Łojasiewicz functions,https://openreview.net/forum?id=2OETPKmG4S0,https://openreview.net/pdf?id=2OETPKmG4S0,"We show that any first-order algorithm requires at least $\tilde{\Omega}\left((L/\mu)^{1-\alpha} \right)$ gradient costs to find an $\epsilon$-approximate optimal solution for a general $L$-smooth, $\mu$-PL function for any $\alpha>0$ .","Polyak-Łojasiewicz (PL) [Polyak, 1963] condition is a weaker condition than the strong convexity but suffices to ensure a global convergence for the Gradient Descent algorithm. In this paper, we study the lower bound of algorithms using first-order oracles to find an approximate optimal solution. We show that any first-order algorithm requires at least $\Omega\left((L/\mu)^{1-\alpha} \right)$ gradient costs to find an $\epsilon$-approximate optimal solution for a general $L$-smooth function that has an $\mu$-PL constant for any $\alpha>0$. This result demonstrates the near optimality of the Gradient Descent algorithm to minimize smooth PL functions in the sense that there exists a ``hard'' PL function such that no first-order algorithm can be faster by a polynomial order. In contrast, it is well-known that the momentum technique, e.g. [Nesterov, 2003, chap. 2] can provably accelerate Gradient Descent to $O\left(\sqrt{L/\hat{\mu}}\log\frac{1}{\epsilon}\right)$ gradient costs for functions that are $L$-smooth and $\hat{\mu}$-strongly convex. Therefore, our result distinguishes the hardness of minimizing a smooth PL function and a smooth strongly convex function as the complexity of the former cannot be improved by any polynomial order in general. ","Polyak-Łojasiewicz Condition, First-order Algorithms, Lower Bound, Complexity" VectorMapNet: End-to-end Vectorized HD Map Learning,https://openreview.net/forum?id=Qx8lUU8CzQ,https://openreview.net/pdf?id=Qx8lUU8CzQ,We proposed an end-to-end method that directly generate vectorized map from sensor data.,"Autonomous driving systems require a good understanding of surrounding environments, including moving obstacles and static High-Definition (HD) semantic map elements. Existing methods approach the semantic map problem by offline manual annotation, which suffers from serious scalability issues. Recent learning-based methods produce dense rasterized segmentation predictions to construct maps. However, these predictions do not include instance information of individual map elements and require heuristic post-processing to obtain vectorized maps. To tackle these challenges, we introduce an end-to-end vectorized HD map learning pipeline, termed VectorMapNet. VectorMapNet takes onboard sensor observations and predicts a sparse set of polylines in the bird's-eye view. This pipeline can explicitly model the spatial relation between map elements and generate vectorized maps that are friendly to downstream autonomous driving tasks. Extensive experiments show that VectorMapNet achieve strong map learning performance on both nuScenes and Argoverse2 dataset, surpassing previous state-of-the-art methods by 14.2 mAP and 14.6mAP. Qualitatively, we also show that VectorMapNet is capable of generating comprehensive maps and capturing more fine-grained details of road geometry. To the best of our knowledge, VectorMapNet is the first work designed towards end-to-end vectorized map learning from onboard sensors.","Autonomous Driving, Map Learning, Transformer" An information-theoretic approach to unsupervised keypoint representation learning,https://openreview.net/forum?id=oPnpibcro8,https://openreview.net/pdf?id=oPnpibcro8,A novel information-theoretic approach to unsupervised keypoint representation learning from videos leveraging local entropy coverage and information transportation maximization.,"Extracting informative representations from videos is fundamental for the effective learning of various downstream tasks. Inspired by classical works on saliency, we present a novel information-theoretic approach to discover meaningful representations from videos in an unsupervised fashion. We argue that local entropy of pixel neighborhoods and its evolution in a video stream is a valuable intrinsic supervisory signal for learning to attend to salient features. We, thus, abstract visual features into a concise representation of keypoints that serve as dynamic information transporters. We discover in an unsupervised fashion spatio-temporally consistent keypoint representations that carry the prominent information across video frames, thanks to two original information-theoretic losses. First, a loss that maximizes the information covered by the keypoints in a frame. Second, a loss that encourages optimized keypoint transportation over time, thus, imposing consistency of the information flow. We evaluate our keypoint-based representation compared to state-of-the-art baselines in different downstream tasks such as learning object dynamics. To evaluate the expressivity and consistency of the keypoints, we propose a new set of metrics. Our empirical results showcase the superior performance of our information-driven keypoints that resolve challenges like attendance to both static and dynamic objects, and to objects abruptly entering and leaving the scene.","representation learning, keypoint discovery, unsupervised learning" Distilling Cognitive Backdoor within an Image,https://openreview.net/forum?id=S3D9NLzjnQ5,https://openreview.net/pdf?id=S3D9NLzjnQ5,A novel method effectively and robustly detect backdoor samples in the dataset. ,"This paper proposes a simple method to distill and detect backdoor patterns within an image: \emph{Cognitive Distillation} (CD). The idea is to extract the ""minimal essence"" from an input image responsible for the model's prediction. CD optimizes an input mask to extract a small pattern from the input image that can lead to the same model output (i.e., logits or deep features). The extracted pattern can help understand the cognitive mechanism of a model on clean vs. backdoor images and is thus called a \emph{Cognitive Pattern} (CP). Using CD and the distilled CPs, we uncover an interesting phenomenon of backdoor attacks: despite the various forms and sizes of trigger patterns used by different attacks, the CPs of backdoor samples are all surprisingly and suspiciously small. One thus can leverage the learned mask to detect and remove backdoor examples from poisoned training datasets. We conduct extensive experiments to show that CD can robustly detect a wide range of advanced backdoor attacks. We also show that CD can potentially be applied to help detect potential biases from face datasets.","Backdoor sample detection, Backdoor defence" Formulating and Proving the Trend of DNNs Learning Simple Concepts,https://openreview.net/forum?id=pG9RSmBrY3,https://openreview.net/pdf?id=pG9RSmBrY3,We theoretically prove and empirically verify that DNNs mainly learn simple interactive concepts.,"This paper theoretically explains the intuition that simple concepts are more likely to be learned by deep neural networks (DNNs) than complex concepts. Beyond empirical studies, our research first specifies an exact definition of the complexity of the concept that boosts the learning difficulty. Specifically, it is proven that the inference logic of a neural network can be represented as a causal graph. In this way, causal patterns in the causal graph can be used to formulate interactive concepts learned by the neural network. Based on such formulation, we explain the reason why simple interactive concepts in the data are more likely to be learned than complex interactive concepts. More crucially, we discover that our research provides a new perspective to explain previous understandings of the conceptual complexity. The code will be released when the paper is accepted.","representation complexity, deep neural network" Curriculum Reinforcement Learning via Morphology-Environment Co-Evolution,https://openreview.net/forum?id=ZytN-E8vZk,https://openreview.net/pdf?id=ZytN-E8vZk,,"Throughout long history, natural species have learned to survive by evolving their physical structures adaptive to the environment changes. In contrast, current reinforcement learning (RL) studies mainly focus on training an agent with a fixed morphology (e.g., skeletal structure and joint attributes) in a fixed environment, which can hardly generalize to changing environments or new tasks. In this paper, we optimize an RL agent and its morphology through ''morphology-environment co-evolution (MECE)'', in which the morphology keeps being updated to adapt to the changing environment, while the environment is modified progressively to bring new challenges and stimulate the improvement of the morphology. This leads to a curriculum to train generalizable RL, whose morphology and policy are optimized for different environments. Instead of hand-crafting the curriculum, we train two policies to automatically change the morphology and the environment. To this end, (1) we develop two novel and effective rewards for the two policies, which are solely based on the learning dynamics of the RL agent; (2) we design a scheduler to automatically determine when to change the environment and the morphology. We find these two designs are critical to the success of MECE, as verified by extensive ablation studies. In experiments on two classes of tasks, the morphology and RL policies trained via MECE exhibit significantly better generalization performance in unseen test environments than SOTA morphology optimization methods. Our ablation studies on the two MECE policies further show that the co-evolution between the morphology and environment is the key to the success.","Curriculum RL, Morphology Evolution" Domain Generalization with Small Data,https://openreview.net/forum?id=RKiWwhocuiU,https://openreview.net/pdf?id=RKiWwhocuiU, A novel domain generalization method in the context of insufficient data is proposed in this work,"In this work, we propose to tackle the problem of domain generalization in the context of insufficient samples. Instead of extracting latent feature embeddings based on deterministic models, we propose to learn a domain-invariant representation based on the probabilistic framework by mapping each data point into probabilistic embeddings. Specifically, we first extend empirical maximum mean discrepancy (MMD) to a novel probabilistic MMD that can measure the discrepancy between mixture distributions (i.e., source domains) consisted of a serial of latent distributions rather than latent points. Moreover, instead of imposing the contrastive semantic alignment (CSA) loss based on pairs of latent points, a novel probabilistic CSA loss encourages positive probabilistic embedding pairs to be closer while pulling other negative ones apart. Benefiting from the learned representation captured by probabilistic models, our proposed method can marriage the measurement on the distribution over distributions (i.e., the global perspective alignment) and the distribution-based contrastive semantic alignment (i.e., the local perspective alignment). Extensive experimental results on three challenging medical datasets show the effectiveness of our proposed method in the context of insufficient data compared with state-of-the-art baseline methods.","domain generalization, small data, healthcare, medical image" 3D generation on ImageNet,https://openreview.net/forum?id=U2WjB9xxZ9q,https://openreview.net/pdf?id=U2WjB9xxZ9q,3D generation on ImageNet,"All existing 3D-from-2D generators are designed for well-curated and alignable datasets: objects can be placed in the same position, similarly scaled and oriented, such that the camera always points to the center of the scene. This alignment procedure is infeasible for diverse, in-the-wild datasets: 1) it requires expensive annotation for each object category; and 2) most images are inherently ""non-alignable"" (e.g., it is impossible to align a ""cat face"" with a ""kitchen""). As a result, existing 3D generators are not scalable to large in-the-wild datasets. In this work, for the first time, we propose a 3D generator which works on non-aligned datasets. First, we develop a technique to use an off-the-shelf, imprecise depth estimator to incorporate the 3D inductive bias into a GAN-based generator. Then, we create a novel learnable camera parametrization which does not use any alignment assumptions and construct a camera gradient penalty regularization. Finally, we propose a simple distillation-based technique to transfer the knowledge from an off-the-shelf feature embedder, like ResNet50, into our discriminator. Our work is the first one which develops a 3D generator for non-aligned data. We conduct experiments on SDIP Dogs, SDIP Elephants, LSUN Horse, and ImageNet on the 256x256 resolution to demonstrate the effectiveness of our ideas. Visualizations: https://u2wjb9xxz9q.github.io.","3d-generation, gans, generative adversarial networks, knowledge distillation, nerf, stylegan, radiance fields, volume rendering" Revocable Deep Reinforcement Learning with Affinity Regularization for Outlier-Robust Graph Matching,https://openreview.net/forum?id=QjQibO3scV_,https://openreview.net/pdf?id=QjQibO3scV_,,"Graph matching (GM) has been a building block in various areas including computer vision and pattern recognition. Despite the recent impressive progress, existing deep GM methods often have difficulty in handling outliers, which are ubiquitous in practice. We propose a deep reinforcement learning based approach RGM, whose sequential node matching scheme naturally fits the strategy for selective inlier matching against outliers. A revocable action framework is devised to improve the agent's flexibility against the complex constrained GM task. Moreover, we propose a quadratic approximation technique to regularize the affinity score, in the presence of outliers. As such, the agent can finish inlier matching timely when the affinity score stops growing, for which otherwise an additional parameter i.e. the number of inliers is needed to avoid matching outliers. In this paper, we focus on learning the back-end solver under the most general form of GM: Lawler's QAP, whose input is the affinity matrix. Especially, our approach can also boost existing GM methods that use such input. Experiments on multiple real-world datasets demonstrate its performance regarding both accuracy and robustness.","Graph Matching, Reinforcement Learning, Quadratic Assignment, Affinity Regularization, Combinatorial Optimization." "Hierarchical Prompting Improves Visual Recognition On Accuracy, Data Efficiency and Explainability",https://openreview.net/forum?id=nZ5_rXpikfK,https://openreview.net/pdf?id=nZ5_rXpikfK,"Hierarchical prompting improves visual recognition on accuracy, data efficiency and explainability.","When humans try to distinguish some inherently similar visual concepts, e.g., Rosa Peace and China Rose, they may use the underlying hierarchical taxonomy to prompt the recognition. For example, given a prompt that the image belongs to the rose family, a person can narrow down the category range and thus focuses on the comparison between different roses. In this paper, we explore the hierarchical prompting for deep visual recognition (image classification, in particular) based on the prompting mechanism of the transformer. We show that the transformer can take the similar benefit by injecting the coarse-class prompts into the intermediate blocks. The resulting Transformer with Hierarchical Prompting (TransHP) is very simple and consists of three steps: 1) TransHP learns a set of prompt tokens to represent the coarse classes, 2) learns to predict the coarse class of the input image using an intermediate block, and 3) absorbs the prompt token of the predicted coarse class into the feature tokens. Consequently, the injected coarse-class prompt conditions (influences) the subsequent feature extraction and encourages better focus on the relatively subtle differences among the descendant classes. Through extensive experiments on popular image classification datasets, we show that this simple hierarchical prompting improves visual recognition on classification accuracy (e.g., improving ViT-B/16 by $+2.83\%$ ImageNet classification accuracy), training data efficiency (e.g., $+12.69\%$ improvement over the baseline under $10\%$ ImageNet training data), and model explainability.","hierarchical prompting, visual recognition, vision transformer" Convergence of the mini-batch SIHT algorithm,https://openreview.net/forum?id=g4PH7bjVZWC,https://openreview.net/pdf?id=g4PH7bjVZWC,This paper provides stochastic convergence analysis of the mini-batch stochastic hard thresholding algorithm,"The Iterative Hard Thresholding (IHT) algorithm has been considered extensively as an effective deterministic algorithm for solving sparse optimizations. The IHT algorithm benefits from the information of the batch (full) gradient at each point and this information is a crucial key for the convergence analysis of the generated sequence. However, this strength becomes a weakness when it comes to machine learning and high dimensional statistical applications because calculating the batch gradient at each iteration is computationally expensive or impractical. Fortunately, in these applications the objective function has a summation structure that can be taken advantage of to approximate the batch gradient by the stochastic mini-batch gradient. In this paper, we study the mini-batch Stochastic IHT (SIHT) algorithm for solving the sparse optimizations. As opposed to previous works where increasing and variable mini-batch size is necessary for derivation, we fix the mini-batch size according to a lower bound that we derive and show our work. To prove stochastic convergence of the objective value function we first establish a critical sparse stochastic gradient descent property. Using this stochastic gradient descent property we show that the sequence generated by the stochastic mini-batch SIHT is a supermartingale sequence and converges with probability one. Unlike previous work we do not assume the function to be a restricted strongly convex. To the best of our knowledge, in the regime of sparse optimization, this is the first time in the literature that it is shown that the sequence of the stochastic function values converges with probability one by fixing the mini-batch size for all steps.","mini-batch gradient, sparse optimization, iterative hard thresholding, IHT, SIHT, mini-batch" Decomposing Texture and Semantics for Out-of-distribution Detection,https://openreview.net/forum?id=BLBulxMHuOp,https://openreview.net/pdf?id=BLBulxMHuOp,We propose a novel OOD detection framework that decomposes the definition of the in-distribution as texture and semantics. ,"Out-of-distribution (OOD) detection has made significant progress in recent years because the distribution mismatch between the training and testing can severely deteriorate the reliability of a machine learning system.Nevertheless, the lack of precise interpretation of the in-distribution limits the application of OOD detection methods to real-world system pipielines. To tackle this issue, we decompose the definition of the in-distribution into texture and semantics, motivated by real-world scenarios. In addition, we design new benchmarks to measure the robustness that OOD detection methods should have. To achieve a good balance between the OOD detection performance and robustness, our method takes a divide-and-conquer approach. That is, the model first tackles each component of the texture and semantics separately, and then combines them later. Such design philosophy is empirically proven by a series of benchmarks including not only ours but also the conventional counterpart.","Out-of-distribution detection, Fourier analysis, Normailzing flow model" Selective Classification Via Neural Network Training Dynamics,https://openreview.net/forum?id=oWbTcqr8g7,https://openreview.net/pdf?id=oWbTcqr8g7,We propose a novel selective classification algorithm in which we derive a score based on the label disagreement of intermediate models obtained during training.,"Selective classification is the task of rejecting inputs a model would predict incorrectly on through a trade-off between input space coverage and model accuracy. Current methods for selective classification impose constraints on either the model architecture or the loss function; this inhibits their usage in practice. In contrast to prior work, we show that state-of-the-art selective classification performance can be attained solely from studying the (discretized) training dynamics of a model. We propose a general framework that, for a given test input, monitors metrics capturing the disagreement with the final predicted label over intermediate models obtained during training; we then reject data points exhibiting too much disagreement at late stages in training. In particular, we instantiate a method that tracks when the label predicted during training stops disagreeing with the final predicted label. Our experimental evaluation shows that our method achieves state-of-the-art accuracy/coverage trade-offs on typical selective classification benchmarks.","selective classification, example difficulty, reject option, training dynamics, optimization" Rethinking the Expressive Power of GNNs via Graph Biconnectivity,https://openreview.net/forum?id=r9hNv76KoT3,https://openreview.net/pdf?id=r9hNv76KoT3,,"Designing expressive Graph Neural Networks (GNNs) is a central topic in learning graph-structured data. While numerous approaches have been proposed to improve GNNs with respect to the Weisfeiler-Lehman (WL) test, for most of them, there is still a lack of deep understanding of what additional power they can systematically and provably gain. In this paper, we take a fundamentally different perspective to study the expressive power of GNNs beyond the WL test. Specifically, we introduce a novel class of expressivity metrics via graph biconnectivity and highlight their importance in both theory and practice. As biconnectivity can be easily calculated using simple algorithms that have linear computational costs, it is natural to expect that popular GNNs can learn it easily as well. However, after a thorough review of prior GNN architectures, we surprisingly find that most of them are not expressive for any of these metrics. The only exception is the ESAN framework (Bevilacqua et al., 2022), for which we give a theoretical justification of its power. We proceed to introduce a principled and more efficient approach, called the Generalized Distance Weisfeiler-Lehman (GD-WL), which is provably expressive for all biconnectivity metrics. Practically, we show GD-WL can be implemented by a Transformer-like architecture that preserves expressiveness and enjoys full parallelizability. A set of experiments on both synthetic and real datasets demonstrates that our approach can consistently outperform prior GNN architectures.","Graph Neural Networks, Expressive Power, Weisfeiler-Lehman test, Graph Transformer, Biconnectivity" One Transformer Can Understand Both 2D & 3D Molecular Data,https://openreview.net/forum?id=vZTp1oPV3PC,https://openreview.net/pdf?id=vZTp1oPV3PC,,"Unlike vision and language data which usually has a unique format, molecules can naturally be characterized using different chemical formulations. One can view a molecule as a 2D graph or define it as a collection of atoms located in a 3D space. For molecular representation learning, most previous works designed neural networks only for a particular data format, making the learned models likely to fail for other data formats. We believe a general-purpose neural network model for chemistry should be able to handle molecular tasks across data modalities. To achieve this goal, in this work, we develop a novel Transformer-based Molecular model called Transformer-M, which can take molecular data of 2D or 3D formats as input and generate meaningful semantic representations. Using the standard Transformer as the backbone architecture, Transformer-M develops two separated channels to encode 2D and 3D structural information and incorporate them with the atom features in the network modules. When the input data is in a particular format, the corresponding channel will be activated, and the other will be disabled. By training on 2D and 3D molecular data with properly designed supervised signals, Transformer-M automatically learns to leverage knowledge from different data modalities and correctly capture the representations. We conducted extensive experiments for Transformer-M. All empirical results show that Transformer-M can simultaneously achieve strong performance on 2D and 3D tasks, suggesting its broad applicability. The code and models will be made publicly available at \url{https://anonymous}.","Transformer, general-purpose molecular model, 2D molecular representation, 3D molecular representation" Hyperbolic Binary Neural Network,https://openreview.net/forum?id=Lv3MfAEgvVv,https://openreview.net/pdf?id=Lv3MfAEgvVv,We propose a Hyperbolic Binary Neural Network that updates the parameters in hyperbolic space.,"Binary Neural Network (BNN) converts the full-precision weights and activations to the extreme 1-bit counterparts, which is especially suitable to be deployed on lightweight mobile devices. Neural network binarization is usually formulated as a constrained optimization problem, which restricts its optimized potential. In this paper, we introduce the dynamic exponential map that converts a constrained problem in the Riemannian manifold into an unconstrained one in the Euclidean space. Specifically, we propose a Hyperbolic Binary Neural Network (HBNN) by representing the parameter vector in the Euclidean space as the one in the hyperbolic space, which would enable us to optimize the parameter in an unconstrained space. By analyzing the parameterized representation, we present that the dynamic exponential map is a diffeomorphism in the Poincaré ball. Theoretically, this property will not create extra saddle points or local minima in the Poincaré ball, which also explains the good performance of the HBNN. Experiments on CIFAR10, CIFAR100, and ImageNet classification datasets with VGGsmall, ResNet18, and ResNet34 demonstrate the superiorities of our HBNN over existing state-of-the-art methods.","Neural network quantization, Hyperbolic geometry, Riemannian manifold" Generating Diverse Cooperative Agents by Learning Incompatible Policies,https://openreview.net/forum?id=UkU05GOH7_6,https://openreview.net/pdf?id=UkU05GOH7_6,We show that incompatible poclies are not similar. LIPO generates diverse cooperative partners by learning a population of incompatible policies.,"Training a robust cooperative agent requires diverse partner agents. However, obtaining those agents is difficult. Previous works aim to learn diverse behaviors by changing the state-action distribution of agents. But, without information about the task's goal, the diversified agents are not guided to find other important, albeit sub-optimal, solutions: the agents might learn only variations of the same solution. In this work, we propose to learn diverse behaviors via policy compatibility. Conceptually, policy compatibility measures whether policies of interest can coordinate effectively. We theoretically show that incompatible policies are not similar. Thus, policy compatibility—which has been used exclusively as a measure of robustness—can be used as a proxy for learning diverse behaviors. Then, we incorporate the proposed objective into a population-based training scheme to allow concurrent training of multiple agents. Additionally, we use state-action information to induce local variations of each policy. Empirically, the proposed method consistently discovers more solutions than baseline methods across various multi-goal cooperative environments. Finally, in multi-recipe Overcooked, we show that our method produces populations of behaviorally diverse agents, which enables generalist agents trained with such a population to be more robust.","multi-agent systems, cooperation, collaboration, reinforcement learning, diversity, robustness" Mind the Gap: Offline Policy Optimizaiton for Imperfect Rewards,https://openreview.net/forum?id=WumysvcMvV6,https://openreview.net/pdf?id=WumysvcMvV6,This paper proposes an offline policy optimization approach for imperfect rewards.,"Reward function is essential in reinforcement learning (RL), serving as the guiding signal to incentivize an agent to solve a given task. However, reward function is notoriously difficult to design or even approximate. In many cases, only a sub-par reward function can be obtained, sometimes even with zero reward signal, which often inflicts substantial performance loss or stringent restrictive requirements on expert demonstrations. In this study, we propose a unified offline policy optimization approach, \textit{RGM} (Reward Gap Minimization), which can smartly handle diverse types of imperfect rewards. RGM is formulated as a bi-level optimization problem: the upper layer optimizes a reward correction term that performs state-action visitation distribution matching w.r.t. a small set of expert data; and the lower layer solves a pessimistic RL problem with the corrected rewards. By exploiting the duality of the lower level problem, we derive a tractable algorithm that enables sampled-based learning without any online interactions. Comprehensive experiments demonstrate that RGM achieves superior performance to existing methods under diverse settings of imperfect rewards. Further, RGM can effectively correct wrong or inconsistent rewards against expert preference, as well as retrieving useful information from biased rewards.","offline policy optimization, imperfect rewards, reward gap" Time Series are Images: Vision Transformer for Irregularly Sampled Time Series,https://openreview.net/forum?id=lRgEbHxowq,https://openreview.net/pdf?id=lRgEbHxowq,We propose an approach that transforms time series data into line graph images and utilizes vision transformer to perform time series classification task.,"Irregularly sampled time series are often observed in medical applications. Highly customized models have been developed to tackle the irregularity. In this work, we propose a simple yet effective approach that transforms irregularly sampled time series into line graph images and adapts vision transformers to perform time series classification in a way similar to image classification. Our approach simplifies the model design without assuming prior knowledge. Despite its simplicity, we show that it is able to outperform state-of-the-art specialized algorithms on several popular healthcare and human activity datasets, especially in the challenging leave-sensors-out setting where a subset of variables are masked during testing. We hope this work could provide beneficial insight into leveraging fast-evolving computer vision techniques in the time series analysis domain.","irregularly sampled time series, vision transformer" Gamma Sampling: Fine-grained Controlling Language Models without Training,https://openreview.net/forum?id=LUdVQkS2CK,https://openreview.net/pdf?id=LUdVQkS2CK,We propose a new simple guided decoding method which does not require any training data to achieve fine-grained controllable text generation while maintaining a fast generation speed.,"The dominant approaches for controlling language models achieve prominence in controlling high-level attributes (e.g. topic and sentiment). However, these methods often require condition-specific data or are computationally expensive. We propose a new simple guided decoding method, Gamma Sampling, which does not require any training data to achieve fine-grained controllable text generation while maintaining a fast generation speed. Gamma Sampling introduces attribute-related information (provided by humans or language models themselves) into the sampling process to guide language models to generate texts with desired attributes. Since no training is involved, Gamma Sampling can be easily applied to any language model for controllable text generation. Through experiments, we show that Gamma Sampling-steered GPT2-small (117M) outperforms baselines such as PPLM (345M) and CTRL (1.6B) in diversity, attribute relevance, and overall quality of generated samples.","guided-decoding, fine-grained control, data-free, fast generation speed" Token-Label Alignment for Vision Transformers,https://openreview.net/forum?id=pAcoOcnF__U,https://openreview.net/pdf?id=pAcoOcnF__U,We propose a token-label alignment method to improve the performance of vision transformers.,"Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs). They mix two images as inputs for training and assign them with a mixed label with the same ratio. While they are shown effective for vision transformers (ViTs), we identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies. We empirically observe that the contributions of input tokens fluctuate as forward propagating, which might induce a different mixing ratio in the output tokens. The training target computed by the original data mixing strategy can thus be inaccurate, resulting in less effective training. To address this, we propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token. We reuse the computed attention at each layer for efficient token-label alignment, introducing only negligible additional training costs. Extensive experiments demonstrate that our method improves the performance of ViTs on image classification, semantic segmentation, objective detection, and transfer learning tasks.","Vision transformer, data mixing." 3D-Scene-Entities: Using Phrase-to-3D-Object Correspondences for Richer Visio-Linguistic Models in 3D Scenes,https://openreview.net/forum?id=coMWK6WGkBP,https://openreview.net/pdf?id=coMWK6WGkBP,,"Recently, there has been significant progress in connecting natural language to real-world 3D scenes. Namely, for the problems of reference disambiguation and discriminative reference production for objects in 3D scenes, various deep-learning-based approaches have been explored by tapping into novel datasets such as ScanRefer (Chen et al., 2019) and ReferIt3D (Achlioptas et al., 2020). In this paper, we curate a large-scale and complementary dataset extending both the aforementioned ones by associating all objects mentioned in a referential sentence to their underlying instances in a 3D scene. Specifically, our 3D Scene Entities (3D-Scent) dataset provides an explicit correspondence between 369,039 objects, spanning 705 scenes, over 84,015 natural referential sentences. Crucially, we show that by incorporating simple and intuitive losses that enable learning from this new dataset, we can significantly improve the performance of several recently introduced neural listening architectures, including improving the SoTA by 5.0% in both the ScanRefer and Nr3D benchmarks. Moreover, we experiment with competitive baseline methods for the task of language generation and show that, as with neural-listeners, 3D neural-speakers can also noticeably benefit by training with Scent3D. Last but not least, our carefully conducted experimental studies strongly support the conclusion that, by learning on Scent3D, commonly used visio-linguistic 3D architectures can become more semantically robust in their generalization without needing to provide these newly collected annotations at test time.", Label Distribution Learning via Implicit Distribution Representation,https://openreview.net/forum?id=9MniHf5dmH,https://openreview.net/pdf?id=9MniHf5dmH,,"In contrast to multi-label learning, label distribution learning characterizes the polysemy of examples by a label distribution to represent richer semantics. In the learning process of label distribution, the training data is collected mainly by manual annotation or label enhancement algorithms to generate label distribution. Unfortunately, the complexity of the manual annotation task or the inaccuracy of the label enhancement algorithm leads to noise and uncertainty in the label distribution training set. To alleviate this problem, we introduce the implicit distribution in the label distribution learning framework to characterize the uncertainty of each label value. Specifically, we use deep implicit representation learning to construct a label distribution matrix with Gaussian prior constraints, where each row component corresponds to the distribution estimate of each label value, and this row component is constrained by a prior Gaussian distribution to moderate the noise and uncertainty interference of the label distribution dataset. Finally, each row component of the label distribution matrix is transformed into a standard label distribution form by using the self-attention algorithm. In addition, some approaches with regularization characteristics are conducted in the training phase to improve the performance of the model.","label distribution learning, implicit distribution, Gaussian distribution, self-attention algorithm" MultiWave: Multiresolution Deep Architectures through Wavelet Decomposition for Multivariate Timeseries Forecasting and Prediction,https://openreview.net/forum?id=lZKBhpedXk,https://openreview.net/pdf?id=lZKBhpedXk,In multivariate time-series datasets changes in signals occurs in different frequencies. MultiWave decomposes signals into different frequencies removes the irrelevant frequencies and models each group using a model component.,"One of the challenges in multivariate time series modeling is that changes in signals occur with different frequencies, even when the sampling rate is consistent across signals. In the case of multivariate time series prediction, the outcome is also determined by patterns of different frequencies. These encapsulate both long-term and short-term effects, which have so far not been sufficiently leveraged by deep learning time series models. We fill this gap by introducing a framework, called MultiWave, which augments any deep learning time series model with components operating at the intrinsic frequencies of the signals. MultiWave applies wavelet decomposition on each signal to obtain subsignals of different frequencies and groups all subsignals in the same frequency band together to train a component. The output of the components is combined through a gating mechanism that removes irrelevant frequencies for the given predictive task. We show that MultiWave accurately determines the informative frequency bands and that the augmented models including components trained to operate on those bands outperform the original models. We further show that applying MultiWave on top of different deep learning models improves their performance in several real-world applications.","Time series, Wavelets, Wavelet decomposition, Recurrent Neural Networks, Deep Learning" Learning to Compose Soft Prompts for Compositional Zero-Shot Learning,https://openreview.net/forum?id=S8-A2FXnIh,https://openreview.net/pdf?id=S8-A2FXnIh,"We introduce compositional soft prompting (CSP), a parameter-efficient learning technique to improve the zero-shot compositionality of large-scale pretrained vision-language models (VLMs).","We introduce compositional soft prompting (CSP), a parameter-efficient learning technique to improve the zero-shot compositionality of large-scale pretrained vision-language models (VLMs) like CLIP. We develop CSP for compositional zero-shot learning, the task of predicting unseen attribute-object compositions (e.g., old cat and young tiger). VLMs have a flexible text encoder that can represent arbitrary classes as natural language prompts but they often underperform task-specific architectures on the compositional zero-shot benchmark datasets. CSP treats the attributes and objects that define classes as learnable tokens of vocabulary. During training, the vocabulary is tuned to recognize classes that compose tokens in multiple ways (e.g., old cat and white cat). At test time, we recompose the learned attribute-object vocabulary in new combinations to recognize novel classes. We show that CSP outperforms the CLIP on benchmark datasets by an average of 10.9 percentage points on AUC. CSP also outperforms CoOp, a soft prompting method that fine-tunes the prefix context tokens, by an average of 5.8 percentage points on AUC. We perform additional experiments to show that CSP improves generalization to higher-order attribute-attribute-object compositions (e.g., old white cat) and combinations of pretrained attributes and fine-tuned objects.","compositional zero-shot learning, prompts, foundation models" SQA3D: Situated Question Answering in 3D Scenes,https://openreview.net/forum?id=IDJx97BC38,https://openreview.net/pdf?id=IDJx97BC38,We introduce a grand challenge for embodied agents to understand situations and reason about 3D scenes accordingly.,"We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D). Given a scene context (e.g., 3D scan), SQA3D requires the tested agent to first understand its situation (position, orientation, etc.) in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation. Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k unique situations, along with 20.4k descriptions and 33.4k diverse reasoning questions for these situations. These questions examine a wide spectrum of reasoning capabilities for an intelligent agent, ranging from spatial relation comprehension to commonsense understanding, navigation, and multi-hop reasoning. SQA3D imposes a significant challenge to current multi-modal especially 3D reasoning models. We evaluate various state-of-the-art approaches and find that the best one only achieves an overall score of 47.20%, while amateur human participants can reach 90.06%. We believe SQA3D could facilitate future embodied AI research with stronger situation understanding and reasoning capability.","3D vision, scene understanding, visual question answering, embodied AI" The Benefits of Model-Based Generalization in Reinforcement Learning,https://openreview.net/forum?id=w1w4dGJ4qV,https://openreview.net/pdf?id=w1w4dGJ4qV,We show how algorithms that generate experience with a learned parametric model can generalize in a way that is inherently more powerful than relying on value-function generalization and experience replay.,"Model-Based Reinforcement Learning (RL) is widely believed to have the potential to improve sample efficiency by allowing an agent to synthesize large amounts of imagined experience. Experience Replay (ER) can be considered a simple kind of model, which has proved extremely effective at improving the stability and efficiency of deep RL. In principle, a learned parametric model could improve on ER by generalizing from real experience to augment the dataset with additional plausible experience. However, owing to the many design choices involved in empirically successful algorithms, it can be very hard to establish where the benefits are actually coming from. Here, we provide theoretical and empirical insight into when, and how, we can expect data generated by a learned model to be useful. First, we provide a general theorem motivating how learning a model as an intermediate step can narrow down the set of possible value functions more than learning a value function directly from data using the Bellman equation. Second, we provide an illustrative example showing empirically how a similar effect occurs in a more concrete setting with neural network function approximation. Finally, we provide extensive experiments showing the benefit of model-based learning for online RL in environments with combinatorial complexity, but factored structure that allows a learned model to generalize. In these experiments, we take care to control for other factors in order to isolate, insofar as possible, the benefit of using experience generated by a learned model relative to ER alone.","model-based reinforcement learning, generalization" Revisiting Higher-Order Gradient Methods for Multi-Agent Reinforcement Learning,https://openreview.net/forum?id=gvMAooaEi3,https://openreview.net/pdf?id=gvMAooaEi3,"We revisit the use of higher-order gradient information in multi-agent reinforcement learning, identify its limitations, and introduce novel approaches that extend its application scope to a broader range of problems.","This paper revisits Higher-Order Gradient (HOG) methods for Multi-Agent Reinforcement Learning (MARL). HOG methods are algorithms in which agents use higher-order gradient information to account for other agents' anticipated learning, and are shown to improve coordination in games with self-interested agents. So far, however, HOG methods are only applied to games with low-dimensional state spaces due to inefficient computation and preservation of higher-order gradient information. In this work, we solve these limitations and propose a HOG framework that can be applied to games with higher-dimensional state spaces. Moreover, we show that current HOG methods, when applied to games with common-interested agents, i.e., team games, can lead to miscoordination between the agents. To solve this, we propose Hierarchical Reasoning (HR) to improve coordination in team games, and we experimentally show that our proposed HR significantly outperforms state-of-the-art methods in standard multi-agent games. With our contributions, we greatly improve the applicability of HOG methods for MARL. For reproducibility, the code used for our work will be shared after the reviewing process.","Multi-agent reinforcement learning, Higher-order gradient-based optimization" SWORD: Demystify the Secrets of Open-world Instance Recognition,https://openreview.net/forum?id=oPYySRqti-,https://openreview.net/pdf?id=oPYySRqti-,,"Current state-of-the-art instance recognition models have demonstrated strong ability in close-world environments while struggling in open-world scenarios, where the novel objects are not annotated in the pre-defined taxonomy during training. The challenge comes from that, in the unlabeled regions, novel objects and backdrop co-exist and are hard to differentiate. To demystify the secrets hidden in the mystery unannotated areas, we present a conceptually simple yet effective open-world instance recognition model, SWORD, answering the two critical questions: (1) How to discover the novel objects? We identify that the direct training of classification would make the features of novel objects degrade to the background. We demonstrate that a simple stop-gradient operation not only prevents feature degradation, but also allows the network to enjoy the merit of heuristic label assignment. (2) How to distinguish the objects from the backdrop? By maintaining a universal object queue, we obtain the object center for performing contrastive learning, in order to enlarge the distinction between objects and background. While the previous works only focus on pursuing recall and neglect precision, we show the prominence of SWORD by giving consideration to both criteria and achieving state-of-the-art performance in various open-world cross-category and cross-dataset generalizations. In particular, on VOC to non-VOC setup, our method sets a new state-of-the-art of 39.6% on ARb100. For COCO to UVO generalization, SWORD significantly outperforms the previous best open-world model by 6.0% on APb and 9.0% on ARb100, respectively.", Efficient Covariance Estimation for Sparsified Functional Data,https://openreview.net/forum?id=WmOF--p0PP,https://openreview.net/pdf?id=WmOF--p0PP,Novel sparsification schemes for functional data are proposed and the covariance estimation is shown to be asymptotically equivalent to sample covariance computed without sparsification.,"To avoid prohibitive computation cost of sending entire data, we propose four sparsification schemes Random-knots, Random-knots-Spatial, B-spline, Bspline-Spatial, and present corresponding nonparametric estimation of the covariance function. The covariance estimators are asymptotically equivalent to the sample covariance computed directly from the original data. And the estimated functional principal components effectively approximate the infeasible principal components under regularity conditions. The convergence rate reflects that leveraging spatial correlation and B-spline interpolation helps to reduce information loss. Data-driven selection method is further applied to determine the number of eigenfunctions in the model. Extensive numerical experiments are conducted to illustrate the theoretical results. ","functional data, covariance estimation, spatial correlation, convergence rate" Sparse Mixture-of-Experts are Domain Generalizable Learners,https://openreview.net/forum?id=RecZ9nB9Q4,https://openreview.net/pdf?id=RecZ9nB9Q4,We theoretically investigate the impact of backbone architecture on DG. We propose a novel SOTA model Generalizable Mixture-of-Experts (GMoE) for DG.,"Human visual perception can easily generalize to out-of-distributed visual data, which is far beyond the capability of modern machine learning models. Domain generalization (DG) aims to close this gap, with existing DG methods mainly focusing on the loss function design. In this paper, we propose to explore an orthogonal direction, i.e., the design of the backbone architecture. It is motivated by an empirical finding that transformer-based models trained with empirical risk minimization (ERM) outperform CNN-based models employing state-of-the-art (SOTA) DG algorithms on multiple DG datasets. We develop a formal framework to characterize a network's robustness to distribution shifts by studying its architecture's alignment with the correlations in the dataset. This analysis guides us to propose a novel DG model built upon vision transformers, namely \emph{Generalizable Mixture-of-Experts (GMoE)}. Extensive experiments on DomainBed demonstrate that GMoE trained with ERM outperforms SOTA DG baselines by a large margin. Moreover, GMoE is complementary to existing DG methods and its performance is substantially improved when trained with DG algorithms.","domain generalization, mixture-of-experts, algorithmic alignment, visual attributes" Structure-Sensitive Graph Dictionary Embedding for Graph Classification,https://openreview.net/forum?id=dyr1wSqCZC,https://openreview.net/pdf?id=dyr1wSqCZC,,"Graph structure expression plays an important role in distinguishing various graphs. In this work, we propose a Structure-Sensitive Graph Dictionary Embedding (SS-GDE) framework to transform input graph into the space of graph dictionary for the graph classification task. Instead of a naive use of base graph dictionary, we propose variational graph dictionary adaptation (GDA) to generate a personalized dictionary (named adapted graph dictionary) for catering each input graph. In particular for the adaptation, the Bernoulli sampling is introduced to adjust substructures of base graph keys, which increases the expression capacity of base dictionary tremendously. To make cross-graph measurement sensitive as well as stable, multi-sensitivity Wasserstein encoding is proposed to produce the embeddings by designing multi-scale attention on optimal transport. To optimize the framework, we introduce mutual information as the objective, which just deduces to variational inference of adapted graph dictionary. We perform our SS-GDE on multiple datasets of graph classification, and the experimental results demonstrate the effectiveness and the superiority over the state-of-the-art methods.","graph classification, variational inference, attention, mutual information" FlexPose: Pose Distribution Adaptation with Few-shot Guidance,https://openreview.net/forum?id=sL8mQ4L_5L,https://openreview.net/pdf?id=sL8mQ4L_5L,We transfer a human pose distribution to another one with only few-shot guidance and apply it to multiple pose-based tasks.,"Annotating human pose images can be costly. Meanwhile, there is an unavoidable major performance drop when a pre-trained pose estimation model is evaluated on a new dataset. We observe that pose distributions from different datasets share similar pose priors with different geometric transformations, which inspires us to learn a pose generator that can flexibly be adapted to generate the pose of a new pose distribution. In this paper, we treat human poses as skeleton images and propose a scheme to transfer a pre-trained pose annotation generator to generate poses from the transferred distribution of a newly collected dataset with only a few annotation guidances. By finetuning a limited number of linear layers, the transferred generator is able to generate similar pose annotations to the target pose distribution. We evaluate our FlexPose on several cross-dataset settings qualitatively and quantitatively. FlexPose surprisingly achieves around 41.8$\%$ performance improvement on the Unsupervised Pose Estimation task when it transfers the pose distribution of COCO, 3DHP and Surreal dataset to that of the H36M dataset.","Pose Adapation, Human Pose Detection, Few-shot" PEER: A Collaborative Language Model,https://openreview.net/forum?id=KbYevcLjnc,https://openreview.net/pdf?id=KbYevcLjnc,"We introduce PEER, a language model trained to mimic the collaborative editing process by which humans often write text.","Textual content is often the output of a collaborative writing process: We start with an initial draft, ask for suggestions, and repeatedly make changes. Agnostic of this process, today’s language models are trained to generate only the final result. As a consequence, they lack several abilities crucial for collaborative writing: They are unable to update existing texts, difficult to control and incapable of verbally planning or explaining their actions. To address these shortcomings, we introduce PEER, a collaborative language model that is trained to imitate the entire writing process itself. PEER can write drafts, add suggestions, propose edits and provide explanations for its actions. Crucially, we train multiple instances of PEER able to infill various parts of the writing process, enabling the use of self-training techniques for increasing the quality, amount and diversity of training data. This unlocks PEER's full potential by making it applicable in domains for which no edit histories are available and improving its ability to follow instructions, to write useful comments, and to explain its actions. We show that PEER achieves strong performance across various domains and editing tasks.","Language Models, Controllability, Prompting, Zero-Shot Learning, Editing" Guide Detectors in Pixel Space with Global Positioning and Abductive Matching,https://openreview.net/forum?id=R8GW1hR1kE,https://openreview.net/pdf?id=R8GW1hR1kE,,"End-to-End object Detector ensembles prior knowledge in a concise framework. DETR (DEtection TRansformer) contains two steps: Learn object queries in the representation space and match the queries with boxes in the pixel space. The ambiguity of object queries in DETR lead to an uncertain assignment in the Hungarian Matching. The formulation loss in the pixel space will in turn affect the learning representations. Therefore, we propose the Abductive DETR, which learns object queries in the representation space with global positioning in the pixel space and matches object queries in the pixel space with the abductive awareness from the representation space. Experimentally, Abductive DETR can be transferred to other DETR-variants methods and achieves a satisfactory improvement. And it takes only 2 epochs to achieve the 98.7% accuracy of predicting the number of objects. Compared with other state-of-the-art methods on the MS COCO dataset, Abductive DETR also achieves outstanding performance and arrives at convergence much faster. Our code will be made publicly available soon.","Object Detection, Abductive DETR" A simple but effective and efficient global modeling paradigm for image restoration,https://openreview.net/forum?id=8sqKEkAO3jv,https://openreview.net/pdf?id=8sqKEkAO3jv,"This is the first attempt to propose a theoretically feasible, simple but effective global modeling paradigm for image restoration.","Global modelling-based image restoration frameworks (e.g., Transformer-like architecture) has gained popularity. Despite the remarkable advancement, the success may be at the cost of model parameters and FLOPs while the intrinsic characteristics of specific task are ignored. The objective of our work is orthogonal to previous studies and we thus tailor a simple yet effective global modelling paradigm for image restoration. The key insights which motivate our study are two-fold: 1) Fourier transform is capable of disentangling image degradation and content component, acting as the image degradation prior embedded into image restoration framework; 2) Fourier domain innately embraces global property where each pixel of Fourier space is involved with all the spatial pixels. We obey the de facto global modeling rule ``spatial interaction + channel evolution"" of previous studies. Differently, we customize the core designs: multi-scale Fourier period spatial modeling and Fourier channel evolution. Equipped with above designs, our image restoration paradigm is verified on mainstream image restoration tasks including image de-raining, image enhancement, image de-hazing, and guided image super-resolution. The extensive experiments suggest that our paradigm achieves the competitive performance with fewer computational resources. Our main focus is not to beat previous frameworks but hopes to provide an alternative global modelling-based customized image restoration framework. Code will be publicly available.","image restoration, image de-raining, image de-hazing, image enhancement" Contrastive Continuity on Augmentation Stability Rehearsal for Continual Self-Supervised Learning,https://openreview.net/forum?id=YhzSxtB3LNk,https://openreview.net/pdf?id=YhzSxtB3LNk,This paper proposes C$^2$ ASR to address catastrophic forgetting in continual self-supervised learning,"Self-supervised learning has attracted a lot of attention recently, which is able to learn powerful representations without any manual annotations. In order to cope with a variety of real-world scenarios, it also needs to develop the ability to continuously learn, i.e. Continual Self-Supervised Learning (CSSL). However, simple rehearsal or regularization will bring some negative effects while alleviating catastrophic forgetting in CSSL, e.g. overfitting on the rehearsal samples or hindering from learning fresh knowledge. In order to address catastrophic forgetting without overfitting on the rehearsal samples, we propose Augmentation Stability Rehearsal (ASR) in this paper, which selects the most representative and discriminative samples by estimating the augmentation stability for rehearsal. Meanwhile, we design a matching strategy for ASR to dynamically update the rehearsal buffer. In addition, we further propose Contrastive Continuity on Augmentation Stability Rehearsal (C$^2$ ASR) based on ASR, which preserves as much information shared among seen task streams as possible to prevent catastrophic forgetting and dismisses the redundant information to free up the ability to learn fresh knowledge. Our method obtains a great achievement compared with state-of-the-art CSSL methods on a variety of CSSL benchmarks. The source code will be released soon. ","continual learning, self-supervised learning, continual self-supervised learning" Empowering Networks With Scale and Rotation Equivariance Using A Similarity Convolution,https://openreview.net/forum?id=NJENsJ37sQ,https://openreview.net/pdf?id=NJENsJ37sQ,,"The translational equivariant nature of CNN is a reason for its great success in the field of computer vision. However, networks do not enjoy more general equivariance properties such as rotation or scaling. This limits the generalization performance of the network. In this paper, we devise a method that provides networks with equivariance with respect to translation, rotation, and scaling simultaneously. We define a convolution-like operation and ensure equivariance based on our proposed scalable Fourier-Argand representation. The method has similar efficiency as a traditional network, since it hardly introduces any additional learnable parameters and does not rely on group theory. We verified the quality of our approach in the image classification task, demonstrating the robustness and the generalization ability to both scaled and rotated inputs.",Representation Learning Uncertainty Calibration via Knowledge Flow under Long-tailed Distribution,https://openreview.net/forum?id=GLOtO2QbNp,https://openreview.net/pdf?id=GLOtO2QbNp,We propose a novel method to realize the calibration under long-tailed distribution,"How to estimate the uncertainty of a given model is a crucial problem. Current calibration techniques treat different classes equally and thus implicitly assume that the distribution of training data is balanced, but ignore the fact that real-world data often follows a long-tailed distribution. In this paper, we explore the problem of calibrating the model trained from a long-tailed distribution. Due to the difference between the imbalanced training distribution and balanced test distribution, existing calibration methods such as temperature scaling can not generalize well to this problem. Specific calibration methods for domain adaptation are also not applicable because they rely on unlabeled target domain instances which are not available. Models trained from a long-tailed distribution tend to be more overconfident to head classes. To this end, we propose a novel knowledge flow based calibration method by estimating the importance weight for samples of tail classes to realize long-tailed calibration. Our method models the distribution of each class as a Gaussian distribution and view the source statistics of head classes as a prior to calibrate the target distributions of tail classes. We transfer knowledge from head classes to get the target probability density of tail classes. The importance weight is estimated by the ratio of the target probability density over the source probability density. Extensive experiments on CIFAR-10-LT, MNIST-LT, CIFAR-100-LT, and ImageNet-LT datasets demonstrate the effectiveness of our method.","Long-tailed, Calibration" $1\times1$ Convolution is All You Need for Image Super-Resolution,https://openreview.net/forum?id=eySeuMAqICL,https://openreview.net/pdf?id=eySeuMAqICL,We exploit and demonstrate the effectiveness of the $1\times1$ convolution with spatial-shift in SR task.,"In resource-constrained environments, such as mobile devices, lightweight and efficient architectures are crucial for the deployment of single image super-resolution (SISR) deep models. Due to the advantage of achieving a good trade-off between model capacity and efficiency, $3\times3$ convolutions are widely utilized in current convolutional neural networks (CNN). Compared to the normal $3\times3$ convolution, $1\times1$ convolution involves less computation burden but lacks the ability to represent and aggregate spatial information. Accordingly, a common sense in the literature is that $1\times1$ solely cannot constitute a powerful SR network. In this paper, we revisit $1\times1$ in the lightweight scenario and demonstrate that the fully $1\times1$ convolutional network with strong learning ability can be achieved for SISR, thanks to the manual spatial-shift operation. We investigate the feature aggregation scheme in normal $3\times3$ convolution and analogously extend the $1\times1$ convolution with a parameter-free spatial-shift operation, simplified as the shift-conv layer. In the proposed SISR method, we replace all normal $3\times3$ convolutions with shift-conv layers and present the $\mathbf{S}$hift-$\mathbf{C}$onv-based $\mathbf{N}$etwork (SCNet). Extensive experiments demonstrate that SCNets with all $1\times1$ convolutions obtain even better results than SR models with normal $3\times3$ convolutions that have a larger model size.","Single Image Super-Resolution, Lightweight Single Image Super-Resolution" ISS: Image as Stepping Stone for Text-Guided 3D Shape Generation,https://openreview.net/forum?id=GMRodZ8OlVr,https://openreview.net/pdf?id=GMRodZ8OlVr,An efficient text-guided 3D shape generation framework without needing paired text and shape. ,"Text-guided 3D shape generation remains challenging due to the absence of large paired text-shape data, the substantial semantic gap between these two modalities, and the structural complexity of 3D shapes. This paper presents a new framework called Image as Stepping Stone (ISS) for the task by introducing 2D image as a stepping stone to connect the two modalities and to eliminate the need for paired text-shape data. Our key contribution is a two-stage feature-space-alignment approach that maps CLIP features to shapes by harnessing a pre-trained single-view reconstruction (SVR) model with multi-view supervisions: first map the CLIP image feature to the detail-rich shape space in the SVR model, then map the CLIP text feature to the shape space and optimize the mapping by encouraging CLIP consistency between the input text and the rendered images. Further, we formulate a text-guided shape stylization module to dress up the output shapes with novel textures. Beyond existing works on 3D shape generation from text, our new approach is general for creating shapes in a broad range of categories, without requiring paired text-shape data. Experimental results manifest that our approach outperforms the state-of-the-arts and our baselines in terms of fidelity and consistency with text. Further, our approach can stylize the generated shapes with both realistic and fantasy structures and textur","Text, 3D shape, CLIP, differentiable rendering" Robust and Controllable Object-Centric Learning through Energy-based Models,https://openreview.net/forum?id=wcNtbEtcGIC,https://openreview.net/pdf?id=wcNtbEtcGIC,,"Humans are remarkably good at understanding and reasoning about complex visual scenes. The capability of decomposing low-level observations into discrete objects allows us to build a grounded abstract representation and identify the compositional structure of the world. Thus it is a crucial step for machine learning models to be capable of inferring objects and their properties from visual scene without explicit supervision. However, existing works on object-centric representation learning are either relying on tailor-made neural network modules or assuming sophisticated models of underlying generative and inference processes. In this work, we present EGO, a conceptually simple and general approach to learning object-centric representation through energy-based model. By forming a permutation-invariant energy function using vanilla attention blocks that are readily available in Transformers, we can infer object-centric latent variables via gradient-based MCMC methods where permutation equivariance is automatically guaranteed. We show that EGO can be easily integrated into existing architectures, and can effectively extract high-quality object-centric representations, leading to better segmentation accuracy and competitive downstream task performance. We empirically evaluate the robustness of the learned representation from EGO against distribution shift. Finally, we demonstrate the effectiveness of EGO in systematic compositional generalization, by recomposing learned energy functions for novel scene generation and manipulation.", Learning Antidote Data to Individual Unfairness,https://openreview.net/forum?id=9U-cIq9P2p4,https://openreview.net/pdf?id=9U-cIq9P2p4,,"Fairness is an essential factor for machine learning systems deployed in high-stake applications. Among all fairness notions, individual fairness, following a consensus that `similar individuals should be treated similarly,' is a vital notion to guarantee fair treatment for individual cases. Previous methods typically characterize individual fairness as a prediction-invariant problem when perturbing sensitive attributes, and solve it by adopting the Distributionally Robust Optimization (DRO) paradigm. However, adversarial perturbations along a direction covering sensitive information do not consider the inherent feature correlations or innate data constraints, and thus mislead the model to optimize at off-manifold and unrealistic samples. In light of this, we propose a method to learn and generate antidote data that approximately follows the data distribution to remedy individual unfairness. These on-manifold antidote data can be used through a generic optimization procedure with original training data, resulting in a pure pre-processing approach to individual unfairness, or can also fit well with the in-processing DRO paradigm. Through extensive experiments, we demonstrate our antidote data resists individual unfairness at a minimal or zero cost to the model's predictive utility.","Individual Fairness, Antidote Data, Machine Learning Fairness" Does Continual Learning Equally Forget All Parameters?,https://openreview.net/forum?id=gPWtHmCaBiY,https://openreview.net/pdf?id=gPWtHmCaBiY,,"Distribution shift (e.g., task or domain shift) in continual learning (CL) usually results in catastrophic forgetting of neural networks. Although it can be alleviated by repeatedly replaying buffered data, the every-step replay is time-consuming and the memory to store historical data is usually too small for retraining all parameters. In this paper, we study which modules in neural networks are more prone to forgetting by investigating their training dynamics during CL. Our proposed metrics show that only a few modules are more task-specific and sensitively alters between tasks, while others can be shared across tasks as common knowledge. Hence, we attribute forgetting mainly to the former and find that finetuning them only on a small buffer at the end of any CL method can bring non-trivial improvement. Due to the small number of finetuned parameters, such ``Forgetting Prioritized Finetuning (FPF)'' is efficient on both the computation and buffer size required. We further propose a more efficient and simpler method that entirely removes the every-step replay and replaces them by only $k$-times of FPF periodically triggered during CL. Surprisingly, this ``$k$-FPF'' performs comparably to FPF and outperforms the SOTA CL methods but significantly reduces their computational overhead and cost. In experiments on several benchmarks of class- and domain-incremental CL, FPF consistently improves existing CL methods by a large margin and $k$-FPF further excels on the efficiency without degrading the accuracy. We also empirically studied the impact of buffer size, epochs per task, and finetuning modules to the cost and accuracy of our methods.", Voting from Nearest Tasks: Meta-Vote Pruning of Pretrained Models for Downstream Tasks,https://openreview.net/forum?id=D6gktu1C7C_,https://openreview.net/pdf?id=D6gktu1C7C_,,"As a few large-scale pre-trained models become the major choices of various applications, new challenges arise for model pruning, e.g., can we avoid pruning the same model from scratch for every downstream task? How to reuse the pruning results of previous tasks to accelerate the pruning for a new task? To address these challenges, we create a small model for a new task from the pruned models of similar tasks. We show that a few fine-tuning steps on this model suffice to produce a promising pruned-model for the new task. We study this ``meta-pruning'' from nearest tasks on two major classes of pre-trained models, convolutional neural network (CNN) and vision transformer (ViT), under a limited budget of pruning iterations. Our study begins by investigating the overlap of pruned models for similar tasks and how the overlap changes over different layers and blocks. Inspired by these discoveries, we develop a simple but effective ``Meta-Vote Pruning (MVP)'' method that significantly reduces the pruning iterations for a new task by initializing a sub-network from the pruned models of its nearest tasks. In experiments, we demonstrate MVP's advantages in accuracy, efficiency, and generalization through extensive empirical studies and comparisons with popular pruning methods over several datasets.", AdPE: Adversarial Positional Embeddings for Pretraining Vision Transformers via MAE+,https://openreview.net/forum?id=maT89nOQi9i,https://openreview.net/pdf?id=maT89nOQi9i,We propose an Adversarial Positional Embedding (AdPE) approach for self-supervised learning.,"Unsupervised learning of vision transformers seeks to pretrain an encoder via pretext tasks without labels. Among them is the Masked Image Modeling (MIM) aligned with pretraining of language transformers by predicting masked patches as a pretext task. A criterion in unsupervised pretraining is the pretext task needs to be sufficiently hard to prevent the transformer encoder from learning trivial low-level features not generalizable well to downstream tasks. For this purpose, we propose an Adversarial Positional Embedding (AdPE) approach -- It distorts the local visual structures by perturbing the position encodings so that the learned transformer cannot simply use the locally correlated patches to predict the missing ones. We hypothesize that it forces the transformer encoder to learn more discriminative features in a global context with stronger generalizability to downstream tasks. We will consider both absolute and relative positional encodings, where adversarial positions can be imposed both in the embedding mode and the coordinate mode. We will also present a new MAE+ baseline that brings the performance of the MIM pretraining to a new level with the AdPE. The experiments demonstrate that our approach can improve the fine-tuning accuracy of MAE by $0.8\%$ and $0.4\%$ over 1600 epochs of pretraining ViT-B and ViT-L on Imagenet1K. For the transfer learning task, it outperforms the MAE with the ViT-B backbone by $2.6\%$ in mIoU on ADE20K, and by $3.2\%$ in AP$^{bbox}$ and $1.6\%$ in AP$^{mask}$ on COCO, respectively. These results are obtained with the AdPE being a pure MIM approach that does not use any extra models or external datasets for pretraining.","Self-Supervised Learning, Adversarial Pre-training, Positional Embedding" STREET: A MULTI-TASK STRUCTURED REASONING AND EXPLANATION BENCHMARK,https://openreview.net/forum?id=1C_kSW1-k0,https://openreview.net/pdf?id=1C_kSW1-k0,"We introduce STREET, a unified multi-task and multi-domain natural language reasoning and explanation benchmark.","We introduce STREET, a unified multi-task and multi-domain natural language reasoning and explanation benchmark. Unlike most existing question-answering (QA) datasets, we expect models to not only answer questions, but also produce step-by-step structured explanations describing how premises in the question are used to produce intermediate conclusions that can prove the correctness of a certain answer. We perform extensive evaluation with popular language models such as few-shot prompting GPT-3 and fine-tuned T5. We find that these models still lag behind human performance when producing such structured reasoning steps. We believe this work will provide a way for the community to better train and test systems on multi-step reasoning and explanations in natural language.","natural language understanding, question answering, structured explanations, soft reasoning, dataset" Topology-aware robust optimization,https://openreview.net/forum?id=ylMq8MBnAp,https://openreview.net/pdf?id=ylMq8MBnAp,We propose a new principled optimization method that seamlessly integrates topological information to develop strong OOD resilience,"Out-of-distribution (OOD) generalization is a challenging machine learning problem yet highly desirable in many high-stake applications. Existing methods suffer from overly pessimistic modeling with low generalization confidence. As generalizing to arbitrary test distributions is impossible, we hypothesize that further structure on the topology of distributions is crucial in developing strong OOD resilience. To this end, we propose topology-aware robust optimization (TRO) that seamlessly integrates distributional topology in a principled optimization framework. More specifically, TRO solves two optimization objectives: (1) Topology Learning which explores data manifold to uncover the distributional topology; (2) Learning on Topology which exploits the topology to constrain robust optimization for tightly-bounded generalization risks. We theoretically demonstrate the effectiveness of our approach, and empirically show that it significantly outperforms the state of the arts in a wide range of tasks including classification, regression, and semantic segmentation. Moreover, we empirically find that the learned topology is highly explainable and consistent with human knowledge and scientific plausibility.","out-of-distribution generalization, distributionally robust optimization" Exploring Neural Network Representational Similarity using Filter Subspaces,https://openreview.net/forum?id=yi4vd8VqROx,https://openreview.net/pdf?id=yi4vd8VqROx,,"Analyzing representational similarity in neural networks is crucial to numerous tasks, such as interpreting or transferring deep models. One typical approach is to input probing data into convolutional neural networks (CNNs) as stimuli to reveal their deep representation for model similarity analysis. Those methods are often computationally expensive and stimulus-dependent. By representing filter subspace in a CNN as a set of filter atoms, previous work has reported competitive performance in continual learning by learning a different set of filter atoms for each task while sharing common atom coefficients across tasks. Inspired by this observation, in this paper, we propose a new paradigm for reducing representational similarity analysis in CNNs to filter subspace distance assessment. Specifically, when filter atom coefficients are shared across networks, model representational similarity can be significantly simplified as calculating the cosine distance among respective filter atoms, to achieve \textit{millions of times} computation reduction. We provide both theoretical and empirical evidence that this simplified filter subspace-based similarity preserves a strong linear correlation with other popular stimulus-based metrics, while being significantly more efficient and robust to probing data. We further validate the effectiveness of the proposed method in various applications, such as analyzing training dynamics as well as in federated and continual learning. We hope our findings can help further explorations of real-time large-scale representational similarity analysis in neural networks.", A Close Look at Token Mixer: From Attention to Convolution,https://openreview.net/forum?id=8l5GjEqGiRG,https://openreview.net/pdf?id=8l5GjEqGiRG,"We take a close look at two classical token-mixers, convolution and attention. Detailed comparison and visual analysis motivate us to present a novel fully convolutional vision transformer, which achieves promising performance on several benchmarks.","There is an increasingly intensive debate about the effectiveness of ConvNets and Transformers in vision fields. Originating from the language processing community, Transformers show great promise for many vision tasks due to the insightful architecture design and attention mechanism. Nevertheless, we witnessed the revenge of ConvNets soon, surpassing Transformer variants in mainstream vision tasks. In this paper, we are not engaging in this debate; instead, we look into the details of attention and convolution. By looking into the self-attention responses in Transformers, we empirically find that 1.) Vision Transformers present a query-irrelevant behavior in deep layers, where the attention maps exhibit nearly consistent contexts in global scope, regardless of the query patch position (also head-irrelevant). This phenomenon indicates that a global context may hide behind the self-attention mechanism. 2.) The attention maps are intrinsically sparse; introducing the knowledge from ConvNets would largely smooth the attention and improve the performance. Motivated by these, we generalize self-attention formulation to abstract the query-irrelevant global context directly and further integrate the global context into convolutions. The resulting model, a Fully Convolutional Vision Transformer (i.e., FCViT), purely consists of convolutional layers and firmly inherits the merits of both attention mechanism and convolutions, including dynamic property, weight sharing, and shortand long-range feature modeling, etc. Experimental results demonstrate the effectiveness of FCViT. With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top-1 accuracy on ImageNet-1K. When scaling FCViT to larger models, we still perform better than previous state-of-the-art ConvNeXt with even fewer parameters. FCViTbased models also demonstrate promising transferability to downstream tasks, like object detection, instance segmentation, and semantic segmentation. Codes and pretrained models are available at:https://anonymous.4open.science/r/FCViT-pytorch.","Convolution, Attention, Visual Representation" EAGLE: Large-scale Learning of Turbulent Fluid Dynamics with Mesh Transformers,https://openreview.net/forum?id=mfIX4QpsARJ,https://openreview.net/pdf?id=mfIX4QpsARJ,We introduce a new large-scale dataset for learning non-steady fluid mechanics and a method based on self-attention on graphs,"Estimating fluid dynamics is classically done through the simulation and integration of numerical models solving the Navier-Stokes equations, which is computationally complex and time-consuming even on high-end hardware. This is a notoriously hard problem to solve, which has recently been addressed with machine learning, in particular graph neural networks (GNN) and variants trained and evaluated on datasets of static objects in static scenes with fixed geometry. We attempt to go beyond existing work in complexity and introduce a new model, method and benchmark. We propose EAGLE: a large-scale dataset of ∼1.1 million 2D meshes resulting from simulations of unsteady fluid dynamics caused by a moving flow source interacting with nonlinear scene structure of varying geometries, with 600 different scenes of three different types in total. To perform future forecasting of pressure and velocity on the challenging EAGLE dataset, we introduce a new mesh transformer. It leverages node clustering, graph pooling and global attention to learn long-range dependencies between spatially distant data points without needing a large number of iterations, as existing GNN methods do. We show that our transformer outperforms state-of-the-art performance on, both, existing synthetic and real datasets and on EAGLE. Finally, we highlight that our approach learns to attend to airflow, integrating complex information in a single iteration.","Learning Fluid Mechanics, Simulation, Graph networks" Momentum in Momentum for Adaptive Optimization,https://openreview.net/forum?id=qQz1UKDCiy7,https://openreview.net/pdf?id=qQz1UKDCiy7,,"Adaptive gradient methods, e.g., Adam, have achieved tremendous success in machine learning. Employing adaptive learning rates according to the gradients, such methods are able to attain rapid training of modern deep neural networks. Nevertheless, they are observed to suffer from compromised generalization capacity compared with stochastic gradient descent (SGD) and tend to be trapped in local minima at an early stage during the training process. Intriguingly, we discover that the issue can be resolved by substituting the gradient in the second raw moment estimate term with its momentumized version in Adam. The intuition is that the gradient with momentum contains more accurate directional information, and therefore its second moment estimation is a more preferable option for learning rate scaling than that of the raw gradient. Thereby we propose AdaM$^3$ as a new optimizer reaching the goal of training quickly while generalizing much better. We further develop a theory to back up the improvement in generalization and provide novel convergence guarantees for our designed optimizer. Extensive experiments on a variety of tasks and models demonstrate that AdaM$^3$ exhibits state-of-the-art performance and superior training stability consistently. Considering the simplicity and effectiveness of AdaM$^3$, we believe it has the potential to become a new standard method in deep learning. Code will be publicly available.", Limitless Stability for Graph Convolutional Networks ,https://openreview.net/forum?id=XqcQhVUr2h0,https://openreview.net/pdf?id=XqcQhVUr2h0,We develop a general and novel stability theory for graph convolutional networks able to deal with undirected graphs as well as topology-changing perturbations,"This work establishes rigorous, novel and widely applicable stability guarantees and transferability bounds for general graph convolutional networks -- without reference to any underlying limit object or statistical distribution. Crucially, utilized graph-shift operators are not necessarily assumed to be normal, allowing for the treatment of networks on both directed- and undirected graphs within the developed framework. In the undirected setting, stability to node-level perturbations is related to an 'adequate spectral covering' property of the filters in each layer. Stability to edge-level perturbations is discussed and related to properties of the utilized filters such as their Lipschitz constants. Results on stability to vertex-set non-preserving perturbations are obtained by utilizing recently developed mathematical-physics based tools. As an exemplifying application of the developed theory, it is showcased that general graph convolutional networks utilizing the un-normalized graph Laplacian as graph-shift-operator can be rendered stable to collapsing strong edges in the underlying graph if filters are mandated to be constant at infinity. These theoretical results are supported by corresponding numerical investigations showcasing the response of filters and networks to such perturbations. ","Graph Convolutional Networks, Graph Neural Networks, Stability, Transferability, Spectral Graph Theory, Rigorous Proofs" MiSAL: Active Learning for Every Budget,https://openreview.net/forum?id=h2ktOJbrT_4,https://openreview.net/pdf?id=h2ktOJbrT_4,Different budget sizes call for different active learning strategies; we introduce a practical method to determine in advance which strategy should be used and when.,"In supervised Active Learning (AL), the learner can manipulate the labeled training set by choosing examples to be labeled by an oracle. The size of the labeled set is termed budget. Recent years have seen significant progress in this domain in the context of deep active learning. In particular, it has been shown that in general, different families of AL strategies are suitable for high and low budgets. Here we address for the first time the problem of deciding which family of strategies is most suitable for a given budget in a given task. We start from the theoretical analysis of a mixture model, which motivates a computational approach to decide on the most suitable family of methods for the task and budget at hand. We then propose a practical decision algorithm, which determines what family of strategies should be preferred. Using this algorithm, we introduce MiSAL - a mixed strategy active learning algorithm. MiSAL combines AL strategies from different families, resulting in a method that fits all budgets. We support the analysis by an empirical study, showing the superiority of our method when dealing with image datasets.","Deep Active learning, Low budget, High budget, Deep learning" SOM-CPC: Unsupervised Contrastive Learning with Self-Organizing Maps for Structured Representations of High-Rate Time Series,https://openreview.net/forum?id=DAxQXzdq8SF,https://openreview.net/pdf?id=DAxQXzdq8SF,"This work proposes SOM-CPC, an unsupervised model for interpretable 2D representation learning of high-rate time series.","Continuous monitoring with an ever-increasing number of sensors has become ubiquitous across many application domains. Acquired data are typically high-dimensional and difficult to interpret, but they are also hypothesized to lie on a low-dimensional manifold. Dimensionality reduction techniques have, therefore, been sought for. Popular linear methods like Principle Component Analysis (PCA) have been extended to non-linear techniques such as Self-Organizing Maps (SOMs) or deep learning (DL) models. DL models have the ability to act on raw data, preventing heuristic feature selection, but the resulting latent space is often unstructured and still multi-dimensional. PCA and SOMs, on the other hand, need to be preceded with a feature-extraction step, but can then map high-dimensional features to 2D space. In this work we propose SOM-CPC, a model that jointly optimizes Contrastive Predictive Coding and a SOM to find an organized 2D manifold. We address a largely unexplored and challenging set of scenarios comprising high-rate time series, and show on both synthetic and real-life data (medical sleep data and audio recordings) that SOM-CPC outperforms both DL-based feature extraction, followed by PCA, K-means or a SOM, and strong deep-SOM baselines that jointly optimize a DL model and a SOM. SOM-CPC has great potential to expose latent patterns in high-rate data streams and may therefore contribute to a better understanding of many different processes and systems. ","Contrastive Predictive Coding, Self-Organizing Maps, Time series, Dimensionality Reduction" DIVISION: Memory Efficient Training via Dual Activation Precision,https://openreview.net/forum?id=6Pv8AMSylux,https://openreview.net/pdf?id=6Pv8AMSylux,A simple and transparent framework to reduce the memory cost of DNN training.,"Activation compressed training (ACT) has been shown to be a promising way to reduce the memory cost of training deep neural networks (DNNs). However, existing work of ACT relies on searching for optimal bit-width during DNN training to reduce the quantization noise, which makes the procedure complicated and less transparent. To this end, we propose a simple and effective method to compress DNN training. Our method is motivated by an instructive observation: DNN backward propagation mainly utilizes the low-frequency component (LFC) of the activation maps, while the majority of memory is for caching the high-frequency component (HFC) during the training. This indicates the HFC of activation maps is highly redundant and compressible during DNN training, which inspires our proposed Dual Activation Precision (DIVISION). During the training, DIVISION preserves the high-precision copy of LFC and compresses the HFC into a light-weight copy with low numerical precision. This can significantly reduce the memory cost without negatively affecting the precision of backward propagation such that DIVISION maintains competitive model accuracy. Experimental results show DIVISION achieves over 10× compression of activation maps, and significantly higher training throughput than state-of-the-art ACT methods, without loss of model accuracy. The code is available at https://anonymous.4open.science/r/division-5CC0/ ","DNN training, activation compressed training, memory efficient training, frequency domain" Lossless Dataset Compression Via Dataset Quantization,https://openreview.net/forum?id=zRCEbtS646c,https://openreview.net/pdf?id=zRCEbtS646c,,"The power of state-of-the-art deep learning models heavily depends on large amounts (millions or even billions) of training data, which hinders researchers having limited resources from conducting relevant researches and causes heavy CO2 emission. Dataset distillation methods are thus developed to compress large datasets into smaller ones to reduce model training cost, by synthesizing samples to match the original ones w.r.t. certain metrics like the training loss. However, existing methods generally suffer poor scalability (not applicable to compressing large-scale datasets such as ImageNet), and limited generalizability for training other model architectures. We empirically observe the reason is that the condensed datasets have lost the sample diversity of the original datasets. Driven by this, we study dataset compression from a new perspective—what is the minimum number of pixels necessary to represent the whole dataset without losing its diversity?—and develop a new dataset quantization (DQ) framework. DQ conducts compression at two levels: the sample level and the pixel level. It introduces a sample-level quantizer to find a compact set of samples to better represent distribution of the full dataset and a pixel-level quantizer to find the minimum number of pixels to describe every single image. Combining these two quantizers, DQ achieves new state-of-the-art dataset lossless compression ratio and provides compressed datasets practical for training models with a large variety of architectures. Specifically, for image classification, it successfully removes 40% data with only 0.4% top-5 accuracy drop on ImageNet and almost zero accuracy drop on CIFAR-10. We further verify that the model weights pre-trained on the 40% compressed dataset only lose 0.2% mAP on COCO dataset for object detection and 0.3% mIoU on ADE20k for segmentation. Code will be made public.","Dataset distillation, coreset selection" "CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable and Controllable Text-Guided Image Manipulation",https://openreview.net/forum?id=9OEW_t2uO4u,https://openreview.net/pdf?id=9OEW_t2uO4u,"We propose a novel approach to enforce better disentanglement, interpretability and controllability for text-guided image manipulation.","Recently introduced Contrastive Language-Image Pre-Training (CLIP) bridges images and text by embedding them into a joint latent space. This opens the door to ample literature that aims to manipulate an input image by providing a textual explanation. However, due to the discrepancy between image and text embeddings in the joint space, using text embeddings as the optimization target often introduces undesired artifacts in the resulting images. Disentanglement, interpretability, and controllability are also hard to guarantee for manipulation. To alleviate these problems, we propose to define corpus subspaces spanned by prompts to capture specific image characteristics. We introduce CLIP projection-augmentation embedding (PAE) as an optimization target to improve the performance of text-guided image manipulation. Our method is a simple and general paradigm that can be easily computed and adapted, and smoothly incorporated into any CLIP-based latent manipulation algorithm to improve performance. To demonstrate the effectiveness of our method, we conduct several theoretical and empirical system studies. As a case study, we utilize the method for text-guided semantic face editing. We quantitatively and qualitatively demonstrate that PAE facilitates a more disentangled, interpretable, and controllable image manipulation method with state of the art quality and accuracy.","computer vision, text-guided image manipulation, latent manipulation" NICO++: Towards Better Benchmarking for Domain Generalization,https://openreview.net/forum?id=q08xeIw1HA1,https://openreview.net/pdf?id=q08xeIw1HA1,,"Despite the remarkable performance that modern deep neural networks have achieved on independent and identically distributed (I.I.D.) data, they can crash under distribution shifts. Most current evaluation methods for domain generalization (DG) adopt the leave-one-out strategy as a compromise on the limited number of domains. We propose a large-scale benchmark with extensive labeled domains named NICO++ along with more rational evaluation methods for comprehensively evaluating DG algorithms. To evaluate DG datasets, we propose two metrics to quantify covariate shift and concept shift, respectively. Two novel generalization bounds from the perspective of data construction are proposed to prove that limited concept shift and significant covariate shift favor the evaluation capability for generalization. Through extensive experiments, NICO++ shows its superior evaluation capability compared with current DG datasets and its contribution in alleviating unfairness caused by the leak of oracle knowledge in model selection. ", Gradient Norm Regularizer Seeks Flat Minima and Improves Generalization,https://openreview.net/forum?id=z4eslwuymzQ,https://openreview.net/pdf?id=z4eslwuymzQ,,"The heavy overparameterization of current deep neural networks requires model generalization guarantees. Recently, flat minima are proven to be effective for improving generalization and sharpness-aware minimization (SAM) achieves state-of-the-art performance. Yet we show that SAM fails to measure flatness/sharpness when there are multiple minima within the perturbation radius. We present a novel regularizer named Gradient Norm Regularizer (GNR) to seek minima with uniformly small curvature across all directions and measure sharpness even when multiple minima are within the perturbation radius. We show that GNR bounds both the maximum eigenvalue of Hessian at local minima and the regularization function of SAM. We present experimental results showing that GNR improves the generalization of models trained with current optimizers such as SGD and AdamW on various datasets and networks. Furthermore, we show that GNR can help SAM find flatter minima and achieve better generalization. ", Token Merging: Your ViT But Faster,https://openreview.net/forum?id=JroZRaRw7Eu,https://openreview.net/pdf?id=JroZRaRw7Eu,"We merge tokens in a ViT at runtime using a fast custom matching algorithm. Our method, ToMe, can increase training and inference speed, lower training memory, and can be applied with and without training.","We introduce Token Merging (ToMe), a simple method to increase the throughput of existing ViT models without needing to train. ToMe gradually combines similar tokens in a transformer using a general and light-weight matching algorithm that is as fast as pruning while being more accurate. Off-the-shelf, ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2.2x the throughput of ViT-L on video with only a 0.2-0.3% accuracy drop in each case. ToMe can also easily be applied during training, improving in practice training speed up to 2x for MAE fine-tuning on video. Training with ToMe further minimizes accuracy drop, leading to 2x the throughput of ViT-B on audio for only a 0.4% mAP drop. Qualitatively, we find that ToMe merges object parts into one token, even over multiple frames of video. Overall, ToMe’s accuracy and speed are competitive with state-of-the-art on images, video, and audio.","token merging, token pruning, inference speed, training speed, throughput, off-the-shelf, fine tuning" TiDAL: Learning Training Dynamics for Active Learning,https://openreview.net/forum?id=anRa-qu7ZjQ,https://openreview.net/pdf?id=anRa-qu7ZjQ,TiDAL: Learning Training Dynamics for Active Learning,"Active learning (AL) aims to select the most useful data samples from an unlabeled data pool and annotate them to expand the labeled dataset under a limited budget. Especially, uncertainty-based methods choose the most uncertain samples, which are known to be effective in improving model performance. However, AL literature often overlooks training dynamics (TD), defined as the ever-changing model behavior during optimization via stochastic gradient descent, even though other areas of literature have empirically shown that TD provides important clues for measuring the sample uncertainty. In this paper, we propose a novel AL method, Training Dynamics for Active Learning (TiDAL), which leverages the TD to quantify uncertainties of unlabeled data. Since tracking the TD of all the large-scale unlabeled data is impractical, TiDAL utilizes an additional prediction module that learns the TD of labeled data. To further justify the design of TiDAL, we provide theoretical and empirical evidence to argue the usefulness of leveraging TD for AL. Experimental results show that our TiDAL achieves better or comparable performance on both balanced and imbalanced benchmark datasets compared to state-of-the-art AL methods, which estimate data uncertainty using only static information after model training.","active learning, training dynamics" CompletionFormer: Depth Completion with Convolutions and Vision Transformers,https://openreview.net/forum?id=LxEfHeknf4z,https://openreview.net/pdf?id=LxEfHeknf4z,,"This paper proposes a joint convolutional attention and Transformer block, which deeply couples the convolutional layer and Vision Transformer into one block, as the basic unit to construct our depth completion model in a pyramidal structure. This hybrid structure naturally benefits both the local connectivity of convolutions and the global context of the Transformer in one single model. As a result, our CompletionFormer outperforms state-of-the-art CNNs-based methods on the outdoor KITTI Depth Completion benchmark and indoor NYUv2 dataset, achieving significantly higher efficiency (near 1/3 FLOPs) compared to pure Transformer-based methods. Especially when the captured depth is highly sparse, the performance gap with other methods gets much larger.","Depth, Depth Completion, Vision Transformer, Computer Vision" Provable Adaptivity in Adam,https://openreview.net/forum?id=l0mX03b3UZv,https://openreview.net/pdf?id=l0mX03b3UZv,We explain why Adam is faster than SGD through convergence analysis.,"Adaptive Moment Estimation (Adam) has been observed to converge faster than stochastic gradient descent (SGD) in practice. However, such an advantage has not been theoretically characterized -- the existing convergence rate of Adam is no better than SGD. We attribute this mismatch between theory and practice to a commonly used assumption: the gradient is globally Lipschitz continuous (called $L$-smooth condition). Specifically, compared to SGD, Adam adaptively chooses a learning rate better suited to the local gradient Lipschitz constant (called local smoothness). This effect becomes prominent when the local smoothness varies drastically across the domain. In this paper, we analyze the convergence of Adam under a condition called $(L_0,L_1)$-smooth condition, which allows the gradient Lipschitz constant to change with the gradient norm. This condition has been empirically verified to be more realistic for deep neural networks \citep{zhang2019gradient} than the $L$-smooth condition. Under $(L_0,L_1)$-smooth condition, we establish the convergence for Adam with practical hyperparameters. As such, we argue that Adam can adapt to this local smoothness condition, justifying Adam's \emph{adaptivity}. In contrast, SGD can be arbitrarily slow under this condition. Our result can shed light on the benefit of adaptive gradient methods over non-adaptive ones.","Benefit of Adam, convergence, SGD, optimization" MS3: A Multimodal Supervised Pretrained Model for Semantic Segmentation,https://openreview.net/forum?id=RBNk9cpT1AW,https://openreview.net/pdf?id=RBNk9cpT1AW,This paper proposes a multi-dataset pretraining model with multimodal supervision for semantic segmentation and outperforms ImageNet pretraining under both standard fine-tuning and some rapid deployment scenarios.,"Due to the limited labeled data, current segmentation models are usually transferred from ImageNet pretrained models. This pipeline introduces task gaps, where the pretraining is based on global image-level recognition while the downstream is focused on local pixel level prediction. In this paper, we aim at mitigating this task gap and building a segmentation-oriented pretrained model, in this way different downstream segmentation tasks can be better and easily adapted. Towards this goal, we combine off-the-shelf annotations from diverse segmentation datasets and make use of both visual and language supervision for jointly training. The highlight is that the two kinds of supervision are complementary and can be boosted to better model the class relation from diverse datasets. The proposed learning framework, termed as MS3 (short for Multimodal Supervision for Semantic Segmentation), not only adjusts and improves the quality of language embeddings to fit the segmentation scene, but also generates momentum-updated visual embeddings for each category to facilitate better visual representation modeling. Besides, considering that the original one-by-one pixel-embedding pairing may cause similar classes from other datasets to be incorrectly pulled away, we further extend the original loss with multi-label mapping via cross-modal information exchange to better model the class relations. Experiments conducted on several benchmarks demonstrate that MS3 consistently outperforms the ImageNet pretrained models by a considerable margin under standard fine-tuning, as well as fitting some rapid deployment scenarios, e.g., frozen-backbone fine-tuning or zero shot predicting.","multi-dataset, multi-modal, semantic segmentation" An Analysis of Information Bottlenecks,https://openreview.net/forum?id=h8RIDPvVubq,https://openreview.net/pdf?id=h8RIDPvVubq,,"Learning representations with information bottlenecks is a powerful information-theoretic approach for learning effective representations where unnecessary information is minimized while task-relevant information is maximized. Many machine learning algorithms have been derived based on information bottlenecks of representations. This study mathematically relates information bottlenecks of intermediate representations to the corresponding expected loss in general settings. We investigate the merit of our new mathematical findings with experiments across a range of architectures and learning settings. Through the theory and experiments, we provide a new foundation for understanding current and future methods for learning intermediate representations with information bottlenecks.", De Novo Molecular Generation via Connection-aware Motif Mining,https://openreview.net/forum?id=Q_Jexl8-qDi,https://openreview.net/pdf?id=Q_Jexl8-qDi,We propose a fragment-based model for molecular generation. It first mines connection-aware motifs from the molecule library and then leverage a connection-aware generator to generate novel drug candidates.,"De novo molecular generation is an essential task for science discovery. Recently, fragment-based deep generative models have attracted much research attention due to their flexibility in generating novel molecules based on existing molecule fragments. However, the motif vocabulary, i.e., the collection of frequent fragments, is usually built upon heuristic rules, which brings difficulties to capturing common substructures from large amounts of molecules. In this work, we propose MiCaM to generate molecules based on mined connection-aware motifs. Specifically, it leverages a data-driven algorithm to automatically discover motifs from a molecule library by iteratively merging subgraphs based on their frequency. The obtained motif vocabulary consists of not only molecular motifs (i.e., the frequent fragments), but also their connection information, indicating how the motifs are connected with each other. Based on the mined connection-aware motifs, MiCaM builds a connection-aware generator, which simultaneously picks up motifs and determines how they are connected. We test our method on distribution-learning benchmarks (i.e., generating novel molecules to resemble the distribution of a given training set) and goal-directed benchmarks (i.e., generating molecules with target properties), and achieve significant improvements over previous fragment-based baselines. Furthermore, we demonstrate that our method can effectively mine domain-specific motifs for different tasks.","Molecular generation, Graph generation, Motifs mining" Multiplane NeRF-Supervised Disentanglement of Depth and Camera Pose from Videos,https://openreview.net/forum?id=wralRReGNHi,https://openreview.net/pdf?id=wralRReGNHi,,"We propose to perform self-supervised disentanglement of depth and camera pose from large-scale videos. We introduce an Autoencoder-based method to reconstruct the input video frames for training, without using any ground-truth annotations of depth and camera. The encoders for our model will estimate the monocular depth and camera pose as the disentangled representations. The decoder will then construct a Multiplane NeRF representation based on the depth encoder feature, and perform rendering to reconstruct the input frames with the estimated camera. The disentanglement is learned with the reconstruction error, based on the assumption that the scene structure does not change in short periods of time in videos. Once the model is learned, it can be applied to multiple applications including depth estimation, camera pose estimation, and single image novel view synthesis. We show substantial improvements over previous self-supervised approaches on all tasks and even better results than counterparts trained with camera ground-truths in some applications. Our code will be made publicly available. ","Multiple Image Plane, Disentanglement from Video" Shared Knowledge Lifelong Learning,https://openreview.net/forum?id=7ZaJfk915b1,https://openreview.net/pdf?id=7ZaJfk915b1,,"In Lifelong Learning (LL), agents continually learn as they encounter new conditions and tasks. Most current LL is limited to a single agent that learns tasks sequentially. Dedicated LL machinery is then deployed to mitigate the forgetting of old tasks as new tasks are learned. This is inherently slow. We propose a new Shared Knowledge Lifelong Learning (SKILL) learning paradigm, which deploys a population of LL agents that each learn different tasks independently and in parallel. After learning their respective tasks, agents share and consolidate their knowledge over a communication network, so that, in the end, all agents can master all tasks. Our approach relies on a frozen backbone embedded in all agents at manufacturing time, so that only the last layer head plus some small adjustments to the backbone beneficial biases are learned for each task. To eliminate the need for a task oracle, agents also learn and share summary statistics about their training datasets (Gaussian Mixture Clusters), or share a few training images, to help other agents assign test samples to the correct head using a Mahalanobis task mapper. On a new, very challenging dataset with 102 image classification tasks, we achieve significant speedup over 18 LL baselines (e.g., >9,000x speedup over single-agent EWC) while also achieving higher (and SOTA) accuracy.", GANet: Graph-Aware Network for Point Cloud Completion with Displacement-Aware Point Augmentor,https://openreview.net/forum?id=uFZt0ZJi8BX,https://openreview.net/pdf?id=uFZt0ZJi8BX,"Our proposed GANet effectively learns from contour information of partial point clouds, and it delivers state-of-the-art results on multiple benchmarks and exhibits impressive efficiency.","Remarkably, real-world data (e.g., LiDAR-based point clouds) is commonly sparse, uneven, occluded, and truncated. The point cloud completion task draws due attention, which aims to predict a complete and accurate shape from its partial observation. However, existing methods commonly adopt PointNet or PointNet++ to extract features of incomplete point clouds. In this paper, we propose an end-to-end Graph-Aware Network (\textbf{GANet}) to effectively learn from the contour information of the partial point clouds. Moreover, we design Displacements-Aware Augmentor (DPA) to upsample and refine coarse point clouds. With our graph-based feature extractors and Displacements-Aware Transformer, our DPA can precisely capture the geometric and structural features to refine the complete point clouds. Experiments on PCN and MVP datasets demonstrate that our GANet achieves state-of-the-art on the task of shape completion.","Point cloud completion, Graph-aware network, Displacements-aware point augmentor" Multiple output samples for each input in a single-output Gaussian process,https://openreview.net/forum?id=jNpvW1ozbj3,https://openreview.net/pdf?id=jNpvW1ozbj3,This paper proposes to extend the Gaussian process framework to allow for multiple output samples for each input from the same task in the training set.,"The standard Gaussian Process (GP) is formulated to only consider a single output sample for each input in the training set. Datasets for subjective tasks, such as spoken language assessment, may be annotated with output labels from multiple human raters for each input. This paper proposes to generalise the GP to allow for multiple output samples per input in the training set. This differs from a multi-output GP, because all output samples are from the same task here. The output density function is formulated to be the joint likelihood of observing all output samples. Through this, the hyper-parameters are optimised using a criterion that is similar to minimising a Kullback-Leibler divergence. The test set predictions are inferred fairly similarly to a standard GP, with a key difference being in the optimised hyper-parameters. This approach is evaluated on spoken language assessment tasks, using the public speechocean762 dataset and an internal Tamil language dataset. The results show that by using the proposed method, the GP is able to compute a test set output distribution that is more similar to the collection of reference outputs annotated by multiple human raters.","Gaussian process, multiple outputs, subjective, uncertainty, spoken language assessment" Demystifying the Optimization and Generalization of Deep PAC-Bayesian Learning,https://openreview.net/forum?id=EQiRSnqUYOh,https://openreview.net/pdf?id=EQiRSnqUYOh,,"In addition to being a successful generalization bound analysis tool, the PAC-Bayesian bound can also be incorporated into an objective function to train a probabilistic neural network, which we refer to simply as {\it PAC-Bayesian Learning}. PAC-Bayesian learning has been proven to be able to achieve a competitive expected test set error numerically, while providing a tight generalization bound in practice, through gradient descent training. Despite its empirical success, the theoretical analysis of deep PAC-Bayesian learning for neural networks is rarely explored. To this end, this paper proposes a theoretical convergence and generalization analysis for PAC-Bayesian learning. For a deep and wide probabilistic neural network, we show that when PAC-Bayesian learning is applied, the convergence result corresponds to solving a kernel ridge regression when the probabilistic neural tangent kernel (PNTK) is used as its kernel. Based on this finding, we further obtain an analytic and guaranteed PAC-Bayesian generalization bound for the first time, which is an improvement over the Rademacher complexity-based bound for deterministic neural networks. Finally, drawing insight from our theoretical results, we propose a proxy measure for efficient hyperparameter selection, which is proven to be time-saving on various benchmarks.","PAC-Bayes, Probabilistic Neural Netowrks, Neural Tangent Kernel" WeightRelay: Efficient Heterogenous Federated Learning on Time Series,https://openreview.net/forum?id=wYIKh2Z7bI8,https://openreview.net/pdf?id=wYIKh2Z7bI8,"To train multiple models on a smaller budget, we don't need to train everyone from the stretch. We could initialize some of those models by the trained weight from the others for fast convergence.","Federated learning for heterogeneous devices aims to obtain models of various structural configurations in order to fit multiple devices according to their hardware configurations and external environments. Existing solutions train those heterogeneous models simultaneously, which requires extra cost (e.g. computation, communication, or data) to transfer knowledge between models. In this paper, we proposed a method, namely, weight relay (WeightRelay), that could get heterogeneous models without any extra training cost. Specifically, we find that, compared with the classic random weight initialization, initializing the weight of a large neural network with the weight of a well-trained small network could reduce the training epoch and still maintain a similar performance. Therefore, we could order models from the smallest and train them one by one. Each model (except the first one) can be initialized with the prior model's trained weight for training cost reduction. In the experiment, we evaluate the weight relay on 128 time series datasets from multiple domains, and the result confirms the effectiveness of WeightRelay.","Federated learning, Heterogeneous models, Time series classification" Revisiting the Entropy Semiring for Neural Speech Recognition,https://openreview.net/forum?id=SNgLnzFQeiD,https://openreview.net/pdf?id=SNgLnzFQeiD,A numerically stable open-source implementation of the entropy semiring for CTC and RNN-T; obtained SOTA on Librispeech streaming.,"In streaming settings, speech recognition models have to map sub-sequences of speech to text before the full audio stream becomes available. However, since alignment information between speech and text is rarely available during training, models need to learn it in a completely self-supervised way. In practice, the exponential number of possible alignments makes this extremely challenging, with models often learning peaky or sub-optimal alignments. Prima facie, the exponential nature of the alignment space makes it difficult to even quantify the uncertainty of a model's alignment distribution. Fortunately, it has been known for decades that the entropy of a probabilistic finite state transducer can be computed in time linear to the size of the transducer via a dynamic programming reduction based on semirings. In this work, we revisit the entropy semiring for neural speech recognition models, and show how alignment entropy can be used to supervise models through regularization or distillation. We also contribute an open-source implementation of CTC and RNN-T in the semiring framework that includes numerically stable and highly parallel variants of the entropy semiring. Empirically, we observe that the addition of alignment distillation improves the accuracy and latency of an already well-optimized teacher-student distillation model, achieving state-of-the-art performance on the Librispeech dataset in the streaming scenario.","semiring, asr, ctc, rnn-t, entropy, regularization, distillation, streaming, speech recognition" Rethinking skip connection model as a learnable Markov chain,https://openreview.net/forum?id=yQdBtFfleh6,https://openreview.net/pdf?id=yQdBtFfleh6,Penal connection only introduces negligible computational burden and can be implemented with one line of code under most popular deep learning frameworks.,"Over past few years afterward the birth of ResNet, skip connection has become the defacto standard for the design of modern architectures due to its widespread adoption, easy optimization and proven performance. Prior work has explained the effectiveness of the skip connection mechanism from different perspectives. In this work, we deep dive into the model's behaviors with skip connections which can be formulated as a learnable Markov chain. An efficient Markov chain is preferred as it always maps the input data to the target domain in a better way. However, while a model is explained as a Markov chain, it is not guaranteed to be optimized following an efficient Markov chain by existing SGD-based optimizers which are prone to get trapped in local optimal points. In order to towards a more efficient Markov chain, we propose a simple routine of penal connection to make any residual-like model become a learnable Markov chain. Aside from that, the penal connection can also be viewed as a particular model regularization and can be easily implemented with one line of code in the most popular deep learning frameworks. The encouraging experimental results in multi-modal translation and image recognition empirically confirm our conjecture of the learnable Markov chain view and demonstrate the superiority of the proposed penal connection.","Language translation, image classification, transformer" "Activation Function: Absolute Function,One Function Behaves more Individualized",https://openreview.net/forum?id=h9ThYkkgSD4,https://openreview.net/pdf?id=h9ThYkkgSD4,A new activition function,"Inspire by nature world mode, a activation function is proposed. It is absolute function.Through test on mnist dataset and fully-connected neural network and convolutional neural network, some conclusions are put forward. The line of accuracy of absolute function is a little shaken that is different from the line of accuracy of relu and leaky relu. The absolute function can keep the negative parts as equal as the positive parts, so the individualization is more active than relu and leaky relu function. The absolute function is less likely to be over-fitting. Through test on mnist and autoencoder, It is that the leaky relu function can do classification task well, while the absolute function can do generation task well. Because the classification task need more universality and generation task need more individualization. The pleasure irritation and painful irritation is not only the magnitude differences, but also the sign differences, so the negative parts should keep as a part. Stimulation which happens frequently is low value, it is showed around zero in figure 1 . Stimulation which happens accidentally is high value, it is showed far away from zero in figure 1. So the high value is the big stimulation, which is individualization.","activation function, absolute function, individualization, universality, over-fitting, Z-Score, abstract network, concrete network,stimulation" ImageNet-E: Benchmarking Neural Network Robustness via Attribute Editing,https://openreview.net/forum?id=C1A2HD6EEGO,https://openreview.net/pdf?id=C1A2HD6EEGO,A new robustness benchmark that can help to evaluate the robustness against different object attributes,"Recent studies have shown that higher accuracy on ImageNet usually leads to better robustness against different corruptions. In this paper, instead of following the traditional research paradigm that investigates new out-of-distribution corruptions or perturbations deep models may encounter, we conduct model debugging in in-distribution data to explore which object attributes a model may be sensitive to. To achieve this goal, we create a toolkit for object editing with controls of backgrounds, sizes, positions, and directions, and create a rigorous benchmark named ImageNet-E(diting) for evaluating the image classifier robustness in terms of object attributes. With our ImageNet-E, we evaluate the performance of current deep learning models, including both convolutional neural networks and vision transformers. We find that most models are quite sensitive to attribute changes. An imperceptible change in the background can lead to an average of 10.15% drop rate on top-1 accuracy. We also evaluate some robust models including both adversarially trained models and other robust trained models and find that some models show worse robustness against attribute changes than vanilla models. Based on these findings, we discover ways to enhance attribute robustness with preprocessing, architecture designs, and training strategies. We hope this work can provide some insights to the community and open up a new avenue for research in robust computer vision. The code and dataset will be publicly available. ","Robustness, benchmark, attribute" Measuring axiomatic identifiability of counterfactual image models,https://openreview.net/forum?id=lZOUQQvwI3q,https://openreview.net/pdf?id=lZOUQQvwI3q,We use the axiomatic definition of counterfactual to derive metrics that enable quantifying the correctness of approximate counterfactual inference models.,"We present a general framework for evaluating image counterfactuals. The power and flexibility of deep generative models make them valuable tools for learning mechanisms in structural causal models. However, their flexibility makes counterfactual identifiability impossible in the general case. Motivated by these issues, we revisit Pearl's axiomatic definition of counterfactuals to determine the necessary constraints of any counterfactual inference model: composition, reversibility, and effectiveness. We frame counterfactuals as functions of an input variable, its parents, and counterfactual parents and use the identifiability constraints to restrict the set of functions that could represent the counterfactual, thus deriving distance metrics between the approximate and ideal functions. We demonstrate how these metrics can be used to compare and choose between different approximate models and to provide insight into a model's identifiability shortcomings and trade-offs.","Counterfactual inference, Generative Models, Computer Vision" Alternating Differentiation for Optimization Layers,https://openreview.net/forum?id=KKBMz-EL4tD,https://openreview.net/pdf?id=KKBMz-EL4tD,We propose a new implicit differentiation framework (Alt-Diff) that decouples optimization layers in an alternating way to increase the computational speed. We also prove the convergence of Alt-Diff and show the upper bound of truncated error.,"The idea of embedding optimization problems into deep neural networks as optimization layers to encode constraints and inductive priors has taken hold in recent years. Most existing methods focus on implicitly differentiating Karush–Kuhn–Tucker (KKT) conditions in a way that requires expensive computations on the Jacobian matrix, which can be slow and memory-intensive. In this paper, we developed a new framework, named Alternating Differentiation (Alt-Diff), that differentiates optimization problems (here, specifically in the form of convex optimization problems with polyhedral constraints) in a fast and recursive way. Alt-Diff decouples the differentiation procedure into a primal update and a dual update in an alternating way. Accordingly, Alt-Diff substantially decreases the dimensions of the Jacobian matrix and thus significantly increases the computational speed of implicit differentiation. Further, we present the computational complexity of the forward and backward pass of Alt-Diff and show that Alt-Diff enjoys quadratic computational complexity in the backward pass. Another notable difference between Alt-Diff and state-of-the-arts is that Alt-Diff can be truncated for the optimization layer. We theoretically show that: 1) Alt-Diff can converge to consistent gradients obtained by differentiating KKT conditions; 2) the error between the gradient obtained by the truncated Alt-Diff and by differentiating KKT conditions is upper bounded by the same order of variables' truncation error. Therefore, Alt-Diff can be truncated to further increases computational speed without sacrificing much accuracy. A series of comprehensive experiments demonstrate that Alt-Diff yields results comparable to the state-of-the-arts in far less time. ","Alternating differentiation, optimization layers, unrolling, implicit models" Cross-Domain Autonomous Driving Perception using Contrastive Appearance Adaptation,https://openreview.net/forum?id=Ox0ZtZKG9_-,https://openreview.net/pdf?id=Ox0ZtZKG9_-,,"Addressing domain shifts for complex perception tasks in autonomous driving has long been a challenging problem. In this paper, we show that existing domain adaptation methods pay little attention to the \textit{content mismatch} issue between source and target images, thereby weakening the domain adaptation performance and the decoupling of domain-invariant and domain-specific representations. To solve the aforementioned problems, we propose an image-level domain adaptation framework that aims at adapting source-domain images to the target domain with content-aligned image pairs. Our framework consists of three mutual-beneficial modules in a cycle: a \textit{cross-domain content alignment} module to generate source-target pairs with consistent content representations in a self-supervised manner, \textit{a reference-guided image synthesis} using the generated content-aligned source-target image pairs, and a \textit{contrastive learning} module to self-supervise domain-invariant feature extractor from the generated images. Our contrastive appearance adaptation is task-agnostic and robust to complex perception tasks in autonomous driving. Our proposed method demonstrates state-of-the-art results in cross-domain object detection, semantic segmentation, and depth estimation as well as better image synthesis ability qualitatively and quantitatively.","Domain adaptation, Object detection, Semantic segmentation, Depth estimation" Out-of-distribution Detection with Implicit Outlier Transformation,https://openreview.net/forum?id=hdghx6wbGuD,https://openreview.net/pdf?id=hdghx6wbGuD,,"Outlier exposure (OE) is powerful in out-of-distribution (OOD) detection, enhancing detection capability via model fine-tuning with surrogate OOD data. However, surrogate data typically deviate from test OOD data. Thus, the performance of OE when facing unseen OOD data, can be weaken. To address this issue, we propose a novel OE-based approach that makes the model perform well for unseen OOD situations, even for unseen OOD cases. It leads to a min-max learning scheme---searching to synthesize OOD data that leads to worst judgments and learning from such OOD data for the uniform performance in OOD detection. In our realization, these worst OOD data are synthesized by transforming original surrogate ones, where the associated transform functions are learned implicitly based on our novel insight that model perturbation leads to data transformation. Our methodology offers an efficient way of synthesizing OOD data, which can further benefit the detection model, besides the surrogate OOD data. We conduct extensive experiments under various OOD detection setups, demonstrating the effectiveness of our method against its advanced counterparts. ",out-of-distribution detection Parameter Averaging for Feature Ranking,https://openreview.net/forum?id=NYtq-CsRP3H,https://openreview.net/pdf?id=NYtq-CsRP3H,"In this work, we introduce a novel method based on parameter averaging to estimate accurate and robust feature importance in tabular data setting, referred as XTab.","Neural Networks are known to be sensitive to initialisation. The methods that rely on neural networks for feature ranking are not robust since they can have variations in their ranking when the model is initialized and trained with different random seeds. In this work, we introduce a novel method based on parameter averaging to estimate accurate and robust feature importance in tabular data setting, referred as XTab. We first initialize and train multiple instances of a shallow network (referred as local masks) with ""different random seeds"" for a downstream task. We then obtain a global mask model by ""averaging the parameters"" of local masks. We show that although the parameter averaging might result in a global model with higher loss, it still leads to the discovery of the ground-truth feature importance more consistently than an individual model does. We conduct extensive experiments on a variety of synthetic and real-world data, demonstrating that the XTab can be used to obtain the global feature importance that is not sensitive to sub-optimal model initialisation.","Parameter averaging, feature ranking, feature importance, robustness, interpretability, tabular data" Gradient Estimation for Unseen Domain Risk Minimization with Pre-Trained Models,https://openreview.net/forum?id=BqxE86ufTzq,https://openreview.net/pdf?id=BqxE86ufTzq,Gradient Estimation for Unseen Domain Risk Minimization with Pre-Trained Models,"Domain generalization aims to build generalized models that perform well on unseen domains when only source domains are available for model optimization. Recent studies have demonstrated that large-scale pre-trained models could play an important role in domain generalization by providing their generalization power. However, large-scale pre-trained models are not fully equipped with target task-specific knowledge due to a discrepancy between the pre-training objective and the target task. Although the task-specific knowledge could be learned from source domains by fine-tuning, this hurts the generalization power of the pre-trained models because of gradient bias toward the source domains. To address this issue, we propose a new domain generalization method that estimates unobservable gradients that reduce potential risks in unseen domains, using a large-scale pre-trained model. Our proposed method allows the pre-trained model to learn task-specific knowledge further while preserving its generalization ability with the estimated gradients. Experimental results show that our proposed method outperforms baseline methods on DomainBed, a standard benchmark in domain generalization. We also provide extensive analyses to demonstrate that the estimated unobserved gradients relieve the gradient bias, and the pre-trained model learns the task-specific knowledge without sacrificing its generalization power.","domain generalization, gradient estimation, pre-trained models" Nearing or Surpassing: Overall Evaluation of Human-Machine Dynamic Vision Ability,https://openreview.net/forum?id=LGbzYw_pnsc,https://openreview.net/pdf?id=LGbzYw_pnsc,,"Dynamic visual ability (DVA), a fundamental function of the human visual system, has been successfully modeled by many computer vision tasks in recent decades. However, the prosperity developments mainly concentrate on using deep neural networks (DNN) to simulate the human DVA system, but evaluation systems still simply compare performance between machines, making it tough to determine how far the gap is between humans and machines in dynamic vision tasks. In fact, neglecting this issue not only makes it hard to determine the correctness of current research routes, but also cannot truly measure the DVA intelligence of machines. To answer the question, this work designs a comprehensive evaluation system based on the 3E paradigm -- we carefully pick 87 videos from various dimensions to construct the environment, confirming it can cover both perceptual and cognitive components of DVA; select 20 representative machines and 15 human subjects to form the task executors, ensuring that different model structures can help us observe the effectiveness of research development; and finally quantify their DVA with a strict evaluation process. Based on detailed experimental analyses, we first determine that the current algorithm research route has effectively shortened the gap. Besides, we further summarize the weaknesses of different executors, and design a human-machine cooperation mechanism with superhuman performance. In summary, the contributions include: (1) Quantifying the DVA of humans and machines, (2) proposing a new view to evaluate DVA intelligence based on the human-machine comparison, and (3) providing a possibility of human-machine cooperation. The datasets, toolkits, codes, and evaluation metrics will be open-sourced to help researchers develop intelligent research on dynamic vision tasks.","Dynamic Visual Ability, Machine Intelligence Evaluation, Single Object Tracking" Re-balancing Adversarial Training Over Unbalanced Datasets,https://openreview.net/forum?id=U_BPCe6yKb9,https://openreview.net/pdf?id=U_BPCe6yKb9,,"In this paper, we study adversarial training on datasets that obey the long-tailed distribution, which is practical but rarely explored by previous works. Compared with conventional adversarial training on the balanced dataset, this process falls into the dilemma of generating uneven adversarial examples (AEs) and an unbalanced feature embedding space, causing the resulting model to exhibit low robustness and accuracy on tail data. To combat that, we propose a new adversarial training framework -- Re-balancing Adversarial Training (REAT). This framework consists of two components: (1) a new training strategy inspired by the term effective number to guide the model to generate more balanced and informative AEs; (2) a carefully constructed penalty function to force a satisfactory feature space. Evaluation results on different datasets and model structures prove that REAT can enhance the model's robustness and preserve the model's clean accuracy.", Extracting Robust Models with Uncertain Examples,https://openreview.net/forum?id=cMAjKYftNwx,https://openreview.net/pdf?id=cMAjKYftNwx,,"Model extraction attacks are proven to be a severe privacy threat to Machine Learning as a Service (MLaaS). A variety of techniques have been designed to steal a remote machine learning model with high accuracy and fidelity. However, how to extract a robust model with similar resilience against adversarial attacks is never investigated. This paper presents the first study toward this goal. We first analyze those existing extraction solutions either fail to maintain the model accuracy or model robustness or lead to the robust overfitting issue. Then we propose Boundary Entropy Searching Thief (BEST), a novel model extraction attack to achieve both accuracy and robustness extraction under restricted attack budgets. BEST generates a new kind of uncertain examples for querying and reconstructing the victim model. These samples have uniform confidence scores across different classes, which can perfectly balance the trade-off between model accuracy and robustness. Extensive experiments demonstrate that BEST outperforms existing attack methods over different datasets and model architectures under limited data. It can also effectively invalidate state-of-the-art extraction defenses.", Neural Groundplans: Persistent Neural Scene Representations from a Single Image,https://openreview.net/forum?id=Pza24zf9FpS,https://openreview.net/pdf?id=Pza24zf9FpS,"We train a self-supervised model that learns to map a single image to a 3D representation of the scene, with separate components for the immovable and movable 3D regions.","We present a method to map 2D image observations of a scene to a persistent 3D scene representation, enabling novel view synthesis and disentangled representation of the movable and immovable components of the scene. Motivated by the bird’s-eye-view (BEV) representation commonly used in vision and robotics, we propose conditional neural groundplans, ground-aligned 2D feature grids, as persistent and memory-efficient scene representations. Our method is trained self-supervised from unlabeled multi-view observations using differentiable rendering, and learns to complete geometry and appearance of occluded regions. In addition, we show that we can leverage multi-view videos at training time to learn to separately reconstruct static and movable components of the scene from a single image at test time. The ability to separately reconstruct movable objects enables a variety of downstream tasks using simple heuristics, such as extraction of object-centric 3D representations, novel view synthesis, instance-level segmentation, 3D bounding box prediction, and scene editing. This highlights the value of neural groundplans as a backbone for efficient 3D scene understanding models.","Neural scene representations, 3D, nerf, scene understanding, neural rendering, object-centric representations" Unified Vision and Language Prompt Learning,https://openreview.net/forum?id=1QQnYd02etI,https://openreview.net/pdf?id=1QQnYd02etI,,"Prompt tuning, a parameter- and data-efficient transfer learning paradigm that tunes only a small number of parameters in a model's input space, has become a trend in the vision community since the emergence of large vision-language models like CLIP. We present a systematic study on two representative prompt tuning methods, namely text prompt tuning and visual prompt tuning. A major finding is that none of the unimodal prompt tuning methods performs consistently well: text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances. To combine the best from both worlds, we propose a simple approach called Unified Prompt Tuning (UPT), which essentially learns a tiny neural network to jointly optimize prompts across different modalities. Extensive experiments on over 11 vision datasets show that UPT achieves a better trade-off than the unimodal counterparts on few-shot learning benchmarks, as well as on domain generalization benchmarks. Code and models will be released to facilitate future research.", Semi-supervised Counting via Pixel-by-pixel Density Distribution Modelling,https://openreview.net/forum?id=t-hNmA0cVSW,https://openreview.net/pdf?id=t-hNmA0cVSW,,"This paper focuses on semi-supervised crowd counting, where only a small portion of the training data are labeled. We formulate the pixel-wise density value to regress as a probability distribution, instead of a single deterministic value, and utilize a dual-branch structure to model the corresponding discrete form of the distribution function. On the basis, we propose a semi-supervised crowd counting model. Firstly, we enhance the transformer decoder by usingdensity tokens to specialize the forwards of decoders w.r.t. different density intervals; Secondly, we design a pixel-wise distribution matching loss to measure the differences in the pixel-wise density distributions between the prediction and the ground-truth; Thirdly, we propose an interleaving consistency regularization term to align the prediction of two branches and make them consistent. Extensive experiments on four datasets are performed to show that our method clearly outperforms the competitors by a large margin under various labeled ratio settings.","Computer Vision, Crowd Counting, Semi-Supervised Learning" Calibrating Multimodal Learning,https://openreview.net/forum?id=hTWqB327Oay,https://openreview.net/pdf?id=hTWqB327Oay,,"Multimodal machine learning has achieved remarkable progress in a wide range of scenarios. However, the reliability of multimodal learning remains largely unexplored. In this paper, through extensive empirical studies, we identify current methods suffer from unreliable predictive confidence that tends to rely on partial modalities when estimating confidence. Specifically, we find that the confidence estimated by current models could even increase when some modalities are corrupted. To address the issue, we introduce an intuitive principle for multimodal classification, i.e., the confidence should not increase when one modality is removed. Accordingly, we propose a novel regularization technique, i.e., Calibrating Multimodal Learning (CML) regularization, to calibrate the predictive confidence of previous methods. This technique could be flexibly equipped by existing models and improve the performance in terms of confidence calibration, classification accuracy, and model robustness.", Understanding Self-Supervised Pretraining with Part-Aware Representation Learning,https://openreview.net/forum?id=3tYvDb4dwab,https://openreview.net/pdf?id=3tYvDb4dwab,"We study the capability of learning part-aware representations of self-supervised pretraining methods, including contrastive learning and masked image modeling.","In this paper, we are interested in understanding self-supervised pretraining through studying the capability that self-supervised representation pretraining methods learn part-aware representations. The study is mainly motivated by that random views, used in contrastive learning, and random masked (visible) patches, used in masked image modeling, are often about object parts. We explain that masked image modeling is a part-to-part task: the masked patches of the object are hallucinated from the visible patches, and that contrastive learning is a part-to-whole task: the projection layer hallucinates the whole object representation from the object part representation learned from the encoder. The explanation suggests that the self-supervised pretrained encoder is required to understand the object part. We empirically compare the off-the-shelf encoders pretrained with several representative methods on object-level recognition and part-level recognition. The results show that the fully-supervised model outperforms self-supervised models for object-level recognition, and most self-supervised contrastive learning and masked image modeling methods outperform the fully-supervised method for part-level recognition. It is observed that the combination of contrastive learning and masked image modeling further improves the performance.","Part-aware representation, Self-supervised learning, Masked image modeling, Contrastive learning" E-CRF: Embedded Conditional Random Field for Boundary-caused Class Weights Confusion in Semantic Segmentation,https://openreview.net/forum?id=g1GnnCI1OrC,https://openreview.net/pdf?id=g1GnnCI1OrC,,"Modern semantic segmentation methods devote much effect to adjusting image feature representations to improve the segmentation performance in various ways, such as architecture design, attention mechnism, etc. However, almost all those methods neglect the particularity of class weights (in the classification layer) in segmentation models. In this paper, we notice that the class weights of categories that tend to share many adjacent boundary pixels lack discrimination, thereby limiting the performance. We call this issue Boundary-caused Class Weights Confusion (BCWC). We try to focus on this problem and propose a novel method named Embedded Conditional Random Field (E-CRF) to alleviate it. E-CRF innovatively fuses the CRF into the CNN network as an organic whole for more effective end-to-end optimization. The reasons are two folds. It utilizes CRF to guide the message passing between pixels in high-level features to purify the feature representation of boundary pixels, with the help of inner pixels belonging to the same object. More importantly, it enables optimizing class weights from both scale and direction during backpropagation. We make detailed theoretical analysis to prove it. Besides, superpixel is integrated into E-CRF and served as an auxiliary to exploit the local object prior for more reliable message passing. Finally, our proposed method yields impressive results on ADE20K, Cityscapes, and Pascal Context datasets. Code will be available.", Sample Complexity of Nonparametric Off-Policy Evaluation on Low-Dimensional Manifolds using Deep Networks,https://openreview.net/forum?id=9x3CO0ZU9LR,https://openreview.net/pdf?id=9x3CO0ZU9LR,,"We consider the off-policy evaluation problem of reinforcement learning using deep convolutional neural networks. We analyze the deep fitted Q-evaluation method for estimating the expected cumulative reward of a target policy, when the data are generated from an unknown behavior policy. We show that, by choosing network size appropriately, one can leverage any low-dimensional manifold structure in the Markov decision process and obtain a sample-efficient estimator without suffering from the curse of high data ambient dimensionality. Specifically, we establish a sharp error bound for fitted Q-evaluation, which depends on the intrinsic dimension of the state-action space, the smoothness of Bellman operator, and a function class-restricted $\chi^2$-divergence. It is noteworthy that the restricted $\chi^2$-divergence measures the behavior and target policies' {\it mismatch in the function space}, which can be small even if the two policies are not close to each other in their tabular forms. We also develop a novel approximation result for convolutional neural networks in Q-function estimation. Numerical experiments are provided to support our theoretical analysis.","RL theory, deep off-policy evaluation, neural network function approximation, manifold data" Stochastic Differentially Private and Fair Learning,https://openreview.net/forum?id=3nM5uhPlfv6,https://openreview.net/pdf?id=3nM5uhPlfv6,"The first efficient differentially private fair learning algorithm that is guaranteed to converge, even when stochastic minibatches of data are used in each iteration of training. ","Machine learning models are increasingly used in high-stakes decision-making systems. In such applications, a major concern is that these models sometimes discriminate against certain demographic groups such as individuals with certain race, gender, or age. Another major concern in these applications is the violation of the privacy of users. While fair learning algorithms have been developed to mitigate discrimination issues, these algorithms can still leak sensitive information, such as individuals’ health or financial records. Utilizing the notion of differential privacy (DP), prior works aimed at developing learning algorithms that are both private and fair. However, existing algorithms for DP fair learning are either not guaranteed to converge or require full batch of data in each iteration of the algorithm to converge. In this paper, we provide the first stochastic differentially private algorithm for fair learning that is guaranteed to converge. Here, the term “stochastic"" refers to the fact that our proposed algorithm converges even when minibatches of data are used at each iteration (i.e. stochastic optimization). Our framework is flexible enough to permit different fairness notions, including demographic parity and equalized odds. In addition, our algorithm can be applied to non-binary classification tasks with multiple (non-binary) sensitive attributes. As a byproduct of our convergence analysis, we provide the first utility guarantee for a DP algorithm for solving nonconvex-strongly concave min-max problems. Our numerical experiments show that the proposed algorithm consistently offers significant performance gains over the state-of-the-art baselines, and can be applied to larger scale problems with non-binary target/sensitive attributes.","algorithmic fairness, differential privacy, private fair learning, stochastic optimization" CLIP-FLOW: CONTRASTIVE LEARNING WITH ITERATIVE PSEUDO LABELING FOR OPTICAL FLOW,https://openreview.net/forum?id=esRySujigfO,https://openreview.net/pdf?id=esRySujigfO,A semi-supervised framework for optical flow with iterative pseudo labeling and contrastive flow loss to facilitate representation learning with unlabeled data,"Synthetic datasets are often used to pretrain end-to-end optical flow networks, due to the lack of a large amount of labeled, real scene data. But major drops in accuracy occur when moving from synthetic to real scenes. How do we better transfer the knowledge learned from synthetic to real domains? To this end, we propose CLIP-Flow, a semi-supervised iterative pseudo labeling framework to transfer the pretraining knowledge to the target real domain. We leverage large-scale, unlabeled real data to facilitate transfer learning with the supervision of iteratively updated pseudo ground truth labels, bridging the domain gap between the synthetic and the real. In addition, we propose a contrastive flow loss on reference features and the warped features by pseudo ground truth flows, to further boost the accurate matching and dampen the mismatching due to motion, occlusion, or noisy pseudo labels. We adopt RAFT as the backbone and obtain an F1-all error of 4.11%, i.e., a 19% error reduction from RAFT (5.10%) and ranking 2nd place at submission on KITTI 2015 benchmark. Our framework can also be extended to other models, e.g., CRAFT, reducing the F1-all error from 4.79% to 4.66% on KITTI 2015 benchmark. ","Optical Flow, Contrastvie Learning, Semi-supervised Learning" Smooth-Reduce: Leveraging Patches for Improved Certified Robustness,https://openreview.net/forum?id=VqDrqeQ8-C4,https://openreview.net/pdf?id=VqDrqeQ8-C4,"We present a novel, patch based approach to simulate ensembles that improve randomized smoothing performance.","Randomized smoothing (RS) has been shown to be a fast, scalable technique for certifying the robustness of deep neural network classifiers. However, methods based on RS require augmenting data with large amounts of noise, which leads to significant drops in accuracy. We propose a training-free, modified smoothing approach, Smooth-Reduce, that leverages patching and aggregation to provide improved classifier certificates. Our algorithm classifies overlapping patches extracted from an input image, and aggregates the predicted logits to certify a larger radius around the input. We study two aggregation schemes --- max and mean --- and show that both approaches provide better certificates in terms of certified accuracy, average certified radii and abstention rates as compared to concurrent approaches. We also provide theoretical guarantees for such certificates, and empirically show significant improvements over other randomized smoothing methods that require expensive retraining. Further, we extend our approach to videos and provide meaningful certificates for video classifiers.","Adversarial defenses, Certifiable defenses, Randomized Smoothing, Ensemble Models, Robust Video Classifiers" "CAN: A simple, efficient and scalable contrastive masked autoencoder framework for learning visual representations",https://openreview.net/forum?id=qmV_tOHp7B9,https://openreview.net/pdf?id=qmV_tOHp7B9,"We propose a minimal and conceptually clean synthesis of (C) contrastive learning, (A) masked autoencoders, and (N) the noise prediction for self-supervised learning on images","We introduce CAN, a simple, efficient and scalable method for self-supervised learning of visual representations. Our framework is a minimal and conceptually clean synthesis of (C) contrastive learning, (A) masked autoencoders, and (N) the noise prediction approach used in diffusion models. The learning mechanisms are \emph{complementary} to one another: contrastive learning shapes the embedding space across a batch of image samples; masked autoencoders focus on reconstruction of the low-frequency spatial correlations in a single image sample; and noise prediction encourages the reconstruction of the high-frequency components of an image. The combined approach results in a robust, scalable and simple-to-implement algorithm. The training process is symmetric, with $50\%$ of patches in \emph{both views} being masked at random, yielding a considerable efficiency improvement over prior contrastive learning methods. Extensive empirical studies on linear evaluation, finetuning, transfer learning, and robustness demonstrate that our approach achieves strong downstream performance. For instance, when pre-training ViT-B encoders on the curated ImageNet dataset, CAN achieves $74.8\%$ top-1 linear probing accuracy, an absolute improvement of $6.8\%$ over MAE and $1.3\%$ over SimCLR with the same architecture and data augmentations. CAN is especially useful for pre-training on larger uncurated datasets such as JFT-300M: the finetuned performance on ImageNet of our ViT-L model is $85.9\%$, compared to $85.0\%$ for SimCLR, and $85.4\%$ for MAE. For linear probe on ImageNet, CAN achieves $75.4\%$ compared to $71.8\%$ for SimCLR and $64.1\%$ for MAE. The overall FLOPs load is $41\%$ \emph{lower} than SimCLR\footnote{Our code will be released at \url{www.xxx.yyy}.}. ","Self supervised learning, contrastive learning, masked autoencoders" On The Inadequacy of Optimizing Alignment and Uniformity in Contrastive Learning of Sentence Representations,https://openreview.net/forum?id=MxvHVNukama,https://openreview.net/pdf?id=MxvHVNukama,,"Contrastive learning is widely used in areas such as visual representation learning (VRL) and sentence representation learning (SRL). Considering the differences between VRL and SRL in terms of negative sample size and evaluation focus, we believe that the solid findings obtained in VRL may not be entirely carried over to SRL. In this work, we consider the suitability of the decoupled form of contrastive loss, i.e., alignment and uniformity, in SRL. We find a performance gap between sentence representations obtained by jointly optimizing alignment and uniformity on the STS task and those obtained using contrastive loss. Further, we find that the joint optimization of alignment and uniformity during training is prone to overfitting, which does not occur on the contrastive loss. Analyzing them based on the variation of the gradient norms, we find that there is a property of ``gradient dissipation'' in contrastive loss and believe that it is the key to preventing overfitting. We simulate similar ""gradient dissipation"" of contrastive loss on four optimization objectives of two forms, and achieve the same or even better performance than contrastive loss on the STS tasks, confirming our hypothesis.","Sentence representation learning, Contrastive learning, Alignment, Uniformity" Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders,https://openreview.net/forum?id=lU5R_e4zXza,https://openreview.net/pdf?id=lU5R_e4zXza,"We propose a motion-aware MAE method, MotionMAE, for self-supervised spatiotemporal representation learning from unlabeled videos.","Masked autoencoders (MAEs) have emerged recently as art self-supervised spatiotemporal representation learners. Inheriting from the image counterparts, however, existing video MAEs still focus largely on static appearance learning whilst are limited in learning dynamic temporal information hence less effective for video downstream tasks. To resolve this drawback, in this work we present a motion-aware variant – MotionMAE. Apart from learning to reconstruct individual masked patches of video frames, our model is designed to additionally predict the corresponding motion structure information over time. This motion information is available at the temporal difference of nearby frames. As a result, our model can extract effectively both static appearance and dynamic motion spontaneously, leading to superior spatiotemporal representation learning capability. Extensive experiments show that our MotionMAE outperforms significantly both supervised learning baseline and state-of-the-art MAE alternatives, under both domain-specific and domain-generic pretraining-then finetuning settings. In particular, when using ViT-B as the backbone our MotionMAE surpasses the prior art model by a margin of 1.2% on Something-Something V2 and 3.2% on UCF101 in domain-specific pretraining setting. Encouragingly, it also surpasses the competing MAEs by a large margin of over 3% on the challenging video object segmentation task.","Masked autoencoders, Self-supervised spatiotemporal representation learning" Bidirectional Learning for Offline Model-based Biological Sequence Design,https://openreview.net/forum?id=luEG3j9LW5-,https://openreview.net/pdf?id=luEG3j9LW5-,We adapt bidirectional learning to biological sequence design and propose Adaptive-$\eta$ to tune learning rates for gradient-based algorithms on offline model-based optimization. ,"Offline model-based optimization aims to maximize a black-box objective function with a static dataset of designs and their scores. In this paper, we focus on biological sequence design to maximize some sequence score. A recent approach employs bidirectional learning, combining a forward mapping for exploitation and a backward mapping for constraint, and it relies on the neural tangent kernel (NTK) of an infinitely wide network to build a proxy model. Though effective, the NTK cannot learn features because of its parametrization, and its use prevents the incorporation of powerful pre-trained Language Models (LMs) that can capture the rich biophysical information in millions of biological sequences. We adopt an alternative proxy model, adding a linear head to a pre-trained LM, and propose a linearization scheme. This yields a closed-form loss and also takes into account the biophysical information in the pre-trained LM. In addition, the forward mapping and the backward mapping play different roles and thus deserve different weights during sequence optimization. To achieve this, we train an auxiliary model and leverage its weak supervision signal via a bi-level optimization framework to effectively learn how to balance the two mappings. Further, by extending the framework, we develop the first learning rate adaptation module Adaptive-$\eta$, which is compatible with all gradient-based algorithms for offline model-based optimization. Experimental results on DNA/protein sequence design tasks verify the effectiveness of our algorithm. Our code is available https://anonymous.4open.science/r/BIB-ICLR2023-Submission/README.md.","offline model-based optimization, meta learning, biological sequence design." Neural Collapse Inspired Feature-Classifier Alignment for Few-Shot Class-Incremental Learning,https://openreview.net/forum?id=y5W8tpojhtJ,https://openreview.net/pdf?id=y5W8tpojhtJ,An interpretable solution inspired by neural collapse for few-shot class-incremental learning,"Few-shot class-incremental learning (FSCIL) has been a challenging problem as only a few training samples are accessible for each novel class in the new sessions. Finetuning the backbone or adjusting the classifier prototypes trained in the prior sessions would inevitably cause a misalignment between the feature and classifier of old classes, which explains the well-known catastrophic forgetting problem. In this paper, we deal with this misalignment dilemma in FSCIL inspired by the recently discovered phenomenon named neural collapse, which reveals that the last-layer features of the same class will collapse into a vertex, and the vertices of all classes are aligned with the classifier prototypes, which are formed as a simplex equiangular tight frame (ETF). It corresponds to an optimal geometric structure for classification due to the maximized Fisher Discriminant Ratio. We propose a neural collapse inspired framework for FSCIL. A group of classifier prototypes are pre-assigned as a simplex ETF for the whole label space, including the base session and all the incremental sessions. During training, the classifier prototypes are not learnable, and we adopt a novel loss function that drives the features into their corresponding prototypes. Theoretical analysis shows that our method holds the neural collapse optimality and does not break the feature-classifier alignment in an incremental fashion. Experiments on the miniImageNet, CUB-200, and CIFAR-100 datasets demonstrate that our proposed framework outperforms the state-of-the-art performances. Our code will be publicly available. ","few-shot class-incremental learning, neural collapse" Self-conditioned Embedding Diffusion for Text Generation,https://openreview.net/forum?id=OpzV3lp3IMC,https://openreview.net/pdf?id=OpzV3lp3IMC,"Our continuous diffusion framework operates on word embeddings, enabling flexible and scalable diffusion models for text generation.","Can continuous diffusion models bring the same performance breakthrough on natural language they did for image generation? To circumvent the discrete nature of text data, we can simply project tokens in a continuous space of embeddings, as is standard in language modeling. We propose Self-conditioned Embedding Diffusion (SED), a continuous diffusion mechanism that operates on token embeddings and allows to learn flexible and scalable diffusion models for both conditional and unconditional text generation. Through qualitative and quantitative evaluation, we show that our text diffusion models generate samples comparable with those produced by standard autoregressive language models — while being in theory more efficient on accelerator hardware at inference time. Our work paves the way for scaling up diffusion models for text, similarly to autoregressive models, and for improving performance with recent refinements to continuous diffusion.","language models, diffusion models, generative models" Decoupling Concept Bottleneck Model,https://openreview.net/forum?id=vVbUB9oWUup,https://openreview.net/pdf?id=vVbUB9oWUup,We analyze the concept/label trade-off for Concept Bottleneck Model (CBM) and propose a new interactive and interpretable AI system to alleviate this issue.,"Concept Bottleneck Model (CBM) is a kind of powerful interpretable neural network, which utilizes high-level concepts to explain model decisions and interact with humans. However, CBM cannot always work as expected due to the troublesome collection and commonplace insufficiency of high-level concepts in real-world scenarios. In this paper, we theoretically reveal that insufficient concept information will induce the mixture of explicit and implicit information, which further leads to the inherent dilemma of concept and label distortions in CBM. Motivated by the proposed theorem, we present Decoupling Concept Bottleneck Model (DCBM), a novel concept-based model decoupling heterogeneous information into explicit and implicit concepts, while still retaining high prediction performance and interpretability. Extensive experiments expose the success in the alleviation of concept/label distortions, where DCBM achieves state-of-the-art performances in both concept and label learning tasks. Especially for situations where concepts are insufficient, DCBM significantly outperforms other models based on concept bottleneck and respectively achieves error rates 24.95% and 20.09% lower than other CBMs on concept/label prediction. Moreover, to express effective human-machine interactions for DCBM, we devise two algorithms based on mutual information (MI) estimation, including forward intervention and backward rectification, which can automatically correct labels and trace back to wrong concepts. The construction of the interaction regime can be formulated as a light min-max optimization problem achieved within minutes. Multiple experiments show that such interactions can effectively promote concept/label accuracy. ","Interpretability, Concept-based Model" OhMG: Zero-shot Open-vocabulary Human Motion Generation,https://openreview.net/forum?id=4lGL_ruf--t,https://openreview.net/pdf?id=4lGL_ruf--t,"We propose a zero-shot open-vocabulary human motion generation framework, with guidance from the large foundation model (i.e., CLIP)","Generating motion in line with text has attracted increasing attention nowadays. However, open-vocabulary human motion generation still remains touchless and undergoes the lack of diverse labeled data. The good news is that, recent studies of large foundation models (e.g., CLIP) have demonstrated superior performance on few/zero-shot image-text alignment, largely reducing the need for manually labeled data. In this paper, we take the advantage of CLIP for open-vocabulary 3D human motion generation in a zero-shot manner. Specifically, our model is composed of two stages, i.e., text2pose and pose2motion generations. For text2pose generation, to address the difficulty of optimization with direct supervision from CLIP, we propose to carve the versatile CLIP model into a slimmer but more specific model for aligning 3D poses and texts, via a novel pipeline distillation strategy. Optimizing with the distilled 3D pose-text model, we manage to concretize the text-pose knowledge of CLIP into a text2pose generator effectively and efficiently. As for pose2motion, drawing the inspiration of the advanced language model, we pretrain a transformer-based motion model, which makes up for the lack of motion dynamics of CLIP. After that, by formulating the condition poses as prompts, the motion generator can generate motions referring to the condition poses in a controllable and flexible manner.","foundation model, contrastive language-image pretraining, human motion generation, zero-shot, open-vocabulary" AQUILA: Communication Efficient Federated Learning with Adaptive Quantization of Lazily-Aggregated Gradients,https://openreview.net/forum?id=COrdS9G6TJ8,https://openreview.net/pdf?id=COrdS9G6TJ8,,"The development and deployment of federated learning (FL) have been bottlenecked by the heavy communication overheads of high-dimensional models between the distributed device nodes and the central server. To achieve better error-communication trade-offs, recent efforts have been made to either adaptively reduce the communication frequency by skipping unimportant updates, a.k.a. lazily-aggregated quantization (LAQ), or adjust the quantization bits for each communication. In this paper, we propose a unifying communication efficient framework for FL based on adaptive quantization of lazily-aggregated gradients (AQUILA), which adaptively adjusts two mutually-dependent factors, the communication frequency, and the quantization level, in a synergistic way. Specifically, we start from a careful investigation of the classical LAQ scheme and formulate AQUILA as an optimization problem where the optimal quantization level per communication is selected by minimizing the model deviation caused by update skipping. Meanwhile, we create a new lazy aggregation strategy to fit the novel quantization criterion better and thus keep the communication frequency at an appropriate level. The effectiveness and convergence of the proposed AQUILA framework are theoretically verified. The experimental results demonstrate that AQUILA can reduce around 60% of overall transmitted bits compared to existing methods while achieving the same level of model accuracy in a number of non-homogeneous FL scenarios, including Non-IID data distribution and heterogeneous model architecture. The proposed AQUILA is highly adaptive and compatible with existing FL settings.","Federated Learning, communication efficiency, adaptive quantization" Token Turing Machines,https://openreview.net/forum?id=3m_awcLrg8E,https://openreview.net/pdf?id=3m_awcLrg8E,"Token Turing Machines (TTM) is a sequential, autoregressive transformer model with memory for real-world sequential decision making, modernizing Neural Turing Machines.","We propose Token Turing Machines (TTM), a sequential, autoregressive Transformer model with memory for real-world sequential decision making. Our model is inspired by the seminal Neural Turing Machine, and has an external memory consisting of a set of tokens which summarise the previous history. This memory is efficiently addressed, read and written using a Transformer as the processing unit/controller at each step. The model's memory module ensures that a new observation will only be processed with the contents of the memory (and not the entire history), meaning that it can efficiently process long sequences with a bounded computational cost at each step. We show that TTM outperforms other alternatives, such as other Transformer models designed for long sequences and recurrent neural networks, on two real-world sequential decision making tasks: online temporal activity localization from videos and vision-based robot action policy learning.","memory, Neural Turing Machine, robot learning, sequence" Generaling Multimodal Variational Methods to Sets,https://openreview.net/forum?id=_uR2KmSfU8g,https://openreview.net/pdf?id=_uR2KmSfU8g,This paper presents a novel variational method on sets called the Set Multimodal VAE (SMVAE) for learning the joint-modality posterior directly while handling the missing modality problem. ,"Making sense of multiple modalities can yield a more comprehensive description of real-world phenomena. However, learning the co-representation of diverse modalities is still a long-standing endeavor in emerging machine learning applications and research. Previous generative approaches for multimodal input approximate a joint-modality posterior by uni-modality posteriors as product-of-experts (PoE) or mixture-of-experts (MoE). We argue that these approximations lead to a defective bound for the optimization process and loss of semantic connection among modalities. This paper presents a novel variational method on sets called the Set Multimodal VAE (SMVAE) for learning a multimodal latent space while handling the missing modality problem. By modeling the joint-modality posterior distribution directly, the proposed SMVAE learns to exchange information between multiple modalities and compensate for the drawbacks caused by factorization. In public datasets of various domains, the experimental results demonstrate that the proposed method is applicable to order-agnostic cross-modal generation while achieving outstanding performance compared to the state-of-the-art multimodal methods. The source code for our method is available online https://anonymous.4open.science/r/SMVAE-9B3C/.","multimodal variational auto encoder, self attention, unupervised learning, set representation learning" Towards a Unified View on Visual Parameter-Efficient Transfer Learning,https://openreview.net/forum?id=ti6fH3EhFkv,https://openreview.net/pdf?id=ti6fH3EhFkv,This paper investigates the positional importance of trainable parameter for adapting a large model to downstream tasks.,"Since the release of various large-scale natural language processing (NLP) pre-trained models, parameter efficient transfer learning (PETL) has become a popular paradigm capable of achieving impressive performance on various downstream tasks. PETL aims at making good use of the representation knowledge in the pre-trained large models by fine-tuning a small number of parameters. Recently, it has also attracted increasing attention to developing various PETL techniques for vision tasks. Popular PETL techniques such as Prompt Tuning and Adapter have been proposed for high-level visual downstream tasks such as image classification and video recognition. However, Prefix-tuning remains under-explored for vision tasks. In this work, we intend to adapt large video-based models to downstream tasks with a good parameter-accuracy trade-off. Towards this goal, we propose a framework with a unified view of PETL called visual-PETL (V-PETL) to investigate the effects of different PETL techniques, data scales of downstream domains, positions of trainable parameters, and other aspects affecting the trade-off. Specifically, we analyze the positional importance of trainable parameters and the differences between NLP and vision tasks in terms of data structures and pre-training mechanisms while implementing various PETL techniques, especially for the under-explored prefix-tuning technique. Based on a comprehensive understanding of the differences between NLP and video data, we propose a new variation of prefix-tuning module called parallel attention (PATT) for video-based downstream tasks. An extensive empirical analysis on two video datasets via different frozen backbones has been carried and the findings show that the proposed PATT can effectively contribute to other PETL techniques. An effective scheme Swin-BAPAT derived from the proposed V-PETL framework achieves significantly better performance than the state-of-the-art AdaptFormer-Swin with slightly more parameters and outperforms full-tuning with far less parameters.","Parameter Efficient, Transfer Learning, Domain Adaption" Everyone's Preference Changes Differently: Weighted Multi-Interest Retrieval Model,https://openreview.net/forum?id=usa87QW3_r9,https://openreview.net/pdf?id=usa87QW3_r9,A joint-modeling of unbiased multiple user interests and the interest weights.,"User embeddings (vectorized representations of a user) are essential in recommendation systems. Numerous approaches have been proposed to construct a representation for the user in order to find similar items for retrieval tasks, and they have been proven effective in industrial recommendation systems. Recently people have discovered the power of using multiple embeddings to represent a user, with the hope that each embedding represents the user's interest in a certain topic. With multi-interest representation, it's important to model the user's preference over the different topics and how the preference change with time. However, existing approaches either fail to estimate the user's affinity to each interest or unreasonably assume every interest of every user fades with an equal rate with time, thus hurting the performance of candidate retrieval. In this paper, we propose the Multi-Interest Preference (MIP) model, an approach that not only produces multi-interest for users by using the user's sequential engagement more effectively but also automatically learns a set of weights to represent the preference over each embedding so that the candidates can be retrieved from each interest proportionally. Extensive experiments have been done on various industrial-scale datasets to demonstrate the effectiveness of our approach. ","recommendation system, sequential models, temporal dynamics, user behavior modeling, multi-interest representation" Variational Autoencoders with Decremental Information Bottleneck for Disentanglement,https://openreview.net/forum?id=og1UqadquNk,https://openreview.net/pdf?id=og1UqadquNk,"We present a novel decremental variational autoencoder with disentanglement-invariant transformations, termed DeVAE, for balancing disentanglement and reconstruction fidelity.","One major challenge of disentanglement learning with variational autoencoders is the trade-off between disentanglement and reconstruction fidelity. Previous methods, spreading the conflict of disentanglement and reconstruction in time, will lose the constraint of disentanglement when expanding the information bottleneck, which causes the information diffusion problem. To tackle this issue, we present a novel decremental variational autoencoder with disentanglement-invariant transformations to spread the conflict on multiple latent spaces, termed DeVAE, for balancing disentanglement and reconstruction fidelity by decreasing the information bottleneck of diverse latent spaces gradually. Benefiting from the multiple latent spaces, DeVAE allows simultaneous optimization of multiple objectives to optimize reconstruction while keeping the constraint of disentanglement, avoiding information diffusion. DeVAE is also compatible with large models with high-dimension latent space. Experimental results on dSprites and Shapes3D that DeVAE achieves the best performance on both disentanglement and reconstruction.","disentanglement, variational autoencoders" Volumetric Optimal Transportation by Fast Fourier Transform,https://openreview.net/forum?id=EVrz7UM-ZDm,https://openreview.net/pdf?id=EVrz7UM-ZDm,"Optimal transport, Monge-Amp\`ere equation, Elliptic PDE, Fast Fourier transform","The optimal transportation map finds the most economical way to transport one probability measure to another, and it has been applied in a broad range of applications in machine learning and computer vision. By the Brenier theory, computing the optimal transport map is equivalent to solving a Monge-Amp\`ere equation, which is highly non-linear. Therefore, the computation of optimal transportation maps is intrinsically challenging. In this work, we propose a novel and powerful method, the FFT-OT (fast Fourier transform-optimal transport), to compute the 3-dimensional OT problems. The method is based on several key ideas: first, the Monge-Amp\`ere equation is linearized to a sequence of linear elliptic PDEs with spacial and temporal variant coefficients; second, the obliqueness property of optimal transportation maps is reformulated as a Neumann boundary condition; and third, the variant coefficient elliptic PDEs are approximated by constant coefficient elliptic PDEs and solved by FFT on GPUs. We also prove that the algorithm converges linearly, namely the approximation error decreases exponentially fast. Experimental results show that the FFT-OT algorithm is more than a hundred times faster than the conventional methods based on the convex geometry. Furthermore, the method can be directly applied for sampling from complex 3D density functions in machine learning and magnifying the volumetric data in medical imaging. ","Optimal transport, Monge-Ampere equation, Elliptic PDE, Fast Fourier transform" GFlowNets and variational inference,https://openreview.net/forum?id=uKiE0VIluA-,https://openreview.net/pdf?id=uKiE0VIluA-,We theoretically and empirically compare and contrast GFlowNets with hierarchical variational inference.,"This paper builds bridges between two families of probabilistic algorithms: (hierarchical) variational inference (VI), which is typically used to model distributions over continuous spaces, and generative flow networks (GFlowNets), which have been used for distributions over discrete structures such as graphs. We demonstrate that, in certain cases, VI algorithms are equivalent to special cases of GFlowNets in the sense of equality of expected gradients of their learning objectives. We then point out the differences between the two families and show how these differences emerge experimentally. Notably, GFlowNets, which borrow ideas from reinforcement learning, are more amenable than VI to off-policy training without the cost of high gradient variance induced by importance sampling. We argue that this property of GFlowNets can provide advantages for capturing diversity in multimodal target distributions.","variational inference, GFlowNets, probabilistic modeling, weighted importance sampling" Neural Networks and the Chomsky Hierarchy,https://openreview.net/forum?id=WbxHAzkeQcn,https://openreview.net/pdf?id=WbxHAzkeQcn,"Large-scale empirical study to determine the computational complexity class of a number of neural network architectures, which allows forecasting limitations on generalization capabilities.","Reliable generalization lies at the heart of safe ML and AI. However, understanding when and how neural networks generalize remains one of the most important unsolved problems in the field. In this work, we conduct an extensive empirical study (10250 models, 15 tasks) to investigate whether insights from the theory of computation can predict the limits of neural network generalization in practice. We demonstrate that grouping tasks according to the Chomsky hierarchy allows us to forecast whether certain architectures will be able to generalize to out-of-distribution inputs. This includes negative results where even extensive amounts of data and training time never lead to any non-trivial generalization, despite models having sufficient capacity to fit the training data perfectly. Our results show that, for our subset of tasks, RNNs and Transformers fail to generalize on non-regular tasks, LSTMs can solve regular and counter-language tasks, and only networks augmented with structured memory (such as a stack or memory tape) can successfully generalize on context-free and context-sensitive tasks.","length generalization, memory-augmented neural networks, recurrent neural networks" DeepSAT: An EDA-Driven Learning Framework for SAT,https://openreview.net/forum?id=ep_8uwxouZO,https://openreview.net/pdf?id=ep_8uwxouZO,,"We present DeepSAT, a novel end-to-end learning framework for the Boolean satisfiability (SAT) problem. Unlike existing solutions trained on random SAT instances with relatively weak supervision, we propose applying the knowledge of the well-developed electronic design automation (EDA) field for SAT solving. Specifically, we first resort to logic synthesis algorithms to pre-process SAT instances into optimized and-inverter graphs (AIGs). By doing so, the distribution diversity among various SAT instances can be dramatically reduced, which facilitates improving the generalization capability of the learned model. Next, we regard the distribution of SAT solutions being a product of conditional Bernoulli distributions. Based on this observation, we approximate the SAT solving procedure with a conditional generative model, leveraging a novel directed acyclic graph neural network (DAGNN) with two polarity prototypes for conditional SAT modeling. To effectively train the generative model, with the help of logic simulation tools, we obtain the probabilities of nodes in the AIG being logic ‘1’ as rich supervision. We conduct comprehensive experiments on various SAT problems. Our results show that, DeepSAT achieves significant accuracy improvements over state-of-the-art learning-based SAT solutions, especially when generalized to SAT instances that are relatively large or with diverse distributions. ", Neural ePDOs: Spatially Adaptive Equivariant Partial Differential Operator Based Networks,https://openreview.net/forum?id=D1Iqfm7WTkk,https://openreview.net/pdf?id=D1Iqfm7WTkk,We propose a novel spatial adaptive equivariant PDOs-based network which achieves superior performance than previous works. ,"Endowing deep learning models with symmetry priors can lead to a considerable performance improvement. As an interesting bridge between physics and deep learning, the equivariant partial differential operators (PDOs) have drawn much researchers' attention recently. However, to ensure the PDOs translation equivariance, previous works have to require coefficient matrices to be constant and spatially shared for their linearity, which could lead to the sub-optimal feature learning at each position. In this work, we propose a novel nonlinear PDOs scheme that is both spatially adaptive and translation equivariant. The coefficient matrices are obtained by local features through a generator rather than spatially shared. Besides, we establish a new theory on incorporating more equivariance like rotations for such PDOs. Based on our theoretical results, we efficiently implement the generator with an equivariant multilayer perceptron (EMLP). As such equivariant PDOs are generated by neural networks, we call them Neural ePDOs. In experiments, we show that our method can significantly improve previous works with smaller model size in various datasets. Especially, we achieve the state-of-the-art performance on the MNIST-rot dataset with only half parameters of the previous best model.","Equivariance, Partial differential operators" An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion,https://openreview.net/forum?id=NAQvF08TcyG,https://openreview.net/pdf?id=NAQvF08TcyG,"We present the task of personalized text-to-image generation, and introduce an inversion-based method that allows us to synthesize novel scenes of user-provided visual concepts, guided by natural language instructions.","Text-to-image models offer unprecedented freedom to guide creation through natural language. Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes. In other words, we ask: how can we use language-guided models to turn *our* cat into a painting, or imagine a new product based on *our* favorite toy? Here we present a simple approach that allows such creative freedom. Using only $3$-$5$ images of a user-provided concept, like an object or a style, we learn to represent it through new ``words"" in the embedding space of a frozen text-to-image model. These ``words"" can be composed into natural language sentences, guiding *personalized* creation in an intuitive way. Notably, we find evidence that a *single* word embedding is sufficient for capturing unique and varied concepts. We compare our approach to a wide range of baselines, and demonstrate that it can more faithfully portray the concepts across a range of applications and tasks. Our code, data and new words will be available.","Personalized generation, text-to-image, inversion" Cutting Long Gradient Flows: Decoupling End-to-End Backpropagation Based on Supervised Contrastive Learning,https://openreview.net/forum?id=6BO4lP8K1N1,https://openreview.net/pdf?id=6BO4lP8K1N1,We cut long gradient flows into multiple shorter ones and maintain comparable test accuracy.,"End-to-end backpropagation (BP) is the foundation of current deep learning technology. Unfortunately, as a network becomes deeper, BP becomes inefficient for various reasons. This paper proposes a new methodology for decoupling BP to transform a long gradient flow into multiple short ones in order to address the optimization issues caused by long gradient flows. We report thorough experiments conducted to illustrate the effectiveness of our model compared with BP and associated learning (AL), a state-of-the-art methodology for backpropagation decoupling. We will release the source code for the experiments after acceptance. ", Hierarchical Relational Learning for Few-Shot Knowledge Graph Completion,https://openreview.net/forum?id=zlwBI2gQL3K,https://openreview.net/pdf?id=zlwBI2gQL3K,,"Knowledge graphs (KGs) are powerful in terms of their inference abilities, but are also notorious for their incompleteness and long-tail distribution of relations. To address these challenges and expand the coverage of KGs, few-shot KG completion aims to make predictions for triplets involving novel relations when only a few training triplets are provided as reference. Previous methods have focused on designing local neighbor aggregators to learn entity-level information and/or imposing sequential dependency assumption at the triplet level to learn meta relation information. However, pairwise triplet-level interactions and context-level relational information have been largely overlooked for learning meta representations of few-shot relations. In this paper, we propose a hierarchical relational learning method (HiRe) for few-shot KG completion. By jointly capturing three levels of relational information (entity-level, triplet-level and context-level), HiRe can effectively learn and refine the meta representation of few-shot relations, and consequently generalize well to new unseen relations. Extensive experiments on two benchmark datasets validate the superiority of HiRe over state-of-the-art methods. The code of HiRe can be found in supplementary material and will be released after acceptance.","few-shot learning, knowledge graph completion" Learn to Know Unknowns: A Bionic Memory Network for Unsupervised Anomaly Detection,https://openreview.net/forum?id=SNzzt94tGzP,https://openreview.net/pdf?id=SNzzt94tGzP,"We proposed a biomimetic neural network for unsupervised anomaly detection inspired by the hippocampus-cortex cascade, enabling the model to know the unknowns.","Is generalization always beneficial? Over-strong generalization induces the model insensitive to anomalies. Unsupervised anomaly detection requires only unlabeled non-anomalous data to learn and generalize normal patterns, which results in a modest reconstruction error when reconstructing normal instances and a significant reconstruction error when reconstructing anomalies. However, over-strong generalization leads to the indistinguishable reconstruction error of normal instances and anomalies, which means that the model well reconstructs the unknown anomalies, resulting in unnoticeable reconstruction error. Inspired by the cascade structure of the hippocampus and cortex in human brain memory, we proposed a re-representation memory network called Random Forgetting Twin Memory (RFTM) to decompose the latent space and introduce a configurable reintegration mechanism to suppress overgeneralization. RFTM shows striking brain-like memory characteristics, which enables the model to know what it does not know. RFTM has the convenience of a single line of code boosting at the model level without adding any additional extra loss terms at the loss function level. RFTM-based models have achieved state-of-the-art experimental results on different public benchmarks.","Unsupervised learning, Anomaly detection, Memory bank" Function-Consistent Feature Distillation,https://openreview.net/forum?id=pgHNOcxEdRI,https://openreview.net/pdf?id=pgHNOcxEdRI,,"As a commonly used technique in model compression of deep neural networks, feature distillation makes the student model mimic the intermediate features of the teacher model, in hopes that the underlying knowledge in the features could provide extra guidance to the student. Nearly all existing feature-distillation methods use L2 distance or its slight variants as the distance metric between teacher and student features. However, while L2 distance is isotropic w.r.t. all dimensions, the neural network’s operation on different dimensions is usually anisotropic, i.e., perturbations with the same 2-norm but in different dimensions of intermediate features lead to changes in the final output with largely different magnitude. Considering this, we argue that the similarity between teacher and student features should \textit{not} be measured merely based on their appearance (i.e. L2 distance), but should, more importantly, be measured by their difference in function, namely how the lateral parts of the network will read, decode, and process them. Therefore, we propose Function-Consistent Feature Distillation (FCFD), which explicitly optimizes the functional similarity between teacher and student features. The core idea of FCFD is to make teacher and student features not only numerically similar, but more importantly produce similar outputs when fed to the lateral part of the same network. With FCFD, the student mimics the teacher more faithfully and learns more from the teacher. Extensive experiments on image classification and object detection demonstrate the superiority of FCFD to existing methods. Furthermore, we can combine FCFD with many existing methods to obtain even higher accuracy. Codes will be publicly released soon.","knowledge distillation, feature distillation, function consistency" Multi-User Reinforcement Learning with Low Rank Rewards,https://openreview.net/forum?id=OiLPUTbiic5Y,https://openreview.net/pdf?id=OiLPUTbiic5Y,A statistically efficient method for learning policies collaboratively across multiple users with same state-space transitions but low-rank reward matrix. ," In this work, we consider the problem of collaborative multi-user reinforcement learning. In this setting there are multiple users with the same state-action space and transition probabilities but with different rewards. Under the assumption that the reward matrix of the $N$ users has a low-rank structure -- a standard and practically successful assumption in the offline collaborative filtering setting-- the question is can we design algorithms with significantly lower sample complexity compared to the ones that learn the MDP individually for each user. Our main contribution is an algorithm which explores rewards collaboratively with $N$ user-specific MDPs and can learn rewards efficiently in two key settings: tabular MDPs and linear MDPs. When $N$ is large and the rank is constant, the sample complexity per MDP depends logarithmically over the size of the state-space, which represents an exponential reduction (in the state-space size) when compared to the standard ``non-collaborative'' algorithms. ","Low Rank Matrix Estimation, Collaborative Reinforcement Learning" Domain Specific Denoising Diffusion Probabilistic Models for Brain Dynamics,https://openreview.net/forum?id=Qsbh0IgVG_8,https://openreview.net/pdf?id=Qsbh0IgVG_8,,"The distribution differences in brain dynamics according to different human subjects, which is a kind of human-subject noise, as referred to as human artifacts, have severely limited the generalized ability of brain dynamics recognition. Previous human artifact removal methods normally utilize traditional spectrum filtering or blind Source Separation techniques, based on a simple assumption of prior distributions, which limit the capacity of learning domain variance of each subject. We propose a new approach to model the removal of the human artifacts as a generative denoising process, which can generate and learn subject-specific domain variance and the invariant brain signals, simultaneously. We propose Domain Specific Denoising Diffusion Probabilistic Model (DS-DDPM) to decompose the denoising process into the subject domain variance and invariant content at each step. Subtle constraints and probabilistic design are proposed to formulate domain variance and invariant content into orthogonal spaces and further supervise the domain variance with the subject classifier. This method is the first work to explicitly separate human subject-specific variance through generative denoising processes, which outperforms previous methods in two aspects, 1) DS-DDPM could learn more accurate subject-specific domain variance by domain generative learning rather than previous filtering methods 2) DS-DDPM is the first work could explicitly generate subject noise distribution. Comprehensive experimental results suggest that DS-DDPM could help alleviate domain distribution bias for cross-domain brain dynamics signal recognition.","Denoising Diffusion Probalistic Models, EEG Signal, Domain Variance Generation, Subject Difference, Deep Learning" The Devil is in the Wrongly-classified Samples: Towards Unified Open-set Recognition,https://openreview.net/forum?id=xLr0I_xYGAs,https://openreview.net/pdf?id=xLr0I_xYGAs,,"Open-set Recognition (OSR) aims to identify test samples whose classes are not seen during the training process. Recently, Unified Open-set Recognition (UOSR) has been proposed to reject not only unknown samples but also known but wrongly classified samples, which tends to be more practical in real-world applications. In this paper, we deeply analyze the UOSR task under different training and evaluation settings to shed light on this promising research direction. For this purpose, we first evaluate the UOSR performance of several OSR methods and show a significant finding that the uncertainty distribution of almost all these methods is actually closer to the expectation of UOSR than OSR. We show that the reason lies in the known but wrongly classified samples, as their uncertainty distribution is extremely close to unknown samples rather than known and correctly classified samples. Second, we analyze how the two training settings of OSR (i.e., pre-training and outlier exposure) influence the UOSR. We find although they are both beneficial for distinguishing known and correctly classified samples from unknown samples, pre-training is also helpful for identifying known but wrongly classified samples while outlier exposure is not. In addition to different training settings, we also formulate a new evaluation setting for UOSR which is called few-shot UOSR, where only one or five samples per unknown class are available during evaluation to help identify unknown samples. We propose FS-KNNS for the few-shot UOSR to achieve state-of-the-art performance under all settings.", Approximated Anomalous Diffusion: Gaussian Mixture Score-based Generative Models,https://openreview.net/forum?id=yc9xen7EAzd,https://openreview.net/pdf?id=yc9xen7EAzd,,"Score-based generative models (SGMs) can generate high-quality samples via Langevin dynamics with a drift term and a diffusion term (Gaussian noise) iteratively calculated and added to a sample until convergence. In biological systems, it is observed that the neural population can conduct heavy-tailed L\'{e}vy dynamics for sampling-based probabilistic representation through neural fluctuations. Critically, unlike the existing sampling process of SGMs, L\'{e}vy dynamics can produce both large jumps and small roaming to explore the sampling space, resulting in better sampling results than Langevin dynamics with a lacking of large jumps. Motivated by this contrast, we explore a new class of SGMs with the sampling based on the L\'{e}vy dynamics. However, exact numerical simulation of the L\'{e}vy dynamics is significantly more challenging and intractable. We hence propose an approximation solution by leveraging Gaussian mixture noises during training to achieve the desired large jumps and small roaming properties. Theoretically, GM-SGMs conduct a probabilistic graphical model used by empirical Bayes for sampling, expanding the maximum a posteriori (MAP) estimation applied by conventional SGMs. Expensive experiments on the challenging image generation tasks show that our GM-SGMs exhibit superior sampling quality over prior art SGMs across various sampling iterations.", MCAL: Minimum Cost Human-Machine Active Labeling,https://openreview.net/forum?id=1FxRPKrH8bw,https://openreview.net/pdf?id=1FxRPKrH8bw,A framework to address the prohibitive data labeling cost challenge using hybrid human-machine labeling.,"Today, groundtruth generation relies on datasets annotated by cloud-based annotation services. These rely on human annotation, which can be prohibitively expensive. In this paper, we consider the problem of hybrid human-machine labeling, which trains a classifier to accurately auto-label part of the data set. However, training the classifier can be expensive too. We propose an iterative approach that minimizes total overall cost by, at each step, jointly determining which samples to label using humans and which to label using the trained classifier. We validate our approach on well known public data sets such as Fashion-MNIST, CIFAR-10, CIFAR-100, and ImageNet. In some cases, our approach has 6× lower overall cost relative to human labeling the entire dataset, and is always cheaper than the cheapest competing strategy.","Active Labeling, Groundtruth Annotation, Dataset Labeling" BoxTeacher: Exploring High-Quality Pseudo Labels for Weakly Supervised Instance Segmentation,https://openreview.net/forum?id=YjFvx8VBl1,https://openreview.net/pdf?id=YjFvx8VBl1,This paper presents an end-to-end training framework named BoxTeacher to boost the performance of weakly supervised instance segmentation with high-quality pseudo labels. ,"Labeling objects with pixel-wise segmentation requires a huge amount of human labor compared to bounding boxes. Most existing methods for weakly supervised instance segmentation focus on designing heuristic losses with priors from bounding boxes. While we find that box-supervised methods can produce some fine segmentation masks and we wonder whether the detectors could learn from these fine masks while ignoring low-quality masks. To answer this question, we present BoxTeacher, an efficient and end-to-end training framework for high-performance weakly supervised instance segmentation, which leverages a sophisticated teacher to generate high-quality masks as pseudo labels. Considering the massive noisy masks hurt the training, we present a mask-aware confidence score to estimate the quality of pseudo masks and propose the noise-aware pixel loss and noise-reduced affinity loss to adaptively optimize the student with pseudo masks. Extensive experiments can demonstrate the effectiveness of the proposed BoxTeacher. Without bells and whistles, BoxTeacher remarkably achieves $34.4$ mask AP and $35.4$ mask AP with ResNet-50 and ResNet-101 respectively on the challenging MS-COCO dataset, which outperforms the previous state-of-the-art methods by a significant margin. The code and models will be available later.","Weakly supervised instance segmentation, instance segmentation, object detection" A Simple and Provable Method to Adapt Pre-trained Model across Domains with Few Samples,https://openreview.net/forum?id=eW2zCT1gm3,https://openreview.net/pdf?id=eW2zCT1gm3,This paper proposes a Simple and Provable method to quickly adapt a given pre-trained model across domains with few samples.,"Adapting the pre-trained model across domains with few samples, known as cross-domain few-shot learning, is a challenging task in statistical machine learning. Most previous efforts focused on training robust and transferable feature representations but rarely explored how to train an accurate few-shot model from a given pre-trained model. In this paper, we are interested in the performance of training a cross-domain few-shot classifier with representations from different layers of a pre-trained model and the impact of reducing the dimensionality of these representations. Based on this, we propose a simple and provable method, Average Pooling Ensemble Few-shot Learning (APEF). We demonstrate the effectiveness of average pooling and ensemble in cross-domain few-shot image classification both theoretically and experimentally. In particular, we provide a theoretical analysis in the PAC-Bayesian framework to illustrate why our method works, and we also empirically evaluate our approach on the challenging CD-FSL benchmark, which shows that our proposed method consistently outperforms all baselines.","Cross-domain few-shot learning, PAC-Bayesian framework, dimensionality reduction, ensemble" CD-Depth: Unsupervised Domain Adaptation for Depth Estimation via Cross Domain Integration,https://openreview.net/forum?id=VfAUPNStOS_,https://openreview.net/pdf?id=VfAUPNStOS_,,"Despite the efficiency of data collecting for depth estimation in the synthetic environment, we cannot take full advantage of such benefit due to the distribution gap between the synthetic and the real world. In this paper, we introduce a new unsupervised domain adaptation framework, CD-Depth, for depth estimation to alleviate domain shift by extracting structure-consistent and domain-agnostic latents using following methods. (1) We propose domain-agnostic latent mapping which projects images from different domains to the shared latent space by removing redundant domain features for estimating monocular depth. (2) We also fuse visual signals from both RGB and latent domains to fully exploit multi domain information with adaptive-window-based cross-attention. Our proposed framework achieves state-of-the-art results in unsupervised domain adaptation for depth estimation both on indoor and outdoor datasets and produces better generalization performance on an unseen dataset. ","monocular depth estimation, unsupervised domain adaptation" SegNeRF: 3D Part Segmentation with Neural Radiance Fields,https://openreview.net/forum?id=D9WJEsALpI1,https://openreview.net/pdf?id=D9WJEsALpI1,We perform 3D part segmentation on novel objects using only images by leveraging volume rendering.,"Recent advances in Neural Radiance Fields (NeRF) boast impressive performances for generative tasks such as novel view synthesis and 3D reconstruction. Methods based on neural radiance fields are able to represent the 3D world implicitly by relying exclusively on posed images. Yet, they have seldom been explored in the realm of discriminative tasks such as 3D part segmentation. In this work, we attempt to bridge that gap by proposing SegNeRF: a neural field representation that integrates a semantic field along with the usual radiance field. SegNeRF inherits from previous works the ability to perform novel view synthesis and 3D reconstruction, and enables 3D part segmentation from a few images. Our extensive experiments on PartNet show that SegNeRF is capable of simultaneously predicting geometry, appearance, and semantic information from posed images, even for unseen objects. The predicted semantic fields allow SegNeRF to achieve an average mIoU of 30.30% for 2D novel view segmentation, and 37.46% for 3D part segmentation, boasting competitive performance against point-based methods by using only a few posed images. Additionally, SegNeRF is able to generate an explicit 3D model from a single image of an object taken in the wild, with its corresponding part segmentation.", EyeDAS: Securing Perception of Autonomous Cars Against the Stereoblindness Syndrome,https://openreview.net/forum?id=qaJj2vTwrG5,https://openreview.net/pdf?id=qaJj2vTwrG5,,"The ability to detect whether an object is a 2D or 3D object is extremely important in autonomous driving, since a detection error can have life-threatening consequences, endangering the safety of the driver, passengers, pedestrians, and others on the road. Methods proposed to distinguish between 2 and 3D objects (e.g., liveness detection methods) are not suitable for autonomous driving, because they are object dependent or do not consider the constraints associated with autonomous driving (e.g., the need for real-time decision-making while the vehicle is moving). In this paper, we present EyeDAS, a novel few-shot learning-based method aimed at securing an object detector (OD) against the threat posed by the stereoblindness syndrome (i.e., the inability to distinguish between 2D and 3D objects). We evaluate EyeDAS's real-time performance using 2,000 objects extracted from seven YouTube video recordings of street views taken by a dash cam from the driver's seat perspective. When applying EyeDAS to seven state-of-the-art ODs as a countermeasure, EyeDAS was able to reduce the 2D misclassification rate from 71.42-100% to 2.4% with a 3D misclassification rate of 0% (TPR of 1.0). We also show that EyeDAS outperforms the baseline method and achieves an AUC of over 0.999.", Learnable Topological Features For Phylogenetic Inference via Graph Neural Networks,https://openreview.net/forum?id=hVVUY7p64WL,https://openreview.net/pdf?id=hVVUY7p64WL,Novel phylogenetic inference methods based on learnable topological features via graph neural networks,"Structural information of phylogenetic tree topologies plays an important role in phylogenetic inference. However, finding appropriate topological structures for specific phylogenetic inference tasks often requires significant design effort and domain expertise. In this paper, we propose a novel structural representation method for phylogenetic inference based on learnable topological features. By combining the raw node features that minimize the Dirichlet energy with modern graph representation learning techniques, our learnable topological features can provide efficient structural information of phylogenetic trees that automatically adapts to different downstream tasks without requiring domain expertise. We demonstrate the effectiveness and efficiency of our method on a simulated data tree probability estimation task and a benchmark of challenging real data variational Bayesian phylogenetic inference problems.","phylogenetic inference, learnable topological features, graph neural network, density estimation, variational inference" HRDFuse: Monocular 360$^\circ$ Depth Estimation by Collaboratively Learning Holistic-with-Regional Depth Distributions,https://openreview.net/forum?id=A85gMNB01t1,https://openreview.net/pdf?id=A85gMNB01t1,"This paper proposed a novel solution for monocular 360$^\circ$ depth estimation, which predicts an ERP format depth map by collaboratively learning the holistic-with-regional information from the ERP image and its TP patches.","Depth estimation from a monocular 360$^\circ$ image is a burgeoning problem as a 360$^\circ$ image provides holistic sensing of a scene with a wide field of view. Recently, some methods, \eg, OmniFusion, have applied the tangent projection (TP) to represent a 360$^\circ$ image and predicted depth values via patch-wise regressions, which are merged to get a depth map with equirectangular projection (ERP) format. However, these methods suffer from 1) non-trivial process of merging a large number of patches; 2) less smooth and accurate depth results caused by ignoring the holistic contextual information contained only in the ERP image and directly regressing the depth value of each pixel. In this paper, we propose a novel framework, HRDFuse, that subtly combines the potential of convolutional neural networks (CNNs) and transformers by collaboratively learning the holistic contextual information from the ERP and the regional structural information from the TP. Firstly, we propose a spatial feature alignment (SFA) module that learns feature similarities between the TP and ERP to aggregate the TP features into a complete ERP feature map in a pixel-wise manner. Secondly, we propose a collaborative depth distribution classification CDDC module that learns the holistic-with-regional histograms capturing the ERP and TP depth distributions. As such, the final depth values can be predicted as a linear combination of histogram bin centers. Lastly, we adaptively combine the depth predictions from two projections to obtain the final depth map. Extensive experiments on three benchmark datasets show that our method achieves more smooth and accurate depth results while favorably surpassing the SOTA methods by a significant margin.","3D Computer Vision, Scene Analysis and Understanding, Depth distribution classification, Feature representation learning" SpQAT: A Sparse Quantization-Aware Training Method,https://openreview.net/forum?id=QfLU7FtXDUn,https://openreview.net/pdf?id=QfLU7FtXDUn,"We develop an efficient sparse QAT method, dubbed SpQAT, based on the partly scratch-off lottery ticket phenomenon we observed.","Quantization-aware training (QAT) has been demonstrated to not only reduce computational cost and storage footprint, but well retain the performance of full-precision neural networks. However, the tedious retraining requirement greatly weakens the practical value of QAT methods. In this paper, we attempt to reduce the training costs of QAT methods, which to our best knowledge are barely investigated in the literature. Our motive stands upon a straightforward-yet-valuable observation: A large portion of quantized weights, referred to as the partly scratch-off lottery ticket, reach the optimal quantization level after a few training epochs. This naturally inspires us to reduce computation by freezing these weights in the remaining training period. Accordingly, we develop an efficient sparse QAT method, dubbed SpQAT. It freezes a weight once the distance between the full-precision one and its quantization level is smaller than a controllable threshold. Along these lines, we show that the proposed SpQAT accurately identifies the partly scratch-off lottery ticket and results in a sparse weight gradient where many weights are pulled out of the training and their related computations are avoided. Extensive experiments demonstrate the efficacy of our SpQAT with 20%-60% weight gradient sparsity. With the elimination of related gradient calculation in the backward propagation, the performance of our SpQAT is still on par with or even better than the compared baseline.","efficient training, quantization-aware training, network quantization" Double dynamic sparse training for GANs,https://openreview.net/forum?id=wmMUAg_l4Qk,https://openreview.net/pdf?id=wmMUAg_l4Qk,We propose a quantity named balance ratio to investigate and improve dynamic sparse training for GANs.,"The past decade has witnessed a drastic increase in modern deep neural networks (DNNs) size, especially for generative adversarial networks (GANs). Since GANs usually suffer from high computational complexity, researchers have shown an increased interest in applying pruning methods to reduce the training and inference costs of GANs. Among different pruning methods invented for supervised learning, dynamic sparse training (DST) has gained increasing attention recently as it enjoys excellent training efficiency with comparable performance to post-hoc pruning. Hence, applying DST on GANs, where we train a sparse GAN with a fixed parameter count throughout training, seems to be a good candidate for reducing GAN training costs. However, a few challenges, including the degrading training instability, emerge due to the adversarial nature of GANs. Hence, we introduce a quantity called balance ratio (BR) to quantify the balance of the generator and the discriminator. We conduct a series of experiments to show the importance of BR in understanding sparse GAN training. Building upon single dynamic sparse training (SDST), where only the generator is adjusted during training, we propose double dynamic sparse training (DDST) to control the BR during GAN training. Empirically, DDST automatically determines the density of the discriminator and greatly boosts the performance of sparse GANs on multiple datasets.","empirical deep learning, neural network pruning, dynamic sparse training" Bayesian Robust Graph Contrastive Learning,https://openreview.net/forum?id=Xecc-oeRzMr,https://openreview.net/pdf?id=Xecc-oeRzMr,,"Graph Neural Networks (GNNs) have been widely used to learn node representations and with outstanding performance on various tasks such as node classification. However, noise, which inevitably exists in real-world graph data, would considerably degrade the performance of GNNs revealed by recent studies. In this work, we propose a novel and robust method, Bayesian Robust Graph Contrastive Learning (BRGCL), which trains a GNN encoder to learn robust node representations. The BRGCL encoder is a completely unsupervised encoder. Two steps are iteratively executed at each epoch of training the BRGCL encoder: (1) estimating confident nodes and computing robust cluster prototypes of node representations through a novel Bayesian nonparametric method; (2) prototypical contrastive learning between the node representations and the robust cluster prototypes. Experiments on public benchmarks demonstrate the superior performance of BRGCL and the robustness of the learned node representations.","Graph Neural Networks, Contrastive Learning, Bayesian Nonparametric Learning, Noise" Hardware-restriction-aware training (HRAT) for memristor neural networks,https://openreview.net/forum?id=aPQRSQCDF2-,https://openreview.net/pdf?id=aPQRSQCDF2-,,"Memristor neural network (MNN), which utilizes memristor crossbars for vector-matrix multiplication, has huge advantages in terms of scalability and energy efficiency for neuromorphic computing. MNN weights are usually trained offline and then deployed as memristor conductances through a sequence of programming voltage pulses. Although weight uncertainties caused by process variation have been addressed in variation-aware training algorithms, efficient design and training of MNNs have not been systematically explored to date. In this work, we propose Hardware-Restriction-Aware Training (HRAT), which takes into account various non-negligible limitations and non-idealities of memristor devices, circuits, and systems. HRAT considers MNN's realistic behavior and circuit restrictions during offline training, thereby bridging the gap between offline training and hardware deployment. HRAT uses a new batch normalization (BN) fusing strategy to align the distortion caused by hardware restrictions between offline training and hardware inference. This not only improves inference accuracy but also eliminates the need for dedicated circuitry for BN operations. Furthermore, most normal scale signals are limited in amplitude due to the restriction of non-destructive threshold voltage of memristors. To avoid input signal distortion of memristor crossbars, HRAT dynamically adjusts the input signal magnitude during training using a learned scale factor. These scale factors can be incorporated into the parameters of linear operation together with fused BN, so no additional signal scaling circuits are required. To evaluate the proposed HRAT methodology, FC-4 and LeNet-5 on MNIST are firstly trained by HRAT and then deployed in hardware. Hardware simulations match well with the offline HRAT results. We also carried out various experiments using VGG-16 on the CIFAR datasets. The study shows that HRAT leads to high-performance MNNs without device calibration or on-chip training, thus greatly facilitating commercial MNN deployment.","Neuromorphic computing, Memristor, Neural network training, Hardware restrictions" FreeSeg: Free Mask from Interpretable Contrastive Language-Image Pretraining for Semantic Segmentation,https://openreview.net/forum?id=kTBTu1XxvFC,https://openreview.net/pdf?id=kTBTu1XxvFC,"We use natural language as supervision for open world segmentation, via freely available mask from raw feature map of pretraining model, which is striaght forward and effective.","Fully supervised semantic segmentation learns from dense masks, which requires heavy annotation cost for closed set. In this paper, we use natural language as supervision without any pixel-level annotation for open world segmentation. We call the proposed framework as FreeSeg, where the mask is freely available from raw feature map of pretraining model. Compared with zero-shot or openset segmentation, FreeSeg doesn't require any annotated masks, and it widely predicts categories beyond class-agnostic unsupervised segmentation. Specifically, FreeSeg obtains free mask from Image-Text Similarity Map (ITSM) of Interpretable Contrastive Language-Image Pretraining (ICLIP). And our core improvements are the smoothed min pooling for dense ICLIP, with the partial label and pixel strategies for segmentation. Furthermore, FreeSeg is very straight forward without complex design like grouping, clustering or retrieval. Besides the simplicity, the performances of FreeSeg surpass previous state-of-the-art at large margins, e.g. 13.4% higher at mIoU on VOC dataset in the same settings.","Semantic Segmentation, Open-vocabulary, Zero-shot, Contrastive Language-Image Pretraining, Interpretability" DifFace: Blind Face Restoration with Diffused Error Contraction,https://openreview.net/forum?id=Mof47lISH6N,https://openreview.net/pdf?id=Mof47lISH6N,We propose a new blind face restoration method that consists of an error compressor and a Markov chain partially borrowed from a pre-trained diffusion model. ,"While deep learning-based methods for blind face restoration have achieved unprecedented success, they still suffer from two major limitations. First, most of them deteriorate when facing complex degradations out of their training data. Second, these methods require multiple constraints, e.g., fidelity, perceptual, and adversarial losses, which requires laborious hyper-parameters tuning to stabilize and balance their influences. In this work, we propose a novel method named DifFace, being able to cope with unseen and complex degradations more gracefully without complicated loss designs. The key of our method is to establish a posterior distribution from the observed low-quality (LQ) image to its high-quality (HQ) counterpart. In particular, we design a transition distribution from the LQ image to the intermediate state of a pre-trained diffusion model and then gradually transmit from this intermediate state to the HQ target by recursively applying a pre-trained diffusion model. The transition distribution only relies on a restoration backbone that is trained with L2 loss on some synthetic data, which favorably avoids the cumbersome training process in existing methods. Moreover, the transition distribution is capable of contracting the error of the restoration backbone and thus makes our method more robust to unknown degradations. Comprehensive experiments show that DifFace is superior to current state-of-the-art methods, especially in cases with severe degradations. Code and model will be released.","Face Restoration, Diffusion Model, Super-resolution" ViTKD: Practical Guidelines for ViT Feature Knowledge Distillation,https://openreview.net/forum?id=RuLGBgoonoM,https://openreview.net/pdf?id=RuLGBgoonoM,A feature-based knowledge distillation method for ViT,"Knowledge Distillation (KD) for Convolutional Neural Network (CNN) is extensively studied as a way to boost the performance of a small model. Recently, Vision Transformer (ViT) has achieved great success on many computer vision tasks and KD for ViT is also desired. However, besides the output logit-based KD, other feature-based KD methods for CNNs cannot be directly applied to ViT due to the huge structure gap. In this paper, we explore the way of feature-based distillation for ViT. Based on the nature of feature maps in ViT, we design a series of controlled experiments and derive three practical guidelines for ViT's feature distillation. Some of our findings are even opposite to the practices in the CNN era. Based on the three guidelines, we propose our feature-based method ViTKD which brings consistent and considerable improvement to the student. On ImageNet-1k, we boost DeiT-Tiny from 74.42% to 76.06%, DeiT-Small from 80.55% to 81.95%, and DeiT-Base from 81.76% to 83.46%. Moreover, ViTKD and the logit-based KD method are complementary and can be applied together directly. This combination can further improve the performance of the student. Specifically, the student DeiT-Tiny, Small, and Base achieve 77.78%, 83.59%, and 85.41%, respectively.","Knowledge Distillation, Vision Transformer, Image Classification" Fairness-aware Contrastive Learning with Partially Annotated Sensitive Attributes,https://openreview.net/forum?id=woa783QMul,https://openreview.net/pdf?id=woa783QMul,Proposing a new problem of fair unsupervised representation learning with limited annotated sensitive attributes and a fairness-aware contrastive learning framework.,"Learning high-quality representation is important and essential for visual recognition. Unfortunately, traditional representation learning suffers from fairness issues since the model may learn information of sensitive attributes. Recently, a series of studies have been proposed to improve fairness by explicitly decorrelating target labels and sensitive attributes. Most of these methods, however, rely on the assumption that fully annotated labels on target variable and sensitive attributes are available, which is unrealistic due to the expensive annotation cost. In this paper, we investigate a novel and practical problem of Fair Unsupervised Representation Learning with Partially annotated Sensitive labels (FURL-PS). FURL-PS has two key challenges: 1) how to make full use of the samples that are not annotated with sensitive attributes; 2) how to eliminate bias in the dataset without target labels. To address these challenges, we propose a general Fairness-aware Contrastive Learning (FairCL) framework consisting of two stages. Firstly, we generate contrastive sample pairs, which share the same visual information apart from sensitive attributes, for each instance in the original dataset. In this way, we construct a balanced and unbiased dataset. Then, we execute fair contrastive learning by closing the distance between representations of contrastive sample pairs. Besides, we also propose an unsupervised way to balance the utility and fairness of learned representations by feature reweighting. Extensive experimental results illustrate the effectiveness of our method in terms of fairness and utility, even with very limited sensitive attributes and serious data bias.","Fair Representation Learning, Semi-supervised Learning, Contrastive Learning, Data Augmentation" Training Instability and Disharmony Between ReLU and Batch Normalization,https://openreview.net/forum?id=BSUoWl5yfv,https://openreview.net/pdf?id=BSUoWl5yfv,We mathematically show how the disharmony between ReLU and BN causes temporal gradient explosion and training instability. We also propose a better solution of the problem.,"Deep neural networks based on batch normalization and ReLU-like activation functions experience instability during early stages of training owing to the high gradient induced by temporal gradient explosion. ReLU reduces the variance by more than the expected amount and batch normalization amplifies the gradient during its recovery. In this paper, we explain the explosion of a gradient mathematically while the forward propagation remains stable, and also the alleviation of the problem during training. Based on this, we propose a Layer-wise Asymmetric Learning rate Clipping (LALC) algorithm, which outperforms existing learning rate scaling methods in large batch training and can also be used to replace WarmUp in small batch training.","Deep learning, Gradient Exploding, ReLU, Batch normalization, Training instability, LARS, WarmUp" Rotamer Density Estimators are Unsupervised Learners of the Effect of Mutations on Protein-Protein Interaction,https://openreview.net/forum?id=_X9Yl1K2mD,https://openreview.net/pdf?id=_X9Yl1K2mD,,"Protein-protein interactions play a fundamental role in a broad range of biological processes. Predicting the effect of amino acid mutations on binding is crucial to protein engineering. Traditional biophysical and statistical methods have dominated the area for years, but they depend heavily on expert prior and face the trade-off between efficiency and accuracy. Recent success in deep learning for proteins has made data-driven approaches more appealing than ever. Nevertheless, the major challenge is the scarcity of experimental mutational data annotated with the change in binding affinity. In this work, we demonstrate that mutational effects on binding can be predicted by the change in conformational flexibility of the protein-protein interface. We propose a flow-based generative model to estimate the probability distribution of conformation (named Rotamer Density Estimator, RDE) and use entropy as the measure of flexibility. The model is trained solely with protein structures and does not require the supervision of the experimental values of changes in binding affinities. Further, the unsupervised representations extracted by the model can be used for prediction even more accurately using simple downstream neural networks. The proposed method outperforms empirical energy functions and other machine learning-based approaches.","effect of mutations, protein-protein interaction, unsupervised learning" "Faster Neural Architecture ""Search"" for Deep Image Prior",https://openreview.net/forum?id=_k0CnK5V7F,https://openreview.net/pdf?id=_k0CnK5V7F,We develop a faster and training-free architecture design strategy to estimate the required architecture for each image in advance.,"Deep image prior (DIP) is known for leveraging the spectral bias of the convolutional neural network (CNN) towards lower frequencies in various single-image restoration tasks. Such inductive bias has been widely attributed to the network architecture. Existing studies therefore either handcraft the architecture or use automated neural architecture search (NAS). However, there is still a lack of understanding on how the architectural choice corresponds to the image to be restored, leading to an excessively large search space that is both time and computationally-expensive for typical NAS techniques. As a result, the architecture is often searched and fixed for the whole dataset, while the best-performing one could be image-dependent. Moreover, common architecture search requires ground truth supervision, which is often not accessible. In this work, we present a simple yet effective \emph{training-free} approach to estimate the required architecture for \emph{every image} in advance. This is motivated by our empirical findings that the width and depth of a good network prior are correlated with the texture of the image, which can be estimated during pre-processings. Accordingly, the design space is substantially shrunk to a handful of subnetworks within a given large network. The experiments on denoising across different noise levels show that a subnetwork with proper setups could be a more effective network prior than the original network while being highly under-parameterized, making it not critically require early-stopping as with the original large network.","Deep Image Prior, Image Denoising, Self-Supervised Learning" Dilated convolution with learnable spacings,https://openreview.net/forum?id=Q3-1vRh3HOA,https://openreview.net/pdf?id=Q3-1vRh3HOA,Dilated convolution with learnable spacings: a new method that improves the accuracy of state-of-the-art CNNs,"Recent works indicate that convolutional neural networks (CNN) need large receptive fields (RF) to compete with visual transformers and their attention mechanism. In CNNs, RFs can simply be enlarged by increasing the convolution kernel sizes. Yet the number of trainable parameters, which scales quadratically with the kernel's size in the 2D case, rapidly becomes prohibitive, and the training is notoriously difficult. This paper presents a new method to increase the RF size without increasing the number of parameters. The dilated convolution (DC) has already been proposed for the same purpose. DC can be seen as a convolution with a kernel that contains only a few non-zero elements placed on a regular grid. Here we present a new version of the DC in which the spacings between the non-zero elements, or equivalently their positions, are no longer fixed but learnable via backpropagation thanks to an interpolation technique. We call this method “Dilated Convolution with Learnable Spacings” (DCLS) and generalize it to the n-dimensional convolution case. However, our main focus here will be on the 2D case. We first tried our approach on ResNet50: we drop-in replaced the standard convolutions with DCLS ones, which increased the accuracy of ImageNet1k classification at iso-parameters, but at the expense of the throughput. Next, we used the recent ConvNeXt state-of-the-art convolutional architecture and drop-in replaced the depthwise convolutions with DCLS ones. This not only increased the accuracy of ImageNet1k classification but also of typical downstream and robustness tasks, again at iso-parameters but this time with negligible cost on throughput, as ConvNeXt uses separable convolutions. Conversely, classic DC led to poor performance with both ResNet50 and ConvNeXt.","deep learning, convolution, dilated convolution, receptive field" PatchDCT: Patch Refinement for High Quality Instance Segmentation,https://openreview.net/forum?id=t9Zd7Oi5JPl,https://openreview.net/pdf?id=t9Zd7Oi5JPl,,"High-quality instance segmentation has shown emerging importance in computer vision. Without any refinement, DCT-Mask directly generates high-resolution masks by compressed vectors. To further refine masks obtained by compressed vectors, we propose for the first time a compressed vector based multi-stage refinement framework. However, the vanilla combination does not bring significant gains, because changes in some elements of the DCT vector will affect the prediction of the entire mask. Thus, we propose a simple and novel method named PatchDCT, which separates the mask decoded from a DCT vector into several patches and refines each patch by the designed classifier and regressor. Specifically, the classifier is used to distinguish mixed patches from all patches, and to correct previously mispredicted foreground and background patches. In contrast, the regressor is used for DCT vector prediction of mixed patches, further refining the segmentation quality at boundary locations. Experiments on COCO show that our method achieves 2.0%, 3.2%, 4.5% AP and 3.4%, 5.3%, 7.0% Boundary AP improvements over Mask-RCNN on COCO, LVIS, and Cityscapes, respectively. It also surpasses DCT-Mask by 0.7%, 1.1%, 1.3% AP and 0.9%, 1.7%, 4.2% Boundary AP on COCO, LVIS and Cityscapes. Besides, the performance of PatchDCT is also competitive with other state-of-the-art methods, and the code will be made publicly available.",Instance Segmentation Global Prototype Encoding for Incremental Video Highlights Detection,https://openreview.net/forum?id=WuDCu0aZXO0,https://openreview.net/pdf?id=WuDCu0aZXO0,,"Video highlights detection (VHD) is an active research field in computer vision, aiming to locate the most user-appealing clips given raw video inputs. However, most VHD methods are based on the closed world assumption, \emph{i.e.}, a fixed number of highlight categories is defined in advance and all training data are available beforehand. Consequently, existing methods have poor scalability with respect to increasing highlight domains and training data. To address above issues, we propose a novel video highlight detection method named \textbf{G}lobal \textbf{P}rototype \textbf{E}ncoding (GPE) to learn incrementally for adapting to new domains via parameterized prototypes. To facilitate this new research direction, we collect a finely annotated dataset termed \emph{LiveFood}, including over 5,100 live gourmet videos that consist of four domains: \emph{cooking}, \emph{eating}, \emph{ingredients} and \emph{presentation}. To the best of our knowledge, this is the first work to explore video highlight detection in the incremental learning setting, opening up new land to apply VHD for practical scenarios where both the concerned highlight domains and training data increase over time. We demonstrate the effectiveness of GPE through extensive experiments. Notably, GPE surpasses popular domain-incremental learning methods on \emph{LiveFood}, achieving significant mAP improvements on all domains. The code and dataset will be made publicly available.","Video highlights detection, Incremental learning, Newly released dataset" WaGI: Wavelet-based GAN Inversion for Preserving High-Frequency Image Details,https://openreview.net/forum?id=ejQVau3Z-QQ,https://openreview.net/pdf?id=ejQVau3Z-QQ,,"Recent GAN inversion models focus on preserving image-specific details through various methods, e.g., generator tuning or feature mixing. While those are helpful for preserving details compared to naive low-rate latent inversion, they still fail to maintain high-frequency features precisely. In this paper, we point out that existing GAN inversion models have inherent limitations in both structural and training aspects, which preclude the delicate reconstruction of high-frequency features. Especially, we prove that the widely-used loss term in GAN inversion, i.e., is biased to mainly reconstructing low-frequency features. To overcome this problem, we propose a novel GAN inversion model, coined WaGI, which enables handling high-frequency features explicitly, by using a novel wavelet-based loss term and a newly proposed wavelet fusion scheme. To the best of our knowledge, WaGI is the first approach to interpret GAN inversion in the frequency domain. We demonstrate that WaGI shows outstanding results on both inversion and editing, compared to existing state-of-the-art GAN inversion models. Especially, WaGI robustly preserves high-frequency features of images even in the editing scenario. We will release our code with the pre-trained model after the review.","GAN inversion, wavelet transform" Neural-Symbolic Recursive Machine for Systematic Generalization,https://openreview.net/forum?id=m7rFrsO0YWb,https://openreview.net/pdf?id=m7rFrsO0YWb,"We present Neural-Symbolic Recursive Machine for systematic generalization which achieves state-of-the-art performance on SCAN, PCFG, and HINT.","Despite the tremendous success, existing machine learning models still fall short of human-like systematic generalization—learning compositional rules from limited data and applying them to unseen combinations in various domains. We propose Neural-Symbolic Recursive Machine (NSR) to tackle this deficiency. The core representation of NSR is a Grounded Symbol System (GSS) with combina- torial syntax and semantics, which entirely emerges from training data. Akin to the neuroscience studies suggesting separate brain systems for perceptual, syntactic, and semantic processing, NSR implements analogous separate modules of neural perception, syntactic parsing, and semantic reasoning, which are jointly learned by a deduction-abduction algorithm. We prove that NSR is expressive enough to model various sequence-to-sequence tasks. Superior systematic generalization is achieved via the inductive biases of equivariance and recursiveness embedded in NSR. In experiments, NSR achieves state-of-the-art performance in three benchmarks from different domains: SCAN for semantic parsing, PCFG for string manipulation, and HINT for arithmetic reasoning. Specifically, NSR achieves 100% generalization accuracy on SCAN and PCFG and outperforms state-of-the-art models on HINT by about 23%. Our NSR demonstrates stronger generalization than pure neural networks due to its symbolic representation and inductive biases. NSR also demonstrates better transferability than existing neural-symbolic approaches due to less domain-specific knowledge required.","Systematic Generalization, Compositional Generalization, Neural-symbolic" ChiroDiff: Modelling chirographic data with Diffusion Models,https://openreview.net/forum?id=1ROAstc9jv,https://openreview.net/pdf?id=1ROAstc9jv,"Learning diffusion model for continuous-time chirographic data (e.g. handwriting, sketch etc.)","Generative modelling over continuous-time geometric constructs, a.k.a ""chirographic data"" such as handwriting, sketches, drawings etc., have been accomplished through autoregressive distributions. Such strictly-ordered discrete factorization however falls short of capturing key properties of chirographic data -- it fails to build holistic understanding of the temporal concept due to one-way visibility (causality). Consequently, temporal data has been modelled as discrete token sequences of fixed sampling rate instead of capturing the true underlying concept. In this paper, we introduce a powerful model-class namely ""Denoising Diffusion Probabilistic Models"" or DDPMs for chirographic data that specifically addresses these flaws. Our model named ""ChiroDiff"", being non-autoregressive, learns to capture holistic concepts and therefore remains resilient to higher temporal sampling rate up to a good extent. Moreover, we show that many important downstream utilities (e.g. conditional sampling, creative mixing) can be flexibly implemented using ChiroDiff. We further show some unique use-cases like stochastic vectorization, de-noising/healing, controlled abstraction are also possible with this model-class. We perform quantitative and qualitative evaluation of our framework on relevant datasets (VectorMNIST, KanjiVG, Quick, Draw! etc) and found it to be better or on par with competing approaches.","chirographic data, continuous-time, diffusion model, generative model" Object Localization helps Action Recognition Models Adapt to New Environments,https://openreview.net/forum?id=qtVUTPpTNq,https://openreview.net/pdf?id=qtVUTPpTNq,,"Consider a real-world problem where we wish to adapt an existing action recognition (AR) model to a new environment. A common approach is to fine-tune a model on a set of labeled videos of actions performed in that environment. Such an approach is costly, since we need to record and annotate the videos, and fine-tune the model. At the same time, there has been recent interest in AR models that take an object-centric approach. In many cases these models are more structured, e.g., containing a module dedicated to object localization. Could we perform adaptation to a new environment via objects alone? We propose to re-use a previously trained AR model and \emph{only adapt its object localization module}. Specifically, we train class-agnostic detectors that can adapt to each new environment. The idea of performing AR model adaptation via objects is novel and promising. While it requires some annotated images with the localized objects in the new environment, such supervision cost is lower than that of a conventional approach above. We conduct experiments on unseen kitchens in within- and across- dataset settings using Epic-Kitchen and EGTEA benchmarks, and show that AR models equipped with our object detectors can efficiently adapt to new environments.", Active Topological Mapping by Metric-Free Exploration via Task and Motion Imitation,https://openreview.net/forum?id=AB4xZG9uzGl,https://openreview.net/pdf?id=AB4xZG9uzGl,A novel framework of building metric-free topological map for exploration and navigation,"Topological map is an effective environment representation for visual navigation. It is a graph of image nodes and spatial neighborhood edges without metric information such as global or relative agent poses. However, currently such a map construction relies on either less-efficient random exploration or more demanding training involving metric information. To overcome these issues, we propose active topological mapping (ATM), consisting of an active visual exploration and a topological mapping by visual place recognition. Our main novelty is the simple and lightweight active exploration policy that works entirely in the image feature space involving no metric information. More specifically, ATM's metric-free exploration is based on task and motion planning (TAMP). The task planner is a recurrent neural network using the latest local image observation sequence to hallucinate a feature as the next-step best exploration goal. The motion planner then fuses the current and the hallucinated feature to generate an action taking the agent towards the hallucinated feature goal. The two planners are jointly trained via deeply-supervised imitation learning from expert exploration demonstrations. Extensive experiments in both exploration and navigation tasks on the photo-realistic Gibson and MP3D datasets validate ATM's effectiveness and generalizability.","Topological Mapping, Feature-Space Task and Motion Planning, Visual Navigation, Deeply-Supervised Learning" SoundCount: Sound Counting from Raw Audio with Dyadic Decomposition Neural Network,https://openreview.net/forum?id=53T6FlFulCV,https://openreview.net/pdf?id=53T6FlFulCV,A novel and general framework for sound crowd counting from sound raw waveform,"In this paper, we study an underexplored, yet important and challenging problem: counting the number of distinct sound events in data characterized by a high degree of polyphonicity and spectral overlap. A key example is counting individual bird calls in bioacoustic data, from which biodiversity can be estimated. We do so by systematically proposing a novel end-to-end trainable neural network, designing new evaluation protocols, quantifying the difficulty of counting depending on sound polyphonicity, and creating a new dataset tailored for concurrent sound event counting. Unlike existing methods that all apply frequency-selective filters on the raw waveform in a one-stage manner, our neural network progressively decomposes the raw waveform dyadically in frequency domain. Taking inspiration from wavelet decomposition, intermediate waveforms convolved by a parent filter are successively processed by a pair of children filters that evenly split the parent filter's carried frequency response. An energy gain normalization module is introduced to normalize received sound events' loudness variance and spectrum overlap. The network is fully convolutional and parameter-frugal so it is light-weight and computationally efficient. We further design a set of polyphony-aware metrics to quantify sound counting difficulty level from different perspectives. To show the efficiency and generalization of our method (we call DyDecNet), we do experiments on both bioacoustic bird sound (both synthetic and real-world sound), telephone-ring sound and music sound data. Comprehensive experiment results show our method outperforms existing sound event detection (SED) methods significantly. The dyadic decomposition front-end network can be used by existing methods to improve their performance accordingly.","Sound Crowd Count, Dyadic Decomposition Network, Learnable Filters, Acoustic Crowd Counting" SoundNeRirF: Receiver-to-Receiver Sound Neural Room Impulse Response Field,https://openreview.net/forum?id=CxPw6TeByX4,https://openreview.net/pdf?id=CxPw6TeByX4,Propose a receiver-to-receiver sound neural room acoustics rendering field,"We present SoundNeRirF, a framework that learns a continuous receiver-to-receiver neural room impulse response field~(r2r-RIR) to help robot efficiently predict the sound to be heard at novel locations. It represents a room acoustic scene as a continuous 6D function, whose input is a reference receiver's 3D position and a target receiver's 3D position, and whose outputs are an inverse room impulse response~(inverse-RIR) and a forward room impulse response~(forward-RIR) that jointly project the sound from the reference position to the target position. SoundNeRirF requires knowledge of neither sound source (e.g. location and number of sound sources) nor room acoustic properties~(e.g. room size, geometry, materials). Instead, it merely depends on a sparse set of sound receivers' positions, as well as the recorded sound at each position. We instantiate the continuous 6D function as multi-layer perceptrons~(MLP), so it is fully differentiable and continuous at any spatial position. SoundNeRirF is encouraged, during the training stage, to implicitly encode the interaction between sound sources, receivers and room acoustic properties by minimizing the discrepancy between the predicted sound and the truly heard sound at the target position. During inference, the sound at a novel position is predicted by giving a reference position and the corresponding reference sound. Extensive experiments on both synthetic and real-world datasets show SoundNeRirF is capable of predicting high-fidelity and audio-realistic sound that fully captures room reverberation characteristics, significantly outperforming existing methods in terms of accuracy and efficiency.","Sound Neural Rendering Field, Sound Prediction, Representation Learning, Receiver-to-Receiver Modelling" Towards Sustainable Self-supervised Learning,https://openreview.net/forum?id=jT1HcWv6PgO,https://openreview.net/pdf?id=jT1HcWv6PgO,This paper proposes a new method towards the sustainable self-supervised learning goal.,"Though increasingly training-expensive, most self-supervised learning (SSL) models have repeatedly been trained from scratch but not fully utilized since only a few SOTAs are adopted for downstream tasks. In this work, we explore a sustainable SSL framework with two major challenges: i) learning a stronger new SSL model based on the existing pretrained SSL model in a cost-friendly manner, ii) allowing the training of the new model to be compatible with various base models. We propose a Target-Enhanced Conditional (TEC) scheme, which introduces two components to existing mask-reconstruction based SSL. Firstly, we introduce patch-relation enhanced targets to encourage the new model to learn semantic-relation knowledge from the base model using incomplete inputs. This hardening and target-enhancing could help the new model surpass the base model, since they enforce additional patch relation modeling to handle incomplete input. Secondly, we introduce a conditional adapter that adaptively adjusts new model prediction to align with the target of each base model. Experimental results show that our TEC scheme can accelerate the learning speed and also improve SOTA SSL models, e.g., MAE and iBOT, taking an explorative step towards sustainable SSL.","sustainable, self-supervised learning, vision transformer" Real-Time Image Demoir$\acute{e}$ing on Mobile Devices,https://openreview.net/forum?id=PmP_sf3JkrH,https://openreview.net/pdf?id=PmP_sf3JkrH,This paper presents a dynamic demoireing acceleration method towards a real-time image demoireing on mobile devices.,"Moir$\acute{e}$ patterns appear frequently when taking photos of digital screens, drastically degrading the image quality. Despite the advance of CNNs in image demoir$\acute{e}$ing, existing networks are with heavy design, causing massive computation burden for mobile devices. In this paper, we launch the first study on accelerating demoir$\acute{e}$ing networks and propose a dynamic demoir$\acute{e}$ing acceleration method (DDA) towards a real-time deployment on mobile devices. Our stimulus stems from a simple-yet-universal fact that moir${\'e}$ patterns often unbalancedly distribute across an image. Consequently, excessive computation is wasted upon non-moir$\acute{e}$ areas. Therefore, we reallocate computation costs in proportion to the complexity of image patches. In order to achieve this aim, we measure the complexity of an image patch by a novel moir$\acute{e}$ prior that considers both colorfulness and frequency information of moir$\acute{e}$ patterns. Then, we restore higher-complex image patches using larger networks and the lower-complex ones are assigned with smaller networks to relieve the computation burden. At last, we train all networks in a parameter-shared supernet paradigm to avoid additional parameter burden. Extensive experiments on several benchmarks demonstrate the efficacy of our DDA. In addition, the acceleration evaluated on the VIVO X80 Pro smartphone equipped with the chip of Snapdragon 8 Gen 1 also shows that our method can drastically reduce the inference time, leading to a real-time image demoir$\acute{e}$ing on mobile devices. ","Image Demoireing, Network Acceleration" Domain Generalization via Independent Regularization from Early-branching Networks,https://openreview.net/forum?id=uPPbSJcMBXf,https://openreview.net/pdf?id=uPPbSJcMBXf,"We find an early-branching structure is essential when using independent regularization for DG, and with a new augmentation strategy, our method can outperform most existing SOTA.","Learning domain-invariant feature representations is critical for achieving domain generalization, where a model is required to perform well on unseen domains. The critical challenge is that standard training often results in entangled domain-invariant and domain-specific features. To address this issue, we use a dual-branching network to learn two features, one for the domain classification problem and the other for the original target classification problem, and the feature of the latter is required to be independent of the former. While this idea seems straightforward, we show that several factors need to be carefully considered for it to work effectively. In particular, we investigate different branching structures and discover that the common practice of using a shared base feature extractor with two lightweight prediction heads is detrimental to the performance. Instead, a simple early-branching architecture, where the domain classification and target classification branches share the first few blocks while diverging thereafter, leads to better results. Moreover, we also incorporate a random style augmentation scheme as an extension to further unleash the power of the proposed method, which can be seamlessly integrated into the dual-branching network by our loss terms. Such an extension gives rise to an effective domain generalization method. Experimental results show that the proposed method outperforms state-of-the-art domain generalization methods on various benchmark datasets.","domain generalization, representational learning" AutoSKDBERT: Learn to Stochastically Distill BERT,https://openreview.net/forum?id=csARsNPKgVi,https://openreview.net/pdf?id=csARsNPKgVi,AutoSKDBERT stochastically selects a teacher model from a predefined teacher team to distill student model in each iteration with a learnable categorical distribution.,"In this paper, we propose AutoSKDBERT, a new knowledge distillation paradigm for BERT compression, that stochastically samples a teacher from a predefined teacher team following a categorical distribution in each step, to transfer knowledge into student. AutoSKDBERT aims to discover the optimal categorical distribution which plays an important role to achieve high performance. The optimization procedure of AutoSKDBERT can be divided into two phases: 1) phase-1 optimization distinguishes effective teachers from ineffective teachers, and 2) phase-2 optimization further optimizes the sampling weights of the effective teachers to obtain satisfactory categorical distribution. Moreover, after phase-1 optimization completion, AutoSKDBERT adopts teacher selection strategy to discard the ineffective teachers whose sampling weights are assigned to the effective teachers. Particularly, to alleviate the gap between categorical distribution optimization and evaluation, we also propose a stochastic single-weight optimization strategy which only updates the weight of the sampled teacher in each step. Extensive experiments on GLUE benchmark show that the proposed AutoSKDBERT achieves state-of-the-art score compared to previous compression approaches on several downstream tasks, including pushing MRPC F1 and accuracy to 93.2 (0.6 point absolute improvement) and 90.7 (1.2 point absolute improvement), RTE accuracy to 76.9 (2.9 point absolute improvement).","Categorical distribution optimization, Stochastic knowledge distillation, BERT compression, GLUE" QCRS: Improve Randomized Smoothing using Quasi-Concave Optimization,https://openreview.net/forum?id=OzHFdcvucgb,https://openreview.net/pdf?id=OzHFdcvucgb,Improve traditional randomized smoothing using Quasi-Concave Optimization,"Randomized smoothing is currently the state-of-the-art method that provides certified robustness for neural networks. However, it often cannot achieve an adequate certified region on real-world datasets. One way to obtain a larger certified region is to use an input-specific algorithm instead of using a fixed Gaussian filter for all data points. Several methods based on this idea have been proposed, but they either suffer from high computational costs or gain marginal improvement in certified radius. In this work, we show that by exploiting the quasiconvex problem structure, we can find the optimal certified radii for most data points with slight computational overhead. This observation leads to an efficient and effective input-specific randomized smoothing algorithm. We conduct extensive experiments and empirical analysis on Cifar10 and ImageNet. The results show that the proposed method significantly enhances the certified radii with low computational overhead.","Randomized Smoothing, Robustness" Training A Multi-stage Deep Classifier with Feedback Signals,https://openreview.net/forum?id=LmNckrTpTBo,https://openreview.net/pdf?id=LmNckrTpTBo,,"Multi-Stage Classifier (MSC) - several classifiers working sequentially in an arranged order and classification decision is partially made at each step - is widely used in industrial applications for various resource limitation reasons. The classifiers of a multi-stage process are usually Neural Network (NN) models trained independently or in their inference order without considering the signals from the latter stages. Aimed at two-stage binary classification process, the most common type of MSC, we propose a novel training framework, named Feedback Training. The classifiers are trained in an order reverse to their actual working order, and the classifier at the later stage is used to guide the training of initial-stage classifier via a sample weighting method. We experimentally show the efficacy of our proposed approach, and its great superiority under the scenario of few-shot training. ","multi-stage classification, training framework" Is Self-Supervised Contrastive Learning More Robust Than Supervised Learning?,https://openreview.net/forum?id=FPdDFUVYVPl,https://openreview.net/pdf?id=FPdDFUVYVPl,,"Prior work on self-supervised contrastive learning has primarily focused on evaluating the recognition accuracy, but has overlooked other behavioral aspects. In addition to accuracy, distributional robustness plays a critical role in the reliability of machine learning models. We design and conduct a series of robustness tests to quantify the behavioral differences between contrastive learning and supervised learning to downstream and pre-training data distribution changes. These tests leverage data corruptions at multiple levels, ranging from pixel-level distortion to patch-level shuffling and to dataset-level distribution shift, including both natural and unnatural corruptions. Our tests unveil intriguing robustness behaviors of contrastive and supervised learning: while we generally observe that contrastive learning is more robust than supervised learning under downstream corruptions, we surprisingly discover the robustness vulnerability of contrastive learning under pixel and patch level corruptions during pre-training. Furthermore, we observe the higher dependence of contrastive learning on spatial image coherence information during pre-training, e.g., it is particularly sensitive to global patch shuffling. We explain these results by connecting to feature space uniformity and data augmentation. Our analysis has implications in improving the downstream robustness of supervised learning, and calls for more studies on understanding contrastive learning.", An Empirical Study of Metrics to Measure Representational Harms in Pre-Trained Language Models,https://openreview.net/forum?id=3leZITnUE9r,https://openreview.net/pdf?id=3leZITnUE9r,Measuring implicit hate in pretrained language models,"Large-scale Pre-Trained Language Models (PTLMs) capture knowledge from massive human-written data which contains latent societal biases and toxic contents. In this paper, we leverage the primary task of PTLMs, i.e. language modeling, and propose a new metric to quantify manifested implicit representational harms in PTLMs towards 13 marginalized demographics. Using this metric, we conducted an empirical analysis of 24 widely used PTLMs. Our analysis provides insights into the correlation between the proposed metric in this work and other related fairness metrics. We observe that our metric correlates with the majority of gender-specific fairness metrics in the literature. Through extensive experiments, we explore the connections between PTLMs architectures and representational harms across two dimensions: depth and width of the networks. We found that prioritizing depth over width, mitigates representational harms in some PTLMs.","Natural Language Processing, Fairness, Safety" Unsupervised Learning of Causal Relationships from Unstructured Data,https://openreview.net/forum?id=2xNKMFGPJU5,https://openreview.net/pdf?id=2xNKMFGPJU5,We propose a modification to the VAE that learns variables and causal relationships between them in an unsupervised way.,"Endowing deep neural networks with the ability to reason about cause and effect would be an important step to make them more robust and interpretable. In this work we propose a variational framework that allows deep networks to learn latent variables and their causal relationships from unstructured data, with no supervision, or labeled interventions. Starting from an abstract Structural Equation Model (SEM), we show that maximizing its posterior probability yields a similar construction to a Variational Auto-Encoder (VAE), but with a structured prior coupled by non-linear equations. This prior represents an interpretable SEM with learnable parameters (such as a physical model or dependence structure), which can be fitted to data while simultaneously learning the latent variables. Unfortunately, computing KL-divergences with this non-linear prior is intractable. We show how linearizing arbitrary SEMs via back-propagation produces local non-isotropic Gaussian priors, for which the KL-divergences can be computed efficiently and differentiably. We propose two versions, one for IID data (such as images) which detects related causal variables within a sample, and one for non-IID data (such as video) which detects variables that are also related over time. Our proposal is complementary to causal discovery techniques, which assume given variables, and instead discovers both variables and their causal relationships. We experiment with recovering causal models from images, and learning temporal relations based on the Super Mario Bros videogame.","causality, deep learning, causal representation learning, unsupervised, VAE" The Biased Artist: Exploiting Cultural Biases via Homoglyphs in Text-Guided Image Generation Models,https://openreview.net/forum?id=Liuo-Bk-beq,https://openreview.net/pdf?id=Liuo-Bk-beq,Inducing cultural biases into text-conditional image generation models by replacing single characters in the text prompts with homoglyphs.,"Text-guided image generation models, such as DALL-E2 and Stable Diffusion, have recently received much attention from academia and the general public. Provided with textual descriptions, these models are capable of generating high-quality images depicting various concepts and styles. However, such models are trained on large amounts of public data and implicitly learn relationships from their training data that are not immediately apparent. We demonstrate that common multimodal models implicitly learned cultural biases that can be triggered and injected into the generated images by simply replacing single characters in the textual description with visually similar non-Latin characters. These so-called homoglyph replacements enable malicious users or service providers to induce biases into the generated images and even render the whole generation process useless. We practically illustrate such attacks on DALL-E2 and Stable Diffusion as text-guided image generation models and further show that CLIP also behaves similarly. Our results further indicate that text encoders trained on multilingual data provide a way to mitigate the effects of homoglyph replacements.","Text-Guided Image Generation Models, Bias, DALL-E 2, Security" Parameterized projected Bellman operator,https://openreview.net/forum?id=lTb1kzFA84J,https://openreview.net/pdf?id=lTb1kzFA84J,A novel reinforcement learning approach that obtains an approximation of the Bellman operator to overcome the limitations of the regular Bellman operator.,"The Bellman operator is a cornerstone of reinforcement learning, widely used in a plethora of works, from value-based methods to modern actor-critic approaches. In problems with unknown models, the Bellman operator requires transition samples that strongly determine its behavior, as uninformative samples can result in negligible updates or long detours before reaching the fixed point. In this work, we introduce the novel idea of obtaining an approximation of the Bellman operator, which we call projected Bellman operator (PBO). Our PBO is a parametric operator on the parameter space of a given value function. Given the parameters of a value function, PBO outputs the parameters of a new value function and converges to a fixed point in the limit, as a standard Bellman operator. Notably, our PBO can approximate repeated applications of the true Bellman operator at once, as opposed to the sequential nature of the standard Bellman operator. We prove the important consequences of this finding for different classes of problems by analyzing PBO in terms of stability, convergence, and approximation error. Eventually, we propose an approximate value-iteration algorithm to show how PBO can overcome the limitations of classical methods, opening up multiple research directions as a novel paradigm in reinforcement learning.","reinforcement learning, bellman operator, operator learning, approximate value iteration" Module-wise Training of Residual Networks via the Minimizing Movement Scheme,https://openreview.net/forum?id=XGT4bsvI6y,https://openreview.net/pdf?id=XGT4bsvI6y,We introduce a regularization inspired by the minimizing movement scheme for gradient flows in distribution space for layer-wise training of neural networks. ,"Greedy layer-wise or module-wise training of neural networks is compelling in constrained and on-device settings, as it circumvents a number of problems of end-to-end back-propagation. However, it suffers from a stagnation problem, whereby early layers overfit and deeper layers stop increasing the test accuracy after a certain depth. We propose to solve this issue by introducing a simple module-wise regularization inspired by the minimizing movement scheme for gradient flows in distribution space. The method, which we call TRGL for Transport Regularized Greedy Learning, is particularly well-adapted to residual networks. We study it theoretically, proving that it leads to greedy modules that are regular and that successively solve the task. Experimentally, we show improved accuracy of module-wise trained networks when our regularization is added.","Deep learning, Layer-wise training, Optimal transport, Locking problems, Parallelism" Cross-Level Distillation and Feature Denoising for Cross-Domain Few-Shot Classification,https://openreview.net/forum?id=Kn-HA8DFik,https://openreview.net/pdf?id=Kn-HA8DFik,We design a cross-level distillation and a feature denoising operation for handling cross-domain few-shot classification. Our approach can surpass the SOTA method by 5.44% on 1-shot and 1.37% on 5-shot classification tasks in the BSCD-FSL benchmark.,"The conventional few-shot classification aims at learning a model on a large labeled base dataset and rapidly adapting to a target dataset that is from the same distribution as the base dataset. However, in practice, the base and the target datasets of few-shot classification are usually from different domains, which is the problem of cross-domain few-shot classification. We tackle this problem by making a small proportion of unlabeled images in the target domain accessible in the training stage. In this setup, even though the base data are sufficient and labeled, the large domain shift still makes transferring the knowledge from the base dataset difficult. We meticulously design a cross-level knowledge distillation method, which can strengthen the ability of the model to extract more discriminative features in the target dataset by guiding the network's shallow layers to learn higher-level information. Furthermore, in order to alleviate the overfitting in the evaluation stage, we propose a feature denoising operation which can reduce the feature redundancy and mitigate overfitting. Our approach can surpass the previous state-of-the-art method, Dynamic-Distillation, by 5.44% on 1-shot and 1.37% on 5-shot classification tasks on average in the BSCD-FSL benchmark. The implementation code will be available soon.","cross-domain few-shot classification, cross-level distillation, feature denoising" kaBEDONN: posthoc eXplainable Artificial Intelligence with Data Ordered Neural Network,https://openreview.net/forum?id=ZaOG6ci_IP7,https://openreview.net/pdf?id=ZaOG6ci_IP7,A posthoc method for providing explanation to blackbox algorithms by querying similar data,"Different approaches to eXplainable Artificial Intelligence (XAI) have been explored including (1) the systematic study of the effect of individual training data sample on the final model (2) posthoc attribution methods that assign importance values to the components of each data sample. Combining concepts from both approaches, we introduce kaBEDONN, a system of ordered dataset coupled with a posthoc and model-agnostic method for querying \textit{relevant} training data samples. These \textit{relevant} data are intended as the explanations for model predictions that are both user-friendly and easily adjustable by developers. Explanations can thus be finetuned and damage control can be performed with ease.","Explainable Artificial Intelligence, Neural Network, instance-based learning" DELTA: DEBIASED FULLY TEST-TIME ADAPTATION,https://openreview.net/forum?id=eGm22rqG93,https://openreview.net/pdf?id=eGm22rqG93,,"Fully test-time adaptation aims at adapting a pre-trained model to the test stream during real-time inference, which is urgently required when the test distribution differs from the training distribution. Several efforts have been devoted to improving adaptation performance. However, we find that two unfavorable biases are concealed in the prevalent adaptation methodologies like test-time batch normalization (BN) and self-learning. First, we reveal that the normalization statistics in test-time BN are inherently biased in favor of currently received test samples. Second, we show that during test-time adaptation, the parameter update is biased towards some dominant classes. In addition to the extensively studied IID test stream, we further observe that the biases can be exacerbated in more complicated test environments, such as non-IID or class-imbalanced data. In this paper, we provide a plug-in solution called DELTA for debiased fully test-time adaptation, which consists of two components: (i) Test-time batch renormalization (TBR), introduced to alleviate the bias in normalization statistics. (ii) Dynamic online re-weighting (DOT), designed to address the class bias within optimization. We investigate various test-time adaptation methods on three commonly used datasets and four scenarios. Previous approaches only work well in certain scenarios while failing in others; DELTA can help them deal with all scenarios simultaneously, leading to new state-of-the-art test-time adaptation performance.", Bit-Pruning: A Sparse Multiplication-Less Dot-Product,https://openreview.net/forum?id=YUDiZcZTI8,https://openreview.net/pdf?id=YUDiZcZTI8,Mult-less dot-product comprised of add and shift; add is pruned during training to reduce energy ,"Dot-product is a central building block in neural networks. However, multiplication ($\texttt{mult}$) in dot-product consumes intensive energy and space costs that challenge deployment on resource-constrained edge devices. In this study, we realize energy-efficient neural networks by exploiting a $\texttt{mult}$-less, sparse dot-product. We first reformulate a dot-product between an integer weight and activation into an equivalent operation comprised of additions followed by bit-shifts ($\texttt{add-shift-add}$). In this formulation, the number of $\texttt{add}$ operations equals the number of bits of the integer weight in binary format. Leveraging this observation, we propose Bit-Pruning, which removes unnecessary bits in each weight value during training to reduce the energy consumption of $\texttt{add-shift-add}$. Bit-Pruning can be seen as soft Weight-Pruning as it prunes bits, not the whole weight element. In extensive experiments, we demonstrate that sparse $\texttt{mult}$-less networks trained with Bit-Pruning show a better accuracy-energy trade-off than sparse $\texttt{mult}$ networks trained with Weight-Pruning. ","pruning, non-uniform quantization, power of two, 2bit" Abstract-to-Executable Trajectory Translation for One-Shot Task Generalization,https://openreview.net/forum?id=S9GpoS2TmN,https://openreview.net/pdf?id=S9GpoS2TmN,We tackle the problem of one-shot generalization for long-horizon tasks by learning a model to translate an abstract trajectory to an executable trajectory.,"Training long-horizon robotic policies in complex physical environments is essential for many applications, such as robotic manipulation. However, learning a policy that can generalize to unseen tasks is challenging. In this work, we propose to achieve one-shot task generalization by decoupling plan generation and plan execution. Specifically, our method solves complex long-horizon tasks in three steps: build a paired abstract environment by simplifying geometry and physics, generate abstract trajectories, and solve the original task by an abstract-to-executable trajectory translator. In the abstract environment, complex dynamics such as physical manipulation are removed, making abstract trajectories easier to generate. However, this introduces a large domain gap between abstract trajectories and the actual executed trajectories as abstract trajectories lack low-level details and aren’t aligned frame-to-frame with the executed trajectory. In a manner reminiscent of language translation, our approach leverages a seq-to-seq model to overcome the large domain gap between the abstract and executable trajectories, enabling the low-level policy to follow the abstract trajectory. Experimental results on various unseen long-horizon tasks with different robot embodiments demonstrate the practicability of our methods to achieve one-shot task generalization. Videos and more details can be found in the supplementary materials and project page: https://sites.google.com/view/abstract-to-executable-iclr23/","Trajectory Translation, One-Shot Generalization, Long-Horizon Task, Reinforcement Learning" Unveiling The Mask of Position-Information Pattern Through the Mist of Image Features,https://openreview.net/forum?id=QDE5hzxVpS,https://openreview.net/pdf?id=QDE5hzxVpS,We demonstrate a more accurate paradigm in quantifying the strength of positional information in CNN models,"Recent studies have shown that paddings in convolutional neural networks encode absolute position information which can negatively affect the model performance for certain tasks. However, existing metrics for quantifying the strength of positional information remain unreliable and frequently lead to erroneous results. To address this issue, we propose novel metrics for measuring and visualizing the encoded positional information. We formally define the encoded information as Position-information Pattern from Padding (PPP) and conduct a series of experiments to study its properties as well as its formation. The proposed metrics measure the presence of positional information more reliably than the existing metrics based on PosENet and tests in F-Conv. We also demonstrate that for any extant (and proposed) padding schemes, PPP is primarily a learning artifact and is less dependent on the characteristics of the underlying padding schemes.","positional information, position encoding, padding, CNN" KNN-Diffusion: Image Generation via Large-Scale Retrieval,https://openreview.net/forum?id=x5mtJD2ovc,https://openreview.net/pdf?id=x5mtJD2ovc,,"Recent text-to-image models have achieved impressive results. However, since they require large-scale datasets of text-image pairs, it is impractical to train them on new domains where data is scarce or not labeled. In this work, we propose using large-scale retrieval methods, in particular, efficient k-Nearest-Neighbors (kNN), which offers novel capabilities: (1) training a substantially small and efficient text-to-image diffusion model without any text, (2) generating out-of-distribution images by simply swapping the retrieval database at inference time, and (3) performing text-driven local semantic manipulations while preserving object identity. To demonstrate the robustness of our method, we apply our kNN approach on two state-of-the-art diffusion backbones, and show results on several different datasets. As evaluated by human studies and automatic metrics, our method achieves state-of-the-art results compared to existing approaches that train text-to-image generation models using images only (without paired text data).", Steering Prototypes with Prompt Tuning for Rehearsal-free Continual Learning,https://openreview.net/forum?id=BSww-NrOzJ,https://openreview.net/pdf?id=BSww-NrOzJ,,"Prototype, as a representation of class embeddings, has been explored to reduce memory footprint or avoid bias towards the latest task for continual learning. However, prototype-based methods still suffer from performance deterioration due to semantic drift and prototype interference. In this work, we propose a simple and novel framework for rehearsal-free continual learning. We show that task-specific prompt-tuning when coupled with a contrastive loss design can effectively address both issues and largely improves the potency of prototypes. The proposed framework excels at three challenging benchmarks, resulting in 3% to 6% absolute improvements over state-of-the-art methods without usage of a rehearsal buffer or a test-time oracle. Furthermore, the proposed framework largely bridges the performance gap between incremental learning and offline joint learning, demonstrating a promising design schema for continual learning.","Continual Learning, Prompt Tuning, Prototype, Contrastive Learning" Normalized Activation Function: Toward Better Convergence,https://openreview.net/forum?id=k1lWBmJuyf,https://openreview.net/pdf?id=k1lWBmJuyf,Normalizing gradient variance by simply constructing an affine transformation after each activation function.,"Activation functions are essential for neural networks to introduce non-linearity. A great number of empirical experiments have validated various activation functions, yet theoretical research on activation functions is insufficient. In this work, we study the impact of activation functions on the variance of gradients and propose an approach to normalize activation functions to keep the same variance of the gradient for all layers so that the neural network can achieve better convergence. First, we complement the previous work on the analysis of the variance of gradients where the impact of activation functions is just considered in an idealized initial state which almost cannot be preserved during training and obtained a property that good activation functions should satisfy as possible. Second, we offer an approach to normalize activation functions apart from the initialization method and testify its effectiveness on prevalent activation functions empirically. And by observing experiments, we discover that the speed of convergence is roughly related to the property we derived in the former part. We run several experiments of our normalized activation functions against common activation functions. And the result shows our approach consistently outperforms their unnormalized counterparts. For example, normalized Swish outperforms vanilla Swish on ResNet50 by 1.4% with Tiny ImageNet and by 1.2% with CIFAR-100 in terms of top-1 accuracy. Our method improves the performance for both fully-connected networks and residual networks.","normalization, activation function, initialization" IS SYNTHETIC DATA FROM GENERATIVE MODELS READY FOR IMAGE RECOGNITION?,https://openreview.net/forum?id=nUmCcZ5RKF,https://openreview.net/pdf?id=nUmCcZ5RKF,We present the first study on the state-of-the-art text-to-image generation models for image recognition.,"Recent text-to-image generation models have shown promising results in generating high-fidelity photo-realistic images. Though the results are astonishing to human eyes, how applicable these generated images are for recognition tasks remains under-explored. In this work, we extensively study whether and how synthetic images generated from state-of-the-art text-to-image generation models can be used for image recognition tasks, and focus on two perspectives: synthetic data for improving classification models in the data-scare settings (i.e. zero-shot and few-shot), and synthetic data for large-scale model pre-training for transfer learning. We showcase the powerfulness and shortcomings of synthetic data from existing generative models, and propose strategies for better applying synthetic data for recognition tasks. Our code will be released.","data generation, image recognition, text-to-image synthesis" Learnable Behavior Control: Breaking Atari Human World Records via Sample-Efficient Behavior Selection,https://openreview.net/forum?id=FeWvD0L_a4,https://openreview.net/pdf?id=FeWvD0L_a4,We have constructed a general framework to control the behaviors in RL and achieved SOTA performance in Atari 1B benchmark.,"The exploration problem is one of the main challenges in deep reinforcement learning (RL). Recent promising works tried to handle the problem with population-based methods, which collect samples with diverse behaviors derived from a population of different exploratory policies. Goal-directed policy selection has been adopted for behavior control. However, the behavior selection space is largely limited by the predefined policy population, which further limits behavior diversity. In this paper, we propose a general framework called Learnable Behavioral Control (LBC) to address the limitation, which a) enables a significantly enlarged behavior selection space via formulating a hybrid behavior mapping from all policies; b) constructs a unified goal-directed learnable process for behavior selection. We introduce LBC into distributed off-policy actor-critic methods and achieve behavior control via optimizing the selection of the behavior mappings with bandit-based meta-controllers. Our agents have achieved 10077.52% mean human normalized score and surpassed 24 human world records within 1B training frames in the Arcade Learning Environment, which demonstrates our significant state-of-the-art (SOTA) performance without degrading the sample efficiency.","Deep Reinforcement Learning, The Arcade Learning Environment, Human World Records, Behavioral Control" Decompose to Generalize: Species-Generalized Animal Pose Estimation,https://openreview.net/forum?id=nQai_B1Zrt,https://openreview.net/pdf?id=nQai_B1Zrt,,"This paper challenges the cross-species generalization problem for animal pose estimation, aiming to learn a pose estimator that can be well generalized to novel species. We find the relation between different joints is important with two-fold impact: 1) on the one hand, some relation is consistent across all the species and may help two joints mutually confirm each other, e.g., the eyes help confirm the nose and vice versa because they are close in all species. 2) on the other hand, some relation is inconsistent for different species due to the species variation and may bring severe distraction rather than benefit. With these two insights, we propose a Decompose-to-Generalize (D-Gen) pose estimation method to break the inconsistent relations while preserving the consistent ones. Specifically, D-Gen first decomposes the body joints into several joint concepts so that each concept contains multiple closely-related joints. Given these joint concepts, D-Gen 1) promotes the interaction between intra-concept joints to enhance their reliable mutual confirmation, and 2) suppresses the interaction between inter-concept joints to prohibit their mutual distraction. Importantly, we explore various decomposition approaches, i.e., heuristic, geometric and attention-based approaches. Experimental results show that all these decomposition manners yield reasonable joint concepts and substantially improve cross-species generalization (and the attention-based approach is the best). ","Pose Estimation, Domain Generalization, Transfer Learning" Correcting the Sub-optimal Bit Allocation,https://openreview.net/forum?id=KWSPJ1tuYX,https://openreview.net/pdf?id=KWSPJ1tuYX,Correcting the bit allocation in neural video compression by extending semi-amortized variational inference to non-factorized latent.,"In this paper, we investigate the problem of bit allocation in Neural Video Compression (NVC). First, we reveal that a recent bit allocation approach claimed to be optimal is, in fact, sub-optimal due to its implementation. Specifically, we find that its sub-optimality lies in the improper application of semi-amortized variational inference (SAVI) on latent with non-factorized variational posterior. Then, we show that the corrected version of SAVI on non-factorized latent requires recursively applying back-propagating through gradient ascent, based on which we derive the corrected optimal bit allocation algorithm. Due to the computational in-feasibility of the corrected bit allocation, we design an efficient approximation to make it practical. Empirical results show that our proposed correction significantly improves the incorrect bit allocation in terms of R-D performance and bitrate error, and outperforms all other bit allocation methods by a large margin. The source code is provided in the supplementary material.","neural video compression, semi-amortized variational auto-encoder" IDEAL: Query-Efficient Data-Free Learning from Black-Box Models,https://openreview.net/forum?id=ConT6H7MWL,https://openreview.net/pdf?id=ConT6H7MWL,query-efficiently learn from black-box model APIs to train a good student without any real data,"Knowledge Distillation (KD) is a typical method for training a lightweight student model with the help of a well-trained teacher model. However, most KD methods require access to either the teacher's training data or model parameter, which is unrealistic. To tackle this problem, recent works study KD under data-free and black-box settings. Nevertheless, these works require a large number of queries to the teacher model, which incurs significant monetary and computational costs. To address these problems, we propose a novel method called \emph{query-effIcient Data-free lEarning blAck-box modeLs} (IDEAL), which aims to query-efficiently learn from black-box model APIs to train a good student without any real data. % a small number of queries. In detail, IDEAL trains the student model in two stages: data generation and model distillation. Note that IDEAL does not require any query in the data generation stage and queries the teacher only once for each sample in the distillation stage. Extensive experiments on various real-world datasets show the effectiveness of the proposed IDEAL. For instance, IDEAL can improve the performance of the best baseline method DFME by 5.83\% on CIFAR10 dataset with only $0.02\times$ the query budget of DFME. Our code will be published upon acceptance.","black-box model, knowledge distillation" MapTR: Structured Modeling and Learning for Online Vectorized HD Map Construction,https://openreview.net/forum?id=k7p_YAO7yE,https://openreview.net/pdf?id=k7p_YAO7yE,We present a structured end-to-end framework for efficient online vectorized HD map construction.,"High-definition (HD) map provides abundant and precise environmental information of the driving scene, serving as a fundamental and indispensable component for planning in autonomous driving system. We present MapTR, a structured end-to-end Transformer for efficient online vectorized HD map construction. We propose a unified permutation-equivalent modeling approach, i.e., modeling map element as a point set with a group of equivalent permutations, which accurately describes the shape of map element and stabilizes the learning process. We design a hierarchical query embedding scheme to flexibly encode structured map information and perform hierarchical bipartite matching for map element learning. MapTR achieves the best performance and efficiency with only camera input among existing vectorized map construction approaches on nuScenes dataset. In particular, MapTR-nano runs at real-time inference speed ($25.1$ FPS) on RTX 3090, $8\times$ faster than the existing state-of-the-art camera-based method while achieving $5.0$ higher mAP. Even compared with the existing state-of-the-art multi-modality method, MapTR-nano achieves $0.7$ higher mAP and $8\times$ faster inference speed, and MapTR-tiny achieves $13.5$ higher mAP and $3\times$ faster inference speed. Abundant qualitative results show that MapTR maintains stable and robust map construction quality in complex and various driving scenes. MapTR is of great application value in autonomous driving. Code will be released for facilitating further research and application.","Autonomous Driving, Online Vectorized HD Map Construction, End-to-End" (LA)YER-NEIGH(BOR) SAMPLING: DEFUSING NEIGHBORHOOD EXPLOSION,https://openreview.net/forum?id=b553eG8Wkb,https://openreview.net/pdf?id=b553eG8Wkb,Paper presents a new sampling algorithm combining layer and neighbor sampling methods,"Graph Neural Networks have recently received a significant attention, however, training them at a large scale still remains as a challenge. Minibatch training coupled with sampling is used to alleviate this challenge. However existing approaches either suffer from the neighborhood explosion phenomenon or does not have good performance. To deal with these issues, we propose a new sampling algorithm called LAyer-neighBOR sampling (LABOR). It is designed to be a direct replacement for Neighborhood Sampling with the same fanout hyperparameter while sampling much fewer vertices, without sacrificing quality. By design, the variance of the estimator of each vertex matches Neighbor Sampling from the point of view from a single vertex. In our experiments, we demonstrate the superiority of our approach when it comes to model convergence behaviour against Neighbor Sampling and also the other Layer Sampling approaches under the same limited vertex sampling budget constraints.",Graph Neural Networks. Sampling Probing into Overfitting for Video Recognition,https://openreview.net/forum?id=-0tPmzgXS5,https://openreview.net/pdf?id=-0tPmzgXS5,We propose a data augmentation tailored for action recognition which shows consistent improvement over various models and datasets.,"Video recognition methods based on 2D networks have thrived in recent years, leveraging advanced image classification techniques. However, overfitting is an even severe problem in 2D video recognition models as 1) the scale of video datasets is relatively small compared to image recognition datasets like ImageNet; 2) current pipeline treats background and semantic frames equally during optimization which aggravates overfitting. Based on these challenges, we design a video-specific data augmentation approach, named as Ghost Motion (GM), to alleviate overfitting. Specifically, GM shifts channels along temporal dimension to enable semantic motion information diffused into other frames which may be irrelevant originally, leading to improvement in frame-wise accuracy. In addition, for challenging video samples with significant temporal dependency (e.g., Something-Something), we further scale the logits during training to prevent overconfident predictions on background frames. Comprehensive empirical validation on various popular datasets shows that the proposed method can improve the generalization of existing methods and is compatible to other competing data augmentation approaches.","Action Recognition, Data Augmentation, Overfitting" Image as Set of Points,https://openreview.net/forum?id=awnvqZja69,https://openreview.net/pdf?id=awnvqZja69,"We introduce Context Cluster, a new paradigm that considers an image as a set of point and employs clustering method for feature extraction."," What is an image, and how to extract latent features? Convolutional Networks (ConvNets) consider an image as organized pixels in a rectangular shape and extract features via convolutional operation in a local region; Vision Transformers (ViTs) treat an image as a sequence of patches and extract features via attention mechanism in a global range. In this work, we introduce a straightforward and promising paradigm for visual representation, which is called Context Clusters. Context clusters (CoCs) view an image as a set of unorganized points and extract features via a simplified clustering algorithm. In detail, each point includes the raw feature (e.g., color) and positional information (e.g., coordinates), and a simplified clustering algorithm is employed to group and extract deep features hierarchically. Our CoCs are convolution- and attention-free, only relying on clustering algorithm for spatial interaction. Owing to the simple design, we show CoCs endow gratifying interpretability via the visualization of the clustering process. Our CoCs aim at providing a new perspective on image and visual representation, which may enjoy broad applications in different domains and exhibit profound insights. Even though we are not targeting SOTA performance, COCs still achieve comparable or even better performance than ConvNets or ViTs on several benchmarks.","Clustering, Image Processing, Context Cluster, Representation" Examining the Value of Neural Filter Pruning -- Retrospect and Prospect,https://openreview.net/forum?id=DFaFg1u7UT,https://openreview.net/pdf?id=DFaFg1u7UT,"We study the ""value of filter pruning"" issue and show it might be inaccurate due to suboptimal LR setups, more insights provided to explain the reason behind.","Neural network filter pruning is one of the major methods in model compression and acceleration. Despite the remarkable progress in the past several years, there is an ongoing debate concerning the value of filter pruning -- Some works in 2019 argue that filter pruning is of no value since they found training the pruned network from scratch can achieve similar or even better performance than pruning a pretrained model. This argument fundamentally challenges the value of many filter pruning works. However, to date, the community has not formally responded to such acute questioning. In this paper, we present extensive empirical analyses to show the seeming contradiction is due to suboptimal learning rate schedule settings. We introduce more strict comparison setups and show filter pruning still has value within the same training epoch budgets. Apart from justifying the value of filter pruning empirically, we further examine the reason behind it and discover that the poor trainability caused by pruning is largely responsible for the sub-optimality of the learning rate schedule, thus calling for an urgent need to recover trainability after pruning. This paper does not target new SOTA performance of filter pruning. Instead, we focus on clarifying the existing mysteries in filter pruning towards a better understanding.","Neural network filter pruning, value of pruning, trainability, dynamical isometry" Sparse Misinformation Detector,https://openreview.net/forum?id=4Udi4sd8qz9,https://openreview.net/pdf?id=4Udi4sd8qz9,"We present an efficient sparse misinformation detector based on a special sparsity pattern (CircuSparsity), with very encouraging performance.","We present Sparse Misinformation Detector (SMD), a new efficient misinformation detection network with regular fine-grained sparsity. We propose two technical components to enable SMD. First, CircuSparsity, a new hardware-friendly sparsity pattern, is introduced for improved training and testing efficiency. Second, through dedicated empirical analyses, we discover that document-level misinformation detection is pretty insensitive to a compact model size, which inspires us to make early exit for the document-level misinformation classifier. With these two techniques, we successfully achieve efficient misinformation detection on both document and event levels with one single model. Empirically, our approach significantly outperforms the original dense misinformation detection network while enjoying 50% to 75% sparsity. Extensive experiments and analyses demonstrate the merits of our method compared to other top-performing counterpart approaches. To our best knowledge, this is the first attempt for efficient misinformation detection from the network sparse training perspective.","Misinformation detection, fake news detection, sparse training, network pruning" Hybrid Neuro-Symbolic Reasoning based on Multimodal Fusion,https://openreview.net/forum?id=SFyOjfEOJO,https://openreview.net/pdf?id=SFyOjfEOJO,A hybrid neural/symbolic modeling to enhance complex image classifications using commonsense knowledge.,"Deep neural models and symbolic Artificial Intelligence (AI) systems have contrasting advantages and disadvantages. Neural models can be trained from raw, incomplete and noisy data to obtain abstraction of features at various levels, but their uninterpretability is well-known. On the other hand, the traditional rule-based symbolic reasoning encodes domain knowledge, but its failure is often attributed to the acquisition bottleneck. We propose to build a hybrid learning and reasoning system which is based on multimodal fusion approach that brings together advantageous features from both the paradigms. Specifically, we enhance convolutional neural networks (CNNs) with the structured information of ‘if-then’ symbolic logic rules obtained via word embeddings corresponding to propositional symbols and terms. With many dozens of intuitive rules relating the type of a scene with its typical constituent objects, we are able to achieve significant improvement over the base CNN-based classification. Our approach is extendible to handle first-order logical syntax for rules and other deep learning models.","Neural Networks, Deep Learning, Symbolic Reasoning, Multimodal Fusion, Word Embedding, Rule-based Reasoning" Distilling Text-Image Foundation Models,https://openreview.net/forum?id=VsqE7E-lWB,https://openreview.net/pdf?id=VsqE7E-lWB,"We focus on image classification task, and investigate the capasity gap resistance of CLIP in knowledge distillation. ","Large pretrained foundation models (such as CLIP, DALL-E) are among the most recent significant advances in the AI community. Their implication is profound. This paper examines the value of these foundation models as a model knowledge base -- we aim to distill the knowledge in these foundation models for training lightweight models designed for specific tasks in practical application scenarios with improved performance. Despite abundant progress in knowledge distillation (KD) in traditional models trained under the supervision of class labels in datasets encoded as integers, distilling such text-image contrastive learning model has not been explored extensively. Meanwhile, KD is well-known for being bothered by the capacity gap problem (i.e., distilling knowledge from a teacher significantly larger than a student often degrades the performance of the student). The teacher-student capacity gap in distilling foundation models is even larger. Therefore, how to overcome this potential issue is also elusive now. This paper presents detailed analyses of these questions aiming to successfully tap into a pretrained foundation model (CLIP) to boost the student's performance. Besides the practical performance benefits, several interesting discoveries are unveiled: (1) CLIP is not bothered by the capacity gap, which may let us re-evaluate if the ""capacity-gap"" issue is really due to the capacity gap (2) We find the reason is largely due to that CLIP is not over-confident on the wrong labels when misclassifies input image samples.","foundation models, CLIP, knowledge distillation, capacity gap" Trainability Preserving Neural Pruning,https://openreview.net/forum?id=AZFvpnnewr,https://openreview.net/pdf?id=AZFvpnnewr,We present a new filter pruning approach that effectively preserves trainability during pruning with encouraging performance. ,"Many recent pruning works show trainability plays a critical role in network structured pruning -- unattended broken trainability can lead to severe under-performance and unintentionally amplify the effect of finetuning learning rate, resulting in biased (or even misinterpreted) benchmark results. In this paper, we present trainability preserving pruning (TPP), a scalable method to preserve network trainability against pruning, aiming for improved pruning performance. TPP regularizes the gram matrix of convolutional filters to decorrelate the pruned filters from the retained filters. In addition to the convolutional layers, per the spirit of preserving the trainability of the whole network, we also propose to regularize the batch normalization parameters. Empirically, TPP performs on par with the ground-truth trainability recovery method on linear MLP networks. On non-linear networks (ResNet56/VGG19 on CIFAR10/100), our TPP outperforms the other counterpart schemes by an obvious margin. Moreover, extensive results on ImageNet with ResNets show TPP consistently performs more favorably against other top-performing structured pruning approaches.","neural network structured pruning, trainability, kernel orthogonalization" Rotation Invariant Quantization for Model Compression,https://openreview.net/forum?id=gurtzTlw6Q,https://openreview.net/pdf?id=gurtzTlw6Q,"In this study, we investigate the theoretical limits of post-training NN model compression rates using the rate-distortion theory, proving that the highest compression rate is attained by a simple single-letter (scalar) rotation-invariant solution.","Post-training Neural Network (NN) model compression is an attractive approach for deploying large, memory-consuming models on devices with limited memory resources. In this study, we investigate the theoretical limits of the NN model compression rates using the rate-distortion theory. First, we prove that the highest compression is attained by a simple single-letter (scalar) rotation-invariant solution. Then, based on these insights, we suggest a Rotation-Invariant Quantization (RIQ) technique that finds the optimal single-letter solution efficiently, yielding a different rate at each layer, i.e., mixed-precision quantization. We rigorously evaluate RIQ and demonstrate its capabilities on various models and tasks. For example, RIQ facilitates $\times 19.4$ and $\times 52.9$ compression ratio on pre-trained VGG dense and pruned models, respectively, with $<0.4\%$ degradation in accuracy. Code is available in the supplementary material.","Neural Network, Model compression, Rate-distortion, Quantization" Robustness Exploration of Semantic Information in Adversarial Training,https://openreview.net/forum?id=SWUGykek_T,https://openreview.net/pdf?id=SWUGykek_T,,"In this paper, we look into the problem of adversarial robustness from the semantic information perspective. We demonstrate a novel insight that adversarial attacks destroy the correlation between visual representations and semantic word vectors, and adversarial training fixed it. We further find that the correlation between robust features of different categories is consistent with the correlation between corresponding semantic word vectors. Based on that, we introduce the semantic information to assist model training and propose Semantic Constraint Adversarial Robust Learning (SCARL). First, we follow an information-theoretical lens to formulate the mutual information between the visual representation and the corresponding semantic word vector in the embedding space to bridge the information gap. We further provide a differentiable lower bound to optimize such mutual information efficiently. Second, we propose a novel semantic structural constraint, encouraging the trained model to keep the structure of visual representations consistent with that of semantic word vectors. Finally, we combine these two techniques with adversarial training to learn robust visual representation. Experimentally, we conduct extensive experiments on several benchmarks, demonstrating that semantic information is indeed beneficial to model robustness.","Adversarial training, Semantic information, Adversarial robustness" Learning Implicit Scale Conditioned Memory Compensation for Talking Head Generation,https://openreview.net/forum?id=OUMNXSAek8,https://openreview.net/pdf?id=OUMNXSAek8,We propose a novel implicit scale conditioned memory compensation network (MCNet) for high-fidelity talking head generation.,"Talking head video generation aims to animate the pose and expression of a person in a target driving video using motion information contained in the video, while maintaining a person's identity in a given still source image. Highly dynamic and complex motions in the driving video cause ambiguous generation from the source image, because the still source image cannot provide sufficient appearance information for occluded regions or delicate expressions, which severely produces artifacts and significantly degrades the generation quality. However, existing works mainly focus on learning more accurate motion estimation and representation in 2D and 3D, and they ignore the facial structural prior in addressing the facial ambiguities. Therefore, effective handling of the ambiguities in the dramatic appearance changes of the source to largely improve facial details and completeness in generation still remains barely explored. To this end, we propose a novel implicit scale conditioned memory compensation network (MCNet) for high-fidelity talking head generation. Specifically, considering human faces are symmetric and structured, we aim to automatically learn a representative global facial memory bank from all training data as a prior to compensate for the facial generation features. Each face in the source image contains a scale that can be reflected in detected facial keypoints. To better query the learned global memory, we further propose to learn implicit scale representations from the discrete keypoints, which can be used to condition on the query of the global memory, to obtain scale-aware memory for the feature compensation. Extensive experiments from quantitative and qualitative perspectives demonstrate that MCNet can learn representative and complementary facial memory, and can clearly outperform previous state-of-the-art methods on VoxCeleb1 and CelebV datasets. ",Talking Head Generation On the Dynamics under the Averaged Sample Margin Loss and Beyond,https://openreview.net/forum?id=P_O91UpSX0M,https://openreview.net/pdf?id=P_O91UpSX0M,We investigate the dynamics of the averaged sample margin loss and provide some insights for improvements.,"Recent works have studied implicit biases in deep learning, especially the behavior of last-layer features and classifier weights. However, they usually need to simplify the dynamics under gradient descent due to the intractability of loss functions and neural architectures. In this paper, we introduce a concise loss function as a surrogate, namely the Averaged Sample Margin (ASM) loss, which offers more mathematical opportunities to analyze the closed-form dynamics while requiring few simplifications or assumptions, and allows for more practical considerations. Based on the layer-peeled model that views last-layer features as free optimization variables, we build a complete analysis for the unconstrained, regularized, and spherical constrained cases. We show that these dynamics mainly \textit{converge exponentially fast} to a solution depending on the initialization of features and classifier weights, which can help explain why the training of deep neural networks usually takes only a few hundred epochs. Our theoretical results can also aid in providing insights for improvements in practical training with the ASM loss or other losses, such as explicit feature regularization and rescaled learning rate for spherical cases. Finally, we empirically demonstrate these theoretical results and insights with extensive experiments.","Implicit bias, neural collapse, gradient descent" ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers,https://openreview.net/forum?id=CvfiXFOW2n,https://openreview.net/pdf?id=CvfiXFOW2n,,"Although Transformers have successfully transitioned from their language modelling origins to image-based applications, their quadratic computational complexity remains a challenge, particularly for dense prediction. In this paper we propose a content-based sparse attention method, as an alternative to dense self-attention, aiming to reduce the computation complexity while retaining the ability to model long-range dependencies. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost. Besides, we further extend the clustering-guided attention from single-scale to multi-scale, which is conducive to dense prediction tasks. We label the proposed Transformer architecture ClusTR, and demonstrate that it achieves state-of-the-art performance on various vision tasks but at lower computational cost and with fewer parameters. For instance, our ClusTR small model with 22.7M parameters achieves 83.2% Top-1 accuracy on ImageNet. Source code and ImageNet models will be made publicly available.","Vision Transformer, Clustering, Multi-scale, Efficient" DrML: Diagnosing and Rectifying Vision Models using Language,https://openreview.net/forum?id=D-zfUK7BR6c,https://openreview.net/pdf?id=D-zfUK7BR6c,Our work highlights a distinct advantage of multi-modal embedding space: the ability to diagnose vision classifiers through natural language.,"Recent multi-modal contrastive learning models have demonstrated the ability to learn an embedding space suitable for building strong vision classifiers, by leveraging the rich information in large-scale image-caption datasets. Our work highlights a distinct advantage of this multi-modal embedding space: the ability to diagnose vision classifiers through natural language. The traditional process of diagnosing model behaviors in deployment settings involves labor-intensive data acquisition and annotation. Our proposed method, DrML, can discover high-error data slices, identify influential attributes and further rectify undesirable model behaviors, without requiring any visual data. Through a combination of theoretical explanation and empirical verification, we present conditions under which classifiers trained on embeddings from one modality can be equivalently applied to embeddings from another modality. On a range of image datasets with known error slices, we demonstrate that our method can effectively identify error slices and influential attributes, and can further use language to rectify failure modes of the classifier.","model diagnosis, multi-modal contrastive learning, vision and language" Semantic Grouping Network for Audio Source Separation,https://openreview.net/forum?id=k8nG8lWMMjn,https://openreview.net/pdf?id=k8nG8lWMMjn,"We propose a novel Semantic Grouping Network, termed as SGN, that can disentangle sound representations and extract high-level semantic info for each source to guide separation.","Audio source separation is a typical and challenging problem that aims to separate individual sources from a mixture of audios. Recently, audio-visual separation approaches take advantage of the natural synchronization between the two modalities to boost separation performance. They extracted high-level semantics from visual inputs as the guidance to help disentangle sound representation for individual sources. Can we directly learn to disentangle the individual semantics from the sound itself? The dilemma is that multiple sound sources are mixed together in the original space. To tackle the difficulty, in this paper, we present a novel Semantic Grouping Network, termed as SGN, that can directly disentangle sound representations and extract high-level semantic information for each source from input audio mixture. Specifically, SGN aggregates category-wise source features through learnable class tokens of sounds. Then, the aggregated semantic features can be used as the guidance to separate the corresponding audio sources from the mixture. The proposed audio source separation framework is simple, flexible, and scalable. Comparing to the existing sound separation methods, our new framework can support audio separation from a flexible number of sources and is capable of generalizing to handle sound sources from different domains. We conducted extensive experiments on both music-only and universal sound separation benchmarks: MUSIC and FUSS. The results demonstrate that our SGN significantly outperforms previous audio-only methods and audio-visual models without utilizing additional visual cues.","audio souce separation, audio-visual separation" "Neural Shape Compiler: A Unified Framework for Transforming between Text, Point Cloud, and Program",https://openreview.net/forum?id=anFy7Tb1zW,https://openreview.net/pdf?id=anFy7Tb1zW,"We proposed Neural Shape Compiler to translate between three hierarchical shape abstractions: Point Cloud, Shape Program, and Text.","3D shapes have complementary abstractions from low-level geometry to part-based hierarchies to languages, which convey different levels of information. This paper presents a unified framework to translate between pairs of shape abstractions: Text ⟺ Point Cloud ⟺ Program. We propose Neural Shape Compiler to model the abstraction transformation as a conditional generation process. It converts 3D shapes of three abstract types into unified discrete shape code, transforms each shape code into code of other abstract types through the proposed ShapeCode Transformer, and decodes them to output the target shape abstraction. Point Cloud code is obtained in a class-agnostic way by the proposed PointVQVAE. On Text2Shape, ShapeGlot, ABO, Genre, and Program Synthetic datasets, Neural Shape Compiler shows strengths in Text ⟹ Point Cloud, Point Cloud ⟹ Text, Point Cloud ⟹ Program, and Point Cloud Completion tasks. Additionally, Neural Shape Compiler benefits from jointly training on all heterogeneous data and tasks.","Multimodal Learning, Generative Models, Representation Learning" Improving Corruption Robustness with Adversarial Feature Alignment Transformers,https://openreview.net/forum?id=YWZ90TiPBM,https://openreview.net/pdf?id=YWZ90TiPBM,We improve the robustness of transformers by enhancing the stability of the self-attention mechanism.,"Despite their success, vision transformers still remain vulnerable to image corruptions, such as noise or blur. Indeed, we find that the vulnerability mainly stems from the unstable self-attention mechanism, which is inherently built upon patch-based inputs and often becomes overly sensitive to the corruptions across patches. For example, when we only occlude a small number of patches with random noise (e.g., 10%), these patch-based corruptions would lead to severe accuracy drops and greatly mislead the intermediate features as well as the corresponding attentions over them. To alleviate this issue, we seek to explicitly reduce the sensitivity of attention layers to patch-based corruptions and improve the overall robustness of transformers. In this paper, we propose the Adversarial Feature Alignment Transformer (AFAT) that aligns the features between clean examples and patch-based corruptions. To construct these corrupted examples, we build a patch corruption model to identify and occlude the patches that could severely distract the intermediate attention layers. We highlight that the corruption model is trained adversarially to the following feature alignment process, which is essentially different from existing methods. In experiments, AFAT greatly improves the stability of attention layers and consistently yields better robustness on various benchmarks, including CIFAR-10/100-C, ImageNet-A, ImageNet-C, and ImageNet-P.","Corruption Robustness, Attention Stability, Feature Alignment" Sharpness-aware Quantization for Deep Neural Networks,https://openreview.net/forum?id=IAy-lKeb3z,https://openreview.net/pdf?id=IAy-lKeb3z,"We propose a novel method, dubbed Sharpness-Aware Quantization (SAQ), to smooth the loss landscape and improve the generalization performance of the quantized models.","Network quantization has gained increasing attention since it can significantly reduce the model size and computational overhead. However, due to the discrete nature of quantization, a small change in full-precision weights might incur large change in quantized weights, which leads to severe loss fluctuations and thus results in sharp loss landscape. The fluctuating loss makes the gradients unstable during training, resulting in considerable performance degradation. Recently, Sharpness-Aware Minimization (SAM) has been proposed to smooth the loss landscape and improve the generalization performance of the models. Nevertheless, how to customize SAM to the quantized models is non-trivial due to the effect of the clipping and discretization in quantization. In this paper, we propose a novel method, dubbed Sharpness-Aware Quantization (SAQ), to smooth the loss landscape and improve the generalization performance of the quantized models, which explores the effect of SAM in model compression, particularly quantization for the first time. Specifically, we first propose a unified view for quantization and SAM, where we consider them as introducing quantization noises and adversarial perturbations to the model weights. According to whether the quantization noises and adversarial perturbations depend on each other, SAQ can be divided into three cases. We then analyze and compare different cases comprehensively. Extensive experiments on both convolutional neural networks and Transformers show that SAQ improves the generalization performance of the quantized models, yielding the SOTA results in uniform quantization. For example, on ImageNet, our SAQ outperforms the model trained with the conventional optimization procedure (i.e., SGD) by 1.1% on the Top-1 accuracy on 4-bit ResNet-50. Our 4-bit ResNet-34 surpasses the previous SOTA quantization method by 1.0% on the Top-1 accuracy.","Sharpness-aware Minimization, Quantization, CNNs, Transformers" Robust Generalization against Corruptions via Worst-Case Sharpness Minimization,https://openreview.net/forum?id=vPvy4x-0H52,https://openreview.net/pdf?id=vPvy4x-0H52,Mitigating sharpness of the worst-case distributions for robust generalization against corruptions.,"Robust generalization aims to deal with the most challenging data distributions which are rarely presented in training set and contain severe noise corruptions. Common solutions such as distributionally robust optimization (DRO) focus on the worst-case empirical risk to ensure low training error on the uncommon noisy distributions. However, due to the over-parameterized model being optimized on scarce worst-case data, DRO fails to produce a smooth loss landscape, thus struggling on generalizing well to the test set. Therefore, instead of focusing on the worst-case risk minimization, we propose SharpDRO by penalizing the sharpness of the worst-case distribution, which measures the loss changes around the neighbor of learning parameters. Through worst-case sharpness minimization, the proposed method successfully produces a flat loss curve on the corrupted distributions, thus achieving robust generalization. Moreover, by considering whether the distribution annotation is available, we apply SharpDRO to two problem settings and design a worst-case selection process for robust generalization. Through simulating real-world noisy distributions using CIFAR10/100 and ImageNet30 datasets, we show that SharpDRO exhibits strong generalization ability against severe corruptions and exceeds well-known baseline methods with large performance gains.","distributionally robust optimization, out-of-distribution generalization" Harnessing Out-Of-Distribution Examples via Augmenting Content and Style,https://openreview.net/forum?id=boNyg20-JDm,https://openreview.net/pdf?id=boNyg20-JDm,Harnessing OOD examples through data augmentation that changes the content and style. ,"Machine learning models are vulnerable to Out-Of-Distribution (OOD) examples, such a problem has drawn much attention. However, current methods lack a full understanding of different types of OOD data: there are benign OOD data that can be properly adapted to enhance the learning performance, while other malign OOD data would severely degenerate the classification result. To Harness OOD data, this paper proposes HOOD method that can leverage the content and style from each image instance to identify benign and malign OOD data. Particularly, we design a variational inference framework to causally disentangle content and style features by constructing a structural causal model. Subsequently, we augment the content and style through an intervention process to produce malign and benign OOD data, respectively. The benign OOD data contain novel styles but hold our interested contents, and they can be leveraged to help train a style-invariant model. In contrast, the malign OOD data inherit unknown contents but carry familiar styles, by detecting them can improve model robustness against deceiving anomalies. Thanks to the proposed novel disentanglement and data augmentation techniques, HOOD can effectively deal with OOD examples in unknown and open environments, whose effectiveness is empirically validated in three typical OOD applications including OOD detection, open-set semi-supervised learning, and open-set domain adaptation.","out-of-distribution, open-set learning" On Stability and Generalization of Bilevel Optimization Problems,https://openreview.net/forum?id=LPwlqyrnwg,https://openreview.net/pdf?id=LPwlqyrnwg,Generalization bounds in different settings for single-timescale gradient-based method ,"(Stochastic) bilevel optimization is a frequently encountered problem in machine learning with a wide range of applications such as meta-learning, hyper-parameter optimization, and reinforcement learning. Most of the existing studies on this problem only focused on analyzing the convergence or improving the convergence rate, while little effort has been devoted to understanding its generalization behaviors. In this paper, we conduct a thorough analysis on the generalization of first-order (gradient-based) methods for the bilevel optimization problem. We first establish a fundamental connection between algorithmic stability and generalization error in different forms and give a high probability generalization bound which improves the previous best one from $O(\sqrt{n})$ to $O(\log n)$, where $n$ is the sample size. We then provide the first stability bounds for the general case where both inner and outer level parameters are subject to continuous update, while existing work allows only the outer level parameter to be updated. Our analysis can be applied in various standard settings such as strongly-convex-strongly-convex (SC-SC), convex-convex (C-C), and nonconvex-nonconvex (NC-NC). Our analysis for the NC-NC setting can also be extended to a particular nonconvex-strongly-convex (NC-SC) setting that is commonly encountered in practice. Finally, we corroborate our theoretical analysis and demonstrate how iterations can affect the generalization error by experiments on meta-learning and hyper-parameter optimization.","bilevel optimization, generalization, stability" Learning GFlowNets from partial episodes for improved convergence and stability,https://openreview.net/forum?id=UYS38ssi1M,https://openreview.net/pdf?id=UYS38ssi1M,GFlowNet training is made faster and more stable by learning from subtrajectories.,"Generative flow networks (GFlowNets) are a family of algorithms for training a sequential sampler of discrete objects under an unnormalized target density and have been successfully used for various probabilistic modeling tasks. Existing training objectives for GFlowNets are either local to states or transitions, or propagate a reward signal over an entire sampling trajectory. We argue that these alternatives represent opposite ends of a gradient bias-variance tradeoff and propose a way to exploit this tradeoff to mitigate its harmful effects. Inspired by the TD($\lambda$) algorithm in reinforcement learning, we introduce subtrajectory balance or SubTB($\lambda$), a GFlowNet training objective that can learn from partial action subsequences of varying lengths. We show that SubTB($\lambda$) accelerates sampler convergence in previously studied and new environments and enables training GFlowNets in environments with longer action sequences and sparser reward landscapes than what was possible before. We also perform a comparative analysis of stochastic gradient dynamics, shedding light on the bias-variance tradeoff in GFlowNet training and the advantages of subtrajectory balance.","GFlowNets, probabilistic modeling, reinforcement learning" DropIT: Dropping Intermediate Tensors for Memory-Efficient DNN Training,https://openreview.net/forum?id=Kn6i2BZW69w,https://openreview.net/pdf?id=Kn6i2BZW69w,"DropIT can save memory & improve accuracy, providing a new perspective of dropping in activation compressed training than quantization.","A standard hardware bottleneck when training deep neural networks is GPU memory. The bulk of memory is occupied by caching intermediate tensors for gradient computation in the backward pass. We propose a novel method to reduce this footprint - Dropping Intermediate Tensors (DropIT). DropIT drops min-k elements of the intermediate tensors and approximates gradients from the sparsified tensors in the backward pass. Theoretically, DropIT reduces noise on estimated gradients and therefore has a higher rate of convergence than vanilla-SGD. Experiments show that we can drop up to 90% of the intermediate tensor elements in fully-connected and convolutional layers while achieving higher testing accuracy for Visual Transformers and Convolutional Neural Networks on various tasks (e.g. classification, object detection). Our code and models are available at https://anonymous.4open.science/r/dropit-iclr177submission.","dropping intermediate tensors, dropping activations, activation compressed training, top-k, vision transformer, cnn" Self-attentive Rationalization for Graph Contrastive Learning,https://openreview.net/forum?id=CdU7ApBxICO,https://openreview.net/pdf?id=CdU7ApBxICO,Graph contrastive learning framework with self-attentive rationalization,"Graph augmentation is the key component to reveal instance-discriminative features of a graph as its rationale in graph contrastive learning (GCL). And existing rationale-aware augmentation mechanisms in GCL frameworks roughly fall into two categories and suffer from inherent limitations: (1) non-heuristic methods with the guidance of domain knowledge to preserve salient features, which require expensive expertise and lacks generality, or (2) heuristic augmentations with a co-trained auxiliary model to identify crucial substructures, which face not only the dilemma between system complexity and transformation diversity, but also the instability stemming from the co-training of two separated sub-models. Inspired by recent studies on transformers, we propose $\underline{S}$elf-attentive $\underline{R}$ationale guided $\underline{G}$raph $\underline{C}$ontrastive $\underline{L}$earning (SR-GCL), which integrates rationale finder and encoder together, leverages the self-attention values in transformer module as a natural guidance to delineate semantically informative substructures from both node- and edge-wise views, and contrasts on rationale-aware augmented pairs. On real world biochemistry datasets, visualization results verify the effectiveness of self-attentive rationalization and the performance on downstream tasks demonstrates the state-of-the-art performance of SR-GCL for graph model pre-training. ","Graph Contrastive Learning, Self-supervised Learning, Transformer, Rationalization, self-attention" A Unified Framework of Soft Threshold Pruning,https://openreview.net/forum?id=cCFqcrq0d8,https://openreview.net/pdf?id=cCFqcrq0d8,,"Soft threshold pruning is among the cutting-edge pruning methods with state-of-the-art performance. However, previous methods either aimlessly perform searching on the threshold scheduler or simply train the threshold, lacking theoretical explanation from a unified perspective. In this work, we reformulate soft threshold pruning as an implicit optimization problem solved using the *Iterative Shrinkage-Thresholding* Algorithm (ISTA), a classic method from the fields of sparse recovery and compressed sensing. Under this theoretical framework, all threshold tuning strategies proposed in previous studies of soft threshold pruning are explained as a specific arrangement style of regularization term. We further derive an optimal threshold scheduler through an in-depth study of threshold scheduling based on our framework. This scheduler keeps $L_1$-regularization in the equivalent optimization problem stable and corresponds to a consistent objective function and can be, in principle, applied to sparsify any mathematical model that includes parameters trained via SGD. We conduct extensive experiments and verify its state-of-the-art performance on both Artificial Neural Networks (ResNet-50 and MobileNet-V1) and Spiking Neural Networks (SEW ResNet-18) on ImageNet datasets. It further evolves into a family of novel pruning methods, including sparsify-during-training, early pruning, and pruning at initialization via analysis based on our framework.","Network Pruning, Network Compression, Spiking Neural Networks" Efficient Automatic Machine Learning via Design Graphs,https://openreview.net/forum?id=KVljrqehulG,https://openreview.net/pdf?id=KVljrqehulG,"We propose FALCON, an efficient AutoML method that searches for the optimal model design on design graphs.","Despite the success of automated machine learning (AutoML), which aims to find the best design, including the architecture of deep networks and hyper-parameters, conventional AutoML methods are computationally expensive and hardly provide insights into the relations of different model design choices. To tackle the challenges, we propose FALCON, an efficient sample-based method to search for the optimal model design. Our key insight is to model the design space of possible model designs as a design graph, where the nodes represent design choices, and the edges denote design similarities. FALCON features 1) a task-agnostic module, which performs message passing on the design graph via a Graph Neural Network (GNN), and 2) a task-specific module, which conducts label propagation of the known model performance information on the design graph. Both modules are combined to predict the design performances in the design space, navigating the search direction. We conduct extensive experiments on 27 node and graph classification tasks from various application domains, and an image classification task on the CIFAR-10 dataset. We empirically show that FALCON can efficiently obtain the well-performing designs for each task using only 30 explored nodes. Specifically, FALCON has a comparable time cost with the one-shot approaches while achieving an average improvement of 3.3% compared with the best baselines.","Automated Machine Learning, Sample efficiency, Design graph, Graph Neural Networks" TaskPrompter: Spatial-Channel Multi-Task Prompting for Dense Scene Understanding,https://openreview.net/forum?id=-CwPopPJda,https://openreview.net/pdf?id=-CwPopPJda,We propose a novel multi-task prompting framework to concurrently learn task-specific and task-generic representations as well as cross-task interaction along spatial and channel dimensions based on transformer for multiple dense predictions tasks.,"Learning effective representations simultaneously from multiple tasks in a unified network framework is a fundamental paradigm for multi-task dense visual scene understanding. This requires jointly modeling (i) task-generic and (ii) task-specific representations, and (iii) cross-task representation interactions. Existing works typically model these three perspectives with separately designed structures, using shared network modules for task-generic learning, different modules for task-specific learning, and establish connections among these components for cross-task interactions. It is barely explored in the literature to model these three perspectives in each network layer in an end-to-end manner, which can not only minimize the effort of carefully designing empirical structures for the three multi-task representation learning objectives, but also greatly improve the representation learning capability of the multi-task network since all the model capacity will be used to optimize the three objectives together. In this paper, we propose TaskPrompter, a novel spatial-channel multi-task prompting transformer framework to achieve this target. Specifically, we design a set of spatial-channel task prompts and learn their spatial- and channel interactions with the shared image tokens in each transformer layer with attention mechanism, as aggregating spatial and channel information is critical for dense prediction tasks. Each task prompt learns task-specific representation for one task, while all the prompts can jointly contribute to the learning of the shared image token representations, and the interactions between different task prompts model the cross-task relationship. To decode dense predictions for multiple tasks with the learned spatial-channel task prompts from transformer, we accordingly design a dense task prompt decoding mechanism, which queries the shared image tokens using task prompts to obtain spatial- and channel-wise task-specific representations. Extensive experiments on two challenging multi-task dense scene understanding benchmarks (i.e. NYUD-V2 and Pascal-Context) show superiority of the proposed framework and TaskPrompter establishes significant state-of-the-art performances on multi-task dense predictions. Code and models will be made publicly available.","Multi-task Learning, Scene Understanding, Computer Vision" Optimizing Server-side Aggregation For Robust Federated Learning via Subspace Training,https://openreview.net/forum?id=yUcvOAre_7,https://openreview.net/pdf?id=yUcvOAre_7,,"Non-IID data distribution across clients and poisoning attacks are two main challenges in real-world federated learning systems. While both of them have attracted great research interest with specific strategies developed, no known solution manages to address them in a unified framework. To overcome both challenges, we propose SmartFL, a generic approach that optimizes the server-side aggregation process with a small amount of on-server proxy data (e.g., around one hundred samples for CIFAR-10) via a subspace training technique. Specifically, the aggregation weight of each participating client at each round is optimized using the server-side proxy data, which is essentially the optimization of the global model in the convex hull spanned by client models. Since at each round, the number of tunable parameters optimized on the server side equals the number of participating clients (thus independent of the model size), we are able to train a global model with massive parameters using only a small amount of server-side proxy data. We provide theoretical analyses of the convergence and generalization capacity for SmartFL. Empirically, SmartFL achieves state-of-the-art performance on both federated learning with non-IID data distribution and federated learning with malicious clients. ","federated learning, server-side aggregation, subspace training" Measuring Asymmetric Gradient Discrepancy in Parallel Continual Learning,https://openreview.net/forum?id=aNWiwR2HiOs,https://openreview.net/pdf?id=aNWiwR2HiOs,We propose a Maximum Discrepancy Optimization (MaxDO) strategy to minimize the maximum asymmetric discrepancy among multiple gradients in parallel continual learning.,"In Parallel Continual Learning (PCL), the parallel multiple tasks start and end training unpredictably, thus suffering from training conflict and catastrophic forgetting issues. The two issues are raised because the gradients from parallel tasks differ in directions and magnitudes. Thus, in this paper, we formulate the PCL into a minimum distance optimization problem among gradients and propose an explicit Asymmetric Gradient Distance (AGD) to evaluate the gradient discrepancy in PCL. AGD considers both gradient magnitude ratios and directions, and has a tolerance when updating with a small gradient of inverse direction, which reduces the imbalanced influence of gradients on parallel task training. Moreover, we propose a novel Maximum Discrepancy Optimization (MaxDO) strategy to minimize the maximum discrepancy among multiple gradients. Solving by MaxDO with AGD, parallel training reduces the influence of the training conflict and suppresses the catastrophic forgetting of finished tasks. Extensive experiments validate the effectiveness of our approach on three image recognition datasets.","Multi-Task Learning, Continual Learning, Gradient Discrepancy" Individual Privacy Accounting for Differentially Private Stochastic Gradient Descent,https://openreview.net/forum?id=D4JQEKlTyG,https://openreview.net/pdf?id=D4JQEKlTyG,We compute individual privacy parameters for DP-SGD and show the privacy guarantee varies across different groups.,"Differentially private stochastic gradient descent (DP-SGD) is the workhorse algorithm for recent advances in private deep learning. It provides a single privacy guarantee to all datapoints in the dataset. We propose an efficient algorithm to compute privacy guarantees for individual examples when releasing models trained by DP-SGD. We use our algorithm to investigate individual privacy parameters across a number of datasets. We find that most examples enjoy stronger privacy guarantees than the worst-case bound. We further discover that the training loss and the privacy parameter of an example are well-correlated. This implies groups that are underserved in terms of model utility are simultaneously underserved in terms of privacy guarantee. For example, on CIFAR-10, the average $\varepsilon$ of the class with the lowest test accuracy is 43.6% higher than that of the class with the highest accuracy. We also run membership inference attacks to show this reflects disparate empirical privacy risks. ","individual privacy for DP-SGD, fairness in privacy" CI-VAE: a Class-Informed Deep Variational Autoencoder for Enhanced Class-Specific Data Interpolation,https://openreview.net/forum?id=jdEXFqGjdh,https://openreview.net/pdf?id=jdEXFqGjdh,A deep learning framework for interpolations in high-dimensional data,"We proposed Class-Informed Variational Autoencoder (CI-VAE) to enable interpolation between arbitrary pairs of observations of the same class. CI-VAE combines the general VAE architecture with a linear discriminator layer on the latent space to enforce the construction of a latent space such that observations from different classes are linearly separable. In conventional VAEs, class overlapping on the latent space usually occurs. However, in CI-VAE, the enforced linear separability of classes on the latent space allows for robust latent-space linear traversal and data generation between two arbitrary observations of the same class. Class-specific data interpolation has extensive potential applications in science, particularly in biology, such as uncovering the biological trajectory of diseases or cancer. We used the MNIST dataset of handwritten digits as a case study to compare the performance of CI-VAE and VAE in class-specific data augmentation. We showed that CI-VAE significantly improved class-specific linear traversal and data augmentation compared with VAE while maintaining comparable reconstruction error. In a study of Colon cancer genomics data, we showed that the interpolation between normal cells and tumor cells using CI-VAE may enhance our understanding of cancer development. ","Variational Auto Encoder, Supervised, Latent Space Traversal, Data Interpolation, Discriminator" Attention De-sparsification Matters: Inducing Diversity in Digital Pathology Representation Learning,https://openreview.net/forum?id=SUcUqu_X30,https://openreview.net/pdf?id=SUcUqu_X30,"We introduce Di-SSL, a diversity-inducing self-supervised learning method to enhance the representation learning in Digital Pathology.","In this work, we develop Di-SSL, a Diversity-inducing Self-Supervised Learning technique for histopathology image analysis. SSL techniques, such as contrastive and non-contrastive approaches, have been shown to learn rich and effective rep- resentations without any human supervision. Lately, computational pathology has also benefited from the resounding success of SSL. In this work, we develop a novel domain-aware pretext task to enhance representation learning in digital pathology. Our analysis of vanilla SSL-pretrained models’ attention distribution reveals an insightful observation: sparsity in attention, i.e, models tends to localize most of their attention to some prominent patterns in the image. Although atten- tion sparsity can be beneficial in natural images due to these prominent patterns being the object of interest itself, this can be sub-optimal in digital pathology; this is because, unlike natural images, digital pathology scans are not object-centric, but rather a complex phenotype of various spatially intermixed biological com- ponents. Inadequate diversification of attention in these complex images could result in crucial information loss. To address this, we first leverage cell segmenta- tion to densely extract multiple histopathology-specific representations. We then propose a dense pretext task for SSL, designed to match the multiple correspond- ing representations between the views. Through this, the model learns to attend to various components more closely and evenly, thus inducing adequate diversi- fication in attention for capturing context rich representations. Through quantita- tive and qualitative analysis on multiple slide-level tasks across cancer types, and patch-level classification tasks, we demonstrate the efficacy of our method and observe that the attention is more globally distributed. Specifically, we obtain a relative improvement in accuracy of up to 6.9% in slide-level and 2% in patch level classification tasks (corresponding AUC improvement up to 7.9% and 0.7%, respectively) over a baseline SSL model.","Computational pathology, Cell segmentation, Self supervised learning, Vision Transformer, Sparse attention" Learning Domain-Agnostic Representation for Disease Diagnosis,https://openreview.net/forum?id=-HHJZlRpGb,https://openreview.net/pdf?id=-HHJZlRpGb,"We propose a disentanglement model in medical imaging diagnosis, in order to achieve robustness to multi centers.","In clinical environments, image-based diagnosis is desired to achieve robustness on multi-center samples. Toward this goal, a natural way is to capture only clinically disease-related features. However, such disease-related features are often entangled with center-effect, disabling robust transferring to unseen centers/domains. To disentangle disease-related features, we first leverage structural causal modeling to explicitly model disease-related and center-effects that are provable to be disentangled from each other. Guided by this, we propose a novel Domain Agnostic Representation Model (DarMo) based on variational Auto-Encoder. To facilitate disentanglement, we design domain-agnostic and domain-aware encoders to respectively capture disease-related features and varied center-effects by incorporating a domain-aware batch normalization layer. Besides, we constrain the disease-related features to well predict the disease label as well as clinical attributes, by leveraging Graph Convolutional Network (GCN) into our decoder. The effectiveness and utility of our method are demonstrated by the superior performance over others on both public datasets and inhouse datasets.","multi centers disease diagnosis, mammogram classification" DOTIN: Dropping Out Task-Irrelevant Nodes for GNNs,https://openreview.net/forum?id=kkLRWnh9ST1,https://openreview.net/pdf?id=kkLRWnh9ST1,We propose a new method to drop task-irrelevant nodes to increase the scalability and efficiency.,"Scalability is an important consideration for deep graph neural networks. Inspired by the conventional pooling layers in CNNs, many recent graph learning approaches have introduced the pooling strategy to reduce the size of graphs for learning, such that the scalability and efficiency can be improved. However, these pooling-based methods are mainly tailored to a single graph-level task and pay more attention to local information, limiting their performance in multi-task settings which often require task-specific global information. In this paper, departure from these pooling-based efforts, we design a new approach called DOTIN (\underline{D}ropping \underline{O}ut \underline{T}ask-\underline{I}rrelevant \underline{N}odes) to reduce the size of graphs. Specifically, by introducing $K$ learnable virtual nodes to represent the graph embeddings targeted to $K$ different graph-level tasks, respectively, up to 90\% raw nodes with low attentiveness with an attention model -- a transformer in this paper, can be adaptively dropped without notable performance decreasing. Achieving almost the same accuracy, our method speeds up GAT about 50\% on graph-level tasks including graph classification and graph edit distance (GED) with about 60\% less memory, on D\&D dataset.","Graph Networks, Graph pooling" Boosting Out-of-Distribution Detection with Multiple Pre-trained Models ,https://openreview.net/forum?id=QHWXmoYNw-Z,https://openreview.net/pdf?id=QHWXmoYNw-Z,,"Out-of-Distribution (OOD) detection, i.e., identifying whether an input is sampled from a novel distribution other than the training distribution, is a critical task for safely deploying machine learning systems in the open world. Recently, post hoc detection utilizing pre-trained models has shown promising performance and can be scaled to large-scale problems. This advance raises a natural question: Can we leverage the diversity of multiple pre-trained models to improve the performance of post hoc detection methods? In this work, we propose a detection enhancement method by ensembling multiple detection decisions derived from a zoo of pre-trained models. Our approach uses the p-value instead of the commonly used hard threshold and leverages a fundamental framework of multiple hypothesis testing to control the true positive rate for In-Distribution (ID) data. We focus on the usage of model zoos and provide systematic empirical comparisons with current state-of-the-art methods on various OOD detection benchmarks. The proposed ensemble scheme shows consistent improvement compared to single-model detectors and significantly outperforms the current competitive methods. Our method substantially improves the relative performance by $65.40\%$ and $26.96\%$ on the CIFAR10 and ImageNet benchmarks.","Out-of-Distribution Detection, Model Zoo, Ensemble" Minimax Optimal Kernel Operator Learning via Multilevel Training,https://openreview.net/forum?id=zEn1BhaNYsC,https://openreview.net/pdf?id=zEn1BhaNYsC,,"Learning mappings between infinite-dimensional function spaces have achieved empirical success in many disciplines of machine learning, including generative modeling, functional data analysis, causal inference, and multi-agent reinforcement learning. In this paper, we study the statistical limit of learning a Hilbert-Schmidt operator between two infinite-dimensional Sobolev reproducing kernel Hilbert spaces. We establish the information-theoretic lower bound in terms of the Sobolev Hilbert-Schmidt norm and show that a regularization that learns the spectral components below the bias contour and ignores the ones above the variance contour can achieve the optimal learning rate. At the same time, the spectral components between the bias and variance contours give us flexibility in designing computationally feasible machine learning algorithms. Based on this observation, we develop a multilevel kernel operator learning algorithm that is optimal when learning linear operators between infinite-dimensional function spaces.", STViT: Semantic Tokens for Efficient Global and Local Vision Transformers,https://openreview.net/forum?id=KdAxKVwAmP,https://openreview.net/pdf?id=KdAxKVwAmP,,"The quadratic computational complexity to the number of tokens limits the practical applications of Vision Transformers (ViTs). Several works propose to prune redundant tokens to achieve efficient ViTs. However, these methods generally suffer from (i) dramatic accuracy drops, (ii) application difficulty in the local vision transformer, and (iii) non-general-purpose networks for downstream tasks. In this work, we propose a novel Semantic Token ViT (STViT), for efficient global and local vision transformers, which can also be revised to serve as backbone for downstream tasks. The semantic tokens represent cluster centers, and they are initialized by pooling image tokens in space and recovered by attention, which can adaptively represent global or local semantic information. Due to the cluster properties, a few semantic tokens can attain the same effect as vast image tokens, for both global and local vision transformers. For instance, only 16 semantic tokens on DeiT-(Tiny,Small,Base) can achieve the same accuracy with more than 100% inference speed improvement and nearly 60% FLOPs reduction; on Swin-(Tiny,Small,Base), we can employ 16 semantic tokens in each window to further speed it up by around 20% with slight accuracy increase. Besides great success in image classification, we also extend our method to video recognition. In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks, which is powerless for previous token sparsification methods. Experiments demonstrate that our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone.","token reduction algorithm, efficient vision transformer, global and local vision transformer, downstream tasks" Learning a 3D-Aware Encoder for Style-based Generative Radiance Field,https://openreview.net/forum?id=W4ub8fyCpED,https://openreview.net/pdf?id=W4ub8fyCpED,,"We tackle the task of GAN inversion for 3D generative radiance field, (e.g., StyleNeRF). In the inversion task, we aim to learn an inversion function to project an input image to the latent space of a generator and then synthesize novel views of the original image based on the latent code. Compared with GAN inversion for 2D generative models, 3D inversion not only needs to 1) preserve the identity of the input image, but also 2) ensure 3D consistency in generated novel views. This requires the latent code obtained from the single view image to be invariant across multiple views. To address this new challenge, we propose a two-stage encoder for 3D generative NeRF inversion. In the first stage, we introduce a base encoder that converts the input image to a latent code. To ensure the latent code can be used to synthesize identity preserving and 3D consistent novel view images, we utilize identity contrastive learning to train the base encoder. Since collecting real-world multi-view images of the same identity is expensive, we leverage multi-view images synthesized by the generator itself for contrastive learning. Second, to better preserve the identity of the input image, we introduce a residual encoder to refine the latent code and add finer details to the output image. Through extensive experiments, we demonstrate that our proposed two-stage encoder qualitatively and quantitatively exhibits superiority over the existing encoders for GAN inver- sion in both image reconstruction and novel-view rendering.", MixQuant: A Quantization Bit-width Search that Can Optimize the Performance of your Quantization Method,https://openreview.net/forum?id=EGx_FtsO1eu,https://openreview.net/pdf?id=EGx_FtsO1eu,"We propose MixQuant, a search algorithm that finds the optimal custom quantization bit-width for each layer weight based on roundoff error minimization and can be combined with any quantization method as a form of pre-processing optimization.","Quantization is a technique for creating efficient Deep Neural Networks (DNNs), which involves performing computations and storing tensors at lower bit-widths than f32 floating point precision. Quantization reduces model size and inference latency, and therefore allows for DNNs to be deployed on platforms with constrained computational resources and real-time systems. However, quantization can lead to numerical instability caused by roundoff error which leads to inaccurate computations and therefore, a decrease in quantized model accuracy. In this paper we focus on simulated quantized inference, where the quantized model parameters are stored in low-precision, but the mathematical operations on them (e.g. matrix multiplications and additions) are performed with floating point arithmetic. This means that the DNN parameters are first quantized from f32 to, for example, int4, and then dequantized back to f32 to perform computations. We show that the roundtrip process of quantizing and dequantizing the model parameters leads to roundoff error, which may lead to numerical instability. Similarly to prior works, which have shown that both biases and activations are more sensitive to quantization and are best kept in full precision or quantized with higher bit-widths, we show that some weights are more sensitive than others which should be reflected on their quantization bit-width. To that end we propose MixQuant, a search algorithm that finds the optimal custom quantization bit-width for each layer weight based on roundoff error and can be combined with any quantization method as a form of pre-processing optimization. We show that combining MixQuant with BRECQ, a state-of-the-art quantization method, yields better quantized model accuracy than BRECQ alone. Additionally, we combine MixQuant with vanilla asymmetric quantization to show that MixQuant has the potential to optimize the performance of any quantization technique.","neural network quantization, rounding error, bit-width search" Logical Entity Representation in Knowledge-Graphs for Differentiable Rule Learning,https://openreview.net/forum?id=JdgO-ht1uTN,https://openreview.net/pdf?id=JdgO-ht1uTN,We propose logical entity representation (LERP) to incorporate contextual information of entities into logical rule learning.,"Probabilistic logical rule learning has shown great strength in logical rule mining and knowledge graph completion. It learns logical rules to predict missing edges by reasoning on existing edges in the knowledge graph. However, previous efforts have largely been limited to only modeling chain-like Horn clauses such as R1(x; z) ^ R2(z; y) ) H(x; y). This formulation overlooks additional contextual information from neighboring sub-graphs of entity variables x, y and z. Intuitively, there is a large gap here, as local sub-graphs have been found to provide important information for knowledge graph completion. Inspired by these observations, we propose Logical Entity RePresentation (LERP) to encode contextual information of entities in the knowledge graph. A LERP is designed as a vector of probabilistic logical functions on the entity’s neighboring sub-graph. It is an interpretable representation while allowing for differentiable optimization. We can then incorporate LERP into probabilistic logical rule learning to learn more expressive rules. Empirical results demonstrate that with LERP, our model outperforms other rule learning methods in knowledge graph completion and is comparable or even superior to state-of-the-art black-box methods. Moreover, we find that our model can discover a more expressive family of logical rules. LERP can also be further combined with embedding learning methods like TransE to make it more interpretable.","Probabilistic Logical Rule Learning, Knowledge Graph Completion, Logical Representation Learning" S-SOLVER: Numerically Stable Adaptive Step Size Solver for Neural ODEs,https://openreview.net/forum?id=f23tQmoxWz-,https://openreview.net/pdf?id=f23tQmoxWz-,"We propose a neural ODE adaptive step size solver that is more numerically stable thanks to novel, more reliable local truncation error estimation.","A neural ordinary differential equation (ODE) is a relation between an unknown function and its derivatives, where the ODE is parameterized by a neural network. Therefore, to obtain a solution to a neural ODE requires a solver that performs numerical integration. Dopri5 is one of the most popular neural ODE solvers and also the default solver in torchdiffeq, a PyTorch library of ODE solvers. It is an adaptive step size solver based on the Runge-Kutta (RK) numerical methods. These methods rely on estimation of the local truncation error to select and adjust integration step size, which determines the numerical stability of the solution. A step size that is too large leads to numerical instability, while a step size that is too small may cause the solver to take unnecessarily many steps, which is computationally expensive and may even cause rounding error build up. Therefore, accurate local truncation error estimation is paramount for choosing an appropriate step size to obtain an accurate, numerically stable, and fast solution to the ODE. In this paper we propose a novel local truncation error approximation that is the first to consider solutions of four different RK orders to obtain a more reliable error estimate. This leads to a novel solver S-SOLVER (Stable Solver), which is more numerically stable; and therefore accurate. We demonstrate S-SOLVER's competitive performance in experiments on image recognition with ODE-Net, learning hamiltonian dynamics with Symplectic ODE-Net, and continuous normalizing flows (CNF).","neural ODEs, ODE solvers, numerical integration, numerical stability" TT-NF: Tensor Train Neural Fields,https://openreview.net/forum?id=e1e9CGUj-3,https://openreview.net/pdf?id=e1e9CGUj-3,We repurpose the tensor train decomposition to learning compressed neural fields via backpropagation through samples.,"Learning neural fields has been an active topic in deep learning research, focusing, among other issues, on finding more compact and easy-to-fit representations. In this paper, we introduce a novel low-rank representation termed Tensor Train Neural Fields (TT-NF) for learning neural fields on dense regular grids and efficient methods for sampling from them. Our representation is a TT parameterization of the neural field, trained with backpropagation to minimize a non-convex objective. We analyze the effect of low-rank compression on the downstream task quality metrics in two settings. First, we demonstrate the efficiency of our method in a sandbox task of tensor denoising, which admits comparison with SVD-based schemes designed to minimize reconstruction error. Furthermore, we apply the proposed approach to Neural Radiance Fields, where the low-rank structure of the field corresponding to the best quality can be discovered only through learning. ","neural fields, deep learning, tensor train, tensor decompositions, sampling, radiance fields, voxels" Partial transportability for domain generalization,https://openreview.net/forum?id=mVn2JGzlET,https://openreview.net/pdf?id=mVn2JGzlET,"This paper investigates the problem of domain generalization from the perspective of transportability theory. We propose the task of partial transportability and provide solutions that highlight new contrasts with ""invariance learning"" methods.","Learning prediction models that generalize to related domains is one of the most fundamental challenges in artificial intelligence. There exists a growing literature that argues for learning invariant associations using data from multiple source domains. However, whether invariant predictors generalize to a given target domain depends crucially on the assumed structural changes between domains. Using the perspective of transportability theory, we show that invariance learning, and the settings in which invariant predictors are optimal in terms of worst-case losses, is a special case of a more general partial transportability task. Specifically, the partial transportability task seeks to identify / bound a conditional expectation $\mathbb E_{P_{\pi^*}}[y\mid\mathbf x]$ in an unseen domain $\pi^*$ using knowledge of qualitative changes across domains in the form of causal graphs and data from source domains $\pi^1,\dots,\pi^k$. We show that solutions to this problem have a much wider generalization guarantee that subsumes those of invariance learning and other robust optimization methods that are inspired by causality. For computations in practice, we develop an algorithm that provably provides tight bounds asymptotically in the number of data samples from source domains for any partial transportability problem with discrete observables and illustrate its use on synthetic datasets. ","Causality, domain generalization" CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training,https://openreview.net/forum?id=4JVdg72e7f,https://openreview.net/pdf?id=4JVdg72e7f,,"Pre-training across 3D vision and language remains under development because of limited training data. Recent works attempt to transfer vision-language pre-training models to 3D vision. PointCLIP converts point cloud data to multi-view depth maps, adopting CLIP for shape classification. However, its performance is restricted by the domain gap between rendered depth maps and images, as well as the diversity of depth distributions. To address this issue, we propose CLIP2Point, an image-depth pre-training method by contrastive learning to transfer CLIP to the 3D domain, and adapt it to point cloud classification. We introduce a new depth rendering setting that forms a better visual effect, and then render 52,460 pairs of images and depth maps from ShapeNet for pre-training. The pre-training scheme of CLIP2Point combines cross-modality learning to enforce the depth features for capturing expressive visual and textual features and intra-modality learning to enhance the invariance of depth aggregation. Additionally, we propose a novel Dual-Path Adapter (DPA) module, i.e., a dual-path structure with simplified adapters for few-shot learning. The dual-path structure allows the joint use of CLIP and CLIP2Point, and the simplified adapter can well fit few-shot tasks without post-search. Experimental results show that CLIP2Point is effective in transferring CLIP knowledge to 3D vision. Our CLIP2Point outperforms PointCLIP and other self-supervised 3D networks, achieving state-of-the-art results on zero-shot and few-shot classification.", Feint in Multi-Player Games,https://openreview.net/forum?id=WbyWDWoXD3,https://openreview.net/pdf?id=WbyWDWoXD3,"First work to formalize, implement and examine Feint in Multi-Player Games","This paper introduces the first formalization, implementation and quantitative evaluation of \feint in Multi-Player Games. Our work first formalizes \feint from the perspective of Multi-Player Games, in terms of the temporal, spatial and their collective impacts. The formalization is built upon \textit{Non-transitive Active Markov Game Model}, where \feint can have a considerable amount of impacts. Then, our work considers practical implementation details of \feint in Multi-Player Games, under the state-of-the-art progress of multi-agent modeling to date (namely Multi-Agent Reinforcement Learning). Finally, our work quantitatively examines the effectiveness of our design, and the results show that our design of Feint can (1) greatly improve the reward gains from the game; (2) significantly improve the diversity of Multi-Player Games; and (3) only incur negligible overheads in terms of time consumption. We conclude that our design of Feint is effective and practical, to make Multi-Player Games more interesting.","Feint, Multi-Player Games" Succinct Compression: Lossless Compression for Fast and Memory-Efficient Deep Neural Network Inference,https://openreview.net/forum?id=VNzq9PBFta,https://openreview.net/pdf?id=VNzq9PBFta,First work to introduce Succinct Data Structures for Fast and Memory-Efficient Computations of Deep Neural Networks,"This paper introduces ``Succinct Compression”, a method to provide lossless compression of Deep Neural Network (DNN) models for fast and memory-efficient inference. The key insight of our method leverages the concept of \textit{Succinct Data Structures}, which supports fast queries without decompressing the compressed representations. Our method consists of three new insights. First, we introduce two basic building blocks to formulate DNN models, and how they can be extended to be synergistic with compressed models (e.g. pruned or quantized models). Then, we propose a scheme to enable mixed-formulation inference for different layers, to better extract its benefits. Finally, our method exploits a specialized execution pipeline to incorporate different model formulations for fast inference. We quantitatively demonstrate that: our method can (1) enable faster and more memory-efficient inference on uncompressed models; (2) be synergistic with a variety of structure-altered/unaltered compression schemes with better speedup and compression ratio, while preserving the accuracy; and (3) can outperform all other state-of-the-art Model Coding approaches. ","Succinct Data Structures, Deep Neural Networks, Efficient Inference" BEVDistill: Cross-Modal BEV Distillation for Multi-View 3D Object Detection,https://openreview.net/forum?id=-2zfgNS917,https://openreview.net/pdf?id=-2zfgNS917,We leverage LiDAR-based knowledge into multi-view 3d detectors with cross-modal BEV distillation.,"3D object detection from multiple image views is a fundamental and challenging task for visual scene understanding. Owing to its low cost and high efficiency, multi-view 3D object detection has demonstrated promising application prospects. However, accurately detecting objects through perspective views is extremely difficult due to the lack of depth information. Current approaches tend to adopt heavy backbones for image encoders, making them inapplicable for real-world deployment. Different from the images, LiDAR points are superior in providing spatial cues, resulting in highly precise localization. In this paper, we explore the incorporation of LiDAR-based detectors for multi-view 3D object detection. Instead of directly training a depth prediction network, we unify the image and LiDAR features in the Bird-Eye-View (BEV) space and adaptively transfer knowledge across non-homogenous representations in a teacher-student paradigm. To this end, we propose BEVDistill, a cross-modal BEV knowledge distillation (KD) framework for multi-view 3D object detection. Extensive experiments demonstrate that the proposed method outperforms current KD approaches on a highly-competitive baseline, BEVFormer, without introducing any extra cost in the inference phase. Notably, our best model achieves 59.4 NDS on the nuScenes test leaderboard, achieving new state-of-the-arts in comparison with various image-based detectors.","object detection, 3d detection, BEV perception" Expanding Datasets With Guided Imagination,https://openreview.net/forum?id=gbC0cLDB6X,https://openreview.net/pdf?id=gbC0cLDB6X,,"The power of Deep Neural Networks (DNNs) depends heavily on the training data quantity, quality and diversity. However, in many real scenarios, it is costly and time-consuming to collect and annotate large-scale data. This has severely hindered the application of DNNs. To address this challenge, we explore a new task of dataset expansion, which seeks to automatically create new labeled samples to expand a small dataset. To this end, we present a Guided Imagination Framework (GIF) that leverages the recently developed big generative models (e.g., DALL-E2) to ``imagine'' and create informative new data from seed data to expand small datasets. Specifically, GIF conducts imagination by optimizing the latent features of seed data in a semantically meaningful space, which are fed into the generative models to generate photo-realistic images with new contents. For guiding the imagination towards creating samples useful for model training, we exploit the zero-shot recognition ability of CLIP and introduce three criteria to encourage informative sample generation, i.e., prediction consistency, entropy maximization and diversity promotion. With these essential criteria as guidance, GIF works well for expanding datasets in different domains, leading to 29.9\% accuracy gain on average over six natural image datasets, and 10.4\% accuracy gain on average over three medical image datasets. The source code will be made public. ","Dataset Expansion, Guided Imagination" ThinkSum: Probabilistic reasoning over sets using large language models,https://openreview.net/forum?id=HdYxZ_OVZG,https://openreview.net/pdf?id=HdYxZ_OVZG,A wise System 2 for large language models: Think (parallel model call) + Sum (aggregate results to make a prediction).,"Large language models (LLMs) have a substantial capacity for high-level analogical reasoning: reproducing patterns in linear text that occur in their training data (zero-shot evaluation) or in the provided context (few-shot in-context learning). However, recent studies show that even the largest LLMs fail in scenarios that require reasoning over multiple objects or facts or making sequences of logical deductions. We propose a two-stage probabilistic inference paradigm, ThinkSum, that reasons over sets of objects or facts in a structured manner. In the first stage (Think -- 'fast' retrieval of associations), a LLM is queried in parallel over a set of phrases extracted from the prompt or an auxiliary model call. In the second stage (Sum -- 'slow' probabilistic inference or reasoning), the results of these queries are aggregated to make the final prediction. We demonstrate the advantages of ThinkSum on the BIG-bench suite of evaluation tasks, achieving improvements over the state of the art using GPT-family models on ten difficult tasks, often with far smaller model variants. We compare and contrast ThinkSum with other proposed modifications to direct prompting of LLMs, such as variants of chain-of-thought prompting. We argue that because the probabilistic inference in ThinkSum is performed outside of calls to the LLM, ThinkSum is less sensitive to prompt design, yields more interpretable predictions, and can be flexibly combined with latent variable models to extract structured knowledge from LLMs.","NLP, language models, prompting, zero-shot learning" Universal Unlearnable Examples: Cluster-wise Perturbations without Label-consistency,https://openreview.net/forum?id=pHO19kq_yT,https://openreview.net/pdf?id=pHO19kq_yT,"We proposed a novel method called UniversalCP, which is effective in a more practical scenario.","There is a growing interest in employing unlearnable examples against privacy leaks on the Internet, which prevents the unauthorized models from being properly trained by adding invisible image noise. However, existing attack methods rely on an ideal assumption called label-consistency. In this work, we clarify a more practical scenario called \emph{label-inconsistency} that allows hackers and protectors to hold different labels for the same image. Inspired by disrupting the \emph{uniformity} and \emph{discrepancy}, we present a novel method called \emph{UniversalCP} on label-inconsistency scenario, which generates the universal unlearnable examples by cluster-wise perturbation. Furthermore, we investigate a new strategy for selecting the CLIP as the surrogate model, since vision-and-language pre-training models are trained on large-scale data and more semantic supervised information. We also verify the effectiveness of the proposed methods and the strategy for selecting surrogate models under a variety of experimental settings including black-box backbones, datasets and even commercial platforms Microsoft {\tt Azure} and Baidu {\tt PaddlePaddle}.","pravicy-preserving, unlearnable example, adversarial attack, poisoning attack" Confidence and Dispersity Speak: Characterising Prediction Matrix for Unsupervised Accuracy Estimation,https://openreview.net/forum?id=O7gAffL9a0,https://openreview.net/pdf?id=O7gAffL9a0,This work proposes a simple but effective method (prediction diversity) to predict how well a model generalize to out-of-distribution datasets,"This work focuses on estimating how well a model performs on out-of-distribution (OOD) datasets without using labels. Our intuition is that a well-performing model should give predictions with high confidence and high dispersity. While recent methods study the prediction confidence, this work newly finds dispersity is another informative cue. Confidence reflects whether the individual prediction is certain; dispersity indicates how the overall predictions are distributed across all categories. To achieve a more accurate estimation, we propose to jointly consider these two properties by using the nuclear norm of the prediction matrix. In our experiments, we extensively validate the effectiveness of nuclear norm for various models (e.g., ViT and ConvNeXt), different datasets (e.g., ImageNet and CUB-200), and diverse types of distribution shifts (e.g., style shift and reproduction shift). We show that the nuclear norm is more accurate and robust in predicting OOD accuracy than existing methods. Lastly, we study the limitation of the nuclear norm and discuss potential directions.","Out-of-distribution generalization, Unsupervised Accuracy Estimation, Prediction DIversity, Distribution Shift" On the Calibration Set Difficulty and Out-of-distribution Calibration,https://openreview.net/forum?id=E2Y_xv8ybf,https://openreview.net/pdf?id=E2Y_xv8ybf,Calibration set difficulity impacts out-of-distribution calibration performance,"Model calibration usually requires optimizing some parameters (e.g., temperature) w.r.t an objective function (e.g., negative log-likelihood). In this paper, we report a plain, important but often neglected fact that the objective function is influenced by calibration set difficulty, i.e., the ratio of the number of incorrectly classified samples to that of correctly classified samples. If a test set has a drastically different difficulty level from the calibration set, the optimal calibration parameters of the two datasets would be different. In other words, a calibrator optimal on the calibration set would be suboptimal on the OOD test set and thus has degraded performance. With this knowledge, we propose a simple and effective method named adaptive calibrator ensemble (ACE) to calibrate OOD datasets whose difficulty is usually higher than the calibration set. Specifically, two calibration functions are trained, one for in-distribution data (low difficulty), and the other for severely OOD data (high difficulty). To achieve desirable calibration on a new OOD dataset, ACE uses an adaptive weighting method that strikes a balance between the two extreme functions. When plugged in, ACE generally improves the performance of a few state-of-the-art calibration schemes on a series of OOD benchmarks. Importantly, such improvement does not come at the cost of the in-distribution calibration accuracy.","nerual network calibration, out-of-distribution calibration" Design of the topology for contrastive visual-textual alignment,https://openreview.net/forum?id=UyC1dXUA-n,https://openreview.net/pdf?id=UyC1dXUA-n,We change the topology of embedding space to oblique manifold for better alignment performance.,"Pre-training weakly related image-text pairs in the contrastive style shows great power in learning semantic aligning cross-modal models. The common choice to measure the distance between the feature representations of the image-text pairs is the cosine similarity, which can be considered as the negative inner product distance of features embedded on a sphere, mathematically. However, empirically, aligning image-text pairs on the spherical topology is vulnerable to the semantic ambiguity phenomenon resulting from the noise in the pre-training datasets. Specifically, under the noisy training data, instead of the optimal alignment-uniformity solution, the system would achieve an equilibrium (a gap between distances of positive and negative pairs), when the gradients for attraction and repulsion are neutralized. Although intuitively, the model should always find this equilibrium given a sufficiently long training scheme, its numerical values might be out of the distance range (e.g. [-1, 1] for the cosine similarity). In the practice of former studies, this problem is partly tackled by introducing a learnable softmax temperature parameter, in other words, by explicitly scaling the range of the distance function. In this work, we alternatively design the topology of embedding space and its endowed distance function. Motivated by studies that make use of Riemannian geometry for visual tasks, we propose a rather simple solution to address the aforementioned equilibrium problem. That is, we map the feature representations onto the oblique manifold endowed with the negative inner product as the distance function. In the experimental analysis, we show that we can improve the baseline performance by a large margin (e.g. 4\% in the zero-shot image to text retrieval task) by changing only two lines of the training codes.", Slimmable Networks for Contrastive Self-supervised Learning,https://openreview.net/forum?id=7PURWDjJCf3,https://openreview.net/pdf?id=7PURWDjJCf3,,"Self-supervised learning makes great progress in large model pre-training but suffers in training small models. Previous solutions to this problem mainly rely on knowledge distillation and indeed have a two-stage learning procedure: first train a large teacher model, then distill it to improve the generalization ability of small ones. In this work, we present a new one-stage solution to obtain pre-trained small models without extra teachers: slimmable networks for contrastive self-supervised learning (SlimCLR). A slimmable network contains a full network and several weight-sharing sub-networks. We can pre-train for only one time and obtain various networks including small ones with low computation costs. However, in self-supervised cases, the interference between weight-sharing networks leads to severe performance degradation. One evidence of the interference is gradient imbalance: a small proportion of parameters produces dominant gradients during backpropagation, and the main parameters may not be fully optimized. The divergence in gradient directions of various networks may also cause interference between networks. To overcome these problems, we make the main parameters produce dominant gradients and provide consistent guidance for sub-networks via three techniques: slow start training of sub-networks, online distillation, and loss re-weighting according to model sizes. Besides, a switchable linear probe layer is applied during linear evaluation to avoid the interference of weight-sharing linear layers. We instantiate SlimCLR with typical contrastive learning frameworks and achieve better performance than previous arts with fewer parameters and FLOPs.","self-supervised learning, contrastive learning, slimmable networks" Interpretable Single/Multi-label Text Classification with Unsupervised Constituent-label alignments,https://openreview.net/forum?id=MLJ5TF5FtXH,https://openreview.net/pdf?id=MLJ5TF5FtXH,An inherently interpretable model architecture with explicit unsupervised label to constituent alignments.,"Deep neural networks based on layer-stacking architectures have historically suffered from poor inherent interpretability. Meanwhile, symbolic probabilistic models function with clear interpretability, but how to combine them with neural networks to enhance their performance remains to be explored. In this paper, we try to marry these two systems for text classification via structured language models. Specifically, we propose a novel label extraction framework based on binary syntax trees. Both the structures and intermediate representations of the trees are learned using a pretrained neural network in an unsupervised manner. Inference and learning is made efficient using dynamic programming over tree structures. Our experiments demonstrate that our approach could achieve good prediction results in single/multi-label text classification and have explicit and inherent constituent-level interpretability.","Interpretability, natural language processing, text classification, unsupervised learning, structured language model, multiple instance learning, recursive neural network" Suppressing the Heterogeneity: A Strong Feature Extractor for Few-shot Segmentation,https://openreview.net/forum?id=CGuvK3U09LH,https://openreview.net/pdf?id=CGuvK3U09LH,,"This paper tackles the Few-shot Semantic Segmentation (FSS) task with focus on learning the feature extractor. Somehow the feature extractor has been overlooked by recent state-of-the-art methods, which directly use a deep model pretrained on ImageNet for feature extraction (without further fine-tuning). Under this background, we think the FSS feature extractor deserves exploration and observe the heterogeneity (i.e., the intra-class diversity in the raw images) as a critical challenge hindering the intra-class feature compactness. The heterogeneity has three levels from coarse to fine: 1) Sample-level: the inevitable distribution gap between the support and query images makes them heterogeneous from each other. 2) Region-level: the background in FSS actually contains multiple regions with different semantics. 3) Patch-level: some neighboring patches belonging to a same class may appear quite different from each other. Motivated by these observations, we propose a feature extractor with Multi-level Heterogeneity Suppressing (MuHS). MuHS leverages the attention mechanism in transformer backbone to effectively suppress all these three-level heterogeneity. Concretely, MuHS reinforces the attention / interaction between different samples (query and support), different regions and neighboring patches by constructing cross-sample attention, cross-region interaction and a novel masked image segmentation (inspired by the recent masked image modeling), respectively. We empirically show that 1) MuHS brings consistent improvement for various FSS heads and 2) using a simple linear classification head, MuHS sets new states of the art on multiple FSS datasets, validating the importance of FSS feature learning.","deep learning, computer vision, few-shot learning, few-shot semantic segmentation" Defactorization Transformer: Modeling Long Range Dependency with Local Window Cost,https://openreview.net/forum?id=m0R-SYjUpTL,https://openreview.net/pdf?id=m0R-SYjUpTL,,"Transformers have astounding representational power but typically consume considerable computation and memory. The current popular Swin transformer reduces computational and memory costs via a local window strategy. However, this inevitably causes two drawbacks: i) the local window-based self-attention mitigates global dependency modeling capability; ii) recent studies point out that the local windows impair robustness. This paper proposes a novel defactorization self-attention mechanism (DeSA) that enjoys both the advantages of local window cost and long-range dependency modeling. Specifically, we defactorize a large area of feature tokens into non-overlapping subsets and obtain a strictly limited number of key tokens enriched of long-range information through cross-set interaction. Equipped with a new mixed-grained multi-head attention that adjusts the granularity of the key features in different heads, DeSA is capable of modeling long-range dependency while aggregating multi-grained information at a computational and memory cost equivalent to the local window-based self-attention. With DeSA, we present a family of models named defactorization vision transformer (DeViT). Extensive experiments show that our DeViT achieves state-of-the-art performance on both classification and downstream tasks, while demonstrating strong robustness to corrupted and biased data. Compared with Swin-T, our DeViT-B2 significantly improves classification accuracy by $1\%$ and robustness by $6\%$, and reduces model parameters by $14\%$. Our code will soon be publically available at https://github.com/anonymous0519/DeViT.", MaPLe: Multi-modal Prompt Learning,https://openreview.net/forum?id=8mWlBArp1qx,https://openreview.net/pdf?id=8mWlBArp1qx,Multi-modal prompt learning for improving synergy between learned language and vision representations for fine-tuning CLIP on downstream image recognition tasks.,"Pre-trained vision-language (V-L) models such as CLIP have shown excellent generalization ability to downstream tasks. However, they are sensitive to the choice of input text prompts and require careful selection of prompt templates to perform well. Inspired by the Natural Language Processing (NLP) literature, recent CLIP adaptation approaches learn prompts as the textual inputs to fine-tune CLIP for downstream tasks. We note that using prompting to adapt representations in a single branch of CLIP (language or vision) is sub-optimal since it does not allow the flexibility to dynamically adjust both representation spaces on a downstream task. In this work, we propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations. Our design promotes strong coupling between the vision-language prompts to ensure mutual synergy and discourages learning independent uni-modal solutions. Further, we learn separate prompts across different early stages to progressively model the stage-wise feature relationships to allow rich context learning. We evaluate the effectiveness of our approach on three representative tasks of generalization to novel classes, new target datasets and unseen domain shifts. Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes and 2.72% on overall harmonic-mean, averaged over 11 diverse image recognition datasets. Our code and models will be publicly released.","Vision-language models, Prompt learning, Generalization, Fine-tuning, Transfer learning" Communication Efficient Fair Federated Recommender System,https://openreview.net/forum?id=ZLv-8v0Sp_H,https://openreview.net/pdf?id=ZLv-8v0Sp_H,A random sampling based Fair Federated Recommender System.,"Federated Recommender Systems (FRSs) aim to provide recommendations to clients in a distributed manner with privacy preservation. FRSs suffer from high communication costs due to the communication between the server and many clients. Some past literature on federated supervised learning shows that sampling clients randomly improve communication efficiency without jeopardizing accuracy. However, each user is considered a separate client in FRS and clients communicate only item gradients. Thus, incorporating random sampling and determining the number of clients to be sampled in each communication round to retain the model's accuracy in FRS becomes challenging. This paper provides sample complexity bounds on the number of clients that must be sampled in an FRS to preserve accuracy. Next, we consider the issue of demographic bias in FRS, quantified as the difference in the average error rates across different groups. Supervised learning algorithms mitigate the group bias by adding the fairness constraint in the training loss, which requires sharing protected attributes with the server. This is prohibited in a federated setting to ensure clients' privacy. We design \ouralgo, a Random Sampling based Fair Federated Recommender System, which trains to achieve a fair global model. In addition, it also trains local clients towards a fair global model to reduce demographic bias at the client level without the need to share their protected attributes. We empirically demonstrate all our results across the two most popular real-world datasets (ML1M, ML100k) and different sensitive features (age and gender) to prove that RS-FairFRS helps reduce communication cost and demographic bias with improved model accuracy. ","Federated Learning, Recommender Systems, Bias and Fairness" Grassmannian Class Representation in Deep Learning,https://openreview.net/forum?id=GmjwnzduXzf,https://openreview.net/pdf?id=GmjwnzduXzf,,"We generalize the class representative vector found in deep classification networks to linear subspaces and show that the new formulation enables the simultaneous enhancement of the inter-class discrimination and intra-class feature variation. Traditionally, the logit is computed by the inner product between a feature and the class vector. In our modeling, classes are subspaces and the logit is defined as the norm of the projection from a feature onto the subspace. Since the set of subspaces forms Grassmann manifolds, finding the optimal subspace representation for classes is to optimize the loss on a Grassmannian. We integrate the Riemannian SGD into existing deep learning frameworks such that the class subspaces in a Grassmannian are jointly optimized with other model parameters in Euclidean. Compared to the vector form, subspaces have two appealing properties: they can be multi-dimensional and they are scaleless. Empirically, we reveal that these distinct characteristics improve various tasks. (1) Image classification. The new formulation brings the top-1 accuracy of ResNet50-D on ImageNet-1K from 78.04% to 79.37% using the standard augmentation in 100 training epochs. This confirms that the representative capability of subspaces is more powerful than vectors. (2) Feature transfer. Subspaces provide freedom for features to vary and we observed that the intra-class variability of features increases when the subspace dimensions are larger. Consequently, the quality of features is better for downstream tasks. The average transfer accuracy across 6 datasets improves from 77.98% to 80.12% compared to the strong baseline of vanilla softmax. (3) Long-tail classification. The scaleless property of subspaces benefits classification in the long-tail scenario and improves the accuracy of ImageNet-LT from 46.83% to 48.94% compared to the standard formulation. With these encouraging results, we believe that more applications could benefit from the Grassmannian class representation. Codes will be released.","Grassmannian, geometric optimization, classification, feature transfer, long-tail" Refining Visual Representation for Generalized Zero-Shot Recognition through Implicit-Semantics-Guided Metric Learning,https://openreview.net/forum?id=r63dkNZj7I5,https://openreview.net/pdf?id=r63dkNZj7I5,,"Deep metric learning (DML) is effective to address the large intra- and the small inter-class variation problem in visual recognition; however, when applied for generalized zero-shot learning (GZSL) in which the label of a target image may belong to an unseen category, this technique can be easily biased towards seen classes. Alternatively in GZSL some form of semantic space is available, which plays an important role in relating seen and unseen classes and is widely used to guide the learning of visual representation. To take advantage of DML while avoiding overfitting to seen classes, we propose a novel representation learning framework$\textemdash$Metric Learning with Implicit Semantics (MLIS)$\textemdash$to refine discriminative and generalizable visual features for GZSL. Specifically, we disentangle the effects of semantics on feature extractor and image classification of the model, so that semantics only participate in feature learning, and classification only uses the refined visual features. We further relax the visual-semantic alignment requirement, avoiding performing pair-wise comparisons between the image and the class embeddings. Experimental results demonstrate that the proposed MLIS framework bridges DML and GZSL. It achieves state-of-the-art performance, and is robust and flexible to the integration with several metric learning based loss functions. ","generalized zero-shot learning, metric learning, multi-class classification" Reward Learning with Trees: Methods and Evaluation,https://openreview.net/forum?id=xl2-MIX2DCD,https://openreview.net/pdf?id=xl2-MIX2DCD,"We show that reward learning with tree models can be competitive with neural networks, and demonstrate some of its interpretability benefits.","Recent efforts to learn reward functions from human feedback have tended to use deep neural networks, whose lack of transparency hampers our ability to explain agent behaviour or verify alignment. We explore the merits of learning intrinsically interpretable tree models instead. We develop a recently proposed method for learning reward trees from preference labels, and show it to be broadly competitive with neural networks on challenging high-dimensional tasks, with good robustness to limited or corrupted data. Having found that reward tree learning can be done effectively in complex settings, we then consider why it should be used, demonstrating that the interpretable reward structure gives significant scope for traceability, verification and explanation.","reinforcement learning, reward learning, alignment, human-agent interaction, explainable AI, XAI, interpretability, decision trees" Achieve the Minimum Width of Neural Networks for Universal Approximation,https://openreview.net/forum?id=hfUJ4ShyDEU,https://openreview.net/pdf?id=hfUJ4ShyDEU,"We prove that the minimum width of FNN for UAP is $w^*_{\min} = \max(d_x,d_y)$ which is achievable.","The universal approximation property (UAP) of neural networks is fundamental for deep learning, and it is well known that wide neural networks are universal approximators of continuous functions within both the $L^p$ norm and the continuous/uniform norm. However, the exact minimum width, $w_{\min}$, for the UAP has not been studied thoroughly. Recently, using a decoder-memorizer-encoder scheme, \citet{Park2021Minimum} found that $w_{\min} = \max(d_x+1,d_y)$ for both the $L^p$-UAP of ReLU networks and the $C$-UAP of ReLU+STEP networks, where $d_x,d_y$ are the input and output dimensions, respectively. In this paper, we consider neural networks with an arbitrary set of activation functions. We prove that both $C$-UAP and $L^p$-UAP for functions on compact domains share a universal lower bound of the minimal width; that is, $w^*_{\min} = \max(d_x,d_y)$. In particular, the critical width, $w^*_{\min}$, for $L^p$-UAP can be achieved by leaky-ReLU networks, provided that the input or output dimension is larger than one. Our construction is based on the approximation power of neural ordinary differential equations and the ability to approximate flow maps by neural networks. The nonmonotone or discontinuous activation functions case and the one-dimensional case are also discussed.","Universal Approximation, Feedforward Neural Network, Leaky-ReLU" H2RBox: Horizonal Box Annotation is All You Need for Oriented Object Detection,https://openreview.net/forum?id=NPfDKT9OUJ3,https://openreview.net/pdf?id=NPfDKT9OUJ3,,"Oriented object detection emerges in many applications from aerial images to autonomous driving, while many existing detection benchmarks are annotated with horizontal bounding box only which is also less costive than fine-grained rotated box, leading to a gap between the readily available training corpus and the rising demand for oriented object detection. This paper proposes a simple yet effective oriented object detection approach called H2RBox merely using horizontal box annotation for weakly-supervised training, which closes the above gap and shows competitive performance even against those trained with rotated boxes. The cores of our method are weakly- and self-supervised learning, which predicts the angle of the object by learning the consistency of two different views. To our best knowledge, H2RBox is the first horizontal box annotation-based oriented object detector. Compared to an alternative i.e. horizontal box-supervised instance segmentation with our post adaption to oriented object detection, our approach is not susceptible to the prediction quality of mask and can perform more robustly in complex scenes containing a large number of dense objects and outliers. Experimental results show that H2RBox has significant performance and speed advantages over horizontal box-supervised instance segmentation methods, as well as lower memory requirements. While compared to rotated box-supervised oriented object detectors, our method shows very close performance and speed, and even surpasses them in some cases. Source code will be made publicly available.","Oriented Object Detection, Rotated Object Detection" Sparse and Hierarchical Masked Modeling for Convolutional Representation Learning,https://openreview.net/forum?id=NRxydtWup1S,https://openreview.net/pdf?id=NRxydtWup1S,This paper presents a simple yet powerful framework to pre-train convolutional network (convnet) with Sparse masKed modeling.,"This paper presents a simple yet powerful framework to pre-train convolutional network (convnet) with Sparse masKed modeling. SparK addresses key challenges in applying transformer-specialized masked modeling to convolutional models: (i) convolution operation cannot handle irregular, random-masked input; (ii) the single-scale nature of existing masked modeling is inconsistent with convnet's hierarchical structure. For (i), we sparsely gather the unmasked pixels to a sparse image and use sparse convolution for encoding. For the later, we develop a hierarchical encoder-decoder to reconstruct from multi-scale encoded features to fully exploit the advantage of hierarchy. As the first hierarchical masked modeling method designed for convnets, SparK exploits their untapped potential. On three downstream tasks, SparK surpasses both state-of-the-art contrastive learning and \textit{transformer-based} masked modeling by similarly large margins (around +1.0%). Improvements on object detection and instance segmentation are more substantial (>1.0%), verifying strong transferability of features learned by SparK. We also demonstrate SparK's favorable scaling behavior by observing more gains on larger models. Taken all results together, a promising future of generative pre-training on convnets has been initially shown by SparK. Codes will be made publicly available.","Self-Supervised Learning, Masked Autoencoding, Masked Pre-training, Masked Modeling, Convolutional Neural Networks" Functional Relation Field: A Model-Agnostic Framework for Multivariate Time Series Forecasting,https://openreview.net/forum?id=BM10-kHq8uX,https://openreview.net/pdf?id=BM10-kHq8uX,Functional Relation Field: A Model-Agnostic Framework for Multivariate Time Series Forecasting,"In multivariate time series forecasting, the most popular strategy for modeling the relationship between multiple time series is the construction of graph, where each time series is represented as a node and related nodes are connected by edges, i.e. spatial-temporal graph neural networks. The graph structure is either given apriori or learned based the similarity between nodes. However, the relationship between multiple time series is typically complicated, for instance, the sum of outflows from upstream nodes may be equal to the inflows of downstream nodes. Such relations widely exist in many real-world multivariate time series forecasting scenarios, yet are far from well studied. In these cases, graph might only be a crude description on the dependency between nodes. To this end, we explore a new framework to model the inter-node relationship in a more precise way based our proposed inductive bias for graphs, Functional Relation Field, where a group of functions parameterized by neural networks are learned to characterize the dependency between multiple time series. These learned functions are versatile: they can then be used to discover the underlying graph structure by identifying the most relevant neighbors of the target node; and on the other hand, the learned functions will form a “field” where the nodes in the backbone prediction networks are enforced to satisfy the constraints defined by these functions. The experiment is conducted on one toy dataset to show our approach can well recover the true constraint relationship between nodes. And two real-world MiniApp calling traffic and road network datasets are also considered with various different backbone networks. Results show that the prediction error can be reduced remarkably with the aid of the proposed functional relation field framework.","Functional Relation Field, Spatio-Temporal Forecasting, Constraint Optimization, Multivariate Time Series" Motion-inductive Self-supervised Object Discovery in Videos,https://openreview.net/forum?id=K5qR1F14qPE,https://openreview.net/pdf?id=K5qR1F14qPE,"We propose a motion-inductive model through directly processing consecutive RGB frames to segment the foreground objects and train it by flow reconstruction between pairwise frames, i.e. without any mask annotations.","In this paper, we consider the task of unsupervised object discovery in videos. Previous works have shown promising results via processing optical flows to segment objects. However, taking flow as input brings about two drawbacks. First, flow cannot capture sufficient cues when objects remain static or partially occluded. Second, it is challenging to establish temporal coherency from flow-only input, due to the missing texture information. To tackle these limitations, we propose a model for directly processing consecutive RGB frames, and infer the optical flow between any pair of frames using a layered representation, with the opacity channels being treated as the segmentation. Additionally, to enforce object permanence, we apply temporal consistency loss on the inferred masks from randomly-paired frames, which refer to the motions at different paces, and encourage the model to segment the objects even if they may not move at the current time point. Experimentally, we demonstrate superior performance over previous state-of-the-art methods on three public video segmentation datasets (DAVIS2016, SegTrackv2, and FBMS-59), while being computationally efficient by avoiding the overhead of computing optical flow as input.","Video Object Segmentation, Motion Segmentation, Object Discovery" Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment,https://openreview.net/forum?id=ptbePrczhRt,https://openreview.net/pdf?id=ptbePrczhRt,,"Detection Transformer (DETR) relies on one-to-one assignment for end-to-end object detection and lacks the capability of exploiting multiple positive object queries. We present a novel DETR training approach, named {\em Group DETR}, to support one-to-many assignment in a group-wise manner. To achieve it, we make simple modifications during training: (i) adopt $K$ groups of object queries; (ii) conduct decoder self-attention on each group of object queries with the same parameters; (iii) perform one-to-one assignment for each group, leading to $K$ positive object queries for each ground-truth object. In inference, we only use one group of object queries, making no modifications to model architectures and inference processes. Group DETR is a versatile training method and is applicable to various DETR variants. Our experiments show that Group DETR significantly speeds up the training convergences and improves the performances of various DETR-based methods.", Transcendental Idealism of Planner: Evaluating Perception from Planning Perspective for Autonomous Driving,https://openreview.net/forum?id=E08kaoSiQl0,https://openreview.net/pdf?id=E08kaoSiQl0,The paper proposes a systematic and principled framework to evaluate the consequence of perception module error from the perspective of autonomous vehicle planning. ,"Evaluating the performance of perception module in autonomous driving is one of the most critical tasks in developing these complex intelligent systems. While module-level unit test methodologies adopted from traditional computer vision tasks are viable to a certain extent, it still remains far less explored to evaluate how changes in a perception module can impact the planning of an autonomous vehicle in a consistent and holistic manner. In this work, we propose a principled framework that provides a coherent and systematic understanding of how perception modules affect the planning of an autonomous vehicle that actually controls the vehicle. Specifically, planning of an autonomous vehicle is formulated as an expected utility maximisation problem, where all input signals from upstream modules jointly provide a world state description, and the planner aims to find the optimal action to execute by finding the solution to maximise the expected utility determined by both the world state and the action. We show that, under some mild conditions, the objective function can be represented as an inner product between the world state description and the utility function in a Hilbert space. This geometric interpretation enables a novel way to formulate, analyse and evaluate the impact of noise in world state estimation on the solution to the problem, and leads to a universal quantitative metric for such purpose. The whole framework resembles the idea of transcendental idealism in the classical philosophy literature, which gives the name to our approach.","Autonomous Driving, Utility Maximisation, Hilbert Space, Planning, Perception" Pushing the Limits of Fewshot Anomaly Detection in Industry Vision: Graphcore,https://openreview.net/forum?id=xzmqxHdZAwO,https://openreview.net/pdf?id=xzmqxHdZAwO,,"In the area of fewshot anomaly detection (FSAD), efficient visual feature plays an essential role in memory bank M-based methods. However, these methods do not account for the relationship between the visual feature and its rotated visual feature, drastically limiting the anomaly detection performance. To push the limits, we reveal that rotation-invariant feature property has a significant impact in industrial-based FSAD. Specifically, we utilize of graph representation in FSAD and provide a novel visual isometric invariant feature (VIIF) as anomaly measurement feature. As a result, VIIF can robustly improve the anomaly discriminating ability and can further reduce the size of redundant features stored in M by a large amount. Besides, we provide a novel model GraphCore via VIIFs that can fast implement unsupervised FSAD training and can improve the performance of anomaly detection. A comprehensive evaluation is provided for comparing GraphCore and other SOTA anomaly detection models under our proposed fewshot anomaly detection setting, which shows GraphCore can increase average AUC by 5.8\%, 4.1\%, 3.4\%, and 1.6\% on MVTec AD and by 25.5\%, 22.0\%, 16.9\%, and 14.1\% on MPDD for 1, 2, 4, and 8-shot cases, respectively.", Evaluating Weakly Supervised Object Localization Methods Right? A Study on Heatmap-based XAI and Neural Backed Decision Tree,https://openreview.net/forum?id=X55dLasnEcC,https://openreview.net/pdf?id=X55dLasnEcC,Evaluating object localization using XAI methods on MaxBoxAcc metrics. NBDT is tested too as an extension.,"Choe et al have investigated several aspects of Weakly Supervised Object Localization (WSOL) with only image label. They addressed the ill-posed nature of the problem and showed that WSOL has not significantly improved beyond the baseline method class activation mapping (CAM). We report the results of similar experiments on ResNet50 with some crucial differences: (1) we perform WSOL using heatmap-based eXplanaible AI (XAI) methods (2) our model is not class agnostic since we are interested in the XAI aspect as well. Under similar protocol, we find that XAI methods perform WSOL with very sub-standard MaxBoxAcc scores. The experiment is then repeated for the same model trained with Neural Backed Decision Tree (NBDT) and we found that vanilla CAM yields significantly better WSOL performance after NBDT training.","object localization, computer vision, deep learning, deep neural network" HyperFeel: An Efficient Federated Learning Framework Using Hyperdimensional Computing,https://openreview.net/forum?id=CqoBqextqY,https://openreview.net/pdf?id=CqoBqextqY,,"Federated Learning (FL) aims to establish a shared model across decentralized clients under the privacy-preserving constraint. Each client learns an independent model with local data, and only model updates are communicated. However, as the FL model typically employs computation-intensive neural networks, major issues in Federated Learning are (i) significant computation overhead for local training; (ii) the massive communication overhead that arises from sending around the model updates; (iii) notable performance degradation resulting from the non-IID scenario. In this work, we propose HyperFeel, an efficient learning framework for federated learning based on Hyper-Dimensional Computing (HDC), that can significantly improve communication/storage efficiency over existing works with nearly no performance degradation. Unlike current solutions that employ neural networks as the learned model, HyperFeel introduces a simple yet effective computing paradigm that encodes and represents data using hyperdimensional vectors. Then, it performs concise and highly parallel operations for encryption, computation, and communication, taking advantage of the lightweight feature representation of hyperdimensional vectors. For further enhance HyperFeel performance, we propose a two-fold optimization scheme combining the characteristics of encoding and updating in hyper-dimensional computing. On the one hand, we design the personalization update based on hyperdimensional computing with a client-specific model, which achieves better accuracy to the non-IID data. On the other hand, we extend the framework from horizontal FL to vertical FL based on a shared encoding mechanism. Comprehensive experimental results demonstrate our method consistently outperforms the state-of-the-art FL models. Typically, we achieves $26\times$ storage reduction and up to $81\times$ communication reduction over FedAvg, with minimal accuracy drops on FEMNIST and Synthetic. \emph{Code will be open-source in the camera-ready version.}", TEAS: Exploiting Spiking Activity for Temporal-wise Adaptive Spiking Neural Networks,https://openreview.net/forum?id=AbRe0e_R07,https://openreview.net/pdf?id=AbRe0e_R07,,"Spiking neural networks (SNNs) are energy-efficient alternatives to commonly used deep artificial neural networks (ANNs). However, their sequential computation pattern over multiple time steps makes processing latency a significant hindrance to deployment. In existing SNNs deployed on time-driven hardware, all layers generate and receive spikes in a synchronized manner, forcing them to share the same time steps. This often leads to considerable time redundancy in the spike sequences and considerable repetitive processing. Motivated by the effectiveness of dynamic neural networks for boosting efficiency, we propose a temporal-wise adaptive SNN, namely TEAS, in which each layer is configured with independent number of time steps to fully exploit the potential of SNNs. Specifically, given an SNN, the number of time steps of each layer is configured according to its contribution to the final performance of the whole network. Then, we exploit the temporal transforming module to produce a dynamic policy that can adapt the temporal information dynamically during inference. The adaptive configuration generating process also enables the trading-off between model complexity and accuracy. Extensive experiments on a variety of challenging datasets demonstrate that our method provides significant savings in energy efficiency and processing latency under similar accuracy outperforming the existing state-of-the-art methods.",Spiking Neural Network Quasi-Conservative Score-based Generative Models,https://openreview.net/forum?id=ALuRpkAeQP,https://openreview.net/pdf?id=ALuRpkAeQP,"In this paper, we propose Quasi-Conservative Score-based Generative Models (QCSGMs), which are designed to maintain both the architectural flexibility and the property of conservativeness of score-based generative models.","Existing Score-based Generative Models (SGMs) can be categorized into constrained SGMs (CSGMs) or unconstrained SGMs (USGMs) according to their parameterization approaches. CSGMs model the probability density functions as Boltzmann distributions, and assign their predictions as the negative gradients of some scalar-valued energy functions. On the other hand, USGMs employ flexible architectures capable of directly estimating scores without the need to explicitly model energy functions. In this paper, we demonstrate that the architectural constraints of CSGMs may limit their score-matching ability. In addition, we show that USGMs' inability to preserve the property of conservativeness may lead to serious sampling inefficiency and degraded sampling performance in practice. To address the above issues, we propose Quasi-Conservative Score-based Generative Models (QCSGMs) for keeping the advantages of both CSGMs and USGMs. Our theoretical derivations demonstrate that the training objective of QCSGMs can be efficiently integrated into the training processes by leveraging the Hutchinson trace estimator. In addition, our experimental results on the Cifar-10, Cifar-100, ImageNet, and SVHN datasets validate the effectiveness of QCSGMs. Finally, we justify the advantage of QCSGMs using an example of a one-layered autoencoder.","Score-based Generative Models, Conservativeness" Multi-Modal Few-Shot Temporal Action Detection,https://openreview.net/forum?id=WLMaYqspJl,https://openreview.net/pdf?id=WLMaYqspJl,Detecting action temporally from very few annotated samples using vision-language,"Conventional temporal action detection (TAD) methods rely on supervised learning from many labeled training videos, rendering them unscalable to new classes. Recent approaches to solving this problem include few-shot (FS) and zero-shot (ZS) TAD. The former can adapt a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter synthesizes some semantic description given a new class (e.g, generating the classifier using a pretrained vision-language (ViL) model). In this work, we further introduce a hybrid problem setup, multi-modality few-shot(MMFS) TAD, that integrates the respective advantages of FS-TAD and ZS-TAD by accounting for both few-shot support videos (i.e, visual modality) and new class names (i.e, textual modality) in a single formula. To tackle this MMFS-TAD problem, we introduce a novel {\bf\em MUlti-modality PromPt mETa-learning} (MUPPET) method. Our key idea is to construct multi-modal prompts by mapping few-shot support videos to the textual token space of a pretrained ViL model (e.g, CLIP) using a meta-learned adapter-equipped visual semantics tokenizer; This facilitates a joint use of the two input modalities for learning richer representation. To address the large intra-class variation challenge, we further design a query feature regulation scheme. Extensive experiments on ActivityNetv1.3 and THUMOS14 demonstrate that our MUPPET outperforms state-of-the-art FS-TAD, ZS-TAD and alternative methods under a variety of MMFS-TAD settings, often by a large margin.","action detection, video understanding, vision language, few-shot learning" Representation Learning for Low-rank General-sum Markov Games,https://openreview.net/forum?id=8FroynZv4C,https://openreview.net/pdf?id=8FroynZv4C,We provide a general representation learning framework for multi-player general-sum Markov games.,"We study multi-agent general-sum Markov games with nonlinear function approximation. We focus on low-rank Markov games whose transition matrix admits a hidden low-rank structure on top of an unknown non-linear representation. The goal is to design an algorithm that (1) finds an $\varepsilon$-equilibrium policy sample efficiently without prior knowledge of the environment or the representation, and (2) permits a deep-learning friendly implementation. We leverage representation learning and present a model-based and a model-free approach to construct an effective representation from collected data. For both approaches, the algorithm achieves a sample complexity of poly$(H,d,A,1/\varepsilon)$, where $H$ is the game horizon, $d$ is the dimension of the feature vector, $A$ is the size of the joint action space and $\varepsilon$ is the optimality gap. When the number of players is large, the above sample complexity can scale exponentially with the number of players in the worst case. To address this challenge, we consider Markov Games with a factorized transition structure and present an algorithm that escapes such exponential scaling. To our best knowledge, this is the first sample-efficient algorithm for multi-agent general-sum Markov games that incorporates (non-linear) function approximation. We accompany our theoretical result with a neural network-based implementation of our algorithm and evaluate it against the widely used deep RL baseline, DQN with fictitious play.","Reinforcement Learning, Multi Agent, Representation Learning" Multi-Domain Long-Tailed Learning by Augmenting Disentangled Representations,https://openreview.net/forum?id=v6dqNREneyw,https://openreview.net/pdf?id=v6dqNREneyw,Balanced augmenting disentangled representations benefit the robustness of multi-domain long-tailed learning,"There is an inescapable long-tailed class-imbalance issue in many real-world classification problems. Existing long-tailed classification methods focus on the single-domain setting, where all examples are drawn from the same distribution. However, real-world scenarios often involve multiple domains with distinct imbalanced class distributions. We study this multi-domain long-tailed learning problem and aim to produce a model that generalizes well across all classes and domains. Towards that goal, we introduce TALLY, which produces invariant predictors by balanced augmenting hidden representations over domains and classes. Built upon a proposed selective balanced sampling strategy, TALLY achieves this by mixing the semantic representation of one example with the domain-associated nuisances of another, producing a new representation for use as data augmentation. To improve the disentanglement of semantic representations, TALLY further utilizes a domain-invariant class prototype that averages out domain-specific effects. We evaluate TALLY on four long-tailed variants of classical domain generalization benchmarks and two real-world imbalanced multi-domain datasets. The results indicate that TALLY consistently outperforms other state-of-the-art methods in both subpopulation shift and domain shift.","multi-domain long-tailed learning, balanced representation augmentation, out-of-distribution robustness" Surgical Fine-Tuning Improves Adaptation to Distribution Shifts,https://openreview.net/forum?id=APuPRxjHvZ,https://openreview.net/pdf?id=APuPRxjHvZ,Selectively fine-tuning a subset of layers outperforms full fine-tuning when transferring to tasks with various distribution shifts.,"A common approach to transfer learning under distribution shift is to fine-tune the last few layers of a pre-trained model, preserving learned features while also adapting to the new task. This paper shows that in such settings, selectively fine-tuning a subset of layers (which we term surgical fine-tuning) matches or outperforms commonly used fine-tuning approaches. Moreover, the type of distribution shift influences which subset is more effective to tune: for example, for image corruptions, fine-tuning only the first few layers works best. We validate our findings systematically across seven real-world data tasks spanning three types of distribution shifts. Theoretically, we prove that for two-layer neural networks in an idealized setting, first-layer tuning can outperform fine-tuning all layers. Intuitively, fine-tuning more parameters on a small target dataset can cause information learned during pre-training to be forgotten, and the relevant information depends on the type of shift.","Transfer learning, fine-tuning, parameter freezing, distortion of pre-trained models" Diversify and Disambiguate: Out-of-Distribution Robustness via Disagreement,https://openreview.net/forum?id=RVTOp3MwT3n,https://openreview.net/pdf?id=RVTOp3MwT3n,"Given underspecified data, (1) find a diverse set of solutions and (2) choose the best one.","Real-world machine learning problems often exhibit shifts between the source and target distributions, in which source data does not fully convey the desired behavior on target inputs. Different functions that achieve near-perfect source accuracy can make differing predictions on test inputs, and such ambiguity makes robustness to distribution shifts challenging. We propose DivDis, a simple two-stage framework for identifying and resolving ambiguity in data. DivDis first learns a diverse set of hypotheses that achieve low source loss but make differing predictions on target inputs. We then disambiguate by selecting one of the discovered functions using additional information, for example, a small number of target labels. Our experimental evaluation shows improved performance in subpopulation shift and domain generalization settings, demonstrating that DivDis can scalably adapt to distribution shifts in image and text classification benchmarks.","Out-of-distribution robustness, spurious correlations, underspecification, ambiguity, ensembles" MaSS: Multi-attribute Selective Suppression,https://openreview.net/forum?id=a3OY2j9kJc-,https://openreview.net/pdf?id=a3OY2j9kJc-,Selectively suppress the attributes while preserving the rest of attributes,"The recent rapid advances in the development and deployment of machine learning technologies largely depend on the vast richness of data available today, in terms of both the quantity and the rich content contained within. For example, biometric data such as images and voices could reveal people's attributes like age, gender, sentiment, and origin, whereas location/motion data could be used to infer people's activity levels, transportation modes, and life habits. Along with the new services and applications enabled by such technological advances, various governmental policies are put in place to regulate such data usage and protect people's privacy and rights. As a result, data owners often opt for simple data obfuscation (e.g., blur people's faces in images) or withholding data altogether, which leads to severe data quality degradation and greatly limits the data's potential utility. Aiming for a sophisticated mechanism which gives data owners fine-grained control while retaining the maximal degree of data utility, we propose Multi-attribute Selective Suppression, or MaSS, a general framework for performing precisely targeted data surgery to simultaneously suppress any selected set of attributes while preserving the rest for downstream machine learning tasks. MaSS learns a data modifier through adversarial games between two sets of networks, where one is aimed at suppressing selected attributes, and the other ensures the retention of the rest of the attributes via general contrastive loss as well as explicit classification metrics. We carried out an extensive evaluation of our proposed method using multiple datasets from different domains including facial images, voice audio, and video clips, and obtained highly promising results in MaSS' generalizability and capability of drastically suppressing targeted attributes (e.g., reducing inference on such attributes to random guess) while imposing virtually no impact on the data's usability in other downstream ML tasks.","Multi-attribute, GAN, Attribute Suppression" Mimic before Reconstruct: Enhance Masked Autoencoders with Feature Mimicking,https://openreview.net/forum?id=UoBJm4V21md,https://openreview.net/pdf?id=UoBJm4V21md,,"Masked Autoencoders (MAE) have been popular paradigms for large-scale vision representation pre-training. However, MAE solely reconstructs the low-level RGB signals after the decoder and lacks supervision upon high-level semantics for the encoder, thus suffering from sub-optimal learned representations and long pre-training epochs. To alleviate this, previous methods simply replace the pixel reconstruction targets of 75% masked tokens by encoded features from pre-trained image-image (DINO) or image-language (CLIP) contrastive learning. Different from those efforts, we propose to Mimic before Reconstruct for Masked Autoencoders, named as MR-MAE, which jointly learns high-level and low-level representations without interference during pre-training. For high-level semantics, MR-MAE employs a mimic loss over 25% visible tokens from the encoder to capture the pre-trained patterns encoded in CLIP and DINO. For low-level structures, we inherit the reconstruction loss in MAE to predict RGB pixel values for 75% masked tokens after the decoder. As MR-MAE applies high-level and low-level targets respectively at different partitions, the learning conflicts between them can be naturally overcome and contribute to superior visual representations for various down-stream tasks. On ImageNet-1K, the MR-MAE base pre-trained for only 200 epochs achieves 85.0\% top-1 accuracy after fine-tuning, surpassing MAE base pre-trained for 1600 epochs by +1.4%. Furthermore, by appending masked convolution stages, MR-MCMAE reaches 85.8%, better than previous state-of-the-art BEiT V2 base by +0.3% with much fewer computational resources (25% vs 100% tokens fed in the encoder, and 400 vs 1600 pre-training epochs). Code and pre-trained models will be released.","Masked Autoencoders, Masked Convolution, Off-the-shelf pertained model DINO and CLIP" Neural Attention Memory,https://openreview.net/forum?id=S1Jgnb7mLfI,https://openreview.net/pdf?id=S1Jgnb7mLfI,Neural attention memory is a differentiable NN memory architecture based on attention which is efficient and powerful.,"Scaled dot-product attention has become the essence of state-of-the-art deep neural networks for various machine learning tasks. Though its ubiquitous accomplishments, it is inefficient for long sequence tasks and problematic for tasks requiring memory states such as compositional generalization. We propose a novel perspective of the attention mechanism by reinventing it as a memory architecture for neural networks, namely Neural Attention Memory (NAM). NAM follows the same query-key-value structure by constructing a memory matrix while reducing its computational complexity from quadratic to linear to the sequence length. NAM writes a memory matrix via the sum of outer products of value and unit key vectors, and reads it by multiplying the matrix with a unit query vector. Indeed, we show that our normalized outer-product attention mechanism is mathematically equivalent to the conventional attention mechanism. Then, we evaluate a NAM-based Transformer on long-range arena tasks and demonstrate its efficiency and efficacy. Finally, we propose two NAM-based memory-augmented neural networks, namely Long Short-Term Attention Memory (LSAM) and NAM Turing Machine (NAM-TM), and test their compositional generalization capability using four different tasks. LSAM replaces LSTM's long-term cell state with NAM memory matrix and NAM-TM implements a Turing tape data structure using NAM read/write primitives. The experimental results show that the proposed models outperform traditional Transformer and LSTM, as well as DNC. NAM opens up possibilities in diverse machine learning research problems, including hierarchical data modeling, efficient edge inference, and few-shot learning.","Neuro-symbolic AI, Transformer, Memory-augmented neural network, compositional generalization" Meta Optimal Transport,https://openreview.net/forum?id=qhu9uX4QlP8,https://openreview.net/pdf?id=qhu9uX4QlP8,We learn to predict the solution to optimal transport problems,"We study the use of amortized optimization to predict optimal transport (OT) maps from the input measures, which we call Meta OT. This helps repeatedly solve similar OT problems between different measures by leveraging the knowledge and information present from past problems to rapidly predict and solve new problems. Otherwise, standard methods ignore the knowledge of the past solutions and suboptimally re-solve each problem from scratch. We instantiate Meta OT models in discrete and continuous (Wasserstein-2) settings between images, spherical data, and color palettes and use them to improve the computational time of standard OT solvers by multiple orders of magnitude.","optimal transport, meta learning, amortized optimization" On amortizing convex conjugates for optimal transport,https://openreview.net/forum?id=TQ5WUwS_4ai,https://openreview.net/pdf?id=TQ5WUwS_4ai,"State-of-the art continuous Wasserstein-2 potential learning, and along the way I improved Jax's L-BFGS implementation to run in 3% of the time for solving batches of optimization problems","This paper focuses on computing the convex conjugate operation that arises when solving Euclidean Wasserstein-2 optimal transport problems. This conjugation, which is also referred to as the Legendre-Fenchel conjugate or c-transform, is considered difficult to compute and in practice, Wasserstein-2 methods are limited by not being able to exactly conjugate the dual potentials in continuous space. I show that combining an amortized approximations to the conjugate with an exact solver is computationally easy. This combination significantly improves the quality of transport maps learned for the Wasserstein-2 benchmark by Korotin et al. (2021a) and is able to model many 2-dimensional couplings and flows considered in the literature. To attain these results, I have also implemented a new parallel Armijo line search for L-BFGS that runs in ~3% of the time as Jax's default sequential Wolfe line search. All of the baselines, methods, and solvers considered in this paper are also available as part of a new software library for Euclidean Wasserstein-2 optimal transport.","optimal transport, wasserstein-2, convex conjugate, c-transform, amortized optimization" Exploring Visual Interpretability for Contrastive Language-Image Pretraining,https://openreview.net/forum?id=7uTvSvC7hGO,https://openreview.net/pdf?id=7uTvSvC7hGO,"A visual interpretability work for CLIP. We observe CLIP shows opposite visualization results, and find the reason is semantic shift at pooling layer. Then, we solve this problem with nontrivial improvements.","Contrastive Language-Image Pre-training (CLIP) learns rich representations via readily available supervision of natural language. It improves the performance of downstream vision tasks, including but not limited to the zero-shot, long tail, segmentation, retrieval, caption, and video. However, the visual interpretability of CLIP is rarely studied, especially in the aspect of the raw feature map. To provide visual explanations of its predictions, we propose the Image-Text Similarity Map (ITSM). Based on it, we surprisingly find that CLIP prefers the background regions than the foregrounds, and shows erroneous visualization against human understanding. Experimentally, we find the devil is in the pooling part, where inappropriate pooling methods lead to a phenomenon called semantic shift. To correct and boost the visualization results, we propose the Masked Max Pooling, with attention map from the self-supervised image encoder. Meanwhile, interpretability and recognition require different representations. To address the problem, we propose the dual projections to cater this requirement. We integrate above methods as Interpretable Contrastive Language-Image Pre-training (ICLIP). Our experiments suggest that ICLIP greatly improves the interpretability of CLIP, e.g. nontrivial improvements at 32.85% and 49.10% on VOC 2012 dataset.","Visual Interpretability, Explainability, Contrastive Language-Image Pretraining, Multimodality" Example-based Planning via Dual Gradient Fields,https://openreview.net/forum?id=nVYND1kLOug,https://openreview.net/pdf?id=nVYND1kLOug,We introduce an example-based planning framework via score-matching.,"Path planning is one of the key abilities of an intelligent agent. However, both the learning-based and sample-based planners remain to require explicitly defining the task by manually designing the reward function or optimisation objectives, which limits the scope of implementation. Formulating the path planning problem from a new perspective, Example-based planning is to find the most efficient path to increase the likelihood of the target distribution by giving a set of target examples. In this work, we introduce Dual Gradient Fields (DualGFs), an offline-learning example-based planning framework built upon score matching. There are two gradient fields in DualGFs: a target gradient field that guides task completion and a support gradient field that ensures moving with environmental constraints. In the learning process, instead of interacting with the environment, the agents are trained with two offline examples, i.e., the target gradients and support gradients are trained by target examples and support examples, respectively. The support examples are randomly sampled from free space, e.g., states without collisions. DualGF is a weighted mixture of the two fields, combining the merits of the two fields together. To update the mixing ratio adaptively, we further propose a fields-balancing mechanism based on Lagrangian-Relaxation. Experimental results across four tasks (navigation, tracking, particle rearrangement, and room rearrangement) demonstrate the scalability and effectiveness of our method.","Example-based Planning, Score-Matching, Path Planning, Reinforcement Learning" DualAfford: Learning Collaborative Visual Affordance for Dual-gripper Manipulation,https://openreview.net/forum?id=I_YZANaz5X,https://openreview.net/pdf?id=I_YZANaz5X,We propose a novel learning framework to learn collaborative affordance for dual-gripper manipulation tasks.,"It is essential yet challenging for future home-assistant robots to understand and manipulate diverse 3D objects in daily human environments. Towards building scalable systems that can perform diverse manipulation tasks over various 3D shapes, recent works have advocated and demonstrated promising results learning visual actionable affordance, which labels every point over the input 3D geometry with an action likelihood of accomplishing the downstream task (e.g., pushing or picking-up). However, these works only studied single-gripper manipulation tasks, yet many real-world tasks require two hands to achieve collaboratively. In this work, we propose a novel learning framework, DualAfford, to learn collaborative affordance for dual-gripper manipulation tasks. The core design of the approach is to reduce the quadratic problem for two grippers into two disentangled yet interconnected subtasks for efficient learning. Using the large-scale PartNet-Mobility and ShapeNet datasets, we set up four benchmark tasks for dual-gripper manipulation. Experiments prove the effectiveness and superiority of our method over three baselines. We will release code and data upon acceptance. Video demonstration can be found at https://sites.google.com/view/dualafford.","Visual Actionable Representation for Robotics, Visual Understanding of 3D Shapes" GraphCG: Unsupervised Discovery of Steerable Factors in Graphs,https://openreview.net/forum?id=wUcNUTnOvq,https://openreview.net/pdf?id=wUcNUTnOvq,We develop an unsupervised graph controllable generation method to steer factors on the molecular graphs and point clouds.,"Deep generative models have been widely developed for graph data such as molecular graphs and point clouds. Yet, much less investigation has been carried out on understanding the learned latent space of deep graph generative models. Such understandings can open up a unified perspective and provide guidelines for essential tasks like controllable generation. To this end, this work develops a method called GraphCG for unsupervised discovery of steerable factors in latent space of deep graph generative models. We first examine the representation space of the recent deep generative models trained for graph data, and observe that the learned representation space is not perfectly disentangled. Thus, our method is designed for discovering steerable factors of graph data in a model-agnostic and task-agnostic manner. Specifically, GraphCG learns the semantic-rich directions via maximizing the corresponding mutual information, where the edited graph along the same direction will possess certain steerable factors. We conduct experiments on two types of graph data, molecular graphs and point clouds. Both the quantitative and qualitative results show the effectiveness of GraphCG for discovering steerable factors.","graph, controllable generation, molecular graph, point clouds" Molecular Geometry Pretraining with SE(3)-Invariant Denoising Distance Matching,https://openreview.net/forum?id=CjTHVo1dvR,https://openreview.net/pdf?id=CjTHVo1dvR,"We propose GeoSSL, a self-supervised learning method using the denoising distance matching for molecular goemetry pretraining.","Pretraining molecular representations is critical in a variety of applications for drug and material discovery due to the limited number of labeled molecules, yet most existing work focuses on pretraining on 2D molecular graphs. The power of pretraining on 3D geometric structures, however, has been less explored. This is owning to the difficulty of finding a sufficient proxy task that can empower the pretraining to effectively extract essential features from the geometric structures. Motivated by the dynamic nature of 3D molecules, where the continuous motion of a molecule in the 3D Euclidean space forms a smooth potential energy surface, we propose a 3D coordinate denoising pretraining framework to model such an energy landscape. Leveraging an SE(3)-invariant score matching method, we propose GeoSSL in which the coordinate denoising proxy task is effectively boiled down to denoising the pairwise atomic distances in a molecule. Our comprehensive experiments confirm the effectiveness and robustness of our proposed method. ","molecule, pretraining, representation, geometry, denoising score matching" SIMPLE: Specialized Model-Sample Matching for Domain Generalization,https://openreview.net/forum?id=BqrPeZ_e5P,https://openreview.net/pdf?id=BqrPeZ_e5P,,"In domain generalization (DG), most existing methods aspire to fine-tune a specific pretrained model through novel DG algorithms. In this paper, we propose an alternative direction, i.e., to efficiently leverage a pool of pretrained models without fine-tuning. Through extensive empirical and theoretical evidence, we demonstrate that (1) pretrained models have possessed generalization to some extent while there is no single best pretrained model across all distribution shifts, and (2) out-of-distribution (OOD) generalization error depends on the fitness between the pretrained model and unseen test distributions. This analysis motivates us to incorporate diverse pretrained models and to dispatch the best matched models for each OOD sample by means of recommendation techniques. To this end, we propose SIMPLE, a specialized model-sample matching method for domain generalization. First, the predictions of pretrained models are adapted to the target domain by a linear label space transformation. A matching network aware of model specialty is then proposed to dynamically recommend proper pretrained models to predict each test sample. The experiments on DomainBed show that our method achieves significant performance improvements (up to 12.2% for individual dataset and 3.9% on average) compared to state-of-the-art (SOTA) methods and further achieves 6.1% gain via enlarging the pretrained model pool. Moreover, our method is highly efficient and achieves more than 1000 times training speedup compared to the conventional DG methods with fine-tuning a pretrained model.","domain generalization, ensemble learning, pretrained model" The Augmented Image Prior: Distilling 1000 Classes by Extrapolating from a Single Image,https://openreview.net/forum?id=6kxApT2r2i,https://openreview.net/pdf?id=6kxApT2r2i,We show that it is possible to extrapolate to semantic classes such as those of ImageNet or Kinetics using just a single datum plus heavy augmentations as visual inputs.,"What can neural networks learn about the visual world when provided with only a single image as input? While any image obviously cannot contain the multitudes of all existing objects, scenes and lighting conditions -- within the space of all $256^{3\cdot224\cdot224}$ possible $224$-sized square images, it might still provide a strong prior for natural images. To analyze this ``augmented image prior'' hypothesis, we develop a simple framework for training neural networks from scratch using a single image and augmentations using knowledge distillation from a supervised pretrained teacher. With this, we find the answer to the above question to be: `surprisingly, a lot'. In quantitative terms, we find accuracies of $94\%$/$74\%$ on CIFAR-10/100, $69$\% on ImageNet, and by extending this method to video and audio, $51\%$ on Kinetics-400 and $84$\% on SpeechCommands. In extensive analyses spanning 13 datasets, we disentangle the effect of augmentations, choice of data and network architectures and also provide qualitative evaluations that include lucid ``panda neurons'' in networks that have never even seen one. ","Augmentations, Single Image Learning, Distillation" Trust-consistent Visual Semantic Embedding for Image-Text Matching,https://openreview.net/forum?id=Xi8JtRx75B,https://openreview.net/pdf?id=Xi8JtRx75B,,"Visual Semantic Embedding (VSE), as a link between Computer Vision and Natural Language Processing, aims at jointly learning cross-modal embeddings to bridge the discrepancy across visual and textual spaces. In recent years, VSE has achieved great success in image-text matching benefiting from the outstanding representation power of deep learning. However, existing methods produce retrieved results only relying on the ranking of cross-modal similarities, even if the retrieved results are unreliable and uncertain. That is to say, they cannot self-evaluate the quality of retrieved results for trustworthy retrieval, resulting in ignoring the ubiquitous uncertainty in data and models. To address this problem, we propose a novel VSE-based method for image-text matching, namely Trust-consistent Visual Semantic Embedding (TcVSE), to embrace trustworthy retrieval and self-evaluation for image-text matching. To be specific, first, TcVSE models the evidence based on cross-modal similarities to capture accurate uncertainty. Second, a simple yet effective consistency module is presented to enforce subjective opinions of bidirectional VSE models (i2t+t2i) to be consistent for high reliability and accuracy. Finally, extensive comparison experiments are conducted to demonstrate the superiority of TcVSE on two widely-used benchmark datasets, i.e., Flickr30K and MS-COCO. Furthermore, some qualitative experiments are carried out to provide comprehensive and insightful analyses for the reliability and rationality of our method.","visual semantic embedding, image-text matching, uncertainty learning, multi-modal learning" Rethinking Knowledge Distillation via Cross-Entropy,https://openreview.net/forum?id=J13x0dErg1,https://openreview.net/pdf?id=J13x0dErg1,A new teacher-based knowledge distillation method and a new teacher-free knowledge distillation method,"Knowledge Distillation (KD) has developed extensively and boosted various tasks. The classical KD method adds the KD loss to the original cross-entropy (CE) loss. We try to decompose the KD loss to explore its relation with the CE loss. Surprisingly, we find it can be regarded as a combination of the CE loss and an extra loss which has the identical form as the CE loss. However, we notice the extra loss forces the student's relative probability to learn the teacher's absolute probability. Moreover, the sum of the two probabilities is different, making it hard to optimize. To address this issue, we revise the formulation and propose a distributed loss. In addition, we utilize teachers' target output as the soft target, proposing the soft loss. Combining the soft loss and the distributed loss, we propose a new KD loss (NKD). Furthermore, we smooth students' target output to treat it as the soft target for training without teachers and propose a teacher-free new KD loss (tf-NKD). Our method achieves state-of-the-art performance on CIFAR-100 and ImageNet. For example, with ResNet-34 as the teacher, we boost the ImageNet Top-1 accuracy of ResNet18 from 69.90% to 71.96%. In training without teachers, MobileNet, ResNet-18 and SwinTransformer-Tiny achieve 70.04%, 70.76%, and 81.48%, which are 0.83%, 0.86%, and 0.30% higher than the baseline, respectively.","Knowledge Distillation, Image Classification" Protein structure generation via folding diffusion,https://openreview.net/forum?id=Nkd7AS2USRd,https://openreview.net/pdf?id=Nkd7AS2USRd,"Inspired by the protein folding process, we introduce a new diffusion-based generative model that acts on the inter-residue angles in protein backbones and generates diverse, designable protein structures without needing equivariance mechanisms.","The ability to computationally generate novel yet physically foldable protein structures could lead to new biological discoveries and new treatments targeting yet incurable diseases. Despite recent advances in protein structure prediction, directly generating diverse, novel protein structures from neural networks remains difficult. In this work, we present a new diffusion-based generative model that designs protein backbone structures via a procedure that mirrors the native folding process. We describe protein backbone structure as a series of consecutive angles capturing the relative orientation of the constituent amino acid residues, and generate new structures by denoising from a random, unfolded state towards a stable folded structure. Not only does this mirror how proteins biologically twist into energetically favorable conformations, the inherent shift and rotational invariance of this representation crucially alleviates the need for complex equivariant networks. We train a denoising diffusion probabilistic model with a simple transformer backbone and demonstrate that our resulting model unconditionally generates highly realistic protein structures with complexity and structural patterns akin to those of naturally-occurring proteins. As a useful resource, we release the first open-source codebase and trained models for protein structure diffusion.","Generative modeling of protein backbone structures, structural biology, diffusion, diffusion modeling, generative modeling, proteins, internal coordinates" Backpropagation Path Search On Adversarial Transferability,https://openreview.net/forum?id=UAB7seI4nq,https://openreview.net/pdf?id=UAB7seI4nq,We boost adversarial transferability by searching paths in the backpropagation process.,"Transfer-based attackers craft adversarial examples against surrogate models and transfer them to victim models deployed in the black-box situation. It is generally accepted that gradients from diverse modules of surrogate models used for perturbation generation contribute differently to transferability. In this paper, we propose backPropagation pAth Search (PAS), which enhances adversarial transferability from the backpropagation perspective. We use structural reparameterization to make the basic modules of DNNs (i.e., convolution and activation) calculate forward as normal but backpropagate the gradients in a skip connection form. Thus, a DAG-based search space is constructed for the backpropagation path. PAS employs Bayesian Optimization to search for the most transferable path and reduces the search overhead by the one-step approximation. We conduct comprehensive attack experiments in a wide range of transfer settings, showing that PAS improves the attack success rate by a huge margin for both normally trained and defense models.","Adversarial Attack, Adversarial Transferability, Black-box Attack" Delving into Semantic Scale Imbalance,https://openreview.net/forum?id=07tc5kKRIo,https://openreview.net/pdf?id=07tc5kKRIo,"Our proposed semantic scale, like the number of samples, is a natural measure of class imbalance and does not depend on the model’s predictions.","Model bias triggered by long-tailed data has been widely studied. However, measure based on the number of samples cannot explicate three phenomena simultaneously: (1) Given enough data, the classification performance gain is marginal with additional samples. (2) Classification performance decays precipitously as the number of training samples decreases when there is insufficient data. (3) Model trained on sample-balanced datasets still has different biases for different classes. In this work, we define and quantify the semantic scale of classes, which is equivalent to the feature diversity of classes. It is exciting to find experimentally that there is a marginal effect of semantic scale, which perfectly describes the first two phenomena. Further, the quantitative measurement of semantic scale imbalance is proposed, which can accurately reflect model bias on multiple datasets, even on sample-balanced data, revealing a novel perspective for the study of class imbalance. Due to the prevalence of semantic scale imbalance, we propose semantic-scale-balanced learning, including a general loss improvement scheme and a dynamic re-weighting training framework that overcomes the challenge of calculating semantic scales in real-time during iterations. Comprehensive experiments show that dynamic semantic-scale-balanced learning consistently enables the model to perform superiorly on large-scale long-tailed and non-long-tailed datasets, which is a good starting point for mitigating the prevalent but unnoticed model bias. ","Imbalanced Learning, Model bias, Long-tailed distribution" Masked Surfel Prediction for Self-Supervised Point Cloud Learning,https://openreview.net/forum?id=GN6cm7uSjV,https://openreview.net/pdf?id=GN6cm7uSjV,Considering the local geometry information explicitly into the masked auto-encoding,"Masked auto-encoding is a popular and effective self-supervised learning approach to point cloud learning. However, most of the existing methods reconstruct only the masked points and overlook the local geometry information, which is also important to understand the point cloud data. In this work, we make the first attempt, to the best of our knowledge, to consider the local geometry information explicitly into the masked auto-encoding, and propose a novel Masked Surfel Prediction (MaskSurf) method. Specifically, given the input point cloud masked at a high ratio, we learn a transformer-based encoder-decoder network to estimate the underlying masked surfels by simultaneously predicting the surfel positions (i.e., points) and per-surfel orientations (i.e., normals). The predictions of points and normals are supervised by the Chamfer Distance and a newly introduced Position-Indexed Normal Distance in a set-to-set manner. Our MaskSurf is validated on six downstream tasks under three fine-tuning strategies. In particular, MaskSurf outperforms its closest competitor, Point-MAE, by 1.2\% on the real-world dataset of ScanObjectNN under the OBJ-BG setting, justifying the advantages of masked surfel prediction over masked point cloud reconstruction.","Self-supervised point cloud learning, surfel representation, masked auto-encoding" Do Spiking Neural Networks Learn Similar Representation with Artificial Neural Networks? A Pilot Study on SNN Representation,https://openreview.net/forum?id=OZG9yDOz0b,https://openreview.net/pdf?id=OZG9yDOz0b,Systematic study on the representation difference between ANNs and SNNs are conducted in this work. ,"Spiking Neural Networks (SNNs) have recently driven much research interest owing to their bio-plausibility and energy efficiency. The biomimicry spatial-temporal communication and computation mechanisms are the key differences that set SNNs apart from current Artificial Neural Networks (ANNs). However, some essential questions exist pertaining to SNNs and yet are little studied: Do SNNs learn similar representation with ANN? Does the time dimension in spiking neurons provide additional information? In this paper, we aim to answer these questions by conducting a representation similarity analysis between SNNs and ANNs using Centered Kernel Alignment~(CKA). We start by analyzing the spatial dimension of the networks, including both the width and the depth. Furthermore, our analysis of residual connection shows that SNN learns a periodic pattern, which rectifies the representations in SNN to ANN-like. We additionally investigate the effect of the time dimension on SNN representation, finding that deeper layers encourage more dynamics along the time dimension. Other aspects like potential improvement in terms of accuracy, efficiency, and adversarial robustness are also analyzed using CKA. We hope this work will inspire future research to fully comprehend the representation of SNNs.","Spiking Neural Networks, Artificial Neural Network, Representation Similarity Analysis" DAG Matters! GFlowNets Enhanced Explainer for Graph Neural Networks,https://openreview.net/forum?id=jgmuRzM-sb6,https://openreview.net/pdf?id=jgmuRzM-sb6,,"Uncovering rationales behind predictions of graph neural networks (GNNs) has received increasing attention over the years. Existing literature mainly focus on selecting a subgraph, through combinatorial optimization, to provide faithful explanations. However, the exponential size of candidate subgraphs limits the applicability of state-of-the-art methods to large-scale GNNs. We enhance on this through a different approach: by proposing a generative structure – GFlowNets-based GNN Explainer (GFlowExplainer), we turn the optimization problem into a step-by-step generative problem. Our GFlowExplainer aims to learn a policy that generates a distribution of subgraphs for which the probability of a subgraph is proportional to its’ reward. The proposed approach eliminates the influence of node sequence and thus does not need any pre-training strategies. We also propose a new cut vertex matrix to efficiently explore parent states for GFlowNets structure, thus making our approach applicable in a large-scale setting. We conduct extensive experiments on both synthetic and real datasets, and both qualitative and quantitative results show the superiority of our GFlowExplainer.","GNN, Interpretability" Generalized Category Discovery via Adaptive GMMs without Knowing the Class Number,https://openreview.net/forum?id=oQjWltREeRA,https://openreview.net/pdf?id=oQjWltREeRA,,"In this paper, we address the problem of generalized category discovery (GCD), \ie, given a set of images where part of them are labelled and the rest are not, the task is to automatically cluster the images in the unlabelled data, leveraging the information from the labelled data, while the unlabelled data contain images from the labelled classes and also new ones. GCD is similar to semi-supervised learning (SSL) but is more realistic and challenging, as SSL assumes all the unlabelled images are from the same classes as the labelled ones. We also do not assume the class number in the unlabelled data is known a-priori, making the GCD problem even harder. To tackle the problem of GCD without knowing the class number, we propose an EM-like framework that alternates between representation learning and class number estimation. We propose a semi-supervised variant of the Gaussian Mixture Model (GMM) with a stochastic splitting and merging mechanism to dynamically determine the prototypes by examining the cluster compactness and separability. With these prototypes, we leverage prototypical contrastive learning for representation learning on the partially labelled data subject to the constraints imposed by the labelled data. Our framework alternates between these two steps until convergence. The cluster assignment for an unlabelled instance can then be retrieved by identifying its nearest prototype. We comprehensively evaluate our framework on both generic image classification datasets and challenging fine-grained object recognition datasets, achieving state-of-the-art performance. ","generalized category discovery, transfer learning, clustering, deep learning" A MULTI-SCALE STRUCTURE-PRESERVING HETEROLOGOUS IMAGE TRANSFORMATION ALGORITHM BASED ON CONDITIONAL ADVERSARIAL NETWORK LEARNING,https://openreview.net/forum?id=P45P8xfL_n,https://openreview.net/pdf?id=P45P8xfL_n,Proposed new model structure and two loss functions reduce distortion and blur in generated heterogenous images,"Image transformation model learning is a basic technology for image enhancement, image super-resolution, image generation, multimodal image fusion, etc. which uses deep convolutional networks as a representation model for arbitrary functions, and uses fitting optimization with paired image training sets to solve the transformation model between images in the different sets. Affected by the complex and diverse changes of the 3D shape of the actual scene and the pixel-level optical properties of materials, the solution of the heterologous image conversion model is an ill-posed problem. In recent years, most of the proposed conditional adversarial learning methods for image transformation networks only consider the overall consistency loss constraint of the image, and the generated images often contain some pseudo-features or local structural deformations. In order to solve this problem, using the idea of multi-scale image coding and perception, this paper proposes a multi-scale structure-preserving heterologous image transformation method based on conditional adversarial network learning. First, using the idea of multi-scale coding and reconstruction, a multi-scale, step by step generator lightweight network structure is designed. Then, two image multi-scale structure loss functions are proposed, and combined with the existing overall consistency loss, a loss function for generative adversarial learning is designed. Finally, test experiments are performed on the KAIST-MPD-set1 dataset. The experimental results show that, compared with the state-of-the-art algorithms, the proposed algorithm can better suppress the local structural distortion, and has significant advantages in evaluation indicators such as RMSE, LPIPS, PSNR, and SSIM.","Heterologous Image Transformation, Multi-scale feature encoding, Generative Adversarial Networks" Metro: Memory-Enhanced Transformer for Retrosynthetic Planning via Reaction Tree,https://openreview.net/forum?id=9JjGZsDvHb,https://openreview.net/pdf?id=9JjGZsDvHb,We use reaction database to search the retrosynthetic routes and introduce a memory network to learn the context information of the route.,"Retrosynthetic planning plays a critical role in drug discovery and organic chemistry. Starting from a target molecule as the root node, it aims to find a complete reaction tree subject to the constraint that all leaf nodes belong to a set of starting materials. The multi-step reactions are crucial because they determine the flow chart in the production of the Organic Chemical Industry. However, existing datasets lack curation of tree-structured multi-step reactions and fail to provide such reaction trees, limiting models' understanding of organic molecule transformations. In this work, we first develop a benchmark curated for the retrosynthetic planning task, which consists of 124,869 reaction trees retrieved from the public USPTO-full dataset. On top of that, we propose Metro: Memory-Enhanced Transformer for RetrOsynthetic planning. Specifically, the dependency among molecules in the reaction tree is captured as context information for multi-step retrosynthesis predictions through transformers with a memory module. Extensive experiments show that Metro dramatically outperforms existing single-step retrosynthesis models by at least 10.7% in top-1 accuracy. The experiments demonstrate the superiority of exploiting context information in the retrosynthetic planning task. Moreover, the proposed model can be directly used for synthetic accessibility analysis, as it is trained on reaction trees with the shortest depths. Our work is the first step towards a brand new formulation for retrosynthetic planning in the aspects of data construction, model design, and evaluation.","Retrosynthetic Planning, Transformer, Memory Network, Reaction Database, Reaction tree" In the ZONE: Measuring difficulty and progression in curriculum generation,https://openreview.net/forum?id=TJjaQEOK8a,https://openreview.net/pdf?id=TJjaQEOK8a,This work proposes a Bayesian computational framework to operationalize ``the zone of proximal development'' and to improve existing curriculum generation algorithms.,"A common strategy in curriculum generation for reinforcement learning is to train a teacher network to generate tasks that fall within a student network's ``zone of proximal development'' (ZPD). These are tasks that are not too easy and not too hard for the student. Albeit intuitive, ZPD is not well understood computationally. We propose ZONE, a novel computational framework that operationalizes ZPD. It formalizes ZPD through the language of Bayesian probability theory, revealing that tasks should be selected by difficulty (the student's success probability on the task) and learning progression (the degree of change in the student's model parameters). ZONE operationalizes ZPD with two techniques that we apply on top of existing algorithms. One is REJECT, which rejects tasks outside a difficulty scope and the other is GRAD, which prioritizes tasks that maximize the student's gradient norm. Compared to the original algorithms, the ZONE techniques improve the student’s generalization performance on discrete Minigrid environments and continuous control Mujoco domains with up to $9 \times$ higher success. ZONE also accelerates the student's learning by training on up to $10\times$ less data.","curriculum learning, multiagent, Bayesian" Object Tracking by Hierarchical Part-Whole Attention,https://openreview.net/forum?id=IzI055GrvG,https://openreview.net/pdf?id=IzI055GrvG,,"We present in this paper that hierarchical representations of objects can provide an informative and low-noisy proxy to associate objects of interest in multi-object tracking. This is aligned with our intuition that we usually only need to compare a little region of the body of target objects to distinguish them from other objects. We build the hierarchical representation in levels of (1) target body parts, (2) the whole target body, and (3) the union area of the target and other objects of overlap. Furthermore, with the spatio-temporal attention mechanism by transformer, we can solve the tracking in a global fashion and keeps the process online. We design our method by combining the representation with the transformer and name it Hierarchical Part-Whole Attention, or HiPWA for short. The experiments on multiple datasets suggest its good effectiveness. Moreover, previous methods mostly focus on leveraging transformers to exploit long temporal context during association which requires heavy computation resources. But HiPWA focuses on a more informative representation of objects on every single frame instead. So it is more robust with the length of temporal context and more computationally economic. ","multi-object tracking, transformer, visual representation" Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking,https://openreview.net/forum?id=nId8ZtIXub,https://openreview.net/pdf?id=nId8ZtIXub,,"Recent advances in object detection and re-identification have greatly improved the performance of Multi-Object Tracking (MOT) methods, but progress in motion modeling has been limited. The motion model is a key component of many MOT methods and is commonly used to predict an object's future position. However, mainstream motion models in MOT naively assume that object motion is linear. They rely on detections on each frame as the observation value to supervise motion models. However, in practice, the observations can be noisy and even missing, especially in crowded scenes, which greatly degrade the performance of existing MOT methods. In this work, we show that a simple filtering-based motion model can still obtain state-of-the-art tracking performance if proper care is given to missing observations and noisy estimates. We emphasize the role of observations when recovering tracks from being lost and reducing the error accumulated by the assumption of linear motion when the target is lost. In contrast to the popular motion-based method SORT, which is estimation-centric, we name our method Observation-Centric SORT (OC-SORT). It remains simple, online, and real-time but improves robustness over occlusion and non-linear motion. It achieves state-of-the-art on multiple MOT benchmarks, including MOT17, MOT20, KITTI, head tracking, and especially DanceTrack where the object motion is highly non-linear.",multi-object tracking scFormer: a universal representation learning approach for single-cell data using transformers,https://openreview.net/forum?id=7hdmA0qtr5,https://openreview.net/pdf?id=7hdmA0qtr5,,"Single-cell sequencing has emerged as a promising technique to decode cellular heterogeneity and analyze gene functions. With the high throughput of modern techniques and resulting large-scale sequencing data, deep learning has been used extensively to learn representations of individual cells for downstream tasks. However, most existing methods rely on fully connected networks and are unable to model complex relationships between both cell and gene representations. We hereby propose scFormer, a novel transformer-based deep learning framework to jointly optimize cell and gene embeddings for single-cell biology in an unsupervised manner. By drawing parallels between natural language processing and genomics, scFormer applies self-attention to learn salient gene and cell embeddings through masked gene modelling. scFormer provides a unified framework to readily address a variety of downstream tasks such as data integration, analysis of gene function, and perturbation response prediction. Extensive experiments using scFormer show state-of-the-art performance on seven datasets across the relevant tasks. The scFormer model implementation is available at https://anonymous.4open.science/r/scFormer-E7E4/","single-cell genomics, self-supervised learning, transformer" Understanding the Training Dynamics in Federated Deep Learning via Aggregation Weight Optimization,https://openreview.net/forum?id=VJjtgKzrmj,https://openreview.net/pdf?id=VJjtgKzrmj,"We provide new understandings about the training dynamics of federated learning with neural network and devise a practical tool for aggregation weight optimization, improving global model generalization.","From the server's perspective, federated learning (FL) learns a global model by iteratively sampling a cohort of clients and updating the global model with the sum local gradient of the cohort. We find this process is analogous to mini-batch SGD of centralized training. In mini-batch SGD, a model is learned by iteratively sampling a batch of data and updating the model with the sum gradient of the batch. In this paper, we delve into the training dynamics in FL by learning from the experience of optimization and generalization in mini-batch SGD. Specifically, we focus on two aspects: \emph{client coherence} (refers to sample coherence in mini-batch SGD) and \emph{global weight shrinking regularization} (refers to weight decay in mini-batch SGD). We find the roles of the two aspects are both determined by the aggregation weights assigned to each client during global model updating. Thus, we use aggregation weight optimization on the server as a tool to study how client heterogeneity and the number of local epochs affect the global training dynamics in FL. Besides, we propose an effective method for \textbf{Fed}erated \textbf{A}ggregation \textbf{W}eight \textbf{O}ptimization, named as \textsc{\textbf{FedAWO}}. Extensive experiments verify that our method can improve the generalization of the global model by a large margin on different datasets and models.","Federated learning, deep learning, weighted aggregation, training dynamics, optimization, neural network." BiBench: Benchmarking and Analyzing Network Binarization,https://openreview.net/forum?id=e1u9PVnwNr,https://openreview.net/pdf?id=e1u9PVnwNr,"We present BiBench, aiming to rigorously benchmark and analyze network binarization.","Neural network binarization emerges as one of the most promising compression approaches with extraordinary computation and memory savings by minimizing the bit-width of weight and activation. However, despite being a generic technique, recent works reveal that applying binarization in a wide range of realistic scenarios involving diverse tasks, architectures, and hardware is not trivial. Moreover, common challenges, such as severe degradation in accuracy and limited efficiency gains, suggest that specific attributes of binarization are not thoroughly studied and adequately understood. To close this gap, we present BiBench, a rigorously designed benchmark with in-depth analysis for network binarization. We first carefully scrutinize the requirements of binarization in the actual production setting. We thus define the evaluation tracks and metrics for a fair and systematic investigation. We then perform a comprehensive evaluation with a rich collection of milestone binarization algorithms. Our benchmark results show binarization still faces severe accuracy challenges but diminishing improvements brought by newer state-of-the-art binarization algorithms, even at the expense of efficiency. Moreover, the actual deployment of certain binarization operations reveals a surprisingly large deviation from their theoretical consumption. Finally, we provide suggestions based on our benchmark results and analysis, devoted to establishing a paradigm for accurate and efficient binarization among existing techniques. We hope BiBench paves the way towards more extensive adoption of network binarization and serves as a foundation for future research.","Model Binarization, Network Compression, Deep Learning" Contextual Image Masking Modeling via Synergized Contrasting without View Augmentation for Faster and Better Visual Pretraining,https://openreview.net/forum?id=A3sgyt4HWp,https://openreview.net/pdf?id=A3sgyt4HWp,We propose a novel framework for synergizing MIM and contrastive learning in a close-loop.,"We propose a new contextual masking image modeling (MIM) approach called contrasting-aided contextual MIM (ccMIM), under the MIM paradigm for visual pretraining. Specifically, we adopt importance sampling to select the masked patches with richer semantic information for reconstruction, instead of random sampling as done in previous MIM works. As such, the resulting patch reconstruction task from the remaining less semantic patches could be more difficult and helps to learn. To speed up the possibly slowed convergence due to our more difficult reconstruction task, we further propose a new contrastive loss that aligns the tokens of the vision transformer extracted from the selected masked patches and the remaining ones, respectively. The hope is that it serves as a regularizer for patch feature learning such that the image-level global information could be captured in both masked and unmasked patches, and notably such a single-view contrasting avoids the tedious image augmentation step required in recent efforts of introducing contrastive learning to MIM (to speedup convergence and discriminative ability). Meanwhile, the attention score from the contrastive global feature can also carry effective semantic clues to in turn guide our above masking patch selection scheme. In consequence, our contextual MIM and contrastive learning are synergetically performed in a loop (semantic patch selection-token alignment contrasting) to boost the best of the two worlds: fast convergence and strong performance on downstream tasks without ad-hoc augmentations, which are verified by empirical results on ImageNet-1K for both classification and dense vision tasks. ","Mask Image Modeling, Self-supervised learning" Patch-Level Contrasting without Patch Correspondence for Accurate and Dense Contrastive Representation Learning,https://openreview.net/forum?id=10R_bcjFwJ,https://openreview.net/pdf?id=10R_bcjFwJ,We propose a new self-supervised leanring method to learn both spatial-sensitive and global-discriminative information,"We propose ADCLR: \underline{A}ccurate and \underline{D}ense \underline{C}ontrastive \underline{R}epresentation \underline{L}earning, a novel self-supervised learning framework for learning accurate and dense vision representation. To extract spatial-sensitive information, ADCLR introduces query patches for contrasting in addition with global contrasting. Compared with previous dense contrasting methods, ADCLR mainly enjoys three merits: i) achieving both global-discriminative and spatial-sensitive representation, ii) model-efficient (no extra parameters in addition to the global contrasting baseline), and iii) correspondence-free and thus simpler to implement. Our approach achieves new state-of-the-art performance for contrastive methods. On classification tasks, for ViT-S, ADCLR achieves 78.1\% top-1 accuracy on ImageNet with linear probing, outperforming our baseline (DINO) without our devised techniques as plug-in, by 1.1\%. For ViT-B, ADCLR achieves 79.8\%, 84.0\% accuracy on ImageNet by linear probing and finetune, outperforming DINO by 0.6\%, 0.4\% accuracy. For dense tasks, on MS-COCO, ADCLR achieves significant improvements of 44.3\% AP on object detection, 39.7\% AP on instance segmentation, outperforming previous SOTA method SelfPatch by 2.2\% and 1.2\%, respectively. On ADE20K, ADCLR outperforms SelfPatch by 1.0\% mIoU, 1.2\% mAcc on the segmentation task.","self supervised learning, contrastive learning" Winograd Structured Pruning for Fast Winograd Convolution ,https://openreview.net/forum?id=E9_04otJ62,https://openreview.net/pdf?id=E9_04otJ62,"We propose a novel Winograd structured pruning method, which prunes the weights in the Winograd-domain in a structured form with optimized pruning unit size for fast Winograd convolution on parallel processors.","Convolutional Neural Networks (CNNs) are computationally intensive, which limits deployment into mobile devices. To minimize operation counts in CNNs, pruning optimization techniques and Winograd’s minimal filtering algorithm are widely used; however, the benefit of pruning disappears when both optimizations are simply applied together in CNN. To take full advantage of both approaches, two previous pruning methods were proposed: one is to apply pruning after kernel transformation, and the other is applying filter pruning on Winograd convolution. Unfortunately, the first method is hardware-unfriendly and the second approach suffers from a significant loss of accuracy. Thus, we propose structured pruning method specialized for Winograd convolution, that maximizes the hardware utilization by considering the conversion algorithm of parallel processors. We analyze the conversion algorithm of Winograd convolution on parallel processing units; then, we prune the weights in the Winograd-domain in a structured form with optimized pruning unit size, which maximizes the parallelism of the hardware while minimizing the loss of accuracy. For VGG-16 on the ImageNet dataset, the inference time of our method is $1.84$ and $2.89$ times better than previous two pruning methods with less than $1\%$ accuracy loss.","Winograd convolution, structured pruning, GPU, parallel processor" Continuous-Discrete Convolution for (3+1)D Geometry-Sequence Modeling in Proteins,https://openreview.net/forum?id=P5Z-Zl9XJ7,https://openreview.net/pdf?id=P5Z-Zl9XJ7,This paper proposes a Continuous-Discrete Convolution (CDConv) for the (3+1)D geometry-sequence strutuere modeling in proteins.,"The structure of proteins involves 3D geometry of amino acid coordinates and 1D sequence of peptide chains. The 3D structure exhibits irregularity because amino acids are distributed unevenly in Euclidean space and their coordinates are continuous variables. In contrast, the 1D structure is regular because amino acids are arranged uniformly in the chains and their sequential positions (orders) are discrete variables. Moreover, geometric coordinates and sequential orders are in two types of spaces and their units of length are incompatible. These inconsistencies make it challenging to capture the (3+1)D structure while avoiding the impact of sequence and geometry modeling on each other. This paper proposes a Continuous-Discrete Convolution (CDConv) that uses irregular and regular approaches to model the geometry and sequence structures, respectively. Specifically, CDConv employs independent learnable weights for different regular sequential displacements but directly encodes geometric displacements due to their irregularity. In this way, CDConv significantly improves protein modeling by reducing the impact of geometric irregularity on sequence modeling. Extensive experiments on a range of tasks, including protein fold classification, enzyme reaction classification, gene ontology term prediction and enzyme commission number prediction, demonstrate the effectiveness of the proposed CDConv. Our code will be publicly available. ","Protein representation learning, 3D geometry modeling, 1D sequence modeling, continuous convolution, discrete convolution." ELODI: Ensemble Logit Difference Inhibition for Positive-Congruent Training,https://openreview.net/forum?id=IJwhRE510b,https://openreview.net/pdf?id=IJwhRE510b,,"Negative flips are errors introduced in a classification system when a legacy model is updated. Existing methods to reduce the negative flip rate (NFR) either do so at the expense of overall accuracy by forcing a new model to imitate the old models, or use ensembles, which multiply inference cost prohibitively. We analyze the role of ensembles in reducing NFR and observe that they remove negative flips that are typically not close to the decision boundary, but often exhibit large deviations in the distance among their logits. Based on the observation, we present a method, called Ensemble Logit Difference Inhibition ELODI, to train a classification system that achieves paragon performance in both error rate and NFR, at the inference cost of a single model. The method distills a homogeneous ensemble to a single student model which is used to update the classification system. ELODI also introduces a generalized distillation objective, Logit Difference Inhibition (LDI), which penalizes changes in the logits between the reference ensemble and the student single model. On multiple image classification benchmarks, model updates with ELODI demonstrate superior accuracy retention and NFR reduction. ","positive-congruent training, negative flip, ensemble learning" Model-agnostic Measure of Generalization Difficulty,https://openreview.net/forum?id=4XMAzZasId,https://openreview.net/pdf?id=4XMAzZasId,We propose a model-agnostic measure of the generalization difficulty of a task.,"The measure of a machine learning algorithm is the difficulty of the tasks it can perform, and sufficiently difficult tasks are critical drivers of strong machine learning models. However, quantifying the generalization difficulty of machine learning benchmarks has remained challenging. We propose what is to our knowledge the first model-agnostic measure of the inherent generalization difficulty of tasks. Our inductive bias complexity measure quantifies the total information required to generalize well on a task minus the information provided by the data. It does so by measuring the fractional volume occupied by hypotheses that generalize on a task given that they fit the training data. It scales exponentially with the intrinsic dimensionality of the space over which the model must generalize but only polynomially in resolution per dimension, showing that tasks which require generalizing over many dimensions are drastically more difficult than tasks involving more detail in fewer dimensions. Our measure can be applied to compute and compare supervised learning, reinforcement learning and meta-learning task difficulties against each other. We show that applied empirically, it formally quantifies intuitively expected trends, e.g. that in terms of required inductive bias, MNIST $<$ CIFAR10 $<$ Imagenet and fully observable Markov decision processes (MDPs) $<$ partially observable MDPs. Further, we show that classification of complex images $<$ few-shot meta-learning with simple images. Our measure provides a quantitative metric to guide the construction of more complex tasks requiring greater inductive bias, and thereby encourages the development of more sophisticated architectures and learning algorithms with more powerful generalization capabilities.","generalization, inductive bias, information theory, manifold, complexity" Efficient Multi-Task Reinforcement Learning via Selective Behavior Sharing,https://openreview.net/forum?id=KjKZaJ5Gbv,https://openreview.net/pdf?id=KjKZaJ5Gbv,Sharing behaviors between tasks to improve exploration for multitask reinforcement learning.,"The ability to leverage shared behaviors between tasks is critical for sample efficient multi-task reinforcement learning (MTRL). Prior approaches based on parameter sharing or policy distillation share behaviors uniformly across tasks and states or focus on learning one optimal policy. Therefore, they are fundamentally limited when tasks have conflicting behaviors because no one optimal policy exists. Our key insight is that, we can instead share exploratory behavior which can be helpful even when the optimal behaviors differ. Furthermore, as we learn each task, we can guide the exploration by sharing behaviors in a task and state dependent way. To this end, we propose a novel MTRL method, Q-switch Mixture of policies (QMP), that learns to selectively shares exploratory behavior between tasks by using a mixture of policies based on estimated discounted returns to gather training data. Experimental results in manipulation and locomotion tasks demonstrate that our method outperforms prior behavior sharing methods, highlighting the importance of task and state dependent sharing. ","Reinforcement Learning, Multitask Reinforcement Learning" Efficient Exploration via Fragmentation and Recall,https://openreview.net/forum?id=ED2Jjms9A4H,https://openreview.net/pdf?id=ED2Jjms9A4H,We propose a novel framework for exploration based on fragmentation-and-recall.,"Efficient exploration and model-building are critical for learning in large state- spaces. However, agents typically face problems like getting stuck locally during exploration and catastrophic forgetting in their construction of models when the environments are heterogeneous. Here, we propose and apply the concept of Fragmentation-and-Recall to solve spatial (FarMap) and reinforcement learning problems (FarCuriosity). Agents construct local maps or local models, respectively, which are used to predict the current observation. High surprisal points lead to a fragmentation event. At fracture points, we store the current map or model fragment in a long-term memory (LTM) and initialize a new fragment. On the other hand, Fragments are recalled (and thus reused) from LTM if the observations of their fracture points match the agent’s current observation during exploration. The set of fracture points defines a set of intrinsic potential subgoals. Agents choose their next subgoal from the set of near and far potential subgoals in the current fragment or LTM, respectively. Thus, local maps and model fragments guide exploration locally and avoid catastrophic forgetting in learning heterogeneous environments, while LTM promotes exploration more globally. We evaluate FarMap and FarCuriosity on complex procedurally-generated spatial environments and on reinforcement learning benchmarks and demonstrate that the proposed methods are more efficient at exploration and memory use, and in harvesting extrinsic rewards, respectively.","fragmentation, recall, exploration, cognitive science, neuroscience, curiosity, reinforcement learning, spatial navigation" Hedge Your Actions: Flexible Reinforcement Learning for Complex Action Spaces,https://openreview.net/forum?id=jU-AXLS2bl,https://openreview.net/pdf?id=jU-AXLS2bl,Flexible reinforcement learning under complex innumerable action spaces via listwise action retrieval,"Real-world decision-making is often associated with large and complex action representations, which can even be unsuited for the task. For instance, the items in recommender systems have generic representations that apply to each user differently, and the actuators of a household robot can be high-dimensional and noisy. Prior works in discrete and continuous action space reinforcement learning (RL) define a retrieval-selection framework to deal with problems of scale. The retrieval agent outputs in the space of action representations to retrieve a few samples for a selection critic to evaluate. But, learning such retrieval actors becomes increasingly inefficient as the complexity in the action space rises. Thus, we propose to treat the retrieval task as one of listwise RL to propose a list of action samples that enable the selection phase to maximize the environment reward. By hedging its action proposals, we show that our agent is more flexible and sample efficient than conventional approaches while learning under a complex action space. Results are also present on \url{https://sites.google.com/view/complexaction}.","Efficient Reinforcement Learning, Large Action Space, Listwise Action Retrieval"